# Explain the 1,000x Speed Difference when taking the Mean
In this challenge, your goal is to explain why taking the `mean` of the following DataFrame is more than 1,000x faster when using the parameter `numeric_only=True`.

### The Challenge
The `bikes` dataset below has about 50,000 rows. Taking the `mean` of the entire DataFrame returns the mean of all the numeric columns. If we set the parameter `numeric_only` to `True`, the exact same result is returned. But, using the second option results in a speed difference of more than 1,000x times, taking the operation from over 40 seconds down to around 15 milliseconds.

The challenge is to explain why this speed difference exists despite each of these operations returning the exact same result. The solution is fairly nuanced and requires a deep understanding of pandas.

In [1]:
import pandas as pd
bikes = pd.read_csv('https://raw.githubusercontent.com/DunderData/Pandas-Challenges/master/data/bikes.csv')
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


In [2]:
bikes.shape

(50089, 19)

### Taking the mean
Calling the `mean` method with the defaults is extremely slow.

In [None]:
bikes.mean()

Setting the parameter `numeric_only` to `True` makes a huge difference even though the returned result is the same.

In [None]:
bikes.mean(numeric_only=True)

### Timing each operation

There is over 1,000x difference in performance - from 40 seconds to 15 ms

In [None]:
%timeit -n 1 -r 1 bikes.mean()

In [None]:
%timeit -n 1 -r 1 bikes.mean(numeric_only=True)

## Solution

The solution relies on a thorough understanding of the object data type. DataFrame columns that are of the object data type may contain any Python object. Object columns may be composed of integers, floats, strings, lists, dictionaries, other DataFrames, or any other object. Typically, columns with the object data type contain only strings, but this isn't guaranteed. The object data type is the most flexible and it is this flexibility that causes the tremendous slowdown above.

### Integers as objects
Let's create a Series of integers and calculate some summary statistics on it. Note that the data type is formally a 64-bit integer after creation.

In [7]:
s_int = pd.Series([10, 99, -123, 88])
s_int

0     10
1     99
2   -123
3     88
dtype: int64

Verify the data type.

In [8]:
s_int.dtype

dtype('int64')

Let's calculate the sum and mean.

In [9]:
s_int.sum()

74

In [10]:
s_int.mean()

18.5

Within pandas, the `astype` method may be used to change the data type of a Series. Let's change the Series so that its data type is object.

In [11]:
s_obj = s_int.astype('object')
s_obj

0      10
1      99
2    -123
3      88
dtype: object

Both the `sum` and `mean` method work and return the same results from above.

In [12]:
s_obj.sum()

74

In [13]:
s_obj.mean()

18.5

Typically, you would never want to convert a Series of integers to object as you would ruin the optimizations granted to you through the numpy library. A Series that has a data type of 'int64' has its data stored internally in a numpy array which stores its data directly in contiguously allocated memory using a C integer array. By converting a Series to the object data type, each integer is no longer stored as a C integer but as a Python integer object. Let's verify this by retrieving the type of an individual value in each numpy array.

In [14]:
type(s_int.values[0])

numpy.int64

In [15]:
type(s_obj.values[0])

int

### Operations on an object array are slow
Changing the data type of a column of integers to object will have no impact on the result for several methods, but performance will decline enormously. Below, a numpy array of 1 million integers is created. It is then summed as both an integer data type and as an object with the object being 60x slower.

In [16]:
import numpy as np
a_int = np.random.randint(0, 10, 1000000)
a_obj = a_int.astype('object')

In [17]:
a_int

array([3, 5, 4, ..., 3, 1, 3])

In [18]:
a_obj

array([3, 5, 4, ..., 3, 1, 3], dtype=object)

In [19]:
%timeit -n 5 a_int.sum()

539 µs ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [20]:
%timeit -n 5 a_obj.sum()

30.1 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Strings can be added together
One interesting property of strings in Python is that they can be concatenated together with the plus operator.

In [21]:
string1 = 'mac'
string2 = 'hine'
string1 + string2

'machine'

### String Series
Whenever you have a column of data in pandas that contains strings, it's data type will always be object. There is no specific string data type in pandas. Let's create a Series of strings and verify that its data type is indeed object.

In [22]:
s = pd.Series(['The', 'quick', 'brown', 'fox'])
s

0      The
1    quick
2    brown
3      fox
dtype: object

### Summing a Series of strings
The Series `sum` method simply adds together every value in the Series. Because addition is a valid operation with strings in Python, the method completes with our current Series.

In [23]:
s.sum()

'Thequickbrownfox'

### Taking the mean of a string Series
Taking the `mean` of a Series of strings is meaningless and pandas will raise an error. Let's attempt this and make a note of the error message.

In [24]:
s.mean()

TypeError: Could not convert Thequickbrownfox to numeric

### Summing and then dividing
The error message reads 'Could not convert Thequickbrownfox to numeric'. This implies that pandas has taken the time to compute the sum first before trying to divide by the total length of the Series.

### Why is there no error with the bikes DataFrame?
You might be wondering why our bikes DataFrame did not produce an error when taking the mean, but the above Series did. DataFrames have a concept called 'nuisance columns', which are columns where a calculation is unable to be computed. These nuisance columns are silently (without an error or warning) dropped from the result. Only columns where the operation is successful are returned. Taking the mean of a DataFrame with columns that don't have a mean is valid.

For instance, we can turn our string Series into a one column DataFrame with the `to_frame` method and then compute the `mean`. Notice that there is no error here as there was above when computed on a Series. Instead, an empty Series is returned as the one column in the DataFrame is a nuisance column.

In [25]:
df = s.to_frame('Words')
df.head()

Unnamed: 0,Words
0,The
1,quick
2,brown
3,fox


In [26]:
df.mean()

Series([], dtype: float64)

## Explaining what happens during the challenge problem
When taking the `mean` of the bikes DataFrame above, pandas first sums every single column regardless of its data type. Once the sum is complete, then it divides by the number of rows to get the mean of that column. For columns of strings, it is only at this stage where the division happens that pandas is unable to compute a mean and declares it a nuisance column.

Concatenating strings is extraordinarily more expensive than adding integers or floats and since every single value in a string column is first concatenated together with the call to `mean` explains why the operation is so terribly slow.

Setting the `use_numeric` parameter to `True` informs pandas to not even attempt to sum the object data type columns, which is why we see the huge gap in performance when it is used even though the result is the same.

### Why can't pandas skip columns of strings?
It does seem that the logical thing to do is for pandas to skip columns where the mean is not a valid option such as columns with strings in them. Since object columns can contain any Python object, it could be possible that the mean is a valid operation as we saw in our first Series from above. 

pandas does not make any assumptions about the data contained in the object column. It just follows its procedure for calculating the mean, which is summing the column and then dividing by the length. If at any point an error occurs, the column is declared as a nuisance and dropped from the result.

### Can't pandas build a special case for this?
Yes, it would still be possible for pandas to inspect values when taking the mean of an object column and if it is a data type that does not have a mean, then raise the error immediately at that point.

# Become a pandas expert

If you are looking to completely master the pandas library and become a trusted expert for doing data science work, check out my book [Master Data Analysis with Python][1]. It comes with over 300 exercises with detailed solutions covering the pandas library in-depth.

[1]: https://www.dunderdata.com/master-data-analysis-with-python