# 5. Say No to `apply`

The `apply` method is one that should be avoided at all costs.

## `apply` - a method that does nothing (almost)
The `apply` method does just about nothing. It is available so that you can use your own custom function in the case that it doesn't already exist in Pandas. For instance, there is no direct method for finding the difference between the maximum and minimum value of a column. You can define your own function that does this and pass it to the `apply` function to have this done.

Let's see hows this would work by creating some random data from a normal distribution.

In [None]:
import numpy as np
import pandas as pd

a = np.random.randn(10 ** 4, 5)
df = pd.DataFrame(data=a, columns=['a', 'b', 'c', 'd', 'e'])
df.head()

Now, let's define our own function to take the maximum and minimum value of each column and return the difference. By default, Pandas will pass each column to the custom function as a Series. This is why it is defined with the variable name `s`.

In [None]:
def min_max_diff(s):
    return s.max() - s.min()

In [None]:
df.apply(min_max_diff)

### No need to use `apply` here
There is no need to use `apply` here. There exists Pandas methods for the min and the max. We can find the min and max of each of the columns separately and then subtract the results.

In [None]:
df.max()

In [None]:
df.min()

In [None]:
df.max() - df.min()

### Time performance of each

In [None]:
%timeit -n 5 df.apply(min_max_diff)

In [None]:
%timeit -n 5 df.max() - df.min()

## Using `apply` when `axis='columns'` - the slowest operation in Pandas
Let's say we wanted to repeat this example, except this time we were interested in finding the difference between the min and max values of each row. By default, `apply` will pass each column to the custom function.  We can change the direction of the operation by setting the parameter `axis` to 'columns'.

In [None]:
df.apply(min_max_diff, axis='columns').head()

In [None]:
(df.max(axis='columns') - df.min(axis='columns')).head()

### Time performance

Around 1000 times performance improvement.

In [None]:
%timeit -n 1 -r 1 df.apply(min_max_diff, axis='columns')

In [None]:
%timeit -n 5 df.max(axis='columns') - df.min(axis='columns')

### `apply` is no different than a for loop
The `apply` method is just a one-line automated for loop. We can reproduce what it does with an actual explicit for loop like this:

In [None]:
data = {}
for col in df.columns:
    data[col] = df[col].max() - df[col].min()
pd.Series(data)

Notice that its performance is similar to the idiomatic solution from above.

In [None]:
%%timeit -n 5

data = {}
for col in df.columns:
    data[col] = df[col].max() - df[col].min()
pd.Series(data)

And here is `apply` reproduced over each row. It's incredibly slow.

In [None]:
%%timeit -n 1 -r 1

data = {}
for row in range(len(df)):
    vals = df.iloc[row]
    data[row] = vals.max() - vals.min()
pd.Series(data)

## Summary of `apply`
* `apply` is an automated for loop that passes each column or row to a user-defined function
* Use `apply` as a method of last resort
* Using `apply` with `axis='columns'` is one of the slowest operations in all of Pandas

## How to un-`apply`
If you have already created a user-defined function that you use for `apply`, you can often make it work without `apply`.
* Always choose a Pandas method over any user-defined function with `apply`
* Try and convert each line of your user-defined method to one that is done outside of it

## Acceptable use cases for `apply`
The `apply` method should only be used whenever the operation cannot be easily completed with Pandas methods directly, which is very rare. Here is one example with the college dataset. The `md_earn_wne_p10` and `grad_debt_mdn_supp` columns appear to be numeric, but are actually read in as strings. Let's select these columns into their own DataFrame and look at their data types.

In [None]:
college = pd.read_csv('data/college.csv')
college.head()

Notice the strings 'PrivacySuppressed', which is why these columns were read in as strings and not numbers.

In [None]:
df = college[['md_earn_wne_p10', 'grad_debt_mdn_supp']]
df.head(20)

In [None]:
df.dtypes

The function `to_numeric` can coerce single columns, but not entire DataFrames, to a numeric data type. We can use `apply` here to have it iterate over and coerce each column to a numeric. The `apply` method allows us to pass additional keyword arguments to the function it is applying. We must do that here and set `errors` to be the string `coerce` to force Pandas to turn all 'PrivacySuppressed' strings to missing values.

In [None]:
df_num = df.apply(pd.to_numeric, errors='coerce')
df_num.head(20)

In [None]:
df_num.dtypes

### Exercise 3
<span style="color:green; font-size:16px">Add a column named **`distance`** to the following DataFrame that computes the euclidean distance between points **`(x1, y1)`** and **`(x2, y2)`**. Calculate it once with **`apply`** and again idiomatically using vectorized operations. Time the difference between them.</span>

In [None]:
# run this first
df = pd.DataFrame(np.random.randint(0, 20, (100000, 4)), 
                  columns=['x1', 'y1', 'x2', 'y2'])
df.head()

### Exercise 4
<span style="color:green; font-size:16px">Using the college dataset, add a new column that has the word 'yes' if the school has a median total SAT score more than 1100 or 'no' if it does not. Do not use `apply`. </span>

In [None]:
college = pd.read_csv('data/college.csv')
college.head()