# `map` vs `apply`

Do you know the difference between the **`map`** and **`apply`** Series methods?

In [1]:
from IPython.display import IFrame

In [2]:
IFrame('http://etc.ch/RjoN', 400, 300)

In [6]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8VzFnGsNyv3rYRtjyM1R0g6HvOxV', 400, 300)

# Primary usage of `map` method
As the name implies, **`map`** can literally map one value to another in a Series. Pass it a dictionary (or another Series). Let's see an example:

In [None]:
import pandas as pd
import numpy as np

In [None]:
s = pd.Series(np.random.randint(1, 7, 10))
s

Create mapping dictionary

In [None]:
d = {1:'odd', 2:'even', 3:'odd', 4:'even', 5:'odd', 6:'even'}

In [None]:
s.map(d)

Works the same if you use a Series

In [None]:
s1 = pd.Series(d)
s1

In [None]:
s.map(s1)

#### `map` example with more data
Let's map the values of 1 million integers ranging from 1 to 100 to 'even/odd' strings

In [None]:
n = 1000000 # 1 million
s = pd.Series(np.random.randint(1, 101, n))
s.head()

Create the mapping

In [None]:
d = {i: 'odd' if i % 2 else 'even' for i in range(1, 101)}
print(d)

In [None]:
s.map(d).head(10)

### Exercise 1
<span style="color:green; font-size:16px">Can you use the **`apply`** method to do the same thing? Time the difference between the **`apply`** and **`map`**.</span>

In [None]:
# your code here

### `map` and `apply` can both take functions
Unfortunately both **`map`** and **`apply`** can accept a function that gets implicitly passed each value in the Series. The result of each operation is the exact same.

In [None]:
a = s.apply(lambda x: 'odd' if x % 2 else 'even')
b = s.map(lambda x: 'odd' if x % 2 else 'even')

a.equals(b)

This dual functionality of **`map`** confuses users. It can accept a dictionary but it can also accept a function.

### Suggestion: only use `map` for literal mapping
It makes more sense to me that the **`map`** method only be used for one purpose and this is to map each value in a Series from one value to another with a dictionary or a Series.

### Use `apply` only for functions
**`apply`** must take a function and has more options than **`map`** when taking a function so it should be used when you want to apply a function to each value in a Series. There is no difference in speed between the two.

### Exercise 2
<span style="color:green; font-size:16px">Use the **`map`** method with a two-item dictionary to convert the Series of integers to 'even/odd' strings. You will need to perform an operation on the Series first. Is this faster or slower than the results in exercise 1?</span>

In [None]:
# run this code first
n = 1000000 # 1 million
s = pd.Series(np.random.randint(1, 101, n))

In [None]:
# your code here

### Exercise 3
<span style="color:green; font-size:16px">Write a for-loop to convert each value in the  Series to 'even/odd' strings. Time the operation.</span>

In [None]:
# your code here

# Vectorized if-then-else with NumPy `where`
The NumPy **`where`** function provides us with a vectorized if-then-else that is very fast. Let's convert the Series again to 'even/odd' strings.

In [None]:
s = pd.Series(np.random.randint(1, 101, n))

In [None]:
np.where(s % 2, 'odd', 'even')

In [None]:
%timeit np.where(s % 2, 'odd', 'even')

### Exercise 4
<span style="color:green; font-size:16px">Convert the values from 1-33 to 'low', 34-67 to 'medium' and the rest 'high'.</span>

In [None]:
# your code here

### There is a DataFrame/Series  `where` method
There is a DataFrame/Series **`where`** method but it works differently. You must pass it a boolean DataFrame/series and it will preserve all the values that are True. The other values will by default be converted to missing, but you can specify a specific number as well.

In [None]:
s.where(s > 50).head(10)

In [None]:
s.where(s > 50, other=-1).head(10)

# Do we really need `apply`?
As we saw from this last example, we could eliminate the need for the **`apply`** method. Most examples of code that use **`apply`** do not actually need it.

### `apply` doesn't really do anything
By itself, the **`apply`** method doesn't really do anything. 
* For Series, it iterates over every single value and passes that value to a function that you must pass to **`apply`**. 
* For a DataFrame, it iterates over each column or row as a Series and calls your passed function on that Series

Let's see a simple example of **`apply`** used to multiply each value of a Series by 2:

In [None]:
s = pd.Series(np.random.randint(1, 101, n))

In [None]:
s.apply(lambda x: x * 2).head()

In [None]:
(s * 2).head()

In [None]:
%timeit s.apply(lambda x: x * 2)

In [None]:
%timeit s * 2

### Use vectorized solution whenever possible
As you can see, the solution with **`apply`** was more than 2 orders of magnitude slower than the vectorized solution. A for-loop can be faster than **`apply`**.

In [None]:
%timeit pd.Series([v * 2 for v in s])

I like to call **`apply`** the **method of last resort**. There is almost rarely a reason to use it over other methods. Pandas and NumPy both provide a tremendous amount of functionality that cover nearly everything you need to do. 

Always use pandas and NumPy methods first before anything else.

### Use-cases for `apply` on a Series
When there is no vectorized implementation in pandas, numpy or other scientific library, then you can use **`apply`**.

A simple example (that's not too practical) is finding the underlying data type of each value in a Series.

In [None]:
s = pd.Series(['a', {'TX':'Texas'}, 99, (0, 5)])
s

In [None]:
s.apply(type)

A more practical example might be from a library that doesn't work directly with arrays, like finding the edit distance between two strings from the NLTK library.

In [None]:
from nltk.metrics import edit_distance

In [None]:
edit_distance('Kaitlyn', 'Kaitlin')

In [None]:
s = pd.Series(['Kaitlyn', 'Katelyn', 'Kaitlin', 'Katelynn', 'Katlyn',
               'Kaitlynn', 'Katelin', 'Katlynn', 'Kaitlin', 'Caitlyn', 'Caitlynn'])
s

Using **`apply`** here is correct

In [None]:
s.apply(lambda x: edit_distance(x, 'Kaitlyn'))

### Using `apply` on a DataFrame
By default **`apply`** will call the passed function on each individual column on a DataFrame. The column will be passed to the function as a Series.

In [None]:
df = pd.DataFrame(np.random.rand(100, 5), columns=['a', 'b', 'c', 'd', 'e'])
df.head()

In [None]:
df.apply(lambda s: s.max())

We can change the direction of the operation by seting the **`axis`** parameter to **`1`** or **`columns`**

In [None]:
df.apply(lambda s: s.max(), axis='columns').head(10)

#### Never actually perform these operations when a DataFrame method exists
Let's fix these two methods and time their differences

In [None]:
df.max()

In [None]:
df.max(axis='columns').head(10)

In [None]:
%timeit df.apply(lambda s: s.max())

In [None]:
%timeit df.max()

In [None]:
%timeit df.apply(lambda s: s.max(), axis='columns')

In [None]:
%timeit df.max(axis='columns')

5x and 70x faster and much more readable code

### Infected by the documentation
Unfortunately, pandas official documentation is littered with examples that don't need **`apply`**. Can you fix the following 2 misuses of **`apply`** [found here](http://pandas.pydata.org/pandas-docs/stable/10min.html#apply).



### Exercise 1
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [None]:
df.apply(np.cumsum).head()

In [None]:
# your code here

### Exercise 2
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [None]:
df.apply(lambda x: x.max() - x.min())

In [None]:
# your code here

### `apply` with `axis=1` is the slowest operation you can do in pandas
If you call **`apply`** with **`axis=1`** or identically with **`axis='columns'`** on a DataFrame, pandas will iterate row by row to complete your operation. Since there are almost always more rows than columns, this will be extremely slow.

### Exercise 3
<span style="color:green; font-size:16px">Add a column named **`distance`** to the following DataFrame that computes the euclidean distance between points **`(x1, y1)`** and **`(x2, y2)`**. Calculate it once with **`apply`** and again idiomatically using vectorized operations. Time the difference between them.</span>

In [None]:
# run this first
df = pd.DataFrame(np.random.randint(0, 20, (100000, 4)), 
                  columns=['x1', 'y1', 'x2', 'y2'])
df.head()

In [None]:
# your code here

### Use-cases for apply on a DataFrame

DataFrames and Series have nearly all of the their methods in common. For methods that only exist for Series, you might need to use **`apply`**.

In [None]:
weather = pd.DataFrame({'Houston': ['rainy', 'sunny', 'sunny', 'cloudy', 'rainy', 'sunny'],
                        'New York':['sunny', 'sunny', 'snowy', 'snowy', 'rainy', 'cloudy'],
                        'Seattle':['sunny', 'cloudy', 'cloudy', 'cloudy', 'cloudy', 'rainy'],
                        'Las Vegas':['sunny', 'sunny', 'sunny', 'sunny', 'sunny', 'sunny']})
weather

Counting the frequencies of each column is normally done by the Series **`value_counts`** method. It does not exist for DataFrames, so you can use it here with **`apply`**.

In [None]:
weather.apply(pd.value_counts)

In [None]:
%matplotlib inline
weather.apply(pd.value_counts).plot(kind='bar')

### Using `apply` with the Series accessors `str`, `dt` and `cat`
Pandas Series, depending on their data type, can access additional Series-only methods through **`str`**, **`dt`** and **`cat`** for string, datetime and categorical type columns.

In [None]:
weather.Houston.str.capitalize()

Since this method exists only for Series, you can use **`apply`** here to capitalize each column.

In [None]:
weather.apply(lambda x: x.str.capitalize())

This is one case where you can use the **`applymap`** method by directly using the string method on each value.

In [None]:
weather.applymap(str.capitalize)

In [None]:
employee = pd.read_csv('../data/employee.csv')
employee.head()

Select just the titles and departments 

In [None]:
emp_title_dept = employee[['DEPARTMENT', 'POSITION_TITLE']]
emp_title_dept.head()

Let's find all the departments and titles that contain the word 'police'.

In [None]:
has_police = emp_title_dept.apply(lambda x: x.str.upper().str.contains('POLICE'))
has_police.head()

Let's use these boolean values to only select rows that have both values as **`True`**.

In [None]:
emp_title_dept[has_police.all(axis='columns')].head(10)

### How fast are the `str` accessor methods?
Not any faster than looping...

In [None]:
%timeit employee['POSITION_TITLE'].str.upper()

In [None]:
%timeit employee['POSITION_TITLE'].apply(str.upper)

In [None]:
%timeit pd.Series([x.upper() for x in employee['POSITION_TITLE']])

In [None]:
%timeit employee['POSITION_TITLE'].max()

In [None]:
%timeit employee['BASE_SALARY'].max()

In [None]:
%timeit employee['POSITION_TITLE'].values.max()

In [None]:
%timeit employee['BASE_SALARY'].values.max()

In [None]:
a_list = employee['POSITION_TITLE'].tolist()

In [None]:
%timeit max(a_list)

### Exercise 4
<span style="color:green; font-size:16px">The following example is from the documentation. Produce the same result without using apply by creating a function that it accepts a DataFrame and returns a DataFrame</span>

In [None]:
df = pd.DataFrame(np.random.randint(0, 20, (10, 4)), 
                  columns=['x1', 'y1', 'x2', 'y2'])
df.head()

In [None]:
def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

In [None]:
df.apply(subtract_and_divide, args=(5,), divide=3)

In [None]:
# your code here

### Exercise 5
<span style="color:green; font-size:16px">Make the following idiomatic:</span>

In [None]:
college = pd.read_csv('../data/college.csv', 
                      usecols=lambda x: 'UGDS' in x or x == 'INSTNM', 
                      index_col='INSTNM')
college = college.dropna()
college.shape

In [None]:
college.head()

In [None]:
def max_race_count(s):
    max_race_pct = s.iloc[1:].max()
    return (max_race_pct * s.loc['UGDS']).astype(int)

In [None]:
college.apply(max_race_count, axis=1).head()

In [None]:
# your code here

# Tips for debugging `apply`
It is more difficult to debug code that uses **`apply`** when you a custom function. This is because the all the code in your custom function gets executed at once. You aren't stepping through the code one line at a time and checking the output.

### Using the `display` IPython function and print statements to inspect custom function
Let's say you didn't know what **`apply`** with **`axis='columns'`** was implicitly passing to the custom function.

In [None]:
# what the hell is x?
def func(x):
    return 1

In [None]:
college.apply(func, axis=1).head()

Its obvious that you need to know what object **`x`** is in **`func`**. One thing we can do is print out its type. To stop the output we can force an error by calling **`raise`**.

In [None]:
# what the hell is x?
def func(x):
    print(type(x))
    raise
    return 1

college.apply(func, axis=1).head()

Ok, great. We know that **`x`** is a Series. Why did it get printed twice? It turns out that pandas calls your method twice on the first row/column to determine if it can take a fast path or not. This is a small implementation detail that shouldn't affect you unless your function is making references to variables out of scope.

Let's go one step further and display **`x`** on the screen

In [None]:
from IPython.display import display

In [None]:
# what the hell is x?
def func(x):
    display(x)
    raise
    return 1

college.apply(func, axis=1).head()

### Exercise 1
<span style="color:green; font-size:16px">Use the **`display`** function after each line in a custom function that gets used with **`apply`** and **`axis='columns'`** to find the population of the second highest race per school. Make sure you raise an exception or else you will have to kill your kernel because of the massive output.</span>

In [None]:
# your code here

### Exercise 2 - Very difficult
<span style="color:green; font-size:16px">Can you do this without using **`apply`**?</span>

In [None]:
# your code here

### Exercise 3
<span style="color:green; font-size:16px">When **`apply`** is called on a Series, what is the data type that gets passed to the function?</span>

In [None]:
# your code here

# Summary
* **`map`** is a Series method. I suggest using by passing it a dictionary/Series and NOT a function
* Use **`apply`** when you want to apply a function to each value of a Series or each row/column of a DataFrame
* You rarely need **`apply`** - Use only pandas and numpy functions first
* Using **`apply`** on a DataFrame with **`axis='columns'`** is the slowest operation in pandas
* You can use **`apply`** on a DataFrame when you need to call a method that is available only to Series (like **`value_counts`**)
* Debug apply by printing and using the **`display`** IPython function inside your custom function