## 12. How to apply functions to pandas series and DataFrame?
We are familiar with functions where we can pass a python object and obtain the output. We can apply such functions on a Series and DataFrame as well. We will be learning about 'map', 'apply', 'applymap', 'agg' and 'describe' methods in this blog.

We will first read Kaggle’s Titanic training dataset. Each row of the dataset represents information for one passenger.

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv("http://bit.ly/kaggletrain")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 12.1. Using 'map' as a series method

We will be learning one common use of ‘map( )’ method i.e. as a data preparation step for machine learning. It is a series method and allows us to map values in a series to our desired values. We pass a dictionary to ‘map( )’ where the ‘key’ represents elements from series and ‘value’ represents what we will be mapping ‘key’ as.

In [3]:
train["Sex_num"] = train.Sex.map({"female":0, "male":1})

In [4]:
train.loc[0:4, ["Sex", "Sex_num"]]

Unnamed: 0,Sex,Sex_num
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


### 12.2. Using 'apply' as a series method

We use the ‘apply’ method to apply a function to all the elements of a series. Pretend we want to find the length of each string in the ‘Name’ column. To complete the task, we will select the series which will be followed by ‘apply( )’ method and pass the name of the function, without parenthesis. We have saved the result in a new column called ‘Name_length’.


In [5]:
train["Name_length"] = train.Name.apply(len)

In [6]:
train.loc[0:4, ["Name", "Name_length"]]

Unnamed: 0,Name,Name_length
0,"Braund, Mr. Owen Harris",23
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,"Heikkinen, Miss. Laina",22
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",44
4,"Allen, Mr. William Henry",24


Now, pretend we want to round up the ‘Fare’ column. We will pass NumPy's ceiling function to apply method (np.ceil). We have saved the result to the ‘fair_ceil’ column.

In [7]:
import numpy as np

In [8]:
train.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_num,Name_length
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,23
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,51


In [9]:
train["fair_ceil"] = train.Fare.apply(np.ceil)

In [10]:
train.loc[0:4, ["Fare", "fair_ceil"]]

Unnamed: 0,Fare,fair_ceil
0,7.25,8.0
1,71.2833,72.0
2,7.925,8.0
3,53.1,54.0
4,8.05,9.0


 
For our final problem, pretend we want to create a column with the last name from the ‘Name’ column. We will start by using the string method ‘split( )’ on the ‘Name’ column. It will return a series of lists. Now we need to separate the first element of each list and assign it to a new column. We can create our function which returns the first element like ‘get_element’. Note that you pass the name of the function to apply without parenthesis so, other parameters needed by the function are positioned next to function name separated by a comma. However, for simpler problems like extracting the first element from the list, we can use ‘lambda’ function. We say ‘lambda x:’, followed by what we want to return. Here since x represents a list, we want to return the first element of the list.

In [11]:
train.Name.str.split(",").head()

0                           [Braund,  Mr. Owen Harris]
1    [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                            [Heikkinen,  Miss. Laina]
3      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                          [Allen,  Mr. William Henry]
Name: Name, dtype: object

In [12]:
def get_element(my_list, position):
    return my_list[position]

In [13]:
train.Name.str.split(",").apply(get_element, position=0).head()

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

In [14]:
train.Name.str.split(",").apply(lambda x: x[0]).head()

0       Braund
1      Cumings
2    Heikkinen
3     Futrelle
4        Allen
Name: Name, dtype: object

### 12.3. Using 'apply' as a DataFrame method

We can also use apply as a DataFrame method. To learn about we will load the ‘drinks’ dataset which contains alcohol consumption by country. As we already know, we can separate a subset of DataFrame using ‘loc’.

In [15]:
drinks = pd.read_csv("http://bit.ly/drinksbycountry")
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [16]:
drinks.loc[:, "beer_servings":"wine_servings"]

Unnamed: 0,beer_servings,spirit_servings,wine_servings
0,0,0,0
1,89,132,54
2,25,0,14
3,245,138,312
4,217,57,45
...,...,...,...
188,333,100,3
189,111,2,1
190,6,0,0
191,32,19,4


While using apply as a DataFrame method, we use the ‘axis’ parameter to specify the direction in which we want to apply the function. If we don’t specify the axis, it will use ‘axis=0’ i.e. row axis. We will be using only numeric columns as they make more sense concerning ‘max’ and ‘arg.max’ functions. We can find the maximum value in each column using ‘axis=0’ and the maximum value in each row using ‘axis=1’. We can use NumPy's ‘argmax’ function along with ‘axis=1’ to find out the maximum value in the row belongs to which column. Since our subset DataFrame has three columns the output obtained can be related to column names as ‘beer_servings:0’, ‘spirit_servings:1’, ‘wine_servings:2’.

In [17]:
drinks.loc[:, "beer_servings":"wine_servings"].apply(max, axis=0)

beer_servings      376
spirit_servings    438
wine_servings      370
dtype: int64

In [18]:
drinks.loc[:, "beer_servings":"wine_servings"].apply(max, axis=1)

0        0
1      132
2       25
3      312
4      217
      ... 
188    333
189    111
190      6
191     32
192     64
Length: 193, dtype: int64

In [19]:
drinks.loc[:, "beer_servings":"wine_servings"].apply(np.argmax, axis=1).head()

0    0
1    1
2    0
3    2
4    0
dtype: int64

### 12.4. Using 'applymap' as DataFrame method

When using ‘applymap’ as the DataFrame method, we apply the function to every element of the DataFrame and not in the direction of rows and columns. Say we want to convert all integer values to float, we will use ‘applymap( )’ method and pass the name of function without parenthesis to it.

In [20]:
drinks.loc[:, "beer_servings":"wine_servings"].applymap(float)

Unnamed: 0,beer_servings,spirit_servings,wine_servings
0,0.0,0.0,0.0
1,89.0,132.0,54.0
2,25.0,0.0,14.0
3,245.0,138.0,312.0
4,217.0,57.0,45.0
...,...,...,...
188,333.0,100.0,3.0
189,111.0,2.0,1.0
190,6.0,0.0,0.0
191,32.0,19.0,4.0


We can also change the existing DataFrame by equating to one we obtained by applying the function to the original one. The number of series and series length has to be equal while doing so.

In [21]:
drinks.loc[:, "beer_servings":"wine_servings"] = drinks.loc[:, "beer_servings":"wine_servings"].applymap(float)
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0.0,0.0,0.0,0.0,Asia
1,Albania,89.0,132.0,54.0,4.9,Europe
2,Algeria,25.0,0.0,14.0,0.7,Africa
3,Andorra,245.0,138.0,312.0,12.4,Europe
4,Angola,217.0,57.0,45.0,5.9,Africa


### 12.5. Using 'agg' as series and DataFrame methods

Aggregate function ‘agg( )’ can be used with both pandas Series and DataFrame. While working with series, we first select the series, which is followed by the ‘agg( )’ method, and finally pass a list of mathematical functions to ‘agg( )’. It will return a series with the values. When using the aggregate function as the DataFrame method, the DataFrame name is followed by the aggregate method and we pass the list of mathematical functions to it. It will return a DataFrame with values for each column.

In [22]:
drinks.beer_servings.agg(["mean", "min", "max"])

mean    106.160622
min       0.000000
max     376.000000
Name: beer_servings, dtype: float64

In [23]:
drinks.agg(["mean", "min", "max"])

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
min,Afghanistan,0.0,0.0,0.0,0.0,Africa
max,Zimbabwe,376.0,438.0,370.0,14.4,South America
mean,,106.160622,80.994819,49.450777,4.717098,


### 12.6. Comparing 'agg' and 'describe' methods

What we just did with the aggregate method, we can achieve with ‘describe( )’ method as well. But notice that the aggregate method is much more flexible. We can limit what values we want and may even apply the aggregate method with functions not calculated by the 'describe( )' method.

In [24]:
drinks.beer_servings.describe()

count    193.000000
mean     106.160622
std      101.143103
min        0.000000
25%       20.000000
50%       76.000000
75%      188.000000
max      376.000000
Name: beer_servings, dtype: float64

In [25]:
drinks.describe()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


Saying the aggregate method is more flexible doesn’t make it superior to describe the method. If the 'describe( )' method works for us, it may be an easier method to apply rather than typing all the functions to calculate values that the 'describe( )' method provides. One common parameter we use with the 'describe( )' method is ‘include’. By default describe method calculates values for numeric columns only. However, we can ask the 'describe( )' method to include particular data types or include all of them as follows.

In [26]:
drinks.describe(include='all')

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
count,193,193.0,193.0,193.0,193.0,193
unique,193,,,,,6
top,Nicaragua,,,,,Africa
freq,1,,,,,53
mean,,106.160622,80.994819,49.450777,4.717098,
std,,101.143103,88.284312,79.697598,3.773298,
min,,0.0,0.0,0.0,0.0,
25%,,20.0,4.0,1.0,1.3,
50%,,76.0,56.0,8.0,4.2,
75%,,188.0,128.0,59.0,7.2,


Similar to include another parameter is ‘exclude’. We can interpret ‘exclude’ as ‘except’. We will get information for every data type except ones we mention in ‘exclude’.

In [27]:
drinks.describe(exclude='number')

Unnamed: 0,country,continent
count,193,193
unique,193,6
top,Nicaragua,Africa
freq,1,53


In [28]:
drinks.describe(exclude='object')

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4
