# Apply functions to dataframes and series

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv('http://bit.ly/kaggletrain')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


So there are 3 methods:
- map
- apply
- applymap

### Map is a Series method

Here's what we are going to use it for. Let's say that you need to create a dummy variable for Sex. And what that means is that I want to translate Sex which is male and female to 1 and 0. So we are going to use map. And map allows you to map an existing value of a Series to a different set of values.

In [3]:
train['Sex_num'] = train.Sex.map({'female':0, 'male':1})

In [4]:
train.loc[0:4, ['Sex', 'Sex_num']]

Unnamed: 0,Sex,Sex_num
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


What we see here is that Male has been translated to 1 and female has been translated to 0

There are other uses of map, but this is the main use and most people use map only for this reason.

## Apply is a Series method and a DataFrame method

We will start with `apply` as a Series method. So what does apply do ? **it applies a function to each element in a Series!!**

Let's say you want to calculate the length of each string in the `Name` column and store the result in a new column `Name length` that contains an integer value.

In [5]:
train['Name_length'] = train.Name.apply(len)

Now let's compare Name and Name_len by using loc again.

In [6]:
train.loc[0:4,['Name', 'Name_length']]

Unnamed: 0,Name,Name_length
0,"Braund, Mr. Owen Harris",23
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,"Heikkinen, Miss. Laina",22
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",44
4,"Allen, Mr. William Henry",24


So, the apply method, when used as a Series method, applies `len` function to `Name` Series. In other words, it applies to every element in the Series.

It is more common to use an apply function with a Numpy function. So let's first load in numpy.

In [7]:
import numpy as np

Now, lets say we look at the Fare column and let's say, I want to round it up to 2 digits. 

In [8]:
train['Fare_ceil'] = train.Fare.apply(np.ceil)

So, I have applied `np.ceil` function to `Fare` Series and saved the result to `Fare_ceil` column.

In [10]:
train.loc[0:4, ['Fare', 'Fare_ceil']]

Unnamed: 0,Fare,Fare_ceil
0,7.25,8.0
1,71.2833,72.0
2,7.925,8.0
3,53.1,54.0
4,8.05,9.0


Now, lets use `apply` to solve a harder problem.

Let's extract the Last Name of each person from the Name column. Initially, you might think that we can just use a str operation on the Series Name.. see below for that approach:

In [14]:
train.Name.str.split(',').head(20)

0                            [Braund,  Mr. Owen Harris]
1     [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                             [Heikkinen,  Miss. Laina]
3       [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                           [Allen,  Mr. William Henry]
5                                   [Moran,  Mr. James]
6                            [McCarthy,  Mr. Timothy J]
7                     [Palsson,  Master. Gosta Leonard]
8     [Johnson,  Mrs. Oscar W (Elisabeth Vilhelmina ...
9                [Nasser,  Mrs. Nicholas (Adele Achem)]
10                   [Sandstrom,  Miss. Marguerite Rut]
11                          [Bonnell,  Miss. Elizabeth]
12                    [Saundercock,  Mr. William Henry]
13                       [Andersson,  Mr. Anders Johan]
14              [Vestrom,  Miss. Hulda Amanda Adolfina]
15                  [Hewlett,  Mrs. (Mary D Kingcome) ]
16                              [Rice,  Master. Eugene]
17                      [Williams,  Mr. Charles 

What we got here is a Series of List objects, each containing a list of strings.

So, how do we just get the first part out of each of the lists ?

What we really wish to say is: "Hey pandas, i want you to take this result: `train.Name.str.split(',')` and I want to pull out the first list element from each Series element. So, I am going to actually write a function to do this and then apply that function to the Series.

In [15]:
def getElement(my_list, position):
    return my_list[position]

So, I created this function which takes its argument a list and a position. I am going to pass it a list and a position.

So, we are going to have to split first to create a Series of List objects and then apply the function `getElement` to each element of the Series.

In [22]:
train.Name.str.split(',').apply(getElement, position=0).head(20)

0            Braund
1           Cumings
2         Heikkinen
3          Futrelle
4             Allen
5             Moran
6          McCarthy
7           Palsson
8           Johnson
9            Nasser
10        Sandstrom
11          Bonnell
12      Saundercock
13        Andersson
14          Vestrom
15          Hewlett
16             Rice
17         Williams
18    Vander Planke
19       Masselmani
Name: Name, dtype: object

So, I am saying, "pandas take this Series: `train.Name.str.split(',')` and apply `getElement` function to the result on every element and pass it a **keyword argument** position=0

So, I have now a Series of strings which are the last names of the passengers.

So, now you might be thinking, you can do this with Lambda functions, if you are familiar with Lambda functions. So, that's what we are going to do next.

You can rewrite the part given inside `apply` as a Lambda function, like below

In [23]:
train.Name.str.split(',').apply(lambda x:x[0]).head(20)

0            Braund
1           Cumings
2         Heikkinen
3          Futrelle
4             Allen
5             Moran
6          McCarthy
7           Palsson
8           Johnson
9            Nasser
10        Sandstrom
11          Bonnell
12      Saundercock
13        Andersson
14          Vestrom
15          Hewlett
16             Rice
17         Williams
18    Vander Planke
19       Masselmani
Name: Name, dtype: object

If you are familiar with `lambda` functions, then this will be clear. And `lambda` functions are used a lot with apply!!

So, that is all with `apply` as a Series method. Now let's move on to apply as a DataFrame method.

## Apply as a DataFrame method

In [24]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')

> So, apply as a dataframe method applies a function along either axis of a dataframe.

In [25]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [27]:
# lets get a subset of this for this example
drinks.loc[:,'beer_servings':'wine_servings'].head()

Unnamed: 0,beer_servings,spirit_servings,wine_servings
0,0,0,0
1,89,132,54
2,25,0,14
3,245,138,312
4,217,57,45


So, if I am using something like this: `drinks.loc[:,'beer_servings':'wine_servings'].apply()`, then I am using the apply method as a DataFrame method and not as a Series method.

In [28]:
drinks.loc[:,'beer_servings':'wine_servings'].apply(max, axis=0)

beer_servings      376
spirit_servings    438
wine_servings      370
dtype: int64

In [31]:
drinks.loc[:,'beer_servings':'wine_servings'].apply(max, axis=1).head(10)

0      0
1    132
2     25
3    312
4    217
5    128
6    221
7    179
8    261
9    279
dtype: int64

Sometimes, you want to know which column is the maximum. So here, `np.argmax` is super useful. Don't fret over it, it is just fyi.

In [33]:
drinks.loc[:,'beer_servings':'wine_servings'].apply(np.argmax, axis=1).head(10)

0      beer_servings
1    spirit_servings
2      beer_servings
3      wine_servings
4      beer_servings
5    spirit_servings
6      wine_servings
7    spirit_servings
8      beer_servings
9      beer_servings
dtype: object

## applymap is a DataFrame method

applymap applies a function to every element of a dataframe. For exampke, if I wanted to convert every element of the dataframe to a float value, then this is how I would do it.

In [35]:
drinks.loc[:,'beer_servings':'wine_servings'].applymap(float).head(10)

Unnamed: 0,beer_servings,spirit_servings,wine_servings
0,0.0,0.0,0.0
1,89.0,132.0,54.0
2,25.0,0.0,14.0
3,245.0,138.0,312.0
4,217.0,57.0,45.0
5,102.0,128.0,45.0
6,193.0,25.0,221.0
7,21.0,179.0,11.0
8,261.0,72.0,212.0
9,279.0,75.0,191.0


You can actually use this to override the existing dataframe columns.

In [36]:
drinks.loc[:,'beer_servings':'wine_servings'] = drinks.loc[:,'beer_servings':'wine_servings'].applymap(float)

In [37]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0.0,0.0,0.0,0.0,Asia
1,Albania,89.0,132.0,54.0,4.9,Europe
2,Algeria,25.0,0.0,14.0,0.7,Africa
3,Andorra,245.0,138.0,312.0,12.4,Europe
4,Angola,217.0,57.0,45.0,5.9,Africa


We have essentially converted the datatype of these columns.

In [38]:
drinks.dtypes

country                          object
beer_servings                   float64
spirit_servings                 float64
wine_servings                   float64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object