# Using Custom Functions in Pandas  

**Why using custom functions in pandas?**

Image you need to apply a function to an entire dataset(e.g. convert all upper case letters to lower case letters for columns that contains strings). You wouldn't want to do it manually like: 
```
df["col1"] = df["col1"].str.lower()
df["col2"] = df["col2"].str.lower()
...
df["col00"] = df["col100"].str.lower()
```
It's fine to do it for a couple of time. But We need to find a more efficient way if there are more columns.

## 1.Setup

In [1]:
#load modules
import numpy as np
import pandas as pd

Let's use a subset of titanic data frame

In [2]:
#load data
titanic = pd.read_csv("titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05,,S


## 2.Calculate Percentage of Missing Values in Each Column

### 2.1.Use Built-in Functions/Methods

In [3]:
titanic.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

### 2.2.Use Custom Functions to calculate missing percentage
#### 2.2.1.external function

In [4]:
#define our custom function
def calc_missing_pct(series):
    return series.isnull().mean()

titanic.apply(calc_missing_pct)

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

#### 2.2.2.Anonymous Functions(Lambda)

In [5]:
titanic.apply(lambda series: series.isnull().mean())

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

### 2.3.More on Lambda  

#### Comparison Between For Loop, List Comprehension and Map/Lambda

Let's do a simple calculation, add one to each number in this list \[0,1,2,3,4\]


In [6]:
my_list = [0,1,2,3,4]

**1. For Loop**

In [7]:
my_list2 = []
for x in my_list:
    my_list2.append(x + 1)
my_list2

[1, 2, 3, 4, 5]

**2. List Comprehension**

In [8]:
[x + 1 for x in my_list]

[1, 2, 3, 4, 5]

**3. Lambda and Map**

In [9]:
list(map(lambda x: x + 1,my_list))

[1, 2, 3, 4, 5]

Map function will convert a list to map object, so we need to use `list()` to convert it back to a list:

In [10]:
map(lambda x: x + 1,my_list)

<map at 0x2372a330080>

**4. Comparison between Javascript and Python on map and lambda**

Javascript:

```javascript
var my_list = [0,1,2,3,4];
my_list.map(function(x){return x + 1});
// Or
my_list.map(x => x + 1);

```

Python:  

```python
my_list = [0,1,2,3,4]
list(map(lambda x: x + 1,my_list))
```

Python Pandas:

```python
my_series = pd.Series([0,1,2,3,4])
my_series.map(lambda x: x + 1)
# Or
my_series + 1
```

## 3.Get Every Second Value in Each Column

In [16]:
titanic.apply(lambda x: x[1])

PassengerId                                                    2
Survived                                                       1
Pclass                                                         1
Name           Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                       female
Age                                                           38
Fare                                                     71.2833
Cabin                                                        C85
Embarked                                                       C
dtype: object

## 4.Get Data Type of Each Column

In [12]:
titanic.apply(lambda x: x.dtype,result_type='expand')

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
Fare           float64
Cabin           object
Embarked        object
dtype: object

## 5. Data Manipulation Using Custom Function

Our next task is slightly more complicated:  
1. for all columns that are strings, calculate number of unique values
2. for all columns that are numbers, calculate the average.

In [19]:
def clean_data(series):
    if(str(series.dtype) in ["float64","int64"]):
        return(f"mean: {series.sum()}")
    elif(str(series.dtype) in ["object"]):
        return(f"nunique:{series.nunique()}")
    else:
        data_type = str(series.dtype)
        return(f"Unexpected Data Type: {data_type}")
    
titanic.apply(clean_data,result_type='expand')

PassengerId        mean: 397386
Survived              mean: 342
Pclass               mean: 2057
Name                nunique:891
Sex                   nunique:2
Age              mean: 21205.17
Fare           mean: 28693.9493
Cabin               nunique:147
Embarked              nunique:3
dtype: object

In the following example, we add one to each numerical column.

In [14]:
def add_one_to_numerical(series):
    if(str(series.dtype) in ["float64","int64"]):
        return(series + 1)
    else:
        return(series)
titanic.apply(add_one_to_numerical).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,Cabin,Embarked
0,2,1,4,"Braund, Mr. Owen Harris",male,23.0,8.25,,S
1,3,2,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,39.0,72.2833,C85,C
2,4,2,4,"Heikkinen, Miss. Laina",female,27.0,8.925,,S
3,5,2,2,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,36.0,54.1,C123,S
4,6,1,4,"Allen, Mr. William Henry",male,36.0,9.05,,S


## Exercises
Please use Lambda and/or apply functions for data wrangling.Following functions should be applied to any dataset, so avoid calling any specif column names.
1. Convert all upper case letters to lower case letters.
1. Round all numbers to nearest ten.
1. Add column name before values for columns that contain strings(See the example below)
      
| PassengerId | Survived | Pclass | Name                                                     | Sex        | Age | Fare    | Cabin      | Embarked   |
|-------------|----------|--------|----------------------------------------------------------|------------|-----|---------|------------|------------|
| 1           | 0        | 3      | Name:Braund, Mr. Owen Harris                             | Sex:male   | 22  | 7.25    |      |      | Embarked:S |
| 2           | 1        | 1      | Name:Cumings, Mrs. John Bradley (Florence Briggs Thayer) | Sex:female | 38  | 71.2833 | Cabin:C85  | Embarked:C |
| 3           | 1        | 3      | Name:Heikkinen, Miss. Laina                              | Sex:female | 26  | 7.925   |       |     | Embarked:S |
| 4           | 1        | 1      | Name:Futrelle, Mrs. Jacques Heath (Lily May Peel)        | Sex:female | 35  | 53.1    | Cabin:C123 | Embarked:S |
| 5           | 0        | 3      | Name:Allen, Mr. William Henry                            | Sex:male   | 35  | 8.05    |     |       | Embarked:S |

## Readings

1. Map,Reduce and Filter  
http://book.pythontips.com/en/latest/map_filter.html  