# Part 3: Functions and descriptive statistics

Last week we learned how to select row, column and element from a dataframe. In this week's tutorial, we will explore some common summary functions which will allow us to quickly draw insights about the different features in a dataframe. 

Similar to last week, we will be working with the [titanic](https://www.kaggle.com/c/titanic/data) dataset on kaggle.

## Import pandas library

In [1]:
import pandas as pd

## Import data

In [2]:
data = pd.read_csv("titanic/train.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
data.shape

(891, 12)

## Summary functions

Summary functions like describe and info give a high-level summary of our data.

Let's see how they work.

In [4]:
# Describe function on numerical variable

data['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [4]:
# Describe function on text variable

data['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [6]:
# Info function

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Unique and value counts function

In [12]:
# How many unique Embarked values are there?

data['Embarked'].nunique()

3

In [13]:
# What are the unique Embarked values?

data['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [14]:
# What are the counts of those individual values?

data['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

## Descriptive statistics

In [15]:
# What is the oldest age?

data['Age'].max()

80.0

In [11]:
# Who is that passenger?
# Recall loc function from last week

data.loc[data['Age'] == 80, :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S


In [16]:
# What is the youngest age?

data['Age'].min()

0.42

In [17]:
# Who is that passenger?

data.loc[data['Age'] == 0.42, :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C


In [18]:
# What is the average age?

data['Age'].mean()

29.69911764705882

In [19]:
# What is the median fare?

data['Fare'].median()

14.4542

In [20]:
# What is the most frequent Embarked value?
# We can cross check this with the value counts function above
# This should return 'S' as the answer

data['Embarked'].mode()

0    S
Name: Embarked, dtype: object

There are more functions for descriptive statistics than what I have shown here. If you are interested, you can have a look at [this page](https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm).

## Map and apply function

Both map and apply help us transform our data. Map is a series method that is it only works with a single column whereas apply works with both a single column as well as an entire dataframe. 

Because this is a beginner's course to pandas as well as Python, I want to first go over some basics about functions before we get into how we can use map and apply functions.

So what is a function? The easiest way to think about a function is that it takes in one or more variable and subsequently spits out an output. For example, y = x + 1 is a function. It takes in a number x and returns that number plus one.

All the methods for descriptive statistics in the section above such as max, min and mean are all examples of functions that have already been built into pandas so that we don't have to write the functions ourselves. But what if we have come up with our own unique transformation that we would like to implement to our dataframe? This is where map and apply comes in.

So what's the game plan?
1. First, we have to write out our desired function.
2. Then, we need to apply that function over a series in our dataframe (via map) or over the entire dataframe (via apply).

In Python, there are two ways to write functions that you should know of. First is via def and second is via something called a lambda function which is a slightly quicker and easier way. In this next section, I will teach you both these methods.

In [21]:
# Say we want to write a function which computes the cube of a number
# Method 1: def

def cube(input):
    output = input ** 3
    return output

cube(2)

8

In [22]:
# Method 2: lambda function

cube = lambda input: input ** 3
cube(3)

27

Now that we have learned how to write functions, let's move on to applying functions to our dataframe.

Suppose we would like to extract the last name out of the Name column of our dataset. This requires a little function called split but don't worry I will explain it very clearly in the video tutorial.

In [19]:
# Define our function
def extractLastName(name):
    token = name.split(',')
    return token[0]

# Map the function to the Name column and assign a new column in our dataframe called Last Name
data['Last Name'] = data['Name'].apply(extractLastName)

# Let's have a look at the first 5 rows
data.loc[:4, ['Name', 'Last Name']]

Unnamed: 0,Name,Last Name
0,"Braund, Mr. Owen Harris",Braund
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Cumings
2,"Heikkinen, Miss. Laina",Heikkinen
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Futrelle
4,"Allen, Mr. William Henry",Allen


## Bonus tip

You can also use the map function to encode categorical variables. This is particularly useful and important when you are preparing your dataset for machine learning. Most machine learning algorithms cannot learn from non-numeric inputs therefore, we have to first turn our categorical variables into numbers before fitting the model to our data. 

Examples of categorical variables in our titanic dataset are the Pclass, Sex and Embarked columns.

Don't worry if you do not understand any machine learning, this section is merely to illustrate how you can encode using the map function.

Suppose we want to encode the Sex column such that male gets assigned as 1 and female gets assgined as 0.

In [20]:
# Encode male as 1 and female as 0
data['Encoded Sex'] = data['Sex'].map({'male':1, 'female':0})

# Show the first 5 rows of Sex and Encoded Sex
data.loc[:4, ['Sex', 'Encoded Sex']]

Unnamed: 0,Sex,Encoded Sex
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


An alternative way to accomplish this is via a pandas function called get_dummies.

In [21]:
pd.get_dummies(data['Sex'])

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
886,0,1
887,1,0
888,1,0
889,0,1
