# 3: Functions and descriptive statistics

Last week we learned how to select row, column and element from a dataframe. In this week's tutorial, we will explore some common summary functions which will allow us to quickly draw insights about the different features in a dataframe. 

Similar to last week, we will be working with the [titanic](https://www.kaggle.com/c/titanic/data) dataset on kaggle.

## Import pandas library

In [None]:
#from pandas.core.computation.check import NUMEXPR_INSTALLED
import pandas as pd

## Import data

In [None]:
data = pd.read_csv("train.csv")
data.head()

In [None]:
data.shape

## Summary functions

Summary functions like describe and info give a high-level summary of our data.

Let's see how they work.

In [None]:
data["Parch"].info()

In [None]:
data.describe()

In [None]:
data.describe(include='all')

In [None]:
# Describe function on numerical variable

data['Fare'].describe()

In [None]:
# Describe function on text variable

data['Embarked'].describe()

In [None]:
data["Sex"].describe()

In [None]:
data['Embarked']

## Unique and value counts function

In [None]:
# How many unique Embarked values are there?

data['Embarked'].nunique()

In [None]:
# What are the unique Embarked values?

data['Embarked'].unique()

In [None]:
# What are the counts of those individual values?

data['Embarked'].value_counts(ascending= True)

In [None]:
data['Embarked'].value_counts(dropna= False)

In [None]:
data['Embarked'].value_counts(normalize= True)  #z = x - Mean / Standard Deviation

In [None]:
data['Embarked'].count()

## Descriptive statistics

In [None]:
# What is the oldest age?

print(data['Age'].max())

In [None]:
data['Age'].min()

In [None]:
# Who is that passenger?
# Recall loc function from last week

data.loc[data['Age'] == 80, :]

In [None]:
# Who is that passenger?

data.loc[data['Age'] == 0.42, :]

In [None]:
# Who is that passenger?

data.loc[data['Age'] == data['Age'].min(), :]

In [None]:
# What is the average age?

data['Age'].mean()   #sum of the points / total number

In [None]:
data.loc[data['Age'] == 29, :].count()

In [None]:
# What is the median fare?

data['Fare'].median()

In [None]:
# What is the most frequent Embarked value?
# We can cross check this with the value counts function above
# This should return 'S' as the answer

data['Embarked'].mode()[0]

There are more functions for descriptive statistics than what I have shown here. If you are interested, you can have a look at [this page](https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm).

## Map and apply function

Both map and apply help us transform our data. Map is a series method that is it only works with a single column whereas apply works with both a single column as well as an entire dataframe. 

Because this is a beginner's course to pandas as well as Python, I want to first go over some basics about functions before we get into how we can use map and apply functions.

So what is a function? The easiest way to think about a function is that it takes in one or more variable and subsequently spits out an output. For example, y = x + 1 is a function. It takes in a number x and returns that number plus one.

All the methods for descriptive statistics in the section above such as max, min and mean are all examples of functions that have already been built into pandas so that we don't have to write the functions ourselves. But what if we have come up with our own unique transformation that we would like to implement to our dataframe? This is where map and apply comes in.

So what's the game plan?
1. First, we have to write out our desired function.
2. Then, we need to apply that function over a series in our dataframe (via map) or over the entire dataframe (via apply).

In Python, there are two ways to write functions that you should know of. First is via def and second is via something called a lambda function which is a slightly quicker and easier way. In this next section, I will teach you both these methods.

In [None]:
# Say we want to write a function which computes the cube of a number
# Method 1: def

def cube(n):
    output = n ** 3
    return output

cube(2)

In [None]:
# Method 2: lambda function

cube = lambda n: n ** 3
cube(3)

Now that we have learned how to write functions, let's move on to applying functions to our dataframe.

Suppose we would like to extract the last name out of the Name column of our dataset. This requires a little function called split but don't worry I will explain it very clearly in the video tutorial.

In [None]:
str1 = "Ahmed Hassan Amr Mohamed"
str2 = str1.split(" ")
print(str2[0])

In [None]:
data["Name"]

# Difference between apply and map function 

## apply: 
It is used when you want to apply a function on the values of Series (variable or column).


In [None]:
def extractLastName(name):
    token = name.split(',')
    #print(token[0])
    token2= token[1].split('.')
    #print(token2)
    return token2[0]

# Map the function to the Name column and assign a new column in our dataframe called Last Name
data['Titles'] = data['Name'].apply(extractLastName) #using User defined function
data['titles'] = data['Name'].apply(lambda Name: Name.split('.')[0].split(',')[1]) #using lambda function
data.head()

In [None]:
# Define our function
def extractLastName(name):
    token = name.split(',')
    #print(token)
    return token[0]

# Map the function to the Name column and assign a new column in our dataframe called Last Name
data['Last Name'] = data['Name'].apply(extractLastName)
data['Last Name'] = data['Name'].apply(lambda x:x.split(',')[0])

# Let's have a look at the first 5 rows
data.loc[:4, ['Last Name','Name']]
data.head()


## map 
It is to subsitute each value with another one.

## Bonus tip

You can also use the map function to encode categorical variables. This is particularly useful and important when you are preparing your dataset for machine learning. Most machine learning algorithms cannot learn from non-numeric inputs therefore, we have to first turn our categorical variables into numbers before fitting the model to our data. 

Examples of categorical variables in our titanic dataset are the Pclass, Sex and Embarked columns.

Don't worry if you do not understand any machine learning, this section is merely to illustrate how you can encode using the map function.

Suppose we want to encode the Sex column such that male gets assigned as 1 and female gets assgined as 0.

In [None]:
# Encode male as 1 and female as 0
data['Encoded Sex'] = data['Sex'].map({'male':1, 'female':0})

# Show the first 5 rows of Sex and Encoded Sex
data.loc[:4, ['Sex', 'Encoded Sex']]

An alternative way to accomplish this is via a pandas function called get_dummies.

In [None]:
pd.get_dummies(data['Sex'])

In [None]:
pd.get_dummies(data['Pclass'])

In [None]:
pd.get_dummies(data['Embarked'])

In [None]:
pd.get_dummies(data['Age'])

In [None]:
data.head()

In [None]:
numeric = data.loc[:, ["PassengerId","Age", "Fare", "Parch", "SibSp", "Survived"]]
numeric

In [None]:
numeric.corr()