# Exercise 01 - Data Processing with Python

##  Introduction

Machine learning is the science of programming computers to learn from data. In order to build sophisticated machine learning models, it is important to prepare the data to learn from beforehand. This Notebook gives an introduction to the required data processing steps applied to the Titanic Survival dataset.

This dataset contains demographics and passenger information from 891 of the 2224 passengers and crew on board the RMS Titanic. 

![RMS Titanic departing Southampton on April 10, 1912.](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/640px-RMS_Titanic_3.jpg)







## Objectives
After this exercise you should be familiar with the following operations, which are needed when you work with data.
 * Data loading
 * Data viewing
 * Data cleaning
 * Data slicing
 * Data mapping

In this exercise we'll be using mainly *pandas*  and *numpy* for data processing, *scikit-learn* for data analysis and *seaborn/matplotlib* for visualization respectively.


## Why do we use Python?
* A lot of  libraries make Python applicable to every step of the data science process like
    * data management,
    * analytical processing, and 
    * visualization libraries.
    
* Python-based Jupyter Notebooks facilitate the analysis and ensure repeatability
* Python is an easy-to-learn and readable language
* Python is an open language with a vibrant community
 
 * **C**
 ```
 #include "stdio.h"
 int main() {
 printf("Hello World\n");
 }
 ```
 
 * **Python**
 ```python
 print('Hello World')
 ```
 * **C**
 ```
 #include "stdio.h"
 int main() {
 int x = 3
 int y = 4
 printf("%s"\n,x+y);
 }
 ```    
 * **Python**
 ```python
 x = 3
 y = 4
 print(x+y)
  ```

## Getting Started with the Dataset

For this exercise we'll be using the Titanic data set.
It contains the passenger data from the RMS Titanic, including whether a passenger survived the sinking of the ship or not.
The dataset can be downloaded from [Kaggle](https://www.kaggle.com/c/3136/download-all).

**Our goal is to create a model to predict whether a passenger survived or not based on the given attributes.**

In [None]:
# install a new version of Seaborn
#!pip install -q seaborn==0.9.0

**Note:** Colab imports *seaborn* at startup, before you've pip install'd the new version. Hence you need to restart your runtime after the install, to pick up the new version.

In [None]:
# Import required Packages
import pandas as pd
import numpy as np
import sys
import seaborn as sns
from collections import Counter
import matplotlib.pyplot as plt
import sklearn

In [None]:
# Test whether the version of Seaborn is 0.9.0
if (sns.__version__ == '0.9.0'):
  print('correct seaborn version is loaded ...')

**If you are using Google Colab, you have to upload the dataset with the following command: **


In [None]:
# Uplod dataset on the cloud
# from google.colab import files
# files.upload()

Now we can use the dataset and start analyzing it.


In [None]:
# Load the train and test datasets from the CSV files
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Combine the datasets for training and testing to one full data set
full_data = [train, test]

# Display the first 5 rows of the train data set
train.head(5)

In [None]:
# Print the columns of the data frame
train.columns.values


 *  **Survived**: Outcome of survival (int: 0 = No; 1 = Yes) 
 *  **Pclass**: Socio-economic class (int: 1 = Upper class; 2 = Middle class; 3 = Lower class)
 * **Name**: Name of passenger (string)
 * **Sex**: Sex of the passenger (string)
 * **Age**: Age of the passenger (float: Some entries contain NaN)
 * **SibSp**: Number of siblings and spouses of the passenger aboard (int)
 * **Parch**: Number of parents and children of the passenger aboard (int)
 * **Ticket**: Ticket number of the passenger (string)
 * **Fare**: Fare paid by the passenger (float)
 * **Cabin** Cabin number of the passenger (string: Some entries contain NaN)
 * **Embarked**: Port of embarkation of the passenger (string: C = Cherbourg; Q = Queenstown; S = Southampton)


In [None]:
# Inspect the data, *info* can be used to show how complete or incomplete the
# dataset is
train.info()

To display information about a specific passenger, we can select a row with the following command:

In [None]:
# iloc: index location
train.iloc[14]

In [None]:
# Retrieve n number of samples from the data set
train.sample(5)

In [None]:
# Retrieve a statistical description of the data set
train.describe()

Above we can see that 38% out of the training-set survived the sinking of the RMS Titanic. We can also see that the passenger ages range from 0.4 years to 80 years.


From the tables above, we can note a few things:
1. We need to convert a lot of features into numeric ones
2. We can see that the features have widely different ranges
3. We can detect some features, that contain missing values

## Visualizing the Data

After a look at the tables and the description of our datasets we can't say a lot about the data yet. We are unaware of the distribution and correlation of the variables regarding the chances of survival for any given passenger.
To get a better understanding about the dataset and its variables, it is helpful to visualize it.

A good tool to visualize data using Python is the library *matplotlib*. It is well [documented](https://matplotlib.org/) and allows for extensive customizability.

Additionally we will introduce [*seaborn*](https://seaborn.pydata.org/), a wrapper which uses matplotlib, but offers a higher-level interface for visualizing data.

In [None]:
# First we start off with matplotlib and setup the figures and plots
f,ax = plt.subplots(1,3,figsize=(20,5))
colors = ["r", "g", "b"]

# Now we look at some general distributions of the data.


# Survived Class
x_survived = [0,1]
y_survived = [np.where(train["Survived"] == (i))[0].size for i in x_survived]
ax[0].bar(x_survived, y_survived, color=colors)
ax[0].set_xticks(x_survived)
ax[0].set_title('Survived')


# Passenger Class
x_pclass = [1, 2, 3]
y_pclass = [np.where(train["Pclass"] == (i))[0].size for i in x_pclass]
ax[1].bar(x_pclass, y_pclass, color=colors)
ax[1].set_xticks(x_pclass)
ax[1].set_title('Passengers by Class')


# Age
x_age = np.arange(0, 80)
y_age = [np.where(train["Age"] == (i))[0].size for i in x_age]
ax[2].bar(x_age, y_age, color="green")
ax[2].set_xticks(np.arange(0, 81, 10))
ax[2].set_title('Passengers by Age')


# Display the graphs
plt.show()

In [None]:
# The categorical histograms we created with matplotlib can be created with the
# "countplot" command in seaborn. The setup is very similar, but easier.

f,ax = plt.subplots(1,3,figsize=(20,5))

sns.countplot('Survived',data=train,ax=ax[0])
ax[0].set_title('Survived')

sns.countplot('Pclass',data=train,ax=ax[1])
ax[1].set_title('Passengers by Class')

sns.distplot(train['Age'].dropna(),ax=ax[2],bins=20, kde = False)
ax[2].set_title('Passengers by Age');

In [None]:
# Now we are going to use various plot options of seaborn to 
# look at survival rates based on other attributes:
f,ax = plt.subplots(1,3,figsize=(20,5))

sns.countplot('Pclass',hue='Survived',data=train,ax=ax[0])
ax[0].set_title('Survival by Class')

sns.countplot('Sex',hue='Survived',data=train,ax=ax[1])
ax[1].set_title('Survival by Sex')

sns.countplot('Embarked',hue='Survived',data=train,ax=ax[2])
ax[2].set_title('Survival by Embarked');

In [None]:
f,ax = plt.subplots(1,3,figsize=(20,5))

sns.countplot('SibSp',hue='Survived',data=train,ax=ax[0])
ax[0].set_title('Survival by SibSp')

sns.countplot('Parch',hue='Survived',data=train,ax=ax[1])
ax[1].set_title('Survival by Parch')

sns.distplot(train[train['Survived']==0]['Fare'].dropna(),ax=ax[2],kde=False,color='red',bins=20)
sns.distplot(train[train['Survived']==1]['Fare'].dropna(),ax=ax[2],kde=False,color='blue',bins=20)
ax[2].set_title('Survival by Fare');

This was only a quick overview of the relationship between features before we start a more detailed analysis in the following sections.

## Cleaning the Data
Data from the real world is messy. Normally there are missing values, outliers and invalid data (e.g. negative values for age) in a data set. We can solve problems with data quality by replacing these values, trying to close the gap by interpolation or by dropping the respective entries. 


### Detecting and Filtering Outliers

Outliers that are either very large or small skew the overall view of the data. One way of detecting outliers could be the use of the standard deviation. If we assume that the data is normally distributed, then 95 percent of the data is within 1.96 standard deviations of the mean. So we can drop the values either above or below that range.

In [None]:
f,ax = plt.subplots(1,2,figsize=(13,5))

# The outliers in Fare (Fare paid by the passenger)
sns.regplot(x=train["PassengerId"], y=train["Fare"], fit_reg=False,ax=ax[0])

# The outliers in SibSp(Number of siblings and spouses of the passenger aboard)
sns.regplot(x=train["PassengerId"], y=train["SibSp"], fit_reg=False, ax=ax[1])

ax[0].set_title('Passengers by Fare')
ax[1].set_title('Passengers by Number of Siblings and Spouses')

plt.show()

In [None]:
# Outlier detection  Method 1 using Standard Deviation
def detect_outliers_sd(df,n,features):
    outlier_indices = []
    # iterate over features(columns)
    for col in features:
        # mean
        mean = df[col].mean()
        # standard deviation
        std = df[col].std()
        # the upper bound
        top = mean + std * 1.96
        #  the lower bound 
        bot = mean - std * 1.96
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < bot) | (df[col] > top)].index       
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    return multiple_outliers   

In [None]:
# Outlier detection  Method 2 using Interquartile Ranges 
def detect_outliers_iqr(df, n ,features):

    outlier_indices = []

    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col] ,75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1

        # outlier step
        outlier_step = 1.5 * IQR

        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index

        # append the found outlier indices for col to the list of outlier indices
        outlier_indices.extend(outlier_list_col)

    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    return multiple_outliers

Detect outliers ...

In [None]:
# detect outliers from Age, SibSp , Parch and Fare
outliers_to_drop = detect_outliers_iqr(train,2,["Age","SibSp","Parch","Fare"])
train.loc[outliers_to_drop] # Show the outliers rows

... and remove them

In [None]:
# Drop the outliers
train = train.drop(outliers_to_drop, axis = 0).reset_index(drop=True)


### Complementary functions

Most	Machine	Learning	algorithms	cannot	work	with missing values,	so	let’s	create a few	functions	to take	care	of	the missing values.

In [None]:
# The .info function shows how complete or incomplete the datasets are. 
print(train.isnull().sum())
print('-'*30)
print(test.isnull().sum())

We can complete missing data by calculating:
* mean, 
* median, or 
* mean + randomized standard deviation.  

Before we can complete the missing data, we should decide which method is best based on the description of the data.

In [None]:
# The outliers in Age 
sns.regplot(x=train["PassengerId"], y=train["Age"], fit_reg=False)
plt.title('Age by Passenger')
plt.show()

To complete the missing data of *Age*, we use the mean + randomized standard deviation, where the standard deviation describes the spread of the data.

In [None]:
# Fill the missing data in Age using mean + randomized standard deviation. 
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)  
    df_age = dataset['Age'].copy()
    df_age[np.isnan(df_age)] = [age_null_random_list]
    dataset['Age'] = df_age.astype(int)

Information of the Fare attribute

In [None]:
# The Description of Fare
print ("median {}".format(train['Fare'].median()))
train['Fare'].describe()

In [None]:
# Fill the missing data in Fare using median standard deviation.
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())

Before we fill the missing data in the Embarked, we will visualize it to decide which option is best to use

In [None]:
sns.countplot('Embarked',data=train)
plt.show()

In [None]:
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

Let us check again if there are any missing values

In [None]:
# update the dataframes
train = full_data[0]
test = full_data[1]
# any: detects if a cell matches a condition
print(train.isnull().any())
print('-'*30)
print(test.isnull().any())

Great! Nothing (important) is missing, and we did not have to remove any rows.

## Feature Engineering
Qualitative data is often nominal (e.g. names) or categorical (e.g. sex). Those can't be ordered and are difficult to evaluate. Therefore we want to convert all our categorial data to quantitiative data, i.e. numerical or ordinal values.

We can convert the names to an attribute based on their length:

In [None]:
for dataset in full_data:
    try:
        dataset['Name_length'] = dataset['Name'].apply(len)
    except:
        print("Name_length feature is located in the data frame")
        
train['Name_length'].head()

In [None]:
fig, ax = plt.subplots(2,1,figsize=(20,10))

# The amount of survived people by Name length.
sum_survived_by_name = train[["Name_length", "Survived"]].groupby(['Name_length'],as_index=False).sum()
sns.barplot(x='Name_length', y='Survived', data=sum_survived_by_name, ax = ax[0])
ax[0].set_title('Survivors by Name length')

# The amount of survived people by Name length.
avg_survived_by_name = train[["Name_length", "Survived"]].groupby(['Name_length'],as_index=False).mean()
sns.barplot(x='Name_length', y='Survived', data=avg_survived_by_name, ax = ax[1])
ax[1].set_title('Survival Rates by Name length')

plt.show()

From the graphics above we can see that passengers with longer names were more likely to survive, perhaps the cause is that rich families tend to have longer names.

It can also be helpful to create meaningful "bins" for attributes. Therefore we will divide the Name_length feature into small classes. Each of these classes has a similar rate to survive.

In [None]:
for dataset in full_data:
    dataset.loc[ dataset['Name_length'] <= 23, 'Name_length']= 0
    dataset.loc[(dataset['Name_length'] > 23) & (dataset['Name_length'] <= 28), 'Name_length']= 1
    dataset.loc[(dataset['Name_length'] > 28) & (dataset['Name_length'] <= 40), 'Name_length']= 2
    dataset.loc[ dataset['Name_length'] > 40, 'Name_length'] = 3
train['Name_length'].value_counts()

As a next step we can map categorical attributes to a numerical discrete value:



In [None]:
# Mapping Gender
for dataset in full_data:
    # np.where takes as input a list of Booleans, a new value and a backup value
    try:
        dataset['Sex'] = np.where(dataset['Sex']=='female', 1, 0)
    except:
        print('The value is already converted ')
train['Sex'].head()
      

For example we can look at the *Age* attribute:


In [None]:
#plot distributions of passengers who survived or died by age
a = sns.FacetGrid(train, hue='Survived', aspect=5)
a.map(sns.kdeplot, 'Age', shade=True)
a.set(xlim=(0, train['Age'].max()))
a.add_legend();

We can see that until the age of 14 the chance of survival is higher than the chance to die.
In reverse the chance for dying is higher between the age of 14 and 30. This changes a couple of times between various ages.

Therefore the best categories for age are:
* 0: less than 14
* 1: 14 to 30
* 2: 30 to 40
* 3: 40 to 50
* 4: 50 to 60
* 5: 60 and more

In [None]:
for dataset in full_data:
    dataset.loc[ dataset['Age'] <= 14, 'Age_bin'] = 0
    dataset.loc[(dataset['Age'] > 14) & (dataset['Age'] <= 30), 'Age_bin'] = 1
    dataset.loc[(dataset['Age'] > 30) & (dataset['Age'] <= 40), 'Age_bin'] = 2
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 50), 'Age_bin'] = 3
    dataset.loc[(dataset['Age'] > 50) & (dataset['Age'] <= 60), 'Age_bin'] = 4
    dataset.loc[ dataset['Age'] > 60, 'Age_bin'] = 5
train['Age_bin'].value_counts()

The next step is to map the Embarked feature: 

In [None]:
# Mapping Embarked
for dataset in full_data:
  try:
      dataset.Embarked.replace(('S','C','Q'), (0,1,2), inplace = True)
  except:
      print('The value is already converted ')
train['Embarked'].head()

Additionally data might be skewed. For example, if we look at the *Fare* attribute, we can see it is heavily skewed to the left:

In [None]:
fig, ax = plt.subplots(figsize=(18,5))
sns.distplot(train["Fare"][train["Survived"] == 0], color="red")
sns.distplot(train["Fare"][train["Survived"] == 1], color="blue");

To reduce the skewness of this attribute, we can transform it with the log function. This redistributes the data:

In [None]:
# Apply log to Fare to reduce skewness distribution
for dataset in full_data:
    dataset["Fare_log"] = dataset["Fare"].map(lambda i: np.log(i) if i > 1 else 0)
    
fig, ax = plt.subplots(figsize=(18,5))
sns.distplot(train["Fare_log"][train["Survived"] == 0], color="r", ax=ax)
sns.distplot(train["Fare_log"][train["Survived"] == 1], color="b", ax=ax);

Now we can define bins more easily:
The survival rate is lower for a *Fare_log* value of less than 2.7 and higher for values greater than 2.7.

In [None]:
for dataset in full_data:
    dataset.loc[ dataset['Fare_log'] <= 2.7, 'Fare_bin'] = 0
    dataset.loc[ dataset['Fare_log'] > 2.7, 'Fare_bin'] = 1
    dataset['Fare_bin'] = dataset['Fare_bin'].astype(int)
train['Fare_bin'].value_counts()

In [None]:
# Print the first 5 rows of the updated dataset
train.head()

## Feature Selection

Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.

Fewer attributes are desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

**Which features within the dataset contribute significantly to our goal?  ** 


To calculate the covariance matrix, we should first remove all remaining string attributes:



In [None]:
train.info()

we will drop the following features:
* Name
* Ticket
* Cabin

In [None]:
# Feature selection
drop_elements = [ 'Name', 'Ticket', 'Cabin']
try: 
  train = train.drop(drop_elements, axis = 1)
  test  = test.drop(drop_elements, axis = 1)
except:
  print("The features are already removed.")

We can examine the data after removing the features.

In [None]:
train.head()

### Correlation analysis - Multi-variate analysis
* Basically, correlation measures how closely two variables move in the same direction. Therefore we try to find whether there is a correlation between a feature and a label. In other words as the feature values change does the label change as well, and vice-versa?

* The data may contain a lot of information redundancy distributed among multiple variables, which is a problem called multivariate correllation.

#### Heatmap for the correlation matrix

In [None]:
colormap = plt.cm.RdBu
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
plt.show()

We can see from the survived column, that it has strong relation with sex and potential relation with class (or fare).

### Features Selection
We will drop the features that are not correlated with our dataset.

In [None]:
# Feature selection
drop_elements = ['PassengerId', 'SibSp', 'Age_bin','Embarked']
try: 
  train = train.drop(drop_elements, axis = 1)
  test  = test.drop(drop_elements, axis = 1)
except:
  print("The features are already removed.")

Exploring the dataset after removing features

In [None]:
test.sample(5)

## Predictive Modelling

In [None]:
import sklearn # Collection of machine learning algorithms
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC


### Implementation: Shuffle and Split Data

The next step requires that we take the train dataset and split the data into training and testing subsets. We should do this because we want to test how well our model generalizes to unseen data.

Use `train_test_split` from `sklearn.cross_validation` to shuffle and split the data into training and testing sets.
* Split the data into 70% training and 30% testing.
* Set the *random_state* for train_test_split to 101. This ensures results are consistent over multiple runs.



In [None]:
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]


from sklearn.model_selection import train_test_split

X_train, x_test, Y_train, y_test = train_test_split(X_train, Y_train, test_size=0.3, random_state=101)


* Accuracy Function
* Logistic Regression
* SVM
* Decision Tree / Random Forest
* Perceptron

###  Logistic Regression

Logistic regression is machine learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (survived) or 0 (not survived). In other words, the logistic regression model predicts P(Y=1) as a function of X (Features). This makes it a  binary classifier.

In [None]:
logreg = LogisticRegression(solver="newton-cg")
logreg.fit(X_train, Y_train)
Y_pred1 = logreg.predict(x_test)
print("The accuracy is {}%".format(round(logreg.score(x_test, y_test) * 100, 2)))

### Decision Tree
Decision tree classifiers are attractive models if we care about interpretability. As
the name decision tree suggests, we can think of this model as breaking down our
data by making decisions based on asking a series of questions.

In [None]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred7 = decision_tree.predict(x_test)
print("The accuracy is {}%".format(round(decision_tree.score(x_test, y_test) * 100, 2)))

### Perceptron
The perceptron is a supervised binary classifier.

In [None]:
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred4 = perceptron.predict(x_test)
print("The accuracy is {}%".format(round(perceptron.score(x_test, y_test) * 100, 2)))

### Support Vector Machines 
Support Vector Machines (SVM) are kernel based methods that require only a user-specified kernel function $K$ i.e., a similarity function over pairs of data points into a kernel (dual) space on which the learning algorithms operate linearly.

In [None]:
svc=SVC(gamma="auto")
svc.fit(X_train, Y_train)
Y_pred2 = svc.predict(x_test)
print("The accuracy is {}%".format(round(svc.score(x_test, y_test) * 100, 2)))

# Conclusion
*  Machine Learning is about algorithms that are capable to learn from data, instead of having to explicitly code rules.
* In an ML project you gather data in a training set, and you feed the training set to a learning algorithm.
* The system will not perform well if your training set is too small, or if the data is not representative, noisy, or polluted with irrelevant features (garbage in, garbage out). 
* Lastly, your model needs to be neither too simple (in which case it will underfit) nor too complex (in which case it will overfit).