# Lab 4: Logistic Regression (continuation) 

In this practice session, you will work on "Titanic" database example (from the website [http://www.data-mania.com/](http://www.data-mania.com/)). This database contains different information about passengers of the sunk "RMS Titanic" ocean liner. The goal is to build classifier that predict if a passenger has survived or not the catastrophe based on the passenger age, sex, ticket fare ...

The "Titanic database lack some information of some passengers. In the other hand, it has some unusable information for survivability prediction like ticket or cabin number. Hence, in the first part you are invited to process the training data by removing unnecessary information and filling the missing data.

In the second part, you should train a logistic classifier using [**sklearn**](http://scikit-learn.org/stable/) library. then, you could assess its accuracy with some [**metrics**](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) of sklearn. 

### Load and process dataset
In this session, we will use new library for data structure and plotting. The [**pandas**](http://pandas.pydata.org/pandas-docs/stable/) library offer the possibility to store data in [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) structure which is similar to classic 2-D array but it has labels for rows and columns and we can use these labels to index a specific component in the dataframe.  
The [**seaborn**](https://seaborn.pydata.org/index.html) library helps us to generate meaningful statistical graphics by extending functionalities in matplotlib library.

<font color="blue">**Question 1: **</font>
- Load data from the given "url" in "titanic" variable and explore it (what are attributes/size...?).  
**Hint:** You could use [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function from pandas library.

In [None]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn import metrics 

# load titanic dataset using the url below
url = 'https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/master/titanic-train.csv'
titanic = # ** your code here **

# explore data
print("The size of titanic dataset is: ",titanic.shape)
print("Features name are: ",titanic.columns)

print("\nSome information about the dataset:")
titanic.info()
print("\nHow dataset looks like:")
titanic.head()


<font color="blue">**Question 2: **</font>
- What is the number of samples (passengers) in each class (y=1:survived, y=0:did not survive).  
**Hint:** You could index dataframe with boolean condition on "Survived" column. You could also use "[where](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html)" function from pandas library.

In [None]:
# determine the number of survived/not survived to understand more dataset characteristics 
nbr_surviv = # ** your code here **
nbr_died = # ** your code here **

print("The number of survived passenger in titanic dataset is: ",nbr_surviv)
print("The number of died passenger in titanic dataset is: ",nbr_died)


#### Remove not significant features
Some features in the "titanic" dataset are meaningless and they doesn't help to predict survivability. Thus, we should remove them.

<font color="blue">**Question 3: **</font>
- The "Name" and "cabin" number are two meaningless features and  there are two other features try to guess them. Then, remove all these 4 features from "titanic" dataframe.  
**Hint:** You could use "[drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)" function from pandas library.

In [None]:
titanic = # ** your code here **
titanic.head()

#### Deal with missing values
We note from the printed information about our dataset (with the command titanic.info()), that the "Age", "Cabin" and "Embarked" features have some missing values for some passengers. 

First, we will try to estimate missing "Age" from the passenger class "Pclass". If, we calculate the mean age of each class, we note that the 1st class passengers tend to be older than 3rd class manager. Hence, you should fill missing age values with the mean age of the corresponding passenger class.

In [None]:
%matplotlib inline

# visualize Age box plot for each passenger class
sns.set_style('whitegrid')
sns.boxplot(x='Pclass', y='Age', data=titanic, palette='hls')
plt.show()

titanic.groupby('Pclass').mean()


<font color="blue">**Question 4: **</font>
- Fill the "approx_age" list with the estimated age from the previous means table. Put the estimated age for 1st class in the beginning of the list and for 3rd class at the end. 

In [None]:
# estimate missing Age
approx_age = # ** your code here **

null_age_idx=pd.isnull(titanic.Age)    # you could also use:  null_age_idx=pd.isnull(titanic['Age'])
titanic.loc[null_age_idx,'Age']=[approx_age[i-1] for i in titanic.loc[null_age_idx,'Pclass']]

print("The number of missing value per feature:\n",titanic.isnull().sum())

- We note that there is two passengers that we don't know their embarking port. We could discard this two samples by dropping them.  
**Hint:** We could use "[dropna](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)" function from pandas library.

In [None]:
titanic=titanic.dropna()
titanic.isnull().sum()

#### Insert dummies variable
In the "titanic" dataset contains some qualitative features like "Sex" and "Embarked" port. In order to make the  useful in our computational model we should encode them with a numeric way with introducing order between "Embarked" categories for instance.  

In our case, we will use dummies variable method which consist on creating new boolean variable for each categories and encode each categories with "True" on the corresponding dummy variable.

In [None]:
# create "sex" dummy variable
sex = pd.get_dummies(titanic['Sex'],drop_first=True)
print("Remplacement 'Sex' dummy variable:\n",sex.head())

# create "embark_port" dummies variables
embark_port = pd.get_dummies(titanic['Embarked'],drop_first=True)
print("Remplacement 'Embarked' dummies variables:\n",embark_port.head())

# remove qualitative features
titanic.drop(['Sex', 'Embarked'],axis=1,inplace=True)

# insert new dummies variables to replace qualitative features
titanic_dmy = pd.concat([titanic,sex,embark_port],axis=1)
titanic_dmy.head()

#### Study correlation between feature

Execute the following code block to visualize the  correlation matrix of our processed dataset.

In [None]:
from matplotlib import cm
sns.heatmap(titanic_dmy.corr(),cmap=cm.coolwarm,annot=True, fmt=".2f")  

From the correlation matrix, We note that passenger "Pclass" and ticket "Fare" features are correlated (correlation factor $> 0.5 $). Thus, we could keep only one of them. we will keep "Pclass" and remove "Fare" feature since it is less expressive.

We note also that, "Survived" and "male" features are also correlated. This means that "male" feature could give high intuition about if the passenger had "Survived".

In [None]:
titanic_dmy=titanic_dmy.drop(['Fare'],axis=1)

print("\nInformation about the final dataset:")
titanic_dmy.info()

titanic_dmy.head()

### Train logistic classifier and predict
<font color="blue">**Question 5: **</font>
- Use "[fit](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit)" function to train the logistic model "Logistic_Regr"
- Use "[predict](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict)" function to predict if passengers (X array) is survived or not.
- calculate the accuracy (number of good prediction/number of all passengers) of the logistic model.

In [None]:
X = titanic_dmy.iloc[:,1:8].values
y = titanic_dmy.iloc[:,0].values

# train logistic classifier
Logistic_Regr = LogisticRegression()
# ** your code here **

# predict survavibality
y_pred = # ** your code here **

# calculate accuracy
accuracy = # ** your code here **
print("The accuracy of our logistic classifier is: ", accuracy)
