# Titanic: Machine learning from disaster- Random Forest

## I - Exploratory data analysis

* Data extraction : we'll load the dataset and have a first look at it.
* Cleaning : we'll fill in missing values.
* Plotting : we'll create some interesting charts that'll (hopefully) spot correlations and hidden insights out of the data.
* Assumptions : we'll formulate hypotheses from the charts.

1.Import Useful libraries


In [6]:

# import libraries we will use
import warnings
warnings.filterwarnings('ignore')

# matplotlib for plotting
from matplotlib import pyplot as plt
import matplotlib
# matplotlib.style.use('ggplot')
%matplotlib inline

# seaborn for plotting
import seaborn as sns

# pandas for dataframes
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

# numpy for linear algebra
import numpy as np

# sklearn for machine learning libraries
from sklearn.ensemble import RandomForestClassifier

ValueError: unknown locale: UTF-8

Two datasets are available: a training set and a test set.

2.Loading the training set.


In [None]:
# load training set into a pandas dataframe
data = pd.read_csv('./train.csv')

In [None]:
test_data = pd.read_csv('./test.csv')

3.Show head of the dataframe

In [None]:
data.head(10)

In [None]:
test_data.head(10)

The Survived column is the target variable. If Suvival = 1 the passenger survived, otherwise he's dead.
The other variables that describe the passengers are:

* PassengerId: and id given to each traveler on the boat
* Pclass: the passenger class. It has three possible values: 1,2,3
* The Name
* The Sex
* The Age
* SibSp: number of siblings and spouses traveling with the passenger
* Parch: number of parents and children traveling with the passenger
* The ticket number
* The ticket Fare
* The cabin number
* The embarkation. It has three possible values S,C,Q>

4.Pandas provide Dataframe Information





In [None]:
# dataframe information
data.info()

5.Describe numerical features using the describe method.

In [None]:
# statistical view of the numeric columns
data.describe()

6.Let's now make some charts.

Lets Visaualize some attribute distribution and Survival based on different attributes





In [None]:
figure = plt.figure(figsize=(12,7))
data.Age.hist()

In [None]:
# correlate the survival with the Sex variable.
survived_sex = data[data['Survived']==1]['Sex'].value_counts()
dead_sex = data[data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived_sex,dead_sex])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(10,6),color = ['g','r'])

In [None]:
data.Pclass.hist()

In [None]:
#correlate the survival with the Pclass variable. 
plt.hist([data[data['Survived']==1]['Pclass'],data[data['Survived']==0]['Pclass']], stacked=True, color = ['g','r'],
         bins = 30,label = ['Survived','Dead'])
plt.xlabel('Pclass')
plt.ylabel('Number of passengers')

In [None]:
# combine the age, the fare and the survival on a single chart
plt.figure(figsize=(15,8))
ax = plt.subplot()
ax.scatter(data[data['Survived']==1]['Age'],data[data['Survived']==1]['Fare'],c='green',s=40)
ax.scatter(data[data['Survived']==0]['Age'],data[data['Survived']==0]['Fare'],c='red',s=40)
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
ax.legend(('survived','dead'),scatterpoints=1,loc='upper right',fontsize=15,)

In [None]:
# ticket fare correlates with Pclass
ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass').mean()['Fare'].plot(kind='bar',figsize=(15,8), ax = ax)

## II - Feature engineering

### Processing Age

In [None]:
# find unique values from age attribute
data.Age.unique()

In [None]:
test_data.Age.unique()

In [None]:
# Filling the missing values with median of Age column
data['Age'] = data.Age.fillna(data.Age.median())

In [None]:
# Checking is there any more missing value present
data.Age.unique()

In [None]:
# Filling the missing values with median of Age column
test_data['Age'] = data.Age.fillna(test_data.Age.median())

In [None]:
test_data.Age.unique()

### Processing Sex

In [None]:
#function maps the string values male and female to 0 and 1 respectively.
data.Sex =data.Sex.map({'male':0, 'female':1}).astype(int)

In [None]:
#function maps the string values male and female to 0 and 1 respectively.
test_data.Sex =test_data.Sex.map({'male':0, 'female':1}).astype(int)

In [None]:
#checking the data description
data.describe()

In [None]:
#select specific columns
data.loc[(data["Sex"]==1)  & (data["Survived"]==1), ["Sex","Age","Survived"]]

In [None]:
data.Sex.unique()

In [None]:
test_data.Sex.unique()

### Processing Fare

In [None]:
data.Fare.unique()

In [None]:
test_data.Fare.unique()

In [None]:
test_data['Fare'] = test_data.Fare.fillna(data.Fare.median())

### Processing Pclass

In [None]:
data.Pclass.unique()

### Processing SibSp, Parch

In [None]:
data.SibSp.unique()

In [None]:
data.Parch.unique()

### Processing Embarked

In [None]:
data.Embarked.unique()

In [None]:
#function replaces the two missing values of Embarked with the most frequent Embarked value.
data.groupby('Embarked').count();

In [None]:
data.Embarked=data.Embarked.fillna('S')

In [None]:
data.Embarked.unique()

In [None]:
#function maps the string values S,C and Q to 2 ,0 and 1 respectively.
Ports = list(enumerate(np.unique(data['Embarked'])))    # determine all values of Embarked,
Ports_dict = { name : i for i, name in Ports }              # set up a dictionary in the form  Ports : index
data.Embarked = data.Embarked.map( lambda x: Ports_dict[x]).astype(int)     # Convert all Embark strings to int


In [None]:
data.Embarked.unique()

In [None]:
test_data.Embarked.unique()

In [None]:
#function maps the string values S,C and Q to 2 ,0 and 1 respectively.
Ports = list(enumerate(np.unique(test_data['Embarked'])))    # determine all values of Embarked,
Ports_dict = { name : i for i, name in Ports }              # set up a dictionary in the form  Ports : index
test_data.Embarked = test_data.Embarked.map( lambda x: Ports_dict[x]).astype(int)     # Convert all Embark strings to int

In [None]:
test_data.Embarked.unique()

## III - Modeling

We'll be using Random Forests. Random Froests has proven a great efficiency

1.Use the train set to build a predictive model.    

2.Evaluate the model using the train set.

3.Test the model using the test set and generate and output file for the submission.

In [None]:
df_out = data.Survived
df_features = data[['Sex', 'Age','Pclass','Fare','SibSp','Parch','Embarked']]
df_features
# df_features


In [None]:
clf = RandomForestClassifier()

In [None]:
df_out.shape

In [None]:
df_features.shape

In [None]:
clf=clf.fit(df_features, df_out)
score = clf.score(df_features, df_out)
score
# clf


In [None]:
for header, value in zip(df_features.columns,clf.feature_importances_):
    print (header," : ", value)


In [None]:
test_features=test_data[['Sex', 'Age','Pclass','Fare','SibSp','Parch','Embarked']]

In [None]:
Output = clf.predict(test_features)


In [None]:
for header, value in zip(test_features.columns,clf.feature_importances_):
    print (header," : ", value)

In [None]:
result = pd.DataFrame(columns=['PassengerId', 'Survived'])
result['PassengerId'] = test_data.PassengerId
result['Survived'] = Output.astype(int)
result.to_csv('randomForest.csv', index=False)
# print(result)
test_data['Survived']=result['Survived']

In [None]:
test_data.head(10)