<h1>Titanic competition</h1>

<div style="background:#abd5f5; color:#069; border:1px solid #b3deff; padding: 20px">
<h2>Table of content</h2>
<ul>    
<li><a href="#Data-Analysis">Data Analysis</a></li>
<li><a href=#Data-Preprocessing>Data Preprocessing</a></li>
<li><a href=#Modelling>Modelling</a></li>
<li><a href=#Prediction>Prediction</a></li>
</ul>
</div>

<div style="background:#abd5f5; color:#069; border:1px solid #b3deff; padding: 20px">
<h2 id='Data-Analysis'>Data Analysis</h2>
</div>

In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# put train data aet into a dataframe
pass_train=pd.read_csv('../input/titanic/train.csv')
pass_test=pd.read_csv('../input/titanic/test.csv')

In [None]:
pass_train.info()

Let's explore the data!
The first parameter is Pclass. I'm going to create a bar chart to see how passenger's class relates with surviving.

In [None]:
# function for grouped bar chart
def grouped_bar_chart(labels,set1,set2):
    x = np.arange(len(labels))  # the label locations
    width = 0.35  # the width of the bars

    fig, ax = plt.subplots()
    rects1 = ax.bar(x - width/2, set1, width, label='Dead') # rectangles for the dead
    rects2 = ax.bar(x + width/2, set2, width, label='Survived') # rectangles for the survived

    ax.set_ylabel('Number of people')
    ax.set_xticks(x)
    ax.set_xticklabels(labels)
    ax.legend()
    fig.tight_layout()
    plt.show()

In [None]:
# grouping data by surviving and the class
gr=pass_train[['Survived','Pclass','PassengerId']].groupby(['Survived','Pclass']).count()
gr=gr.reset_index()
grouped_bar_chart(['1st class', '2nd class', '3rd class'],
                  gr[gr['Survived']==0]['PassengerId'],
                  gr[gr['Survived']==1]['PassengerId'])

As we see, if you are rich and travel by 1st class, you have more chances to survive. Unfortunately, if you travel by 3rd class, it decreases your chance of surviving.

The next is sex.

In [None]:
# grouping data by surviving and the sex
gr=pass_train[['Survived','Sex','PassengerId']].groupby(['Survived','Sex']).count()
gr=gr.reset_index()
grouped_bar_chart(gr[gr['Survived']==0]['Sex'],
                  gr[gr['Survived']==0]['PassengerId'],
                  gr[gr['Survived']==1]['PassengerId'])

Obviously, be a woman is better than a man if you travel on Titanic.

The next parameter is number of siblings and spouses.

In [None]:
# grouping data by surviving and the number of siblings and spouses
gr=pass_train[['Survived','SibSp','PassengerId']].groupby(['Survived','SibSp']).count()
gr

In [None]:
gr=gr.reset_index()
# The number of groups is different for survived=0 and survived=0. Therefore I add missing rows to create a plot.
for i in range(0,2):
    for j in range(1,9):
        if len(gr[(gr['Survived']==i)&(gr['SibSp']==j)])==0:
            gr=gr.append({'Survived':i,'SibSp':j,'PassengerId':0},ignore_index=True)
gr=gr.sort_values(by=['Survived','SibSp'])
grouped_bar_chart(gr[gr['Survived']==0]['SibSp'],
                  gr[gr['Survived']==0]['PassengerId'],
                  gr[gr['Survived']==1]['PassengerId'])

To have exactly one spouse or sibling is little better for surviving.

Let's explore the number of parents or children 

In [None]:
gr=pass_train[['Survived','Parch','PassengerId']].groupby(['Survived','Parch']).count()
gr

In [None]:
gr=gr.reset_index()
# The number of groups is different for survived=0 and survived=0. Therefore I add missing rows to create a plot.
gr=gr.append({'Survived':1,'Parch':4,'PassengerId':0},ignore_index=True)
gr=gr.append({'Survived':1,'Parch':6,'PassengerId':0},ignore_index=True)
gr=gr.sort_values(by=['Survived','Parch'])
grouped_bar_chart(gr[gr['Survived']==0]['Parch'],
                  gr[gr['Survived']==0]['PassengerId'],
                  gr[gr['Survived']==1]['PassengerId'])

The best option is to have exactly one parent or child.

The next is embarkment place.

In [None]:
gr=pass_train[['Survived','Embarked','PassengerId']].groupby(['Survived','Embarked']).count()
gr=gr.reset_index()
grouped_bar_chart(['Cherbourg','Queenstown','Southampton'],
                  gr[gr['Survived']==0]['PassengerId'],
                  gr[gr['Survived']==1]['PassengerId'])

People from Cherbourg have more chances than from Southampton.

Next, I extract the title from names.

In [None]:
pass_train['Title']=pass_train['Name'].str.extract(r',\s?(.+?)\.\s')
pass_test['Title']=pass_test['Name'].str.extract(r',\s?(.+?)\.\s')

In [None]:
gr=pass_train[['Survived','Title','PassengerId']].groupby(['Survived','Title']).count()
gr=gr.reset_index()
# I only show groups with more than 3 people
gr=gr[gr['PassengerId']>=3]
gr=gr.append({'Survived':1,'Title':'Rev','PassengerId':0},ignore_index=True)
grouped_bar_chart(gr[gr['Survived']==0]['Title'],
                  gr[gr['Survived']==0]['PassengerId'],
                  gr[gr['Survived']==1]['PassengerId'])

As we see, married women (Mrs) have more chances for surviving than unmarried (Miss).

The next parameter is fare. Calculate the median value.

In [None]:
pass_train[['Fare','Survived']].groupby(['Survived']).median()

Calculate the mean value.

In [None]:
pass_train[['Fare','Survived']].groupby(['Survived']).mean()

The bigger fare is, the more chances to survive you have. I think the fare was connected with the class.

Finally, I get a deck from the cabin number. I think the deck is connected with the class.

In [None]:
pass_train['Deck']=pass_train['Cabin'].str.slice(0,1)
pass_test['Deck']=pass_test['Cabin'].str.slice(0,1)
pd.concat([pass_train,pass_test])[['Deck','Pclass','PassengerId']].groupby(['Pclass','Deck']).count()

Parameter "Cabin" has a lot of missing values. Other values are connected with the class. The higher class is, the higher deck is.

<div style="background:#abd5f5; color:#069; border:1px solid #b3deff; padding: 20px">
<h2 id='Data-Preprocessing'>Data Preprocessing</h2>
</div>

Let's copy dependent and independent values into Y and X.

In [None]:
Y_train=pass_train[['Survived']].copy()
params=['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Title','Deck']
X_train=pass_train[params].copy()
X_test=pass_test[params].copy()

# concat X_train and X_test to preprocess them together
X_train['Set']='train'
X_test['Set']='test'
X_full=pd.concat([X_train,X_test])
X_full.info()

Checking missing values

In [None]:
pd.isna(X_full).sum()

Mark missing deck with "Z"

In [None]:
X_full['Deck'].replace(np.nan,'Z',inplace=True)

Replace missing age and fare with the mean

In [None]:
X_full['Age'].replace(np.nan,X_full['Age'].median(),inplace=True)
X_full['Fare'].replace(np.nan,X_full['Fare'].median(),inplace=True)

Replace the place of embarkment with the most frequent value

In [None]:
X_full['Embarked'].replace(np.nan,X_full['Embarked'].value_counts().idxmax(),inplace=True)

The next step is preparing categorical values

In [None]:
# Check values of 'Sex' in both sets
X_full[['Set','Sex','Pclass']].groupby(['Set','Sex']).count()

In [None]:
# Check values of 'Embarked' in both sets
X_full[['Set','Embarked','Pclass']].groupby(['Set','Embarked']).count()

In [None]:
# Check values of 'Title' in both sets
X_full[['Set','Title','Pclass']].groupby(['Set','Title']).count().sort_values(by=['Set','Title'])

As we see, there are values in the train set that are not in the test and vise versa. Let's replace them with the most popular values

In [None]:
X_full['Title'].replace(['Capt','Don','Dona','Jonkheer','Lady','Major','Mlle','Mme','Sir','the Countess'],
                        ['Mr','Mr','Mrs','Mr','Mrs','Mr','Miss','Mrs','Mr','Mrs'],inplace=True)

In [None]:
# Check values of 'Deck' in both sets
X_full[['Set','Deck','Pclass']].groupby(['Set','Deck']).count().sort_values(by=['Set','Deck'])

One passenger had the deck 'T'. Let's replace this with 'Z'.

In [None]:
X_full['Deck'].replace('T','Z',inplace=True)

Convert categorical values into numerical.

In [None]:
X_full['Deck'].replace(['A','B','C','D','E','F','G','Z'],[1,2,3,4,5,6,7,0],inplace=True)
X_full['Sex'].replace(['male','female'],[1,0],inplace=True)
cat_params=['Embarked','Title']
X_full=pd.get_dummies(data=X_full,columns=cat_params,drop_first=True)

<div style="background:#abd5f5; color:#069; border:1px solid #b3deff; padding: 20px">
<h2 id='Modelling'>Modelling</h2>
</div>

I'll try to build several models using different methods and choose the best one.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler

# array of methods
models=[LogisticRegression(),
        RandomForestClassifier(),
        SVC(),
        DecisionTreeClassifier(),
        KNeighborsClassifier()]

X_train=X_full[X_full['Set']=='train'].copy()
X_test=X_full[X_full['Set']=='test'].copy()
X_train.drop(columns=['Set'],inplace=True)
X_test.drop(columns=['Set'],inplace=True)

# scaling
scale=StandardScaler().fit(X_train)
X_train_sc=scale.transform(X_train)
X_test_sc=scale.transform(X_test)

for model in models:
    # fit the model with cross validation
    results=cross_validate(model,X_train_sc,Y_train.values.ravel(),cv=10)
    # accuracy
    r2=results['test_score'].mean()
    # print the result
    m=str(model)
    print('r2 for',m[:m.index('(')],'=',r2)


<p>I got very good accuracy for each method. Nevertheless, I think actual accuracy will be less. It may be connected with
2 problems:</p>
<li>Hyperparameters. X_train contains 21 variables. Some may increase the accuracy parameter but have no actually
connection with Y. </li>
<li>Overfitting. We have too litle data to create an honest model</li>
<I'll try to solve the problems with grid search optimization and combine all methods using voting classiffier>

In [None]:
# Logistic Regression
from sklearn.model_selection import GridSearchCV

param_lr={'penalty':['l1','l2'],
         'C' : [0.01,0.1,1,10,50,100,200,300],
         'solver':['liblinear', 'saga']}

gs_lr = GridSearchCV(LogisticRegression(),param_grid = param_lr, scoring="accuracy",n_jobs=-1)
gs_lr.fit(X_train_sc,Y_train.values.ravel())
best_lr=gs_lr.best_estimator_
print(best_lr)
print('score=',gs_lr.best_score_)

In [None]:
# Random Forest
param_rf={'max_features': [1, 2, 3, 5, 10],
          'min_samples_split': [2, 3, 5, 7, 10],
          'min_samples_leaf': [1, 3, 5, 7, 10],
          'bootstrap': [False],
          'n_estimators' :[100,200,300]}

gs_rf = GridSearchCV(RandomForestClassifier(),param_grid = param_rf, scoring="accuracy",n_jobs=-1)
gs_rf.fit(X_train_sc,Y_train.values.ravel())
best_rf=gs_rf.best_estimator_
print(best_rf)
print('score=',gs_rf.best_score_)

In [None]:
# Support Vector Machine
param_sv={'probability':[True],
          'gamma': [ 0.001, 0.01, 0.1, 1],
          'C': [1, 10, 50, 100, 200, 300, 1000]}

gs_sv = GridSearchCV(SVC(),param_grid = param_sv, scoring="accuracy",n_jobs=-1)
gs_sv.fit(X_train_sc,Y_train.values.ravel())
best_sv=gs_sv.best_estimator_
print(best_sv)
print('score=',gs_sv.best_score_)

In [None]:
# Decision Tree
param_dt={'max_features': [1, 2, 3, 5, 6, 7, 8, 9, 10, 15],
          'min_samples_split': [2, 3, 4, 5, 6, 7, 10, 15],
          'min_samples_leaf': [1, 2, 3, 5, 6, 7, 8, 10, 15],
          'splitter':['best']}

gs_dt = GridSearchCV(DecisionTreeClassifier(),param_grid = param_dt, scoring="accuracy",n_jobs=-1)
gs_dt.fit(X_train_sc,Y_train.values.ravel())
best_dt=gs_dt.best_estimator_
print(best_dt)
print('score=',gs_dt.best_score_)

In [None]:
# KNN
param_kn={'n_neighbors':[1,2,3,5,7,10,14,15]}

gs_kn = GridSearchCV(KNeighborsClassifier(),param_grid = param_kn, scoring="accuracy",n_jobs=-1)
gs_kn.fit(X_train_sc,Y_train.values.ravel())
best_kn=gs_kn.best_estimator_
print(best_kn)
print('score=',gs_kn.best_score_)

In [None]:
from sklearn.ensemble import VotingClassifier
vote=VotingClassifier(estimators=[('lr',best_lr),
                                  ('rfc', best_rf),
                                  ('svc',best_sv),
                                  ('dtc',best_dt),
                                  ('knc',best_kn)],
                      voting='soft', n_jobs=-1)

vote = vote.fit(X_train_sc, Y_train.values.ravel())

<div style="background:#abd5f5; color:#069; border:1px solid #b3deff; padding: 20px">
<h2 id='Prediction'>Prediction</h2>
</div>

In [None]:
Y_predict=vote.predict(X_test_sc)
pass_test['Survived']=Y_predict

In [None]:
pass_test.set_index('PassengerId',inplace=True)
pass_test[['Survived']].to_csv('result.csv')