<h1 id="Titanic" style="color:white;background:#0087B6;padding:8px;border-radius:8px"> Titanic - Machine Learning Solution </h1>

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

<center><img src="https://i0.wp.com/stringsmagazine.com/wp-content/uploads/2016/02/Titanic-e1630453763407.jpg?fit=800%2C450&ssl=1"></center>

## Problem definition

Building a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie. name, age, gender, socio-economic class, etc).

## This notebook contains the following sections:
* Importing libraries
* Reading datasets
* Missing values
    * Visualizing missing data
    * Handling missing values
* Data manipulation
    * Checking the cardinality of features
    * Manipulating the 'Name' column
        * Extracting name titles
    * Manipulating the 'Ticket' column
        * separating ticket prefixes and numbers
    * Reducing the cardinality of ticket prefixes and ticket numbers
* Correlations
    * Correlation between features and target
    * Correlation of features with themselves
* Adding new features
    * Visulazing relation between features and target
    * Combining correlated features
* Encoding categorical features
* Modeling
    * Cross validation score for each model
Stacking ensemble
* Making prediction on test data


### I tried to write short codes, Hope you enjoy .

<h1 id="Import libraries" style="color:white;background:#0087B6;padding:8px;border-radius:8px">1. Import libraries</h1>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, KFold, RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score, ConfusionMatrixDisplay

sns.set()

<h1 id="Reading dataset" style="color:white;background:#0087B6;padding:8px;border-radius:8px">2. Reading datasets</h1>

In [None]:
train=pd.read_csv('../input/titanic/train.csv')
test=pd.read_csv('../input/titanic/test.csv')

#Combining train and test dataset for preprocessing
all_df=pd.concat([train,test], ignore_index=True,sort=False)
#Drop target column
all_df.drop('Survived',axis=1,inplace=True)


<h1 id="Missing values" style="color:white;background:#0087B6;padding:8px;border-radius:8px">3. Missing values</h1>

<h2 style="color:white;background:gray;padding:8px;border-radius:8px"> Visualize missing data </h2>

In [None]:
#Calculate missing values 
def get_missings(df):
    labels,values = list(),list()
    if all_df.isna().sum().sum()>0:
        for column in df.columns:
            if df[column].isnull().sum():
                labels.append(column)
                values.append((df[column].isnull().sum() / len(df[column]))*100)
        #Make a dataframe 
        missings=pd.DataFrame({'Features':labels,'MissingPercent':values }).sort_values(by='MissingPercent',ascending=False)
        plt.figure(figsize=(10,7))
        sns.barplot(x=missings.Features,y=missings.MissingPercent).set_title('The Percentage of Missing Values')
        return missings.style.set_properties(**{'background-color': 'black','color': 'white'})
    else:
        return False

In [None]:
#Get the percentage of missing values
get_missings(all_df)

<h2 style="color:white;background:gray;padding:8px;border-radius:8px"> Dealing with missing values</h2>

### Drop cabin 

In [None]:
all_df.drop('Cabin',axis=1,inplace=True)

### Fill the Age column with the median based on passenger class

In [None]:
#Mask to access the null part of the age based on Pclass
mask1=(all_df.Pclass==1) & (all_df.Age.isnull())
mask2=(all_df.Pclass==2) & (all_df.Age.isnull())
mask3=(all_df.Pclass==3) & (all_df.Age.isnull())

#Fill the Age with median based on Pclass
all_df.loc[mask1,'Age']=all_df[all_df.Pclass==1]['Age'].median()
all_df.loc[mask2,'Age']=all_df[all_df.Pclass==2]['Age'].median()
all_df.loc[mask3,'Age']=all_df[all_df.Pclass==3]['Age'].median()

### Fill the 'Fare' 
There is one missing value in 'Fare' column

Check the Pclass for this passenger

In [None]:
all_df[['Pclass','Fare']][all_df.Fare.isnull()].head(50)

### Fill with the median of 3rd-class passengers fare

In [None]:
all_df.Fare.fillna(all_df.Fare[all_df.Pclass == 3].median() , inplace=True)

### Describe Embarked

In [None]:
all_df['Embarked'].describe()

### Fill 'Embarked' with top value

In [None]:
all_df.Embarked.fillna('S', inplace=True)

<h1 id="Data manipulation" style="color:white;background:#0087B6;padding:8px;border-radius:8px">4. Data manipulation</h1>

## Checking cardinality of features

In [None]:
#Get the number of unique values in each column
for column in all_df.columns:
    print(column +'---->', len(all_df[column].unique()) )

### Drop PassengerId

In [None]:
all_df.drop('PassengerId',axis=1,inplace=True)

<div class="alert alert-info"><h3> Features with high cardinality :</h3>

* Name
* Ticket
* Fare

</div>

### Converting all strings to lowercase

In [None]:
#Lowercase strings
for col in all_df.columns:
    if pd.api.types.is_string_dtype(all_df[col]):
        all_df[col] = all_df[col].str.lower()

<h2 style="color:white;background:gray;padding:8px;border-radius:8px"> Manipulating the 'Name' column  </h2>

In [None]:
all_df['Name']

## Extracting the name prefixes (Mr, Miss, Mrs,...)

In [None]:
#lambda magic! to get the substring between ',' and '.' which would be the prefix
all_df['Name_Prefix']=all_df['Name'].apply(lambda x: x[x.find(', ')+len(', '):x.rfind('.')])

### Check cardinality

In [None]:
#Get unique name prefixes 
all_df['Name_Prefix'].unique()

There were two dots in the name : 'Mrs. Martin (Elizabeth L. Barrett)' that caused a problem

Let's fix it


In [None]:
all_df['Name_Prefix']=all_df['Name_Prefix'].replace("mrs. martin (elizabeth l","mrs")

### Value counts :

In [None]:
all_df['Name_Prefix'].value_counts()

### Translate a few name titles to English

In [None]:
all_df['Name_Prefix']=all_df['Name_Prefix'].replace("mlle","miss") #French  to En
all_df['Name_Prefix']=all_df['Name_Prefix'].replace("mme","mrs")   #French  to En
all_df['Name_Prefix']=all_df['Name_Prefix'].replace("don","sir")   #Spanish to En
all_df['Name_Prefix']=all_df['Name_Prefix'].replace("dona","mrs")  #Spanish to En

### Check the number of unique lastnames

In [None]:
#The Number of unique lastnames
len((all_df["Name"].str.split(",").str.get(0)).unique())

Unfortunately there are so many unique last names but I'll create the Lastname column it may help to manipulate other features

In [None]:
#Split by comma and get the first part which would be the lastname
all_df['Lastname']=all_df["Name"].str.split(",").str.get(0)

### Drop name 

In [None]:
all_df.drop('Name',axis=1,inplace=True)

<h2 style="color:white;background:gray;padding:8px;border-radius:8px"> Manipulating the 'Ticket' </h2>

In [None]:
#Explore
all_df.Ticket

<h3 class="alert alert-info">There are +900 distinct Tickets, we need variables with low cardinality</h3>

## Separate ticket prefixes and ticket numbers

In [None]:
#Split by space and store first part in TicketPre
all_df['TicketPre']=all_df.Ticket.apply(lambda x: x.split(' ')[0] if x.isdigit()==False else 'NoPre')
#Split by space and store second part in TicketNum
all_df['TicketNum']=all_df.Ticket.apply(lambda x: x.split(' ')[-1] if x.isdigit()==False else x)

## Check unique values for Ticket prefix

In [None]:
all_df['TicketPre'].unique()

<p style="color:black"> Not bad, but still can be reduced by removing slashes and dots</p>

<p style="color:black">I'm not sure wether there is a difference between STON/O and SOTON/OQ  or S.C./PARIS and SC/Paris</p>

<p style="color:black">Let's check the TicketNum that may help </p>

In [None]:
#Get ticket numbers with similar ticket prefixes
all_df[['TicketPre','TicketNum']][(all_df.TicketPre=='ston/o2.')|
                                 (all_df.TicketPre=='soton/o2')
                                 ].sort_values(by='TicketNum').head(10)

### Still not sure but for now let's just get rid of those dots/slashes

In [None]:
#Remove slashes and dots from TicketPre
reps = {'.' : '','/':''}
all_df.TicketPre=all_df.TicketPre.str.translate(str.maketrans(reps))

In [None]:
all_df['TicketPre']=all_df['TicketPre'].replace("sotono2","stono2") #Improves score
all_df['TicketPre']=all_df['TicketPre'].replace("sotonoq","stonoq") #Improves score

## Working with Ticket numbers

In [None]:
all_df.TicketNum.sort_values()

### 4 cells contain the word 'line' , let's check the Lastnames  : 

In [None]:
all_df[['TicketPre','TicketNum','Lastname']][all_df['TicketNum']=='line']

### Are there other passengers with these lastnames ?

In [None]:
all_df[['TicketNum','Lastname']][(all_df.Lastname=='johnson')|(all_df.Lastname=='johnson')]

### Replace 'line' with '347742'

In [None]:
all_df.TicketNum=all_df.TicketNum.replace('line',347742)

### Change the date type of ticketNum to the int

In [None]:
all_df.TicketNum=all_df.TicketNum.astype(int)

### Let's see how many unique ticket numbers there are

In [None]:
#The number of unique ticket numbers
len(all_df.TicketNum.unique())

### Consecutive ticket numbers

<p style="color:black">There are consecutive numbers inside the column ticketNum , maybe there was a relation between passengers who had tickets with sequential numbers I don't want to treat this feature like a number, after a bit of processing I'll change it's type to the category datatype . </p>

### The idea is detecting the first ticket number of consecutive ticket numbers and replacing the rest with it

<p class="alert alert-info"> <b> e.g. [21,22,23,24] >>> will be >>> [21,21,21,21] </b> </p>

In [None]:
#Store the head of each sequence and numbers after that in a dict {head : range(FirstNum,LastNum)}
s, head = {}, None
for x in sorted(all_df['TicketNum']):
    if head is None or x != s[head].stop:
        head = x
    s[head] = range(head, x+1)

### An example for the ticket number '3101279' 

In [None]:
s[3101279]

#### It tells us that there are consecutive ticket numbers from 3101279 to 3101296 and the first number is the head of this sequence


### Define a function that returns the head for given ticket number

In [None]:
#If the given ticket number is the part of a sequence returns the first number. 
def get_head(ticketNum):
    for head,range in s.items():
        if ticketNum in range:
            x=head   
    return x

### Now it's time to replace ticket numbers with the head of sequence (if there is a sequence)


In [None]:
#Relace each ticket number with the head of sequence
all_df['TicketNum_Groups']=all_df['TicketNum'].apply(lambda x: get_head(x))

#Get the number of unique ticketNumbers
len(all_df['TicketNum_Groups'].unique())

#### The number of unique ticket numbers has been reduced to 551, it's not perfect but that will do for now


In [None]:
#Change the type of ticket numbers to the object datatype
all_df['TicketNum_Groups']=all_df['TicketNum_Groups'].astype(object)

### Drop ticket

In [None]:
all_df.drop('Ticket',axis=1,inplace=True)
all_df.drop('TicketNum',axis=1,inplace=True)

<h1 id="Correlations" style="color:white;background:#0087B6;padding:8px;border-radius:8px">5. Correlations</h1>

### Correlations between each feature and target

In [None]:
all_df[:891].corrwith(train['Survived']).sort_values(ascending=False).head(10)

### Correlation between features

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(all_df.corr(),annot=True)

### High correlations : 

* <b> Possitive</b>  correlation between <b>Fare</b> and <b>Target</b>
* <b>Negative</b> correlation between <b>Pclass</b> and <b>Target</b>
* <b> Possitive</b> correlation between <b>SibSp</b> and <b>Parch</b>
* <b>Negative</b> correlation between <b>Pclass</b> and <b>Fare </b>
* <b>Negative</b> correlation between <b>Pclass</b> and <b>Age </b>

<h1 id="Adding new features" style="color:white;background:#0087B6;padding:8px;border-radius:8px">6. Adding new features </h1>

## SibSp - Parch

### Combine SibSp and Parch (sibling-spouse, parent-children)

In [None]:
all_df['Relatives'] = all_df['SibSp'] + all_df['Parch']

### Visualize relatives and survival rate 

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x=all_df[:891]['Relatives'],hue=train.Survived)
plt.ylabel('The Number Of Passengers')

## Age

In [None]:
#Create bins for age column
bins = [0, 13,24,40,50,80]
all_df['Age_Bins'] = pd.cut(all_df['Age'], bins)

#Plot
plt.figure(figsize=(10,7))
sns.countplot(x=all_df[:891]['Age_Bins'],hue=train.Survived)
plt.ylabel('The Number Of Passengers')

## Pclass

In [None]:
#Plot Survival rate based on Pclass
plt.figure(figsize=(10,7))
sns.countplot(x=all_df[:891]['Pclass'],hue=train.Survived)
plt.ylabel('The Number Of Passengers')

## Fare

In [None]:
#Create bins
bins = [0,50,320]
all_df['Fare_Bins'] = pd.cut(all_df['Fare'], bins)
#Plot
plt.figure(figsize=(10,7))
sns.countplot(x=all_df[:891]['Fare_Bins'],hue=train.Survived)
plt.ylabel('The Number Of Passengers')

## Add new feature based on Fare, Pclass, Age

In [None]:
all_df['Fare/Pclass/Age']=( all_df['Fare'] / all_df['Pclass'] ) / (all_df['Age'])

<h1 id="Encoding features" style="color:white;background:#0087B6;padding:8px;border-radius:8px">7. Encoding categorical features</h1>

In [None]:
#Drop list
toDrop=['Age_Bins','Fare','Pclass','Age','Parch','SibSp','Lastname']
#Get dummies
all_df_copy=pd.get_dummies(all_df.drop(toDrop,axis=1) ,drop_first=True)

In [None]:
all_df_copy

<h1 id="Modeling" style="color:white;background:#0087B6;padding:8px;border-radius:8px">8. Modeling </h1>

### Split the preprocessed data to train/test 

In [None]:
train_df=all_df_copy[:891].copy()
train_df['Survived']=train['Survived'].copy()
test_df=all_df_copy[891:].copy()

### Define variables

In [None]:
#Independent variables
X=train_df.drop('Survived',axis=1)
#Target
y=train_df['Survived']
#Scaler
scaler=StandardScaler()
x_scaled=scaler.fit_transform(X)
#Split the train df to train/test
X_train,X_test,y_train,y_test= train_test_split(x_scaled,y,test_size=0.2, random_state=42)

### Define models 

In [None]:
# get a list of models to evaluate
def get_models():
    models = dict()
    models['LGBMClassifier'] = LGBMClassifier()
    models['LogisticRegression'] = LogisticRegression()
    models['DecisionTree'] = DecisionTreeClassifier(max_depth=8) #Tuned
    models['RandomForest'] = RandomForestClassifier(max_depth=32) #Tuned
    models['GradientBoosting'] = GradientBoostingClassifier(max_depth=5) #Tuned
    models['svc'] = SVC(C=100, gamma=0.001, kernel='sigmoid') #Tuned
    return models

### Define evaluation function

In [None]:
# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

### Get cross-validated score 

In [None]:
models=get_models()
results,names=list(),list()

for name,model in models.items():
    scores=evaluate_model(model,x_scaled,y)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))

## Stacking Classifier

In [None]:
def get_stacking():
    # define the base models
    level0 = list()
    level0.append(('LogisticR', LogisticRegression()))
    level0.append(('LGBMClassifier', LGBMClassifier()))
    level0.append(('DecisionTree', DecisionTreeClassifier(max_depth=8)))
    level0.append(('RandomForest', RandomForestClassifier(max_depth=32)))
    level0.append(('GBoost', GradientBoostingClassifier()))
    level0.append(('svc', SVC(C=100, gamma=0.001, kernel='sigmoid')))
    # define meta learner model
    level1 = LogisticRegression()
    # define the stacking ensemble
    model = StackingClassifier(estimators=level0, final_estimator=level1, cv=10, n_jobs= -1)
    return model

### Validation score and confusion matrix

In [None]:
model= get_stacking()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print("Validation accuracy : ", accuracy_score(y_pred,y_test))

In [None]:
CM=confusion_matrix(y_pred,y_test,labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=CM, display_labels=model.classes_)
disp.plot(cmap='viridis')
plt.grid(None)

<h1 id="Make prediction on test data" style="color:white;background:#0087B6;padding:8px;border-radius:8px">9. Make prediction on test data </h1>

In [None]:
test_pred=model.predict(scaler.transform(test_df))
sub=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':test_pred})

In [None]:
#Save predictions
sub.to_csv('submission.csv',index=False)