# Creator:
### Najeeb Zaidi
### Github: https://github.com/snajeebz
### zaidi.nh@gmail.com
### Contributors: 
1. https://github.com/snajeebz
## Dataset Source: 
1. https://www.kaggle.com/competitions/titanic

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import os
import seaborn as sns
#Disable warning
import warnings
warnings.filterwarnings("ignore")



# Importing the Dataset

In [2]:
try:   #for Local Environment
    train_df = pd.read_csv("Dataset/train.csv")
    serving_df = pd.read_csv("Dataset/test.csv")
except: #for Kaggle Environment
    train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
    serving_df = pd.read_csv("/kaggle/input/titanic/test.csv")

train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


# Strategy:
1. Data Preparation and Scikit Learn Algo implementation
2. Model and Hyper Parameter Tuning
3. Tensorflow Models implementation
4. Tensorflow Models and Hyper-Parameters Tuning

## 1. Data Preparation and Scikit Learn Algo implementation

### Steps:
1. Dataset EDA
2. Data Wrangling
3. Test Train Dataset preparation for scikit-Learn
4. Scikit Learn ML Model Plus Hyper Parameters Tuning
5. Submission of the Best Results.

## 1. Dataset EDA

In [3]:
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [4]:
train_df=pd.get_dummies(train_df, columns=['Sex'])


# Observations:
1. Cabin has got 687 nulls which is more than 75% of the rows. So I guess it will be better to not use cabin as a feature for our model.
2. Age has around 20% nulls. So we will try the best to fill them up.

## Let's dig deep into Age

In [5]:
print("Group by Parch: \n",train_df['Age'].isna().groupby(train_df['Parch']).value_counts())
print("Group by SibSp: \n",train_df['Age'].isna().groupby(train_df['SibSp']).value_counts())
print("Group by Pclass: \n",train_df['Age'].isna().groupby(train_df['Pclass']).value_counts())



Group by Parch: 
 Parch  Age  
0      False    521
       True     157
1      False    110
       True       8
2      False     68
       True      12
3      False      5
4      False      4
5      False      5
6      False      1
Name: count, dtype: int64
Group by SibSp: 
 SibSp  Age  
0      False    471
       True     137
1      False    183
       True      26
2      False     25
       True       3
3      False     12
       True       4
4      False     18
5      False      5
8      True       7
Name: count, dtype: int64
Group by Pclass: 
 Pclass  Age  
1       False    186
        True      30
2       False    173
        True      11
3       False    355
        True     136
Name: count, dtype: int64


## Observation
1. Out the Nans of age 16% are female.
2. Out the Nans of age 21% are male.

# Let's impute the nulls.

### You can refer to the [Imputating the Nans by ML](https://www.kaggle.com/code/najeebz/titanic-deep-learning) results on that notebook:
> It resulted in 80% Accurate Results.

In [6]:
def impute_age(data):
    fill=data['Age'][data['Age'].isnull()==False].median()
    data['Age'].fillna(fill, inplace=True)
    return data
X=train_df[['SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass', 'Age']]
train_df = impute_age(train_df)
print('After Imputations:\n ', train_df.isnull().sum())



After Imputations:
  PassengerId      0
Survived         0
Pclass           0
Name             0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Sex_female       0
Sex_male         0
dtype: int64


In [7]:
df= train_df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass', 'Survived']]
figure= px.imshow(df.corr(), text_auto=True, width=1200, height=1200)
figure.show()

Some noticeable correlation between Sibsp, Parch with Age. I guess if we apply Machine Learning Algorithm, we can get some fruitful model to fill the Nans of the age.

# Generating Feature by Clustering

In [8]:
def cluster(X):
    from sklearn import cluster
    agglo = cluster.KMeans(n_clusters=5,random_state=0, n_init="auto")
    agglo.fit(X)
    return agglo.labels_
train_df['Clusters']=np.nan
X=train_df[['SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass']]
train_df['Clusters']=(cluster(X)+1)/5  
train_df


Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male,Clusters
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,S,False,True,0.6
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,True,False,0.2
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,S,True,False,0.6
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S,True,False,0.2
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,S,False,True,0.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,S,False,True,0.6
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,S,True,False,0.6
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",28.0,1,2,W./C. 6607,23.4500,,S,True,False,0.6
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,C,False,True,0.6


In [9]:
df= train_df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass','Clusters', 'Survived']]
figure= px.imshow(df.corr(), text_auto=True, width=1200, height=1200)
figure.show()

### Clusters added a feature with 15% correlation.

# Preparing Training Dataset

In [10]:
X=train_df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass','Clusters']]
y=train_df[['Survived']]

In [11]:
def scale(X):
    from sklearn import preprocessing
    scaled=preprocessing.StandardScaler()
    scaler=scaled.fit(X)
    X=scaler.transform(X)
    return X

In [12]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(scale(X),y,train_size=0.8, random_state=42)

# Evaluation of the Training Model

In [13]:
def evaluate(y_test,ypred):
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import f1_score
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import confusion_matrix
    print("Accuracy: ",accuracy_score(y_test,y_pred)) 
    print("Precision Score : ", precision_score(y_test,y_pred)) #precision measures the proportion of true positive predictions among all positive instances. how many of survived predicted actually survived, doesn't verifies 0's 70 survived as preicted whereas actually 92 survived so 70/92 will be the precision.  if we predicted 70 survived, so presion will tell how many of those 70 predicted survived matches the actual row by row data. It checkes all positives and verifies if the answer is true for each row?
    print("Recall Score: ", recall_score(y_test,y_pred, average='macro')) #Recall measures the proportion of true positive predictions among all actual positive instalnces. If we predicted 100 survived correctly whereas actually 100 survived out of which 67 predicted correctly so recall will be 0.67
    print("F1 Score: ",f1_score(y_test,y_pred)) #mean of recall and precision
    cm = confusion_matrix(y_test, y_pred)
    figure= px.imshow(cm,text_auto=True, width=1200, height=1200)
    figure.show()

# Training KNN Classifier

In [14]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
evaluate(y_test,y_pred)

Accuracy:  0.7932960893854749
Precision Score :  0.7534246575342466
Recall Score:  0.7859073359073359
F1 Score:  0.7482993197278911


# Training Random Forest Classifier

In [15]:
from sklearn.ensemble import RandomForestClassifier
rf =RandomForestClassifier(n_jobs=-1,verbose=1) 
print ('Training the model')
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
evaluate(y_test,y_pred)

Training the model


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.1s


Accuracy:  0.8044692737430168
Precision Score :  0.782608695652174
Recall Score:  0.7934362934362934
F1 Score:  0.7552447552447553


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.3s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


# Training MLP Classifier

In [16]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='adam', 
              max_iter =1000, 
              alpha=10, 
              hidden_layer_sizes=10, 
              random_state=5,
              activation='identity',
              batch_size=360, 
              learning_rate='adaptive', 
              verbose=0,
              early_stopping=0, 
              n_iter_no_change=100)

print ('Training the model')
clf.fit(X_train,y_train)
print(clf.score(X_train,y_train))
y_pred=clf.predict(X_test)
evaluate(y_test,y_pred)

Training the model
0.797752808988764
Accuracy:  0.7988826815642458
Precision Score :  0.7794117647058824
Recall Score:  0.7866795366795367
F1 Score:  0.7464788732394366


## Observations:
- All the models return 0.8 accuracy with not much differnce. 
- However, the best results are obtained by KNN Classifier.

# Preparing the Test Data

In [17]:
df = serving_df[['Age','SibSp', 'Parch', 'Fare','PassengerId', 'Sex','Pclass']]
print('Before Processing: \n', df.isnull().sum())
df=pd.get_dummies(df, columns=['Sex'])
df = impute_age(df)
df['Fare']=df['Fare'].fillna(df['Fare'].median())
df['Clusters']=np.nan
X=df[['SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass']]
df['Clusters']=(cluster(X)+1)/5  
print('After Processing: \n', df.isnull().sum())


Before Processing: 
 Age            86
SibSp           0
Parch           0
Fare            1
PassengerId     0
Sex             0
Pclass          0
dtype: int64
After Processing: 
 Age            0
SibSp          0
Parch          0
Fare           0
PassengerId    0
Pclass         0
Sex_female     0
Sex_male       0
Clusters       0
dtype: int64


# Preparing the Results for Submission

In [18]:
X=df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass','Clusters']]
result=pd.DataFrame(columns=['PassengerId', 'Survived'])
result['PassengerId']=serving_df['PassengerId']
result['Survived']=rf.predict(scale(X))

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


In [19]:
result=result.set_index('PassengerId')
result

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,1
896,0
...,...
1305,0
1306,1
1307,0
1308,0


In [20]:
result.to_csv('submission.csv')