<a href="https://www.kaggle.com/code/najeebz/titanic-clustering-for-beginners-features-engr?scriptVersionId=157488193" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Creator:
### Najeeb Zaidi
### Github: https://github.com/snajeebz
### zaidi.nh@gmail.com
### Contributors: 
1. https://github.com/snajeebz
## Dataset Source: 
1. https://www.kaggle.com/competitions/titanic

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import os
import seaborn as sns
#Disable warning
import warnings
warnings.filterwarnings("ignore")



# Importing the Dataset

In [None]:
try:   #for Local Environment
    train_df = pd.read_csv("Dataset/train.csv")
    serving_df = pd.read_csv("Dataset/test.csv")
except: #for Kaggle Environment
    train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
    serving_df = pd.read_csv("/kaggle/input/titanic/test.csv")

train_df.head(10)

# Strategy:
1. Data Preparation and Scikit Learn Algo implementation
2. Model and Hyper Parameter Tuning
3. Tensorflow Models implementation
4. Tensorflow Models and Hyper-Parameters Tuning

## 1. Data Preparation and Scikit Learn Algo implementation

### Steps:
1. Dataset EDA
2. Data Wrangling
3. Test Train Dataset preparation for scikit-Learn
4. Scikit Learn ML Model Plus Hyper Parameters Tuning
5. Submission of the Best Results.

## 1. Dataset EDA

In [None]:
train_df.isnull().sum()

In [None]:
train_df=pd.get_dummies(train_df, columns=['Sex'])


# Observations:
1. Cabin has got 687 nulls which is more than 75% of the rows. So I guess it will be better to not use cabin as a feature for our model.
2. Age has around 20% nulls. So we will try the best to fill them up.

## Let's dig deep into Age

In [None]:
print("Group by Parch: \n",train_df['Age'].isna().groupby(train_df['Parch']).value_counts())
print("Group by SibSp: \n",train_df['Age'].isna().groupby(train_df['SibSp']).value_counts())
print("Group by Pclass: \n",train_df['Age'].isna().groupby(train_df['Pclass']).value_counts())



## Observation
1. Out the Nans of age 16% are female.
2. Out the Nans of age 21% are male.

# Let's impute the nulls.

### You can refer to the [Imputating the Nans by ML](https://www.kaggle.com/code/najeebz/titanic-deep-learning) results on that notebook:
> It resulted in 80% Accurate Results.

In [None]:
def impute_age(data):
    fill=data['Age'][data['Age'].isnull()==False].median()
    data['Age'].fillna(fill, inplace=True)
    return data
X=train_df[['SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass', 'Age']]
train_df = impute_age(train_df)
print('After Imputations:\n ', train_df.isnull().sum())



In [None]:
df= train_df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass', 'Survived']]
figure= px.imshow(df.corr(), text_auto=True, width=1200, height=1200)
figure.show()

Some noticeable correlation between Sibsp, Parch with Age. I guess if we apply Machine Learning Algorithm, we can get some fruitful model to fill the Nans of the age.

# Generating Feature by Clustering

In [None]:
def cluster(X):
    from sklearn import cluster
    agglo = cluster.KMeans(n_clusters=5,random_state=0, n_init="auto")
    agglo.fit(X)
    return agglo.labels_
train_df['Clusters']=np.nan
X=train_df[['SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass']]
train_df['Clusters']=(cluster(X)+1)/5  
train_df


In [None]:
df= train_df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass','Clusters', 'Survived']]
figure= px.imshow(df.corr(), text_auto=True, width=1200, height=1200)
figure.show()

### Clusters added a feature with 15% correlation.

# Preparing Training Dataset

In [None]:
X=train_df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass','Clusters']]
y=train_df[['Survived']]

In [None]:
def scale(X):
    from sklearn import preprocessing
    scaled=preprocessing.StandardScaler()
    scaler=scaled.fit(X)
    X=scaler.transform(X)
    return X

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(scale(X),y,train_size=0.8, random_state=42)

# Evaluation of the Training Model

In [None]:
def evaluate(y_test,ypred):
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import f1_score
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import confusion_matrix
    print("Accuracy: ",accuracy_score(y_test,y_pred)) 
    print("Precision Score : ", precision_score(y_test,y_pred)) #precision measures the proportion of true positive predictions among all positive instances. how many of survived predicted actually survived, doesn't verifies 0's 70 survived as preicted whereas actually 92 survived so 70/92 will be the precision.  if we predicted 70 survived, so presion will tell how many of those 70 predicted survived matches the actual row by row data. It checkes all positives and verifies if the answer is true for each row?
    print("Recall Score: ", recall_score(y_test,y_pred, average='macro')) #Recall measures the proportion of true positive predictions among all actual positive instalnces. If we predicted 100 survived correctly whereas actually 100 survived out of which 67 predicted correctly so recall will be 0.67
    print("F1 Score: ",f1_score(y_test,y_pred)) #mean of recall and precision
    cm = confusion_matrix(y_test, y_pred)
    figure= px.imshow(cm,text_auto=True, width=1200, height=1200)
    figure.show()

# Training KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
evaluate(y_test,y_pred)

# Training Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf =RandomForestClassifier(n_jobs=-1,verbose=1) 
print ('Training the model')
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
evaluate(y_test,y_pred)

# Training MLP Classifier

In [None]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='adam', 
              max_iter =1000, 
              alpha=10, 
              hidden_layer_sizes=10, 
              random_state=5,
              activation='identity',
              batch_size=360, 
              learning_rate='adaptive', 
              verbose=0,
              early_stopping=0, 
              n_iter_no_change=100)

print ('Training the model')
clf.fit(X_train,y_train)
print(clf.score(X_train,y_train))
y_pred=clf.predict(X_test)
evaluate(y_test,y_pred)

## Observations:
- All the models return 0.8 accuracy with not much differnce. 
- However, the best results are obtained by KNN Classifier.

# Preparing the Test Data

In [None]:
df = serving_df[['Age','SibSp', 'Parch', 'Fare','PassengerId', 'Sex','Pclass']]
print('Before Processing: \n', df.isnull().sum())
df=pd.get_dummies(df, columns=['Sex'])
df = impute_age(df)
df['Fare']=df['Fare'].fillna(df['Fare'].median())
df['Clusters']=np.nan
X=df[['SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass']]
df['Clusters']=(cluster(X)+1)/5  
print('After Processing: \n', df.isnull().sum())


# Preparing the Results for Submission

In [None]:
X=df[['Age', 'SibSp', 'Parch','Fare','Sex_female', 'Sex_male','Pclass','Clusters']]
result=pd.DataFrame(columns=['PassengerId', 'Survived'])
result['PassengerId']=serving_df['PassengerId']
result['Survived']=rf.predict(scale(X))

In [None]:
result=result.set_index('PassengerId')
result

In [None]:
result.to_csv('submission.csv')