


**Synopsis :** 

Titanic, in full Royal Mail Ship (RMS) Titanic, British luxury passenger liner that sank on April 14–15, 1912, during its maiden voyage, en route to New York City from Southampton, England, killing about 1,500 passengers and ship personnel. One of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.

**Data :**

There are tow datasets one dataset is titled `train.csv` and the other is titled `test.csv`.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The `test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

**Goal :**

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.


### Required Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import  GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,accuracy_score

## Dataset

In [None]:
train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")

### let‘s talk about train dataset!

In [None]:
train.sample(5)

In [None]:
train.shape

In [None]:
train.info()

In [None]:
# Drop useless columns
train = train.drop(['Cabin','Ticket','Name','PassengerId'],axis=1)

**"NAN" values  in the taining dataset**

In [None]:
#   number of "NAN" values 
train.isnull().sum()

In [None]:
## Dealing with mising values ##
freq = train.Embarked.dropna().mode()
print(freq,'\n')
train['Embarked'] = train['Embarked'].fillna(freq[0]) # fill "NAN" values with the most frequent value

mean = train['Age'].dropna().mean()
train['Age'] = train['Age'].fillna(round(mean))
print(round(mean))

**Converting categorical feature to numeric**

In [None]:
train['Sex'].replace('female', 0,inplace=True)
train['Sex'].replace('male', 1,inplace=True)


train['Embarked'].replace('S', 0,inplace=True)
train['Embarked'].replace('C', 1,inplace=True)
train['Embarked'].replace('Q', 2,inplace=True)

In [None]:
print(train.isnull().sum() , train.shape ,train.head(), train.describe().T ,sep = ' \n ***********   *************  *********** \n ' )

### Data Analysis & Visualization

In [None]:
sns.set(rc={'figure.figsize':(13,13)})
ax = sns.heatmap(train.corr(), annot=True)

In [None]:
cols = ['Pclass','Sex','SibSp' ,'Parch','Embarked']
for col in cols :
    print(train[[col, 'Survived']].groupby([col],as_index=False).mean().sort_values(by='Survived', ascending=False),end=' \n ******** ******* ********* \n ')

In [None]:
fig, axes =plt.subplots(5,1, figsize=(6,12))
axes = axes.flatten()

for ax, catplot in zip(axes,train[cols]):
      
    _=sns.countplot(x=catplot, data=train, ax=ax, hue=train['Survived'], palette="OrRd")
    _.legend(loc='upper right')
    
plt.tight_layout()  
plt.show()        

In [None]:
_ = sns.FacetGrid(train, col='Survived')
_.map(plt.hist, 'Age', bins=15)

In [None]:
_ = sns.FacetGrid(train, col='Pclass')
_.map(plt.hist, 'Age', bins=15)

In [None]:
_ = sns.FacetGrid(train, col='Sex')
_.map(plt.hist, 'Age', bins=15)

In [None]:
fig,ax=plt.subplots(1,3,figsize=(20,8))
sns.histplot(train[train['Pclass']==1].Fare,ax=ax[0],kde=True, stat="density", linewidth=0)
ax[0].set_title('Fares in Pclass 1')
sns.histplot(train[train['Pclass']==2].Fare,ax=ax[1],kde=True, stat="density", linewidth=0)
ax[1].set_title('Fares in Pclass 2')
sns.histplot(train[train['Pclass']==3].Fare,ax=ax[2],kde=True, stat="density", linewidth=0)
ax[2].set_title('Fares in Pclass 3')
plt.show()

### let‘s talk about test dataset!

In [None]:
test.sample(5)

In [None]:
print(test.shape,test.info(),test.isnull().sum(),sep=' \n ***********  *************  ************ \n')

In [None]:
# Drop useless columns
test = test.drop(['Cabin','Ticket','Name','PassengerId'],axis=1)


## Dealing with mising values ##
freq = test.Fare.dropna().mode()
print(freq,'\n')
test['Fare'] = test['Fare'].fillna(freq[0]) # fill "NAN" values with the most frequent value

mean = test['Age'].dropna().mean()
test['Age'] = test['Age'].fillna(round(mean))
print(round(mean))

In [None]:

test['Sex'].replace('female', 0,inplace=True)
test['Sex'].replace('male', 1,inplace=True)


test['Embarked'].replace('S', 0,inplace=True)
test['Embarked'].replace('C', 1,inplace=True)
test['Embarked'].replace('Q', 2,inplace=True)

In [None]:
test.sample(5)

## Machine learning model

### Training and Predictions

In [None]:
x_test =test
x_train = train.drop("Survived", axis=1)
y_train = train["Survived"]

### 1. Logistic Regression

In [None]:
model1= LogisticRegression(solver='liblinear')
model1.fit(x_train,y_train)
prediction = model1.predict(x_test)
prediction[:10]

### 2. K nearest neighbor (KNN)

In [None]:
model2= KNeighborsClassifier(n_neighbors=3)
model2.fit(x_train, y_train)
prediction = model2.predict(x_test)
prediction[:10]

### 3. Naive Bayes

In [None]:
model3 = GaussianNB()
model3.fit(x_train,y_train)
prediction = model3.predict(x_test)
prediction[:10]

### 4. Support Vector Machine (SVM)

In [None]:
model4 = SVC(kernel='linear')
model4.fit(x_train, y_train)
prediction= model4.predict(x_test)
prediction[:10]

### Evaluating

In [None]:
score1 = round(model1.score(x_train, y_train) * 100, 2)
score2 = round(model2.score(x_train, y_train) * 100, 2)
score3 = round(model3.score(x_train, y_train) * 100, 2)
score4 = round(model4.score(x_train, y_train) * 100, 2)

In [None]:
dict = {'Model' : ['Logistic Regression','K nearest neighbor','Naive Bayes','Support Vector Machine'],
'Score' :[score1,score2,score3,score4] }
models_score = pd.DataFrame(dict)

In [None]:
models_score

In [None]:
submission = pd.DataFrame({ 'Survived': prediction})
submission .to_csv('my_submission.csv', index=False)