# **The Titanic Survivor**

**Done by:** *Saja Abdalaal*

## **Background Information**

On Sunday, April 14, 1912, the passenger ocean liner RMS Titanic, the largest such ship 
at that time, struck an iceberg in the North Atlantic and sank in less than 3 hours. More 
than 1500 of her 2224 passengers and crew perished. The dataset about the survivors 
is perhaps one of the most cited and studied in data analytics courses and is 
used to illustrate machine learning algorithms, cluster analysis, and basic statistical 
and visualization methods using R and Python.

# **Data**

In [None]:
# Import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
url = '/kaggle/input/test-file/tested.csv'

import pandas as pd
df = pd.read_csv(url)

df.head()

In [None]:
df.shape

**PassengerId** passenger ID<br>
**Survived** Survival
 (0 = No; 1 = Yes)<br>
**Pclass** Passenger Class
 (1 = 1st; 2 = 2nd; 3 = 3rd) <br>
**Name** Name<br>
**Sex** Sex<br>
**Age** Age<br>
**SibSp** Number of Siblings/Spouses Aboard<br>
**Parch** Number of Parents/Children Aboard<br>
**Ticket** Ticket Number<br>
**Fare** Passenger Fare<br>
**Cabin** Cabin<br>
**Embarked** Port of Embarkation
 (C = Cherbourg; Q = Queenstown; S = Southampton)<br>

In [None]:
df.info()

In [None]:
df['Cabin'].value_counts()

# **Data preprocessing**

## Null values

In [None]:
# checking nulls
df.isnull().sum()

There're nulls in **Age**, **Fare** and **Cabin** attributes

------------------

**Filling Fare**

 Filling nulls with the mean fare of their Pclass:

In [None]:
df[df.Fare.isnull()]

In [None]:
fare_mean_c3 = df.Fare[df.Pclass == 3].mean()
df['Fare'].fillna(value=fare_mean_c3, inplace=True)

-----------------------------
**Filling Age**

To fill the age, we can check the titles (Miss, Mr, Mrs, Master, Dr) and take the age average of each one, then fill the age according to the title. Yes, Master is one of the titles used in Titanic, is used for boys and young men, mostly by english people.

In [None]:
mean_age_miss = df[df["Name"].str.contains('Miss.', na=False)]['Age'].mean().round()
mean_age_mrs = df[df["Name"].str.contains('Mrs.', na=False)]['Age'].mean().round()
mean_age_mr = df[df["Name"].str.contains('Mr.', na=False)]['Age'].mean().round()
mean_age_master = df[df["Name"].str.contains('Master.', na=False)]['Age'].mean().round()

print('Mean age of Miss. title {}'.format(mean_age_miss))
print('Mean age of Mrs. title {}'.format(mean_age_mrs))
print('Mean age of Mr. title {}'.format(mean_age_mr))
print('Mean age of Master. title {}'.format(mean_age_master))

def fill_age(name_age):
    
    name = name_age[0]
    age = name_age[1]
    
    if pd.isnull(age):
        if 'Mr.' in name:
            return mean_age_mr
        if 'Mrs.' in name:
            return mean_age_mrs
        if 'Miss.' in name:
            return mean_age_miss
        if 'Master.' in name:
            return mean_age_master
        if 'Dr.' in name:
            return mean_age_master
        if 'Ms.' in name:
            return mean_age_miss
    else:
        return age

df['Age'] = df[['Name', 'Age']].apply(fill_age,axis=1)

fig, (ax1) = plt.subplots(1, 1, figsize=(10,5))
sns.heatmap(df.isnull(),cmap='copper', ax=ax1)
plt.tight_layout()

-----------------

And for Cabin attribute which has alot of missings we dropped it

In [None]:
df.drop(['Cabin'], axis=1, inplace=True)

## Drop irrelevant columns

In [None]:
df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)

## Categorical values

In [None]:
df.nunique()

In [None]:
df_cat = df[['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Survived']]
df_cat.head()

In [None]:
categories = {"female": 1, "male": 0}
df['Sex']= df['Sex'].map(categories)

prepared_df = pd.concat([df, pd.get_dummies(df['Embarked'],drop_first=True)], axis=1) 

prepared_df.drop(['Embarked'], axis=1, inplace=True)

In [None]:
prepared_df.head()

In [None]:
prepared_df.info()

## Correlation

In [None]:
corr = prepared_df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, cbar=True, fmt='.2f', annot=True, annot_kws={'size':15}, cmap='Greens')

**It's noticeable that 'Sex_male' has a 100% correlation with the target attribute, So we have to drop it as it's going to overfit our ML model later.**

In [None]:
prepared_df.drop(['Sex'], axis=1, inplace=True)

## Data distribution

In [None]:
prepared_df.hist(bins=50, figsize=(10,7))
plt.show()

In [None]:
prepared_df_num = prepared_df[['Age', 'Fare']]

fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(5,3))
i = 0
ax = ax.flatten()
for col, value in prepared_df_num.items(): 
  sns.boxplot(data=prepared_df_num, y = col, ax= ax[i])
  i+=1
plt.tight_layout()

### Skewness
the skewness value should be within the range of -1 to 1 for a normal distribution, any major changes from this value may indicate the presence of outliers.

In [None]:
print('skewness value of Age: ',prepared_df['Age'].skew())
print('skewness value of Fare: ',prepared_df['Fare'].skew())

from the code above, the ‘Fare’ skewness value of 3.69 shows the variable has been rightly skewed, indicating the presence of outliers.

----------------------------

**Flooring And Capping** <br>
in this quantile-based technique, we will do the flooring(e.g 25th percentile) for the lower values and capping(e.g for the 75th percentile) for the higher values. These percentile values will be used for the quantile-based flooring and capping.

the code below drops the outliers by removing all the values that are below the 25th percentile and above the 75th percentile of the ‘Fare’ variable.

In [None]:
Q1 = prepared_df['Fare'].quantile(0.25)
Q3 = prepared_df['Fare'].quantile(0.75)
IQR = Q3 - Q1
whisker_width = 1.5
lower_whisker = Q1 -(whisker_width*IQR)
upper_whisker = Q3 + (whisker_width*IQR)
prepared_df['Fare']=np.where(prepared_df['Fare']>upper_whisker,upper_whisker,np.where(prepared_df['Fare']<lower_whisker,lower_whisker,prepared_df['Fare']))

# **Feature engineering**


In [None]:
fig, axx = plt.subplots(1, 3, figsize=(20,5))
axx[0].set_title('Amounth of Siblins/Spouses')
sns.countplot(x='SibSp', data=prepared_df, ax=axx[0])
axx[1].set_title('Amounth of parents/children')
sns.countplot(x='Parch', data=prepared_df, ax=axx[1])
axx[2].set_title('Distribution of Classes')
sns.countplot(x='Pclass', data=prepared_df, ax=axx[2])
plt.tight_layout()

we can see that most of the people were alone and most belonged to 3rd class (lower). This corresponds to what we saw earlier with the Cabins and the fare, most people without a cabin assign had a small fare, makes sense they belong to class 3. We can create a new feature that specifies if the person was traveling alone or with family based on SibSp (Siblings/Spouses) and Parch (Parents/Children) attributes, also the size of the family. Those attributes could be of interest.

----------------------------------

In [None]:
def create_alone_feature(SibSp_Parch):
    if (SibSp_Parch[0]+SibSp_Parch[1])==0:
        return 1
    else:
        return 0
 
prepared_df['Alone'] = prepared_df[['SibSp','Parch']].apply(create_alone_feature, axis=1)
prepared_df['Familiars'] = 1 + prepared_df['SibSp'] + prepared_df['Parch']


In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(prepared_df.corr(), annot=True)
plt.tight_layout()

# **Split Data**

In [None]:
X = prepared_df.drop(['Survived'], axis=1)
y = prepared_df['Survived']

## Normalize data

In [None]:
from sklearn.preprocessing import MinMaxScaler
mns = MinMaxScaler()
X = pd.DataFrame(mns.fit_transform(X), columns=X.columns)


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=42)

# **Model training**

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=2)
clf.fit(X_train, y_train)


In [None]:
print('Training score: ', round(clf.score(X_train, y_train),3))
print('Testing score: ', round(clf.score(X_test, y_test),3))

In [None]:
feature_imp = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False)


plt.figure(figsize=(10,6))
sns.barplot(x=feature_imp, y=feature_imp.index)

# Add labels to graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.tight_layout()

In [None]:
X_2 = X.drop(['SibSp', 'Alone', 'S', 'Q'], axis=1)

from sklearn.model_selection import train_test_split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2,y, test_size=0.30, random_state=42)

clf.fit(X_train_2, y_train_2)

print('Training score: ', round(clf.score(X_train_2, y_train_2),3))
print('Testing score: ', round(clf.score(X_test_2, y_test_2),3))

## SVC

In [None]:
from sklearn.svm import SVC
clf2 = SVC(gamma='auto')
clf2.fit(X_train, y_train)

print('Training score: ', round(clf2.score(X_train, y_train),3))
print('Testing score: ', round(clf2.score(X_test, y_test),3))

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X_train, y_train)

print('Training score: ', round(lr.score(X_train, y_train),3))
print('Testing score: ', round(lr.score(X_test, y_test),3))

## GaussianNB

In [None]:
from sklearn.naive_bayes import GaussianNB
G = GaussianNB().fit(X_train, y_train)

print('Training score: ', round(G.score(X_train, y_train),3))
print('Testing score: ', round(G.score(X_test, y_test),3))
