### Introduction:
In This note book we will try to explore the key determinants of survival on Titanic through the following steps:
1. Load and cheack the data.
2. Feature engineering.
3. Explanatory Data analysis.
4. Model Training, Validation, and Prediction

### 1. Load and Check

In [None]:
# importing the required liberaries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import random as rnd

In [None]:
# Load The Data
df_train = pd.read_csv("../input/titanic/train.csv")
df_test = pd.read_csv("../input/titanic/test.csv")

As you can see the provided data is splitted into two parts: training and test data. We will combine them together so that any feature engineering applies to both of them at once. After completing our explanatory data analysis and start model training, we will split them again.

In [None]:
df_train.tail()

In [None]:
df_test.head()

In [None]:
dfs = [df_train, df_test] # to apply any change to both data frames

In [None]:
df_train.info()

In [None]:
df_test.info()

The general information about the dataframe points out to several problems:
1. PassengerId is stored as Integer, where in fact it has no numeric meaning. So, it should be converted to an Object.
2. Categorical variables such as Sex and Embarked should be Integer not Float. But we will convert it latter before training using pd.get_dummies().
3. Age, Fare, Cabin, and Embarked have missing values.
4. The Name needs to be splitted into Title and Name
5. SibSp and Parch should to be combined into 1 variable (number of family members)

### 2. Feature Engineering

In [None]:
# Convering PassengerId to an object
for df in dfs:
    df["PassengerId"] = df["PassengerId"].astype("object")

In [None]:
# How many missing points in each variable
count_missing_train = df_train.isnull().sum()
percent_missing_train = round(df_train.isnull().sum()/len(df_train) * 100, 1)
missing_train = pd.concat([count_missing_train, percent_missing_train], axis = 1)
missing_train.columns = ["Missing (count)", "Missing (%)"]
missing_train

In [None]:
# How many missing points in each variable
count_missing_test = df_test.isnull().sum()
percent_missing_test = round(df_test.isnull().sum()/len(df_test) * 100, 1)
missing_test = pd.concat([count_missing_test, percent_missing_test], axis = 1)
missing_test.columns = ["Missing (count)", "Missing (%)"]
missing_test

### How to deal with Missing Data?
There are many strategies to fill missing data:
1. Fill with the mean (better used in case of continous variables without outliers)
2. Fill with the median (better used in case of continous variables with outliers)
3. Fill with the mode (better used in case of categorical variables)
4. Drop the entire variable if the number of missing points is too large

Based on the above strategies:
1. Embarked: has just 2 missing values, se we will fill them with the mode (the most frequent data point)
2. Cabin: 77.1 percent of its values are missing, we will drop it entirly.
3. Age: Just 20 percent if the values are missing, we will fill these values with the mean of age given the values of other feature such as Sex, Ticket... etc.
4. In Test Data, Fill Fare with the mode. 

In [None]:
df_train.columns

In [None]:
# Fill Embarked and Fare Variables 
df_train["Embarked"] = df_train["Embarked"].fillna(df_train["Embarked"].mode()[0])
df_test["Fare"] = df_test["Fare"].fillna(df_test["Fare"].median())

In [None]:
#  Drop Cabin
df_train = df_train.drop("Cabin", axis = 1)
df_test = df_test.drop("Cabin", axis = 1)

In [None]:
# fill Age. We will iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate guessed values of Age for the six combinations.
guess_ages = np.zeros((2,3))
dfs = [df_train, df_test]
for df in dfs:
    df["Sex"] = df["Sex"].map({"male":1, "female":0}) #Do not Run This Cell Twice 
    for i in range(0, 2):
        for j in range(0,3):
            guess_df = df[(df["Sex"] == i)&(df["Pclass"] == j+1)]["Age"].dropna()
            
            age_guess = guess_df.mean()
            guess_ages[i,j] = int(age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0,3):
            df.loc[ (df.Age.isnull()) & (df.Sex == i) & (df.Pclass == j+1),'Age'] = guess_ages[i,j]            
    
    df.Age = df.Age.astype(int)

In [None]:
df_train.head()

In [None]:
df_train.info()

### Done! Now It is all nice and clean
#### But still more feature engineering is required, we will create new variables as follows: 
1. Number of family members on board = 1 + SibSp + Parch
2. Title needs to be seprated from Name
3. Regrouping the Titles
4. Seprate pure numeric Tickets from Text-Numeric Tickets: It may say something about the income and thus the social class of the passenger.
5. Regrouping Ticket labels

In [None]:
for df in dfs:
    df["num_family"] = 1 + df.SibSp + df.Parch
    df["Title"] = df["Name"].apply(lambda x: x.split(",")[1].split(".")[0].strip())
    df["Title"] = df["Title"].replace(["Mlle", "Major", "Col", "Jonkheer", "Ms", "Lady", "the Countess", "Mme", "Sir", "Capt", "Don"], "Other")
    df['numeric_ticket'] = df.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
    df["Ticket_Text"] = df["Ticket"].apply(lambda x: x.split(" ")[0].replace("/", "").replace(".", "").lower() if len(x.split(" ")) > 1 else 0)
    df["Ticket_Text"] = df["Ticket_Text"].replace(["swpp", "sotono2", "ppp", "fa", "casoton", "sop", "sp", "as", "sca4", "scow", "fc", "sc"], "other")
    df["Is_Alone"] = df["num_family"].apply(lambda x: 1 if x < 2 else 0)
    

In [None]:
df_train.head()

To better understand the relationship between Survival and other Continious features such as Fare and Age, we need to convert the latter int catgorical features.

In [None]:
for df in dfs:
    df['Age_bins'] = pd.qcut(df['Age'], labels = ["<19", "19-23", "24-25", "26-31", "32-40", "41-80"], q = 6)
    

In [None]:
dfs = [df_train, df_test]

In [None]:
for df in dfs:
    df["Fare_bins"] = pd.qcut(df["Fare"], labels = ["<7", "7-8.5", "8.6-13", "14-25", "26-51", "52-512"], q = 6)

### 3. Explanatory Data analysis
In this section we will explore who had the highest probability of survival, during this process we will select the most relevant variables to feed the model with. 

In [None]:
df_train.describe()

In [None]:
Fare_Sur = pd.pivot_table(data = df_train, index = "Fare_bins", values = "Survived").sort_values(by = "Survived", ascending = False) * 100
round(Fare_Sur, 1)

In [None]:
Age_Sur = pd.pivot_table(data = df_train, index = "Age_bins", values = "Survived").sort_values(by = "Survived", ascending = False) * 100
round(Age_Sur, 1)

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = df_train, x = "Fare_bins", y = "Survived", ci = None)
plt.title("Distribution of survivors by Fare")
plt.show()

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = df_train, x = "Pclass", y = "Survived", ci = None)
plt.title("Distribution of survivors by Pclass")
plt.show()

In [None]:
 df_train.groupby("Fare_bins").mean()

**Conclusion 1:** The more rich the passenger was, the more likely he survived.

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = df_train, x = "Age_bins", y = "Survived", ci = None)
plt.title("Distribution of survivors by Age")
plt.show()

**Age bins (execluding 24 - 25) have similar survival rate. We need to further invistigate this age band in specific.**

In [None]:
df_train.groupby("Age_bins").mean()

**Conclusion 2: from the previous tables we can cleary see why those aged 24 - 25 were the least likely to survive**
1. They were entirely males (93% of this age band were males)
2. They were the most disadvantaged (They paid the lowest Fare)
3. They had many siblings and other family members on board.

**Here are the possiblities:**
1. It seems that they sacrificed themselves to rescue other.
2. Being disadvanteged and poor, they are least probable to have had access to survival equipments such as life-jackets for example.
3. Analyzing the second table suggests that bieng rich is the most important determinant of survival

**From the previous tables, we noticed that those with numerical tickets were most likely to be youthful disadvanteged males with low odds of survival, lets invisitage it further**

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = df_train, x = "numeric_ticket", y = "Survived", ci = None)
plt.title("Distribution of survivors by numeric_ticket")
plt.show()

In [None]:
df_train.groupby("numeric_ticket").mean()

In [None]:
Tic_Sur = round(pd.pivot_table(data = df_train, index = "Ticket_Text", values = "Survived").sort_values(by = "Survived", ascending = False) * 100, 0)
Fare_Sur = round(pd.pivot_table(data = df_train, index = "Ticket_Text", values = "Fare"), 0)
Tic_Sur_count = df_train.Ticket_Text.value_counts()
Tic_Sur_count = pd.DataFrame(Tic_Sur_count)
Ticket_Text_Survival = pd.concat([Tic_Sur, Tic_Sur_count, Fare_Sur], axis = 1)
Ticket_Text_Survival.columns = ["% of Survivor", "N", "Mean Fare"]
Ticket_Text_Survival

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = Ticket_Text_Survival, x = Ticket_Text_Survival.index, y = Ticket_Text_Survival["% of Survivor"], ci = None)
plt.title("Distribution of survivors by Ticket_Text")
plt.xticks(rotation = 90)
plt.show()

**Conclusion 3:** It turns out that our first impresion was somehow incorrect. Whether the ticket has text on it or not has no impact on survival. But it does have correlation with Fare and Age as we previously expected. Morover, when we look at the distribution of survivors by text wrote on each ticket we immediatly see a strong correlation.

**Now lets see if having family relatives on board affects survival**

In [None]:
df_train.Is_Alone.value_counts()

- **537 passengers have no family relatives on board**

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = df_train, x = "Is_Alone", y = "Survived", ci = None)
plt.title("Distribution of survivors by Is_Alone")
plt.show()

In [None]:
df_train.groupby("Is_Alone").mean()

**Conclusion 4:** Unexpectedly, Those who are alone are less likely to survive compared to those who are not alone. This is might be due to the fact that they are older and less well-off.

**Lets now if males have higher survival rate than females?**

In [None]:
# Rename the sex variable to prevent any misconciption
df_train = df_train.rename(columns = {"Sex": "Is_Male"})
df_test = df_test.rename(columns = {"Sex": "Is_Male"})

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = df_train, x = "Is_Male", y = "Survived", ci = None)
plt.title("Distribution of survivors by Is_Male")
plt.show()

In [None]:
df_train.groupby("Is_Male").mean()

In [None]:
pd.pivot_table(data = df_train, index = ["Is_Male", "Pclass"], values =["Survived", "Age", "Fare", "Is_Alone"])

In [None]:
pd.pivot_table(data = df_train, index = ["Is_Male", "Pclass"], values = "Survived", columns = "Age_bins")

**Conclusion 5:** Regardless to the Ticket Class and Age, Females have higher rates of survival than males.

**Lets now explore the impact of Titles no Survival**

In [None]:
Tit_Sur = pd.pivot_table(data = df_train, index = "Title", values = "Survived").sort_values(by = "Survived", ascending = False)
plt.figure(figsize = (8, 4), dpi = 100)
sns.barplot(data = df_train, x = Tit_Sur.index, y = Tit_Sur.Survived, ci = None)
plt.title("Distribution of survivors by Is_Alone")
plt.show()

In [None]:
pd.pivot_table(data = df_train, index = "Title", values = ["Survived", "Age", "Fare", "Is_Alone"]).sort_values("Survived", ascending = False)

**Conclusion 6:** Regardless to the Ticket Class and Age, Miss and Mrs have higher rates of survival than other titles.

### Lets now have have a look at the statistical distribution for each feature 

In [None]:
df_train.corr()

In [None]:
plt.figure(figsize = (12, 5), dpi = 100)
sns.heatmap(round(df_train.corr(), 1), annot = True, cmap = "viridis", annot_kws={"fontsize":10})
plt.show()

In [None]:
pairplot_data = df_train[['Survived', 'Pclass', 'Is_Male', 'Age', 'SibSp', 'Parch', 'Fare', 'num_family', 'numeric_ticket', 'Is_Alone']]
sns.pairplot(pairplot_data, diag_kind = "kde", hue = "Survived")
plt.show()

In [None]:
#cols = ['Pclass', 'Is_Male', 'Age', 'SibSp', 'Parch', 'Fare', 'num_family', 'numeric_ticket', 'Is_Alone']
#for col in cols:
    #plt.figure(figsize = (8, 4), dpi = 100)
    #sns.kdeplot(data = df_train, x= col, hue = "Survived")

### 4. Model Training, Validation, and Prediction

In [None]:
df_train.head()

In [None]:
## dropping unnecessairy features for model training
df_train = df_train.drop(["Name", "Ticket", "Fare", "Age"], axis = 1)
df_test = df_test.drop(["Name", "Ticket", "Fare", "Age"], axis = 1)

In [None]:
# dropping repetitive features
df_train = df_train.drop(["SibSp", "Parch", "num_family"], axis = 1)
df_test = df_test.drop(["SibSp", "Parch", "num_family"], axis = 1)

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
#Creating dummy varaibles
X = df_train.drop(["Survived", "PassengerId"], axis = 1)
X_dum = pd.get_dummies(X, drop_first = True)
df_test_dum = df_test.drop("PassengerId", axis = 1)
df_test_dum = pd.get_dummies(df_test, drop_first = True)
y = df_train["Survived"]

In [None]:
#Making Sure that training and test data sets have the same columns
Train_cols = X_dum.columns
test_cols = df_test_dum.columns 
for col in test_cols:
    if col not in Train_cols:
        df_test_dum = df_test_dum.drop(col, axis = 1)

IF you tried to complete the project without running the preceding cell, you will not be able to complete the upcoming model training parts. You will face an error indicating that the test data set do not have the same collumns as the training data set. The error resulted from wrangling the two data sets in seprate, some variables in the test set had categories that do not exist in the same variable in the training set.

In [None]:
#import cross validation 
from sklearn.model_selection import cross_val_score

In [None]:
#Logestic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter = 2000)

#Model Training
log_reg.fit(X_dum, y)

#Prediction and validation
acc_score = round(log_reg.score(X_dum, y) * 100, 2)
acc_score

#cross validation
cv = cross_val_score(log_reg,X_dum,y,cv=5)
cv_mean = round(cv.mean() * 100, 2)

pd.DataFrame({"acc_score": [acc_score], "cv_score": [cv_mean]})

In [None]:
#Support Vector Machine
from sklearn.svm import SVC
SVC = SVC()

#Model Training
SVC.fit(X_dum, y)

#Prediction and validation
acc_score = round(SVC.score(X_dum, y) * 100, 2)


#cross validation
cv = cross_val_score(SVC,X_dum,y,cv=5)
cv_mean = round(cv.mean() * 100, 2)

pd.DataFrame({"acc_score": [acc_score], "cv_score": [cv_mean]})

In [None]:
#Descision Tree 
from sklearn import tree
dt = tree.DecisionTreeClassifier()

#Model Training
dt.fit(X_dum, y)

#Prediction and validation
acc_score = round(dt.score(X_dum, y) * 100, 2)


#cross validation
cv = cross_val_score(dt,X_dum,y,cv=5)
cv_mean = round(cv.mean() * 100, 2)

pd.DataFrame({"acc_score": [acc_score], "cv_score": [cv_mean]})

In [None]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier
rs = RandomForestClassifier()

#Model Training
rs.fit(X_dum, y)

#Prediction and validation
acc_score = round(rs.score(X_dum, y) * 100, 2)

#cross validation
cv = cross_val_score(rs,X_dum,y,cv=5)
cv_mean = round(cv.mean() * 100, 2)

pd.DataFrame({"acc_score": [acc_score], "cv_score": [cv_mean]})

In [None]:
#XGB
from sklearn.ensemble import GradientBoostingClassifier
xgb = GradientBoostingClassifier()

#Model Training
xgb.fit(X_dum, y)

#Prediction and validation
acc_score = round(xgb.score(X_dum, y) * 100, 2)


#cross validation
cv = cross_val_score(xgb,X_dum,y,cv=5)
cv_mean = round(cv.mean() * 100, 2)

pd.DataFrame({"acc_score": [acc_score], "cv_score": [cv_mean]})

### Hyper Parameter Tunning
Lets now do some hyper parameter tunning to improve model results.

#### 1. Logestic Regression

In [None]:
# import GridSearchCV
from sklearn.model_selection import GridSearchCV
log_reg = LogisticRegression(max_iter = 2000)

# param_grid
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-4, 4, 50),
              'solver' : ['liblinear']}

# grid_model: Logestic Regression
lr_tuned = GridSearchCV(log_reg, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_lr = lr_tuned.fit(X_dum, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(lr_tuned.best_params_))
print("Best score is {}".format(lr_tuned.best_score_ * 100))

#### 2. Random Forest Classifier

In [None]:
rs = RandomForestClassifier()

# param_grid
param_grid =  {'n_estimators': [400,450,500,550],
               'criterion':['gini','entropy'],
                                  'bootstrap': [True],
                                  'max_depth': [15, 20, 25],
                                  'max_features': ['auto','sqrt', 10],
                                  'min_samples_leaf': [2,3],
                                  'min_samples_split': [2,3]}

# grid_model: Logestic Regression
rs_tuned = GridSearchCV(rs, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_rs = rs_tuned.fit(X_dum, y)


# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(rs_tuned.best_params_))
print("Best score is {}".format(rs_tuned.best_score_ * 100))

In [None]:
#Prediction and submission
y_predict = best_lr.predict(df_test_dum)

#Create a  DataFrame with the passengers ids and our prediction regarding whether they survived or not
submission = pd.DataFrame({'PassengerId':df_test['PassengerId'],'Survived':y_predict})

#Visualize the first 5 rows
submission.head()

In [None]:
#Submission
filename = 'Titanic_Predictions_2.csv'

submission.to_csv(filename,index=False)

print('Saved file: ' + filename)