# Survivor Prediction on Titanic using Machine Learning

## Objective: Predict survival on the Titanic

## Description

Visit [Titanic - Machine Learning from Disaster Competition on Kaggle](https://www.kaggle.com/competitions/titanic/overview) to view the challenge description and instructions.

## My Approach

Here is a brief description of my approach. I have included detailed explanation with the code cells.

My first submission score was 0.6866 using K-Nearest Neighbour technique and I trained the model with only 3 features - `Pclass`, `Sex` and `Age`.

For my next 7 submissions, I did hyperparameter tuning using `sklearn.mode_selection.GridSearchCV()` method and included five features - `Pclass`, `Sex`, `Age`, `SibSp` and `Parch`. I used several techniques including K-Nearest Neighbour, Logistic Regression, Decision Tree to train and predict. The train and test accuracy on labelled data were between 0.73 and 0.81. My submission scores were between 0.72 and 0.77.

For my next submissions, I changed the approach for addressing missing values in `Age` column and also implemented **Feature Engineering** wherein new features were added.

- The calculation of missing ages is described with the code cells below.  
[Jump to Handling missing values](#Handling-missing-values).
- From `Parch` and `SibSp`, I derived two columns `FamilySize` and `IsAlone`. `FamilySize` contain numerical values and `IsAlone` contain boolean values.  
[Jump to Interaction Feature](#Interaction-Feature-using-SibSp-and-Parch).
- From `FamilySize` feature created above, I created `FamilySizeGroup` feature to divide the family size in categorical values - `Single`, `Small` and `Large` and later on assigned numerical values to these categories using `preprocessing.LabelEncoder()` method.  
[Jump to Family Size Bins](#Family-Size-Bins).
- From `Age` column, I created bins of Ages using `pandas.qcut()` method such that the distribution of passengers among bins are equal. The bins were provided numerical values which are dependent categorical values.  
[Jump to Age Bins](#Age-Bins).
- From `Fare` column, I created bins of Fares using `pandas.qcut()` method such that the distribution of passengers among bins are equal. The bins were provided numerical values which are dependent categorical values.  
[Jump to Fare Bins](#Fare-Bins).
- On `Sex` column, I applied **One-hot encoding** for the categories which created two additional columns containing boolean values. I dropped `Sex` column from my training and test data.  
[Jump to Encode Categorical Data](#Encode-Categorical-Data).
- From `Name` column, I extracted Title information and then applied **One-hot encoding** to the title categories. It added 5 new columns.  
[Jump to Titles](#Titles)  
[Jump to Encode Categorical Data](#Encode-Categorical-Data).
- I also created new features using `Ticket` column, but then the model got overly complex and it reduced my submission score to 0.72996. I dropped these new features along with `Ticket` from the dataset.


Application of the above methods results in submission scores > 0.77.

**MY LATEST SUBMISSION SCORE IS 0.78708. I plan to break the 0.8 barrier.**

---

**I have included several ML techniques in the code. The tuning and training is performed with all the techniques and the predictions from each techniques are stored in the csv files separately. The code will also generate the confusion matrix and provide the accuracy score on training data for each ML algorithm.**  
>[Jump to Models Development](#Model-Development-using-Training-Data).


**Please feel free to share your feedback/suggestions/appreciation on my work.**

## Code

### Import Libraries

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import metrics
import seaborn as sns

### Read Dataset

In [None]:
df_train = pd.read_csv("./kaggle-titanic-comp/train.csv")
df_train.set_index('PassengerId', inplace=True)
df_train.shape

In [None]:
df_train.head(10)

In [None]:
df_train.info()

From the dataset snapshot and info, we can make some initial observations to understand our data and to determine the preprocessing required in our dataset.
- `Name`, `Sex`, `Ticket`, `Cabin`, and `Embarked` columns contain string datatype which we will need to convert to numeric datatype.
- Some of the values in `Age` and `Cabin` columns are NaN. We need to address these null type values if we have to keep these columns in our analysis.
- `Sex` columns contain categorical values which can be easily converted to numeric datatype.
- `Name` column contains title that we can extract and make it useful in our analysis
- We can use the counts in `SibSp` and `Parch` columns to identify the number of family members traveling with the passenger.

Let's dive deeper into our dataset and find out which columns we need in our analysis and the feature engineering that we have to do.

In [None]:
df_train.describe()

### Handling missing values

#### Find out which columns has missing values

In [None]:
df_train.isna().sum()

So we have 177 values missing in the `Age` column, 687 in the `Cabin` column and 2 in the `Embarked` column. Let's address each of these columns individually.

#### Missing Values in Age Columns

In [None]:
df_train['Age'].median()

##### Histogram of passengers in different age groups

In [None]:
import matplotlib.pyplot as plt
plt.hist(df_train['Age'].dropna(), bins=20)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Histogram of Age')
plt.show()

From the above histogram plot we can see that most passengers were aged between 20 and 40 years of age and median age of the dataset is 28 years.

Usually, median is the best approach to address the missing age values. However, if we are inserting the median age we will be putting all those passengers with different features in the same age catggory. Instead, we must take into consideration other features of the passengers (with missing ages) before assigning any age values to them.

We know that the travelling passengers may belong to different age categories based on their sex, passenger class, embarked port, sibling/spouse count, and parent/child count. So let's first explore each of these features indivually.

##### Histogram of males and females passenger in different age groups

In [None]:
men = df_train[df_train["Sex"] == "male"]
women = df_train[df_train["Sex"] == "female"]

# create a histogram of age for men
plt.hist(men["Age"].dropna(), bins=30, alpha=0.5, label="Men")

# create a histogram of age for women
plt.hist(women["Age"].dropna(), bins=30, alpha=0.5, label="Women")

# add labels and title to the histogram
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age Distribution for Men and Women")
plt.legend()

# display the histogram
plt.show()

##### Check the count of missing age values in different column features

In [None]:
df_na = df_train[df_train['Age'].isna()]
#df_na['Embarked'].value_counts()
print("Bifurcation of missing age values in the following column features")
print(f"in Sex:\n{df_na['Sex'].value_counts()}")
print(f"in Pclass:\n{df_na['Pclass'].value_counts()}")
print(f"in Embarked:\n{df_na['Embarked'].value_counts()}")
print(f"in SibSp:\n{df_na['SibSp'].value_counts()}")
print(f"in Parch:\n{df_na['Parch'].value_counts()}")

##### Find the median age of each of these categories:

In [None]:
df_train.groupby(['Pclass'])['Age'].median()

In [None]:
df_train.groupby(['Sex'])['Age'].median()

In [None]:
df_train.groupby(['Embarked'])['Age'].median()

In [None]:
df_train.groupby(['SibSp'])['Age'].median()

In [None]:
df_train.groupby(['Parch'])['Age'].median()

**Take a close look at the above median age values in different features and categories. These median values gives us a fair idea that filling the missing Age values with the median age of the whole dataset is not the right approach and it might lead to inaccuracies in prediction. Instead I am going to fill the missing Age values with a median age of the passenger's combined categories.**

For example, find the median age of males, traveling in Pclass = 3, Embarked at S, Have no Sibling/Spouse, and Have no Parent/Child and fill the missing age of the passengers that falls in the above category. Similarly, I will find the median age of all the combinations and fill the missing age values using loops.

In [None]:
# This next code line is only an example. Full operation is performed in the next code cell.
# Find the median age of males, traveling in Pclass = 3, Embarked at S, Have no Sibling/Spouse, and Have no Parent/Child
df_train[(df_train['Sex']=='male') & (df_train['Pclass']==3) & (df_train['Embarked']=='S') & (df_train['SibSp']==0) & (df_train['Parch']==0)]['Age'].median()

In [None]:
import warnings
warnings.filterwarnings("ignore")
sex=['male', 'female']
pclasses = [1, 2, 3]
embarked = ['C', 'Q', 'S']
sibsp = [0, 1, 2, 3, 8]
parch = [0, 1, 2]

median_ages = []
for sx in sex:
    for pc in pclasses:
        for e in embarked:
            for s in sibsp:
                for p in parch:
                    mask = (df_train['Sex'] == sx) & (df_train['Pclass'] == pc) & (df_train['Embarked'] == e) & (df_train['SibSp'] == s) & (df_train['Parch'] == p)
                    median_age = df_train.loc[mask, 'Age'].median()
                    df_train.loc[mask & (df_train['Age'].isna()), 'Age'] = median_age

##### Check the count of none values in thr `Age` feature again!

In [None]:
df_train['Age'].isna().sum()

So, I have still 20 none type values in the Age column. Finding these values requires further research about the passenger features whose age is missing.

In [None]:
df_na = df_train[df_train['Age'].isna()]
df_na

These age values were not filled because there were no other people matching the features of the above passengers, and thus no median value exist.  

Let's fill out the ages of these passengers individually. Take a look at the above data and you will notice the following points:

1. `PassengerId = 47, 110, 187, 215, 242, 365, 613, 769` have the same features, i.e. travelled in `Pclass=3`, `SibSp=1`, `Parch=0`, except that they are either male or female and embarked at different ports.
2. `PassengerId = 49, 302, 331` have the same features, i.e. travelled in `Pclass=3`, `SibSp=2`, `Parch=0`, except that they are either male or female and embarked at different ports.
3. `PassengerId = 160, 181, 202, 325, 793, 847, 864` are Sage family travelled in `Pclass=3`, `SibSp=8`, `Parch=2`. We will deal with the Sage family age separately.
4. `PassengerId = 594` travelled in `Pclass=3`, `SibSp=0`, `Parch=2`, `Sex=female` and `PassengerId = 889` travelled in `Pclass=3`, `SibSp=1`, `Parch=2`, `Sex=female`. These two passengers can be combined in the same category.

Let's analyse the above points one by one to find out the best approach to represent their age.

**1. Find the median age of passengers traveled in `Pclass=3`, `SibSp=1`, `Parch=0` and leave out the `Sex` and `Embarked` features. I have left out `Sex` feature because I have checked that median is same for both genders when it is included in our calculation criteria.**

In [None]:
median = df_train[(df_train['Pclass']==3) & (df_train['SibSp']==1) & (df_train['Parch']==0)]['Age'].median()
median

Fill the ages of `PassengerId = 47, 110, 187, 215, 242, 365, 613, 769` with the above calculated median age = 25 years.

In [None]:
pid = [47, 110, 187, 215, 242, 365, 613, 769]

for p in pid:
    df_train.loc[p, 'Age'] = median

**2. Find the median age of passengers traveled in `Pclass=3`, `SibSp=2`, `Parch=0`, `Sex=male/female` and leave out the `Embarked` feature.**

In [None]:
median_male = df_train[(df_train['Sex']=='male') & (df_train['Pclass']==3) & (df_train['SibSp']==2) & (df_train['Parch']==0)]['Age'].median()
median_female = df_train[(df_train['Sex']=='female') & (df_train['Pclass']==3) & (df_train['SibSp']==2) & (df_train['Parch']==0)]['Age'].median()
print(f"Male median for the above calculation criteria is: {median_male} and female median is: {median_female}")

Fill the ages of `PassengerId = 49, 302` with the above calculated male median age = 27 years and `PassengerId = 331` with female median age = 18 years

In [None]:
df_train.loc[49, 'Age'] = median_male
df_train.loc[302, 'Age'] = median_male
df_train.loc[331, 'Age'] = median_female

**3. Find the median age of passengers with the feature `Parch=2`.**

In [None]:
median = df_train[(df_train['Parch']==2)]['Age'].median()
median

So, looking at the titles of the Sage family passengers, I am confident that `PassengerId=160` having title Master was aged under 18 years. I will use the above calculated median age = 15 years for the above passenger. For the rest of the Sage family, I am  not sure about their age group and therefore, for them I will provide the median age of all the passengers.

In [None]:
df_train.loc[160, 'Age'] = median

In [None]:
median = df_train['Age'].median()
median

In [None]:
pid = [181, 202, 325, 793, 847, 864]

for p in pid:
    df_train.loc[p, 'Age'] = median

**4. Find the median age of passengers traveled in `Pclass=3`, `SibSp=0`, `Parch=2`, `Sex=female` and leave out the `Embarked` feature.**

In [None]:
median = df_train[(df_train['Sex']=='female') & (df_train['Pclass']==3) & (df_train['SibSp']==0) & (df_train['Parch']==2)]['Age'].median()
median

Fill the ages of `PassengerId = 594, 889` with the above calculated median age = 15 years

In [None]:
df_train.loc[594, 'Age'] = median
df_train.loc[889, 'Age'] = median

In [None]:
df_train.isna().sum()

#### Missing Values in Cabin Column

Since the percentage of the missing values in Cabin column is very high. I will drop the Cabin column from my analysis.

#### Missing Values in Embarked Column

##### Plot Passenger Count vs Embarked Ports

In [None]:
plt.bar(df_train['Embarked'].value_counts().index, df_train['Embarked'].value_counts(), width=0.4, alpha=0.8)
plt.xlabel("Embarked Ports")
plt.ylabel("Count")
plt.title("Embarked Count")
plt.show()
df_train['Embarked'].mode()[0]

Most of the passengers embarked at 'S' port. And since we have only 2 missing port value, I will fill those missing values with the mode value of the `Embarked` feature.

In [None]:
df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace=True)

In [None]:
df_train.isna().sum()

### Exploratory Data Analysis

#### Univariate Analysis

In [None]:
df_train.drop(['Cabin'], axis=1, inplace=True)
df_train.head()

##### Survived and deaths count

In [None]:
df_train['Survived'].value_counts()

##### Gender
We can see in the pivot table that females has more chance of the survival.

In [None]:
df_train.pivot_table(values='Survived', index='Sex', columns=df_train['Survived'], aggfunc='count')

##### Social Economix Status

From the below plot, we can tell that a Class 1 passenger has more chance of surviving than a Class 3 passenger.

In [None]:
psgr_survived = df_train[df_train['Survived']==1]

plt.bar(df_train['Pclass'].value_counts().index, df_train['Pclass'].value_counts(), width=0.4, color='red', alpha=0.5, label='Total Passengers')
plt.bar(psgr_survived['Pclass'].value_counts().index, psgr_survived['Pclass'].value_counts(), width=0.4, color='blue', alpha=0.7, label='Passengers Survived')
plt.xticks([1, 2, 3])
plt.legend()
plt.xlabel("Passenger Class")
plt.ylabel("Passenger Count")
plt.title("Survival Likelihood based on Socio-Economic Status")

plt.show()

##### Age

Passengers below 8 years and above 35 years of age had a better chance of surviving as compared to passengers aged between 15 and 35 years.

In [None]:
plt.hist(df_train['Age'], color='red', bins=20, alpha=0.5, label='Total Passengers')
plt.hist(psgr_survived['Age'], color='blue', bins=20, alpha=0.7, label='Passengers Survived')
plt.legend()
plt.xlabel("Passenger Age")
plt.ylabel("Passenger Count")
plt.title("Survival Likelihood based on Age")

plt.show()

##### Embarkment Port

The passengers embarked at port 'S' has the worst ratio of survived passenger. However, looking at the numbers below, the Embarkment port feature doesn't provide a strong conclusion about the chances of survival, thus I will drop this feature from my analysis.

In [None]:
plt.bar(df_train['Embarked'].value_counts().index, df_train['Embarked'].value_counts(), width=0.4, color='red', alpha=0.5, label='Total Passengers')
plt.bar(psgr_survived['Embarked'].value_counts().index, psgr_survived['Embarked'].value_counts(), width=0.4, color='blue', alpha=0.7, label='Passengers Survived')
plt.legend()
plt.xlabel("Port of Embarkment")
plt.ylabel("Passenger Count")
plt.title("Survival Likelihood based on Embarkment Port")

plt.show()

In [None]:
print(f"Survival percentage for passengers embarked at different ports\
:\n{round(psgr_survived['Embarked'].value_counts()/df_train['Embarked'].value_counts()*100,3)}")

##### Sibling/Spouse and Parent/Child

Most passengers were travelling without parents or children.  
Most passengers were travelling without siblings or a spouse.  
The data also suggests that the passengers who were traveling alone have a lower survival chance.

In [None]:
psgr_survived = df_train[df_train['Survived']==1]

fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].bar(df_train['SibSp'].value_counts().index, df_train['SibSp'].value_counts(), width=0.4, color='red', alpha=0.5, label='Total Passengers')
ax[0].bar(psgr_survived['SibSp'].value_counts().index, psgr_survived['SibSp'].value_counts(), width=0.4, color='blue', alpha=0.7, label='Passengers Survived')
#plt.xticks([1, 2, 3])
ax[0].legend()
ax[0].set_xlabel("Sibling/Spouse")
ax[0].set_ylabel("Passenger Count")
ax[0].set_ylim(0, 715)
ax[0].set_title("Survival Likelihood based on\nSibling Spouse count")

ax[1].bar(df_train['Parch'].value_counts().index, df_train['Parch'].value_counts(), width=0.4, color='red', alpha=0.5, label='Total Passengers')
ax[1].bar(psgr_survived['Parch'].value_counts().index, psgr_survived['Parch'].value_counts(), width=0.4, color='blue', alpha=0.7, label='Passengers Survived')
#plt.xticks([1, 2, 3])
ax[1].legend()
ax[1].set_xlabel("Parent/Child")
ax[1].set_ylabel("Passenger Count")
ax[1].set_title("Survival Likelihood based on\nParent Child count")

plt.show()

#### Multivariate Analysis

##### Sex and Socio-Economic Status

Most of the Class 1 or Class 2 female passengers survived as compared to Class 3 female passengers. Among males, the likelihood of survival are more if the passenger belonged to Class 1.

In [None]:
# Subset the data to include only males and females
ml_all = df_train[df_train['Sex'] == 'male']['Pclass'].value_counts()
ml_srv = df_train[(df_train['Sex'] == 'male') & (df_train['Survived'] == 1)]['Pclass'].value_counts()
fm_all = df_train[df_train['Sex'] == 'female']['Pclass'].value_counts()
fm_srv = df_train[(df_train['Sex'] == 'female') & (df_train['Survived'] == 1)]['Pclass'].value_counts()

In [None]:
# Create the figure and axis objects
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].bar(ml_all.index, ml_all, width=0.4, color='blue', alpha=0.4, label='Total Males')
ax[0].bar(ml_srv.index, ml_srv, width=0.4, color='green', alpha=0.6, label='Males Survived')
ax[0].set_title('Male Survival Likelihood\nbased on Socio-economic Status')
ax[0].set_xlabel("Passenger Class")
ax[0].set_ylabel("Passenger Count")
ax[0].set_xticks([1, 2, 3])
ax[0].legend()

ax[1].bar(fm_all.index, fm_all, width=0.4, color='pink', alpha=0.8, label='Total Females')
ax[1].bar(fm_srv.index, fm_srv, width=0.4, color='green', alpha=0.6, label='Females Survived')
ax[1].set_title('Female Survival Likelihood\nbased on Socio-economic Status')
ax[1].set_xlabel("Passenger Class")
#ax[1].set_ylabel("Passenger Count")
ax[1].set_xticks([1, 2, 3])
ax[1].legend()

plt.ylim(0, 360)
plt.show()

In [None]:
table = pd.DataFrame({'Sex': ['Males', 'Females'],
                     'Class 1': [round(ml_srv[1]/ml_all[1]*100, 3), round(fm_srv[1]/fm_all[1]*100, 3)],
                     'Class 2': [round(ml_srv[2]/ml_all[2]*100, 3), round(fm_srv[2]/fm_all[2]*100, 3)],
                     'Class 3': [round(ml_srv[3]/ml_all[3]*100, 3), round(fm_srv[3]/fm_all[3]*100, 3)]
                     })
table.set_index(['Sex'], inplace=True)
table

##### Sex and Age

Higher percentage of females aged between 0 to 8 years and 35 to 60 years survived the disaster, whereas, for males the survival percent is higher for ages between 0 to 10 years and 25 to 35 years.

In [None]:
# Subset the data to include only males and females
ml_all = df_train[df_train['Sex'] == 'male']['Age']
ml_srv = df_train[(df_train['Sex'] == 'male') & (df_train['Survived'] == 1)]['Age']
fm_all = df_train[df_train['Sex'] == 'female']['Age']
fm_srv = df_train[(df_train['Sex'] == 'female') & (df_train['Survived'] == 1)]['Age']

In [None]:
# Create the figure and axis objects
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].hist(ml_all, color='blue', alpha=0.4, label='Total Males')
ax[0].hist(ml_srv, color='green', alpha=0.6, label='Males Survived')
ax[0].set_title('Male Survival Likelihood\nbased on Age')
ax[0].set_xlabel("Passenger Age")
ax[0].set_ylabel("Passenger Count")
ax[0].set_xticks(np.linspace(0, 80, 9))
ax[0].legend()

ax[1].hist(fm_all, color='pink', alpha=0.8, label='Total Females')
ax[1].hist(fm_srv, color='green', alpha=0.6, label='Females Survived')
ax[1].set_title('Female Survival Likelihood\nbased on Age')
ax[1].set_xlabel("Passenger Age")
ax[1].set_xticks(np.linspace(0, 80, 9))
ax[1].legend()

plt.xlim(0, 85)
plt.ylim(0, 220)

plt.show()

##### Sex, Age and Sibling/Spouse
Most passengers were travelling without siblings or a spouse.  
The passengers who were traveling alone or with the family of 3 or more members have a lower survival chance, whereas passengers with 1 or 2 family members and aged between 15 to 25 years had the highest chance of surviving.

In [None]:
# Subset the data to include only males and females
p0_all = df_train[df_train['SibSp'] == 0]['Age']
p0_srv = df_train[(df_train['SibSp'] == 0) & (df_train['Survived'] == 1)]['Age']
p1_all = df_train[df_train['SibSp'] == 1]['Age']
p1_srv = df_train[(df_train['SibSp'] == 1) & (df_train['Survived'] == 1)]['Age']
p2_all = df_train[df_train['SibSp'] == 2]['Age']
p2_srv = df_train[(df_train['SibSp'] == 2) & (df_train['Survived'] == 1)]['Age']
p3_all = df_train[df_train['SibSp'] == 3]['Age']
p3_srv = df_train[(df_train['SibSp'] == 3) & (df_train['Survived'] == 1)]['Age']
p4_all = df_train[df_train['SibSp'] == 4]['Age']
p4_srv = df_train[(df_train['SibSp'] == 4) & (df_train['Survived'] == 1)]['Age']
p5_all = df_train[df_train['SibSp'] >= 5]['Age']
p5_srv = df_train[(df_train['SibSp'] >= 5) & (df_train['Survived'] == 1)]['Age']

In [None]:
# Create the figure and axis objects
fig, ax = plt.subplots(2, 3, figsize=(10, 10))

p = [[p0_all, p0_srv], [p1_all, p1_srv], [p2_all, p2_srv], [p3_all, p3_srv], [p4_all, p4_srv], [p5_all, p5_srv]]
count=0
for i in range(2):
    for j in range(3):
        ax[i][j].hist(p[i+j+count][0], color='red', bins=20, alpha=0.5, label='Total Passengers')
        ax[i][j].hist(p[i+j+count][1], color='blue', bins=20, alpha=0.7, label='Survived')
        ax[i][j].set_title(f'SibSp = {i+j+count}')
        ax[i][j].set_xlabel("Age")
        ax[i][j].set_ylabel("Count")
        ax[i][j].set_xticks(np.linspace(0, 80, 9))
        ax[i][j].set_ylim(0, 150)
        ax[i][j].legend(fontsize=8)
    count += 2

fig.suptitle("Survival Likelihood based on Age and Sibling/Spouse count", fontsize=15)
plt.show()

### Feature Engineering

#### Interaction Feature using `SibSp` and `Parch`

This code creates two new features, `FamilySize` and `IsAlone`, from the existing features `Parch` and `SibSp`.
These two new features can potentially provide information about the survival of a passenger that is not captured in the `Parch` and `SibSp` features.

In [None]:
df_train['FamilySize'] = df_train['Parch'] + df_train['SibSp'] + 1
df_train['IsAlone'] = np.where(df_train['FamilySize'] == 1, 1, 0)

In [None]:
psgr_srv = df_train[df_train['Survived']==1]

fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].bar(df_train['FamilySize'].value_counts().index, df_train['FamilySize'].value_counts(), width=0.4, color='red', alpha=0.5, label='Total Passengers')
ax[0].bar(psgr_srv['FamilySize'].value_counts().index, psgr_srv['FamilySize'].value_counts(), width=0.4, color='blue', alpha=0.7, label='Passengers Survived')
ax[0].set_xticks(np.linspace(0, 11, 12))
ax[0].set_xlabel("Family Size")
ax[0].set_ylabel("Count")
ax[0].set_title("Survival Likelihood\nbased on Family Size")
ax[0].legend()

ax[1].bar(df_train['IsAlone'].value_counts().index, df_train['IsAlone'].value_counts(), width=0.1, color='red', alpha=0.5, label='Total Passengers')
ax[1].bar(psgr_srv['IsAlone'].value_counts().index, psgr_srv['IsAlone'].value_counts(), width=0.1, color='blue', alpha=0.7, label='Passengers Survived')
ax[1].set_xticks([0, 1])
ax[1].set_xticklabels(['With Family', 'Alone'])
ax[1].set_xlabel("Alone Traveler?")
ax[1].set_ylabel("Count")
ax[1].set_title("Survival Likelihood\nbased on traveling solo/with family")
ax[1].legend()

plt.show()

#### Categories from Features 

##### Age Bins

Creating age bins can help in improving accuracy by converting the continuous variable "age" into a categorical variable. Converting a continuous variable into a categorical variable can make it easier for the model to detect patterns and relationships between the target variable and the predictor variable. Additionally, age bins can also reduce the impact of noise or outliers in the data and make the variable more interpretable.

In [None]:
df_train['AgeBin'] = pd.qcut(df_train['Age'], 8, labels=[1, 2, 3, 4, 5, 6, 7, 8])
df_train.groupby(['AgeBin'])['Age'].count()

In [None]:
bins = pd.qcut(df_train['Age'], 8, labels=[1, 2, 3, 4, 5, 6, 7, 8], retbins=True)[1]
print(bins)

##### Fare Bins

In [None]:
df_train['FarePerPerson'] = df_train['Fare']/(df_train['FamilySize'] + 1)

# create fare bins
df_train['FareBin'] = pd.qcut(df_train['FarePerPerson'], 5, labels=[1, 2, 3, 4, 5])
df_train.groupby(['FareBin'])['FarePerPerson'].count()

In [None]:
bins = pd.qcut(df_train['Fare'], 5, labels=[1, 2, 3, 4, 5], retbins=True)[1]
print(bins)

##### Family Size Bins

In [None]:
df_train['FamilySizeGroup'] = 'Single'
df_train.loc[(df_train['FamilySize'] > 1) & (df_train['FamilySize'] <= 4), 'FamilySizeGroup'] = 'Small'
df_train.loc[(df_train['FamilySize'] > 4), 'FamilySizeGroup'] = 'Large'

##### Titles

In [None]:
df_train['Title'] = df_train.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
df_train['Title'] = df_train['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df_train['Title'] = df_train['Title'].replace('Mlle', 'Miss')
df_train['Title'] = df_train['Title'].replace('Ms', 'Miss')
df_train['Title'] = df_train['Title'].replace('Mme', 'Mrs')

##### Creating feature from Ticket Prefix and Ticket Number - Not required

In [None]:
#df_train['TicketPrefix'] = df_train['Ticket'].apply(lambda x: x.split(' ')[0])
#df_train['TicketNumber'] = df_train['Ticket'].apply(lambda x: x.split(' ')[-1])
#df_train['TicketCombined'] = df_train['TicketPrefix'] + df_train['TicketNumber']
# Apply one-hot encoding
#df_train = pd.get_dummies(df_train, columns=['TicketCombined'])

In [None]:
# Find columns that are missing in the train dataset
# but are available in train dataset. These columns
# were added in the train data through feature engineering.
#train_cols = set(df_train.columns)
#test_cols = set(df_test.columns)
#missing_cols = test_cols - train_cols
#print("No. of Columns missing in train data:", len(missing_cols))

# Add missing columns in the test dataframe
#for col in df_test.columns:
#    if col.startswith('TicketCombined'):
#        if col not in df_train.columns:
#            df_train[col] = 0

### Model Development using Training Data

#### Create NumPy Arrays from Dataframes

#### Encode Categorical Data

`Sex` features in this dataset is categorical. ML libraries does not handle categorical variables and to infer any information from the categorical features such as `Age`, `Titles` etc. needs to be converted to numerical values. We can these features to numerical values using **pandas.get_dummies()** or `LabelEncoder()` method to convert the categorical variable into indicator variables.  

For features that contain categories that have numerical relationship, I will use `LabelEncoder` method. As we can see, the categories in feature `FamilySizeGroup` are `Single`, `Small` and `Large`. We can establish a numerical relationship between these categories, thus I will apply the method `LabelEncoder` to `FamilySizeGroup`, which will change the contents of the column to specified numerical values.

For features that have independent categories (meaning that there is no relationship between categories such as `male`, `female`), I will use the `pd.get_dummies` method which creates new columns for each category and assign a binary value indicating whether the original value is present or not. These features are `Sex` which has `male` and `female` categories, and `Title` which has `Rare`, `Mr`, `Mrs`, `Miss`, and `Master` categories. This method is called as **One-hot encoding**.

After applying one-hot encoding, I will drop the some columns from our dataframe as these features either (1) don't provide any meaningful information or (2) I have extracted all meaningful information from them by feature engineering and I don't need these features now.  
After dropping the columns, convert the Pandas dataframe to a NumPy array. Subsequently, I will apply the LabelEncoder method on the `FamilySizeGroup` column.

In [None]:
# One-hot encoding for Sex and Title features using pd.get_dummies() method
encode_sex = pd.get_dummies(df_train['Sex'], prefix='Sex')
encode_ttl = pd.get_dummies(df_train['Title'], prefix='Title')
df_train = pd.concat([df_train, encode_sex, encode_ttl], axis=1) 

In [None]:
df_train.columns

In [None]:
# Drop columns that are not required
df = df_train.drop(['Survived', 'Name', 'Fare', 'Sex', 'Ticket', 'Embarked', 'FamilySize', 'FarePerPerson', 'Title'], axis=1)
x = df.values
df.head()
#x[0:5]

In [None]:
df.info()

In [None]:
# Encode FamilySizeGroup feature using LabelEncoder() method
le_fsize = preprocessing.LabelEncoder()
le_fsize.fit(['Single','Small', 'Large'])
x[:,7] = le_fsize.transform(x[:,7])
x[0:5]

In [None]:
y = df_train['Survived'].values
y[0:5]

#### Normalize Data

Data Standardization gives the data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is based on the distance of data points:

In [None]:
x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))
x[0:5]

#### Train-Test Split

Although, the test-train data is provided in separate files. I will split the train data into test and train data. I will train the data and then run it on the test data to test the accuracy. I will be training the data several times using different train-to-test ratio. Once a high accuracy is achieved, I will run the code on the actual test data to make the predictions.

In [None]:
xtrain_train, xtrain_test, ytrain_train, ytrain_test = train_test_split(x, y, test_size=0.3, random_state=2)
print ('Train set:', xtrain_train.shape,  ytrain_train.shape)
print ('Test set:', xtrain_test.shape,  ytrain_test.shape)

In [None]:
def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['survived', 'died']); ax.yaxis.set_ticklabels(['survived', 'died'])

#### Model - K-Nearest Neighbor (KNN)

##### Hyperparameter Tuning

In [None]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute', 'uniform', 'distance'],
              'p': [1,2, 3]}

print("Model - k-Nearest Neighbor")
print("==========================")
print("Tuning hyperparameters")


KNN = KNeighborsClassifier()
knn_cv = GridSearchCV(KNN, param_grid=parameters, cv=20, verbose=0).fit(xtrain_train, ytrain_train)

print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)

##### Train with best Parameters

In [None]:
knn = KNeighborsClassifier(n_neighbors=knn_cv.best_params_['n_neighbors'],
                           algorithm=knn_cv.best_params_['algorithm'],
                           p=knn_cv.best_params_['p']).fit(xtrain_train, ytrain_train)

##### Prediction and Confusion Matrix
We can use the model to make predictions on the test set:

In [None]:
knn_score = knn.score(xtrain_test, ytrain_test)
print(knn_score)
yhat = knn.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)

#### Model - Logistic Regression

##### Hyperparameter Tuning

In [None]:
parameters = {'penalty': ['l1', 'l2', 'elasticnet', 'none'],
              'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
              'fit_intercept': [True, False],
              'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

print("Model - Logistic Regression")
print("===========================")
print("Tuning hyperparameters")

lr = LogisticRegression()
logreg_cv = GridSearchCV(lr, param_grid=parameters, cv=20, verbose=0).fit(xtrain_train,ytrain_train)

print("tuned hpyerparameters :(best parameters) ", logreg_cv.best_params_)
print("accuracy :", logreg_cv.best_score_)

##### Prediction and Confusion Matrix

##### Train with best Parameters

In [None]:
logreg = LogisticRegression(penalty=logreg_cv.best_params_['penalty'],
                            C=logreg_cv.best_params_['C'],
                            fit_intercept=logreg_cv.best_params_['fit_intercept'],
                            solver=logreg_cv.best_params_['solver']).fit(xtrain_train, ytrain_train)

In [None]:
logreg_score = logreg.score(xtrain_test, ytrain_test)
print(logreg_score)

yhat = logreg.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)

#### Model - Support Vector Machine

##### Hyperparameter Tuning

In [None]:
parameters = {'C': [0.1, 1, 10, 100],
              'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
              'degree': [2, 3, 4],
              'gamma': [0.1, 0.01, 0.001, 'scale', 'auto']}

print("Model - Support Vector Maching")
print("==============================")
print("Tuning hyperparameters")

svm = SVC()
svm_cv = GridSearchCV(svm, param_grid=parameters, cv=3, verbose=0).fit(xtrain_train, ytrain_train)

print("tuned hpyerparameters :(best parameters) ", svm_cv.best_params_)
print("accuracy :", svm_cv.best_score_)

##### Train with best Parameters

In [None]:
svm = SVC(C=svm_cv.best_params_['C'],
          kernel=svm_cv.best_params_['kernel'],
          degree=svm_cv.best_params_['degree'],
          gamma=svm_cv.best_params_['gamma']).fit(xtrain_train, ytrain_train)

##### Prediction and Confusion Matrix

In [None]:
svm_score = svm.score(xtrain_test, ytrain_test)
print(svm_score)

yhat = svm.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)

#### Model - Decision Tree Classifier

##### Hyperparameter Tuning

In [None]:
parameters = {'criterion': ['gini', 'entropy'],
              'splitter': ['best', 'random'],
              'max_depth': [2*n for n in range(1,10)],
              'max_features': [None, 'auto', 'sqrt'],
              'min_samples_leaf': [2, 5, 10, 20, 50],
              'min_samples_split': [1, 2, 5, 10, 20]}

print("Model - Decision Tree")
print("=====================")
print("Tuning hyperparameters")

tree = DecisionTreeClassifier()
tree_cv = GridSearchCV(tree, param_grid=parameters, cv=20, verbose=0).fit(xtrain_train, ytrain_train)

print("tuned hpyerparameters :(best parameters) ", tree_cv.best_params_)
print("accuracy :", tree_cv.best_score_)

##### Train with best Parameters

In [None]:
tree = DecisionTreeClassifier(criterion=tree_cv.best_params_['criterion'],
                              splitter=tree_cv.best_params_['splitter'],
                              max_depth=tree_cv.best_params_['max_depth'],
                              max_features=tree_cv.best_params_['max_features'],
                              min_samples_leaf=tree_cv.best_params_['min_samples_leaf'],
                              min_samples_split=tree_cv.best_params_['min_samples_split']).fit(xtrain_train, ytrain_train)

##### Prediction and Confusion Matrix

In [None]:
tree_score = tree.score(xtrain_test, ytrain_test)
print(tree_score)

yhat = tree.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)

#### Model - Naive Bayes

##### Hyperparameter Tuning

In [None]:
parameters = {'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]}

print("Model - Gauss Naive Bayes")
print("=========================")
print("Tuning var_smoothing")

gnb = GaussianNB()
gnb_cv = GridSearchCV(gnb, param_grid=parameters, cv=20, verbose=0).fit(xtrain_train, ytrain_train)

print("var_smoothing :(best parameter) ", gnb_cv.best_params_['var_smoothing'])
print("accuracy :", gnb_cv.best_score_)

##### Train with best Parameters

In [None]:
gnb = GaussianNB(var_smoothing=gnb_cv.best_params_['var_smoothing']).fit(xtrain_train, ytrain_train)

##### Prediciton and Confusion Matrix

In [None]:
gnb_score = gnb.score(xtrain_test, ytrain_test)
print(gnb_score)

yhat = gnb.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)

#### Model - XGBoost

##### Hyperparameter Tuning

In [None]:
parameters = {'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [3, 5, 7],
              'min_child_weight': [1, 3, 5],
              'n_estimators': [100, 200, 300]
              }

print("Model - XGBoost")
print("===============")
print("Tuning hyperparameters")

xgb = XGBClassifier()
xgb_cv = GridSearchCV(xgb, param_grid=parameters, cv=5, verbose=0).fit(xtrain_train, ytrain_train)


print("tuned hpyerparameters :(best parameters) ", xgb_cv.best_params_)
print("accuracy :", xgb_cv.best_score_)

##### Train with best Parameters

In [None]:
xgb = XGBClassifier(learning_rate=xgb_cv.best_params_['learning_rate'],
                    max_depth=xgb_cv.best_params_['max_depth'],
                    min_child_weight=xgb_cv.best_params_['min_child_weight'],
                    n_estimators=xgb_cv.best_params_['n_estimators']).fit(xtrain_train, ytrain_train)

##### Prediciton and Confusion Matrix

In [None]:
xgb_score = xgb.score(xtrain_test, ytrain_test)
print(xgb_score)

yhat = xgb.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)

#### Model - Random Forest Classifier

##### Hyperparameter Tuning

In [None]:
# Define the hyperparameters to be tuned
parameters = {
    'n_estimators': [100, 500, 1000],
    'max_depth': [5, 8, 15, 25, 30],
    'min_samples_split': [2, 5, 10, 15, 100],
    'min_samples_leaf': [1, 2, 5, 10]
}

print("Model - Random Forest")
print("=====================")
print("Tuning hyperparameters")


rfc = RandomForestClassifier()
# Use GridSearchCV to perform hyperparameter tuning
rfc_cv = GridSearchCV(rfc, param_grid=parameters, cv=5, verbose=0).fit(xtrain_train, ytrain_train)

print("tuned hpyerparameters :(best parameters) ", rfc_cv.best_params_)
print("accuracy :", rfc_cv.best_score_)

##### Train with best Parameters

In [None]:
# Use the best parameters to train and test the model
rfc = RandomForestClassifier(n_estimators=rfc_cv.best_params_['n_estimators'],
                             max_depth=rfc_cv.best_params_['max_depth'],
                             min_samples_split=rfc_cv.best_params_['min_samples_split'],
                             min_samples_leaf=rfc_cv.best_params_['min_samples_leaf']).fit(xtrain_train, ytrain_train)

##### Prediction and Confusion Matrix

In [None]:
rfc_score = rfc.score(xtrain_test, ytrain_test)
print(rfc_score)

yhat = rfc.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)

#### Compare Models

In [None]:
models = {
    'KNeighborsClassifier': knn_cv.best_score_,
    'DecisionTreeClassifier': tree_cv.best_score_,
    'SupportVectorMachine': svm_cv.best_score_,
    'LogisticRegression': logreg_cv.best_score_,
    'Naive Bayes': gnb_cv.best_score_,
    'xGBoost': xgb_cv.best_score_,
    'RandomForest': rfc_cv.best_score_
}

bestmodel = max(models, key=models.get)
print(f"Best performing model is: {bestmodel} with accuracy score: {models.get(bestmodel):<6.3f}")

params = {
    'KNeighborsClassifier': knn_cv.best_params_,
    'DecisionTreeClassifier': tree_cv.best_params_,
    'SupportVectorMachine': svm_cv.best_params_,
    'LogisticRegression': logreg_cv.best_params_,
    'Naive Bayes': gnb_cv.best_params_,
    'xGBoost': xgb_cv.best_params_,
    'RandomForest': rfc_cv.best_params_
}

bestparams = params.get(bestmodel)
print(f"Tuned hyperparameters for the {bestmodel} are:", bestparams)

fig = plt.figure(figsize = (10, 5))
bars = plt.bar(*zip(*models.items()), width=0.6)

for bar in bars:
    height = bar.get_height()
    plt.annotate(str(round(height,3)),
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                ha='center', va='bottom')
plt.ylim(0,1)
plt.xticks(rotation=45)
plt.show()

In [None]:
df_score = pd.DataFrame({'Models':['KNN', 'Tree', 'SVM', 'LR', 'GaussNB', 'xGBoost', 'RandomFor'],
                         'Score':[knn_score, tree_score, svm_score, logreg_score, gnb_score, xgb_score, rfc_score]
                         })
df_score

#### Cross Validation

In [None]:
"""from sklearn.model_selection import cross_val_score

best_params = tree_cv.best_params_

# create an instance of Gaussian Naive Bayes with the best hyperparameters
tree = DecisionTreeClassifier(**best_params)

# perform cross-validation to evaluate the performance of the model
cv_scores = cross_val_score(tree, xtrain_train, ytrain_train, cv=20)

# get the mean accuracy from cross-validation
mean_accuracy = np.mean(cv_scores)

print("Mean Accuracy:", mean_accuracy)
print(f"Mean cross-validation score: {cv_scores.mean():.2f}")
print(f"Standard deviation: {cv_scores.std():.2f}")

# Select the best model
best_model = tree_cv.best_estimator_
print(f"Best model parameters: {tree_cv.best_params_}")

# Re-train the best model on the entire training dataset
best_model.fit(xtrain_train, ytrain_train)

# Evaluate the final model on a separate test set
yhat = best_model.predict(xtrain_test)
accuracy = accuracy_score(ytrain_test, yhat)
print(f"Accuracy on test set: {accuracy:.2f}")

# Fine-tune the hyperparameters of the best model using GridSearchCV
parameters = {'criterion': ['gini', 'entropy'],
     'splitter': ['best', 'random'],
     'max_depth': [2*n for n in range(1,10)],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]} # Define the hyperparameter grid
tree_cv = GridSearchCV(best_model, param_grid=parameters, cv=20).fit(xtrain_train, ytrain_train)
"""

In [None]:
"""print(tree_cv.score(xtrain_test, ytrain_test))

yhat = tree_cv.predict(xtrain_test)
plot_confusion_matrix(ytrain_test, yhat)
"""

### Running the Model on Test Data

In [None]:
df_test = pd.read_csv("./kaggle-titanic-comp/test.csv")
df_test.set_index('PassengerId', inplace=True)
df_test.head(10)

In [None]:
df_test.info()

In [None]:
df_test.shape

In [None]:
df_test.isna().sum()

#### Missing Values in Age Column

In [None]:
sex=['male', 'female']
pclasses = [1, 2, 3]
embarked = ['C', 'Q', 'S']
sibsp = [0, 1, 2, 8]
parch = [0, 1, 2, 4, 9]

median_ages = []
for sx in sex:
    for pc in pclasses:
        for e in embarked:
            for s in sibsp:
                for p in parch:
                    mask = (df_test['Sex'] == sx) & (df_test['Pclass'] == pc) & (df_test['Embarked'] == e) & (df_test['SibSp'] == s) & (df_test['Parch'] == p)
                    median_age = df_test.loc[mask, 'Age'].median()
                    df_test.loc[mask & (df_test['Age'].isna()), 'Age'] = median_age

In [None]:
df_na = df_test[df_test['Age'].isna()]
df_na

**1. Find the median age of passengers traveled in `Pclass=3`, `SibSp=1`, `Parch=0` and leave out the `Sex` and `Embarked` features. I have left out `Sex` feature because I have checked that median is same for both genders when it is included in our calculation criteria.**

In [None]:
median = df_test[(df_test['Pclass']==3) & (df_test['SibSp']==1) & (df_test['Parch']==0)]['Age'].median()
median

Fill the ages of `PassengerId = 1013, 1141, 1165` with the above calculated median age = 25 years.

In [None]:
pid = [1013, 1141, 1165]

for p in pid:
    df_test.loc[p, 'Age'] = median

**2. Find the median age of passengers traveled in `Pclass=3`, `SibSp=2`, `Parch=0`, `Sex=male/female` and leave out the `Embarked` feature.**

In [None]:
median_male = df_test[(df_test['Sex']=='male') & (df_test['Pclass']==3) & (df_test['SibSp']==2) & (df_test['Parch']==0)]['Age'].median()
median_female = df_test[(df_test['Sex']=='female') & (df_test['Pclass']==3) & (df_test['SibSp']==2) & (df_test['Parch']==0)]['Age'].median()
print(f"Male median for the above calculation criteria is: {median_male} and female median is: {median_female}")

Fill the ages of `PassengerId = 921, 1189` with the above calculated male median age = 27 years and `PassengerId = 1019` with female median age = 18 years

In [None]:
df_test.loc[921, 'Age'] = median_male
df_test.loc[1189, 'Age'] = median_male
df_test.loc[1019, 'Age'] = median_female

**3. Find the median age of passengers with the feature `Parch>=4`, `Sex=male/female`.**

In [None]:
median_male = df_test[(df_test['Parch']>=4)  & (df_test['Sex']=='male')]['Age'].median()
median_female = df_test[(df_test['Parch']>=4)  & (df_test['Sex']=='female')]['Age'].median()
print(f"Male median for the above calculation criteria is: {median_male} and female median is: {median_female}")

Fill the ages of `PassengerId = 1024, 1257` with the above calculated female median age = 39 yrs and `PassengerId = 1234` with male median age = 40 years.

In [None]:
df_test.loc[1024, 'Age'] = median_female
df_test.loc[1257, 'Age'] = median_female
df_test.loc[1234, 'Age'] = median_male

**4. Find the median age of passengers with the feature `Parch=2`, `SibSp=8`.**

In [None]:
median = df_test[(df_test['Parch']==2)  & (df_test['SibSp']==8)]['Age'].median()
median

Fill the age of `PassengerId = 1080` with the above calculated median age = 14.5 yrs.

In [None]:
df_test.loc[1080, 'Age'] = median

**5. Find the median age of passengers traveled in `Pclass=3`, `SibSp=0`, `Parch=2`, `Sex=female` and leave out the `Embarked` feature.**

In [None]:
median = df_test[(df_test['Sex']=='female') & (df_test['Pclass']==3) & (df_test['SibSp']==0) & (df_test['Parch']==2)]['Age'].median()
median

In [None]:
df_test.loc[1117, 'Age'] = median

**6. Find the median age of passengers traveled in `Pclass=3`, `SibSp=1`, `Parch=2`, `Sex=male` and leave out the `Embarked` feature.**

In [None]:
median = df_train[(df_train['Sex']=='male') & (df_train['Pclass']==3) & (df_train['SibSp']==1) & (df_train['Parch']==2)]['Age'].median()
median

Fill the ages of `PassengerId = `1136` with the above calculated median age = 15 years

In [None]:
df_test.loc[1136, 'Age'] = median

#### Missing values in Fare column

In [None]:
# Find the details of the row where Fare is missing, i.e. nan
df_na = df_test[df_test['Fare'].isna()]
df_na

In [None]:
pclass = df_na['Pclass'].values[0]
age = df_na['Age'].values[0]
sibsp = df_na['SibSp'].values[0]
parch = df_na['Parch'].values[0]
ticket = df_na['Ticket'].values[0]
emb = df_na['Embarked'].values[0]

print(pclass, age, sibsp, parch, ticket, emb)

# Find the median of rows in the df_test dataframe which meet the conditions similar to the one in the missing fare row
msk = (df_test['Pclass'] == pclass) & (df_test['Parch'] == parch) & (df_test['SibSp'] == sibsp) & (df_test['Embarked'] == emb)
median_fare = df_test.loc[msk, 'Fare'].median()
df_test.loc[1044, 'Fare'] = median_fare

In [None]:
df_test.isna().sum()

#### Feature Engineering on Test Data

##### Interaction Feature

In [None]:
df_test['FamilySize'] = df_test['Parch'] + df_test['SibSp'] + 1
df_test['IsAlone'] = np.where(df_test['FamilySize'] == 1, 1, 0)

##### Age Bins

In [None]:
df_test['AgeBin'] = pd.qcut(df_test['Age'], 8, labels=[1, 2, 3, 4, 5, 6, 7, 8])
df_test.groupby(['AgeBin'])['Age'].count()

##### Fare Bins

In [None]:
df_test['FarePerPerson'] = df_test['Fare']/(df_test['FamilySize'] + 1)

# create fare bins
df_test['FareBin'] = pd.qcut(df_test['FarePerPerson'], 5, labels=[1, 2, 3, 4, 5])
df_test.groupby(['FareBin'])['FarePerPerson'].count()

##### FamilySize Bins

In [None]:
df_test['FamilySizeGroup'] = 'Single'
df_test.loc[(df_test['FamilySize'] > 1) & (df_test['FamilySize'] <= 4), 'FamilySizeGroup'] = 'Small'
df_test.loc[(df_test['FamilySize'] > 4), 'FamilySizeGroup'] = 'Large'

##### Creating Titles

In [None]:
df_test['Title'] = df_test.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
df_test['Title'] = df_test['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df_test['Title'] = df_test['Title'].replace('Mlle', 'Miss')
df_test['Title'] = df_test['Title'].replace('Ms', 'Miss')
df_test['Title'] = df_test['Title'].replace('Mme', 'Mrs')

##### Creating feature from Ticket Prefix and Ticket Number

In [None]:
#df_test['Ticket'] = df_test['Ticket'].astype(str)
#df_test['TicketPrefix'] = df_test['Ticket'].apply(lambda x: x.split(' ')[0])
#df_test['TicketNumber'] = df_test['Ticket'].apply(lambda x: x.split(' ')[-1])
#df_test['TicketCombined'] = df_test['TicketPrefix'] + df_test['TicketNumber']
# Apply one-hot encoding
#df_test = pd.get_dummies(df_test, columns=['TicketCombined'])

In [None]:
# Find columns that are missing in test data
# but are available in train data. These columns
# were added in the train data through feature engineering.
#train_cols = set(df_train.columns)
#test_cols = set(df_test.columns)
#missing_cols = train_cols - test_cols
#print("No. of Columns missing in test data:", len(missing_cols))
# Add missing columns in the test dataframe
#for col in df_train.columns:
#    if col.startswith('TicketCombined'):
#        if col not in df_test.columns:
#            df_test[col] = 0

#### Create NumPy Arrays from Test Dataframe

In [None]:
# One-hot encoding for Sex and Title features using pd.get_dummies() method
encode_sex = pd.get_dummies(df_test['Sex'], prefix='Sex')
encode_ttl = pd.get_dummies(df_test['Title'], prefix='Title')
df_test = pd.concat([df_test, encode_sex, encode_ttl], axis=1)

In [None]:
df_test.columns

In [None]:
# Drop columns that are not required
df = df_test.drop(['Name', 'Sex', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'FamilySize', 'FarePerPerson', 'Title'], axis=1)
xtest = df.values
df.head()
#x[0:5]

In [None]:
df.info()

#### Encode Categorical Variables in Test Dataframe

In [None]:
# Encode FamilySizeGroup feature using LabelEncoder() method
le_fsize = preprocessing.LabelEncoder()
le_fsize.fit(['Single','Small', 'Large'])
xtest[:,7] = le_fsize.transform(xtest[:,7])
xtest[0:5]

#### Normalize Test Data

In [None]:
xtest = preprocessing.StandardScaler().fit(xtest).transform(xtest.astype(float))

#### Predict Test Data

In [None]:
model_list = ['KNN', 'LogReg', 'SVM', 'Tree', 'GaussNB', 'XGBoost', 'Forest']

models = {'KNN': knn,
          'LogReg': logreg,
          'SVM': svm,
          'Tree': tree,
          'GaussNB': gnb,
          'XGBoost': xgb,
          'Forest': rfc
         }

#### Save Predictions to File

In [None]:
for m in model_list:
    model = models.get(m)
    # Predict
    ypred = model.predict(xtest)
    # Save predictions to file
    file_name = f'./kaggle-titanic-dataset/prediction_{m}.csv'
    df_predicted = pd.DataFrame({'Survived':ypred}, index=df_test.index)
    df_predicted.to_csv(file_name)