# **Flow**

 1. **Inputting & Importing**
 2. **Data Preprocessing**
    * Missing Values
            a. Cabin
            b. Age
            c. Embarked
            d. Fare
    * Column Transformations
            a. SibSp & Parch
            b. Fare
            c. Age
            d. Ticket
            e. Name
    * Categorical Encoding
            a. Mean Encoding 
            b. One hot Encoding
        
 3. **Correlation & Feature Selection**
 4. **Splitting data**
 5. **Feature Scaling**
 6. **Models & Selection**
 7. **Learning Curve for Hyperparameters**
 8. **Final Model with Hyperparameters**
 9. **Submission**

# **1. Inputting & Importing**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import random

In [None]:
%config Completer.use_jedi = False
import warnings
warnings.filterwarnings("ignore")
sns.set(rc={'figure.figsize':(18,10)})

# Colors
cyan = '#00FFD1'
red = '#FF007D'
prussian = '#0075FF'
green = '#EEF622'
yellow = '#FFF338'
violet = '#9B65FF'
orange = '#FFA500'
blue = '#00EBFF'
vermillion = '#FF6900'

red2 = '#FF2626'
seagreen = '#28FFBF'
green2 = '#FAFF00'
navyblue = '#04009A'

darkgreen = '#206A5D'
lightgreen = '#CCF6C8'
pink = '#F35588'
mauve = '#BAABDA'
lightblue = '#1CC5DC'
mustard = '#FDB827'
deeppurple = '#723881'



color_list = [cyan,red,prussian,green,violet,orange,yellow,blue,vermillion,red2,seagreen,green2,navyblue,darkgreen,lightgreen,pink,mauve,lightblue,mustard,deeppurple]
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=color_list)

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import matplotlib
import re
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from IPython.display import Markdown, display

In [None]:
def printmd(string):
    display(Markdown(string))

In [None]:
train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.columns

In [None]:
test.columns

# **2. Data Preprocessing**

## Missing Values

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

- 20% **Age** values are missing 
- 0.2% **Embarked** values are missing 
- 77% **Cabin** values are missing -> deleting for now
- **Fare** has just 1 missing value

### Cabin

In [None]:
#train_initial = train.copy()
#test_initial = test.copy()

In [None]:
train = train.drop(['Cabin'],axis=1)
test = test.drop(['Cabin'],axis=1)

### Age

In [None]:
sns.displot(data=train['Age'],kde=True,height=6.5,color=random.choice(color_list));

In [None]:
train['Age'] = train['Age'].fillna(train['Age'].mean())
test['Age'] = test['Age'].fillna(train['Age'].mean())

### Embarked

In [None]:
train['Embarked'].mode()

In [None]:
train['Embarked'] = train['Embarked'].fillna('S')

### Fare

In [None]:
sns.displot(train['Fare'],bins=25,color=random.choice(color_list));

In [None]:
test['Fare'] = test['Fare'].fillna(train['Fare'].mode()[0])

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

## Column Transformations

### SibSp and Parch

In [None]:
train['Family'] = train['SibSp']+train['Parch']
test['Family'] = test['SibSp']+test['Parch']
train=train.drop(['SibSp','Parch'],axis=1)
test=test.drop(['SibSp','Parch'],axis=1)

### Fare

In [None]:
train['Fare'] = train['Fare'].astype('int32')
test['Fare'] = test['Fare'].astype('int32')

### Age

In [None]:
train['Age'] = train['Age'].astype('int32')
test['Age'] = test['Age'].astype('int32')

### Ticket

In [None]:
Ticket_temp_train = train['Ticket'].value_counts()
Ticket_temp_test = test['Ticket'].value_counts()

In [None]:
Ticket_temp_train_df = pd.DataFrame({'ticket':Ticket_temp_train.index,'freq':Ticket_temp_train.values})
Ticket_temp_test_df = pd.DataFrame({'ticket':Ticket_temp_test.index,'freq':Ticket_temp_test.values})

In [None]:
def analyse_tickets(freq_to_stop_at,dataframe):
    flag = 'none'
    for i in range(0,len(Ticket_temp_train_df.iloc[:,:])): # iterating a number range
        ticket_name = Ticket_temp_train_df.iloc[i,0]
        ticket_freq = Ticket_temp_train_df.iloc[i,1]

        if(flag != ticket_freq):
            flag=ticket_freq
            printmd('---')
            printmd('### Ticket frequency: **%d**'%(ticket_freq))
            
    
        if (ticket_freq!=freq_to_stop_at-1):
            printmd(' #### *Ticket Name:* **%s**'%(ticket_name))
            display(dataframe.loc[dataframe['Ticket']==ticket_name])
            print('\n\n')# End of one number
        
        else:
            break

In [None]:
analyse_tickets(6,train) # Enter frequency to stop at and dataframe to work with

#### **Observations**

* Some people have more than one cabin
* Some people not from the same family are in the same cabin
* should I age categorize?
* There are hardly any cabin names for 3rd class passengers
* 3rd class passengers usually travel in F and G (for the few data that is there)
* passengers on the same ticket are mostly in the same cabin and belong to the same class

**Berth numbers were given for some passengers. Odd for lower berths and even for upper berths.** [source](https://www.encyclopedia-titanica.org/cabins.html)

#### Getting Ticket prefix values

In [None]:
c = -1
tick_1 = {}
for i in range(0,len(train['Ticket'])):
    c=c+1
    match = re.search('^[a-zA-Z]+',train.loc[i,'Ticket'])
    if (match):
        tick_1[c] = match.group()

In [None]:
tick1_s = pd.Series(tick_1)

In [None]:
tick_prefix_train = []
for i in range(0,len(train['Ticket'])):
    match = re.search('^[a-zA-Z]+',train.loc[i,'Ticket'])
    if (match):
        tick_prefix_train.append(match.group())
    else:
        tick_prefix_train.append('Null')
        
        
tick_prefix_test = []
for i in range(0,len(test['Ticket'])):
    match = re.search('^[a-zA-Z]+',test.loc[i,'Ticket'])
    if (match):
        tick_prefix_test.append(match.group())
    else:
        tick_prefix_test.append('Null')

In [None]:
train['Ticket_prefix'] = tick_prefix_train
test['Ticket_prefix'] = tick_prefix_test

In [None]:
train.head()

-----------

In [None]:
Ticket_pre_df = pd.DataFrame({'prefix':train['Ticket_prefix'].value_counts().index, 'freq':train['Ticket_prefix'].value_counts().values})

In [None]:
def analyse_prefix(freq_to_stop_at,dataframe):
    # booll - enter True if you want null too
    flag = 'none'
    for i in range(1,len(Ticket_pre_df)): # iterating a number range
        ticket_name = Ticket_pre_df.iloc[i,0]
        ticket_freq = Ticket_pre_df.iloc[i,1]

        if(flag != ticket_freq):
            flag=ticket_freq
            printmd('---')
            printmd('### Ticket frequency: **%d**'%(int(ticket_freq)))
            
    
        if (ticket_freq!=freq_to_stop_at-1):
            printmd(' #### *Ticket Name:* **%s**'%(ticket_name))
            display(dataframe.loc[dataframe['Ticket_prefix']==ticket_name])
            print('\n\n')# End of one number
        
        else:
            break

In [None]:
analyse_prefix(11,train) # first arg doesn't work here ##change

Grouping all unique tickets to a common value

In [None]:
for i in range(0,len(Ticket_temp_train_df.iloc[:,:])):
    if (Ticket_temp_train_df.loc[i,'freq'] == 1):
        train['Ticket'] = train['Ticket'].replace([ Ticket_temp_train_df.loc[i,'ticket'] ],'ticketcount_1')
        
for i in range(0,len(Ticket_temp_test_df.iloc[:,:])):
    if (Ticket_temp_test_df.loc[i,'freq'] == 1):
        test['Ticket'] = test['Ticket'].replace([ Ticket_temp_test_df.loc[i,'ticket'] ],'ticketcount_1')

In [None]:
train.head()

In [None]:
train['Ticket'].value_counts()

### Name

In [None]:
name_titles_train = []
for i in range(0,len(train['Name'])):
    title = (train.loc[i,'Name'].split(', ')[1]).split(' ')[0]
    name_titles_train.append(title)


name_titles_test = []
for i in range(0,len(test['Name'])):
    title = (test.loc[i,'Name'].split(', ')[1]).split(' ')[0]
    name_titles_test.append(title)

In [None]:
train['Title'] = name_titles_train
test['Title'] = name_titles_test

In [None]:
train = train.drop(['Name'],axis=1)
test = test.drop(['Name'],axis=1)

## Categorical Encoding

In [None]:
train.head(7)

In [None]:
# Categories

for i in (2,3,5,7,9,10):
    c = train.columns[i]
    printmd('### %s'%(c))
    display(train[c].value_counts())
    print(' ')

### Mean Encoding for **Ticket**, **Ticket_prefix** and **Title** columns

In [None]:
def Mean_Encoding(column_name):
    new_smooth_name = column_name+'_smean_encod'
    
    mean = train['Survived'].mean()
    agg= train.groupby(column_name)['Survived'].agg(['count','mean'])
    counts = agg['count']
    means = agg['mean']
    weight = 100
    smooth = (counts*means + weight*mean)/(counts+weight)
    
    train.loc[:,new_smooth_name] = train[column_name].map(smooth)
    test.loc[:,new_smooth_name] = test[column_name].map(smooth)    
    

In [None]:
Mean_Encoding('Ticket')

In [None]:
Mean_Encoding('Ticket_prefix')

In [None]:
Mean_Encoding('Title')

In [None]:
test.isnull().sum()

**This means that there are new unique values in the test dataset which weren't mapped to the smooth values we have here**

#### Missing values after mean Encoding

In [None]:
sns.displot(data=train['Ticket_smean_encod'],kde=True,height=6.5,color=random.choice(color_list));

In [None]:
sns.displot(data=train['Ticket_prefix_smean_encod'],kde=True,height=6.5,color=random.choice(color_list));

In [None]:
sns.displot(data=train['Title_smean_encod'],kde=True,height=6.5,color=random.choice(color_list));

In [None]:
test['Ticket_smean_encod'] = test['Ticket_smean_encod'].fillna(train['Ticket_smean_encod'].mean())
test['Ticket_prefix_smean_encod'] = test['Ticket_prefix_smean_encod'].fillna(train['Ticket_prefix_smean_encod'].mean())
test['Title_smean_encod'] = test['Title_smean_encod'].fillna(train['Title_smean_encod'].mean())

In [None]:
test.isnull().sum()

## One Hot Encoding for **Sex**, **Embarked** and **Pclass** columns

In [None]:
# Sex

train['Sex_female'] = pd.get_dummies(train.Sex, prefix='Sex')['Sex_female']
train['Sex_male'] = pd.get_dummies(train.Sex, prefix='Sex')['Sex_male']
test['Sex_female'] = pd.get_dummies(test.Sex, prefix='Sex')['Sex_female']
test['Sex_male'] = pd.get_dummies(test.Sex, prefix='Sex')['Sex_male']

In [None]:
# Pclass

train['Pclass_1'] = pd.get_dummies(train.Pclass, prefix='Pclass')['Pclass_1']
train['Pclass_2'] = pd.get_dummies(train.Pclass, prefix='Pclass')['Pclass_2']
train['Pclass_3'] = pd.get_dummies(train.Pclass, prefix='Pclass')['Pclass_3']

test['Pclass_1'] = pd.get_dummies(test.Pclass, prefix='Pclass')['Pclass_1']
test['Pclass_2'] = pd.get_dummies(test.Pclass, prefix='Pclass')['Pclass_2']
test['Pclass_3'] = pd.get_dummies(test.Pclass, prefix='Pclass')['Pclass_3']

In [None]:
# Embarked

train['Embarked_C'] = pd.get_dummies(train.Embarked, prefix='Embarked')['Embarked_C']
train['Embarked_Q'] = pd.get_dummies(train.Embarked, prefix='Embarked')['Embarked_Q']
train['Embarked_S'] = pd.get_dummies(train.Embarked, prefix='Embarked')['Embarked_S']

test['Embarked_C'] = pd.get_dummies(test.Embarked, prefix='Embarked')['Embarked_C']
test['Embarked_Q'] = pd.get_dummies(test.Embarked, prefix='Embarked')['Embarked_Q']
test['Embarked_S'] = pd.get_dummies(test.Embarked, prefix='Embarked')['Embarked_S']

In [None]:
train.columns

In [None]:
train.head()

# **3. Correlation and Feature Selection**

In [None]:
fig, ax = plt.subplots(figsize=(18,16)) 
my_c = sns.diverging_palette(20, 220, as_cmap=True)
mask = np.triu(train.corr())
sns.heatmap(train.corr(),cmap=my_c,linewidths=1.5,ax=ax,annot=True,center=0,square=True,mask=mask);

In [None]:
# df_train = train[['Age','Fare','Family','Ticket_smean_encod','Ticket_prefix_smean_encod','Title_smean_encod','Sex_female','Pclass_1','Pclass_2','Embarked_C',
#        'Embarked_Q','Survived']] # omitted extra dummy variables
# df_test = test[['Age','Fare','Family','Ticket_smean_encod','Ticket_prefix_smean_encod','Title_smean_encod','Sex_female','Pclass_1','Pclass_2','Embarked_C',
#        'Embarked_Q']] # omitted extra dummy variables

# df_train = train[['Age','Fare','Family','Ticket_prefix_smean_encod','Sex_female','Pclass_1','Pclass_2','Embarked_C',
#        'Embarked_Q','Survived']] # omitted extra dummy variables
# df_test = test[['Age','Fare','Family','Ticket_prefix_smean_encod','Sex_female','Pclass_1','Pclass_2','Embarked_C',
#        'Embarked_Q']] # omitted extra dummy variables

# 77.9 accuracy
df_train = train[['Age','Fare','Ticket_prefix_smean_encod','Sex_female','Pclass_1','Pclass_2','Embarked_C',
       'Embarked_Q','Survived']] # omitted extra dummy variables
df_test = test[['Age','Fare','Ticket_prefix_smean_encod','Sex_female','Pclass_1','Pclass_2','Embarked_C',
       'Embarked_Q']] # omitted extra dummy variables



In [None]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
fig, ax = plt.subplots(figsize=(16,14)) 
my_c = sns.diverging_palette(20, 220, as_cmap=True)
mask = np.triu(df_train.corr())
sns.heatmap(df_train.corr(),cmap=my_c,linewidths=1.5,ax=ax,annot=True,center=0,square=True,mask=mask);
plt.savefig('correlation.png')

# **4. Splitting data**

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = df_train.iloc[:,:-1]
y = df_train.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
X.head()

# **5. Feature Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train.iloc[:,:6] = sc.fit_transform(X_train.iloc[:,:6])
X_test.iloc[:,:6] = sc.transform(X_test.iloc[:,:6])

In [None]:
X_train.head()

In [None]:
sc = StandardScaler()
X.iloc[:,:6] = sc.fit_transform(X.iloc[:,:6])
df_test.iloc[:,:6] = sc.transform(df_test.iloc[:,:6])

# **6. Models & Selection**

**Run only either one of these 4 models, and then the Submission section**

## 1. Random Forest

### Testing accuracy

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=13,random_state=0)
rfc.fit(X_train,y_train)

In [None]:
y_pred = rfc.predict(X_test.iloc[:,:])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

### <span style="background-color:LightGreen;">Actual</span> (For Submission)

In [None]:
sc = StandardScaler()
X.iloc[:,:6] = sc.fit_transform(X.iloc[:,:6])
df_test.iloc[:,:6] = sc.transform(df_test.iloc[:,:6])

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X,y)

In [None]:
y_pred = rfc.predict(df_test.iloc[:,:])

## 2. Logistic Regression

### Testing accuracy

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)

In [None]:
y_pred = lr.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

### <span style="background-color:LightGreen;">Actual</span>

In [None]:
lr = LogisticRegression()
lr.fit(X,y)

y_pred = lr.predict(df_test.iloc[:,:])


## 3. XGBoost

### Testing Accuracy

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
gbc = GradientBoostingClassifier(n_estimators=50, learning_rate=0.06,max_depth=5, random_state=0).fit(X_train, y_train)
#gbc.score(X_test, y_test)

In [None]:
y_pred = gbc.predict(X_test)
accuracy_score(y_test,y_pred)

### <span style="background-color:LightGreen;">Actual</span>

In [None]:
gbc = GradientBoostingClassifier(n_estimators=50, learning_rate=0.06,max_depth=5, random_state=0).fit(X, y)
y_pred = gbc.predict(df_test.iloc[:,:])

## 4. Naive Bayes

### Test

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

In [None]:
y_pred = gnb.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

### <span style="background-color:LightGreen;">Actual</span>

In [None]:
gnb = GaussianNB()
gnb.fit(X, y)
y_pred = gnb.predict(df_test.iloc[:,:])

# **7. Learning Curve Plot for XGBoost Hyperparameters**

From the following graphs, the point on the x axis where the Test graph peaks just before it begins to decrease afterwards, is where the value is best suited

### Max Depth

In [None]:
values = [i for i in range(1, 15, 1)] ## Max Depth

In [None]:
values

In [None]:
train_scores = []
test_scores = []
# evaluate a decision tree for each depth
for i in values:
    # configure the model
    model = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1,max_depth=i, random_state=0)
    # fit model on the training dataset
    model.fit(X_train, y_train)
    # evaluate on the train dataset
    train_yhat = model.predict(X_train)
    train_acc = accuracy_score(y_train, train_yhat)
    train_scores.append(train_acc)
    # evaluate on the test dataset
    test_yhat = model.predict(X_test)
    test_acc = accuracy_score(y_test, test_yhat)
    test_scores.append(test_acc)
    # summarize progress
    print('>%3f, train: %.3f, test: %.3f' % (i, train_acc, test_acc))

In [None]:
# Max Depth
plt.plot(values,train_scores,color=blue,label='Train');
plt.plot(values,test_scores,color=red,label='Test');
plt.legend();

Therefore, **Max depth = 5**

### N Estimators

In [None]:
values = [i for i in range(75, 200, 5)] ## N Estimators

In [None]:
values

In [None]:
train_scores = []
test_scores = []
# evaluate a decision tree for each depth
for i in values:
    # configure the model
    model = GradientBoostingClassifier(n_estimators=i, learning_rate=0.1,max_depth=5, random_state=0)
    # fit model on the training dataset
    model.fit(X_train, y_train)
    # evaluate on the train dataset
    train_yhat = model.predict(X_train)
    train_acc = accuracy_score(y_train, train_yhat)
    train_scores.append(train_acc)
    # evaluate on the test dataset
    test_yhat = model.predict(X_test)
    test_acc = accuracy_score(y_test, test_yhat)
    test_scores.append(test_acc)
    # summarize progress
    print('>%3f, train: %.3f, test: %.3f' % (i, train_acc, test_acc))

In [None]:
# n_estimators
plt.plot(values,train_scores,color=blue,label='Train');
plt.plot(values,test_scores,color=red,label='Test');
plt.legend();

Therefore,**n_estimators=100**

### Learning Rate

In [None]:
values = [i for i in np.linspace(0.01,0.1,30)] ## Learning Rate

In [None]:
values

In [None]:
train_scores = []
test_scores = []
# evaluate a decision tree for each depth
for i in values:
    # configure the model
    model = GradientBoostingClassifier(n_estimators=100, learning_rate=i,max_depth=5, random_state=0)
    # fit model on the training dataset
    model.fit(X_train, y_train)
    # evaluate on the train dataset
    train_yhat = model.predict(X_train)
    train_acc = accuracy_score(y_train, train_yhat)
    train_scores.append(train_acc)
    # evaluate on the test dataset
    test_yhat = model.predict(X_test)
    test_acc = accuracy_score(y_test, test_yhat)
    test_scores.append(test_acc)
    # summarize progress
    print('>%3f, train: %.3f, test: %.3f' % (i, train_acc, test_acc))

In [None]:
# Learning rate
plt.plot(values,train_scores,color=blue,label='Train');
plt.plot(values,test_scores,color=red,label='Test');
plt.legend();

Therfore, **learning_rate=0.072069**

# **8. Final Model with hyperparameters**

### Test

In [None]:
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.072069,max_depth=5, random_state=0)
gbc.fit(X_train,y_train)
y_pred = gbc.predict(X_test)
accuracy_score(y_test,y_pred)

### <span style="background-color:LightGreen;">**Actual**</span>

In [None]:
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.072069,max_depth=5, random_state=0).fit(X, y)
y_pred = gbc.predict(df_test.iloc[:,:])

# **9. Submission**

In [None]:
Submission = pd.DataFrame({'PassengerID': test.PassengerId, 'Survived': y_pred})

In [None]:
Submission.to_csv('submission.csv', index=False)

# **Conclusion**

XGBoost works the best with 77.9% accuracy.