<a href="https://www.kaggle.com/code/veravarela/predict-survival-on-the-titanic?scriptVersionId=138410782" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Libraries and Others

## Libraries

In [None]:
import sys
import os

import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_score, KFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns
from plotly.subplots import make_subplots
import plotly.graph_objects as go

## Hide Warnings

In [None]:
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

# Import and Load Data

## Import Data

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Load Train Data

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

## Load Test Data

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

# Exploratory Data Analysis (EDA) - Train and Test Data

## Data Overview

In [None]:
train_data.describe()

##### Analyzing the statistics of numerical variables: (i) the values of "Pclass" seems "normal" values - the min. and max. shows there is 3 classes; (ii) the values of  "Age" variables seems "normal" values too - the min. and max. have acceptable values (it not seem these variable have outliers but I'll check it in outliers step); (iii) the values of "SibSp" (# of Siblings/Spouses Aboard the Titanic) and "Parch" (# of Parents/Children Aboard the Titanic) don't look so good - it's weird to have 8 siblings in the trip, especially when most have 0, or have 6 children when most have 0 too (I'll check it in outliers step); (vi) I'll analyze the "Fare" values too in outliers step because the max. value is too big; (v) looking at the values of the counts it's possible see there is missing values for "Age" variable - I'll check it in missing values step.

In [None]:
train_data.describe(include=['O'])

##### Analyzing the statistics of categorical variables: (i) looking at the values of the counts it's possible see there is missing values for "Cabin" and "Embarked" variables - I'll check it in missing values step.

## Analyze the Correlation Between the Variables

In [None]:
cmap=sns.cubehelix_palette(start=2)
mask = np.triu(np.ones_like(train_data.corr(), dtype=bool))

plt.figure(figsize=(10,5))
sns.heatmap(train_data.corr(), vmin=-1, vmax=1, annot=True, mask=mask, cmap=cmap);

##### Note: As we can see the variables are not very correlated - most of them have a correlation smaller than 30%. Only there is two pairs with a correlation bigger than 30% - Fare/Pclass and Parch/SibSp. In this work I'll keep all the variables. In a future, I'll analyze these pairs better (Fare/Pclass and Parch/SibSp) and exclude one of the two variables, within each pair, to see if it interferes with my prediction.

## Missing Values

### Searching for Missing Values - Train Data

In [None]:
train_data.count()

### Confirming Missing Values - Train Data

In [None]:
train_data.isna().sum()

##### Deal with missing values: (i) for "Age" variable, as it is a numeric variable and I have less than 1/4 of data missing, I'll change missing values by median; (ii) I will exclude "Cabin" variable because I have more than 3/4 of data missing - I can't use a variable that has so much missing data and I can drop these rows because I'll not have data for the analysis; (iii) I'll remove the rows that have missing values for "Embarked" if in test_data set there isn't missing values for "Embarked" variable - It's a categorical variable and I could try to know the value for these two rows, analyzing the other variables, but as it is just 2 in 891 I think the time I will waste analyzing and finding the values is not worth it.

In [None]:
test_data['Embarked'].count()

# there isn't missing values

### Deal With Missing Values - Train Data

#### Replace 'Age' Missing Values by 'Age' Median

In [None]:
train_data['Age'] = train_data['Age'].fillna(train_data['Age'].median())

#### Remove Rows with "Embarked" Missing Values

In [None]:
train_data[train_data['Embarked'].isna()]

In [None]:
train_data.dropna(subset=['Embarked'], how='all', inplace=True)

### Searching for Missing Values - Test Data

In [None]:
test_data.count()

### Deal with Missing Values - Test Data

#### Replace 'Age' Missing Values by 'Age' Median

In [None]:
test_data['Age'] = test_data['Age'].fillna(test_data['Age'].median())

#### Replace 'Fare' Missing Values by 'Fare' Median (for the same reason as I change "Age" missing values)

In [None]:
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median())

## Outliers

### Searching for Outliers - Train Data

In [None]:
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=('Age', '# of Siblings / Spouses Aboard the Titanic (SibSp)', '# of Parents / Children Aboard the Titanic (Parch)', 'Fare'))

fig.add_trace(go.Box(x=train_data['Age']),
              row=1, col=1)

fig.add_trace(go.Box(x=train_data['SibSp']),
              row=1, col=2)

fig.add_trace(go.Box(x=train_data['Parch']),
              row=2, col=1)

fig.add_trace(go.Box(x=train_data['Fare']),
              row=2, col=2)

fig.update_layout(height=500, width=1000, yaxis_visible=False, yaxis2_visible=False, yaxis3_visible=False, yaxis4_visible=False, showlegend=False)

fig.show()

##### Note: I will check the number of passengers with: (i) "Age" over 66 years old, (ii) "SibSp" over 5 and (iii) "Fare" over 300. I can't remove rows if the test_data dataset contains passengers with these characteristics, because I need to predict the survival of the 418 passengers present in the test_data dataset . So, if there is a small number of passengers with these characteristics (in train_data and test_data datasets) I will replace the values by the median, if there is a considerable number of passengers with these characteristics I will analyze them separately.

#### Check the Number of Passengers with "Age" over 66 years old, "SibSp" over 5 and "Fare" over 300 - Train and Test Data

In [None]:
train_data.dtypes

In [None]:
train_data['Age'][train_data['Age'] > 66].count()

In [None]:
train_data['SibSp'][train_data['SibSp'] > 5].count()

In [None]:
train_data['Fare'][train_data['Fare'] > 300].count()

In [None]:
test_data['Age'][test_data['Age'] > 66].count()

In [None]:
test_data['SibSp'][test_data['SibSp'] > 5].count()

In [None]:
test_data['Fare'][test_data['Fare'] > 300].count()

### Deal with Outliers - Train Data

In [None]:
median_age = train_data['Age'].median()
train_data['Age'] = np.where(train_data['Age'] > 66, median_age,train_data['Age'])

In [None]:
median_sibsp = train_data['SibSp'].median()
train_data['SibSp'] = np.where(train_data['SibSp'] > 5, median_sibsp,train_data['SibSp'])

In [None]:
median_fare = train_data['Fare'].median()
train_data['Fare'] = np.where(train_data['Fare'] > 300, median_fare,train_data['Fare'])

### Deal with Outliers - Test Data

In [None]:
median_age = test_data['Age'].median()
test_data['Age'] = np.where(test_data['Age'] > 66, median_age,test_data['Age'])

In [None]:
median_sibsp = test_data['SibSp'].median()
test_data['SibSp'] = np.where(test_data['SibSp'] > 5, median_sibsp,test_data['SibSp'])

In [None]:
median_fare = test_data['Fare'].median()
test_data['Fare'] = np.where(test_data['Fare'] > 300, median_fare,test_data['Fare'])

## Transforming Variables

### Embarked Variable - Train Data

#### Analyze Embarked Data Type

In [None]:
print(train_data['Embarked'].unique())

#### Create a New Column with the Port Embarkation

In [None]:
train_data['Port_Embarkation'] = np.where(train_data['Embarked'].astype(str).str[0] == 'C', 'C',
                                 np.where(train_data['Embarked'].astype(str).str[0] == 'Q', 'Q',
                                 np.where(train_data['Embarked'].astype(str).str[0] == 'S', 'S',
                                 'NaN')))

#### Count the Number of Passengers by Port Embarkation

In [None]:
train_data.groupby(['Port_Embarkation'])['PassengerId'].count()

### Embarked Variable - Test Data

#### Create a New Column with the Port Embarkation

In [None]:
test_data['Port_Embarkation'] = np.where(test_data['Embarked'].astype(str).str[0] == 'C', 'C',
                                np.where(test_data['Embarked'].astype(str).str[0] == 'Q', 'Q',
                                np.where(test_data['Embarked'].astype(str).str[0] == 'S', 'S',
                                'NaN')))

#### Count the Number of Passengers by Port Embarkation

In [None]:
test_data.groupby(['Port_Embarkation'])['PassengerId'].count()

# Modeling - Choose the Model

## Features to Consider

In [None]:
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Port_Embarkation']

train = train_data[features]
test = test_data[features]

## Transform Categorical Variables into Numeric Variables

In [None]:
X_train = pd.get_dummies(train)
y_train = train_data['Survived']

X_test = pd.get_dummies(test)

### Confirm If I Have the Same Number of Rows in Both Datasets

In [None]:
print(len(X_train.columns))
print(len(X_test.columns))

## Run some Models to Choose the Best One

##### Note: to choose I'll use 'accuracy', 'precision' and 'neg_mean_squared_error scores

In [None]:
cv = KFold(n_splits=10, shuffle=True, random_state=1)

models = [LogisticRegression(random_state=20), 
          DecisionTreeClassifier(random_state=20),
          KNeighborsClassifier(n_neighbors=5),
          SVC(random_state=20),
          RandomForestClassifier(random_state=20),
          xgb.XGBClassifier(random_state=20)]

names = ['LogisticRegression', 'Decision Tree', 'K Neighbors','SVC','Random Forest','XGBoost']

for model, name in zip(models, names):
    print(name)
    for score in ['accuracy', 'precision', 'neg_mean_squared_error']:
        result = cross_val_score(model, X_train, y_train, scoring=score, cv=cv)
        print(score,': %.4f (%.3f)' % (np.mean(result), np.std(result)))
    print('\r\n')

##### Note: best models was Random Forest and XGBoost so I'll use them in the next step

# Modeling - Hyperparameter Tuning using Randomized Search Type

## Random Forest Classifier Model

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

param_space = {'n_estimators': [20,40,60,80,100,120,140,160],
               'criterion': ['gini', 'entropy', 'log_loss'],
               'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None],
               'min_samples_split': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
               'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
               'min_weight_fraction_leaf': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.10],
               'max_features': ['sqrt', 'log2', None],
               'max_leaf_nodes': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None],
               'min_impurity_decrease': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
               'verbose': [0, 1, 2, 3],
               'max_samples': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
               }

rfc = RandomForestClassifier()


rfc_randomsearch = RandomizedSearchCV(rfc, param_space, scoring='accuracy', cv=5, random_state = 10)

rfc_randomsearch.fit(X_train, y_train)

print(rfc_randomsearch.best_params_)

print('Accuracy with hyperparameter tuning Randomized Search: %.2f' % ( rfc_randomsearch.best_score_))
# 62%

## XGBoost Model

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

param_space = {'booster': ['gbtree', 'gblinear', 'dart'],
               'verbosity': [0, 1, 2, 3],
               'max_depth': [3, 4, 5, 6, 7, 8, 9, 10]
               }

xgbclass = xgb.XGBClassifier()


xgbclass_randomsearch = RandomizedSearchCV(xgbclass, param_space, scoring='accuracy', cv=5, random_state = 10)

xgbclass_randomsearch.fit(X_train, y_train)

print(xgbclass_randomsearch.best_params_)

print('Accuracy with hyperparameter tuning Randomized Search: %.2f' % ( xgbclass_randomsearch.best_score_))
# 83%

##### Note: the best model, after the hyperparameter tuning, was XGBoost with 83% of accuracy.

# Modeling - Applying the Best Model (XGBoost)

In [None]:
best_model = xgb.XGBClassifier(verbosity = 3, max_depth = 3, booster = 'gbtree', random_state=20)
best_model.fit(X_train, y_train)
y_predict = best_model.predict(X_test)

# Analyze the Results

## Big Picture

In [None]:
results = test_data
results['Survived_Predict'] = pd.DataFrame(y_predict)
results

In [None]:
results.describe()

## How many I predict survived?

In [None]:
survivors_count = results.loc[results['Survived_Predict'] == 1, 'PassengerId'].count()
total = results['Survived_Predict'].count()

print(f'Survived {survivors_count} ({(survivors_count/total*100).round(decimals=1)}%) passengers, out of a total of {total}.')

## Who I predict survived?

In [None]:
# survivors predicted 
survivors = pd.DataFrame(results.loc[results['Survived_Predict'] == 1])

# graphs
window, graphs = plt.subplots(nrows=4, ncols=2, figsize=(10,15))
plt.tight_layout()
plt.subplots_adjust(hspace=0.5, wspace=0.5)
sns.set_theme(style='dark')

ax1 = sns.histplot(data=survivors, x='Sex', ax=graphs[0][0])
ax2 = sns.histplot(data=survivors, x='Age', ax=graphs[0][1])
ax3 = sns.histplot(data=survivors, x='SibSp', ax=graphs[1][0])
ax4 = sns.histplot(data=survivors, x='Parch', ax=graphs[1][1])
ax5 = sns.histplot(data=survivors, x='Pclass', ax=graphs[2][0])
ax6 = sns.histplot(data=survivors, x='Fare', ax=graphs[2][1])
ax8 = sns.histplot(data=survivors, x='Port_Embarkation', ax=graphs[3][0])

ax1.set(title='Sex', ylabel='Survivors')
ax2.set(title='Age', ylabel='Survivors')
ax3.set(title='# of Siblings / Spouses Aboard the Titanic (SibSp)', ylabel='Survivors')
ax4.set(title='# of Parents / Children Aboard the Titanic (Parch)', ylabel='Survivors')
ax5.set(title='Ticket Class (Pclass)', ylabel='Survivors')
ax6.set(title='Fare', ylabel='Survivors')
ax8.set(title='Port Embarkation', ylabel='Survivors')

# Output in .csv File

In [None]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': y_predict})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")