<a href="https://colab.research.google.com/github/sahinozan/Titanic/blob/master/titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## 1. Preparation


### 1.1 Importing Libraries


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict, cross_validate, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder
import warnings

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

warnings.filterwarnings("ignore")
np.set_printoptions(precision=5)

#### Optional Style Settings


In [2]:
% matplotlib inline
plt.rcParams['figure.dpi'] = 200
plt.rcParams['savefig.dpi'] = 200
sns.set(rc={"figure.dpi": 200, 'savefig.dpi': 200})
sns.set_context('notebook')
sns.set_style("ticks")

UsageError: Line magic function `%` not found.


### 1.2 Load Dataset


In [None]:
df_train = pd.read_csv('https://raw.githubusercontent.com/sahinozan/Titanic/master/train.csv')
df_test = pd.read_csv('https://raw.githubusercontent.com/sahinozan/Titanic/master/test.csv')

In [None]:
df_train.head()

In [None]:
df_test.head()

### 1.3 Checking Null Values


In [None]:
pd.DataFrame(data=[df_train.isna().sum(), df_test.isna().sum()], index=['Train', 'Test']).T

**Age** and **Cabin** features have too many null values. **Embarked** feature has 2 null values in Train data. **Fare** feature has a single null value in Test data.


### 1.4 Checking Duplicate Values


In [None]:
print(f'Number of duplicate values in train data: {df_train.duplicated().sum()}')
print(f'Number of duplicate values in test data: {df_test.duplicated().sum()}')

There are no duplicate values in train and test data.


### 1.5 Checking Dataset Features


In [None]:
df_train.describe()

In [None]:
df_train.info()

We have 12 features in the dataset.

- **PassengerId**: Identification number of the passenger
- **Survival**: Whether a passenger survived or not (0 or 1)
- **Pclass**: The socio-economic class
  - Upper: 1
  - Middle 2
  - Lower: 3
- **Name**: Name of the passenger
- **Sex**: Gender of the passenger (Male or Female)
- **Age**: Age of the passenger in years
- **SibSp**: Number of siblings / spouses aboard
- **Parch**: Number of parents / children aboard
- **Ticket**: Ticket Number
- **Fare**: Passenger Fare
- **Cabin**: Cabin Number
- **Embarked**: Port of Embarkation
  - C: Cherbourg
  - Q: Queenstown
  - S: Southampton


Numerical:

- **Age**, **SibSp**, **Parch**, and **Fare**

Categorical:

- **Survival**, **Pclass**, **Sex**, **Ticket**, **Cabin**, and **Embarked**, **Name**, **PassengerId**


**Name**, **Sex**, **Ticket**, **Cabin**, and **Embarked** features are `object` type. **Sex** and **Embarked** features consists of only a few values therefore, we will convert them into the `category` type to increase efficiency.

> **Name**, **Ticket**, and **Cabin** features will not be in the training set so, we will not convert them into `category` type.


In [None]:
df_train[['Sex', 'Embarked']] = df_train[['Sex', 'Embarked']].astype('category')

In [None]:
df_train.info()

## 2. Exploratory Data Analysis


### 2.1 Univariate Analysis


We will analyze and visualize features separately to understand the data in depth. We will use bar plot and pie chart for `categorical` features. We will use histogram and box plot for `numerical` features.

This is a custom function for bar plots. We will use this function to annotate exact counts of the features.

In [None]:
def bar_plot_annotate(axes, column):
    for _i in range(len(df_train[column].dropna().unique())):
        _x = axes.patches[_i].get_x() + axes.patches[_i].get_width() / 2
        _y = axes.patches[_i].get_height() / 2
        axes.annotate(
            text=df_train.groupby(column).agg({'Ticket': 'count'}).loc[df_train[column].unique()[_i], 'Ticket'],
            xy=(_x, _y), ha='center', va='center')

#### 2.1.1 Analysis of Survived


In [None]:
_, ax = plt.subplots(1, 2, figsize=(20, 7))
g = sns.countplot(data=df_train, x='Survived', ax=ax[0])
ax[0].set_title('Bar Chart')

df_train['Survived'].value_counts().plot(kind='pie', autopct="%1.1f%%", ax=ax[1])
ax[1].set_title('Pie Chart')

bar_plot_annotate(g, 'Survived')
plt.show()

We will analyze to find out what caused **38.4%** of the passengers to survive.

- **61.6%** of the passengers did **not** survive.
- Only **38.4%** of the passengers did survive.


#### 2.1.2 Analysis of Sex


In [None]:
_, ax = plt.subplots(1, 2, figsize=(20, 7))
g = sns.countplot(data=df_train, x='Sex', order=['male', 'female'], ax=ax[0])
ax[0].set_title('Bar Chart')

df_train['Sex'].value_counts().plot(kind='pie', autopct="%1.1f%%", ax=ax[1])
ax[1].set_title('Pie Chart')

bar_plot_annotate(g, 'Sex')
plt.show()

- **64.8%** of the passengers are **Male**.
- Only **35.2%** of the passengers are **Female**.


#### 2.1.3 Analysis of Age


In [None]:
_, ax = plt.subplots(1, 2, figsize=(20, 7))
sns.histplot(data=df_train, x='Age', kde=True, ax=ax[0])
ax[0].set_title('Age Distribution Histogram')

sns.boxplot(data=df_train, x='Age', ax=ax[1])
ax[1].set_title('Age Distribution Boxplot')
plt.show()

In [None]:
print(f'Average Age: {df_train["Age"].mean()}')
print(f'Lowest Age: {df_train["Age"].min()}')
print(f'Highest Age: {df_train["Age"].max()}')

In [None]:
number_of_people = max(dict(df_train["Age"].value_counts()).values())
most_frequent_age = [key for key, value in dict(df_train["Age"].value_counts()).items() if value == number_of_people]
print(f'Most frequent age is {most_frequent_age[0]} with {number_of_people} passengers.')

Age of the passengers varies from **0.42** to **80** years with an average of **29.7**.


#### 2.1.4 Analysis of Fare


In [None]:
_, ax = plt.subplots(1, 2, figsize=(20, 7))
sns.histplot(data=df_train, x='Fare', kde=True, ax=ax[0])
ax[0].set_title('Fare Distribution Histogram')

sns.boxplot(data=df_train, x='Fare', ax=ax[1])
ax[1].set_title('Fare Distribution Boxplot')
plt.show()

In [None]:
print(f'Average Fare: ${df_train["Fare"].mean():.2f}')
print(f'Lowest Fare: ${df_train["Fare"].min()}')
print(f'Highest Fare: ${df_train["Fare"].max():.2f}')

In [None]:
print('Number of passengers who paid $0.0: ', df_train[df_train["Fare"] == df_train["Fare"].min()].shape[0])
print('Number of passengers who paid $512.33: ', df_train[df_train["Fare"] == df_train["Fare"].max()].shape[0])

In [None]:
number_of_people = max(dict(df_train["Fare"].value_counts()).values())
most_frequent_fare = [key for key, value in dict(df_train["Fare"].value_counts()).items() if value == number_of_people]
print(f'Most frequent fare is ${most_frequent_fare[0]} which is paid by {number_of_people} passengers.')

- There are passengers who did **not** pay for the cruise.
- Only 3 passengers paid **512.33** dollars.
- 15 passengers paid **0.0** dollars.
- Average fare is **32.2** dollars.


#### 2.1.5 Analysis of Pclass


In [None]:
_, ax = plt.subplots(1, 2, figsize=(20, 7))
g = sns.countplot(data=df_train, x='Pclass', order=[3, 1, 2], ax=ax[0])
ax[0].set_title('Bar Chart')

df_train['Pclass'].value_counts().plot(kind='pie', autopct="%1.1f%%", ax=ax[1])
ax[1].set_title('Pie Chart')

bar_plot_annotate(g, 'Pclass')
plt.show()

**55.1%** of the passengers have 3rd class ticket. Meanwhile, the number of passengers who have 1st and 2nd class ticket are quite close with **24.2%** and **20.7%** respectively.


#### 2.1.6 Analysis of SibSp and Parch


In [None]:
_, ax = plt.subplots(2, 2, figsize=(20, 12))
sns.histplot(data=df_train, x='SibSp', kde=True, ax=ax[0, 0])
ax[0, 0].set_title('Siblings and Spouses Distribution Histogram')

sns.boxplot(data=df_train, x='SibSp', ax=ax[0, 1])
ax[0, 1].set_title('Siblings and Spouses Distribution Boxplot')

sns.histplot(data=df_train, x='Parch', kde=True, ax=ax[1, 0])
ax[1, 0].set_title('Parents and Children Distribution Histogram')

sns.boxplot(data=df_train, x='Parch', ax=ax[1, 1])
ax[1, 1].set_title('Parents and Children Distribution Boxplot')

plt.show()

- Over **600** passengers traveling alone.
- There are also over **100** passengers traveling with 1 person.


#### 2.1.7 Analysis of Embarked


In [None]:
_, ax = plt.subplots(1, 2, figsize=(20, 7))
g = sns.countplot(data=df_train, x='Embarked', order=['S', 'C', 'Q'], ax=ax[0])
ax[0].set_title('Bar Chart')

df_train['Embarked'].value_counts().plot(kind='pie', autopct="%1.1f%%", ax=ax[1])
ax[1].set_title('Pie Chart')

bar_plot_annotate(g, 'Embarked')
plt.show()

Most of the passengers, approximately **72.4%** boarded the Titanic from Southampton.


### 2.2 Multivariate Analysis


This function will be used to annotate bar plots with multiple features.

In [None]:
def stacked_bar_plot_annotate(axes, column, order=None):
    for _i in range(len(sorted(df_train[column].dropna().unique()))):
        b1 = df_train.groupby(column)['Survived'].value_counts().loc[order[_i], 1]
        b2 = df_train.groupby(column)['Survived'].value_counts().loc[order[_i], 0]
        _x = axes.patches[_i].get_x() + g.patches[_i].get_width() / 2
        axes.annotate(text=b1, xy=(_x, b1 / 2), ha='center', va='center')
        axes.annotate(text=b2, xy=(_x, b1 + b2 / 2), ha='center', va='center')

#### 2.2.1 Analysis of Survived and Age


In [None]:
plt.figure(figsize=(20, 7))
sns.histplot(data=df_train, x='Age', hue='Survived', multiple='stack', kde=True)
plt.show()

In [None]:
number_of_survival_under_10 = df_train[(df_train['Age'] <= 10) & (df_train['Survived'] == 1)].shape[0]
number_of_survival_over_65 = df_train[(df_train['Age'] >= 65) & (df_train['Survived'] == 1)].shape[0]

In [None]:
print(f'Number of people survived in 0-10 age range: {number_of_survival_under_10}')
print(f'Number of people survived in 65+ age range: {number_of_survival_over_65}')

- The **0-10** age range has a higher rate of survival. Maybe kids had a higher priority for the lifeboats.
- The **65+** age range has an extremely low rate of survival. This outcome may have happened because elders have a relatively low physical capacity to survive.


#### 2.2.2 Analysis of Survived and Sex


In [None]:
plt.figure(figsize=(20, 7))
g = sns.histplot(data=df_train, x='Sex', hue='Survived', multiple='stack')

stacked_bar_plot_annotate(g, 'Sex', order=['female', 'male'])
plt.show()

In [None]:
for i in ['female', 'male']:
    survived = df_train[(df_train['Sex'] == i) & (df_train['Survived'] == 1)].shape[0]
    total = df_train[df_train['Sex'] == i].shape[0]
    print(f'{survived / total * 100 :.2f}% of the {i} passengers survived.')

- Most of the survivors are **female**.
- Huge amount of female passengers survived.
- Small amount of male passengers survived.


#### 2.2.3 Analysis of Survived and Pclass


In [None]:
plt.figure(figsize=(20, 7))
ax = sns.histplot(data=df_train, x='Pclass', hue='Survived', multiple='stack', discrete=True)
ax.set_xticks([1, 2, 3])

stacked_bar_plot_annotate(ax, 'Pclass', order=[1, 2, 3])
plt.show()

In [None]:
for i in range(1, 4):
    survived = df_train[(df_train['Pclass'] == i) & (df_train['Survived'] == 1)].shape[0]
    total = df_train[df_train['Pclass'] == i].shape[0]
    print(f'{survived / total * 100 :.2f}% of the Pclass-{i} passengers survived.')

- Upper-class people survived more compared Middle and Lower class people.
- Maybe Upper-class people had a higher priority in the rescue process.


#### 2.2.4 Analysis of Survived and Embarked


In [None]:
plt.figure(figsize=(20, 7))
ax = sns.histplot(data=df_train, x='Embarked', hue='Survived', multiple='stack', discrete=True)

stacked_bar_plot_annotate(ax, 'Embarked', order=['C', 'Q', 'S'])
plt.show()

In [None]:
for i, k in {"C": "Cherbourg", "Q": "Queenstown", "S": "Southampton"}.items():
    survived = df_train[(df_train['Embarked'] == i) & (df_train['Survived'] == 1)].shape[0]
    total = df_train[df_train['Embarked'] == i].shape[0]
    print(f'{survived / total * 100 :.2f}% of the passengers embarked in {k} survived.')

- Most of the passengers who survived are boarded from **Southampton**. This may be due to the fact that **Southampton** is the most crowded port.
- More than **50%** of the passengers who boarded from **Cherbourg** survived.


#### 2.2.5 Analysis of Survived and SipSb


In [None]:
plt.figure(figsize=(20, 7))
sns.histplot(data=df_train, x='SibSp', hue='Survived', multiple='stack', kde=True)
plt.show()

#### 2.2.6 Analysis of Survived and Parch


In [None]:
plt.figure(figsize=(20, 7))
sns.histplot(data=df_train, x='Parch', hue='Survived', multiple='stack', kde=True)
plt.show()

## 3. Feature Engineering


In [None]:
df_train.columns

In [None]:
plt.figure(figsize=(20, 7))
sns.heatmap(data=df_train.corr(), annot=True)
plt.show()

When we look at the relations between `Survived` and other features, we observe:

- `PassengerId` has a low negative correlation with `Survived`, approximately **-0.005**.
- `Pclass` has a high negative correlation with `Survived`, approximately **-0.34**.
- `Fare` has a high positive relation with `Survived`, approximately **0.26**.


### 3.1 PassengerId


In [None]:
df_train['PassengerId']

**PassengerId** column contains **891** unique values for each passenger. Meaning that we can't use this feature because it does not contain any valuable information for us. We are removing this column from the dataset.


In [None]:
df_train.drop('PassengerId', axis=1, inplace=True)
df_test.drop('PassengerId', axis=1, inplace=True)

### 3.2 Name


In [None]:
df_train['Name']

Name column contains unique values for each passenger. Meaning that we can't use this feature because it does not contain any valuable information for us. We are removing this column from the dataset.


In [None]:
df_train.drop(labels='Name', axis=1, inplace=True)
df_test.drop(labels='Name', axis=1, inplace=True)

### 3.3 Ticket


In [None]:
df_train['Ticket']

**Ticket** column contains **891** unique values for each passenger. These values are not valuable for us. Therefore, we are removing this column from the dataset.


In [None]:
df_train.drop(labels='Ticket', axis=1, inplace=True)
df_test.drop(labels='Ticket', axis=1, inplace=True)

### 3.4 Cabin


In [None]:
df_train['Cabin']

In [None]:
print(f'Missing cabin values in train data: {df_train["Cabin"].isna().sum() / df_train["Cabin"].shape[0] * 100:.2f}%')
print(f'Missing cabin values in test data: {df_test["Cabin"].isna().sum() / df_test["Cabin"].shape[0] * 100:.2f}%')

**Cabin** column contains too many **(77.10%)** missing values. That is why we are removing this from the dataset.


In [None]:
df_train.drop(labels='Cabin', axis=1, inplace=True)
df_test.drop(labels='Cabin', axis=1, inplace=True)

### 3.5 Age


In [None]:
df_train['Age']

In [None]:
print(f'Missing age values in train data: {df_train["Age"].isna().sum() / df_train["Age"].shape[0] * 100:.2f}%')
print(f'Missing age values in test data: {df_test["Age"].isna().sum() / df_test["Age"].shape[0] * 100:.2f}%')

There are missing values in the **Age** column. Amount of missing values are not too much **(19.87%)**. This is why we will try to do data imputation on this feature.


In [None]:
imputer = SimpleImputer(strategy='median')
imputer = imputer.fit(df_train[['Age']])
df_train['Age'] = imputer.transform(df_train[['Age']])

In [None]:
imputer = imputer.fit(df_test[['Age']])
df_test['Age'] = imputer.transform(df_test[['Age']])

### 3.6 Fare


In [None]:
df_train['Fare']

In [None]:
print(f'Number of missing fare values in train data: {df_train["Fare"].isna().sum()}')
print(f'Number of missing fare values in test data: {df_test["Fare"].isna().sum()}')

There is only a single missing value in test data. This is why we will try to do data imputation on this feature.


In [None]:
imputer = imputer.fit(df_test[['Fare']])
df_test['Fare'] = imputer.transform(df_test[['Fare']])

### 3.7 Embarked


In [None]:
df_train['Embarked']

In [None]:
print(f'Number of missing embarked values in train data: {df_train["Embarked"].isna().sum()}')
print(f'Number of missing embarked values in test data: {df_test["Embarked"].isna().sum()}')

There are only 2 missing values in train data. Similar to previous features, we will do data imputation on this feature.


In [None]:
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(df_train[['Embarked']])
df_train[['Embarked']] = imputer.transform(df_train[['Embarked']])

We will also use **OneHotEncoder** to extract separate features from this column. We will create **C**, **Q**, and **S** features which will represent **Cherbourg**, **Queenstown**, and **Southampton**.


In [None]:
encoder = OneHotEncoder(sparse=False)
encoder = encoder.fit(df_train[['Embarked']])
df_train[['C', 'Q', 'S']] = encoder.transform(df_train[['Embarked']])

In [None]:
encoder = encoder.fit(df_test[['Embarked']])
df_test[['C', 'Q', 'S']] = encoder.transform(df_test[['Embarked']])

We successfully separated the **Embarked** column into **C**, **Q**, and **S** columns. Now we can remove the **Embarked** column itself because we don't need it anymore.


In [None]:
df_train.drop(labels='Embarked', axis=1, inplace=True)
df_test.drop(labels='Embarked', axis=1, inplace=True)

### 3.8 Sex


In [None]:
df_train['Sex']

In [None]:
print(f'Number of missing sex values in train data: {df_train["Sex"].isna().sum()}')
print(f'Number of missing sex values in test data: {df_test["Sex"].isna().sum()}')

- We don't have **any** missing values in **Sex** column. Therefore, we can use this feature. We convert this category to a vector.
- Similar to **Embarked**, we will again use **OneHotEncoder** to extract separate features from this column. We will create **Female** and **Male** features.


In [None]:
encoder = OneHotEncoder(sparse=False)
encoder = encoder.fit(df_train[['Sex']])
df_train[['Female', 'Male']] = encoder.transform(df_train[['Sex']])

In [None]:
encoder = encoder.fit(df_test[['Sex']])
df_test[['Female', 'Male']] = encoder.transform(df_test[['Sex']])

In [None]:
df_train.head()

We successfully separated the **Sex** column into **Female** and **Male** columns. Now we can remove the **Sex** column itself because we don't need it anymore.


In [None]:
df_train.drop(labels='Sex', axis=1, inplace=True)
df_test.drop(labels='Sex', axis=1, inplace=True)

### 3.9 Parch and SibSp


We will combine **Parch** and **SibSp** columns into a single feature called **FamilySize**.


In [None]:
df_train['FamilySize'] = df_train['SibSp'] + df_train['Parch']
df_test['FamilySize'] = df_test['SibSp'] + df_test['Parch']
df_train.drop(labels=['SibSp', 'Parch'], axis=1, inplace=True)
df_test.drop(labels=['SibSp', 'Parch'], axis=1, inplace=True)

In [None]:
df_train

In [None]:
df_test

In [None]:
X_train = df_train.drop(labels='Survived', axis=1)
y_train = df_train['Survived'].copy()
X_test = df_test.copy()

## 4. Model Building


We are done with the data preparation. Now, we are going to build our Machine Learning model.
We are going to use the models below:

- Logistic Regression
- K-Nearest Neighbor (KNN)
- Decision Tree
- Random Forest
- Support Vector Machine


### 4.1 Model Comparison


In [None]:
estimators = {
    'Logistic Regression': LogisticRegression(),
    'K-Nearest Neighbor': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Support Vector Machine': SVC()
}

We will use `cross-validation` to compare our models.


In [None]:
cv_scores = pd.DataFrame(columns=["Estimator", "F1-Score", "Accuracy", "Overall"])


def compare_models(estimator=estimators, cv=10):
    global cv_scores
    for name, est in estimator.items():
        cv_pred = cross_val_predict(estimator=est, X=X_train, y=y_train, cv=cv)
        cv_score = cross_validate(estimator=est, X=X_train, y=y_train, cv=cv, scoring=(['f1', 'accuracy']))
        cv_scores = cv_scores.append({
            "Estimator": name,
            "F1-Score": cv_score["test_f1"].mean(),
            "Accuracy": cv_score["test_accuracy"].mean(),
            "Overall": (cv_score["test_f1"].mean() + cv_score["test_accuracy"].mean()) / 2
        }, ignore_index=True)
        print(f'\nClassification Report for {name}')
        print(classification_report(y_true=y_train, y_pred=cv_pred))

In [None]:
compare_models()

We will look at the **F1-Score** and **Accuracy** to decide which model performs better without tuning.


In [None]:
cv_scores.sort_values('Overall', ascending=False).reindex(
    columns=['Estimator', 'F1-Score', 'Accuracy', 'Overall']).set_index('Estimator')

We can see that **Random Forest** performs best out of all. That's why we will continue with the **RandomForestClassifier**.


### 4.2 Hyperparameter Tuning


We will use **Randomized Searching** to find the best hyperparameter values for the **RandomForestClassifier**.

In [None]:
n_estimators = np.linspace(50, 500, int((500 - 50) / 20), dtype=int)
max_depth = [5, 10, 50, 100, 200, 300, 400, 500]
min_samples_split = [2, 4, 6, 8, 10]
max_features = ['sqrt', 'log2']
bootstrap = [True, False]

params = {
    'n_estimators': n_estimators,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'max_features': max_features,
    'bootstrap': bootstrap
}

In [None]:
rfc = RandomForestClassifier()
random_search_cv = RandomizedSearchCV(
    estimator=rfc,
    param_distributions=params,
    n_iter=100,
    cv=10,
    n_jobs=-1
)

search = random_search_cv.fit(X_train, y_train)

We found the best parameters from fitting the **Randomized Search**.

In [None]:
search.best_params_

We will compare our base **Random Forest** model to the one with these parameters to figure out the amount of improvement we achieved.

In [None]:
best_model = RandomForestClassifier(n_estimators=478,
                                    min_samples_split=10,
                                    max_features='sqrt',
                                    max_depth=10,
                                    bootstrap=False,
                                    random_state=42)

best_model_score = cross_validate(estimator=best_model, X=X_train, y=y_train, cv=10, scoring=(['f1', 'accuracy']))

In [None]:
best_model_df = pd.DataFrame(data=best_model_score, columns=['F1-Score', 'Accuracy', 'Overall'], index=['Best Random Model'])
best_model_df.loc['Best Random Model', 'F1-Score'] = best_model_score["test_f1"].mean()
best_model_df.loc['Best Random Model', 'Accuracy'] = best_model_score["test_accuracy"].mean()
best_model_df.loc['Best Random Model', 'Overall'] = (best_model_score["test_f1"].mean() +
                                                     best_model_score["test_accuracy"].mean()) / 2
comparison = cv_scores[cv_scores['Estimator'] == 'Random Forest'].set_index('Estimator')
comparison = comparison.append(best_model_df).sort_values('Overall', ascending=False)
comparison = comparison.rename(index={'Random Forest': 'Base Model'})

In [None]:
comparison

In [None]:
best_random_overall = comparison.loc['Best Random Model', 'Overall']
base_overall = comparison.loc['Base Model', 'Overall']
print(f'Improvement: {(best_random_overall - base_overall) / base_overall * 100 :.2f}%')

We achieved **3.04%** improvement over our base Random Forest model.