![](https://www.intrepidstrategic.com/wp-content/uploads/2023/09/titanic.jpg)

## A study in feature engineering

### This notebook takes a segmentation approach
* Segment by name country of origin, as England was relatively homogenous in the early 1900s
* Bin name titles by kind (ie: nobility titles, professional titles, etc)
* Identify likely parents of various ages based on historic data
* Identify & bin by group size
* Normalize fare & bin by sixths
* Age binning

## The local Anaconda version of this notebook landed in the TOP 13%!
Kaggle won't run the fine tuning in its entirety. Currently, I am trying to identify why the Anaconda version performs quite well regularly while the Kaggle version does not. I'm still in the early learning phase of ML and am eager to hear your feedback. 

## COMMENTS ARE WELCOME!
I'm here to learn! 

In [None]:
!pip install dataprep

## Import what we'll need

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score
import plotly.express as px
import numpy as np 
import pandas as pd 
import scipy
from scipy.stats import mode
import shap
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVC
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.impute import IterativeImputer
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance
from xgboost import cv
import xgboost as xgb
from dataprep.eda import create_report
from IPython.display import display
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Examine the data

In [None]:
# Read the train_data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
display(train_data.head())

# Read the test_data
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
display(test_data.head())

# Display the information of test_data
display(test_data.info())

## Dataprep Create Report

In [None]:
create_report(train_data)

## Find null values

In [None]:
test_data_df = pd.DataFrame(test_data.isnull().sum())
print(test_data_df)

In [None]:
train_data_df = pd.DataFrame(train_data.isnull().sum())
print(train_data_df)


## Change the null cabin to a category 'Unknown' 
We can see in the data that a huge chunk of cabins are listed as 'null,' yet I am unsure as to why. It's possible the nulls could be for a reason, so we'll add them into a category of their own.

In [None]:
# Find and replace null values in train_data for 'Cabin' and 'Ticket'
train_data['Cabin'].fillna('Unknown', inplace=True)
train_data['Ticket'].fillna('Unknown', inplace=True)

# Find and replace null values in test_data for 'Cabin' and 'Ticket'
test_data['Cabin'].fillna('Unknown', inplace=True)
test_data['Ticket'].fillna('Unknown', inplace=True)

## Take a look at ages

In [None]:
unique_ages = train_data['Age'].unique()
print(unique_ages)

#### We'll use predictive modeling to fill ages in before running final prediction


## Let's turn these floats off.

And convert Age to numeric

In [None]:
# Convert the Age column to numeric (including NaN values)
train_data['Age'] = pd.to_numeric(train_data['Age'], errors='coerce')
test_data['Age'] = pd.to_numeric(test_data['Age'], errors='coerce')

# Round the float values and then convert to integers
train_data['Age'] = train_data['Age'].round().astype('Int64')
test_data['Age'] = test_data['Age'].round().astype('Int64')

#### now we'll view that again

In [None]:
unique_ages = train_data['Age'].unique()
print(unique_ages)

## Quickly view unique values
#### Make a dataframe with unique options in each column

In [None]:
unique_values_dict = {}

# Exclude 'PassengerID' column
columns_to_check = [col for col in train_data.columns if col != 'PassengerID']

# Find the maximum number of unique values across columns
max_unique_values = max([len(train_data[col].dropna().unique()) for col in columns_to_check])

# Populate unique_values_dict
for col in columns_to_check:
    unique_values = train_data[col].dropna().unique()
    padding = max_unique_values - len(unique_values)
    unique_values_dict[col] = list(unique_values) + [''] * padding

# Create DataFrame
unique_df = pd.DataFrame(unique_values_dict)

unique_df.head(100)

## Take a look at all abbreviations in the 'Name' column

In [None]:
# Extract words that end with a period from the 'Name' column
train_data['Title'] = train_data['Name'].str.extract('([A-Za-z]+\.)', expand=False)
# Extract words that end with a period from the 'Name' column
test_data['Title'] = test_data['Name'].str.extract('([A-Za-z]+\.)', expand=False)

In [None]:
train_data['Title'].unique()

## One-hot encode titles

#### Mr.
Grown Men
#### Master.
Men under 18
#### MsMlle
Women who are unmarried
#### MrsMme
Women who are married
#### Nobility
Men and women with nobility titles
#### ProTitle
People with a title signifying their profession
#### Military
People with a military title

In [None]:
# Initialize new columns with zero
train_data[['MsMlle', 'MrsMme', 'Nobility', 'ProTitle', 'Military']] = 0
test_data[['MsMlle', 'MrsMme', 'Nobility', 'ProTitle', 'Military']] = 0

# Define combined titles and their respective column names
combined_titles = {
    'MsMlle': ['Ms.', 'Mlle.'],
    'MrsMme': ['Mrs.', 'Mme.'],
    'Nobility': ['Don.', 'Sir.', 'Lady.', 'Jonkheer.'],
    'ProTitle': ['Rev.', 'Dr.'],
    'Military': ['Major.', 'Col.', 'Capt.']
}

# Apply the changes to train_data and test_data
for column_name, titles in combined_titles.items():
    for title in titles:
        train_data.loc[train_data['Name'].str.contains(title), column_name] = 1
        test_data.loc[test_data['Name'].str.contains(title), column_name] = 1
        
titles = ['Mr.', 'Master.']

In [None]:
for title in titles:
    train_data[title] = train_data['Name'].apply(lambda x: 1 if title in x else 0)
    test_data[title] = test_data['Name'].apply(lambda x: 1 if title in x else 0)

## Remove the titles from the names before we move on

In [None]:
titles_to_check = ['Mr.', 'Mrs.', 'Miss.', 'Master.', 'Don.', 'Rev.', 'Dr.', 'Mme.', 'Ms.', 'Major.', 'Lady.', 'Sir.', 'Mlle.', 'Col.', 'Capt.', 'Countess.', 'Jonkheer.']

# Define a function to remove the titles
def remove_titles_from_name(df, titles):
    for title in titles:
        df['Name'] = df['Name'].str.replace(title, '', regex=False)
    return df

# Apply the function to both train_data and test_data
train_data = remove_titles_from_name(train_data, titles_to_check)
test_data = remove_titles_from_name(test_data, titles_to_check)

# What's in a name?
### a hypothesis

Surname origin may very well be important, as it could demonstrate those who travelled from a nearby country to embark on The Titanic. The fated ship set sail from Southhampton in southern England in 1912. In the 1901 census, 96% of the population was born in England or Wales in Great Britain. If you happen to find census data, either tabular or raw images then please let me know so I can gain a more accurate understanding of the surname breakdown of English residents.

### Import dataframes & assign country
[from Surname Language of Origin](https://www.kaggle.com/datasets/sinclairg/surname-language-of-origin/) by Sinclair

In [None]:
def load_txt_to_df(filepath):
    with open(filepath, 'r') as f:
        lines = f.readlines()
    df = pd.DataFrame(lines, columns=['Surname'])
    df['Surname'] = df['Surname'].str.strip()
    return df
filepaths = {
    "arabic": "/kaggle/input/surname-language-of-origin/data/names/Arabic.txt",
    "chinese": "/kaggle/input/surname-language-of-origin/data/names/Chinese.txt",
    "czech": "/kaggle/input/surname-language-of-origin/data/names/Czech.txt",
    "dutch": "/kaggle/input/surname-language-of-origin/data/names/Dutch.txt",
    "english": "/kaggle/input/surname-language-of-origin/data/names/English.txt",
    "french": "/kaggle/input/surname-language-of-origin/data/names/French.txt",
    "german": "/kaggle/input/surname-language-of-origin/data/names/German.txt",
    "greek": "/kaggle/input/surname-language-of-origin/data/names/Greek.txt",
    "irish": "/kaggle/input/surname-language-of-origin/data/names/Irish.txt",
    "italian": "/kaggle/input/surname-language-of-origin/data/names/Italian.txt",
    "japanese": "/kaggle/input/surname-language-of-origin/data/names/Japanese.txt",
    "korean": "/kaggle/input/surname-language-of-origin/data/names/Korean.txt",
    "polish": "/kaggle/input/surname-language-of-origin/data/names/Polish.txt",
    "portuguese": "/kaggle/input/surname-language-of-origin/data/names/Portuguese.txt",
    "russian": "/kaggle/input/surname-language-of-origin/data/names/Russian.txt",
    "scottish": "/kaggle/input/surname-language-of-origin/data/names/Scottish.txt",
    "spanish": "/kaggle/input/surname-language-of-origin/data/names/Spanish.txt",
    "vietnamese": "/kaggle/input/surname-language-of-origin/data/names/Vietnamese.txt",
}

dataframes = {}
for name, path in filepaths.items():
    dataframes[name] = load_txt_to_df(path)

## One-hot encode names to country of origin

In [None]:
# Create columns for each origin in train_data and test_data
origins = list(dataframes.keys())
for origin in origins:
    train_data[origin] = 0
    test_data[origin] = 0

# Function to one-hot encode based on name substrings
def one_hot_encode_by_origin(df):
    for index, row in df.iterrows():
        name = row['Name'].lower()
        for origin, origin_df in dataframes.items():
            if any(subname.lower() in name for subname in origin_df['Surname']):
                df.at[index, origin] = 1

# One-hot encode the train and test data
one_hot_encode_by_origin(train_data)
one_hot_encode_by_origin(test_data)

In [None]:
train_data.describe()

In [None]:
# 1. Compute the sum for each country column
train_sums = train_data[origins].sum()
test_sums = test_data[origins].sum()

# 2. Concatenate the sums
all_sums = pd.concat([train_sums, test_sums], axis=1, keys=['Train', 'Test'])

# 3. Plot using matplotlib
all_sums.plot(kind='bar', figsize=(14,7))
plt.title('Counts of People by Country in Train and Test Data')
plt.ylabel('Number of People')
plt.xlabel('Country')
plt.tight_layout()
plt.show()

In [None]:
# 1. Sum of 1s across country columns for each row
train_data['sum_of_countries'] = train_data[origins].sum(axis=1)
test_data['sum_of_countries'] = test_data[origins].sum(axis=1)

# 2. Filter rows where this sum is 2 or more
multi_country_df = train_data[train_data['sum_of_countries'] >= 2]

# 3. Compute the percentage of such occurrences
percent_multi_country = (multi_country_df[origins].sum() / len(multi_country_df)) * 100

# 4. Plot the percentages
plt.figure(figsize=(14,7))
percent_multi_country.plot(kind='bar', color='c')
plt.title('Percentage of Names Associated with Multiple Countries in Train Data')
plt.ylabel('Percentage')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## We can see that the anglican version of Asian names appear too often. 
#### I have verified that Russians were, in fact, aboard the Titanic. 
Let's remove the columns representing Chinese, Japanese, Korean, and Vietnamese. Then, if somebody has all 0s on country of origin we can add them to a column called 'othercountry.'

In [None]:
columns_to_drop = ['chinese', 'japanese', 'korean', 'vietnamese']

# Drop the columns from train_data
train_data = train_data.drop(columns=columns_to_drop, axis=1)

# Drop the columns from test_data
test_data = test_data.drop(columns=columns_to_drop, axis=1)

In [None]:
columns_to_check = ['arabic', 'czech', 'dutch', 'english', 'french', 'german', 'greek', 'irish', 'italian', 'polish', 'portuguese', 'scottish', 'spanish']

# Check and set 'othercountry' for train_data
train_data['othercountry'] = np.where(train_data[columns_to_check].sum(axis=1) == 0, 1, 0)

# Check and set 'othercountry' for test_data
test_data['othercountry'] = np.where(test_data[columns_to_check].sum(axis=1) == 0, 1, 0)

## Onward. Let's look to see if 'Cabin' shows a pattern

In [None]:
test_data['Cabin'].unique()

## There is a pattern. It starts with A-F! 
#### A quick Google Search shows this is the Sun Deck, Upper Promenade, Upper Deck, Saloon Deck, Main Deck, and Middle Deck. 
## Let's break one-hot encode this into decks

In [None]:
def categorize_cabin(cabin):
    if cabin.startswith('S'):
        return 'Sun_Deck'
    elif cabin.startswith('A'):
        return 'Upper_Prom_Deck'
    elif cabin.startswith('B'):
        return 'Prom_Deck_Glass'
    elif cabin.startswith('C'):
        return 'Upper_Deck'
    elif cabin.startswith('D'):
        return 'Saloon_Deck'
    elif cabin.startswith('E'):
        return 'Main_Deck'
    elif cabin.startswith('F'):
        return 'Middle_Deck'
    else:
        return 'Unknown'

# Apply the categorization function to the 'Cabin' column in both train_data and test_data
train_data['Cabin_Category'] = train_data['Cabin'].apply(categorize_cabin)
test_data['Cabin_Category'] = test_data['Cabin'].apply(categorize_cabin)

# Perform one-hot encoding on the 'Cabin_Category' column
train_data = pd.get_dummies(train_data, columns=['Cabin_Category'], prefix='Cabin')
test_data = pd.get_dummies(test_data, columns=['Cabin_Category'], prefix='Cabin')

# Drop the original 'Cabin' column
train_data.drop(columns=['Cabin'], inplace=True)
test_data.drop(columns=['Cabin'], inplace=True)

# Print the updated data
print(train_data.head())
print(test_data.head())

In [None]:
unique_values_dict = {}

# Exclude 'PassengerID' column
columns_to_check = [col for col in test_data.columns if col != 'PassengerID']

# Find the maximum number of unique values across columns
max_unique_values = max([len(test_data[col].dropna().unique()) for col in columns_to_check])

# Populate unique_values_dict
for col in columns_to_check:
    unique_values = test_data[col].dropna().unique()
    padding = max_unique_values - len(unique_values)
    unique_values_dict[col] = list(unique_values) + [''] * padding

# Create DataFrame
unique_df = pd.DataFrame(unique_values_dict)

unique_df.head(100)

## One-Hot Encode Sex

In [None]:
# Perform one-hot encoding for 'Sex' column in train_data
train_data = pd.get_dummies(train_data, columns=['Sex'], prefix='Sex')

# Perform one-hot encoding for 'Sex' column in test_data
test_data = pd.get_dummies(test_data, columns=['Sex'], prefix='Sex')

print(train_data.head())
print(test_data.head())

## One-Hot Encode Embarked

In [None]:
# Perform one-hot encoding for 'Embarked' column in train_data
train_data = pd.get_dummies(train_data, columns=['Embarked'], prefix=['Embarked'])

# Perform one-hot encoding for 'Embarked' column in test_data
test_data = pd.get_dummies(test_data, columns=['Embarked'], prefix=['Embarked'])

print(train_data.head())
print(test_data.head())

## We don't want to be redundant with 'Mr.' and 'Mrs,' so let's drop the names, as well as Ticket and Title

In [None]:
train_data.drop(columns=['Name'], inplace=True)
test_data.drop(columns=['Name'], inplace=True)
train_data.drop(columns=['Ticket'], inplace=True)
test_data.drop(columns=['Ticket'], inplace=True)
test_data.drop(columns=['Title'], inplace=True)
train_data.drop(columns=['Title'], inplace=True)

In [None]:
test_data.info()

In [None]:
train_data.info()

## View a histogram of the fare

In [None]:
fig = px.histogram(train_data, x='Fare', nbins=20, title='Histogram of Fare')
fig.update_xaxes(title_text='Fare')
fig.update_yaxes(title_text='Frequency')
fig.show()


#### There seems to be quite the outlier(s). Let's explore further

In [None]:
fig = px.box(train_data, y='Fare', title='Box Plot of Fare')
fig.update_xaxes(title_text='Fare')
fig.update_yaxes(title_text='Value')
fig.show()

In [None]:
plt.figure(figsize=(12, 6))

sns.kdeplot(train_data['Fare'], label='train_data', fill=True)
sns.kdeplot(test_data['Fare'], label='test_data', fill=True)

plt.title('Fare in train_data and test_data')
plt.xlabel('Fare')
plt.ylabel('Density')
plt.legend()

plt.show()

## Fare is skewed. 
#### Run a log function to normalize

In [None]:
# Apply log transformation to 'Fare' in train_data and test_data
train_data['Fare'] = np.log1p(train_data['Fare'])
test_data['Fare'] = np.log1p(test_data['Fare'])

# Now plot the KDE again
plt.figure(figsize=(12, 6))

sns.kdeplot(train_data['Fare'], label='train_data', fill=True)
sns.kdeplot(test_data['Fare'], label='test_data', fill=True)

plt.title('Log-Transformed Fare in train_data and test_data')
plt.xlabel('Log-Transformed Fare')
plt.ylabel('Density')
plt.legend()

plt.show()

## Verify

In [None]:
fig = px.box(train_data, y='Fare', title='Box Plot of Fare')
fig.update_xaxes(title_text='Fare')
fig.update_yaxes(title_text='Value')
fig.show()

much better!

In [None]:
test_data.head()

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
test_data.info()

In [None]:
test_data.head()

## 'Age' Still has some null values. 
#### Let's use predictive modeling to guess their values.

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
features = ['Cabin_Main_Deck', 'Cabin_Middle_Deck', 'Cabin_Prom_Deck_Glass', 'Cabin_Saloon_Deck', 'Cabin_Unknown', 'Cabin_Upper_Deck', 'Cabin_Upper_Prom_Deck', 'Mr.', 'Mrs.', 'Miss.', 'Master.', 'Other', "Sex_male", "Sex_female", "SibSp", "Parch", "Embarked_C", "Embarked_Q", "Embarked_S", "Fare","arabic", "czech", "dutch", "english", "french", "german", "greek", "irish", "italian", "polish", "portuguese", "scottish", "spanish", "othercountry",'Age']

## Identify Couples

In [None]:
# For train_data
train_data['YoungCoupleTrip'] = 0
train_data.loc[(train_data['SibSp'] == 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 17) & (train_data['Age'] < 24), 'YoungCoupleTrip'] = 1

# For test_data
test_data['YoungCoupleTrip'] = 0
test_data.loc[(test_data['SibSp'] == 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 17) & (test_data['Age'] < 24), 'YoungCoupleTrip'] = 1


In [None]:
# For YoungishCouples
train_data['YoungishCouples'] = 0
train_data.loc[(train_data['SibSp'] == 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 23) & (train_data['Age'] < 35), 'YoungishCouples'] = 1

test_data['YoungishCouples'] = 0
test_data.loc[(test_data['SibSp'] == 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 23) & (test_data['Age'] < 35), 'YoungishCouples'] = 1

# For MidAgeCouples
train_data['MidAgeCouples'] = 0
train_data.loc[(train_data['SibSp'] == 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 35) & (train_data['Age'] < 50), 'MidAgeCouples'] = 1

test_data['MidAgeCouples'] = 0
test_data.loc[(test_data['SibSp'] == 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 35) & (test_data['Age'] < 50), 'MidAgeCouples'] = 1

# For GrandCouples
train_data['GrandCouples'] = 0
train_data.loc[(train_data['SibSp'] == 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 49), 'GrandCouples'] = 1

test_data['GrandCouples'] = 0
test_data.loc[(test_data['SibSp'] == 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 49), 'GrandCouples'] = 1


## Identify Young Families 

In [None]:
# For train_data
train_data['YoungParents'] = 0
train_data.loc[(train_data['SibSp'] >= 1) & (train_data['Parch'] == 1) & (train_data['Age'] > 17) & (train_data['Age'] < 24), 'YoungParents'] = 1

# For test_data
test_data['YoungParents'] = 0
test_data.loc[(test_data['SibSp'] >= 1) & (test_data['Parch'] == 1) & (test_data['Age'] > 17) & (test_data['Age'] < 24), 'YoungParents'] = 1
# For YoungishCouples
train_data['YoungishCouples'] = 0
train_data.loc[(train_data['SibSp'] >= 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 23) & (train_data['Age'] < 35), 'YoungishParents'] = 1

test_data['YoungishCouples'] = 0
test_data.loc[(test_data['SibSp'] >= 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 23) & (test_data['Age'] < 35), 'YoungishParents'] = 1

# For MidAgeCouples
train_data['MidAgeCouples'] = 0
train_data.loc[(train_data['SibSp'] >= 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 35) & (train_data['Age'] < 50), 'MidAgeParents'] = 1

test_data['MidAgeCouples'] = 0
test_data.loc[(test_data['SibSp'] >= 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 35) & (test_data['Age'] < 50), 'MidAgeParents'] = 1

# For GrandCouples
train_data['GrandCouples'] = 0
train_data.loc[(train_data['SibSp'] >= 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 49), 'GrandParents'] = 1

test_data['GrandCouples'] = 0
test_data.loc[(test_data['SibSp'] >= 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 49), 'GrandParents'] = 1


## Encode Family Size

In [None]:
# Create a FamilySize column
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1

# Create an Alone column, 1 if FamilySize is 1 else 0
train_data['Alone'] = (train_data['FamilySize'] == 1).astype(int)
test_data['Alone'] = (test_data['FamilySize'] == 1).astype(int)

In [None]:
# Group by FamilySize to get counts
family_size_counts = train_data['FamilySize'].value_counts().sort_index()

# Create a bar plot
fig = px.bar(family_size_counts, 
             x=family_size_counts.index, 
             y=family_size_counts.values, 
             labels={'x':'Family Size', 'y':'Number of Passengers'},
             title="Distribution of FamilySize in train_data")

# Show plot
fig.show()

### Encode Family Size
We already have a column titled "Alone." Now, let's use the chart to one hot encode the category 2, 3, 4 and 5 together, 6 plus together. 

In [None]:
# One-hot encode FamilySize for the category of size 2
train_data['TwoPass'] = train_data['FamilySize'].apply(lambda x: 1 if x == 2 else 0)
test_data['TwoPass'] = test_data['FamilySize'].apply(lambda x: 1 if x == 2 else 0)

# One-hot encode FamilySize for the category of size 3
train_data['ThreePass'] = train_data['FamilySize'].apply(lambda x: 1 if x == 3 else 0)
test_data['ThreePass'] = test_data['FamilySize'].apply(lambda x: 1 if x == 3 else 0)

# One-hot encode FamilySize for the combined category of sizes 4 and 5
train_data['FourFivePass'] = train_data['FamilySize'].apply(lambda x: 1 if x == 4 or x == 5 else 0)
test_data['FourFivePass'] = test_data['FamilySize'].apply(lambda x: 1 if x == 4 or x == 5 else 0)

# One-hot encode FamilySize for the category of size 6 and above
train_data['SixPlusPass'] = train_data['FamilySize'].apply(lambda x: 1 if x >= 6 else 0)
test_data['SixPlusPass'] = test_data['FamilySize'].apply(lambda x: 1 if x >= 6 else 0)

## Identify Likely Young Couples

In [None]:
# For train_data
train_data['YoungCoupleTrip'] = 0
train_data.loc[(train_data['SibSp'] == 1) & (train_data['Parch'] == 0) & (train_data['Age'] > 17) & (train_data['Age'] < 24), 'YoungCoupleTrip'] = 1

# For test_data
test_data['YoungCoupleTrip'] = 0
test_data.loc[(test_data['SibSp'] == 1) & (test_data['Parch'] == 0) & (test_data['Age'] > 17) & (test_data['Age'] < 24), 'YoungCoupleTrip'] = 1


In [None]:
train_data.drop('FamilySize', axis=1, inplace=True)
test_data.drop('FamilySize', axis=1, inplace=True)

## One Hot Encode Fare into sixths

In [None]:
# Determine the Fare range
min_fare = train_data['Fare'].min()
max_fare = train_data['Fare'].max()

# Calculate the interval length
interval_length = (max_fare - min_fare) / 6

# Assign each Fare to its respective interval
def assign_interval(fare):
    for i in range(6):
        if fare <= min_fare + interval_length * (i+1):
            return f"Fare_{i+1}"
    return f"Fare_6"

train_data['Fare_Interval'] = train_data['Fare'].apply(assign_interval)
test_data['Fare_Interval'] = test_data['Fare'].apply(assign_interval)

# One-hot encode the intervals
train_data = pd.get_dummies(train_data, columns=['Fare_Interval'], drop_first=False)
test_data = pd.get_dummies(test_data, columns=['Fare_Interval'], drop_first=False)


#### Drop Fare

In [None]:
train_data.drop('Fare', axis=1, inplace=True)
test_data.drop('Fare', axis=1, inplace=True)

#### Verify and Inspect all Features

In [None]:
train_data.info()

#### Use this to define 'features'

In [None]:
features = [
    "Pclass", "Age", "SibSp", "Parch", 
    "MsMlle", "MrsMme", "Nobility", "ProTitle", 
    "Military", "Mr.", "Master.", 
    "arabic", "czech", "dutch", "english", 
    "french", "german", "greek", "irish", 
    "italian", "polish", "portuguese", 
    "russian", "scottish", "spanish", 
    "sum_of_countries", "othercountry", 
    "Cabin_Main_Deck", "Cabin_Middle_Deck", 
    "Cabin_Prom_Deck_Glass", "Cabin_Saloon_Deck", 
    "Cabin_Unknown", "Cabin_Upper_Deck", 
    "Cabin_Upper_Prom_Deck", "Sex_female", 
    "Sex_male", "Embarked_C", "Embarked_Q", 
    "Embarked_S", "YoungCoupleTrip", 
    "YoungishCouples", "MidAgeCouples", 
    "GrandCouples", "YoungParents", 
    "YoungishParents", "MidAgeParents", 
    "GrandParents", "Alone", "TwoPass", 
    "ThreePass", "FourFivePass", "SixPlusPass", 
    "Fare_Interval_Fare_1", "Fare_Interval_Fare_2", 
    "Fare_Interval_Fare_3", "Fare_Interval_Fare_4", 
    "Fare_Interval_Fare_5", "Fare_Interval_Fare_6"
]

In [None]:
# Extract columns needed for imputation
train_subset = train_data[features]
test_subset = test_data[features]

# Instantiate and fit the imputer
imputer = IterativeImputer(max_iter=10, random_state=42)
imputer.fit(train_subset)

# Apply imputation on train and test data
train_data_imputed = imputer.transform(train_subset)
test_data_imputed = imputer.transform(test_subset)

# Convert imputed data back to DataFrame and update original data
train_data[features] = pd.DataFrame(train_data_imputed, columns=features)
test_data[features] = pd.DataFrame(test_data_imputed, columns=features)

In [None]:
test_data_df = pd.DataFrame(test_data.isnull().sum())
print(test_data_df)
train_data_df = pd.DataFrame(train_data.isnull().sum())
print(train_data_df)

In [None]:
# Add a new column "AgeInt" by converting the "Age" column to integers
train_data['AgeInt'] = train_data['Age'].astype(int)

# Calculate the count of each unique age
age_counts = train_data['AgeInt'].value_counts().reset_index()
age_counts.columns = ['AgeInt', 'Count']

# Plot using sns.barplot
plt.figure(figsize=(10, 6))
sns.barplot(data=age_counts, x='AgeInt', y='Count', palette='viridis')
plt.title('Count of Each Age in train_data')
plt.ylabel('Count')
plt.xlabel('Age')
plt.xticks(rotation=45)
plt.show()

# Calculate the percentage of 'Transported' for each age group using "AgeInt"
age_transported_percentage = train_data.groupby('AgeInt')['Survived'].mean() * 100

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=age_transported_percentage.index, y=age_transported_percentage.values, palette='viridis')
plt.xlabel('Age')
plt.ylabel('Percentage Transported')
plt.title('Percentage of Transported by Age')
plt.xticks(rotation=45)
plt.show()

# Remove the "AgeInt" column
train_data.drop('AgeInt', axis=1, inplace=True)

### bin by threes and view again

In [None]:
# Bin the ages by threes
max_age = train_data['Age'].max()
bins = list(range(0, int(max_age) + 4, 3))
labels = [f'{i}-{i+2}' for i in bins[:-1]]

train_data['Age_bins'] = pd.cut(train_data['Age'], bins=bins, labels=labels, right=False)
test_data['Age_bins'] = pd.cut(test_data['Age'], bins=bins, labels=labels, right=False)

# Calculate counts for each bin
age_bin_counts = train_data['Age_bins'].value_counts().sort_index().reset_index()
age_bin_counts.columns = ['Age Bins', 'Count']

# Plot using sns.barplot
plt.figure(figsize=(15, 6))
sns.barplot(data=age_bin_counts, x='Age Bins', y='Count', palette='viridis')
plt.title('Distribution of Age in 3-Year Bins')
plt.ylabel('Count')
plt.xlabel('Age Bins')
plt.xticks(rotation=45)
plt.show()

## One hot encode these age bins and drop the age and age bin columns

In [None]:
# One-hot encode the Age_bins column
age_dummies = pd.get_dummies(train_data['Age_bins'], prefix='AgeBin')

# Concatenate the one-hot encoded columns to the original DataFrame
train_data = pd.concat([train_data, age_dummies], axis=1)

# Drop the Age and Age_bins columns
train_data.drop(['Age', 'Age_bins'], axis=1, inplace=True)

# One-hot encode the Age_bins column
age_dummies = pd.get_dummies(test_data['Age_bins'], prefix='AgeBin')

# Concatenate the one-hot encoded columns to the original DataFrame
test_data = pd.concat([test_data, age_dummies], axis=1)

# Drop the Age and Age_bins columns
test_data.drop(['Age', 'Age_bins'], axis=1, inplace=True)

In [None]:
features.remove('Age')

In [None]:
# List of AgeBin features
agebin_features = [
    'AgeBin_0-2', 'AgeBin_3-5', 'AgeBin_6-8', 'AgeBin_9-11', 'AgeBin_12-14', 'AgeBin_15-17',
    'AgeBin_18-20', 'AgeBin_21-23', 'AgeBin_24-26', 'AgeBin_27-29', 'AgeBin_30-32', 
    'AgeBin_33-35', 'AgeBin_36-38', 'AgeBin_39-41', 'AgeBin_42-44', 'AgeBin_45-47',
    'AgeBin_48-50', 'AgeBin_51-53', 'AgeBin_54-56', 'AgeBin_57-59', 'AgeBin_60-62', 
    'AgeBin_63-65', 'AgeBin_66-68', 'AgeBin_69-71', 'AgeBin_72-74', 'AgeBin_75-77', 
    'AgeBin_78-80'
]

# Extend the original features list with the AgeBin features
features.extend(agebin_features)

#### View all features

In [None]:
features

# Model Selection

### We're predicting "Survived," which is a classification prediction

In [None]:
train_data.shape

In [None]:
# Set the display options to show more rows
pd.set_option('display.max_rows', None)

# Create a DataFrame with columns of train_data and test_data for comparison
comparison_df = pd.DataFrame({
    "train_data columns": pd.Series(train_data.columns),
    "test_data columns": pd.Series(test_data.columns)
})

# Display the comparison DataFrame
print(comparison_df)

# Reset the display option to the default, if desired
pd.reset_option('display.max_rows')

# Prepare XGBoost & Fine Tune
This ran in Anaconda and doesn't need to run again. Keep scrolling to see best parameters

In [None]:
# Separate X and y from train_data using 'features'
X_train = train_data[features]
y_train = train_data['Survived']

# Split train_data into training and validation sets
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=64)


In [None]:
# # Initialize XGBoost model
# xgb_model = XGBClassifier(
#     objective='binary:logistic',
#     n_estimators=500,
#     max_depth=10,
#     learning_rate=0.01,
#     subsample=1,
#     colsample_bytree=0.6,
#     reg_lambda=0.1,
#     reg_alpha=0.1,
#     random_state=64
# )

In [None]:
# # Update hyperparameter search space for XGBoost
# param_grid = {
#     'n_estimators': [100, 500, 1000],
#     'max_depth': [2, 3, 4, 5, 6, 7],
#     'learning_rate': [0.001, 0.005, 0.01],
#     'subsample': [0.5,0.6, 0.7, 0.8, 0.9, 1],
#     'colsample_bytree': [0.7, 0.75, 0.8, 0.85, 9],
#     'reg_lambda': np.logspace(-3, 0, 2),
#     'reg_alpha': np.logspace(-3, 0, 2)
# }


# # Use GridSearch
# search = GridSearchCV(xgb_model, param_grid=param_grid, scoring='roc_auc', cv=5, verbose=1, n_jobs=-1)
# search.fit(X_train_split, y_train_split)

# # Using the best parameters found
# xgb_model_best = search.best_estimator_

In [None]:
# best_parameters = search.best_params_
# print(best_parameters)

In [None]:
# # Predict on the validation set using the best estimator
# y_pred_best_val = search.best_estimator_.predict(X_val_split)

# # Get prediction probabilities for AUC score using the best estimator
# y_pred_prob_best_val = search.best_estimator_.predict_proba(X_val_split)[:, 1]

# from sklearn.metrics import accuracy_score, roc_auc_score

# # Calculate accuracy
# accuracy_best_val = accuracy_score(y_val_split, y_pred_best_val)

# # Calculate ROC AUC
# roc_auc_best_val = roc_auc_score(y_val_split, y_pred_prob_best_val)

# print(f"Optimized XGBoost Validation Accuracy: {accuracy_best_val:.2f}")
# print(f"Optimized XGBoost Validation AUC: {roc_auc_best_val:.2f}")


### The Best Parameters

In [None]:
xgb_model_best = XGBClassifier(
    objective='binary:logistic',
    n_estimators=500,
    max_depth=3,
    learning_rate=0.001,
    subsample=1,
    colsample_bytree=0.7,
    reg_lambda=0.001,
    reg_alpha=1.0,
    random_state=64
)

# Model Ensemble by Random Seed Mode

In [None]:
# Set up number of random seeds
n_seeds = 100
all_preds = []

test_data_filtered = test_data[features]


In [None]:
# For each seed, train the model and predict outcomes on test set
for seed in range(n_seeds):
    # Split train_data into training and validation sets
    X_train_split, _, y_train_split, _ = train_test_split(X_train, y_train, test_size=0.2, random_state=seed)

    # Train the model on the training subset
    xgb_model_best.fit(X_train_split, y_train_split)

    # Predict on the test_data using the model
    y_pred_test = xgb_model_best.predict(test_data_filtered)
    
    all_preds.append(y_pred_test)

In [None]:
# Convert list of predictions to a numpy array
all_preds_array = np.array(all_preds)

# Calculate the mode for each passenger
final_preds, _ = mode(all_preds_array, axis=0)

In [None]:
test_data['Survived'] = final_preds[0]
output = pd.DataFrame({'PassengerID': test_data['PassengerId'], 'Survived': final_preds[0]})
output.set_index("PassengerID", inplace=True)
output.to_csv('submission.csv')
output

## What are the best features?
These results are from the final random seed, not the entire ensemble

In [None]:
X_train_filtered = X_train[features]

# Initialize SHAP explainer. 
explainer = shap.TreeExplainer(xgb_model_best)

# Calculate SHAP values for a particular dataset. 
shap_values = explainer.shap_values(X_train_filtered)

# Visualize the SHAP values for a specific instance 
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0, :], X_train_filtered.iloc[0, :])

In [None]:
shap.summary_plot(shap_values, X_train_filtered)

In [None]:
# Get feature importances from the model
feature_importances = xgb_model_best.feature_importances_

# Pair each feature with its importance score
features_with_scores = list(zip(features, feature_importances))

# Sort the pairs by importance score in descending order
sorted_features_with_scores = sorted(features_with_scores, key=lambda x: x[1], reverse=True)

# Display the sorted features and their scores
for feature, score in sorted_features_with_scores:
    print(f"Feature: {feature}, F Score: {score:.5f}")

## Next Steps
* Figure out why my anaconda notebook is generating a higher score than this Kaggle notebook
* Try setting stumps early and then building the model from there with step-size tuning
* Try [HyperOpt](http://hyperopt.github.io/hyperopt/) instead to fine tune
* Use [YellowBrick](https://www.scikit-yb.org/en/latest/) to visualize details about XGBoost
* Implement [XGBFIR](https://github.com/limexp/xgbfir) to develop a better understanding of feature
