(**Click the icon below to open this notebook in Colab**)

[![Open InColab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/machine-learning-for-actuarial-science/blob/main/2025-spring/week05/notebook/demo.ipynb)

We will work with the Titanic datasets from Kaggle.
https://www.kaggle.com/competitions/titanic/data
- **The Titanic** https://en.wikipedia.org/wiki/Titanic

| Variable   | Definition                                | Key                                  |
|------------|-------------------------------------------|--------------------------------------|
| survival   | Survival                                 | 0 = No, 1 = Yes                     |
| pclass     | Ticket class                             | 1 = 1st, 2 = 2nd, 3 = 3rd           |
| sex        | Sex                                      |                                      |
| Age        | Age in years                             |                                      |
| sibsp      | # of siblings / spouses aboard the Titanic |                                      |
| parch      | # of parents / children aboard the Titanic |                                      |
| ticket     | Ticket number                            |                                      |
| fare       | Passenger fare                           |                                      |
| cabin      | Cabin number                             |                                      |
| embarked   | Port of Embarkation                     | C = Cherbourg, Q = Queenstown, S = Southampton |


# 1. Loading the data

In [None]:
import pandas as pd

In [None]:
train = pd.read_csv('../data/titanic/train.csv')
test = pd.read_csv('../data/titanic/test.csv')
train.head(2)

In [None]:
train.sample(2)

In [None]:
train.info()

In [None]:
train.dtypes

In [None]:
train.describe()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set_style("whitegrid")

# Define color palette for consistency
palette = "husl"

# Create subplots
fig, axes = plt.subplots(4, 3, figsize=(20, 16))

# 1st row
sns.countplot(x='Pclass', data=train, ax=axes[0, 0])
sns.countplot(x='Sex', data=train, ax=axes[0, 1])
sns.countplot(x='Embarked', data=train, ax=axes[0, 2])

# 2nd row
sns.boxplot(x='Pclass', y='Age', data=train, ax=axes[1, 0])
sns.histplot(train['Fare'].dropna(), ax=axes[1, 1], bins=30, color='b')
sns.countplot(x='SibSp', hue='Survived', data=train, ax=axes[1,2], palette=palette)

# 3rd row
sns.countplot(x='Pclass', hue='Survived', data=train, ax=axes[2, 0], palette=palette)
sns.countplot(x='Sex', hue='Survived', data=train, ax=axes[2, 1], palette=palette)
sns.histplot(x='Age', hue='Survived', data=train, ax=axes[2, 2], bins=5, palette=palette)

# 4th row
sns.countplot(x='Parch', hue='Survived', data=train, ax=axes[3, 0], palette=palette)
sns.stripplot(x='Pclass', y='Fare', hue='Survived', data=train, palette=palette, ax=axes[3, 1], jitter=True, dodge=True)
sns.countplot(x='Embarked', hue='Survived', data=train, ax=axes[3, 2], palette=palette)

# Set titles for each subplot
titles = [
    "Total Passengers by Class", 
    "Total Passengers by Gender", 
    "Total Passengers by Embarked", 
    "Age Box Plot By Class", 
    "Fare Distribution", 
    "Survival Rate by SibSp",
    "Survival Rate by Class", 
    "Survival Rate by Gender", 
    "Survival Rate by Age", 
    "Survival Rate by Parch",
    "Survival Rate by Fare and Pclass", 
    "Survival Rate by Embarked"
]

# Assign titles correctly
for ax, title in zip(axes.flat, titles):
    ax.set_title(title)

# Adjust layout
plt.tight_layout()
plt.show()


# 2. Exploratory Data Analysis

## 2.0 Quick survey across key variables

- Some quick analysis across key columns is usually a good starting point to help understand the data.
- **Domain knowledge and common sense are important too!!**

In [None]:
# convert all column names to lower cases
train.columns = train.columns.str.lower()
test.columns = test.columns.str.lower()

In [None]:
train.info()

In [None]:
# how many are survived, also include the corresponding % out of the total

dist_survived_train = train.groupby('survived')['passengerid'].count().reset_index()
dist_survived_train['percentage'] = (dist_survived_train['passengerid'] / len(train)).map(lambda x: round(x, 4))
dist_survived_train

It's good to know that the train and test datasets have consistent data distributions!!

In [None]:
# train['pclass'].head(5)
train.groupby('pclass')['passengerid'].count().reset_index()

In [None]:
test.groupby('pclass')['passengerid'].count().reset_index()

The variable `Pclass` here represents the passenger class, which indicates the socio-economic status (SES) of the passenger based on their ticket type.

| Pclass | Description   | Socio-Economic Status     |
|--------|-------------|--------------------------|
| 1      | First Class  | Upper Class (Wealthy)    |
| 2      | Second Class | Middle Class             |
| 3      | Third Class  | Lower Class (Poorer)     |

- First class passengers were given priority boarding, access to a higher deck, and potentially the priority in evacuation
- Third class passengers were mostly in lower deck areas, making it harder to reach lifeboats

![](https://rpmarchildon.com/wp-content/uploads/2018/06/titanic_class_cabin_locations.png)

In [None]:
# let pandas display all rows instead of hidding

pd.set_option('display.max_rows', None)

train.groupby(['pclass','embarked'])['passengerid'].count().reset_index()

C = Cherbourg, Q = Queenstown, S = Southampton

![](https://d.newsweek.com/en/full/2248395/titanic-journey.jpg?w=1200&f=ea15a8ece59fe5cc42a6ab06fb1fb672)

In [None]:
train.groupby('embarked')['passengerid'].count().reset_index()

In [None]:
# The cabin variable
train['cabin'].head(5)

![](https://www.titanicandco.com/titanic/images/deckplan1.jpg)

In [None]:
dist_sex_train = train.groupby(['sex'])['passengerid'].count().reset_index()
dist_sex_train['percentage'] = (dist_sex_train['passengerid'] / len(train)).map(lambda x: '{:.2%}'.format(x))
dist_sex_train

In [None]:
dist_sex_test = test.groupby(['sex'])['passengerid'].count().reset_index()
dist_sex_test['percentage'] = (dist_sex_test['passengerid'] / len(test)).map(lambda x: '{:.2%}'.format(x))
dist_sex_test

In [None]:
# distribution of other numerical features
train[[
    'age',
    'sibsp',
    'parch',
    'fare'
]].describe()

In [None]:
test[[
    'age',
    'sibsp',
    'parch',
    'fare'
]].describe()

In [None]:
# ticket number??

train['ticket'].head(10)

- Raw Ticket values are not directly useful because they are alphanumeric and contain no obvious numerical meaning.
- However, feature engineering can extract useful patterns that might impact survival probability.
- Possible insights:
  - Passengers with the same ticket number likely traveled together, which can indicate family or group survival dependencies.
  - Ticket prefixes might correlate with cabin class or embarkation location.

## 2.1 Outlier detection

![](https://miro.medium.com/v2/resize:fit:1400/1*0MPDTLn8KoLApoFvI0P2vQ.png)

In [None]:
import numpy as np
from collections import Counter

# Outlier detection 
def detect_outliers(df,n,features):
    outlier_indices = []
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col],25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        # outlier step
        outlier_step = 1.5 * IQR
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index       
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    return multiple_outliers   
# detect outliers from Age, SibSp , Parch and Fare
outliers_to_drop = detect_outliers(train,2,["age","sibsp","parch","fare"])
train.loc[outliers_to_drop] # Show the outliers rows

In [None]:
# Drop outliers
# train = train.drop(outliers_to_drop, axis = 0).reset_index(drop=True)

## 2.2 Handle Missing values

In [None]:
train.info()

As can be seen above, there are missing values in the following columns:
- `Age`
- `Cabin`
- `Embarked`

Missing values are typically bad and need to be handled. However, some algorithms can handle missing values, such as decision trees.

In [None]:
# how to quickly locate columns with null values??
train.isnull().sum()

In [None]:
test.isnull().sum()

| Reason                                     | Who Is Affected?               |
|--------------------------------------------|--------------------------------|
| Poor record-keeping (pre-1912 era)         | Mostly third-class passengers  |
| Ticketing system didn’t require age        | Families, group travelers      |
| Crew members not consistently recorded     | Crew entries in the dataset    |
| Passengers may have hidden or omitted age  | Various                        |


In [None]:
# fill the missing values in age with the median age
train['age'] = train['age'].fillna(train['age'].median())
test['age'] = test['age'].fillna(test['age'].median())

- First-class passengers had private cabins, which were recorded.
- Some second-class passengers also had assigned cabins, but not all.
- Most third-class passengers didn’t have individual cabins but instead stayed in large dormitory-style areas (especially in the lower decks).

In [None]:
train['has_cabin'] = train['cabin'].notna().astype(int)
train.groupby('pclass')['has_cabin'].mean().reset_index()

## 2.3. Feature Engineering

Here are the general types of data we could encounter:
- **Categorical data**:
  - With ordinal relationships - e.g., ratings, grades
  - Without ordinal relationships - e.g., colors, brands
- **Numerical data**:
  - Discrete - e.g., number of children, number of votes
  - Continuous - e.g., height, weight, temperature

Ultimately, we want to convert all data into numerical data for computation, which means that we need to convert categorical data into numerical data.


### 2.3.1 Encode categorical variables

In [None]:
# Encoding ordinal variables

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'education': ['primary', 'secondary', 'tertiary', 'primary']})
encoder = OrdinalEncoder(categories=[['primary', 'secondary', 'tertiary']])  # Define order

df['education_encoded'] = encoder.fit_transform(df[['education']])
df

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoded_arr = encoder.fit_transform(df[['education']])
encoded_arr


In [None]:
df_encoded_arr = pd.DataFrame(encoded_arr, columns=encoder.get_feature_names_out(['education']))
df_encoded_arr

In the titanic dataset

In [None]:
train.sex.value_counts().reset_index()

In [None]:
train['sex'] = train['sex'].map( {'female': 0, 'male': 1} ).astype(int)    

### 2.3.2 Normalize numerical variables

| **Transformation**        | **Description**                                         | **Method**             |
|---------------------------|---------------------------------------------------------|------------------------|
| Normalization             | Scales to [0, 1] range                                  | `MinMaxScaler()`       |
| Standardization           | Scales to have mean 0, std 1                            | `StandardScaler()`     |
| Log Transformation        | Compresses large values, reduces skewness               | `np.log()`             |
| [Box-Cox Transformation](https://builtin.com/data-science/box-cox-transformation-target-variable)    | Stabilizes variance and normalizes data                 | `stats.boxcox()` [[reference](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html)]      |


In [None]:
# Normalization
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Sample data
df = pd.DataFrame({'Age': [22, 38, 26, 35, 35, 54, 2],
                   'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05, 51.8625, 21.075]})

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Normalize the numerical data
df_normalized = df.copy()
df_normalized[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

df_normalized


In [None]:
# Standardization
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Standardize the numerical data
df_standardized = df.copy()
df_standardized[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

df_standardized


In [None]:
# Log transformation - right skewed data
import numpy as np

# Apply log transformation (for positive values)
df_transformed = df.copy()
df_transformed['Fare_log'] = np.log(df_transformed['Fare'] + 1)  # Added 1 to avoid log(0)

df_transformed


In [None]:
# Box-Cox transformation
from scipy import stats

# Apply Box-Cox transformation (only for positive values)
df_boxcox = df.copy()
df_boxcox['Fare_boxcox'], _ = stats.boxcox(df_boxcox['Fare'] + 1)  # Added 1 to avoid 0 values

df_boxcox


### 2.3.3 Create meaningful features

Sometimes, we could combine or repurpose existing features to create new features that have a stronger correlation with the target variable. The **correlation analysis** between the explainatory variables and the target variable is typically the criteria for evaluating the usefulness of a feature.

#### `embarked`

In [None]:
train.groupby('embarked')['survived'].mean().reset_index()

In [None]:
sns.barplot(x='embarked', y='survived', data=train)
plt.show()

In [None]:
sns.barplot(x="embarked", y="survived", hue="sex", data=train)
plt.show()

#### `name`

In [None]:
train.head(3)

In [None]:
train[['name']].head(10)

In [None]:
train['name_length'] = train['name'].map(len)
train[['name', 'name_length']].head(3)

In [None]:
train.groupby('name_length')['passengerid'].count().reset_index()

In [None]:
train['name_length'] = train['name'].map(len)
passenger_count = train.groupby(['name_length'])['passengerid'].count().reset_index().rename(columns={'passengerid':'num_passengers'})
survival_dist = train.groupby(['name_length'])['survived'].mean().reset_index()
fig, (axis1,axis2,axis3) = plt.subplots(3,1,figsize=(18,6))
sns.barplot(x='name_length', y='num_passengers', data=passenger_count, ax = axis1)
sns.barplot(x='name_length', y='survived', data=survival_dist, ax = axis2)
sns.pointplot(x='name_length', y='survived', data=survival_dist, ax = axis3)
plt.show()

In [None]:
train['name_length_cat'] = train['name_length'].apply(lambda x: 0 if x <= 23 else 1 if x <= 28 else 2 if x <= 40 else 3)
train['name_length_cat'].value_counts().reset_index()

#### `age`

In [None]:
#plot distributions of age of passengers who survived or did not survive
sns.kdeplot(x='age', data=train, hue='survived', common_norm=False)
# sns.displot(x='age', data=train, hue='survived', kind='kde', common_norm=False)
plt.show()

In [None]:
train['age_cat'] = train['age'].apply(
    lambda x: 0 if x <= 14 else 1 if x <= 30 else 2 if x <= 40 else 3 if x <= 50 else 4 if x <= 60 else 5
)
train.age_cat.value_counts().reset_index() #.sort_values(by='age_cat')

In [None]:
train.groupby(['age_cat'])['survived'].mean().reset_index()

#### `familysize`

In [None]:
train['family_size'] = train['sibsp'] + train['parch'] + 1
train['is_alone'] = train['family_size'].apply(lambda x: 1 if x == 1 else 0)

fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.barplot(x='family_size', y='survived', hue='sex', data=train, ax=axes[0])
sns.barplot(x='is_alone', y='survived', hue='sex', data=train, ax=axes[1])
plt.show()

#### `fare`

In [None]:
sns.histplot(x='fare', data=train, bins=30)
plt.show()

In [None]:
# Apply log to Fare to reduce skewness distribution
train["fare_log"] = train["fare"].map(lambda i: np.log(i) if i > 0 else 0)

sns.kdeplot(x='fare_log', data=train, hue='survived', common_norm=False)
plt.show()

In [None]:
train['fare_log_cat'] = train['fare_log'].apply(
    lambda x: 0 if x <= 2.7 else 1 if x <= 3.2 else 2 if x <= 3.6 else 3
)
train['fare_log_cat'].value_counts().reset_index()

#### `cabin`

In [None]:
train['has_cabin'] = train['cabin'].apply(lambda x: 0 if type(x) == float else 1)
train.groupby('has_cabin')['survived'].mean().reset_index()

#### `title`

In [None]:
import re

# Define function to extract titles from passenger names
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
 # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

train['title'] = train['name'].apply(get_title)

fig, (axis1) = plt.subplots(1,figsize=(18,6))
sns.barplot(x="title", y="survived", data=train, ax=axis1);
plt.show()

#### `deck`

In [None]:
train.head(3)

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
train[['cabin']].head(3)

In [None]:
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}

train['cabin'] = train['cabin'].fillna("U0")
train['deck'] = train['cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
train['deck'] = train['deck'].map(deck)
train['deck'] = train['deck'].fillna(0)
train['deck'] = train['deck'].astype(int)

train.deck.value_counts()

In [None]:
sns.barplot(x = 'deck', y = 'survived', order=[1,2,3,4,5,6,7,8], data=train)
plt.show()

In [None]:
# colormap = plt.cm.RdBu
# plt.figure(figsize=(14,12))
# plt.title('Pearson Correlation of Features', y=1.05, size=15)
# sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

In [None]:
# g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Fare',
#        u'FamilySize', u'Title']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
# g.set(xticklabels=[])