# 2110446 DATA SCIENCE AND DATA ENGINEERING

## **Unit 02:** Data Preparation

- **Problem:** Modified Titanic (`02_dataprep_01_2025s2`)
- **Author:** Worralop Srichainont
- **Year:** 2025 (Semester 2)

## Dependencies

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Data Resources
File URL

In [2]:
TITANIC_URL = "https://raw.githubusercontent.com/reisenx/2110446-DATA-SCI-ENG/refs/heads/main/02-Data-Preparation/Grader/02_dataprep_01_2025s2/code/titanic_to_student.csv"

Load files by using the first column as an index.

In [3]:
TITANIC_DF = pd.read_csv(TITANIC_URL, index_col=0)

Display titanic `DataFrame`

In [4]:
TITANIC_DF.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.0,0,PC 17599,71.2833,C85,C
1,4,1.0,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0,113803,53.1,C123,S
2,6,0.0,3.0,"Moran, Mr. James",male,,0.0,0,330877,8.4583,,Q
3,8,0.0,3.0,"Palsson, Master. Gosta Leonard",male,2.0,3.0,1,349909,21.075,,S
4,10,1.0,2.0,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1.0,0,237736,30.0708,,C
5,12,1.0,1.0,"Bonnell, Miss. Elizabeth",female,58.0,0.0,0,113783,26.55,C103,S
6,14,0.0,3.0,"Andersson, Mr. Anders Johan",male,39.0,1.0,5,347082,31.275,,S
7,16,1.0,2.0,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0.0,0,248706,16.0,,S
8,18,1.0,2.0,"Williams, Mr. Charles Eugene",male,,0.0,0,244373,13.0,,S
9,20,1.0,3.0,"Masselmani, Mrs. Fatima",female,,0.0,0,2649,7.225,,C


# Problem `Q1`
How many rows are there in the `titanic_to_student.csv`?

In [5]:
dataset_rows = TITANIC_DF.shape[0]
print(f"There are {dataset_rows} rows in the original dataset.")

There are 445 rows in the original dataset.


# Problem `Q2`

## Problem `Q2-1`
Drop variables with missing more than 50%.

To drop a missing columns by missing value, use `.dropna()` function.
- `axis=1` means dropping columns.
- `thresh` means the threshold for dropping columns.
- `inplace=True` means directly edit the value of the given variable, not creating new variable.

In [6]:
# Create a deep copy of the DataFrame.
cleaned_titanic_df = TITANIC_DF.copy()

# Calculate amount of rows.
dataset_rows = cleaned_titanic_df.shape[0]

# Calculate drop threshold.
drop_threshold = 0.5 * dataset_rows

# Drop columns with missing value more than 50%.
cleaned_titanic_df.dropna(axis=1, thresh=drop_threshold, inplace=True)

## Problem `Q2-2`
Check all column except `Age` and `Fare` for flat values, drop the columns where flat value more than 70%

To count values in the column, use `values_count()` and make sure to set `dropna=False` to also consider missing value too.

In [7]:
EXAMPLE_COLUMN = "Parch"
cleaned_titanic_df[EXAMPLE_COLUMN].value_counts(dropna=False)

Parch
0    339
1     57
2     43
5      3
4      2
3      1
Name: count, dtype: int64

Next, use `.iloc[0]` to choose only the first row (the highest count) and ignore the row name.

In [8]:
highest_count = cleaned_titanic_df[EXAMPLE_COLUMN].value_counts(dropna=False).iloc[0]
print(f"The highest count of '{EXAMPLE_COLUMN}' column is {highest_count}.")

The highest count of 'Parch' column is 339.


So the logic is to iterate each column.
- Find the count of each value in the current column.
- Choose only the highest count.
- If the highest count exceeds the threshold, append the column name to a list.

In [9]:
# Ignore these columns.
IGNORE_COLUMNS = ("Age", "Fare")

# Initialize a list to store column names to drop.
drop_columns = []

# Calculate drop threshold.
drop_threshold = 0.7 * dataset_rows

# Iterate each column.
for column_name in cleaned_titanic_df.columns:
    # Skip the ignored columns.
    if column_name in IGNORE_COLUMNS:
        continue

    # Calculate the highest count of the current columns.
    highest_count = cleaned_titanic_df[column_name].value_counts(dropna=False).iloc[0]

    # If the count exceeds the threshold, append the column name to a list.
    if highest_count > drop_threshold:
        drop_columns.append(column_name)

# Display column to drop
for column_name in drop_columns:
    print(f"- Drop '{column_name}' column.")

- Drop 'Parch' column.


Drop columns from the list.

In [10]:
cleaned_titanic_df.drop(columns=drop_columns, inplace=True)

From `Q2-1` and `Q2-2`, how many columns do we have left?

In [11]:
dataset_cols = cleaned_titanic_df.shape[1]
print(f"There are {dataset_cols} columns in the cleaned dataset.")

There are 10 columns in the cleaned dataset.


# Problem `Q3`
Remove all rows with missing targets (`Survived`).
How many rows do we have left?

Example of missing targets rows.

In [12]:
missing_titanic_df = TITANIC_DF[TITANIC_DF["Survived"].isna()]

DISPLAY_COLUMNS = ["Name", "Age", "Sex", "Survived"]
missing_titanic_df[DISPLAY_COLUMNS]

Unnamed: 0,Name,Age,Sex,Survived
29,"Goodwin, Master. William Frederick",11.0,male,
30,"Icard, Miss. Amelie",38.0,female,
31,"Skoog, Master. Harald",4.0,male,
32,"Moubarek, Master. Gerios",,male,
33,"Crease, Mr. Ernest James",19.0,male,
34,"Kink, Mr. Vincenz",26.0,male,
35,"Goodwin, Miss. Lillian Amy",16.0,female,
36,"Chronopoulos, Mr. Apostolos",26.0,male,
37,"Moen, Mr. Sigurd Hansen",25.0,male,
38,"Moutal, Mr. Rahamin Haim",,male,


In [13]:
# Remove all rows with missing Survived column.
cleaned_titanic_df = TITANIC_DF.dropna(subset=["Survived"])

# Output amount of rows of the cleaned dataset.
dataset_rows = cleaned_titanic_df.shape[0]
print(f"There are {dataset_rows} rows in the cleaned dataset.")

There are 432 rows in the cleaned dataset.


# Problem `Q4`
Handle outlier for `Fare` column.

Calculate IQR for `Fare` columns.

In [14]:
# Calculate quantile
q1 = TITANIC_DF["Fare"].quantile(0.25)
q3 = TITANIC_DF["Fare"].quantile(0.75)

# Calculate IQR
iqr = q3 - q1

# Display value
print(f"Q1: {q1}")
print(f"Q3: {q3}")
print(f"IQR: {iqr}")

Q1: 7.925
Q3: 34.375
IQR: 26.45


Calculate upper bound value and lower bound value.

In [15]:
# Calculate lower bound.
lower_bound = q1 - (1.5 * iqr)

# Calculate upper bound
upper_bound = q3 + (1.5 * iqr)

# Display value
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")

Lower bound: -31.749999999999996
Upper bound: 74.05


Example of outliers rows.

In [16]:
outlier_titanic_df = cleaned_titanic_df[
    (cleaned_titanic_df["Fare"] < lower_bound)
    | (cleaned_titanic_df["Fare"] > upper_bound)
]

DISPLAY_COLUMNS = ["Name", "Sex", "Age", "Fare"]
outlier_titanic_df[DISPLAY_COLUMNS].head(10)

Unnamed: 0,Name,Sex,Age,Fare
13,"Fortune, Mr. Charles Alexander",male,19.0,263.0
15,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,146.5208
69,"Giglio, Mr. Victor",male,24.0,79.2
97,"Lurette, Miss. Elise",female,58.0,146.5208
107,"Newell, Miss. Madeleine",female,31.0,113.275
122,"Minahan, Dr. William Edward",male,44.0,90.0
128,"Cherry, Miss. Gladys",female,30.0,86.5
134,"Bissette, Miss. Amelia",female,35.0,135.6333
137,"Andrews, Miss. Kornelia Theodosia",female,63.0,77.9583
145,"Bishop, Mrs. Dickinson H (Helen Walton)",female,19.0,91.0792


Replace outlier value by upper bound value or lower bound value.
- Use `.loc[row_names, col_names]` to access the data.
- `row_names` is for choosing outliers rows.
- `col_names` is for choosing `Fare` column.

In [17]:
# Create a deep copy the DataFrame
cleaned_titanic_df = TITANIC_DF.copy()

# Handle low outlier.
low_outlier_rows = cleaned_titanic_df["Fare"] < lower_bound
cleaned_titanic_df.loc[low_outlier_rows, "Fare"] = lower_bound

# Handle high outlier.
high_outlier_rows = cleaned_titanic_df["Fare"] > upper_bound
cleaned_titanic_df.loc[high_outlier_rows, "Fare"] = upper_bound

Calculate new mean value.

In [18]:
# Calculate average value.
avg_fare = cleaned_titanic_df["Fare"].mean()
print(f"The new average fare is {round(avg_fare, 2)} dollars.")

The new average fare is 26.27 dollars.


# Problem `Q5`

What is the average (mean) of `Age` after imputing the missing values (round 2 decimal points)?

Calculate mean of `Age` column.

In [19]:
average_age = TITANIC_DF["Age"].mean()

Example of missing `Age` rows.

In [20]:
missing_titanic_df = TITANIC_DF[TITANIC_DF["Age"].isna()]

DISPLAY_COLUMNS = ["Name", "Sex", "Age"]
missing_titanic_df[DISPLAY_COLUMNS].head(10)

Unnamed: 0,Name,Sex,Age
2,"Moran, Mr. James",male,
8,"Williams, Mr. Charles Eugene",male,
9,"Masselmani, Mrs. Fatima",female,
14,"Todoroff, Mr. Lalio",male,
15,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,
22,"Rogers, Mr. William John",male,
23,,female,
27,"Woolner, Mr. Hugh",male,
32,"Moubarek, Master. Gerios",male,
38,"Moutal, Mr. Rahamin Haim",male,


Impute missing value using average age.

In [21]:
# Create a deep copy the DataFrame.
cleaned_titanic_df = TITANIC_DF.copy()

# Fill the missing value with mean.
cleaned_titanic_df["Age"].fillna(average_age)

0      38.000000
1      35.000000
2      29.136361
3       2.000000
4      14.000000
         ...    
440    33.000000
441    28.000000
442    39.000000
443    19.000000
444    26.000000
Name: Age, Length: 445, dtype: float64

Calculate new average age.

In [22]:
average_age = cleaned_titanic_df["Age"].mean()
print(f"The new average age is {round(average_age, 2)}.")

The new average age is 29.14.


# Problem `Q6`

Convert categorical to numeric values 

For the variable `Embarked`, perform the dummy coding. 
What is the average (mean) of `Embarked_Q` after performing dummy coding (round 2 decimal points)?

To perform one-hot encoding, we use `get_dummies()`
- `prefix` is the prefix of the column name (e.g. `Embarked`).
- `drop_first=True` is to create `k-1` columns out of `k`. 

In this case, we will set this to `drop_first=False` to see how one-hot encoding works.

In [23]:
TITANIC_DF["Embarked"].head(10)

0    C
1    S
2    Q
3    S
4    C
5    S
6    S
7    S
8    S
9    C
Name: Embarked, dtype: object

In [24]:
# Perform one-hot encoding
titanic_one_hot_df = pd.get_dummies(TITANIC_DF["Embarked"], prefix="Embarked")

# Display result of one-hot encoding
titanic_one_hot_df.head(10)

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,True,False,False
1,False,False,True
2,False,True,False
3,False,False,True
4,True,False,False
5,False,False,True
6,False,False,True
7,False,False,True
8,False,False,True
9,True,False,False


Concatenate the one-hot encoding columns into the `DataFrame`.

In [25]:
cleaned_titanic_df = pd.concat([TITANIC_DF, titanic_one_hot_df], axis=1)

DISPLAY_COLUMNS = ["Name", "Sex", "Age", "Embarked_C", "Embarked_Q", "Embarked_S"]
cleaned_titanic_df[DISPLAY_COLUMNS].head(10)

Unnamed: 0,Name,Sex,Age,Embarked_C,Embarked_Q,Embarked_S
0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,True,False,False
1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,False,False,True
2,"Moran, Mr. James",male,,False,True,False
3,"Palsson, Master. Gosta Leonard",male,2.0,False,False,True
4,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,True,False,False
5,"Bonnell, Miss. Elizabeth",female,58.0,False,False,True
6,"Andersson, Mr. Anders Johan",male,39.0,False,False,True
7,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,False,False,True
8,"Williams, Mr. Charles Eugene",male,,False,False,True
9,"Masselmani, Mrs. Fatima",female,,True,False,False


Calculate the mean of `Embarked_Q` column.

Note than `False` and `True` are equivalent to `0` and `1` respectively.

In [26]:
mean_embarked_q = cleaned_titanic_df["Embarked_Q"].mean()
print(f"The mean of Embarked_Q column is {round(mean_embarked_q, 2)}")

The mean of Embarked_Q column is 0.06


# Problem `Q7`
Split train/test split with stratification using `70% : 30%` and random seed with `123`.

Show a proportion between survived (`1`) and died (`0`) in all data sets (`total data`, `train`, `test`).

What is the proportion of survivors (survived = 1) in the training data (round 2 decimal points)?

First, we need to deal with the missing value first because missing value causes an error.

For numeric columns, fill the missing value with mean value.

In [27]:
# Calculate the mean of numeric columns.
numeric_means = TITANIC_DF.select_dtypes(include="number").mean()

# Display the mean of numeric columns.
numeric_means

PassengerId    446.000000
Survived         0.416667
Pclass           2.265700
Age             29.136361
SibSp            0.516355
Parch            0.379775
Fare            34.238473
dtype: float64

In [28]:
# Fill missing value on numeric columns with mean value.
cleaned_titanic_df = TITANIC_DF.fillna(numeric_means)

Separate features and targets from a dataset.
- **Features** means all columns except the target.
- **Target** means target which is `Survived` column.

In [29]:
features = cleaned_titanic_df.drop(columns=["Survived"])
target = cleaned_titanic_df["Survived"]

Separate train dataset and test dataset by ratio 7:3 by using stratification.
- Stratification can ensure that both train datasets and test dataset have the same `survived` and `died` proportions.

To separate dataset, use `train_test_split()` function from `sklearn.model_selection`.
- `test_size` is the percentage of test dataset.
- `random_state` is random seed for ensuring the same result when running the code.
- Pass target to `stratify` parameter.

In [30]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.3, random_state=123, stratify=target
)

Display rows amount of test dataset and train dataset.

In [31]:
train_dataset_rows = target_train.shape[0]
test_dataset_rows = target_test.shape[0]

print(f"The train dataset has {train_dataset_rows} rows.")
print(f"The test dataset has {test_dataset_rows} rows.")

The train dataset has 311 rows.
The test dataset has 134 rows.


Calculate the proportion of survivors in the training data.

In [32]:
# Calculate the proportion
train_dataset_rows = target_train.shape[0]
train_dataset_survivors = (target_train == 1).sum()
proportions = train_dataset_survivors / train_dataset_rows

# Display the value.
print(f"The survivors in the train dataset is {train_dataset_survivors} person.")
print(f"The train dataset has {train_dataset_rows} rows.")
print(f"The proportions of survivors in the train dataset is {round(proportions, 2)}.")

The survivors in the train dataset is 126 person.
The train dataset has 311 rows.
The proportions of survivors in the train dataset is 0.41.
