### Import Required Libraries

We import the core Python libraries required for data manipulation, path handling, and exploratory analysis.

In [1]:
import pandas as pd
from pathlib import Path


### Loading the Training and Test Datasets

The training and test datasets are loaded from the `data/` directory. Paths are resolved explicitly to avoid issues with relative locations.
These datasets will be used for exploratory data analysis and feature engineering.

In [2]:

train_path = Path("../data/train.csv").resolve()
test_path = Path("../data/test.csv").resolve()

out_dir = Path("../data/processed")
out_dir.mkdir(parents=True, exist_ok=True)

train_out_path = out_dir / "train_cleaned.csv"
test_out_path = out_dir / "test_cleaned.csv"


In [3]:
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

### Combine Training and Test Data for Consistent Preprocessing

To ensure consistent feature engineering and preprocessing, we temporarily combine the training and test datasets.
A flag (`is_train`) is added to allow safe separation later.

In [4]:
train_data['is_train'] = 1
test_data['is_train'] = 0

full_data = pd.concat([train_data, test_data], ignore_index=True)


### Previewing the Training Data

A quick preview helps verify that the data has loaded correctly and provides an initial sense of feature types and values.


In [5]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,is_train
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


### Inspecting Dataset Structure and Data Types

The info() method provides a concise summary of the dataset, including:

1. Number of entries

2. Column names

3. Data types

4. Count of non-null values

This is especially useful for identifying missing values.

In [6]:
train_data.info()

<class 'pandas.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    str    
 4   Sex          891 non-null    str    
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    str    
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    str    
 11  Embarked     889 non-null    str    
 12  is_train     891 non-null    int64  
dtypes: float64(2), int64(6), str(5)
memory usage: 90.6 KB


### Statistical Summary of Numerical Features

The describe() method generates descriptive statistics for numerical columns, such as:

1. Mean

2. Standard deviation

3. Minimum and maximum values

4. Quartiles

This gives an overview of data distribution and potential anomalies.

In [7]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,is_train
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,1.0
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.0
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,1.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104,1.0
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,1.0
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,1.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,1.0


### Identifying Missing Age Values

Here we filter the dataset to show only rows where the Age column has missing values (NaN).
This helps us understand how many records are affected and plan an imputation strategy.

In [8]:
train_data[train_data['Age'].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,is_train
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,1
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,1
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C,1
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C,1
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C,1
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S,1
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S,1
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S,1


### Extract Passenger Titles from Names

Passenger titles (e.g., Mr, Mrs, Miss) are extracted from the Name column using a regular expression.
Titles often capture social status and correlate strongly with age and survival.

In [9]:
full_data['Title'] = full_data['Name'].str.extract(r',\s*([^\.]+)\.')


### Grouping Rare Titles

Many titles appear very infrequently.
To reduce noise and dimensionality, we group uncommon titles into a single category called "Rare".

In [10]:
full_data['Title'] = full_data['Title'].where(
full_data['Title'].isin(['Mr', 'Mrs', 'Master', 'Miss']),
'Rare'
)


### Encoding Titles as Numerical Values

Machine learning models require numerical inputs.
Here we map each title category to an integer value for model compatibility.

In [11]:
title_mapping = {
    'Mr': 0,
    'Miss': 1,
    'Mrs': 2,
    'Master': 3,
    'Rare': 4
}

full_data['Title'] = full_data['Title'].map(title_mapping)


full_data['Title'] = full_data['Title'].fillna(4)


 

### Impute Missing Age Values Using Training Data Only

Missing values in the Age column are filled using the average age per title, calculated only from the training data.
This avoids data leakage while producing more realistic age estimates.

In [12]:
title_age_means = (
    full_data[full_data['is_train'] == 1]
    .groupby('Title')['Age']
    .mean()
)

for title, avg_age in title_age_means.items():
    full_data.loc[
        (full_data['Title'] == title) & (full_data['Age'].isna()),
        'Age'
    ] = round(avg_age, 2)



### Validating the Sex Column

We check whether the Sex column contains only the expected values (male, female).
This ensures data consistency before encoding.

In [13]:
train_data[~train_data['Sex'].isin(['male', 'female'])]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,is_train


### Encoding Sex as Numerical Values

We convert the categorical Sex column into numeric form:

male → 0

female → 1

This is required for machine learning algorithms.

In [14]:
full_data['Sex'] = full_data['Sex'].map({
    'male': 0,
    'female': 1
})


### Exploring the Embarked Feature

We inspect the Embarked column, which represents the port where passengers boarded the ship.

In [15]:
train_data['Embarked']

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: str

### Validating Embarked Values

This step checks for unexpected or invalid embarkation values outside the known categories:

S (Southampton)

C (Cherbourg)

Q (Queenstown)

In [16]:
train_data[~train_data['Embarked'].isin(['S','C','Q'])]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,is_train
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,1
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,1


### Handling Missing Embarked Values

Missing values in Embarked are filled with 'C', which is the most frequent embarkation port in this dataset.
This ensures no missing values remain in this feature.

In [17]:
full_data['Embarked'] = full_data['Embarked'].fillna('C')


### Split the Combined Dataset Back into Train and Test

After preprocessing, we separate the combined dataset back into cleaned training and test datasets.
The helper column is_train is removed.

In [18]:
train_cleaned = full_data[full_data['is_train'] == 1].drop(columns=['is_train'])
test_cleaned = full_data[full_data['is_train'] == 0].drop(columns=['is_train'])



### Remove Survived from test set 

In [19]:
test_cleaned = test_cleaned.drop(columns=['Survived'])


### Handling Missing Fare Values in Test Set

Missing fare values in the test dataset are imputed using
the median fare from the training dataset to prevent data leakage.


In [20]:
fare_median = train_cleaned['Fare'].median()
test_cleaned['Fare'] = test_cleaned['Fare'].fillna(fare_median)


### Final Data Validation

As a final check, we confirm that there are no unexpected missing values remaining in either dataset.

In [21]:
print("Train missing values:")
print(train_cleaned.isna().sum())

print("\nTest missing values:")
print(test_cleaned.isna().sum())



Train missing values:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Title            0
dtype: int64

Test missing values:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
Title            0
dtype: int64


### Export Cleaned Datasets to CSV Files

The cleaned datasets are saved as CSV files for use in modeling and experimentation.

In [22]:
train_cleaned.to_csv(train_out_path, index=False)
test_cleaned.to_csv(test_out_path, index=False)
