### Import Required Libraries

We import the core Python libraries required for data manipulation, path handling, and exploratory analysis.

In [None]:
import pandas as pd
from ml.src.config.paths import (
    RAW_TRAIN_PATH,
    RAW_TEST_PATH,
    CLEANED_TRAIN_PATH,
    CLEANED_TEST_PATH,
)


### Loading the Training and Test Datasets

The training and test datasets are loaded from the `data/` directory. Paths are resolved explicitly to avoid issues with relative locations.
These datasets will be used for exploratory data analysis and feature engineering.

In [None]:
train_data = pd.read_csv(RAW_TRAIN_PATH)
test_data = pd.read_csv(RAW_TEST_PATH)

### Combine Training and Test Data for Consistent Preprocessing

To ensure consistent feature engineering and preprocessing, we temporarily combine the training and test datasets.
A flag (`is_train`) is added to allow safe separation later.

In [None]:
train_data['is_train'] = 1
test_data['is_train'] = 0

full_data = pd.concat([train_data, test_data], ignore_index=True)


### Previewing the Training Data

A quick preview helps verify that the data has loaded correctly and provides an initial sense of feature types and values.


In [None]:
train_data.head()

### Inspecting Dataset Structure and Data Types

The info() method provides a concise summary of the dataset, including:

1. Number of entries

2. Column names

3. Data types

4. Count of non-null values

This is especially useful for identifying missing values.

In [None]:
train_data.info()

### Statistical Summary of Numerical Features

The describe() method generates descriptive statistics for numerical columns, such as:

1. Mean

2. Standard deviation

3. Minimum and maximum values

4. Quartiles

This gives an overview of data distribution and potential anomalies.

In [None]:
train_data.describe()

### Identifying Missing Age Values

Here we filter the dataset to show only rows where the Age column has missing values (NaN).
This helps us understand how many records are affected and plan an imputation strategy.

In [None]:
train_data[train_data['Age'].isna()]

### Extract Passenger Titles from Names

Passenger titles (e.g., Mr, Mrs, Miss) are extracted from the Name column using a regular expression.
Titles often capture social status and correlate strongly with age and survival.

In [None]:
full_data['Title'] = full_data['Name'].str.extract(r',\s*([^\.]+)\.')


### Grouping Rare Titles

Many titles appear very infrequently.
To reduce noise and dimensionality, we group uncommon titles into a single category called "Rare".

In [None]:
full_data['Title'] = full_data['Title'].where(
full_data['Title'].isin(['Mr', 'Mrs', 'Master', 'Miss']),
'Rare'
)


### Encoding Titles as Numerical Values

Machine learning models require numerical inputs.
Here we map each title category to an integer value for model compatibility.

In [None]:
title_mapping = {
    'Mr': 0,
    'Miss': 1,
    'Mrs': 2,
    'Master': 3,
    'Rare': 4
}

full_data['Title'] = full_data['Title'].map(title_mapping)


full_data['Title'] = full_data['Title'].fillna(4)


 

### Impute Missing Age Values Using Training Data Only

Missing values in the Age column are filled using the average age per title, calculated only from the training data.
This avoids data leakage while producing more realistic age estimates.

In [None]:
title_age_means = (
    full_data[full_data['is_train'] == 1]
    .groupby('Title')['Age']
    .mean()
)

for title, avg_age in title_age_means.items():
    full_data.loc[
        (full_data['Title'] == title) & (full_data['Age'].isna()),
        'Age'
    ] = round(avg_age, 2)



### Validating the Sex Column

We check whether the Sex column contains only the expected values (male, female).
This ensures data consistency before encoding.

In [None]:
train_data[~train_data['Sex'].isin(['male', 'female'])]


### Encoding Sex as Numerical Values

We convert the categorical Sex column into numeric form:

male → 0

female → 1

This is required for machine learning algorithms.

In [None]:
full_data['Sex'] = full_data['Sex'].map({
    'male': 0,
    'female': 1
})


### Exploring the Embarked Feature

We inspect the Embarked column, which represents the port where passengers boarded the ship.

In [None]:
train_data['Embarked']

### Validating Embarked Values

This step checks for unexpected or invalid embarkation values outside the known categories:

S (Southampton)

C (Cherbourg)

Q (Queenstown)

In [None]:
train_data[~train_data['Embarked'].isin(['S','C','Q'])]

### Handling Missing Embarked Values

Missing values in Embarked are filled with 'C', which is the most frequent embarkation port in this dataset.
This ensures no missing values remain in this feature.

In [None]:
full_data['Embarked'] = full_data['Embarked'].fillna('C')


### Split the Combined Dataset Back into Train and Test

After preprocessing, we separate the combined dataset back into cleaned training and test datasets.
The helper column is_train is removed.

In [None]:
train_cleaned = full_data[full_data['is_train'] == 1].drop(columns=['is_train'])
test_cleaned = full_data[full_data['is_train'] == 0].drop(columns=['is_train'])



### Remove Survived from test set 

In [None]:
test_cleaned = test_cleaned.drop(columns=['Survived'])


### Handling Missing Fare Values in Test Set

Missing fare values in the test dataset are imputed using
the median fare from the training dataset to prevent data leakage.


In [None]:
fare_median = train_cleaned['Fare'].median()
test_cleaned['Fare'] = test_cleaned['Fare'].fillna(fare_median)


### Final Data Validation

As a final check, we confirm that there are no unexpected missing values remaining in either dataset.

In [None]:
print("Train missing values:")
print(train_cleaned.isna().sum())

print("\nTest missing values:")
print(test_cleaned.isna().sum())



### Export Cleaned Datasets to CSV Files

The cleaned datasets are saved as CSV files for use in modeling and experimentation.

In [None]:
CLEANED_TRAIN_PATH.parent.mkdir(parents=True, exist_ok=True)

train_cleaned.to_csv(CLEANED_TRAIN_PATH, index=False)
test_cleaned.to_csv(CLEANED_TEST_PATH, index=False)

print(f"Saved train_cleaned.csv to: {CLEANED_TRAIN_PATH}")
print(f"Saved test_cleaned.csv to: {CLEANED_TEST_PATH}")
