# Data Cleaning (Titanic Dataset)

In this notebook, we will:
- Handle missing values
- Drop unnecessary columns
- Create new features if needed
- Prepare dataset for further analysis
- Save cleaned dataset for future use

Cleaning is important because models cannot handle messy data with missing or irrelevant columns.

### Step 1: Set Project Root for Python Imports

In [1]:
import sys
import os

# Add project root to sys.path
sys.path.append(os.path.abspath("..")) 

### Step 2: Load Dataset
- Load Titanic dataset and verify the first few rows to understand structure.

In [2]:
# We Load The Titanic Dataset
from src.data import load_data
df= load_data(r"D:\Thiru\ML_Projects\Titanic-Survival-Prediction\Data\Raw\train.csv")

# Show First 5 Rows
print(df.head(5))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


### Step 3:Missing Values
- Age and Cabin have many missing entries.
- Embarked has a few missing values.
- We need to decide strategies for imputing or dropping these columns.

In [3]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Step 4: Handle missing values
- **Age** → fill with median (better than mean because of outliers)
- **Embarked** → fill with mode (most common value)
- **Cabin** → drop column (too many missing values)

In [4]:
from src.data import preprocess_mis_val
df=preprocess_mis_val(df)

### Step 5: Dropping Columns
- PassengerId, Name, and Ticket are dropped as they are not useful for prediction.

In [5]:
from src.data import drop_cols
df=drop_cols(df)

### Step 6: Feature Engineering
- FamilySize: total family members on board.  
- IsAlone: 1 if passenger is alone, 0 otherwise.  
- These features may help improve model performance.

In [6]:
from src.data import feature_engi
df=feature_engi(df)

### Step 7: Encoding Categorical Variables
- Sex and Embarked columns are converted to numeric values for modeling.

In [7]:
from src.data import encode_var
df=encode_var(df)

### Step 8: Final check
We confirm there are no missing values and data is clean.

In [8]:
from src.data import basic_info
df=basic_info(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    int64  
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked    891 non-null    int64  
 8   FamilySize  891 non-null    int64  
 9   IsAlone     891 non-null    int64  
dtypes: float64(2), int64(8)
memory usage: 69.7 KB
   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked  FamilySize  \
0         0       3    0  22.0      1      0   7.2500         2           2   
1         1       1    1  38.0      1      0  71.2833         0           2   
2         1       3    1  26.0      0      0   7.9250         2           1   
3         1       1    1  35.0      

### Step 7: Save cleaned dataset
We save the cleaned and processed dataset so it can be reused in future notebooks.

In [9]:
from src.data import save_clean_data
df=save_clean_data(df)

Cleaned dataset saved to D:\Thiru\ML_Projects\Titanic-Survival-Prediction\Data\processed\cleaned_titanic.csv


### Summary
- Handled missing values:
  - Filled Age with median
  - Filled Embarked with mode
  - Dropped Cabin
- Dropped irrelevant columns: PassengerId, Name, Ticket
- Added new features: FamilySize, IsAlone
- Encoded categorical variables (Sex, Embarked)
- Dataset is now clean and ready for EDA and modeling.
- - **Saved the cleaned dataset** to `D:\Thiru\ML_Projects\Titanic-Survival-Prediction\Data\processed\cleaned_titanic.csv` for use in future notebooks.
