- load the data from the data folder for the analysis below

In [33]:
import pandas as pd
df = pd.read_csv("data\phase_2_titanic_dataset.csv")

This standardization centers the values around zero with unit variance.
$$z = \frac{x-mean}{{std}_{dev}}$$

In [34]:
# from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler()
# df['age_scaled'] = scaler.fit_transform(df[['age']])
# df['fare_scaled'] = scaler.fit_transform(df[['fare']])


#### 🔹 2. Normalization (Min-Max Scaling)


This transforms values to a [0, 1] range
$$x' = \frac{x - min(x)}{max(x) - min(x)}$$

In [35]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['age_norm'] = scaler.fit_transform(df[['age']])
df['fare_norm'] = scaler.fit_transform(df[['fare']])



We need to convert strings to numbers.

#### 🔹 1. Label Encoding (e.g. for binary columns)


In [36]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['sex_encoded'] = le.fit_transform(df['sex'])  # male=1, female=0
df['sex_encoded']


0      1
1      0
2      0
3      0
4      1
      ..
674    0
675    0
676    0
677    1
678    1
Name: sex_encoded, Length: 679, dtype: int64


#### 🔹 2. One-Hot Encoding (e.g. for columns like `embarked`, `class`)


In [37]:
df = pd.get_dummies(df, columns=['embarked', 'class'], drop_first=True)



### ✅ Step 3: Feature Selection (Basic)
Remove less useful features or redundant columns.


In [38]:
# Drop columns not useful for prediction
df = df.drop(columns=['sex', 'age', 'fare', 'deck', 'embark_town', 'who', 'alive', 'adult_male'])
df.columns


Index(['survived', 'pclass', 'sibsp', 'parch', 'alone', 'age_norm',
       'fare_norm', 'sex_encoded', 'embarked_Q', 'embarked_S', 'class_Second',
       'class_Third'],
      dtype='object')


You can also use `df.corr()` to check correlation between features and remove highly correlated ones if needed.

---

When you're done with this step, we’ll move to **Phase 4: Data Splitting** (final part before modeling). Let me know!


In [39]:
df.to_csv("data/phase_3_titanic_dataset.csv", index=False)