# Titanic Dataset — Summary Table

| Feature        | Type               | Description |
|----------------|--------------------|-------------|
| **survived**   | Binary (0/1)       | Passenger survival (0 = no, 1 = yes). |
| **pclass**     | Categorical (1/2/3)| Passenger ticket class (proxy for socio-economic status). |
| **sex**        | Categorical        | Gender of the passenger (`male`, `female`). |
| **age**        | Continuous (float) | Age of the passenger (in years). Missing for some. |
| **sibsp**      | Integer            | Number of siblings/spouses aboard. |
| **parch**      | Integer            | Number of parents/children aboard. |
| **fare**       | Continuous (float) | Ticket price (in 1912 British pounds). |
| **embarked**   | Categorical        | Port of embarkation (`C` = Cherbourg, `Q` = Queenstown, `S` = Southampton). |
| **class**      | Categorical        | Passenger class (`First`, `Second`, `Third`). |
| **who**        | Categorical        | Simplified category (`man`, `woman`, `child`). |
| **adult_male** | Boolean            | Whether passenger is an adult male (`True`/`False`). |
| **deck**       | Categorical        | Cabin deck letter (`A–G` or missing). |
| **embark_town**| String             | Embarkation town name (`Cherbourg`, `Queenstown`, `Southampton`). |
| **alive**      | String             | Human-readable survival status (`yes`, `no`). |
| **alone**      | Boolean            | True if passenger had no family aboard. |


# 1. Imports

In [76]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
import joblib

# 2. Load Titanic Dataset

In [77]:
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 2.1 Checking Data Quality
Before applying feature engineering, we must evaluate the quality of the raw dataset.  
Typical checks include:

- **Missing values** → identify columns with NaNs.  
- **Duplicated entries** → verify if rows are repeated.  
- **Data types** → confirm consistency (numerical vs. categorical).  
- **Basic statistics** → detect outliers, skewness, unexpected ranges.  

In [78]:
# --- Checking Data Quality ---

# 1. Overview of data types and non-null counts
print("=== Dataset Info ===")
print(titanic.info())


=== Dataset Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None


In [79]:
# 2. Missing values summary
print("\n=== Missing Values ===")
print(titanic.isnull().sum())


=== Missing Values ===
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [80]:
# 3. Duplicates
print("\n=== Duplicated Rows ===")
print("Number of duplicates:", titanic.duplicated().sum())



=== Duplicated Rows ===
Number of duplicates: 107


In [81]:

# 4. Basic statistics
print("\n=== Descriptive Statistics ===")
display(titanic.describe(include="all").transpose())


=== Descriptive Statistics ===


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
survived,891.0,,,,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
pclass,891.0,,,,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
sex,891.0,2.0,male,577.0,,,,,,,
age,714.0,,,,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
sibsp,891.0,,,,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
parch,891.0,,,,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
fare,891.0,,,,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292
embarked,889.0,3.0,S,644.0,,,,,,,
class,891.0,3.0,Third,491.0,,,,,,,
who,891.0,3.0,man,537.0,,,,,,,


3. Data Cleaning

In [82]:
# Drop rows where the target is missing
titanic = titanic.dropna(subset=["survived"])

# For categorical columns: add "Unknown" as a valid category
for col in ["deck", "embark_town"]:
    titanic[col] = titanic[col].astype("category")
    titanic[col] = titanic[col].cat.add_categories("Unknown")
    titanic[col] = titanic[col].fillna("Unknown")

# For numerical column
titanic["age"] = titanic["age"].fillna(titanic["age"].median())

# For categorical 'embarked', we use the most frequent category
titanic["embarked"] = titanic["embarked"].fillna("S")


4. Feature Creation

In [83]:
titanic["family_size"] = titanic["sibsp"] + titanic["parch"] + 1
titanic["is_alone"] = (titanic["family_size"] == 1).astype(int)
titanic["title"] = titanic["who"].map(
    {"man": "Mr", "woman": "Mrs", "child": "Master"}
)

titanic[["survived","family_size","is_alone","title"]].head()

Unnamed: 0,survived,family_size,is_alone,title
0,0,2,0,Mr
1,1,2,0,Mrs
2,1,1,1,Mrs
3,1,2,0,Mrs
4,0,1,1,Mr


# 5. Define Features and Target

In [84]:

X = titanic[
    ["pclass","sex","age","fare","embarked","family_size","is_alone","title"]
]
y = titanic["survived"]


# 6. Feature Selection Example

In [85]:
# Remove correlated or redundant features
# Example: drop 'is_alone' if family_size already encodes similar info
print(X.shape)
X_selected = X.drop(columns=["is_alone"])
print(X_selected.shape)

(891, 8)
(891, 7)


# 8. Train/Test Split



In [86]:
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(712, 7)
(179, 7)
(712,)
(179,)


# 9. Preprocessing Pipelines ---

In [87]:

numeric_features = ["age","fare","family_size"]
categorical_features = ["pclass","sex","embarked","title"]

numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)


# 10. Full Pipeline with Logistic Regression

In [88]:
print(X_train.shape)

clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=100))
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy (without PCA):", accuracy_score(y_test, y_pred))


(712, 7)
Accuracy (without PCA): 0.8044692737430168


# 11. Pipeline with Feature Extraction (PCA)

In [89]:
print(X_train.shape)
pca_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("pca", PCA(n_components=3)),
    ("classifier", LogisticRegression(max_iter=100))
])

pca_pipeline.fit(X_train, y_train)
y_pred_pca = pca_pipeline.predict(X_test)

print("Accuracy (with PCA):", accuracy_score(y_test, y_pred_pca))

(712, 7)
Accuracy (with PCA): 0.770949720670391


# 12. Save Pipelines (Serialization)

In [90]:

joblib.dump(clf, "titanic_feature_pipeline.pkl")
joblib.dump(pca_pipeline, "titanic_pca_pipeline.pkl")


['titanic_pca_pipeline.pkl']

# --- 13. Reload & Predict ---

In [91]:

loaded_clf = joblib.load("titanic_feature_pipeline.pkl")
print("Reloaded model accuracy:", accuracy_score(y_test, loaded_clf.predict(X_test)))


Reloaded model accuracy: 0.8044692737430168


# 14. Summary
Key Takeaways:
- Feature Engineering includes creating new variables (family_size, titles).
- Feature Selection removes redundant variables.
- Feature Extraction (PCA) transforms data into lower dimensions.
- Pipelines ensure reproducibility and consistency in MLOps.
- Serialization (joblib) allows deployment-ready models.

