# 🚀 Task 1: Data Pipeline Development

This notebook presents an automated data pipeline for the **Titanic Dataset**, including steps for data ingestion, cleaning, transformation, and loading using `pandas` and `scikit-learn`.

## 🎯 Objectives
- Automate data ingestion, cleaning, transformation, and loading
- Use `pandas` and `scikit-learn` for preprocessing and feature engineering
- Ensure reproducibility and scalability of the ETL process

In [12]:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [13]:
# Load the Titanic dataset
filepath = "titanic.csv"  # Replace with your dataset path
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
# Handle missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop(columns=['Cabin'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


In [15]:
# Encode categorical variables
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['Embarked'] = le.fit_transform(df['Embarked'])

In [16]:
# Feature engineering: FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

In [17]:
# Save cleaned data
df.to_csv("cleaned_titanic.csv", index=False)
print("Cleaned dataset saved successfully.")

Cleaned dataset saved successfully.
