# Titanic ETL Pipeline

**Objective:**
- Create an automated pipeline for data ingestion, cleaning, transformation, and loading.
- Utilize tools like Pandas and Scikit-learn for preprocessing and feature engineering.
- Ensure reproducibility and scalability of the ETL process.

**Dataset:** Titanic Dataset from Kaggle


In [3]:
!pip install scikit-learn


Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp313-cp313-win_amd64.whl.metadata (14 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.0-cp313-cp313-win_amd64.whl (10.7 MB)
   ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
    --------------------------------------- 0.3/10.7 MB ? eta -:--:--
   - -------------------------------------- 0.5/10.7 MB 935.5 kB/s eta 0:00:11
   - -------------------------------------- 0.5/10.7 MB 935.5 kB/s eta 0:00:11
   - -------------------------------------- 0.5/10.7 MB 935.5 kB/s eta 0:00:11
   - ------


[notice] A new release of pip is available: 25.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


In [2]:
# Step 1: Data Ingestion
train_df= pd.read_csv("C:\\Users\\saksh\\OneDrive\\Desktop\\Python Practice\\titanic\\train.csv")

print(train_df)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

In [3]:
# Step 2: Basic Data Cleaning
train_df = train_df.drop(columns=["PassengerId", "Ticket", "Cabin"])
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)


In [None]:
# Feature Engineering: Extracting Title
train_df= pd.read_csv("C:\\Users\\saksh\\OneDrive\\Desktop\\Python Practice\\titanic\\train.csv")
train_df['Title'] = train_df['Name'].str.extract(r',\s*([^\.]+)\s*\.')
train_df = train_df.drop(columns=["Name"])
rare_titles = train_df['Title'].value_counts()[train_df['Title'].value_counts() < 10].index
train_df['Title'] = train_df['Title'].replace(rare_titles, 'Rare')
print(rare_titles)


Index(['Dr', 'Rev', 'Col', 'Mlle', 'Major', 'Ms', 'Mme', 'Don', 'Lady', 'Sir',
       'Capt', 'the Countess', 'Jonkheer'],
      dtype='object', name='Title')
     PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch  \
0              1         0       3    male  22.0      1      0   
1              2         1       1  female  38.0      1      0   
2              3         1       3  female  26.0      0      0   
3              4         1       1  female  35.0      1      0   
4              5         0       3    male  35.0      0      0   
..           ...       ...     ...     ...   ...    ...    ...   
886          887         0       2    male  27.0      0      0   
887          888         1       1  female  19.0      0      0   
888          889         0       3  female   NaN      1      2   
889          890         1       1    male  26.0      0      0   
890          891         0       3    male  32.0      0      0   

               Ticket     Fare Cabin Embarked Ti

In [6]:
# Feature Engineering: Creating FamilySize
train_df= pd.read_csv("C:\\Users\\saksh\\OneDrive\\Desktop\\Python Practice\\titanic\\train.csv")
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
train_df = train_df.drop(columns=["SibSp", "Parch"])

In [None]:
# Preprocessing: Encoding and Scaling
train_df= pd.read_csv("C:\\Users\\saksh\\OneDrive\\Desktop\\Python Practice\\titanic\\train.csv")
numeric_features = ['Age', 'Fare', 'FamilySize']
categorical_features = ['Sex', 'Embarked', 'Title']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
print(numeric_transformer)

# Encoding categorical features manually with LabelEncoder
for col in categorical_features:
    train_df[col] = LabelEncoder().fit_transform(train_df[col])

train_df[numeric_features] = numeric_transformer.fit_transform(train_df[numeric_features])

In [9]:
# Splitting Data (Optional)
X = train_df.drop("Survived", axis=1)
y = train_df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# Saving cleaned dataset
import pandas as pd
cleaned_data = pd.concat([X, y], axis=1)
cleaned_data.to_csv("C:\\Users\\saksh\\OneDrive\\Desktop\\Python Practice\\titanic\\train.csv_cleaned.csv", index=False)
print("ETL Pipeline complete. Cleaned data saved as 'titanic_cleaned.csv'.")

ETL Pipeline complete. Cleaned data saved as 'titanic_cleaned.csv'.
