# ETL Pipeline (Airflow Concept Simulation in Python)

## 🎯 Objective
Simulate an ETL (Extract, Transform, Load) pipeline using Python.  
Learn the workflow of ETL pipelines used in Data Engineering without requiring Airflow setup.

By the end of this notebook, you will:
- Understand ETL pipeline steps
- Implement ETL using Python functions
- Load data into CSV or SQLite database

## 1. ETL Pipeline Overview

**ETL Steps:**

1. **Extract**: Collect data from sources (CSV, API, database, etc.)
2. **Transform**: Clean, process, and prepare data for analysis
3. **Load**: Store transformed data into a destination (CSV, database, data warehouse)

**Python Simulation:**
- We'll use `pandas` for data handling
- We'll simulate extraction from CSV
- We'll clean and transform data
- We'll load data into a new CSV (or SQLite database)


In [16]:
# Import Required Libraries
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime

print("Libraries imported successfully!")


Libraries imported successfully!


## Extract Step
We will simulate extracting data from a CSV file.

**Task:**
- Read a sample CSV
- Inspect data

In [17]:
# Sample data creation (simulating a CSV input)
data = {
    "id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, np.nan, 30, 45, 22],
    "salary": [50000, 60000, None, 80000, 45000],
    "department": ["HR", "Engineering", "Engineering", "Finance", "HR"]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save to CSV (simulate source file)
df.to_csv("input_data.csv", index=False)

# Read CSV (Extract)
extracted_data = pd.read_csv("input_data.csv")
print("Extracted Data:")
extracted_data


Extracted Data:


Unnamed: 0,id,name,age,salary,department
0,1,Alice,25.0,50000.0,HR
1,2,Bob,,60000.0,Engineering
2,3,Charlie,30.0,,Engineering
3,4,David,45.0,80000.0,Finance
4,5,Eve,22.0,45000.0,HR


## Transform Step
**Transformations:**
1. Fill missing values
2. Convert data types
3. Create new columns
4. Filter data based on condition


In [18]:
# Fill missing values
transformed_data = extracted_data.copy()
transformed_data['age'] = transformed_data['age'].fillna(transformed_data['age'].mean())
transformed_data['salary'] = transformed_data['salary'].fillna(transformed_data['salary'].median())

# Convert data types
transformed_data['name'] = transformed_data['name'].astype(str)
transformed_data['department'] = transformed_data['department'].astype(str)

# Create new column: Tax (10% of salary)
transformed_data['tax'] = transformed_data['salary'] * 0.10

# Filter: Only include employees older than 25
transformed_data = transformed_data[transformed_data['age'] > 25]

print("Transformed Data:")
transformed_data


Transformed Data:


Unnamed: 0,id,name,age,salary,department,tax
1,2,Bob,30.5,60000.0,Engineering,6000.0
2,3,Charlie,30.0,55000.0,Engineering,5500.0
3,4,David,45.0,80000.0,Finance,8000.0


## Load Step
**Load the transformed data** into:
1. CSV
2. SQLite database


In [20]:
# Load to CSV
transformed_data.to_csv("transformed_data.csv", index=False)
print("Data loaded into 'transformed_data.csv' successfully!")

# Load to SQLite database
conn = sqlite3.connect("etl_pipeline.db")
transformed_data.to_sql("employees", conn, if_exists="replace", index=False)
print("Data loaded into SQLite database 'etl_pipeline.db' successfully!")

# Query to verify
pd.read_sql("SELECT * FROM employees", conn)


Data loaded into 'transformed_data.csv' successfully!
Data loaded into SQLite database 'etl_pipeline.db' successfully!


Unnamed: 0,id,name,age,salary,department,tax
0,2,Bob,30.5,60000.0,Engineering,6000.0
1,3,Charlie,30.0,55000.0,Engineering,5500.0
2,4,David,45.0,80000.0,Finance,8000.0


##  Automation / Pipeline Function
Encapsulate ETL steps into a reusable Python function (simulate Airflow DAG)

In [21]:
def etl_pipeline(input_file, output_csv, db_name):
    # Extract
    df = pd.read_csv(input_file)
    
    # Transform
    df['age'] = df['age'].fillna(df['age'].mean())
    df['salary'] = df['salary'].fillna(df['salary'].median())
    df['tax'] = df['salary'] * 0.10
    df = df[df['age'] > 25]
    
    # Load
    df.to_csv(output_csv, index=False)
    conn = sqlite3.connect(db_name)
    df.to_sql("employees", conn, if_exists="replace", index=False)
    conn.close()
    print(f"ETL pipeline executed successfully! Output saved to {output_csv} and {db_name}")

# Run pipeline
etl_pipeline("input_data.csv", "final_data.csv", "final_etl.db")


ETL pipeline executed successfully! Output saved to final_data.csv and final_etl.db


## Summary
- Extracted data from CSV
- Cleaned missing values and transformed columns
- Filtered relevant rows
- Loaded data into CSV and SQLite database
- Encapsulated ETL as a reusable Python function

This simulates an **Airflow-like ETL workflow** without needing Airflow installation.

# ETL Pipeline with Real Dataset (Titanic)

## Objective
- Extract, transform, and load the Titanic dataset
- Handle missing values, categorical encoding, and filtering
- Load transformed data into CSV and SQLite

## Extract Step
- Load Titanic dataset from a CSV file or directly from a URL

In [22]:
# Load dataset from URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_df = pd.read_csv(url)

print("First 5 rows of the dataset:")
titanic_df.head()

First 5 rows of the dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Transform Step
Transformations:
1. Handle missing values
2. Encode categorical columns
3. Create new features
4. Filter rows (optional)

In [23]:
# Copy data for transformation
df = titanic_df.copy()

# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Encode categorical columns
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

# Create new feature: FamilySize = SibSp + Parch + 1
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Filter: Keep only passengers with Fare > 0
df = df[df['Fare'] > 0]

print("Transformed data:")
df.head()


Transformed data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,True,False,True,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,False,False,False,2
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,False,False,True,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,False,False,True,2
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,True,False,True,1


## Load Step
- Save the transformed dataset to CSV and SQLite

In [24]:
# Load to CSV
df.to_csv("titanic_transformed.csv", index=False)
print("Data loaded into 'titanic_transformed.csv' successfully!")

# Load to SQLite
conn = sqlite3.connect("titanic_etl.db")
df.to_sql("passengers", conn, if_exists="replace", index=False)
print("Data loaded into SQLite database 'titanic_etl.db' successfully!")

# Verify by reading from database
pd.read_sql("SELECT * FROM passengers LIMIT 5", conn)


Data loaded into 'titanic_transformed.csv' successfully!
Data loaded into SQLite database 'titanic_etl.db' successfully!


Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,1,0,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,0,0,0,2
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,0,0,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,0,0,1,2
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,1,0,1,1


## Encapsulate ETL as Function

In [26]:
def titanic_etl_pipeline(input_url, output_csv, db_name):
    # Extract
    df = pd.read_csv(input_url)
    
    # Transform
    df['Age'] = df['Age'].fillna(df['Age'].mean())
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df = df[df['Fare'] > 0]
    
    # Load
    df.to_csv(output_csv, index=False)
    conn = sqlite3.connect(db_name)
    df.to_sql("passengers", conn, if_exists="replace", index=False)
    conn.close()
    print(f"ETL pipeline executed! Output saved to {output_csv} and {db_name}")

# Run pipeline
titanic_etl_pipeline(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
    "titanic_final.csv",
    "titanic_final.db"
)

ETL pipeline executed! Output saved to titanic_final.csv and titanic_final.db


## 📊 6. Summary
- Extracted Titanic dataset from URL
- Handled missing values and encoded categorical columns
- Created a new feature (`FamilySize`)
- Filtered rows and saved transformed data to CSV and SQLite
- Encapsulated ETL steps into a reusable function
