<a href="https://colab.research.google.com/github/saerarawas/AAI_634O_A11_202520/blob/main/week3/Implementing_ETL_Using_Python_for_a_Healthcare_Application_Implementing_the_ETL_Process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hands-on Lab: Implementing the ETL (Extract, Transform, Load) Process**

**Objective:**

In this hands-on lab, students will learn how to implement the fundamental steps of the ETL process by extracting data from multiple sources, transforming the data, and loading it into a database. Students will use Python along with libraries such as Pandas for data transformation and PyMongo for loading the data into a MongoDB database.

By the end of this lab, students will be able to:

* Extract data from different sources (CSV and API).
* Clean, transform, and validate the data.
* Load the transformed data into MongoDB.
* Automate the ETL process by building a reusable pipeline.

**Pre-requisites:**

* Basic knowledge of Python.
* MongoDB Atlas account (or a local MongoDB instance).
* Install the required Python libraries:



**In this Lab:**

You are tasked with creating an ETL pipeline for a fictitious retail company. You will extract product and sales data from different sources (a CSV file and a REST API), transform the data by cleaning and standardizing it, and load the transformed data into MongoDB for further analysis.

Use Python and Pandas to extract the product data from this CSV file.

**1) Extract Data**
**Patient data (CSV file):**
You have a CSV file named patients.csv that contains basic patient information, such as ID,
name, age, and gender.


In [24]:
import pandas as pd

# Read the CSV file from the content folder
patients_df = pd.read_csv('/content/patients.csv')
print("Extracted Patients Data:")
print(patients_df)


Extracted Patients Data:
    patient_id             name  age  gender
0         P001      James Smith   45    Male
1         P002     Mary Johnson   32  Female
2         P003  Robert Williams   56    Male
3         P004   Patricia Brown   29  Female
4         P005       John Jones   67    Male
..         ...              ...  ...     ...
195       P196     Emily Brooks   41  Female
196       P197      Jack Fisher   29    Male
197       P198       Judith Lee   50  Female
198       P199       Sean Kelly   38    Male
199       P200  Rebecca Sanders   57  Female

[200 rows x 4 columns]


**Diagnostics data (simulated API):**

Diagnostics data is retrieved from a simulated API that provides information about medical
tests and results.

In [25]:
import requests

# Simulated API response (in a real scenario, use requests.get(URL).json())
diagonistic_data = [
    {"diagonistic_id": "D001", "patient_id": "P001", "test": "Blood Test", "result": "Normal"},
    {"diagonistic_id": "D002", "patient_id": "P002", "test": "X-Ray", "result": "Fracture"},
    {"diagonistic_id": "D003", "patient_id": "P003", "test": "MRI", "result": "Normal"}
]

print("Extracted Diagnostic Data:")
print(diagonistic_data)


Extracted Diagnostic Data:
[{'diagonistic_id': 'D001', 'patient_id': 'P001', 'test': 'Blood Test', 'result': 'Normal'}, {'diagonistic_id': 'D002', 'patient_id': 'P002', 'test': 'X-Ray', 'result': 'Fracture'}, {'diagonistic_id': 'D003', 'patient_id': 'P003', 'test': 'MRI', 'result': 'Normal'}]


**Step 2: Transform Data**

**2.1. Clean the Diagnostic Data**


Clean patient data: Let’s assume you need to filter out patients who are younger than 40
years old for a specific study.



In [26]:
# Filter out patients who are younger than 40 years old
filtered_patients_df = patients_df[patients_df['age'] >= 40]

# Display the cleaned data
print("Filtered Patients Data (Age >= 40):")
print(filtered_patients_df)

# Save the cleaned data to a new CSV file
filtered_patients_df.to_csv('/content/filtered_patients.csv', index=False)

Filtered Patients Data (Age >= 40):
    patient_id               name  age  gender
0         P001        James Smith   45    Male
2         P003    Robert Williams   56    Male
4         P005         John Jones   67    Male
5         P006       Linda Garcia   40  Female
7         P008      Barbara Davis   55  Female
..         ...                ...  ...     ...
193       P194  Dorothy Patterson   48  Female
194       P195      Benjamin Ward   55    Male
195       P196       Emily Brooks   41  Female
197       P198         Judith Lee   50  Female
199       P200    Rebecca Sanders   57  Female

[127 rows x 4 columns]


**2.2. Enrich the Diagnostic Data**


Enrich diagnostic data with patient information: Join the diagnostics data with
patient details (name, age, gender) to provide context for the test results.


In [27]:
diagonistic_df = pd.DataFrame(diagonistic_data)

# Join the diagnostic data with the filtered patient details
enriched_df = pd.merge(diagonistic_df, filtered_patients_df, on="patient_id")

# Display the enriched DataFrame
print("Enriched Diagonistics Data:")
print(enriched_df)

# Save the enriched DataFrame to a new CSV file
enriched_df.to_csv('/content/enriched_diagonistics.csv', index=False)
print("enriched_diagnostics.csv file has been created.")


Enriched Diagonistics Data:
  diagonistic_id patient_id        test  result             name  age gender
0           D001       P001  Blood Test  Normal      James Smith   45   Male
1           D003       P003         MRI  Normal  Robert Williams   56   Male
enriched_diagnostics.csv file has been created.


**Step 3: Load Data into MongoDB**

Now that the data is transformed and cleaned, load the product and sales data into MongoDB.

**3.1. Connect to MongoDB**

Ensure you have MongoDB running locally or use MongoDB Atlas. Connect to MongoDB using PyMongo.

In [28]:
!pip install pymongo
!pip install --upgrade pymongo

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
uri = "mongodb+srv://tsjannoun123:KufyyNNqnno0atX9@cluster0.sb8py.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)
from pymongo import MongoClient

# Access a specific database
db = client['patients_db']



Pinged your deployment. You successfully connected to MongoDB!


**3.2. Load Patients Data**

Insert the transformed patients data into the MongoDB patients collection.

In [29]:
# Convert DataFrame to dictionary and insert into MongoDB
patients_records = filtered_patients_df.to_dict(orient='records')
# Changed orient to 'records' as 'patients' is not a valid option
db.patients.insert_many(patients_records)
print("Loaded Patients Data into MongoDB")

Loaded Patients Data into MongoDB


**3.3. Load Diagonistic Data**

Insert the enriched diagonistic data into the MongoDB sales collection.

In [30]:
# Convert DataFrame to dictionary and insert into MongoDB
# diagnostic_data = sales_df.to_dict(orient='records') # sales_df is not defined, using enriched_df instead
diagnostic_data = enriched_df.to_dict(orient='records')  # enriched_df holds the diagnostic data
db.diagnostics.insert_many(diagnostic_data)
print("Loaded Diagnostics Data into MongoDB")



Loaded Diagnostics Data into MongoDB


**Step 4: Automate the ETL Process**

To make the ETL process reusable, wrap the steps into functions and run the ETL pipeline from start to finish.

In [31]:
# Function to extract patient data from CSV
def extract_patients():
    return pd.read_csv('/content/filtered_patients.csv')

# Function to extract diagnostic data
def extract_diagnostics():
    diagnostic_data = [
        {"diagnostic_id": "D001", "patient_id": "P001", "test": "Blood Test", "result": "Normal"},
        {"diagnostic_id": "D002", "patient_id": "P002", "test": "X-Ray", "result": "Fracture"},
        {"diagnostic_id": "D003", "patient_id": "P003", "test": "MRI", "result": "Normal"}
    ]
    return pd.DataFrame(diagnostic_data)

# Function to transform patient data
def transform_patients(patients_df):
    # Filter patients who are 40 years old or older
    patients_df['age'] = pd.to_numeric(patients_df['age'], errors='coerce')
    return patients_df[patients_df['age'] >= 40]

# Function to transform diagnostic data
def transform_diagnostics(diagnostic_df, patients_df):
    return pd.merge(diagnostic_df, patients_df[['patient_id', 'name', 'age', 'gender']], on='patient_id', how='left')

# Function to load data into MongoDB
def load_data(patients_df, diagnostic_df, connection_string, database_name):
    client = MongoClient(connection_string)
    db = client[database_name]
    # Convert DataFrame to dictionary and insert into MongoDB
    patients_records = patients_df.to_dict(orient='records')
    db.patients.insert_many(patients_records)
    print("Loaded Patients Data into MongoDB")

    diagnostics_records = diagnostic_df.to_dict(orient='records')
    db.diagnostics.insert_many(diagnostics_records)
    print("Loaded Diagnostics Data into MongoDB")

# Complete ETL pipeline
def run_etl(patients_file, diagnostics_file, connection_string, database_name):
    patients_df = extract_patients()
    diagnostic_df = extract_diagnostics()
    transformed_patients_df = transform_patients(patients_df)
    transformed_diagnostics_df = transform_diagnostics(diagnostic_df, transformed_patients_df)
    load_data(transformed_patients_df, transformed_diagnostics_df, connection_string, database_name)

# Parameters
patients_file = '/content/filtered_patients.csv'
diagnostics_file = '/content/diagnostics.csv'
connection_string = "mongodb+srv://tsjannoun123:KufyyNNqnno0atX9@cluster0.sb8py.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
database_name = "patients_db"

# Run the ETL pipeline
run_etl(patients_file, diagnostics_file, connection_string, database_name)
print("ETL Process Completed!")



Loaded Patients Data into MongoDB
Loaded Diagnostics Data into MongoDB
ETL Process Completed!


**Conclusion:**
This hands-on lab provides a comprehensive introduction to the ETL process, from extracting raw data from multiple sources, transforming it for quality and consistency, and finally loading it into MongoDB.