<a href="https://colab.research.google.com/github/saerarawas/AAI_634O_A11_202520/blob/main/week3/Implementing_ETL_Using_Python_for_a_Healthcare_Application_Implementing_the_ETL_Process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hands-on Lab: Implementing the ETL (Extract, Transform, Load) Process**

**Objective:**

In this hands-on lab, students will learn how to implement the fundamental steps of the ETL process by extracting data from multiple sources, transforming the data, and loading it into a database. Students will use Python along with libraries such as Pandas for data transformation and PyMongo for loading the data into a MongoDB database.

By the end of this lab, students will be able to:

* Extract data from different sources (CSV and API).
* Clean, transform, and validate the data.
* Load the transformed data into MongoDB.
* Automate the ETL process by building a reusable pipeline.

**Pre-requisites:**

* Basic knowledge of Python.
* MongoDB Atlas account (or a local MongoDB instance).
* Install the required Python libraries:



**In this Lab:**

You are tasked with creating an ETL pipeline for a fictitious retail company. You will extract product and sales data from different sources (a CSV file and a REST API), transform the data by cleaning and standardizing it, and load the transformed data into MongoDB for further analysis.

Use Python and Pandas to extract the product data from this CSV file.

**1) Extract Data**
**Patient data (CSV file):**
You have a CSV file named patients.csv that contains basic patient information, such as ID,
name, age, and gender.


In [92]:
import pandas as pd

# Read the CSV file from the content folder
patients_df = pd.read_csv('/content/patients.csv')
print("Extracted Patients Data:")
print(patients_df)
# Print length of DataFrame
file_length = len(patients_df)
print(f"Total Patients: {file_length}")


Extracted Patients Data:
    patient_id             name  age  gender
0         P001      James Smith   45    Male
1         P002     Mary Johnson   32  Female
2         P003  Robert Williams   56    Male
3         P004   Patricia Brown   29  Female
4         P005       John Jones   67    Male
..         ...              ...  ...     ...
195       P196     Emily Brooks   41  Female
196       P197      Jack Fisher   29    Male
197       P198       Judith Lee   50  Female
198       P199       Sean Kelly   38    Male
199       P200  Rebecca Sanders   57  Female

[200 rows x 4 columns]
Total Patients: 200


**Diagnostics data (simulated API):**

Diagnostics data is retrieved from a simulated API that provides information about medical
tests and results.

In [93]:
import requests

# Simulated API response (in a real scenario, use requests.get(URL).json())
diagnostic_data = [
    {"diagnostic_id": "D001", "patient_id": "P001", "test": "Blood Test", "result": "Normal"},
    {"diagnostic_id": "D002", "patient_id": "P002", "test": "X-Ray", "result": "Fracture"},
    {"diagnostic_id": "D003", "patient_id": "P003", "test": "MRI", "result": "Normal"}
]

print("Extracted Diagnostic Data:")
print(diagnostic_data)
# Print length of DataFrame
file_length = len(diagnostic_data)
print(f"Diagnostic Data: {file_length}")

Extracted Diagnostic Data:
[{'diagnostic_id': 'D001', 'patient_id': 'P001', 'test': 'Blood Test', 'result': 'Normal'}, {'diagnostic_id': 'D002', 'patient_id': 'P002', 'test': 'X-Ray', 'result': 'Fracture'}, {'diagnostic_id': 'D003', 'patient_id': 'P003', 'test': 'MRI', 'result': 'Normal'}]
Diagnostic Data: 3


**Step 2: Transform Data**

**2.1. Clean the Diagnostic Data**


Clean patient data: Let’s assume you need to filter out patients who are younger than 40
years old for a specific study.



In [94]:
# Filter out patients who are younger than 40 years old
filtered_patients_df = patients_df[patients_df['age'] > 40]

# Display the cleaned data
print("Filtered Patients Data (Age >= 40):")
print(filtered_patients_df)

# Save the cleaned data to a new CSV file
filtered_patients_df.to_csv('/content/filtered_patients.csv', index=False)
# Print length of DataFrame
file_length = len(filtered_patients_df)
print(f"Patients younger than 40: {file_length}")

Filtered Patients Data (Age >= 40):
    patient_id                name  age  gender
0         P001         James Smith   45    Male
2         P003     Robert Williams   56    Male
4         P005          John Jones   67    Male
7         P008       Barbara Davis   55  Female
9         P010  Elizabeth Martinez   62  Female
..         ...                 ...  ...     ...
193       P194   Dorothy Patterson   48  Female
194       P195       Benjamin Ward   55    Male
195       P196        Emily Brooks   41  Female
197       P198          Judith Lee   50  Female
199       P200     Rebecca Sanders   57  Female

[120 rows x 4 columns]
Patients younger than 40: 120


**2.2. Enrich the Diagnostic Data**


Enrich diagnostic data with patient information: Join the diagnostics data with
patient details (name, age, gender) to provide context for the test results.


In [95]:
#convert diagnostics_data to a DataFrame
diagnostics_df = pd.DataFrame(diagnostic_data)

#join diagnostics data with patient data to add patient name age and gender
diagnostics_df = pd.merge(diagnostics_df, patients_df[['patient_id', 'name', 'age', 'gender']], on='patient_id', how='left')
print("Enriched Diagnostics Data:")
print(diagnostics_df)

Enriched Diagnostics Data:
  diagnostic_id patient_id        test    result             name  age  gender
0          D001       P001  Blood Test    Normal      James Smith   45    Male
1          D002       P002       X-Ray  Fracture     Mary Johnson   32  Female
2          D003       P003         MRI    Normal  Robert Williams   56    Male


**Step 3: Load Data into MongoDB**

Now that the data is transformed and cleaned, load the product and sales data into MongoDB.

**3.1. Connect to MongoDB**

Ensure you have MongoDB running locally or use MongoDB Atlas. Connect to MongoDB using PyMongo.

In [96]:
!pip install pymongo
!pip install --upgrade pymongo

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
uri = "mongodb+srv://tsjannoun123:KufyyNNqnno0atX9@cluster0.sb8py.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)
from pymongo import MongoClient

# Access a specific database
db = client['patients_healthcare_db']



Pinged your deployment. You successfully connected to MongoDB!


**3.2. Load Patients Data**

Insert the transformed patients data into the MongoDB patients collection.

In [97]:
# Convert DataFrame to dictionary and insert into MongoDB
patients_records = filtered_patients_df.to_dict(orient='records')
# Changed orient to 'records' as 'patients' is not a valid option
db.patients.insert_many(patients_records)
print("Loaded Patients Data into MongoDB")
# Print the number of records
record_count = db.patients.count_documents({})
print(f"Number of records loaded: {record_count}")

Loaded Patients Data into MongoDB
Number of records loaded: 120


**3.3. Load Diagonistic Data**

Insert the diagonistic data into the MongoDB collection.

In [98]:
# Convert DataFrame to dictionary and insert into MongoDB
diagnostics_records = diagnostics_df.to_dict(orient='records')
insert_result1 = db.diagnostics.insert_many(diagnostics_records)
if insert_result1.acknowledged:
  print(f"{len(insert_result1.inserted_ids)} Record of diagnostics data loaded into MongoDB")
else:
  print("Error loading diagnostics data into MongoDB")

3 Record of diagnostics data loaded into MongoDB


**Step 4: Automate the ETL Process**

To make the ETL process reusable, wrap the steps into functions and run the ETL pipeline from start to finish.

In [99]:
def extract_patients():
    return pd.read_csv('/content/filtered_patients.csv')

def extract_diagnostics():
    return pd.DataFrame(diagnostic_data)

def transform_patients(patients_df):
    return patients_df[patients_df['age'] > 40]

def transform_diagnostics(diagnostics_df, patients_df):
    return pd.merge(diagnostics_df, patients_df[['patient_id', 'name', 'age', 'gender']], on='patient_id', how='left')

def load_data(patients_df, diagnostics_df):
    db.patients_ETL.insert_many(patients_df.to_dict(orient='records'))
    db.diagnostics_ETL.insert_many(diagnostics_df.to_dict(orient='records'))

# Run the ETL pipeline
patients_df = extract_patients()
diagnostics_df = extract_diagnostics()
transformed_patients_df = transform_patients(patients_df)
transformed_diagnostics_df = transform_diagnostics(diagnostics_df, patients_df)
load_data(transformed_patients_df, transformed_diagnostics_df)
print("ETL Process Completed!")

ETL Process Completed!


**Conclusion:**
This hands-on lab provides a comprehensive introduction to the ETL process, from extracting raw data from multiple sources, transforming it for quality and consistency, and finally loading it into MongoDB.