Skip to content

This project is a example of a ETL pipeline, with the goal of analyzing, cleaning, validating, and exporting a hospital dataset originally in JSON format.

Notifications You must be signed in to change notification settings

sebastianjaimesg-coder/Python_ETLpipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š ETL pipeline with python

This project is a example of a ETL pipeline, with the goal of analyzing, cleaning, validating, and exporting a hospital dataset originally in JSON format.

πŸ“‚ Project Structure
ETL_pipeline/
β”‚
β”œβ”€ data/
β”‚   β”œβ”€ raw/        # Original data (JSON)
β”‚   β”œβ”€ interim/    # Intermediate data generated by explore.py
β”‚   └─ clean/      # Final cleaned data
β”‚
β”œβ”€ reports/        # Exploration, cleaning, and validation reports
β”‚
β”œβ”€ src/            # Python scripts
β”‚   β”œβ”€ explore.py       # Initial dataset exploration
β”‚   β”œβ”€ clean.py         # Cleaning and validation
β”‚   β”œβ”€ export_excel.py  # Export cleaned dataset to Excel
β”‚   β”œβ”€ compare.py       # Quality comparison before/after cleaning
β”‚   └─ load_to_dw.py    # Data Warehouse load simulation
β”‚
β”œβ”€ tests/          # Automated data quality tests
β”‚
β”œβ”€ requirements.txt
└─ README.md

βš™οΈ Requirements

Python 3.10+

Libraries:

pandas numpy python-dateutil openpyxl pyarrow

Installation:

pip install -r requirements.txt

πŸš€ Step-by-Step Execution

1️⃣ Place the original file Save dataset_hospital.json in data/raw/.

2️⃣ Data exploration python src/explore.py

Outputs: Intermediate CSV files (pacientes_raw.csv, citas_medicas_raw.csv) in data/interim/ exploration_report.md in reports/

3️⃣ Data cleaning and validation python src/clean.py

Outputs: Clean CSV files in data/clean/ cleaning_summary.md and orphan_citas.csv in reports/

4️⃣ Export to Excel python src/export_excel.py

Output: hospital_dataset_clean.xlsx in data/

5️⃣ Data Warehouse load simulation python src/load_to_dw.py

Loads cleaned data into a SQLite target structure with: dim_pacientes fact_citas

🧹 Cleaning Process Overview

Detected issues: Patients: duplicates by id_paciente, null values in edad, sexo, email, telefono, ciudad. Appointments: null values in fecha_cita, especialidad, medico, costo, estado_cita; 190 orphan records (nonexistent patient IDs).

Actions taken: Standardized date format (YYYY-MM-DD) Recalculated age from fecha_nacimiento Normalized gender (M / F) Removed duplicates (id_paciente) Referential integrity check (orphan appointments logged in orphan_citas.csv)

πŸ“ˆ Data Quality Indicators

Table Initial rows Final rows Initial null values Final null values Initial duplicates Final duplicates pacientes 5,010 5,000 7,671 6,021 10 0 citas_medicas 9,961 9,961 11,250 11,250 0 0

βœ… Automated Validation

Tests executed using pytest (tests/test_data_quality.py): No duplicate IDs in patients or appointments All patient IDs are valid Referential integrity between appointments and patients All required columns are present Result: All tests passed successfully.

πŸ“¦ Deliverables

Data: data/raw/dataset_hospital.json β†’ Original source data/interim/ β†’ Intermediate data data/clean/ β†’ Cleaned data data/hospital_dataset_clean.xlsx β†’ Final consolidated dataset

Reports: exploration_report.md cleaning_summary.md orphan_citas.csv final_report.md (this technical document) Scripts: explore.py, clean.py, export_excel.py, compare.py, load_to_dw.py

πŸ’‘ Recommendations for Improvement

Fill in missing key fields from external sources. Review and correct orphan appointments. Standardize specialty and physician names. Define business rules to impute costo and estado_cita.

About

This project is a example of a ETL pipeline, with the goal of analyzing, cleaning, validating, and exporting a hospital dataset originally in JSON format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages