π ETL pipeline with python
This project is a example of a ETL pipeline, with the goal of analyzing, cleaning, validating, and exporting a hospital dataset originally in JSON format.
π Project Structure
ETL_pipeline/
β
ββ data/
β ββ raw/ # Original data (JSON)
β ββ interim/ # Intermediate data generated by explore.py
β ββ clean/ # Final cleaned data
β
ββ reports/ # Exploration, cleaning, and validation reports
β
ββ src/ # Python scripts
β ββ explore.py # Initial dataset exploration
β ββ clean.py # Cleaning and validation
β ββ export_excel.py # Export cleaned dataset to Excel
β ββ compare.py # Quality comparison before/after cleaning
β ββ load_to_dw.py # Data Warehouse load simulation
β
ββ tests/ # Automated data quality tests
β
ββ requirements.txt
ββ README.md
Python 3.10+
Libraries:
pandas numpy python-dateutil openpyxl pyarrow
pip install -r requirements.txt
1οΈβ£ Place the original file Save dataset_hospital.json in data/raw/.
2οΈβ£ Data exploration python src/explore.py
Outputs: Intermediate CSV files (pacientes_raw.csv, citas_medicas_raw.csv) in data/interim/ exploration_report.md in reports/
3οΈβ£ Data cleaning and validation python src/clean.py
Outputs: Clean CSV files in data/clean/ cleaning_summary.md and orphan_citas.csv in reports/
4οΈβ£ Export to Excel python src/export_excel.py
Output: hospital_dataset_clean.xlsx in data/
5οΈβ£ Data Warehouse load simulation python src/load_to_dw.py
Loads cleaned data into a SQLite target structure with: dim_pacientes fact_citas
Detected issues: Patients: duplicates by id_paciente, null values in edad, sexo, email, telefono, ciudad. Appointments: null values in fecha_cita, especialidad, medico, costo, estado_cita; 190 orphan records (nonexistent patient IDs).
Actions taken: Standardized date format (YYYY-MM-DD) Recalculated age from fecha_nacimiento Normalized gender (M / F) Removed duplicates (id_paciente) Referential integrity check (orphan appointments logged in orphan_citas.csv)
Table Initial rows Final rows Initial null values Final null values Initial duplicates Final duplicates pacientes 5,010 5,000 7,671 6,021 10 0 citas_medicas 9,961 9,961 11,250 11,250 0 0
Tests executed using pytest (tests/test_data_quality.py): No duplicate IDs in patients or appointments All patient IDs are valid Referential integrity between appointments and patients All required columns are present Result: All tests passed successfully.
Data: data/raw/dataset_hospital.json β Original source data/interim/ β Intermediate data data/clean/ β Cleaned data data/hospital_dataset_clean.xlsx β Final consolidated dataset
Reports: exploration_report.md cleaning_summary.md orphan_citas.csv final_report.md (this technical document) Scripts: explore.py, clean.py, export_excel.py, compare.py, load_to_dw.py
Fill in missing key fields from external sources. Review and correct orphan appointments. Standardize specialty and physician names. Define business rules to impute costo and estado_cita.