GitHub - sebastianjaimesg-coder/Python_ETLpipeline: This project is a example of a ETL pipeline, with the goal of analyzing, cleaning, validating, and exporting a hospital dataset originally in JSON format.

📊 ETL pipeline with python

This project is a example of a ETL pipeline, with the goal of analyzing, cleaning, validating, and exporting a hospital dataset originally in JSON format.

📂 Project Structure
ETL_pipeline/
│
├─ data/
│   ├─ raw/        # Original data (JSON)
│   ├─ interim/    # Intermediate data generated by explore.py
│   └─ clean/      # Final cleaned data
│
├─ reports/        # Exploration, cleaning, and validation reports
│
├─ src/            # Python scripts
│   ├─ explore.py       # Initial dataset exploration
│   ├─ clean.py         # Cleaning and validation
│   ├─ export_excel.py  # Export cleaned dataset to Excel
│   ├─ compare.py       # Quality comparison before/after cleaning
│   └─ load_to_dw.py    # Data Warehouse load simulation
│
├─ tests/          # Automated data quality tests
│
├─ requirements.txt
└─ README.md

⚙️ Requirements

Python 3.10+

Libraries:

pandas numpy python-dateutil openpyxl pyarrow

Installation:

pip install -r requirements.txt

🚀 Step-by-Step Execution

1️⃣ Place the original file Save dataset_hospital.json in data/raw/.

2️⃣ Data exploration python src/explore.py

Outputs: Intermediate CSV files (pacientes_raw.csv, citas_medicas_raw.csv) in data/interim/ exploration_report.md in reports/

3️⃣ Data cleaning and validation python src/clean.py

Outputs: Clean CSV files in data/clean/ cleaning_summary.md and orphan_citas.csv in reports/

4️⃣ Export to Excel python src/export_excel.py

Output: hospital_dataset_clean.xlsx in data/

5️⃣ Data Warehouse load simulation python src/load_to_dw.py

Loads cleaned data into a SQLite target structure with: dim_pacientes fact_citas

🧹 Cleaning Process Overview

Detected issues: Patients: duplicates by id_paciente, null values in edad, sexo, email, telefono, ciudad. Appointments: null values in fecha_cita, especialidad, medico, costo, estado_cita; 190 orphan records (nonexistent patient IDs).

Actions taken: Standardized date format (YYYY-MM-DD) Recalculated age from fecha_nacimiento Normalized gender (M / F) Removed duplicates (id_paciente) Referential integrity check (orphan appointments logged in orphan_citas.csv)

📈 Data Quality Indicators

Table Initial rows Final rows Initial null values Final null values Initial duplicates Final duplicates pacientes 5,010 5,000 7,671 6,021 10 0 citas_medicas 9,961 9,961 11,250 11,250 0 0

✅ Automated Validation

Tests executed using pytest (tests/test_data_quality.py): No duplicate IDs in patients or appointments All patient IDs are valid Referential integrity between appointments and patients All required columns are present Result: All tests passed successfully.

📦 Deliverables

Data: data/raw/dataset_hospital.json → Original source data/interim/ → Intermediate data data/clean/ → Cleaned data data/hospital_dataset_clean.xlsx → Final consolidated dataset

Reports: exploration_report.md cleaning_summary.md orphan_citas.csv final_report.md (this technical document) Scripts: explore.py, clean.py, export_excel.py, compare.py, load_to_dw.py

💡 Recommendations for Improvement

Fill in missing key fields from external sources. Review and correct orphan appointments. Standardize specialty and physician names. Define business rules to impute costo and estado_cita.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
ETL_pipeline		ETL_pipeline
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚙️ Requirements

Installation:

🚀 Step-by-Step Execution

🧹 Cleaning Process Overview

📈 Data Quality Indicators

✅ Automated Validation

📦 Deliverables

💡 Recommendations for Improvement

About

Uh oh!

Releases

Packages

Languages

sebastianjaimesg-coder/Python_ETLpipeline

Folders and files

Latest commit

History

Repository files navigation

⚙️ Requirements

Installation:

🚀 Step-by-Step Execution

🧹 Cleaning Process Overview

📈 Data Quality Indicators

✅ Automated Validation

📦 Deliverables

💡 Recommendations for Improvement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages