*** Data Cleaning Process Documentation ***
Introduction

For this project, I cleaned a healthcare dataset to make it accurate, consistent, and HIPAA-compliant. The goal was to prepare it for building a star schema and running SQL analysis.

1. Removed Duplicates

I checked for duplicate admissions and patient records. Any repeated rows were dropped so that each admission is unique.

2. Standardized Categorical Data

Columns like gender, admission_type, and test_results had inconsistencies (e.g., “male” vs. “Male”). I standardized these values to keep things uniform.

3. Anonymized Patient Names

To stay HIPAA-compliant, patient names were anonymized with SHA-256 hashing. This keeps privacy intact while still allowing consistent tracking of patients.

4. Created Unique IDs

I generated surrogate IDs:

patient_id → consistent across multiple admissions

doctor_id → unique per doctor

admission_id → unique per admission

5. Calculated Length of Stay (LOS)

LOS was derived from date_of_admission and discharge_date. This metric will be central for operational and cost analysis.

6. Added Time Columns

I extracted additional time features:

admit_year and admit_month

admit_day_of_week and discharge_day_of_week
These will support trend and scheduling analysis.

7. Cleaned Outliers

I scanned for odd values like negative ages, unrealistic billing amounts, or LOS outliers. These were corrected where possible or flagged for review.

8. Dropped Unneeded Columns

After anonymizing, the original patient_name column was dropped to further reduce PHI risk.

Conclusion

After cleaning, the dataset is now:

Free of duplicates and inconsistencies

HIPAA-compliant with anonymized patient data

Structured with unique IDs for easier modeling

Enriched with LOS and time-based columns

The data is ready for star schema modeling and deeper SQL analysis.