*** Data Cleaning Process Documentation ***
Introduction
In this project, I worked on cleaning up a healthcare dataset. The main goal was to get rid of duplicates, fix inconsistencies, ensure the data was HIPAA-compliant, and make it ready for analysis. Here’s how I went about it.

1. Removing Duplicates
I started by looking for duplicate rows in the dataset. In healthcare data, it’s common for patients to have multiple appointments, but sometimes the same appointment gets recorded multiple times, or a single patient record is split across rows.

To fix this, I checked for any duplicates and resolved inconsistencies, especially where the same patient had different values for things like age or doctor. In those cases, I kept the most frequent value or the first one, depending on what made sense.

2. Handling Inconsistencies in Categorical Data
Next, I took a look at some of the categorical data. If the data’s inconsistent, it’ll mess up everything later on. For example, gender might be entered as "male" in one row and "Male" in another, so I standardized everything to make sure the entries were consistent across the dataset.

I also checked a few other columns like admission_type and test_results to ensure there were no typos or variations in how the data was recorded.

3. Anonymizing Patient Names (HIPAA Compliance)
HIPAA is a big deal in healthcare, and patient privacy is crucial. Since the dataset contains sensitive information, I anonymized the patient names to comply with HIPAA regulations. Instead of storing patient names, I hashed them using the SHA-256 algorithm, which makes it impossible to recover the original name, but still keeps the data usable.

4. Generating Unique Identifiers
To make the dataset more manageable, I created unique IDs for both patients and doctors. The patient_id is now consistent across all records for the same patient, which is useful for any analysis that looks at multiple appointments for a patient. Similarly, each doctor now has a unique doctor_id.

5. Handling Length of Stay (LOS)
Length of Stay (LOS) is an important metric, so I decided to calculate it in SQL. By using the DATEDIFF() function in SQL, I calculated how long each patient stayed in the hospital, which can be used for various analyses later on.

6. Dealing with Strange Data
I also checked for any unreasonable values—like negative ages or billing amounts that were way off. If anything seemed out of place, I corrected it or flagged it for further review.

7. Resetting the Index
The dataset originally had an index starting from 0, but I reset it so that it starts from 1. This gives the data a cleaner look, especially when it’s being used for reports or analysis.

8. Dropping Unnecessary Columns
After anonymizing the patient names and generating unique IDs, I realized that the name column wasn’t needed anymore. Since the data was already anonymized, I dropped the name column to make the dataset cleaner and more compliant with HIPAA.

Conclusion
At the end of the day, the dataset is in much better shape:

No duplicates, making the data more accurate.

Consistent values across columns, ensuring reliable analysis.

Patient names anonymized for HIPAA compliance.

Unique IDs for patients and doctors, which simplifies analysis.

LOS calculated in SQL for more efficient querying.

No weird data that could throw off results.

Now, the data is clean, compliant, and ready to be analyzed or used for further processing.