# 🧼 Data Cleaning: CMS Hospital General Information

In this notebook, we clean and prepare the **CMS Hospital General Information dataset** — a publicly available dataset from the Centers for Medicare & Medicaid Services (CMS): 🔗 [CMS Hospital Dataset](https://data.cms.gov/provider-data/dataset/xubh-q36u)

The dataset contains detailed information about U.S. hospitals, including names, locations, types, ownership structures, emergency services, and various performance metrics (e.g. mortality, safety, readmission, patient experience, and timely care).

### 🧭 Cleaning Objectives
Our goal is to prepare a clean, analysis-ready version of this dataset for PostgreSQL import and SQL-based analysis. This includes:
- Loading the raw CMS CSV into a pandas DataFrame
- Previewing column names, structure, and sample data
- Dropping footnote and metadata columns with high missingness
- Inspecting and converting numeric-like string columns
- Verifying uniqueness of the primary key (`Facility ID`)
- Saving the final cleaned dataset to disk

---
### 📦 Import Libraries

We begin by importing the necessary libraries for data handling and cleaning.  
This project uses `pandas` for data manipulation and `numpy` for numeric operations.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

### 📂 Load Raw CSV Data

We load the full raw dataset exported from CMS as a CSV file.  
The dataset is located at `../data/raw/cms_hospital_general_info.csv`.

In [2]:
# Load raw dataset
df = pd.read_csv("../data/raw/cms_hospital_general_info.csv")

### 🧾 Initial Dataset Inspection

This section previews the dataset structure, column names, sample rows, and highlights missing values to inform the cleaning strategy.

In [3]:
# Show dataset dimensions and column names
print("Rows:", df.shape[0])
print("Columns:", df.shape[1])
print("Column names:")
print(df.columns.tolist())

# Preview first few rows (transposed)
print("\nSample data:")
print(df.head(3).T)

# Display columns with missing values
missing_pct = df.isnull().mean().sort_values(ascending=False)
print("\nMissing value summary:")
print(missing_pct[missing_pct > 0])

Rows: 5384
Columns: 38
Column names:
['Facility ID', 'Facility Name', 'Address', 'City/Town', 'State', 'ZIP Code', 'County/Parish', 'Telephone Number', 'Hospital Type', 'Hospital Ownership', 'Emergency Services', 'Meets criteria for birthing friendly designation', 'Hospital overall rating', 'Hospital overall rating footnote', 'MORT Group Measure Count', 'Count of Facility MORT Measures', 'Count of MORT Measures Better', 'Count of MORT Measures No Different', 'Count of MORT Measures Worse', 'MORT Group Footnote', 'Safety Group Measure Count', 'Count of Facility Safety Measures', 'Count of Safety Measures Better', 'Count of Safety Measures No Different', 'Count of Safety Measures Worse', 'Safety Group Footnote', 'READM Group Measure Count', 'Count of Facility READM Measures', 'Count of READM Measures Better', 'Count of READM Measures No Different', 'Count of READM Measures Worse', 'READM Group Footnote', 'Pt Exp Group Measure Count', 'Count of Facility Pt Exp Measures', 'Pt Exp Group Foo

### ✅ Inspection Summary

The dataset contains **5,384 rows** and **38 columns**. A sample of the data shows detailed hospital information, including facility names, locations, ownership types, service counts, and performance metrics.

Several columns contain a high percentage of missing values — notably various `Footnote` fields and the `birthing friendly designation`. These columns were removed during the cleaning process as they did not provide sufficient data for analysis.

---
### 🧹 Remove Footnote Columns with High Missingness

Based on the inspection above, several `Footnote` columns contain more than 50% missing values and serve only as metadata (e.g., `"TE Group Footnote"`, `"Hospital overall rating footnote"`). These columns are not useful for downstream analysis and are dropped.

In [4]:
# Drop footnote columns that contain mostly text metadata or have high missingness
cols_to_drop = [
    'TE Group Footnote',
    'READM Group Footnote',
    'MORT Group Footnote',
    'Safety Group Footnote',
    'Pt Exp Group Footnote',
    'Hospital overall rating footnote'
]
df.drop(columns=cols_to_drop, inplace=True)

### 🔎 Preview Unique Values in Categorical Columns

Before we proceed to type conversions or encoding, we inspect the object (string) columns to understand the number and type of unique values. This helps identify columns that may need standardization, encoding, or manual correction (e.g. typos or inconsistent labels).

In [5]:
# Preview unique values in object columns (show top 10 values only)
for col in df.select_dtypes(include='object').columns:
    unique_vals = df[col].dropna().unique()
    print(f"\nColumn: {col} — {len(unique_vals)} unique values")
    print(pd.Series(unique_vals).head(10).to_list())


Column: Facility ID — 5384 unique values
['010001', '010005', '010006', '010007', '010008', '010011', '010012', '010016', '010018', '010019']

Column: Facility Name — 5257 unique values
['SOUTHEAST HEALTH MEDICAL CENTER', 'MARSHALL MEDICAL CENTERS', 'NORTH ALABAMA MEDICAL CENTER', 'MIZELL MEMORIAL HOSPITAL', 'CRENSHAW COMMUNITY HOSPITAL', "ST. VINCENT'S EAST", 'DEKALB REGIONAL MEDICAL CENTER', 'SHELBY BAPTIST MEDICAL CENTER', 'CALLAHAN EYE HOSPITAL', 'HELEN KELLER HOSPITAL']

Column: Address — 5355 unique values
['1108 ROSS CLARK CIRCLE', '2505 U S HIGHWAY 431 NORTH', '1701 VETERANS DRIVE', '702 N MAIN ST', '101 HOSPITAL CIRCLE', '50 MEDICAL PARK EAST DRIVE', '200 MED CENTER DRIVE', '1000 FIRST STREET NORTH', '1720 UNIVERSITY BLVD STE 305', '1300 SOUTH MONTGOMERY AVENUE']

Column: City/Town — 3028 unique values
['DOTHAN', 'BOAZ', 'FLORENCE', 'OPP', 'LUVERNE', 'BIRMINGHAM', 'FORT PAYNE', 'ALABASTER', 'SHEFFIELD', 'OZARK']

Column: State — 56 unique values
['AL', 'AK', 'AZ', 'AR', 'CA',

### 🔍 Summary of Unique Values in Object Columns

We explored the distinct values in all object-type columns to assess their cardinality and content.

Key takeaways:

- **ID, Address, and Contact Info** columns (e.g., `Facility ID`, `Address`, `Telephone Number`) are highly unique, as expected.
- Some fields (e.g., `Meets criteria for birthing friendly designation`) had only a single unique value (`'Y'`) and were dropped for low informational value.
- Presence of `"Not Available"` values in several columns indicated a need for cleaning and type conversion.

---

### 🔢 Convert Numeric-Like Columns to Proper Data Types

Performance-related fields contained numeric values mixed with `"Not Available"` strings, causing incorrect data types.

Actions taken:

- Replaced `"Not Available"` with `NaN` to standardize missing data.
- Converted affected columns to numeric types (`float64`) for proper analysis.

This prepared the dataset for accurate statistical operations and further analysis.

In [6]:
# Replace "Not Available" with NaN and convert columns to numeric
num_cols = [
    'Hospital overall rating',
    'MORT Group Measure Count',
    'Count of Facility MORT Measures',
    'Count of MORT Measures Better',
    'Count of MORT Measures No Different',
    'Count of MORT Measures Worse',
    'Safety Group Measure Count',
    'Count of Facility Safety Measures',
    'Count of Safety Measures Better',
    'Count of Safety Measures No Different',
    'Count of Safety Measures Worse',
    'READM Group Measure Count',
    'Count of Facility READM Measures',
    'Count of READM Measures Better',
    'Count of READM Measures No Different',
    'Count of READM Measures Worse',
    'Pt Exp Group Measure Count',
    'Count of Facility Pt Exp Measures',
    'TE Group Measure Count',
    'Count of Facility TE Measures'
]

df[num_cols] = df[num_cols].replace("Not Available", np.nan).apply(pd.to_numeric)

### 🧪 Final Data Diagnostics

Before exporting, we run a final diagnostic to verify data integrity:

- Checked the shape of the dataset
- Reviewed remaining missing values
- Identified any duplicate rows
- Inspected column data types, especially numeric fields

This ensures the dataset is clean, consistent, and ready for import into PostgreSQL for SQL querying.

In [7]:
# Final check for missing values, duplicates, and data types
print(f"\nShape: {df.shape}")

missing = df.isna().sum()
missing = missing[missing > 0].sort_values(ascending=False)
if not missing.empty:
    print("Missing values:")
    print(missing)
else:
    print("No missing values")

duplicates = df.duplicated().sum()
print(f"{'Warning:' if duplicates else 'No'} duplicate rows: {duplicates}")

print("\nColumn type summary:")
print(df.dtypes.value_counts())
print(df.dtypes.sort_index())

print("\nSample numeric column types:")
print(df[num_cols].dtypes)


Shape: (5384, 32)
Missing values:
Meets criteria for birthing friendly designation    3154
Hospital overall rating                             2572
Count of Facility Pt Exp Measures                   2190
Count of Facility Safety Measures                   1989
Count of Safety Measures Worse                      1989
Count of Safety Measures Better                     1989
Count of Safety Measures No Different               1989
Count of MORT Measures Worse                        1872
Count of MORT Measures No Different                 1872
Count of MORT Measures Better                       1872
Count of Facility MORT Measures                     1872
Count of Facility READM Measures                    1088
Count of READM Measures Better                      1088
Count of READM Measures No Different                1088
Count of READM Measures Worse                       1088
Count of Facility TE Measures                        894
Safety Group Measure Count                           

### 🧾 Diagnostics Summary

The final inspection confirms:

- The dataset contains **5,384 rows** and **32 columns**
- No duplicate rows were found
- All numeric-like fields have been successfully converted to `float64`
- A few `object`-type columns remain, such as IDs, names, addresses, and categorical values
- **Missing values** are still present in many performance measure columns, as well as in the birthing designation and overall hospital rating. This is expected, as not all hospitals report every metric.

The dataset is now cleaned and structured, ready to be imported into a relational database for SQL analysis and dashboarding.

---
### 🔑 Check for Primary Key

We check whether the `Facility ID` column contains only unique values. If it does, it can serve as a valid **primary key** when importing the cleaned dataset into a relational database.

In [8]:
# Check uniqueness of 'Facility ID' to determine if it can serve as a primary key
is_unique = df['Facility ID'].is_unique
print("'Facility ID' is unique and can be used as primary key." if is_unique else "'Facility ID' is NOT unique — investigate further.")

'Facility ID' is unique and can be used as primary key.


### 💾 Save Cleaned Dataset

The cleaned dataset is now saved to a processed CSV file, ready for loading into a PostgreSQL database for SQL analysis and dashboarding.

In [9]:
# Save cleaned dataset to CSV for PostgreSQL import
df.to_csv("../data/processed/hospital_info_clean.csv", index=False)
print("Final cleaned CSV saved to ../data/processed/hospital_info_clean.csv")

Final cleaned CSV saved to ../data/processed/hospital_info_clean.csv


---

## ✅ Cleaning Complete

The CMS Hospital General Information dataset has been fully cleaned and saved in a PostgreSQL-friendly format.  
This cleaned version will serve as the foundation for structured SQL analysis and stakeholder-focused dashboarding in the next phases of the project.