# CVS Health Community Access Analysis - Part 1: Data Loading and Cleaning

This notebook loads and cleans the final CVS dataset for analysis. We'll prepare the data for all subsequent analysis notebooks.


In [None]:
# import necessary libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("libraries imported successfully")


## Load the Dataset

we load the final processed dataset that contains all merged data including SVI scores, health indicators, and clinic counts.


In [None]:
# load the final processed dataset
df = pd.read_csv(r"C:\Users\14122\OneDrive\Desktop\cvs_heath_project\data\processed\CVS_FINAL_DATASET.csv")

# display first few rows to inspect data structure
print(f"dataset shape: {df.shape}")
print(f"\nfirst few rows:")
df.head()


## Data Cleaning

we need to clean the data by converting numeric columns and ensuring proper formatting. this ensures all calculations work correctly.


In [None]:
# convert all columns to numeric where possible (removes commas, percent signs, etc.)
# this handles any string formatting that might interfere with calculations
for col in df.columns:
    try:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    except:
        pass  # keep non-numeric columns as-is

# ensure COUNTY column is a string with 5-digit FIPS code format (leading zeros)
# this is important for merging with geographic data later
df["COUNTY"] = df["COUNTY"].astype(str).str.zfill(5)

print(f"✓ dataset cleaned: {len(df)} counties")
print(f"✓ columns processed: {len(df.columns)}")
print(f"\nsample of cleaned data:")
df[['county_full', 'state_full', 'clinic_count', 'svi_overall']].head()
