# 🧼📊 Clean + Explore: IMLS Public Library Survey (FY 2022)

This notebook loads the raw FY 2022 IMLS dataset, cleans it, and explores trends in U.S. public library usage.

> ⚠️ **Note for Colab users**: First upload the raw file `PLS_FY22_AE_pud22i.csv` using the file upload button on the left sidebar.


In [None]:
# 📦 Install if running in Colab
# %pip install pandas matplotlib seaborn

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# 📂 Load the raw data (uploaded in root directory for Colab users)
df_raw = pd.read_csv("PLS_FY22_AE_pud22i.csv", encoding="ISO-8859-1")
df_raw.head()


## 🔧 Step 1: Clean and Prepare Key Fields

In [None]:
df = df_raw.rename(columns={
    "LIBNAME": "Library Name",
    "STABR": "State",
    "VISITS": "Visits",
    "TOTPRO": "Total Programs",
    "TOTCIR": "Circulation"
})[["Library Name", "State", "Visits", "Total Programs", "Circulation"]]

# Replace special negative values with NaN
df = df.replace({-1: None, -3: None, -4: None, -9: None})

# Drop rows where all key metrics are missing
df = df.dropna(subset=["Visits", "Total Programs", "Circulation"], how="all")
df.head()


In [None]:
df.isnull().sum()

In [None]:
plt.figure(figsize=(10, 5))
df['Visits'].dropna().hist(bins=50)
plt.title("Distribution of Library Visits (FY 2022)")
plt.xlabel("Visits")
plt.ylabel("Number of Libraries")
plt.grid(True)
plt.show()


In [None]:
top10 = df[['Library Name', 'Total Programs']].sort_values(by='Total Programs', ascending=False).head(10)
top10.reset_index(drop=True, inplace=True)
top10


In [None]:
df.to_csv("imls_pls_2022_cleaned.csv", index=False)
print("✅ Cleaned file saved as: imls_pls_2022_cleaned.csv")
