# Graphs!
### Let's see about making some graphs and charts
Looking at our file with resource calculations, this prompt gives good suggestions:
resources_by_hood_calc-csv.csv
Each record in this file is a portland neighborhood.  number_of_locations is the count of homeless resource providers in that neighborhood.  final_resource_calc is the sum of all the resources available in that neighborhood, which is the rest of the fields except for the two I just described and the first one which is "neighborhood".   What interesting statistics, charts and graphs would you suggest for this data? 

But I think I first need to add homeless population counts by neighborhood.
IRP_Campsite_Reports_latlong-hood-zips-2024.csv

Before we do that, I think we should add a column to our original file for "homeless_pop" and calculate that by the count of the number of times each individual neighborhood appears in the attached file.  If a neighborhood appears in the second file, but not the first, a record should be created for it and the rest of the values in that row can be 0. I would like the python code for this.


In [6]:
import pandas as pd
from pathlib import Path
print("Libraries imported!")

# --- INPUT PATHS (edit as needed) ---
orig_path = Path("C:/Users/Steph/local/OIT-class/project-files/datasets/processed/resources_by_hood_calc-csv.csv")  # one row per neighborhood, resource columns
irp_path  = Path("C:/Users/Steph/local/OIT-class/project-files/datasets/processed/IRP_Campsite_Reports_latlong-hood-zips-2024.csv")  # many rows; we'll count by neighborhood
out_path  = Path("C:/Users/Steph/local/OIT-class/project-files/datasets/processed/resources_by_hood_with_homeless_camps-csv.csv")

# --- Helpers ---
def normalize_hood(series: pd.Series) -> pd.Series:
    return (series.astype(str)
                  .str.strip()
                  .str.replace(r"\s+", " ", regex=True)
                  .str.title())

# --- Load original neighborhood resources ---
df_orig = pd.read_csv(orig_path, dtype=str, engine="python")
df_orig.columns = df_orig.columns.str.strip()

hood_col_orig = next((c for c in df_orig.columns if c.lower() == "neighborhood"), None)
if hood_col_orig is None:
    raise KeyError("Could not find a 'neighborhood' column in the original file.")
if hood_col_orig != "neighborhood":
    df_orig = df_orig.rename(columns={hood_col_orig: "neighborhood"})

df_orig["neighborhood"] = normalize_hood(df_orig["neighborhood"])

# --- Load IRP file and count rows per neighborhood -> homeless_camps ---
df_irp = pd.read_csv(irp_path, dtype=str, engine="python")
df_irp.columns = df_irp.columns.str.strip()

candidates = [c for c in df_irp.columns if c.lower() in {"neighborhood", "name", "neighborhood_name", "hood"}]
if not candidates:
    raise KeyError("Could not locate a neighborhood column in the IRP file.")
hood_col_irp = next((c for c in candidates if c.lower() == "neighborhood"), candidates[0])

df_irp["_neighborhood"] = normalize_hood(df_irp[hood_col_irp])
df_irp = df_irp.dropna(subset=["_neighborhood"])

homeless_counts = (
    df_irp.groupby("_neighborhood").size().rename("homeless_camps").reset_index()
          .rename(columns={"_neighborhood": "neighborhood"})
)

# --- Merge: include neighborhoods present in either file ---
df_merged = pd.merge(df_orig, homeless_counts, on="neighborhood", how="outer")

# Coerce all non-neighborhood columns to numeric ints (fill missing with 0)
for col in df_merged.columns:
    if col != "neighborhood":
        df_merged[col] = pd.to_numeric(df_merged[col], errors="coerce").fillna(0).astype(int)

# Save
df_merged.to_csv(out_path, index=False)
print(f"Saved with homeless_camps added: {out_path}")

# Optional peek
print(df_merged.head())



Libraries imported!
Saved with homeless_camps added: C:\Users\Steph\local\OIT-class\project-files\datasets\processed\resources_by_hood_with_homeless_camps-csv.csv
              neighborhood  library_open  temporary_shelter  day_center  \
0                  Alameda             0                  0           0   
1              Arbor Lodge             0                  0           0   
2  Ardenwald-Johnson Creek             1                  0           1   
3            Argay Terrace             0                  0           0   
4        Arlington Heights             0                  0           0   

   coord_reentry  group_therapy  showers  meals  food_box  laundry  \
0              0              0        0      0         0        0   
1              0              0        0      0         0        0   
2              0              0        1      0         1        0   
3              0              0        0      0         0        0   
4              0              0     

Prompt: I see that there is at least one null in the neighborhoods field.  How would I remove records with a null neighborhood?

Each record in this file is a portland neighborhood and the number of homeless camps reported in 2024 as homeless_camps. Number_of_locations is the count of homeless resource providers in that neighborhood. final_resource_calc is the sum of all the resources available in that neighborhood, which is the rest of the fields except for the two I just described and the first one which is "neighborhood". What interesting statistics, charts and graphs would you suggest for this data?

In [10]:
# drop records with null neighborhoods
df_nonulls = df_merged.dropna(subset=["neighborhood"]).copy()

# Save
df_nonulls.to_csv(out_path, index=False)
print(f"Saved with homeless_camps and no null neighborhood records: {out_path}")


Saved with homeless_camps and no null neighborhood records: C:\Users\Steph\local\OIT-class\project-files\datasets\processed\resources_by_hood_with_homeless_camps-csv.csv


In [None]:
# following the output from ChatGPT
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print("libraries imported!")

# ---------- Load & prep ----------
# df = pd.read_csv("resources_by_hood_with_homeless_camps.csv")
df = df_merged.copy()
df.columns = df.columns.str.strip()

# Ensure numeric
for c in ["homeless_camps","number_of_locations","final_resource_calc"]:
    df[c] = pd.to_numeric(df[c], errors="coerce").fillna(0)

# Derived metrics
df["camps_per_location"]      = df["homeless_camps"] / df["number_of_locations"].replace(0, np.nan)
df["resources_per_location"]  = df["final_resource_calc"] / df["number_of_locations"].replace(0, np.nan)
df["resources_per_camp"]      = df["final_resource_calc"] / df["homeless_camps"].replace(0, np.nan)

# ---------- 1) Summary stats ----------
summary = {
    "total_homeless_camps": int(df["homeless_camps"].sum()),
    "total_locations": int(df["number_of_locations"].sum()),
    "total_resources": int(df["final_resource_calc"].sum()),
    "median_camps_per_location": float(df["camps_per_location"].median(skipna=True)),
    "median_resources_per_location": float(df["resources_per_location"].median(skipna=True)),
    "median_resources_per_camp": float(df["resources_per_camp"].median(skipna=True)),
}
print(summary)

# ---------- 2) Top/Bottom bar charts ----------
def plot_sorted_bar(series, title, xlabel):
    s = series.sort_values(ascending=False)
    plt.figure(figsize=(10, 6))
    plt.barh(s.index, s.values)
    plt.gca().invert_yaxis()
    plt.title(title)
    plt.xlabel(xlabel)
    plt.tight_layout()
    plt.show()

print("Homeless camps by neighborhood.")
plot_sorted_bar(df.set_index("neighborhood")["homeless_camps"], "Homeless camps by neighborhood", "Count")
print("Total resources by neighborhood.")
plot_sorted_bar(df.set_index("neighborhood")["final_resource_calc"], "Total resources by neighborhood", "Count")

# ---------- 3) Scatter: camps vs resources (+ fit) ----------
x = df["homeless_camps"].values
y = df["final_resource_calc"].values
mask = ~(np.isnan(x) | np.isnan(y))
xv, yv = x[mask], y[mask]

# Fit line
if len(xv) >= 2:
    m, b = np.polyfit(xv, yv, 1)
    r = np.corrcoef(xv, yv)[0,1]
else:
    m, b, r = 0, 0, np.nan

plt.figure(figsize=(8, 6))
plt.scatter(df["homeless_camps"], df["final_resource_calc"])
if not np.isnan(r):
    xs = np.linspace(df["homeless_camps"].min(), df["homeless_camps"].max(), 100)
    plt.plot(xs, m*xs + b)
    plt.title(f"Resources vs Camps (Pearson r={r:.2f})")
else:
    plt.title("Resources vs Camps")
plt.xlabel("Homeless camps")
plt.ylabel("Total resources")
plt.tight_layout()
plt.show()

# Outliers by residual
if not np.isnan(r):
    df["residual"] = df["final_resource_calc"] - (m*df["homeless_camps"] + b)
    out_high_need_low_res = df.nsmallest(5, "residual")[["neighborhood","homeless_camps","final_resource_calc","residual"]]
    out_low_need_high_res = df.nlargest(5, "residual")[["neighborhood","homeless_camps","final_resource_calc","residual"]]
    print("\nHigh need / low resources (most negative residuals):\n", out_high_need_low_res)
    print("\nLow need / high resources (most positive residuals):\n", out_low_need_high_res)

# ---------- 4) Pareto curve & Gini ----------
def pareto_and_gini(series, title):
    s = series.sort_values(ascending=False).values
    cum = np.cumsum(s)
    cum_share = cum / cum[-1] if cum[-1] > 0 else cum
    x = np.arange(1, len(s)+1) / len(s)
    # Pareto plot
    plt.figure(figsize=(8, 5))
    plt.plot(x, cum_share)
    plt.plot([0,1],[0,1])  # equality line
    plt.title(f"Pareto: {title}")
    plt.xlabel("Share of neighborhoods")
    plt.ylabel("Cumulative share")
    plt.tight_layout()
    plt.show()
    # Gini
    # Gini = 1 - 2 * area under Lorenz curve
    lorenz = np.insert(np.cumsum(np.sort(series.values)), 0, 0)
    lorenz = lorenz / lorenz[-1] if lorenz[-1] > 0 else lorenz
    x2 = np.linspace(0, 1, len(lorenz))
    gini = 1 - 2 * np.trapz(lorenz, x2)
    print(f"Gini for {title}: {gini:.3f}")

pareto_and_gini(df.set_index("neighborhood")["homeless_camps"], "Homeless camps")
pareto_and_gini(df.set_index("neighborhood")["final_resource_calc"], "Resources")

# ---------- 5) Quadrant chart (medians) ----------
med_camps = df["homeless_camps"].median()
med_resources = df["final_resource_calc"].median()

plt.figure(figsize=(8, 6))
plt.scatter(df["homeless_camps"], df["final_resource_calc"])
plt.axvline(med_camps)
plt.axhline(med_resources)
plt.title("Quadrants: Need vs Resources (medians)")
plt.xlabel("Homeless camps (need)")
plt.ylabel("Total resources")
plt.tight_layout()
plt.show()

# Optional labels for extreme points
top = df.nlargest(5, "homeless_camps")
for _, r_ in top.iterrows():
    plt.text(r_["homeless_camps"], r_["final_resource_calc"], r_["neighborhood"])

# ---------- 6) Rates distributions ----------
def plot_hist(series, title, xlabel):
    vals = series.replace([np.inf, -np.inf], np.nan).dropna()
    plt.figure(figsize=(8,5))
    plt.hist(vals, bins=15)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.tight_layout()
    plt.show()

plot_hist(df["resources_per_location"], "Resources per location", "resources/location")
plot_hist(df["resources_per_camp"], "Resources per camp", "resources/camp")
plot_hist(df["camps_per_location"], "Camps per location", "camps/location")
