# Community Indicators

This notebook documents the process of analyzing community indicators based on census data. Below is an outline of the workflow:

## Workflow Steps

1. **Load Dataset**  
    - Import the dataset containing barangay-level census data.

2. **Filter Relevant Provinces**  
    - Focus on the 17 provinces of interest:
      - 801: CITY OF CALOOCAN  
      - 802: CITY OF LAS PIÑAS  
      - 803: CITY OF MAKATI  
      - 804: CITY OF MALABON  
      - 805: SANTA CRUZ  
      - 806: SAMPALOC  
      - 807: CITY OF MARIKINA  
      - 808: CITY OF MUNTINLUPA  
      - 809: CITY OF NAVOTAS  
      - 810: CITY OF PARAÑAQUE  
      - 811: PASAY CITY  
      - 812: CITY OF PASIG  
      - 813: QUEZON CITY  
      - 814: CITY OF SAN JUAN  
      - 815: CITY OF TAGUIG  
      - 816: CITY OF VALENZUELA  
      - 817: PATEROS  

3. **Data Cleaning and Transformation**  
    - Coerce distance fields to numeric values, handle missing data, and apply caps for distances exceeding 30 km.  
    - Create 16 community indicators based on the dataset.

4. **Province-Level Aggregation**  
    - Compute province-level averages for the community indicators.  
    - Save the aggregated data for further analysis or merging.

5. **Visualization and Analysis**  
    - Analyze trends and patterns in the community indicators across provinces.

## Key Variables

- **`df_barangay`**: Original dataset containing barangay-level census data.  
- **`df_brg`**: Filtered and transformed dataset focused on Region 13 and the selected provinces.  
- **`distance_vars`**: List of distance-related fields capped at 30 km.  
- **`pmt16_cols`**: List of 16 community indicators created from the dataset.  
- **`prov_comm_means`**: Province-level averages of the community indicators.

This documentation serves as a guide to understanding the steps and variables used in the analysis of community indicators.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df_brg = pd.read_csv('./data/Census/PHL PSA CPH 2020 PUF/PUF for CPH Form 5 (Barangay Schedule Questionnaire)/Philippines/CPH PUF 2020 Philippines-Barangay Schedule.csv')

# Display the first few rows of the DataFrame
df_brg.head()

df_brg = df_brg[df_brg['REG'] == 13]

# Display the first few rows of the filtered DataFrame
df_brg

  df_brg = pd.read_csv('./data/Census/PHL PSA CPH 2020 PUF/PUF for CPH Form 5 (Barangay Schedule Questionnaire)/Philippines/CPH PUF 2020 Philippines-Barangay Schedule.csv')


Unnamed: 0,REG,PRV,MUN,BGY,Q1A,Q1B,Q2,Q3,Q3_DISTANCE,Q4A,...,Q4M,Q4M_DISTANCE,Q4N,Q4N_DISTANCE,Q4O,Q4O_DISTANCE,Q4P,Q4P_DISTANCE,Q4Q,Q4R
33797,13,806,1,1,2,1,1,1,1,2,...,2,1,1,,2,1,1,,1,1
33798,13,806,1,2,2,1,1,1,1,2,...,2,1,1,,2,1,1,,1,1
33799,13,806,1,3,2,1,1,1,1,2,...,2,1,1,,2,1,1,,1,1
33800,13,806,1,4,2,1,1,1,1,2,...,2,1,1,,2,1,1,,1,1
33801,13,806,1,5,2,1,1,1,1,2,...,2,1,1,,2,1,1,,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35502,13,815,0,24,2,2,1,1,1,2,...,2,2,1,,2,1,1,,1,1
35503,13,815,0,25,1,2,1,1,1,2,...,2,2,1,,2,2,1,,1,1
35504,13,815,0,26,2,1,1,1,2,2,...,2,2,1,,2,2,1,,1,1
35505,13,815,0,27,1,1,1,1,1,2,...,2,2,1,,2,2,1,,1,1


In [4]:
df_brg.columns = df_brg.columns.str.strip()

In [102]:
# ------------------------------------------------------------------
# 1.  Build a list that has BOTH distance vars and their base flags
# ------------------------------------------------------------------
distance_cols = [
    "Q3_DISTANCE",  "Q4D_DISTANCE", "Q4A_DISTANCE", "Q4E_DISTANCE",
    "Q4F_DISTANCE", "Q4G_DISTANCE", "Q4K_DISTANCE", "Q4J_DISTANCE",
    "Q4C_DISTANCE", "Q4M_DISTANCE", "Q4I_DISTANCE", "Q4N_DISTANCE",
    "Q4P_DISTANCE", "Q4O_DISTANCE", "Q4L_DISTANCE"
]

# Add the two “plain” binary vars that don’t have _DISTANCE versions
plain_indicator_cols = ["Q1B", "Q2", "Q4Q"]

# Create the matching base-name list (e.g. Q4O)
base_cols = [c.replace("_DISTANCE", "") for c in distance_cols
             if c.replace("_DISTANCE", "") in df_brg.columns]

orig_cols = plain_indicator_cols + distance_cols + base_cols
# Optional: sort so output is ordered nicely
orig_cols = sorted(set(orig_cols))

# ------------------------------------------------------------------
# 2.  Count missing values
# ------------------------------------------------------------------
nan_table = (df_brg[orig_cols]
               .isna()
               .sum()
               .rename("num_nans")
               .reset_index()
               .rename(columns={"index": "variable"})
               .sort_values("num_nans", ascending=False))

# 3.  Preview
display(nan_table.style.format({"num_nans": "{:,}"}))


Unnamed: 0,variable,num_nans
0,Q1B,0
17,Q4I_DISTANCE,0
31,Q4P_DISTANCE,0
30,Q4P,0
29,Q4O_DISTANCE,0
28,Q4O,0
27,Q4N_DISTANCE,0
26,Q4N,0
25,Q4M_DISTANCE,0
24,Q4M,0


In [103]:
for col in orig_cols:
    print(f"Value counts for column: {col}")
    print(df_brg[col].value_counts())
    print("-" * 40)

Value counts for column: Q1B
Q1B
1    1081
2     629
Name: count, dtype: int64
----------------------------------------
Value counts for column: Q2
Q2
1    1710
Name: count, dtype: int64
----------------------------------------
Value counts for column: Q3
Q3
1    1710
Name: count, dtype: int64
----------------------------------------
Value counts for column: Q3_DISTANCE
Q3_DISTANCE
1    1617
2      93
Name: count, dtype: int64
----------------------------------------
Value counts for column: Q4A
Q4A
2    1692
1      18
Name: count, dtype: int64
----------------------------------------
Value counts for column: Q4A_DISTANCE
Q4A_DISTANCE
2    1146
1     546
       18
Name: count, dtype: int64
----------------------------------------
Value counts for column: Q4C
Q4C
2    1364
1     346
Name: count, dtype: int64
----------------------------------------
Value counts for column: Q4C_DISTANCE
Q4C_DISTANCE
1    1082
      346
2     282
Name: count, dtype: int64
---------------------------------

In [104]:


# # -----------------------------------------------------------
# # 2.  Helper: coerce distance fields to numeric & cap at 30 km
# # -----------------------------------------------------------
# distance_vars = [
#     "Q3_DISTANCE", "Q4A_DISTANCE", "Q4E_DISTANCE", "Q4F_DISTANCE",
#     "Q4G_DISTANCE", "Q4K_DISTANCE", "Q4J_DISTANCE", "Q4C_DISTANCE",
#     "Q4P_DISTANCE", "Q4O_DISTANCE", "Q4L_DISTANCE"
# ]

# df_brg[distance_vars] = (
#     df_brg[distance_vars]
#       .apply(pd.to_numeric, errors="coerce")   # force numeric, NaN if blank
#       .fillna(30)                              # policy note: missing → 30 km
#       .clip(upper=30)                          # winsorise  >30 km
# )

# # -----------------------------------------------------------
# # 3.  Create the community indicators
# # -----------------------------------------------------------
df_brg["poblacion"]        = (df_brg["Q1B"] == 1).astype("uint8")
df_brg["street_pattern"]   = (df_brg["Q2"]  == 1).astype("uint8")
df_brg["acc_nat_hwy"]      = (df_brg["Q3_DISTANCE"]  == 1).astype("uint8")


df_brg["cemetery"]      = (df_brg["Q4D"] == 1).astype("uint8")
df_brg["city_hall"]     = (df_brg["Q4A"] == 1).astype("uint8")
df_brg["market"]        = (df_brg["Q4E"] == 1).astype("uint8")
df_brg["elem_sch"]      = (df_brg["Q4F"] == 1).astype("uint8")
df_brg["hs_sch"]        = (df_brg["Q4G"] == 1).astype("uint8")
df_brg["health"]        = (df_brg["Q4K"] == 1).astype("uint8")
df_brg["hospital"]      = (df_brg["Q4J"] == 1).astype("uint8")
df_brg["plaza"]         = (df_brg["Q4C"] == 1).astype("uint8")
df_brg["port"]          = (df_brg["Q4M"] == 1).astype("uint8")
df_brg["library"]       = (df_brg["Q4I"] == 1).astype("uint8")

df_brg["waterworks_system"] = (df_brg["Q4N"] == 1).astype("uint8")
df_brg["cell_signal"]       = (df_brg["Q4Q"] == 1).astype("uint8")

df_brg["landline"]      = (df_brg["Q4P"] == 1).astype("uint8")
df_brg["post_office"]   = (df_brg["Q4O"] == 1).astype("uint8")
df_brg["fire_station"]  = (df_brg["Q4L"] == 1).astype("uint8")

# -----------------------------------------------------------
# 4.  Keep a tidy list of the new columns for later use
# -----------------------------------------------------------
pmt16_cols = [
    "poblacion",
    "street_pattern",
    "acc_nat_hwy",
    "cemetery",
    "city_hall",
    "market",
    "elem_sch",
    "hs_sch",
    "health",
    "hospital",
    "plaza",
    "port",
    "library",
    "waterworks_system",
    "cell_signal",
    "landline",
    "post_office",
    "fire_station",
]


df_brg[pmt16_cols] = df_brg[pmt16_cols].astype(int)
print(df_brg[pmt16_cols].dtypes)
print(" community indicators built:", pmt16_cols)


poblacion            int64
street_pattern       int64
acc_nat_hwy          int64
cemetery             int64
city_hall            int64
market               int64
elem_sch             int64
hs_sch               int64
health               int64
hospital             int64
plaza                int64
port                 int64
library              int64
waterworks_system    int64
cell_signal          int64
landline             int64
post_office          int64
fire_station         int64
dtype: object
 community indicators built: ['poblacion', 'street_pattern', 'acc_nat_hwy', 'cemetery', 'city_hall', 'market', 'elem_sch', 'hs_sch', 'health', 'hospital', 'plaza', 'port', 'library', 'waterworks_system', 'cell_signal', 'landline', 'post_office', 'fire_station']


In [105]:
df_brg

Unnamed: 0,REG,PRV,MUN,BGY,Q1A,Q1B,Q2,Q3,Q3_DISTANCE,Q4A,...,health,hospital,plaza,port,library,waterworks_system,cell_signal,landline,post_office,fire_station
33797,13,806,1,1,2,1,1,1,1,2,...,1,0,0,0,0,1,1,1,0,0
33798,13,806,1,2,2,1,1,1,1,2,...,0,0,0,0,0,1,1,1,0,0
33799,13,806,1,3,2,1,1,1,1,2,...,0,0,0,0,0,1,1,1,0,0
33800,13,806,1,4,2,1,1,1,1,2,...,0,0,1,0,0,1,1,1,0,1
33801,13,806,1,5,2,1,1,1,1,2,...,0,0,0,0,1,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35502,13,815,0,24,2,2,1,1,1,2,...,1,0,1,0,1,1,1,1,0,0
35503,13,815,0,25,1,2,1,1,1,2,...,1,0,0,0,1,1,1,1,0,1
35504,13,815,0,26,2,1,1,1,2,2,...,1,0,0,0,1,1,1,1,0,0
35505,13,815,0,27,1,1,1,1,1,2,...,1,0,0,0,1,1,1,1,0,0


In [106]:
# 1. Province-level averages
prov_comm_means = (
  df_brg
    .groupby("PRV", as_index=False)[pmt16_cols]
    .mean()            # simple arithmetic mean
    .round(3)          # optional: nicer display, keep 3 decimals
)

# 2. Preview
prov_comm_means




Unnamed: 0,PRV,poblacion,street_pattern,acc_nat_hwy,cemetery,city_hall,market,elem_sch,hs_sch,health,hospital,plaza,port,library,waterworks_system,cell_signal,landline,post_office,fire_station
0,801,0.0,1.0,0.0,0.064,0.011,0.42,0.367,0.229,0.468,0.08,0.176,0.0,0.016,1.0,1.0,1.0,0.106,0.213
1,802,0.1,1.0,0.0,0.1,0.05,0.8,0.95,0.75,1.0,0.35,0.5,0.05,0.75,1.0,1.0,1.0,0.55,0.85
2,803,0.091,1.0,0.0,0.091,0.03,0.667,0.848,0.667,1.0,0.152,0.788,0.0,0.515,1.0,1.0,1.0,0.182,0.727
3,804,0.048,1.0,0.0,0.19,0.048,0.667,0.905,0.667,1.0,0.19,0.429,0.0,0.095,1.0,1.0,1.0,0.286,0.714
4,805,0.037,1.0,0.0,0.148,0.037,0.852,0.704,0.593,0.963,0.222,0.593,0.0,0.037,1.0,1.0,1.0,0.111,0.444
5,806,1.0,1.0,0.0,0.001,0.001,0.144,0.185,0.14,0.168,0.028,0.079,0.013,0.01,0.999,1.0,1.0,0.025,0.058
6,807,0.125,1.0,0.0,0.5,0.062,0.75,0.938,0.875,1.0,0.5,0.875,0.0,0.062,1.0,1.0,1.0,0.125,0.75
7,808,0.333,1.0,0.0,0.333,0.111,0.889,1.0,1.0,1.0,0.333,0.889,0.0,0.111,1.0,1.0,1.0,0.778,1.0
8,809,0.056,1.0,0.0,0.056,0.056,0.611,0.778,0.444,0.889,0.056,0.389,0.167,0.056,1.0,1.0,0.944,0.111,0.111
9,810,0.062,1.0,0.0,0.125,0.062,0.688,0.938,0.875,1.0,0.5,0.562,0.062,0.25,1.0,1.0,1.0,0.875,0.812


In [107]:
prov_comm_means.to_csv("output/pmt_comm_indicators_by_province.csv", index=False)