# Preparation and Cleaning of the FIES & LFS 2018 Dataset

This notebook provides a comprehensive workflow for preparing and cleaning the Family Income and Expenditure Survey (FIES) and Labor Force Survey (LFS) 2018 dataset, focusing on Region 13. The process includes data loading, merging, filtering, cleaning, variable transformation, indicator construction, and saving the final dataset for analysis.

---

## Workflow Overview

1. **Data Loading**
    - Import necessary libraries (`pandas`, `numpy`, `os`, `importlib`).
    - Load individual-level and household-level CSVs into DataFrames.
    - Load external poverty indicator data.

2. **Data Merging**
    - Merge individual and household DataFrames on shared keys (`W_REGN`, `W_PROV`, `SEQUENCE_NO`).
    - Filter merged data for Region 13 (`W_REGN == 13`).
    - Merge poverty indicators into the filtered dataset.

3. **Variable Cleaning and Transformation**
    - Clean proxy variables: convert types, handle missing values, remove leading zeros, and standardize formats.
    - Recode education levels using a custom function for consistent categories.
    - Contextually fill missing values for key variables based on age and other conditions.
    - Map categorical codes to descriptive labels using a mapping dictionary.

4. **Indicator Construction**
    - Create household-level indicators by aggregating individual data:
        - Education buckets
        - Age buckets
        - Occupation codes
        - Worker counts
    - Generate dummy variables for categorical columns (e.g., domestic helper status).
    - Extract household head sex and marital status.

5. **Final DataFrame Refinement**
    - Remove unnecessary columns and duplicates.
    - Merge household-level indicators.
    - Convert selected columns to categorical types for analysis.

6. **Saving the Cleaned Dataset**
    - Save the final, cleaned, and labeled DataFrame (`df_18`) to disk for further analysis.

---

## Outputs

- **Cleaned DataFrame (`df_18`)**: Contains merged, filtered, and labeled data for Region 13, with household and individual indicators.
- **CSV File**: Saved to `output/df_18_ols_occ.csv` for downstream analysis.

---

## Purpose

This notebook ensures that the FIES & LFS 2018 data is:
- Consistently formatted and labeled
- Ready for robust socioeconomic analysis
- Documented for reproducibility and transparency

In [1]:
# Load necessary libraries

import pandas as pd
import numpy as np
import importlib
import os

In [2]:
DATA_PATH = '/Users/Ruhama/Library/CloudStorage/OneDrive-UniversityofCopenhagen/Master thesis'

In [3]:
# Create output folder if it doesn't exist

output_folder = "output"
os.makedirs(output_folder, exist_ok=True)

## Loading and Merging the Dataset

This section covers the initial preparation of the FIES and LFS 2018 data:

**Inputs:**
- Individual-level CSV (household members)
- Household-level CSV (summary)
- External CSV (poverty indicators)

**Steps:**
1. Load CSVs into DataFrames.
2. Merge on shared keys (`W_REGN`, `W_PROV`, `SEQUENCE_NO`).
3. Filter for Region 13 (`W_REGN == 13`).
4. Merge poverty indicators.

**Output:**  
A cleaned, merged DataFrame (`df_18`) for Region 13, ready for analysis.

In [4]:
# Load the CSV file into a DataFrame
file_path = DATA_PATH + '/FIES&LFS/FIES LFS Merge 2018/FIES-LFS PUF 2018 Household Members.CSV'
df_FIES18LSF = pd.read_csv(file_path)

# Load the CSV file into a DataFrame
file_path = DATA_PATH + '/FIES&LFS/FIES LFS Merge 2018/FIES-LFS PUF 2018 Household Summary.CSV'
df_FIES18LSF2 = pd.read_csv(file_path)

# Merge the two dataframes on common columns
merged_df = pd.merge(df_FIES18LSF, df_FIES18LSF2, on=['W_REGN', 'W_PROV', 'SEQUENCE_NO'])

# Display the first few rows of the merged dataframe
merged_df.head()

# Drop rows where W_REGN is not equal to 13
filtered_df = merged_df[merged_df['W_REGN'] == 13]

# Display the first few rows of the filtered dataframe
# filtered_df


In [5]:
df_pov = pd.DataFrame([
    {"District Name in NRC": "1st District", "W_PROV": 39, "poverty_line": 28682},
    {"District Name in NRC": "2nd District", "W_PROV": 74, "poverty_line": 28682},
    {"District Name in NRC": "3rd District", "W_PROV": 75, "poverty_line": 28682},
    {"District Name in NRC": "4th District", "W_PROV": 76, "poverty_line": 28682},
])

# Merge filtered_df with df_pov on the 'W_PROV' column
df_18 = pd.merge(filtered_df, df_pov, on='W_PROV', how='left')

# Display the first few rows of the merged dataframe
#df_18

## Remapping of the Highest Completed Grades

`LC07_GRADE` is recoded using the `recode_edu` function from the `education_recode` module to standardize educational attainment categories in `df_18`.

**Purpose:**  
- Groups education levels for easier analysis.
- Ensures consistent, interpretable data.


In [6]:

edu = importlib.import_module("education_recode")   # gives you edu.MAP  &  edu.recode_edu

mod = {}
exec(open("education_recode.py").read(), mod)   # gives mod["recode_edu"]

df_18["LC07_GRADE"] = df_18["LC07_GRADE"].apply(mod["recode_edu"])


## Cleaning and Labeling Proxy Variables

This section outlines the cleaning and labeling of proxy variables in `df_18`, representing key household and individual characteristics for socioeconomic analysis.

**Steps:**
- **Identify Proxies:** Select columns as proxies (e.g., education, occupation, income, poverty line).
- **Clean Data:** Convert types, handle missing values, remove leading zeros, and standardize formats.
- **Label Variables:** Map codes to descriptive labels and fill missing values contextually.
- **Verify:** Print summary statistics and NaN counts.

**Output:**  
A cleaned and labeled `df_18` DataFrame, ready for analysis.


In [7]:
df_18['LC14_PROCC'].value_counts() 

# Keep only the first digit (as a string) of each 4-digit code, including leading zero
df_18['LC14_PROCC'] = df_18['LC14_PROCC'].astype(str).str.zfill(4).str[0]
df_18['LC14_PROCC'].value_counts()

LC14_PROCC
     48348
5     7193
9     5832
4     4488
1     4296
7     3090
8     2989
2     2347
3     2131
6       62
0       49
Name: count, dtype: int64

In [8]:
with open(os.path.join(output_folder, 'all_columns_list_18.txt'), 'w') as f:
    for column in df_18.columns:
        f.write(f"{column}\n")

In [9]:
check_list = [
    'W_REGN',
    'W_PROV',
    'SEQUENCE_NO',
    'FSIZE',
    'PCINC',
    'URB',
    'RPROV',
    'RPSU',
    'BWEIGHT',
    'RFACT',
    'RFACT_POP',
    'HS001001_SEX',
    'HS001002_AGE',
    'HS001003_MS',
    'HS001004_HGC',
    'HS001005_JOB',
    'HS001006_OCC',
    'HS001007_IND',
    'HS001008_CW',
    'H150101_BLDG_TYPE',
    'H150102_ROOF',
    'H150103_WALLS',
    'H150103A_MAIN',
    'H150104_TENURE_STA',
    'H150109_TOILET',
    'H150110_ELECTRICITY',
    'H150111_WATER_SUPPLY',
    'H150113_RADIO_QTY',
    'H150114_TV_QTY',
    'H150115_VCD_QTY',
    'H150116_STEREO_QTY',
    'H150117_REF_QTY',
    'H150118_WASH_QTY',
    'H150119_AIRCON_QTY',
    'H150120_CAR_QTY',
    'H150121_LANDLINE_QTY',
    'H150122_CELL_QTY',
    'H150123_COMP_QTY',
    'H150124_OVEN_QTY',
    'H150125_BANCA_QTY',
    'H150126_MOTOR_QTY',
    'IND_4PS',
    'M4PS',
    'Y4PS',
    'T930530',
    'L1PRRCD',
    'LC101_LNO',
    'LC03_REL',
    'LC04_SEX',
    'LC05_AGE',
    'LC06_MSTAT',
    'LC07_GRADE',
    'LC08_CURSCH',
    'LC10_CONWR',
    'LC14_PROCC',
    'LC16_PKB',
    'LC17_NATEM',
    'LC23_PCLASS',
    'LC44_DIFF_SEE',
    'LC45_DIFF_HEAR',
    'LC46_DIFF_WALK',
    'LC47_DIFF_REM',
    'LC48_DIFF_CARE',
    'LC49_DIFF_COMM',
    'H150108_HSE_ALTERATION',
    'LC24_PBASIS',
    'LC41_WQTR',
    'LC43_QKB',
    'LC12_JOB'
]

In [10]:
columns_to_clean = []

for variable in check_list:
    if df_18[variable].dtypes == 'int64':
        continue
    if df_18[variable].dtypes == 'float64':
        continue
    else:
        columns_to_clean.append(variable)
        continue

# Loop through the columns and clean the data
nan_counts = {}
obs_counts = {}

for column in columns_to_clean:

    obs_counts[column] = df_18[column].unique()

    # Convert to string to ensure consistency
    df_18[column] = df_18[column].astype(str)

    # Remove leading zeros, but keep standalone "0"
    df_18[column] = df_18[column].replace('00', '0')
    df_18[column] = df_18[column].replace('00000', '0')
    df_18[column] = df_18[column].apply(lambda x: '0' if x == '0' else x.lstrip('0'))

    df_18[column].replace("", np.nan, inplace=True)
    df_18[column].replace(" ", np.nan, inplace=True)
    df_18[column].replace("  ", np.nan, inplace=True)
    df_18[column].replace("    ", np.nan, inplace=True)
    df_18[column].replace("     ", np.nan, inplace=True)
    
    # Count NaN values before dropping
    nan_counts[column] = df_18[column].isna().sum()
    
    df_18[column] = pd.to_numeric(df_18[column], errors='coerce')
    # df_FIES18.dropna(subset=[column], inplace=True)

    # Convert back to integer
    df_18[column] = df_18[column].astype(float)

    obs_counts[column + '_new'] = df_18[column].unique()

# Print the count of NaN values for each column
print("NaN values per column:")
for column, count in nan_counts.items():
    print(f"{column}: {count}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_18[column].replace("", np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_18[column].replace(" ", np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting val

NaN values per column:
HS001006_OCC: 21644
HS001007_IND: 21644
HS001008_CW: 21644
M4PS: 70782
Y4PS: 72872
LC06_MSTAT: 6052
LC08_CURSCH: 50491
LC10_CONWR: 20908
LC14_PROCC: 48348
LC16_PKB: 48348
LC17_NATEM: 48360
LC23_PCLASS: 48360
LC44_DIFF_SEE: 21977
LC45_DIFF_HEAR: 21977
LC46_DIFF_WALK: 21977
LC47_DIFF_REM: 21977
LC48_DIFF_CARE: 21977
LC49_DIFF_COMM: 21977
LC24_PBASIS: 54166
LC41_WQTR: 35949
LC43_QKB: 49106
LC12_JOB: 39526


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_18[column].replace("", np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_18[column].replace(" ", np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting val

In [11]:
# Label NaNs in LC08_CURSCH as 0 if the individual's age is < 5 or > 24
df_18.loc[(df_18['LC05_AGE'] < 5) | (df_18['LC05_AGE'] > 24), 'LC08_CURSCH'] = df_18['LC08_CURSCH'].fillna(0)
df_18.loc[(df_18['LC05_AGE'] <= 5), 'LC07_GRADE'] = df_18['LC07_GRADE'].fillna(1)
df_18.loc[(df_18['LC05_AGE'] <= 14) | (df_18['LC05_AGE'] > 65), 'LC17_NATEM'] = df_18['LC17_NATEM'].fillna(0)
df_18.loc[(df_18['LC05_AGE'] <= 14) | (df_18['LC05_AGE'] > 65), 'LC10_CONWR'] = df_18['LC10_CONWR'].fillna(0)
df_18.loc[(df_18['LC05_AGE'] <= 14) | (df_18['LC05_AGE'] > 65), 'LC23_PCLASS'] = df_18['LC23_PCLASS'].fillna(7)
df_18.loc[(df_18['LC05_AGE'] <= 14) | (df_18['LC05_AGE'] > 65), 'LC41_WQTR'] = df_18['LC41_WQTR'].fillna(3)
#df_18.loc[(df_18['LC05_AGE'] <= 14) | (df_18['LC05_AGE'] > 65), 'LC14_PROCC'] = df_18['LC14_PROCC'].fillna(0)
df_18.loc[(df_18['LC05_AGE'] <= 14) | (df_18['LC05_AGE'] > 65), 'LC12_JOB'] = df_18['LC12_JOB'].fillna(4)
df_18.loc[(df_18['LC41_WQTR'] == 2), 'LC23_PCLASS'] = df_18['LC23_PCLASS'].fillna(8)
df_18.loc[(df_18['LC41_WQTR'] == 2), 'LC17_NATEM'] = df_18['LC17_NATEM'].fillna(4)

# Display the first few rows of the updated dataframe to verify the changes
# df_18

In [12]:
df_18["H150110_ELECTRICITY"] = df_18["H150110_ELECTRICITY"].replace({2: 0})

In [13]:
list_vars = df_18.columns

## Individual and Household Indicator Construction

This section describes how household-level indicators are constructed by aggregating individual data from `df_18` using custom bucket functions and pandas group operations.

**Steps:**
- Copy `df_18` to `p` for processing.
- Define education and age buckets to ensure all categories are represented.
- Assign education and age buckets to individuals, one-hot encode, and aggregate counts by household (`SEQUENCE_NO`).
- Identify workers (`LC12_JOB == 1`), aggregate occupation codes, and count total workers per household.
- Merge all indicators into `merged_hh_data`, filling missing values with zeros.

**Output:**  
A household-level DataFrame (`merged_hh_data`) with counts for education, age, occupation, and workers, ready for analysis.


In [14]:
# --- education buckets -----------------------------------------
def educ_band(code):
    if code in (0, 1, 10, 191, 192):
        return 'educ_none'                       # no formal schooling
    elif 310 <= code <= 500:
        return 'educ_primary'                   # grade 1 - grade 10
    elif 510 <= code <= 520 or 601 <= code <= 699:
        return 'educ_secondary'                 # grade 11-12 + post-sec certs
    elif 710 <= code <= 799 or 801 <= code <= 899:
        return 'educ_college'                   # associate / bachelor / college years
    elif 910 <= code <= 949:
        return 'educ_postgrad'                  # masters / doctorate
    else:
        return 'educ_other'                     # 999 or anything unexpected

# ---------- age buckets ----------
def age_band(a):
    if   a <= 5:   return 'age_0_5'
    elif a <=12:   return 'age_6_12'
    elif a <=17:   return 'age_13_17'
    elif a <=64:   return 'age_18_64'
    else:          return 'age_65p'

In [16]:
p = df_18.copy()

# Define all possible buckets for education and age groups
all_educ_buckets = ['educ_none', 'educ_primary', 'educ_secondary', 'educ_college', 'educ_postgrad', 'educ_other']
all_age_buckets = ['age_0_5', 'age_6_12', 'age_13_17', 'age_18_64', 'age_65p']

# 1. Education counts per household
educ_cnt = (
  pd.get_dummies(p.set_index('SEQUENCE_NO')['LC07_GRADE'].apply(educ_band))
    .groupby('SEQUENCE_NO').sum()
    .reindex(columns=all_educ_buckets, fill_value=0)
)

# 2. Age counts per household
age_cnt = (
  pd.get_dummies(p.set_index('SEQUENCE_NO')['LC05_AGE'].apply(age_band))
    .groupby('SEQUENCE_NO').sum()
    .reindex(columns=all_age_buckets, fill_value=0)
)

# 3. Master list of household IDs
all_hh = df_18['SEQUENCE_NO'].unique()

# 4. Select workers
workers = df_18[df_18['LC12_JOB'] == 1]

# 5. Occupation counts per household
hh_occ_cnt = (
  pd.get_dummies(workers.set_index('SEQUENCE_NO')['LC14_PROCC'], prefix='occ4d')
    .groupby('SEQUENCE_NO').sum()
    .reindex(all_hh, fill_value=0)
)
hh_occ_cnt.rename(columns=lambda col: col.replace('.0', ''), inplace=True)

# 6. Total worker count per household
hh_tot_workers = (
  workers.groupby('SEQUENCE_NO').size()
    .rename('n_workers')
    .reindex(all_hh, fill_value=0)
    .to_frame()
)

# Merge the dataframes on SEQUENCE_NO
merged_hh_data = hh_occ_cnt.merge(hh_tot_workers, on='SEQUENCE_NO', how='outer') \
                           .merge(age_cnt, on='SEQUENCE_NO', how='outer') \
                           .merge(educ_cnt, on='SEQUENCE_NO', how='outer') .fillna(0)

# Display the first few rows of the merged dataframe
# merged_hh_data.head()

## Mapping Categorical Variables to Descriptive Labels

Categorical codes in `df_18` are replaced with human-readable labels using a mapping dictionary loaded from `Mapping Dictionary for Thesis.py`. For each variable in `list_vars`, if a mapping exists, codes are mapped to labels. The updated DataFrame is displayed for verification.

**Purpose:**  
- Improves clarity for analysis and reporting.
- Ensures consistent labeling of categorical variables.


In [17]:
# Load the mapping dictionary
mapping_dict_path = './Mapping Dictionary for Thesis.py'
mapping_dict = {}
exec(open(mapping_dict_path, encoding='utf-8').read(), mapping_dict)

# Map the categorical variables
for var in list_vars:
    if var in mapping_dict:
        df_18[var] = df_18[var].map(mapping_dict[var])

# Display the first few rows of the dataframe to verify the mapping
# df_18.head()


## Domestic Helper Indicator Construction

This section creates a household-level indicator for domestic helpers:

**Steps:**
- **Dummy Creation:**  Use `pd.get_dummies()` on `LC03_REL` to generate a binary `LC03_REL_Domestic Helper` column.
- **Aggregation:**  Sum `LC03_REL_Domestic Helper` by `SEQUENCE_NO` to count helpers per household (`domestic_helper`).
- **Integration:**  Merge `domestic_helper` into `df_18` for household analysis.

**Outputs:**  
- `df_domestic_helper`: Households with helpers.  
- `domestic_helper`: Indicator in `df_18`.


In [18]:
create_dummies = ['LC03_REL']

# Create dummy variables for the specified columns
df_18_dummies = pd.get_dummies(df_18, columns=create_dummies, drop_first=True)

# List all the names of the newly created columns
new_columns = [col for col in df_18_dummies.columns if col not in df_18.columns]

df_18_dummies['domestic_helper'] = df_18_dummies.groupby('SEQUENCE_NO')['LC03_REL_Domestic Helper'].transform('sum')
df_18_dummies2 = df_18_dummies[['SEQUENCE_NO', 'domestic_helper']].drop_duplicates()

df_18 = df_18.merge(df_18_dummies2, on='SEQUENCE_NO', how='left')

# Display the first few rows of the updated dataframe
# df_18


## Household Head Sex and Marital Status Extraction

This step creates household-level indicators for the sex (`hh_sex`) and marital status (`hh_ms`) of the household head:

**Steps:**
 - **Select Columns:**  Extract `LC04_SEX`, `LC06_MSTAT`, `LC03_REL`, and `SEQUENCE_NO` from `df_18`.
 -  **Filter Heads:**  Keep only rows where `LC03_REL` is `'Head'`.
-  **Rename and Drop:**  Rename `LC04_SEX` to `hh_sex`, `LC06_MSTAT` to `hh_ms`, and drop `LC03_REL`.
-  **Merge:**   Merge these indicators back into `df_18` by `SEQUENCE_NO`.

**Result:**  
`df_18` now includes `hh_sex` and `hh_ms` for each household.


In [19]:
df_18_hh = df_18[['LC04_SEX',
    'LC06_MSTAT', 'LC03_REL', 'SEQUENCE_NO']]
df_18_hh = df_18_hh[df_18_hh['LC03_REL'] == 'Head']
df_18_hh.rename(columns={'LC04_SEX': 'hh_sex', 'LC06_MSTAT': 'hh_ms'}, inplace=True)
df_18_hh.drop(columns=['LC03_REL'], inplace=True)

# Merge df_21 with df_21_hh on 'SEQUENCE_NO'
df_18 = pd.merge(df_18, df_18_hh, on='SEQUENCE_NO', how='left')

# Display the first few rows of the updated dataframe to verify the changes
# df_18.head()

## Final DataFrame Refinement

This section summarizes the last steps in preparing the household-level DataFrame (`df_18`):

**Steps:**
- **Column Cleanup:** Remove columns starting with `LC` and drop `NEWEMPSTAT`, `PWGTPRV`.
- **Merge Indicators:** Add household-level indicators from `merged_hh_data`.
- **Remove Duplicates:** Ensure each household is unique.
- **Format Indicators:** Convert selected columns to categorical types.
- **Verification:** Check indicator data types.

The result is a clean, household-indexed DataFrame ready for analysis.


In [20]:
df_18 = df_18.loc[:, ~df_18.columns.str.startswith('LC')]
df_18 = df_18.drop(columns=['NEWEMPSTAT', 'PWGTPRV'])

df_18 = df_18.merge(merged_hh_data, on='SEQUENCE_NO', how='left')

df_18 = df_18.drop_duplicates()

In [21]:
# List of indicators to reformat as categories
indicators_to_category = [
    'H150104_TENURE_STA',
    'H150102_ROOF',
    'H150103_WALLS',
    'H150101_BLDG_TYPE',
    'H150111_WATER_SUPPLY',
    'H150109_TOILET',
    'H150110_ELECTRICITY',
    'RPROV',
    'hh_sex',
    'hh_ms'
]

# Convert the specified columns to category type
df_18[indicators_to_category] = df_18[indicators_to_category].astype('category')

# Verify the changes
df_18[indicators_to_category].dtypes

H150104_TENURE_STA      category
H150102_ROOF            category
H150103_WALLS           category
H150101_BLDG_TYPE       category
H150111_WATER_SUPPLY    category
H150109_TOILET          category
H150110_ELECTRICITY     category
RPROV                   category
hh_sex                  category
hh_ms                   category
dtype: object

## Community Indicators: Average Barangay Metrics by Province

This section integrates community-level indicators into the household dataset by province. The workflow includes:

- **Loading the Community Indicators Dataset:**  
    Import barangay-level metrics from an external CSV file.

- **Filtering for Target Provinces:**  
    Select only the 17 Metro Manila provinces using their codes:
    - 801: Caloocan City
    - 802: City of Las Piñas
    - 803: City of Makati
    - 804: City of Malabon
    - 805: City of Mandaluyong
    - 806: City of Manila
    - 807: City of Marikina
    - 808: City of Muntinlupa
    - 809: City of Navotas
    - 810: City of Parañaque
    - 811: Pasay City
    - 812: City of Pasig
    - 813: Quezon City
    - 814: City of San Juan
    - 815: City of Taguig
    - 816: City of Valenzuela
    - 817: Pateros

- **Merging with Household Data:**  
    Attach average barangay-level indicators to each household based on province code.

**Purpose:**  
Enhance household-level analysis by incorporating contextual community metrics, enabling richer socioeconomic insights at the province level.

In [22]:

# --- mapping ----------------------------------------------------
name_to_3dig = {
    "Caloocan City": 801,
    "City of Las Piñas": 802,
    "City of Makati": 803,
    "City of Malabon": 804,
    "'City of Mandaluyong'": 805,
    "City of Manila": 806,
    "City of Marikina": 807,
    "City of Muntinlupa": 808,
    "City of Navotas": 809,
    "City of Parañaque": 810,
    "Pasay City": 811,
    "City of Pasig": 812,
    "Quezon City": 813,
    "City of San Juan": 814,
    "Taguig City": 815,
    "City of Valenzuela": 816,
    "Pateros": 817,
}

# --- example: rename column ------------------------------------
# df is your household / province dataframe
# assume the column that currently holds the names is called 'prov_name'
df_18["RPROV_new"] = (
    df_18["RPROV"]         
      .map(name_to_3dig)      
)

df_18['RPROV'] = df_18["RPROV_new"]
df_18['RPROV'].dtypes

dtype('float64')

In [23]:
file_path = './output/pmt_comm_indicators_by_province.csv'

# Load the CSV file into a pandas DataFrame
df_comm_indicators = pd.read_csv(file_path)

df_18 = df_18.merge(df_comm_indicators, left_on='RPROV', right_on='PRV', how='left')
df_18 = df_18.dropna(subset=['RPROV'])
# df_18


## Saving the Cleaned DataFrame (`df_18`)

The final household-level DataFrame `df_18` is saved to disk for further analysis.

- **File:** `output/df_18_ols_occ.csv`
- **Contents:** Cleaned, merged, and labeled FIES & LFS 2018 data for Region 13, including household and individual indicators.

In [25]:
df_18.to_csv(os.path.join(output_folder, 'df_18_ols_occ.csv'), index=False)
