# CSMODEL Machine Project

This Jupyter notebook presents a comprehensive analysis of the [Family Income and Expenditure Survey 2012 Vol. 1](https://psada.psa.gov.ph/auth/register) dataset from the Philippine Statistics Authority (PSA).

The goal of the project is to explore how **socioeconomic and demographic factors** may be associated with the **expenditure patterns** of Filipino households in 2012.

**Research Question**
1. [How are socioeconomic and demographic factors associated with the expenditure patterns of Filipino households in 2012?](#how-are-socioeconomic-and-demographic-factors-associated-with-the-expenditure-patterns-of-filipino-households-in-2012)

**Exploratory Data Analysis (EDA) Questions**
1. [How does the proportion of spending across **major expenditure categories** vary based on various **socioeconomic and demographic factors**?](#1-how-does-the-proportion-of-spending-across-major-expenditure-categories-vary-based-on-various-socioeconomic-and-demographic-factors)
2. [Which regions spend the highest proportion of their **food expenditure** on various **food categories**?](#2-which-regions-spend-the-highest-proportion-of-their-food-expenditure-on-various-food-categories-eg-vegetables-meats)
3. [Is there a correlation between **total household income** and the proportion of food expenditure spent on **food consumed outside the home**?](#3-is-there-a-correlation-between-total-household-income-and-the-proportion-of-food-expenditure-spent-on-food-consumed-outside-the-home)
4. [How does **housing expenditure** differ between **urban** and **rural** households?](#4-how-does-housing-expenditure-differ-between-urban-and-rural-households)
5. [Is there a correlation between **total household income** and **education-related expenditures**?](#5-is-there-a-correlation-between-total-household-income-and-education-related-expenditures)

## Authors

The following students of De La Salle University - Manila, Philippines collaborated on this project:

<table>
  <thead>
    <tr>
      <th>Profile</th>
      <th>Author</th>
      <th>Contributions</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td align="center">
        <img src="https://github.com/qu1r0ra.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Christian Joseph Bunyi</strong>
        <br />
        <a href="https://github.com/qu1r0ra">@qu1r0ra</a>
      </td>
      <td>
        <ul>
          <li>Created and maintained the GitHub repository and the Jupyter notebook</li>
          <li>Constructed the research question</li>
          <li>Constructed EDA questions 1, 2, and 3</li>
          <li>Performed data cleaning and preprocessing (Section II)</li>
          <li>Performed EDA on EDA questions 1 and 2 (Section III)</li>
          <li>Wrote introductory and skeletal Markdown (Introduction, Authors, etc.)</li>
          <li>Wrote Markdown for Sections II and III</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td align="center">
        <img src="https://github.com/kelliekaw.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Kellie Kaw</strong>
        <br />
        <a href="https://github.com/kelliekaw">@kelliekaw</a>
      </td>
      <td>
        <ul>
          <li>Constructed EDA question 4</li>
          <li>Wrote structure of the data</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td align="center">
        <img src="https://github.com/JohnathanTantanan.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Lance Xavier Lim</strong>
        <br />
        <a href="https://github.com/JohnathanTantanan">@JohnathanTantanan</a>
      </td>
      <td>
        <ul>
          <li>[indicate contributions here]</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td align="center">
        <img src="https://github.com/jstnsy.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Justin John Abraham Sy</strong>
        <br />
        <a href="https://github.com/jstnsy">@jstnsy</a>
      </td>
      <td>
        <ul>
          <li>[indicate contributions here]</li>
        </ul>
      </td>
    </tr>
  </tbody>
</table>

## I. Dataset

```
Note from CJ: Employ a first person, narrative tone--as if we're guiding readers along the notebook.
```

### Description (Justin)

>[write stuff]

### Data collection method (Justin)

>[write stuff]

### Structure (Kellie)

The dataset we are working with is a structured dataset, with each row representing an observation and each column representing an attribute or feature.

The dataset contains **$40171$ observations** with **$119$ attributes**. The attributes are as following:

`W_REGN` indicates the **region** where the household is located.

`W_OID` refers to **other unique IDs**. These are provinces and districts in the Philippines.

`W_SHSN` is the **sample household serial number**, uniquely identifying each sampled household.

`W_HCN` is the **household control number**.

`URB` indicates whether the household is located in an **urban** or **rural** area.

`RSTR` denotes the **stratum** the household belongs to.

`PSU` represents the **PSU number**.

`BWEIGHT` refers to the **base weight**.

`RFACT` refers to the **final weight**.

`FSIZE` refers to the **family size** of the household.

`AGRI_SAL` is the household's **salaries/wages** from **agricultural activity**.

`NONAGRI_SAL` is the household's **salaries/wages** from **non-agricultural activity**.

`WAGES` is the household's **salaries/wages** from both **agricultural** and **non-agricultural activities**.

`NETSHARE` represents the **net share** of **crops, fruits and vegetables produced, fishing or livestock and poultry raised by other households**.

`CASH_ABROAD` represents the household's **cash receipts, supports, etc.** from **abroad**.

`CASH_DOMESTIC` represents the household's **cash receipts, supports, etc.** from **domsetic source**.

`RENTALS_REC` refers to the **rentals received** from non-agricultural land, buildings, spaces, or other properties.

`INTEREST` refers to the amount received, cash or in kind, **interest** from bank deposits and loans extended to other families.

`PENSION` refers to the amount received, cash or in kind, **pension and retirement, workmen's compensation, or social security benefits**.

`DIVIDENDS` refers to the amount received, cash or in kind, **dividend** from investments, such as stock, bonds, etc.

`OTHER_SOURCE` refers to the amount received, cash or in kind, from **other sources of income not elsewhere classified**, such as royalties and income of members below 10 years old.

`NET_RECEIPT` represents the **total net receipts** from family sustenance activity.

`REGFT` represents the total received as **gifts**.

`NET_CFG` refers to the **net income** from **crop farming and gardening**.

`NET_LPR` refers to the **net income** from **livestock and poultry raising**.

`NET_FISH` refers to the **net income** from **fishing**.

`NET_FOR` refers to the **net income** from **forestry and hunting**.

`NET_RET` refers to the **net income** from **wholesale and retail**.

`NET_MFG` refers to the **net income** from **manufacturing**.

`NET_COM` refers to the **net income** from **community, social, recreational, and personal services**.

`NET_TRANS` refers to the **net income** from **transportation, storage, and communication services**.

`NET_MIN` refers to the **net income** from **mining and quarrying**.

`NET_CONS` refers to the **net income** from **construction**.

`NET_NEC` refers to the **net income** from **entrpreneurial activities not elsewhere classified**.

`EAINC` is the **total income** from **entrpreneurial activities**.

`TOINC` is the **total income**.

`LOSSES` represents the **losses** from **entrepreneurial activities**.

`T_BREAD` refers to the **total food expenditure** for **bread and cereals**.

`T_MEAT` refers to the **total food expenditure** for **meat**.

`T_FISH` refers to the **total food expenditure** for **fish and seafood**.

`T_MILK` refers to the **total food expenditure** for **milk, cheese, and eggs**.

`T_OIL` refers to the **total food expenditure** for **oil and fats**.

`T_FRUIT` refers to the **total food expenditure** for **fruits**.

`T_VEG` refers to the **total food expenditure** for **vegetables**.

`T_SUGAR` refers to the **total food expenditure** for **sugar, jam, honey, chocolate, and confectionery**.

`T_FOOD_NEC` refers to the **total food expenditure** for **other food not elsewhere classified**.

`T_COFFEE` refers to the **total food expenditure** for **coffee, cocoa, and tea**.

`T_MINERAL` refers to the **total food expenditure** for **mineral water, soft drinks, fruit juices, and vegetable juices**.

`T_ALCOHOL` refers to the **total food expenditure** for **alcoholic beverages**.

`T_TOBACCO` refers to the **total food expenditure** for **tobacco**.

`T_OTHER_VEG` refers to the **total food expenditure** for **other vegetable-based products**.

`T_FOOD_HOME` refers to the **total food expenditure** for food **consumed at home**.

`T_FOOD_OUTSIDE` refers to the **total food expenditure** for food regularly **consumed outside of home**, i.e., in restaurants, cafes, and canteens.

`T_FOOD` refers to the **total food expenditure**

`T_CLOTH` refers to the **total expenditure** for **clothing and footwear**.

`T_FURNISHING` refers to the **total expenditure** for **furnishings, household equipment, and routine household maintenance**.

`T_HEALTH` refers to the **total expenditure** for **medical products** such as medicines, appliances, and equipment, or any outpatient and inpatient **medical services**.

`T_HOUSING_WATER` refers to the **total expenditure** for **housing, water, electricity, gas, and other fuels**.

`T_ACTRENT` represents the **actual house rent**. It is the actual payment for the house or lot.

`T_RENTVAL` is the house rent or **rental value**.

`T_IMPUTED_RENT` is the **imputed house rental value**

`T_BIMPUTED_RENT` is the **imputed housing benefit rental value**

`T_TRANSPORT` refers to the **total expenditure** for **transportation**. This includes purchase of vehicles, operation, maintenance, and repair of personal transport equipment, and services.

`T_COMMUNICATION` refers to the **total expenditure** for **communication**.

`T_RECREATION` refers to the **total expenditure** for **recreation**. This includes all expenses incurred in acquiring equipment or items.

`T_EDUCATION` refers to the **total expenditure** for **education**. This includes tuition, allowances, and other school fees and contribution.

`T_MISCELLANEOUS` refers to the **total expenditure** for **miscellaneous goods and services** during the past month. This includes salons, barbershops, products for personal hygiene, and beauty products.

`T_OTHER EXPENDITURE` refers to the **total other expenditure**. This includes premiums for insurance, interest payments, losses due to fire or theft, and legal and membership fees.

`T_OTHER_DISIMBURSEMENT` refers to the **total other disimbursements**. This refers to non-family expenditures, including purchase or amortization of real property, payments of cash loan, installments for appliances or personal transport before 2012, loans granted to persons outside the household, amount deposited in banks or investments, and major repair or construction of a house.

`T_NFOOD` refers to the **total non-food expenditure**.

`T_TOTEX` represents the **total expenditure**.

`T_TOTDIS` represents the **total disimbursements**.

`T_OTHREC` refers to the **total other receipts**. These are non-income receipts including the value at cost of real and personal property sold, loands from outside the household, payments for loands granted to others, and withdrawals from savings or business equity. Also included are profits from sale of stocks and bonds, back pay and proceeds from insurance, net winnings from gambling, sweepstakes and lotteries, and inheritance.

`T_TOREC` refers to the **total receipts**.

`FOOD_ACCOM_SRVC` represents the **accomodation services**.

`SEX` is the household head's **sex**.

`AGE` is the household head's **age**.

`MS` is the household head's **marital status**.

`HGC` is the household head's **highest grade completed**.

`JOB` is the household head's **job or business indicator** during the past six months.

`OCCUP` is the household head's **primary occupation** during the past six months.

`KB` is the household head's **kind of business or industry** during the past six months.

`CW` is the household head's **class of worker** during the past six months.

`HHTYPE` represents the **type of household**.

`MEMBERS` is the **number of family members**.

`AGELESS5` is the number of family members **below 5 years old**.

`AGE5_17` is the number of family members **between 5 to 17 years old**.

`EMPLOYED_PAY` is the number of family members **employed for pay** during the past six months.

`EMPLOYED_PROF` is the number of family members **employed for profit** during the past six months.

`SPOUSE_EMP` represents whether the spouse **has a job or business** during the past six months.

`BLDG_TYPE` refers to the **type of building** of the house.

`ROOF` refers to the **type of roof** of the house.

`WALLS` refers to the **type of walls** of the house.

`TENURE` refers to the **tenure status** of the housing unit and lot occupied by the family.

`HSE_ALTERTN` represents whether there were any **alterations or additions** to the house or other **major renovations** done during the past six months

`TOILET` represents the kind of **toilet facilities** used by the family in the house.

`ELECTRIC` represents the **electricity indicator** in the building or house.

`WATER` refers to the **main source of water supply** of the family.

`DISTANCE` is the **distance** of the house **from the water source**.

`RADIO_QTY` refers to the number of **radios**.

`TV_QTY` refers to the number of **TVs**.

`CD_QTY` refers to the number of **CDs / VCDs / DVDs**.

`STEREO_QTY` refers to the number of **component / stero sets**.

`REF_QTY` refers to the number of **refrigerators / freezers**.

`WASH_QTY` refers to the number of **washing machines**.

`AIRCON_QTY` refers to the number of **air conditioners**.

`CAR_QTY` refers to the number of **cars, jeeps, and vans**.

`LANDLINE_QTY` refers to the number of **landline / wireless telephones**.

`CELLPHONE_QTY` refers to the number of **cellular phones**.

`PC_QTY` refers to the number of **personal computers**.

`OVEN_QTY` refers to the number of **stoves with oven / gas range**.

`MOTOR_BANCA_QTY` refers to the number of **motorized boats**.

`MOTORCYCLE_QTY` refers to the number of **motorcycle / tricycles**.

`POP_ADJ` is the **population adjustment**.

`PCINC` represents the **per capita income**.

`NATPC` represents the **national per capita income decile**.

`NATDC` represents the **national income decile**.

`REGDC` represents the **regional income decile**.

`REGPC` represents the **regional per capita income decile**.

## II. Data Cleaning and Preprocessing (Lance and CJ)

Now that we have a good understanding of the dataset and how it was collected, we can proceed with cleaning and preprocessing it.

Cleaning the data is crucial to avoid errors or unexpected results later on, which may result from data that is inconsistent, incorrect, missing, etc.

First, let us import all the Python libraries and modules which we will be using throughout the notebook. Brief descriptions of the purpose of each library/module are indicated as comments.

In [None]:
import matplotlib.pyplot as plt     # brief description of purpose
import numpy as np                  # brief description of purpose
import pandas as pd                 # brief description of purpose
import seaborn as sns               # brief description of purpose

Next, we load the dataset from a `.csv` file. The **pandas** library is ideal for this as it optimized for handling tabular data like that from the survey.

In [None]:
df = pd.read_csv('data/FIES_PUF_2012_Vol_1.csv')

It's good practice to view high level information of a dataset when looking at it the first time. `df.info()` allows us to do so.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40171 entries, 0 to 40170
Columns: 119 entries, W_REGN to REGPC
dtypes: float64(5), int64(92), object(22)
memory usage: 36.5+ MB


From `df.info()`, we learn that the dataset indeed contains $40171$ entries or rows or **observations** and $119$ columns or **attributes**.

We also learn that:
- $5$ attributes are of datatype float64
- $92$ attributes are of datatype int64
- $22$ attributes are of datatype object

It also helps looking at some sample observations to see how data is structured and encoded. `df.head()` allows us to do so.

In [None]:
df.head()

From `df.head()`, we learn that the dataset's attribute names follow some coding scheme which does not clearly convey their meaning. This is not a problem, as we can refer to the **metadata dictionary** provided along with the dataset. Short descriptions of each attribute can also be seen in [Section I](#structure-kellie).

<br>

Next, we will drop unnecessary attributes. This step is dependent on our [research question](#research-question) and [EDA questions](#eda-questions) as they determine which variables are needed and which ones are irrelevant given the scope our EDA.

>**Note:**
>From this point onward, the authors used the [Data Wrangler](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler) extension in Visual Studio Code to make data cleaning easier.

>[dropping columns explanation]

In [None]:
# Drop columns: 'W_OID', 'W_SHSN' and 80 other columns
df = df.drop(columns=['W_OID', 'W_SHSN', 'W_HCN', 'RSTR', 'PSU', 'RFACT', 'BWEIGHT', 'FSIZE', 'AGRI_SAL', 'NONAGRI_SAL', 'WAGES', 'NETSHARE', 'CASH_ABROAD', 'CASH_DOMESTIC', 'RENTALS_REC', 'INTEREST', 'PENSION', 'DIVIDENDS', 'OTHER_SOURCE', 'NET_RECEIPT', 'REGFT', 'NET_LPR', 'NET_CFG', 'NET_FISH', 'NET_FOR', 'NET_RET', 'NET_MFG', 'NET_COM', 'NET_TRANS', 'NET_MIN', 'NET_CONS', 'NET_NEC', 'EAINC', 'LOSSES', 'T_ACTRENT', 'T_RENTVAL', 'T_IMPUTED_RENT', 'T_BIMPUTED_RENT', 'T_OTHREC', 'T_TOREC', 'FOOD_ACCOM_SRVC', 'MS', 'JOB', 'OCCUP', 'KB', 'CW', 'HHTYPE', 'MEMBERS', 'AGELESS5', 'AGE5_17', 'EMPLOYED_PAY', 'EMPLOYED_PROF', 'SPOUSE_EMP', 'BLDG_TYPE', 'ROOF', 'WALLS', 'TENURE', 'HSE_ALTERTN', 'TOILET', 'ELECTRIC', 'WATER', 'DISTANCE', 'RADIO_QTY', 'TV_QTY', 'CD_QTY', 'STEREO_QTY', 'REF_QTY', 'WASH_QTY', 'AIRCON_QTY', 'CAR_QTY', 'LANDLINE_QTY', 'CELLPHONE_QTY', 'PC_QTY', 'OVEN_QTY', 'MOTOR_BANCA_QTY', 'MOTORCYCLE_QTY', 'POP_ADJ', 'PCINC', 'NATPC', 'NATDC', 'REGDC', 'REGPC']);

In [None]:
# TODO: For debugging!
df.head()

>[converting numerical values to categorical values explanation]

In [None]:

# Change column type to string for columns: 'W_REGN', 'URB' and 1 other column
df = df.astype({'W_REGN': 'string', 'URB': 'string', 'SEX': 'string'});

# Replace all instances of "41" with "4A" in column: 'W_REGN'
df.loc[df['W_REGN'].str.lower() == "41".lower(), 'W_REGN'] = "4A"

# Replace all instances of "42" with "4B" in column: 'W_REGN'
df.loc[df['W_REGN'].str.lower() == "42".lower(), 'W_REGN'] = "4B"

# Replace all instances of "1" with "Urban" in column: 'URB'
df.loc[df['URB'].str.lower() == "1".lower(), 'URB'] = "Urban"

# Replace all instances of "2" with "Rural" in column: 'URB'
df.loc[df['URB'].str.lower() == "2".lower(), 'URB'] = "Rural"

# Replace all instances of "1" with "Male" in column: 'SEX'
df.loc[df['SEX'].str.lower() == "1".lower(), 'SEX'] = "Male"

# Replace all instances of "2" with "Female" in column: 'SEX'
df.loc[df['SEX'].str.lower() == "2".lower(), 'SEX'] = "Female"

In [None]:
# TODO: For debugging!
df.head()

>[dropping duplicate values explanation]

In [None]:
df.duplicated().sum()       # Number of duplicate observations

Fortunately, we have no duplicate observations.

>[checking for null values explanation]

In [None]:
num_na_per_attribute = df.isna().sum()
num_na_per_attribute = num_na_per_attribute.sort_values(ascending=False)

print(num_na_per_attribute)

From this, we learn that the dataset **does not have any missing values**. We can also choose to leave values of 0 for various expenditure categories as it can be the case that the household does not spend any amount for a particular category. We can only trust that the figures provided by each household are accurate to a significant extent and that the values are encoded by PSA without error.

However, to check for *possible* outliers and to satisfy early curiosity, we shall take the liberty to check the distribution of each **atomic** numerical attribute. We won't check aggregate numerical attributes anymore on the assumption that they are summations of a set of atomic numerical attributes.

For this, we can create a **boxplot** for each attribute, as it is a convenient, summarized way of checking how a group of numerical data may be distributed. Moreover, since all attributes of interest fall within the same range [$0$ to $10^9$], we can group them together in the same graph to give us a high-level comparison of the distribution of various atomic expenditure categories.

Lastly, we will need to apply **log transformation** to the values due to their very wide range.

In [None]:
numerical_cols = [      # atomic numerical attributes of interest
    'TOINC', 'T_BREAD', 'T_MEAT', 'T_FISH', 'T_MILK',
    'T_OIL', 'T_FRUIT', 'T_VEG', 'T_SUGAR', 'T_FOOD_NEC',
    'T_COFFEE', 'T_MINERAL', 'T_ALCOHOL', 'T_TOBACCO', 'T_OTHER_VEG',
    'T_FOOD_OUTSIDE', 'T_CLOTH', 'T_FURNISHING', 'T_HEALTH', 'T_HOUSING_WATER',
    'T_TRANSPORT', 'T_COMMUNICATION', 'T_RECREATION', 'T_EDUCATION', 'T_MISCELLANEOUS',
    'T_OTHER_EXPENDITURE', 'T_OTHER_DISBURSEMENT', 'T_NFOOD'
]

GROUP_SIZE = 10     # number of attributes per plot

for i in range(0, len(numerical_cols), GROUP_SIZE):
    subset = numerical_cols[i:i+GROUP_SIZE]
    df_subset = df[subset].replace(0, 1)    # We replace 0s with 1s since 0 can't be viewed on a log transformation graph

    sns.boxplot(data=df_subset)
    plt.yscale("log")
    plt.ylim(1/2, 1e9 * 2)      # manually set the limit of the y-axis from slightly below 10^0 (or 1) to slightly above 10^9 (encoding limit)
    plt.xticks(rotation=45)
    plt.title(f"Boxplots of Atomic Numerical Attributes (Group {i//GROUP_SIZE + 1})")
    plt.show()



From these charts alone, we can already draw several insights about household expenditures (such as the first thing I noticed that somewhat surprised me, which is that **on average, a household in 2012 *may have* spent the most on the bread and cereals food category**, followed by **fish** then **meat**, both of which didn't even come close and were the food categories I expected instead to come out on top - CJ).

However, we only intend to check for possible outliers, so we shall not analyze any further. At first glance, it appears that there are lots of outliers for each atomic expenditure category, but that does not mean we can simply discard those observations. In fact, there isn't really any obvious 'extreme' outlier, as the outliers for each attribute are pretty spread out, hence the densely blackened areas. This *might* simply be indicative of a **significant disparity** in the higher expenditure amounts for each attribute.

Hence, we will not remove any observations and can reasonably conclude with the data cleaning process.

<br>

Now that we have cleaned the dataset, we can proceed to **preprocessing**, which entails applying necessary transformations (e.g., *feature engineering*, *encoding*, *normalization*, *standardization*) to prepare our data for [**exploratory data analysis (EDA)**](#iii-exploratory-data-analysis-eda).

First, we will do **feature engineering**, which is the creation of new features based on existing ones.

We want to create the ff. new features which we will use for EDA:
- **PCT_FOOD_OUTSIDE** = T_FOOD_OUTSIDE / TFOOD * 100
- **T_VICES** = T_ALCOHOL + T_TOBACCO
- **T_HOME** = T_FURNISHING + T_HOUSING_WATER
- **AGE_GROUP**: age group of the head of the household, binned from AGE
- **HGC**: binned from HGC

<br>

>**Note:**
>Technically, we can choose not to create new features and instead compute for said values on the fly as we need them (as pandas makes it easy to perform vectorized operations, anyways). However, creating new features based on values we may need in the future makes it easier for us long-term and saves time from having to recompute them, especially when they are needed in multiple instances.

In [None]:
df['PROP_FOOD_OUTSIDE'] = df['T_FOOD_OUTSIDE'] / df['T_FOOD']
df['T_VICES'] = df['T_ALCOHOL'] + df['T_TOBACCO']
df['T_HOME'] = df['T_FURNISHING'] + df['T_HOUSING_WATER']

In [None]:
age_bins = [0, 29, 39, 49, 59, 120]
age_labels = ['Under 30', '30–39', '40–49', '50–59', '60+']

df['AGE_GROUP'] = pd.cut(df['AGE'], bins=age_bins, labels=age_labels)

In [None]:
# The 'HGC - highest grade completed of the head of the family' codes are taken from the metadata dictionary.
def convert_hgc_code_to_string(code):
    exact_matches = {
        0: 'No Grade Completed',
        10: 'Preschool',
        280: 'Elementary Graduate',
        350: 'High School Graduate',
        900: 'Post Baccalaureate'
    }

    if code in exact_matches:
        return exact_matches[code]
    elif 210 <= code <= 260:
        return 'Elementary Undergraduate'
    elif 310 <= code <= 330:
        return 'High School Undergraduate'
    elif 410 <= code <= 420:
        return 'Post Secondary'
    elif 501 <= code <= 589:
        return 'Post Secondary / Technical Vocational Graduate'
    elif 810 <= code <= 840:
        return 'College Undergraduate'
    elif 601 <= code <= 689:
        return 'College Graduate'
    else:
        return 'N/A'

df['HGC'] = df['HGC'].apply(convert_hgc_code_to_string);

With all these changes, we may want to take another high-level view of our cleaned and preprocessed dataset.

In [None]:
df.info()

In [None]:
df.head()

Looking great! We can now proceed with EDA.

## III. Exploratory Data Analysis

>[write stuff]

### Research Question

#### How are socioeconomic and demographic factors associated with the expenditure patterns of Filipino households in 2012?

>[write stuff]

### EDA Questions

>[write stuff]

#### 1. How does the proportion of spending across major expenditure categories vary based on various socioeconomic and demographic factors?

*Socioeconomic* factors
- Income bracket of the household
- Education level of the household head

*Demographic* factors
- Region of the household
- Age of the household head
- Sex of the household head

In [None]:
# TODO: Will work on this soon - CJ


#### 2. Which regions spend the highest proportion of their food expenditure on various food categories?

>[write stuff]

#### 3. Is there a correlation between total household income and the proportion of food expenditure spent on food consumed outside the home?

>[write stuff]

First, let us define a generalized function that allows us to generate and display a **scatterplot** for correlation.

In [None]:
def generate_scatter(data, x, y, log_x=False, log_y=False, title=None):
    sns.scatterplot(data=data, x=x, y=y)

    if log_x:
        plt.xscale('log')
    if log_y:
        plt.yscale('log')

    plt.title(title or f"{x} vs. {y}")
    plt.xlabel(x)
    plt.ylabel(y)
    plt.show()

In [None]:
generate_scatter(df, 'TOINC', 'PROP_FOOD_OUTSIDE', log_x=True)

not meaningful :\(

#### 4. How does housing expenditure differ between urban and rural households?

>[write stuff]

#### 5. Is there a correlation between total household income and any of the major expenditure categories?

>[write stuff]

In [None]:
# TODO: Generate log-log scatterplots for each major expenditure category.

major_expenditure_categories = [
    'T_FOOD',
]

In [None]:
# Currently hard-coded. Will revise soon.
generate_scatter(df, 'TOINC', 'T_FOOD', log_x=True, log_y=True)
generate_scatter(df, 'TOINC', 'T_CLOTH', log_x=True, log_y=True)
generate_scatter(df, 'TOINC', 'T_HOME', log_x=True, log_y=True)
generate_scatter(df, 'TOINC', 'T_TRANSPORT', log_x=True, log_y=True)
generate_scatter(df, 'TOINC', 'T_COMMUNICATION', log_x=True, log_y=True)

## IV. Data Mining

To be continued during phase 2.

<br>

## V. Statistical Inference

To be continued during phase 2.

<br>

## VI. Insights and Conclusions

To be continued during phase 2.

<br>

## Sources and Citations

During the preparation of this work, the authors used [NAME TOOL/SERVICE]
for the following purposes:

- [purposes]

After using this tool, the authors reviewed and edited the content as needed and takes
full responsibility for the content of the publication.