# CSMODEL Machine Project

This Jupyter notebook presents a comprehensive analysis of the **Family Income and Expenditure Survey 2012 Vol. 1** dataset from the Philippine Statistics Authority (PSA).

The goal of the project is to explore how **socioeconomic factors** may be associated with the **expenditure patterns** of Filipino households in 2012.

**Research Question**
1. How are socioeconomic factors associated with the expenditure patterns of Filipino households in 2012?

**Exploratory Data Analysis (EDA) Questions**
1. How does the proportion of spending across **major expenditure categories** (e.g., food, vices) vary based on **socioeconomic factors** (e.g., region, income bracket)?
2. Which regions spend the highest proportion of their **food expenditure** on various **food categories** (e.g., vegetables, meats)?
3. Is there a correlation between **total household income** and the proportion of food expenditure spent on **food consumed outside the home**?
4. How does **housing expenditure** differ between **urban** and **rural** households?
5. Is there a correlation between **total household income** and **education-related expenditures**?

## Authors

The following students of De La Salle University - Manila, Philippines collaborated on this project:

<table>
  <thead>
    <tr>
      <th>Profile</th>
      <th>Author</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td align="center">
        <img src="https://github.com/qu1r0ra.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Christian Joseph Bunyi</strong>
        <br />
        <a href="https://github.com/qu1r0ra">@qu1r0ra</a>
      </td>
    </tr>
    <tr>
      <td align="center">
        <img src="https://github.com/kelliekaw.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Kellie Kaw</strong>
        <br />
        <a href="https://github.com/kelliekaw">@kelliekaw</a>
      </td>
    </tr>
    <tr>
      <td align="center">
        <img src="https://github.com/JohnathanTantanan.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Lance Xavier Lim</strong>
        <br />
        <a href="https://github.com/JohnathanTantanan">@JohnathanTantanan</a>
      </td>
    </tr>
    <tr>
      <td align="center">
        <img src="https://github.com/jstnsy.png" width="50" height="50" style="border-radius: 50%;" />
      </td>
      <td>
        <strong>Justin John Abraham Sy</strong>
        <br />
        <a href="https://github.com/jstnsy">@jstnsy</a>
      </td>
    </tr>
  </tbody>
</table>

## I. Dataset

```
Note from CJ: Employ a first person, narrative tone--as if we're guiding a reader along the notebook.
```

### Description (Justin)

[write stuff]

### Data collection method (Justin)

[write stuff]

### Structure (Kellie)

[write stuff]

## II. Data Cleaning and Preprocessing (Lance and CJ)

Now that we have a good understanding of the dataset and how it was collected, we can proceed with cleaning and preprocessing it.

Cleaning the data is crucial to avoid errors or unexpected results later on, which may result from data that is inconsistent, incorrect, missing, etc.

First, let us import all the Python libraries and modules which we will be using throughout the notebook. Brief descriptions of the purpose of each library/module are indicated as comments.

In [1]:
import numpy as np          # brief description of purpose
import pandas as pd         # brief description of purpose
...

Ellipsis

Next, we load the dataset from a `.csv` file. The **pandas** library is ideal for this as it optimized for handling tabular data like that from the survey.

In [3]:
df = pd.read_csv('data/FIES_PUF_2012_Vol_1.csv')

It's good practice to view high level information of a dataset when viewing it the first time. `df.info()` allows us to do so.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40171 entries, 0 to 40170
Columns: 119 entries, W_REGN to REGPC
dtypes: float64(5), int64(92), object(22)
memory usage: 36.5+ MB


From `df.info()`, we can learn that the dataset indeed contains $40171$ entries or rows or **observations** and $119$ columns or **attributes**.

We can also learn that:
- $5$ attributes are of datatype float64
- $92$ attributes are of datatype int64
- $22$ attributes are of datatype object

It also helps looking at some sample observations to see how data is structured and encoded. `df.head()` allows us to do so.

In [5]:
df.head()

Unnamed: 0,W_REGN,W_OID,W_SHSN,W_HCN,URB,RSTR,PSU,BWEIGHT,RFACT,FSIZE,...,PC_QTY,OVEN_QTY,MOTOR_BANCA_QTY,MOTORCYCLE_QTY,POP_ADJ,PCINC,NATPC,NATDC,REGDC,REGPC
0,14,101001000,2,25,2,21100,415052,138.25,200.6576,3.0,...,1.0,1.0,,,0.946172,108417.0,9,8,8,9
1,14,101001000,3,43,2,21100,415052,138.25,200.6576,12.5,...,,1.0,,1.0,0.946172,30631.6,5,9,9,4
2,14,101001000,4,62,2,21100,415052,138.25,200.6576,2.0,...,,1.0,,,0.946172,86992.5,9,6,6,8
3,14,101001000,5,79,2,21100,415052,138.25,200.6576,4.0,...,,1.0,,,0.946172,43325.75,6,6,6,6
4,14,101001000,10,165,2,21100,415052,138.25,200.6576,5.0,...,,,,1.0,0.946172,37481.8,6,6,6,5


From `df.head()`, we can learn that the attribute names follow some coding scheme which does not immediately convey their meaning. This is not a problem, as we can refer to the **metadata dictionary** provided along with the dataset.

>Continue here!!

In [None]:
# continue here!!

Now that we have cleaned the dataset, we will proceed to preprocessing it by applying the appropriate transformations (e.g., *encoding*, *normalization*, *standardization*). This prepares it for **exploratory data analysis (EDA)**.

In [6]:
# Drop columns: 'W_OID', 'W_SHSN' and 81 other columns
df = df.drop(columns=['W_OID', 'W_SHSN', 'W_HCN', 'RSTR', 'PSU', 'RFACT', 'BWEIGHT', 'FSIZE', 'AGRI_SAL', 'NONAGRI_SAL', 'WAGES', 'NETSHARE', 'CASH_ABROAD', 'CASH_DOMESTIC', 'RENTALS_REC', 'INTEREST', 'PENSION', 'DIVIDENDS', 'OTHER_SOURCE', 'NET_RECEIPT', 'REGFT', 'NET_LPR', 'NET_CFG', 'NET_FISH', 'NET_FOR', 'NET_RET', 'NET_MFG', 'NET_COM', 'NET_TRANS', 'NET_MIN', 'NET_CONS', 'NET_NEC', 'EAINC', 'LOSSES', 'T_ACTRENT', 'T_RENTVAL', 'T_IMPUTED_RENT', 'T_BIMPUTED_RENT', 'T_OTHREC', 'T_TOREC', 'FOOD_ACCOM_SRVC', 'MS', 'AGE', 'JOB', 'OCCUP', 'KB', 'CW', 'HHTYPE', 'MEMBERS', 'AGELESS5', 'AGE5_17', 'EMPLOYED_PAY', 'EMPLOYED_PROF', 'SPOUSE_EMP', 'BLDG_TYPE', 'ROOF', 'WALLS', 'TENURE', 'HSE_ALTERTN', 'TOILET', 'ELECTRIC', 'WATER', 'DISTANCE', 'RADIO_QTY', 'TV_QTY', 'CD_QTY', 'STEREO_QTY', 'REF_QTY', 'WASH_QTY', 'AIRCON_QTY', 'CAR_QTY', 'LANDLINE_QTY', 'CELLPHONE_QTY', 'PC_QTY', 'OVEN_QTY', 'MOTOR_BANCA_QTY', 'MOTORCYCLE_QTY', 'POP_ADJ', 'PCINC', 'NATPC', 'NATDC', 'REGDC', 'REGPC'])

# Change column type to string for columns: 'W_REGN', 'URB' and 2 other columns
df = df.astype({'W_REGN': 'string', 'URB': 'string', 'SEX': 'string'})

# Replace all instances of "41" with "4A" in column: 'W_REGN'
df.loc[df['W_REGN'].str.lower() == "41".lower(), 'W_REGN'] = "4A"

# Replace all instances of "42" with "4B" in column: 'W_REGN'
df.loc[df['W_REGN'].str.lower() == "42".lower(), 'W_REGN'] = "4B"

# Replace all instances of "1" with "Urban" in column: 'URB'
df.loc[df['URB'].str.lower() == "1".lower(), 'URB'] = "Urban"

# Replace all instances of "2" with "Rural" in column: 'URB'
df.loc[df['URB'].str.lower() == "2".lower(), 'URB'] = "Rural"

# Replace all instances of "1" with "Male" in column: 'SEX'
df.loc[df['SEX'].str.lower() == "1".lower(), 'SEX'] = "Male"

# Replace all instances of "2" with "Female" in column: 'SEX'
df.loc[df['SEX'].str.lower() == "2".lower(), 'SEX'] = "Female"

def convert_hgc_to_string(code):
    if code == 0:
        return 'No Grade Completed'
    elif code == 10:
        return 'Preschool'
    elif 210 <= code <= 260:
        return 'Elementary Undergraduate'
    elif code == 280:
        return 'Elementary Graduate'
    elif 310 <= code <= 330:
        return 'High School Undergraduate'
    elif code == 350:
        return 'High School Graduate'
    elif 410 <= code <= 420:
        return 'Post Secondary'
    elif 501 <= code <= 589:
        return 'Post Secondary / Technical Vocational Graduate'
    elif 810 <= code <= 840:
        return 'College Undergraduate'
    elif 601 <= code <= 689:
        return 'College Graduate'
    elif code == 900: 
        return 'Post Baccalaureate'
    else:
        return 'N/A'

df['HGC'] = df['HGC'].apply(convert_hgc_to_string)

## III. Exploratory Data Analysis (EDA)

### Research Question

#### How are socioeconomic factors associated with the expenditure patterns of Filipino households in 2012?

### EDA Questions

[write stuff]

#### 1. How does the proportion of spending across major expenditure categories vary based on socioeconomic factors (e.g., region, income bracket, family size, sex of the household head, and education level of the household head)?

[write stuff]

#### 2. Which regions spend the highest proportion of their food expenditure on various food categories (e.g., vegetables, meats)?

[write stuff]

#### 3. Is there a correlation between total household income and the proportion of food expenditure spent on food consumed outside the home?

[write stuff]

#### 4. How does housing expenditure differ between urban and rural households?

[write stuff]

#### 5. Is there a correlation between total household income and education-related expenditures?

[write stuff]

<br>

## IV. Data Mining

To be continued for phase 2.

<br>

## V. Statistical Inference

To be continued for phase 2.

<br>

## VI. Insights and Conclusions

To be continued for phase 2.

<br>

## Sources and Citations

During the preparation of this work, the authors used [NAME TOOL/SERVICE]
for the following purposes:

- [purposes]

After using this tool, the authors reviewed and edited the content as needed and takes
full responsibility for the content of the publication.