# Homework 4

### Costa Rican Household Poverty Level Prediction (Kaggle Competition)

#### By: Spencer Wise

The goal of the Costa Rican Household Poverty Level Prediction contest is to develop a machine learning model that can predict the poverty level of households using both individual and household characteristics. Many social programs have a hard time making sure the right people are given the right amount of aid, and it’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify.  Costa Rica is just one of many countries that faces this same problem of assessing social need.

In this notebook, we will do our best to come up with a complete machine learning solution to this problem. First, we will get introduced to the problem, then explore the dataset, clean the dataset, work on feature engineering, try out multiple machine learning models, select a model, work to optimize the model, and finally, inspect the outputs of the model and draw conclusions.

### Data Overview and Problem

The data for this competition is provided in a training set file and a test set file: `train.csv` and `test.csv`. The training set has 9557 rows and 143 columns, and the testing set has 23856 rows and 142 columns. Each row represents one individual and each column is a feature, either unique to the individual, or for the household of that individual. The training set has one additional column, `Target`, which represents the poverty level on a scale from 1-4. A value of 1 is the most extreme poverty, while a value of 4 represents non vulnerable households. This is a multi-class classification machine learning problem with 4 classes.

### Objective

The objective is to predict poverty on a household level. We are given data on the individual level with each individual having unique features but also information about their household. In order to create a dataset for the task, we'll have to perform some aggregations of the individual data for each household. Moreover, we have to make a prediction for every individual in the test set, but "ONLY the heads of household are used in scoring" which means we want to predict poverty on a household basis.

To begin, let's start by importing the necessary packages and reading in the data!

### Module Imports
We'll use a familiar stack of data science libraries: `Pandas` , `numpy`, `matplotlib`, `seaborn`, and `sklearn` for modeling.

In [29]:
# Pandas for data loading, manipulation etc.
import pandas as pd

# Numeric functions
import numpy as np
from scipy import stats
from scipy.stats import norm
from collections import OrderedDict

# Plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting defaults
%matplotlib inline
plt.style.use('fivethirtyeight')

# Other Packages
import warnings
warnings.filterwarnings('ignore')

### Load Data

Here we are going to first load the training and test datasets into pandas dataframes.

In [3]:
# Loading datasets into separate pandas dataframes
train = pd.read_csv(r"C:\Users\Spencer\Dropbox\School\Fall 2018\Machine Learning\Projects\Homework_4\train.csv", encoding="latin1")
test = pd.read_csv(r"C:\Users\Spencer\Dropbox\School\Fall 2018\Machine Learning\Projects\Homework_4\test.csv", encoding="latin1")

In [8]:
# Print the shape of the test and training datasets
print(train.shape)
print(test.shape)

(9557, 143)
(23856, 142)


One interesting thing to note right off the bat is that the training dataset has one more column than the test dataset. That is because the train dataset includes our target value). Let's take a closer look by previewing our training data.

In [16]:
# Preview the data
train.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


### Analyze the Target

The first thing that we want to do is identify and analyze the target. In this case, our target is the `Target` column (which represents the poverty level on a scale from 1-4) as it is the variable that we will be trying to predict.

In [19]:
# Descriptive summary of Target data
target = train['Target']
target.describe()

count    9557.000000
mean        3.302292
std         1.009565
min         1.000000
25%         3.000000
50%         4.000000
75%         4.000000
max         4.000000
Name: Target, dtype: float64

The next thing we'll want to do is get an idea of the distribution of the `Target` classes.

## Learn the Data

Now that we've corrected our target variable, we'll want to learn as much as we can about the data. The things that we will focus on are:

- Knowing which variables are the core variables in the dataset
- Knowing the rest of the variables in the dataset
- Knowing what is correlated
- Knowing if there are missing variables

**Core Data**

Because the data contains 142 total variables, we'll focus on learning as much as we can about the core data fields outlined by the Inter-American Development Bank. The core variables are:

- `Id`: a unique identifier for each row.
- `Target`: the target is an ordinal variable indicating groups of income levels. 
    - 1 = extreme poverty 
    - 2 = moderate poverty 
    - 3 = vulnerable households 
    - 4 = non vulnerable households
- `idhogar`: this is a unique identifier for each household. This can be used to create household-wide features, etc. All rows in a given household will have a matching value for this identifier.
- `parentesco1`: indicates if this person is the head of the household.

**Rest of Variables and Description**

- `v2a1`, Monthly rent payment
- `hacdor`, =1 Overcrowding by bedrooms
- `rooms`,  number of all rooms in the house
- `hacapo`, =1 Overcrowding by rooms
- `v14a`, =1 has bathroom in the household
- `refrig`, =1 if the household has refrigerator
- `v18q`, owns a tablet
- `v18q1`, number of tablets household owns
- `r4h1`, Males younger than 12 years of age
- `r4h2`, Males 12 years of age and older
- `r4h3`, Total males in the household
- `r4m1`, Females younger than 12 years of age
- `r4m2`, Females 12 years of age and older
- `r4m3`, Total females in the household
- `r4t1`, persons younger than 12 years of age
- `r4t2`, persons 12 years of age and older
- `r4t3`, Total persons in the household
- `tamhog`, size of the household
- `tamviv`, number of persons living in the household
- `escolari`, years of schooling
- `rez_esc`, Years behind in school
- `hhsize`, household size
- `paredblolad`, =1 if predominant material on the outside wall is block or brick
- `paredzocalo`, "=1 if predominant material on the outside wall is socket (wood,  zinc or absbesto"
- `paredpreb`, =1 if predominant material on the outside wall is prefabricated or cement
- `pareddes`, =1 if predominant material on the outside wall is waste material
- `paredmad`, =1 if predominant material on the outside wall is wood
- `paredzinc`, =1 if predominant material on the outside wall is zink
- `paredfibras`, =1 if predominant material on the outside wall is natural fibers
- `paredother`, =1 if predominant material on the outside wall is other
- `pisomoscer`, "=1 if predominant material on the floor is mosaic,  ceramic,  terrazo"
- `pisocemento`, =1 if predominant material on the floor is cement
- `pisoother`, =1 if predominant material on the floor is other
- `pisonatur`, =1 if predominant material on the floor is  natural material
- `pisonotiene`, =1 if no floor at the household
- `pisomadera`, =1 if predominant material on the floor is wood
- `techozinc`, =1 if predominant material on the roof is metal foil or zink
- `techoentrepiso`, "=1 if predominant material on the roof is fiber cement,  mezzanine "
- `techocane`, =1 if predominant material on the roof is natural fibers
- `techootro`, =1 if predominant material on the roof is other
- `cielorazo`, =1 if the house has ceiling
- `abastaguadentro`, =1 if water provision inside the dwelling
- `abastaguafuera`, =1 if water provision outside the dwelling
- `abastaguano`, =1 if no water provision
- `public`, "=1 electricity from CNFL,  ICE,  ESPH/JASEC"
- `planpri`, =1 electricity from private plant
- `noelec`, =1 no electricity in the dwelling
- `coopele`, =1 electricity from cooperative
- `sanitario1`, =1 no toilet in the dwelling
- `sanitario2`, =1 toilet connected to sewer or cesspool
- `sanitario3`, =1 toilet connected to  septic tank
- `sanitario5`, =1 toilet connected to black hole or letrine
- `sanitario6`, =1 toilet connected to other system
- `energcocinar1`, =1 no main source of energy used for cooking (no kitchen)
- `energcocinar2`, =1 main source of energy used for cooking electricity
- `energcocinar3`, =1 main source of energy used for cooking gas
- `energcocinar4`, =1 main source of energy used for cooking wood charcoal
- `elimbasu1`, =1 if rubbish disposal mainly by tanker truck
- `elimbasu2`, =1 if rubbish disposal mainly by botan hollow or buried
- `elimbasu3`, =1 if rubbish disposal mainly by burning
- `elimbasu4`, =1 if rubbish disposal mainly by throwing in an unoccupied space
- `elimbasu5`, "=1 if rubbish disposal mainly by throwing in river,  creek or sea"
- `elimbasu6`, =1 if rubbish disposal mainly other
- `epared1`, =1 if walls are bad
- `epared2`, =1 if walls are regular
- `epared3`, =1 if walls are good
- `etecho1`, =1 if roof are bad
- `etecho2`, =1 if roof are regular
- `etecho3`, =1 if roof are good
- `eviv1`, =1 if floor are bad
- `eviv2`, =1 if floor are regular
- `eviv3`, =1 if floor are good
- `dis`, =1 if disable person
- `male`, =1 if male
- `female`, =1 if female
- `estadocivil1`, =1 if less than 10 years old
- `estadocivil2`, =1 if free or coupled uunion
- `estadocivil3`, =1 if married
- `estadocivil4`, =1 if divorced
- `estadocivil5`, =1 if separated
- `estadocivil6`, =1 if widow/er
- `estadocivil7`, =1 if single
- `parentesco1`, =1 if household head
- `parentesco2`, =1 if spouse/partner
- `parentesco3`, =1 if son/doughter
- `parentesco4`, =1 if stepson/doughter
- `parentesco5`, =1 if son/doughter in law
- `parentesco6`, =1 if grandson/doughter
- `parentesco7`, =1 if mother/father
- `parentesco8`, =1 if father/mother in law
- `parentesco9`, =1 if brother/sister
- `parentesco10`, =1 if brother/sister in law
- `parentesco11`, =1 if other family member
- `parentesco12`, =1 if other non family member
- `idhogar`, Household level identifier
- `hogar_nin`, Number of children 0 to 19 in household
- `hogar_adul`, Number of adults in household
- `hogar_mayor`, # of individuals 65+ in the household
- `hogar_total`, # of total individuals in the household
- `dependency`, Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
- `edjefe`, years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
- `edjefa`, years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
- `meaneduc`,average years of education for adults (18+)
- `instlevel1`, =1 no level of education
- `instlevel2`, =1 incomplete primary
- `instlevel3`, =1 complete primary
- `instlevel4`, =1 incomplete academic secondary level
- `instlevel5`, =1 complete academic secondary level
- `instlevel6`, =1 incomplete technical secondary level
- `instlevel7`, =1 complete technical secondary level
- `instlevel8`, =1 undergraduate and higher education
- `instlevel9`, =1 postgraduate higher education
- `bedrooms`, number of bedrooms
- `overcrowding`, # persons per room
- `tipovivi1`, =1 own and fully paid house
- `tipovivi2`, "=1 own,  paying in installments"
- `tipovivi3`, =1 rented
- `tipovivi4`, =1 precarious
- `tipovivi5`, "=1 other(assigned,  borrowed)"
- `computer`, =1 if the household has notebook or desktop computer
- `television`, =1 if the household has TV
- `mobilephone`, =1 if mobile phone
- `qmobilephone`, # of mobile phones
- `lugar1`, =1 region Central
- `lugar2`, =1 region Chorotega
- `lugar3`, =1 region PacÃƒÂ­fico central
- `lugar4`, =1 region Brunca
- `lugar5`, =1 region Huetar AtlÃƒÂ¡ntica
- `lugar6`, =1 region Huetar Norte
- `area1`, =1 zona urbana
- `area2`, =2 zona rural
- `age`, Age in years
- `SQBescolari`, escolari squared
- `SQBage`, age squared
- `SQBhogar_total`, hogar_total squared
- `SQBedjefe`, edjefe squared
- `SQBhogar_nin`, hogar_nin squared
- `SQBovercrowding`, overcrowding squared
- `SQBdependency`, dependency squared
- `SQBmeaned`, square of the mean years of education of adults (>=18) in the household
- `agesq`, Age squared

## Clean Data

### Outliers

I had a look into train and test set, it turned out there is only one outlier value rez_esc in test set, and acorrding to the [answer](https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403) from competition host, we can safely change the value to 5.

In [34]:
# Outlier in test set which rez_esc is 99.0
test[['rez_esc'] == 99.0 , 'rez_esc'] = 5

### Missing Values

In [20]:
print ("Top Columns having missing values")
missmap = train.isnull().sum().to_frame().sort_values(0, ascending = False)
missmap.head()

Top Columns having missing values


Unnamed: 0,0
rez_esc,7928
v18q1,7342
v2a1,6860
SQBmeaned,5
meaneduc,5
