# Homework 4

### Costa Rican Household Poverty Level Prediction (Kaggle Competition)

#### By: Spencer Wise

The goal of the Costa Rican Household Poverty Level Prediction contest is to develop a machine learning model that can predict the poverty level of households using both individual and household characteristics. Many social programs have a hard time making sure the right people are given the right amount of aid, and it’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify.  Costa Rica is just one of many countries that faces this same problem of assessing social need.

In this notebook, we will do our best to come up with a complete machine learning solution to this problem. First, we will get introduced to the problem, then explore the dataset, clean the dataset, work on feature engineering, try out multiple machine learning models, select a model, work to optimize the model, and finally, inspect the outputs of the model and draw conclusions.

### Data Overview and Problem

The data for this competition is provided in a training set file and a test set file: `train.csv` and `test.csv`. The training set has 9557 rows and 143 columns, and the testing set has 23856 rows and 142 columns. Each row represents one individual and each column is a feature, either unique to the individual, or for the household of that individual. The training set has one additional column, `Target`, which represents the poverty level on a scale from 1-4. A value of 1 is the most extreme poverty, while a value of 4 represents non vulnerable households. This is a multi-class classification machine learning problem with 4 classes.

### Objective

The objective is to predict poverty on a household level. We are given data on the individual level with each individual having unique features but also information about their household. In order to create a dataset for the task, we'll have to perform some aggregations of the individual data for each household. Moreover, we have to make a prediction for every individual in the test set, but "ONLY the heads of household are used in scoring" which means we want to predict poverty on a household basis.

To begin, let's start by importing the necessary packages and reading in the data!

### Module Imports
We'll use a familiar stack of data science libraries: `Pandas` , `numpy`, `matplotlib`, `seaborn`, and `sklearn` for modeling.

In [18]:
# Pandas for data loading, manipulation etc.
import pandas as pd

# Numeric functions
import numpy as np
from scipy import stats
from scipy.stats import norm

# Plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting defaults
%matplotlib inline
plt.style.use('fivethirtyeight')

# Other Packages
import warnings
warnings.filterwarnings('ignore')

### Load Data

Here we are going to first load the training and test datasets into pandas dataframes.

In [3]:
# Loading datasets into separate pandas dataframes
train = pd.read_csv(r"C:\Users\Spencer\Dropbox\School\Fall 2018\Machine Learning\Projects\Homework_4\train.csv", encoding="latin1")
test = pd.read_csv(r"C:\Users\Spencer\Dropbox\School\Fall 2018\Machine Learning\Projects\Homework_4\test.csv", encoding="latin1")

In [8]:
# Print the shape of the test and training datasets
print(train.shape)
print(test.shape)

(9557, 143)
(23856, 142)


One interesting thing to note right off the bat is that the training dataset has one more column than the test dataset. That is because the train dataset includes our target value). Let's take a closer look by previewing our training data.

In [16]:
# Preview the data
train.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


### Analyze the Target

The first thing that we want to do is identify and analyze the target. In this case, our target is the `Target` column as it is the variable that we will be trying to predict.

In [19]:
# Descriptive summary of Target data
target = train['Target']
target.describe()

count    9557.000000
mean        3.302292
std         1.009565
min         1.000000
25%         3.000000
50%         4.000000
75%         4.000000
max         4.000000
Name: Target, dtype: float64