# Costa Rican Household Poverty Level Prediction 

Develop a machine learning model that can predict the poverty level of households using both individual and household characteristics. 

## Problem and Data Explanation

The data for this competition is provided in two files: train.csv and test.csv. The training set has 9557 rows and 143 columns while the testing set has 23856 rows and 142 columns. Each row represents one individual and each column is a feature, either unique to the individual, or for the household of the individual. The training set has one additional column, Target, which represents the poverty level on a 1-4 scale and is the label for the competition. A value of 1 is the most extreme poverty.

This is a supervised multi-class classification machine learning problem:

Supervised: provided with the labels for the training data
Multi-class classification: Labels are discrete values with 4 classes

## Objective
The objective is to predict poverty on a household level. We are given data on the individual level with each individual having unique features but also information about their household. In order to create a dataset for the task, we'll have to perform some aggregations of the individual data for each household. Moreover, we have to make a prediction for every individual in the test set, but "ONLY the heads of household are used in scoring" which means we want to predict poverty on a household basis.

Important note: while all members of a household should have the same label in the training data, there are errors where individuals in the same household have different labels. In these cases, we are told to use the label for the head of each household, which can be identified by the rows where parentesco1 == 1.0. We will cover how to correct this in the notebook (for more info take a look at the competition main discussion).

The Target values represent poverty levels as follows:

__1 = extreme poverty__<br> 
__2 = moderate poverty__ <br> 
__3 = vulnerable households__<br> 
__4 = non vulnerable households__<br>

## Data Descriptions
The explanations for all 143 columns can be found in the competition documentation, but a few to note are below:

Id: a unique identifier for each individual, this should not be a feature that we use.

idhogar: a unique identifier for each household. This variable is not a feature, but will be used to group individuals by household as all individuals in a household will have the same identifier.

parentesco1: indicates if this person is the head of the household.

Target: the label, which should be equal for all members in a household

When we make a model, we'll train on a household basis with the label for each household the poverty level of the head of household. The raw data contains a mix of both household and individual characteristics and for the individual data, we will have to find a way to aggregate this for each household. Some of the individuals belong to a household with no head of household which means that unfortunately we can't use this data for training. These issues with the data are completely typical of real-world data and hence this problem is great preparation for the datasets you'll encounter in a data science job.

## Metric
Ultimately we want to build a machine learning model that can predict the integer poverty level of a household. Our predictions will be assessed by the __Macro F1 Score__. You may be familiar with the standard F1 score for binary classification problems which is the harmonic mean of precision and recall:

F1=21recall+1precision=2⋅precision⋅recallprecision+recall
 
For mutli-class problems, we have to average the F1 scores for each class. The macro F1 score averages the F1 score for each class without taking into account label imbalances.

Macro F1=F1 Class 1+F1 Class 2+F1 Class 3+F1 Class 44
 
In other words, the number of occurrences of each label does not figure into the calculation when using macro (while it does when using the "weighted" score). (For more information on the differences, look at the Scikit-Learn Documention for F1 Score or this Stack Exchange question and answers. If we want to assess our performance, we can use the code:

from sklearn.metrics import f1_score
f1_score(y_true, y_predicted, average = 'macro`)
For this problem, the labels are imbalanced, which makes it a little strange to use macro averaging for the evaluation metric, but that's a decision made by the organizers and not something we can change! In your own work, you want to be aware of label imbalances and choose a metric accordingly.

## Roadmap
The end objective is a machine learning model that can predict the poverty level of a household. However, before we get carried away with modeling, it's important to understand the problem and data. Also, we want to evaluate numerous models before choosing one as the "best" and after building a model, we want to investigate the predictions. Our roadmap is therefore as follows:

- Understand the problem (we're almost there already)
- Exploratory Data Analysis
- Feature engineering to create a dataset for machine learning
- Compare several baseline machine learning models
- Try more complex machine learning models
- Optimize the selected model
- Investigate model predictions in context of problem
- Draw conclusions and lay out next steps

The steps laid out above are iterative meaning that while we will go through them one at a time, we might go back to an earlier step and revisit some of our decisions. In general, data science is a non-linear pracice where we are constantly evaluating our past decisions and making improvements. In particular, feature engineering, modeling, and optimization are steps that we often repeat because we never know if we got them right the first time!

### Getting Started
For the EDA we'll examine any interesting anomalies, trends, correlations, or patterns that can be used for feature engineering and for modeling. We'll make sure to investigate our data both quantitatively (with statistics) and visually (with figures).

Once we have a good grasp of the data and any potentially useful relationships, we can do some feature engineering (the most important part of the machine learning pipeline) and establish a baseline model. This won't get us to the top of the leaderboard, but it will provide a strong foundation to build on!

With all that info in mind (don't worry if you haven't got all the details), let's get started!

In [None]:
#Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set a few plotting defaults
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18
plt.rcParams['patch.edgecolor'] = 'k'

In [None]:
#Read in Data and look at Summary Information
pd.options.display.max_columns = 150

# Read in data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
test

In [None]:
Ids = test['Id'].unique()
Ids

In [None]:
test.groupby('SQBescolari').describe()

In [None]:
train.groupby('SQBescolari').describe()

In [None]:
#Let's look at the distribution of the 'Target' variable from the training data
fig=plt.figure(figsize=(10,10))
train.hist(column='Target')
plt.xlabel('Poverty Level')

In [None]:
#Now let's compare a few variables in the training and test datasets.

x = test['SQBescolari']
y = train['SQBescolari']

from matplotlib import pyplot
pyplot.hist(x, label='Test')
pyplot.hist(y, label='Train',color='purple')
pyplot.legend(loc='upper right')
pyplot.title('SQBescolari')
pyplot.show()

In [None]:
#To explore SQBovercrowding variable, a histogram won't show enough info so let's do a line plot
x = test['SQBovercrowding']
y = train['SQBovercrowding']

from matplotlib import pyplot as plt

plt.plot(x, label='Test', marker='o')
plt.plot(y, label='Train',color='purple')
plt.legend(loc='upper left')
plt.title('SQBovercrowding')
plt.show()

#Well that isn't extremely helpful since there don't appear to be any patterns. 

#Let's look at the outlier data points a bit more in each of the Test and Training data sets.

In [None]:
#Let's zoom in on the test outliers by changing the x and y axis limits
plt.plot(x, label='Test', marker='o')
plt.xlim(16000,20100)
plt.ylim(80,180)
plt.title('SQBovercrowding Test')
plt.show()

#The outlier values in the test dataset seem to be 100 and 170.
#Most values in the training dataset are between 0-40; and 0-50 in the test dataset.

In [None]:
#Let's look more closely at the variable 'SQBescolari' (square years of education) that has values 0, 1, 4 and 9
#And where the Target is 1 or 2 (extreme or moderate poverty)

SQBescolari_train = train.query('SQBescolari <=9' and 'Target <=2')
SQBescolari_train

#There are 755 with Target of 1, 1597 with Target of 2, and 2352 with either 1 or 2 
#out of the total train sample size of 9553.

#We still don't know if the 'SQBage' or 'SQBhogar_total' are significant variables so let's keep for now.

In [None]:
#Explore relationship between 'SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBovercrowding' and 'Target'
#Output feature in training dataset

import seaborn as sns

#Calculate the correlation matrix
corr = SQBescolari_train.corr()

#Plot the heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns)

#From the heatmap, we see 'SQBage' and 'SQBhogar_total' have 0 correlation so these variables can be deleted.

#SQBescolari is positively correlated (0.4) and 'SQBovercrowding' is highly negatively correlated (-0.25)

In [None]:
trainclean = SQBescolari_train[['Id','SQBescolari', 'SQBovercrowding']]
trainclean

## Continue Exploratory Data Analysis with Test Data

In [None]:
test.groupby('SQBovercrowding').describe()

In [None]:
#From looking at the training dataset above, we see Target of 1 or 2 when 'SQBovercrowding' has values between 
#0.111 and 12.25 so let's query with this in mind.

SQBescolari_test= test.query('SQBescolari <=9')
SQBescolari_test

#3162 have 0 years SQBescolari, 3879 1 years, 769 have 4 years, 1051 have 9 years out of 23856 test sample size.
##(about 37% of total sample size)

In [None]:
#Now let's find unique values for SQBovercrowding
SQBescolari_test.SQBovercrowding.unique()

In [None]:
#Clean up SQBescolari_test dataframe to get rid of unnecessary columsn
testclean = SQBescolari_test[['Id','SQBescolari', 'SQBovercrowding']]
testclean

In [None]:
#Let's see if we can visualize the relationship (if any) between SQBescolari & SQBovercrowding

import seaborn as sns

#Calculate the correlation matrix
corr = testclean.corr()

#Plot the heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns)

#It appears that the two variables are highly linearly correlated.

## Duplicate Features

- 'SQBage' and 'agesq' are duplicates of 'age'
- 'SQBedjefe', 'edjefe'[101] and 'SQBmeaned' are duplicate metrics of 'SQBescolari'
- 'SQBhogar_nin' is duplicate metric of 'SQBhogar_total'

In [None]:
#Build a quick baseline Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

#Define input and output features
ytrain = train.iloc[:,-1] #Define target variable as last column of data frame (see https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)
Xtrain = train.drop('Target', axis=1)

#Fill NAs
Xtrain = Xtrain.fillna(-999)

#label encoder
for c in train.columns[train.dtypes == 'object']:
    Xtrain[c] = Xtrain[c].factorize()[0]

rf = RandomForestClassifier()
rf.fit(Xtrain,ytrain)

In [None]:
#Test the model

#Create a copy to work with
Xtest = test.copy()

#Save and drop labels
ytest = Xtrain
Xtest = Xtrain.iloc[0:141]

#Fill NAs
Xtest = Xtest.fillna(-999)

#Make the prediction
ypredictions = rf.predict(ytest)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print("=== Confusion Matrix ===")
print(confusion_matrix(ytest, ypredictions))
print('\n')
print("=== Classification Report ===")
print(classification_report(ytest, ypredictions))
print('\n')

In [None]:
#Random Forest Classifier Model Metrics

# Use the forest's predict method on the test data
predictions = rf.predict(Xtest)
# Calculate the absolute errors
errors = abs(predictions - ytest)

#Macro F1 Score is Model Evaluation Metric
from sklearn.metrics import f1_score

print("=== Macro F1 Score ===")
f1_score (ytrain, ypredictions, average='macro')

In [None]:
#Now add these tuned parameters to the model to see if we can improve results
rfc = RandomForestClassifier(n_estimators=1000, max_depth=300, max_features='auto')
rfc.fit(Xfeatures_train,yfeatures_train)
rfc_predict = rfc.predict(Xfeatures_test)

print("=== Macro F1 Score ===")
f1_score (ytrain, ypredictions, average='macro')

#Notice there is no change in the macro F1 score with hypertuned parameters.

More Exploratory Data Analysis - Feature Selection

In [None]:
#Plot Feature Importance
plt.figure(figsize=(10,10)) #Increased figure size to see which features are most interesting
plt.plot(rf.feature_importances_, 'bo') #change to points to see individual feature points.
plt.xticks(np.arange(Xtrain.shape[1], Xtrain.columns.tolist, rotation=vertical))
plt.xlabel('Features')
plt.xlim(90,140)
plt.show()
#TLet's take a closer look at the outliers to see which features might affect the model the most.

In [None]:
import numpy as np
np.set_printoptions(threshold=np.inf)  #https://stackoverflow.com/questions/1987694/how-to-print-the-full-numpy-array

print("-Here are the predicted Poverty Level Targets-")
ypredictions

In [None]:
#Read in Results Data
submission = pd.read_csv('CR_Kaggle_LKahn.csv')
submission.head(5)

In [None]:
submission.to_csv('./Submission_log_RF.csv')

In [None]:
#Next, let's try a NN to see if we can improve F1 macro score
from sklearn.neural_network import MLPClassifier

#Create a copy to work with
Xtrain = train.copy()

#Save and drop labels
ytrain = Xtrain.iloc[:,-1] #Define target variable as last column of data frame (see https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)
Xtrain = Xtrain.drop('Target', axis=1)

#Fill NAs
Xtrain = Xtrain.fillna(-999)

#label encoder
for c in train.columns[train.dtypes == 'object']:
    Xtrain[c] = Xtrain[c].factorize()[0]

MLP= MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,2),random_state=1)
MLP.fit(Xtrain,ytrain)