# Contents

1. [The Problem](#The-Problem)
2. [Common Libraries](#Common-Libraries)
3. [Import Data](#Import-Data)
4. [Explore Data](#Explore-Data)
5. [Prepare Data](#Prepare-Data)
6. [Shortlist Models](#Shortlist-Models)
7. [Tune Models](#Tune-Models)
8. [Submission](#Submission)

# The Problem

For now, you can read about the problem on Kaggle. 

Based on that, this qualifies as a supervised multivariate classification problem in which we'll use batch learning from the supplied dataset.

## Performance evaluation 
As described in the Kaggle description, the performance will be evaluated using the Macro F1 Score. 

Todo:
* Describe the F1 score and evaluate if it's a good fit.
* Is the right measure given the business objectives?
* Minimumm performance required?
* List assumptions

# Common Libraries

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Set up Seaborn with default theme, scaling, and color palette
sns.set()

#Scikit-learn common imports
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline

# Import Data

In [None]:
def import_train_test():
    train = pd.read_csv('../input/train.csv')
    test = pd.read_csv('../input/test.csv')
    
    return train, test

train, test = import_train_test()

# Explore Data

In [None]:
print(train.shape)
train.head()

In [None]:
print(test.shape)
test.head()

Okay, first things we notice is that we have 9,557 rows in our training set and 23,856 in our test set. Each row represents an individual and the household they correspond it is indicated by their `idhogar` value. According to the business objective, we are only interested in classifying households, the `Target` should be the same for every individual in a household. Let's see how many households are in each set.

In [None]:
print(f"There are {len(train.idhogar.unique())} unique households in the training set.")
print(f"There are {len(test.idhogar.unique())} unique households in the test set.")

Almost 2.5 times the number of households in the test set. This means we should be very cautious to make sure that our model doesn't overfit the training set because if we do, we'll likely score pretty poorly on the test submission.

Let's take a look at the label distirbution both for individuals and household.

In [None]:
fig, axs = plt.subplots(1,2)
fig.tight_layout()

sns.countplot(x='Target', data=train, ax=axs[0])
axs[0].set_title('Train')

sns.countplot(x='Target', data=train.copy().groupby('idhogar').first(), ax=axs[1])
axs[1].set_title('Train Households')

plt.show()

Okay, looks like we're okay in terms of the distributions between the train set and the extracted households. We should take not that our classes are very imbalanced. If the test set follows the same distribution, we could use a naive model that simply predicts the `Target` to be 4 and we'll be right over 65% of the time. This certainly won't help IDB figure out how to focus their support, since most households would be classified as non-vulnerable and, therefore, not in need of additional support. Still, it's good to be aware of this as we evaluate our models.

It seems like a lot of the features in the dataset are boolean. To start exploring the data and distributions, let's take a look at all the numeric attributes that aren't boolean to see what their distributions are and see if we can identify anything interesting about the data we're working with.

In [None]:
def is_boolean_column(data, column):
    return set(data[column].value_counts().keys()) == {0,1}
    
numeric_non_boolean_attributes = [column for column in train.columns if not is_boolean_column(train, column)]

train.hist(numeric_non_boolean_attributes, bins=50, figsize=(20,15))
plt.show()

From the above histograms, we can see a couple interesting things that we should be aware of and possibly address before training our models. Specifically:

* While a lot of the attributes resemble a normal distribution, many have long tails and some are skewed heavily 
* `elimbasu5`, which shouldn't be in here since it is a boolean field, but by being in here we can see that all of the values are 0. If that's the case, this doesn't tell us anything, so we shouldn't pass that to our model.
* There is a lot of variance in the values, so we'll need to scale these before training our model.

## Outliers
Let's find any attribute that has values outside the IQR.


In [None]:
q1s = train[numeric_non_boolean_attributes].quantile(0.25)
q3s = train[numeric_non_boolean_attributes].quantile(0.75)

iqrs = pd.DataFrame([q1s, q3s], index=['q1', 'q3']).transpose().drop('Target')
iqrs['iqr'] = iqrs.q3 - iqrs.q1
iqrs['iqr_amplified'] = iqrs.iqr * 1.5
iqrs['outlier_min'] = iqrs.q1 - iqrs.iqr_amplified
iqrs['outlier_max'] = iqrs.q3 + iqrs.iqr_amplified

def min_outlier_count(iqr_row):
    return len(train[train[iqr_row.name] < iqrs.loc[[iqr_row.name]].outlier_min[0]])

def max_outlier_count(iqr_row):
    return len(train[train[iqr_row.name] > iqrs.loc[[iqr_row.name]].outlier_max[0]])

iqrs['min_outlier_count'] = iqrs.apply(min_outlier_count, axis=1)
iqrs['max_outlier_count'] = iqrs.apply(max_outlier_count, axis=1)
iqrs[(iqrs.min_outlier_count > 0) | (iqrs.max_outlier_count > 0)]

with pd.option_context('display.max_columns', len(iqrs)):
    print(train[iqrs.index].describe())

In [None]:
with pd.option_context('display.max_columns', len(iqrs)):
    print(train[iqrs.index].describe())

While there are a few columns with outliers, if you simply defined that as outside 1.5 times the IQR, this is mostly because most of these distributions aren't gaussian and have long tails. When looking at the specifics, I don't see anything that's so far beyond reasonable as to be completely unexpected—depressing if you think about the real people behind these numbers, though. 

Aside from still thinking we should exlude `elimbasu5`, I don't think we should remove or constrain any outliers in this first pass.

## Attribute Correlation
Let's also evaluate if there's any obvious correlation between any of the attributes and the target variable that we should be aware of at this point.

In [None]:
correlations = train.copy().corr()

correlations['Target'].where(correlations['Target'].abs() > 0.25).dropna().sort_values(ascending=False)

We can see that there a few weakly correlated attributes with the target variabe, `Target`. That's good news, otherwise we'd have a tough time training a model worth anything. 

One other thing I notice here is that the squared eqivalent attributes, e.g. `SQBescolari` to `escolari` seem to have a lower correlation to the un-squared attribute. We might want to omit the squared variables.

`meaneduc`, average years of education for adults (18+), and `hogar_nin`, Number of children 0 to 19 in household, are the most correlated attributes, so we should pay extra attention to those two concepts.

In [None]:
fig, axs = plt.subplots(1,2)
fig.tight_layout()

sns.regplot(x='meaneduc', y='Target', data=train, ax=axs[0])
sns.regplot(x='hogar_nin', y='Target', data=train, ax=axs[1])

plt.show()

# Data Cleaning

First, let's create some reference dictionaries we'll use for the features

In [None]:
id_features = ['Id', 'idhogar']

features_and_descriptions = [('v2a1', 'Monthly rent payment'),
                             ('hacdor', '=1 Overcrowding by bedrooms'),
                             ('rooms', 'number of all rooms in the house'),
                             ('hacapo', '=1 Overcrowding by rooms'),
                             ('v14a', '=1 has bathroom in the household'),
                             ('refrig', '=1 if the household has refrigerator'),
                             ('v18q', 'owns a tablet'),
                             ('v18q1', 'number of tablets household owns'),
                             ('r4h1', 'Males younger than 12 years of age'),
                             ('r4h2', 'Males 12 years of age and older'),
                             ('r4h3', 'Total males in the household'),
                             ('r4m1', 'Females younger than 12 years of age'),
                             ('r4m2', 'Females 12 years of age and older'),
                             ('r4m3', 'Total females in the household'),
                             ('r4t1', 'persons younger than 12 years of age'),
                             ('r4t2', 'persons 12 years of age and older'),
                             ('r4t3', 'Total persons in the household'),
                             ('tamhog', 'size of the household'),
                             ('tamviv', 'number of persons living in the household'),
                             ('escolari', 'years of schooling'),
                             ('rez_esc', 'Years behind in school'),
                             ('hhsize', 'household size'),
                             ('paredblolad', '=1 if predominant material on the outside wall is block or brick'),
                             ('paredzocalo', '"=1 if predominant material on the outside wall is socket (wood,  zinc or absbesto"'),
                             ('paredpreb', '=1 if predominant material on the outside wall is prefabricated or cement'),
                             ('pareddes', '=1 if predominant material on the outside wall is waste material'),
                             ('paredmad', '=1 if predominant material on the outside wall is wood'),
                             ('paredzinc', '=1 if predominant material on the outside wall is zink'),
                             ('paredfibras', '=1 if predominant material on the outside wall is natural fibers'),
                             ('paredother', '=1 if predominant material on the outside wall is other'),
                             ('pisomoscer', '"=1 if predominant material on the floor is mosaic,  ceramic,  terrazo"'),
                             ('pisocemento', '=1 if predominant material on the floor is cement'),
                             ('pisoother', '=1 if predominant material on the floor is other'),
                             ('pisonatur', '=1 if predominant material on the floor is  natural material'),
                             ('pisonotiene', '=1 if no floor at the household'),
                             ('pisomadera', '=1 if predominant material on the floor is wood'),
                             ('techozinc', '=1 if predominant material on the roof is metal foil or zink'),
                             ('techoentrepiso', '"=1 if predominant material on the roof is fiber cement,  mezzanine "'),
                             ('techocane', '=1 if predominant material on the roof is natural fibers'),
                             ('techootro', '=1 if predominant material on the roof is other'),
                             ('cielorazo', '=1 if the house has ceiling'),
                             ('abastaguadentro', '=1 if water provision inside the dwelling'),
                             ('abastaguafuera', '=1 if water provision outside the dwelling'),
                             ('abastaguano', '=1 if no water provision'),
                             ('public', '"=1 electricity from CNFL,  ICE,  ESPH/JASEC"'),
                             ('planpri', '=1 electricity from private plant'),
                             ('noelec', '=1 no electricity in the dwelling'),
                             ('coopele', '=1 electricity from cooperative'),
                             ('sanitario1', '=1 no toilet in the dwelling'),
                             ('sanitario2', '=1 toilet connected to sewer or cesspool'),
                             ('sanitario3', '=1 toilet connected to  septic tank'),
                             ('sanitario5', '=1 toilet connected to black hole or letrine'),
                             ('sanitario6', '=1 toilet connected to other system'),
                             ('energcocinar1', '=1 no main source of energy used for cooking (no kitchen)'),
                             ('energcocinar2', '=1 main source of energy used for cooking electricity'),
                             ('energcocinar3', '=1 main source of energy used for cooking gas'),
                             ('energcocinar4', '=1 main source of energy used for cooking wood charcoal'),
                             ('elimbasu1', '=1 if rubbish disposal mainly by tanker truck'),
                             ('elimbasu2', '=1 if rubbish disposal mainly by botan hollow or buried'),
                             ('elimbasu3', '=1 if rubbish disposal mainly by burning'),
                             ('elimbasu4', '=1 if rubbish disposal mainly by throwing in an unoccupied space'),
                             ('elimbasu5', '"=1 if rubbish disposal mainly by throwing in river,  creek or sea"'),
                             ('elimbasu6', '=1 if rubbish disposal mainly other'),
                             ('epared1', '=1 if walls are bad'),
                             ('epared2', '=1 if walls are regular'),
                             ('epared3', '=1 if walls are good'),
                             ('etecho1', '=1 if roof are bad'),
                             ('etecho2', '=1 if roof are regular'),
                             ('etecho3', '=1 if roof are good'),
                             ('eviv1', '=1 if floor are bad'),
                             ('eviv2', '=1 if floor are regular'),
                             ('eviv3', '=1 if floor are good'),
                             ('dis', '=1 if disable person'),
                             ('male', '=1 if male'),
                             ('female', '=1 if female'),
                             ('estadocivil1', '=1 if less than 10 years old'),
                             ('estadocivil2', '=1 if free or coupled uunion'),
                             ('estadocivil3', '=1 if married'),
                             ('estadocivil4', '=1 if divorced'),
                             ('estadocivil5', '=1 if separated'),
                             ('estadocivil6', '=1 if widow/er'),
                             ('estadocivil7', '=1 if single'),
                             ('parentesco1', '=1 if household head'),
                             ('parentesco2', '=1 if spouse/partner'),
                             ('parentesco3', '=1 if son/doughter'),
                             ('parentesco4', '=1 if stepson/doughter'),
                             ('parentesco5', '=1 if son/doughter in law'),
                             ('parentesco6', '=1 if grandson/doughter'),
                             ('parentesco7', '=1 if mother/father'),
                             ('parentesco8', '=1 if father/mother in law'),
                             ('parentesco9', '=1 if brother/sister'),
                             ('parentesco10', '=1 if brother/sister in law'),
                             ('parentesco11', '=1 if other family member'),
                             ('parentesco12', '=1 if other non family member'),
                             ('idhogar', 'Household level identifier'),
                             ('hogar_nin', 'Number of children 0 to 19 in household'),
                             ('hogar_adul', 'Number of adults in household'),
                             ('hogar_mayor', '# of individuals 65+ in the household'),
                             ('hogar_total', '# of total individuals in the household'),
                             ('dependency', 'Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)'),
                             ('edjefe', 'years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0'),
                             ('edjefa', 'years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0'),
                             ('meaneduc', 'average years of education for adults (18+)'),
                             ('instlevel1', '=1 no level of education'),
                             ('instlevel2', '=1 incomplete primary'),
                             ('instlevel3', '=1 complete primary'),
                             ('instlevel4', '=1 incomplete academic secondary level'),
                             ('instlevel5', '=1 complete academic secondary level'),
                             ('instlevel6', '=1 incomplete technical secondary level'),
                             ('instlevel7', '=1 complete technical secondary level'),
                             ('instlevel8', '=1 undergraduate and higher education'),
                             ('instlevel9', '=1 postgraduate higher education'),
                             ('bedrooms', 'number of bedrooms'),
                             ('overcrowding', '# persons per room'),
                             ('tipovivi1', '=1 own and fully paid house'),
                             ('tipovivi2', '"=1 own,  paying in installments"'),
                             ('tipovivi3', '=1 rented'),
                             ('tipovivi4', '=1 precarious'),
                             ('tipovivi5', '"=1 other(assigned,  borrowed)"'),
                             ('computer', '=1 if the household has notebook or desktop computer'),
                             ('television', '=1 if the household has TV'),
                             ('mobilephone', '=1 if mobile phone'),
                             ('qmobilephone', '# of mobile phones'),
                             ('lugar1', '=1 region Central'),
                             ('lugar2', '=1 region Chorotega'),
                             ('lugar3', '=1 region PacÃƒÂ­fico central'),
                             ('lugar4', '=1 region Brunca'),
                             ('lugar5', '=1 region Huetar AtlÃƒÂ¡ntica'),
                             ('lugar6', '=1 region Huetar Norte'),
                             ('area1', '=1 zona urbana'),
                             ('area2', '=2 zona rural'),
                             ('age', 'Age in years'),
                             ('SQBescolari', 'escolari squared'),
                             ('SQBage', 'age squared'),
                             ('SQBhogar_total', 'hogar_total squared'),
                             ('SQBedjefe', 'edjefe squared'),
                             ('SQBhogar_nin', 'hogar_nin squared'),
                             ('SQBovercrowding', 'overcrowding squared'),
                             ('SQBdependency', 'dependency squared'),
                             ('SQBmeaned', 'square of the mean years of education of adults (>=18) in the household'),
                             ('agesq', 'Age squared')]

all_features = [feature for feature, description in features_and_descriptions]


household_features_and_descriptions = [('v2a1', 'Monthly rent payment'),
                             ('hacdor', '=1 Overcrowding by bedrooms'),
                             ('rooms', 'number of all rooms in the house'),
                             ('hacapo', '=1 Overcrowding by rooms'),
                             ('v14a', '=1 has bathroom in the household'),
                             ('refrig', '=1 if the household has refrigerator'),
                             ('v18q1', 'number of tablets household owns'),
                             ('r4h1', 'Males younger than 12 years of age'),
                             ('r4h2', 'Males 12 years of age and older'),
                             ('r4h3', 'Total males in the household'),
                             ('r4m1', 'Females younger than 12 years of age'),
                             ('r4m2', 'Females 12 years of age and older'),
                             ('r4m3', 'Total females in the household'),
                             ('r4t1', 'persons younger than 12 years of age'),
                             ('r4t2', 'persons 12 years of age and older'),
                             ('r4t3', 'Total persons in the household'),
                             ('tamhog', 'size of the household'),
                             ('tamviv', 'number of persons living in the household'),
                             ('hhsize', 'household size'),
                             ('paredblolad', '=1 if predominant material on the outside wall is block or brick'),
                             ('paredzocalo', '"=1 if predominant material on the outside wall is socket (wood,  zinc or absbesto"'),
                             ('paredpreb', '=1 if predominant material on the outside wall is prefabricated or cement'),
                             ('pareddes', '=1 if predominant material on the outside wall is waste material'),
                             ('paredmad', '=1 if predominant material on the outside wall is wood'),
                             ('paredzinc', '=1 if predominant material on the outside wall is zink'),
                             ('paredfibras', '=1 if predominant material on the outside wall is natural fibers'),
                             ('paredother', '=1 if predominant material on the outside wall is other'),
                             ('pisomoscer', '"=1 if predominant material on the floor is mosaic,  ceramic,  terrazo"'),
                             ('pisocemento', '=1 if predominant material on the floor is cement'),
                             ('pisoother', '=1 if predominant material on the floor is other'),
                             ('pisonatur', '=1 if predominant material on the floor is  natural material'),
                             ('pisonotiene', '=1 if no floor at the household'),
                             ('pisomadera', '=1 if predominant material on the floor is wood'),
                             ('techozinc', '=1 if predominant material on the roof is metal foil or zink'),
                             ('techoentrepiso', '"=1 if predominant material on the roof is fiber cement,  mezzanine "'),
                             ('techocane', '=1 if predominant material on the roof is natural fibers'),
                             ('techootro', '=1 if predominant material on the roof is other'),
                             ('cielorazo', '=1 if the house has ceiling'),
                             ('abastaguadentro', '=1 if water provision inside the dwelling'),
                             ('abastaguafuera', '=1 if water provision outside the dwelling'),
                             ('abastaguano', '=1 if no water provision'),
                             ('public', '"=1 electricity from CNFL,  ICE,  ESPH/JASEC"'),
                             ('planpri', '=1 electricity from private plant'),
                             ('noelec', '=1 no electricity in the dwelling'),
                             ('coopele', '=1 electricity from cooperative'),
                             ('sanitario1', '=1 no toilet in the dwelling'),
                             ('sanitario2', '=1 toilet connected to sewer or cesspool'),
                             ('sanitario3', '=1 toilet connected to  septic tank'),
                             ('sanitario5', '=1 toilet connected to black hole or letrine'),
                             ('sanitario6', '=1 toilet connected to other system'),
                             ('energcocinar1', '=1 no main source of energy used for cooking (no kitchen)'),
                             ('energcocinar2', '=1 main source of energy used for cooking electricity'),
                             ('energcocinar3', '=1 main source of energy used for cooking gas'),
                             ('energcocinar4', '=1 main source of energy used for cooking wood charcoal'),
                             ('elimbasu1', '=1 if rubbish disposal mainly by tanker truck'),
                             ('elimbasu2', '=1 if rubbish disposal mainly by botan hollow or buried'),
                             ('elimbasu3', '=1 if rubbish disposal mainly by burning'),
                             ('elimbasu4', '=1 if rubbish disposal mainly by throwing in an unoccupied space'),
                             ('elimbasu5', '"=1 if rubbish disposal mainly by throwing in river,  creek or sea"'),
                             ('elimbasu6', '=1 if rubbish disposal mainly other'),
                             ('epared1', '=1 if walls are bad'),
                             ('epared2', '=1 if walls are regular'),
                             ('epared3', '=1 if walls are good'),
                             ('etecho1', '=1 if roof are bad'),
                             ('etecho2', '=1 if roof are regular'),
                             ('etecho3', '=1 if roof are good'),
                             ('eviv1', '=1 if floor are bad'),
                             ('eviv2', '=1 if floor are regular'),
                             ('eviv3', '=1 if floor are good'),
                             ('hogar_nin', 'Number of children 0 to 19 in household'),
                             ('hogar_adul', 'Number of adults in household'),
                             ('hogar_mayor', '# of individuals 65+ in the household'),
                             ('hogar_total', '# of total individuals in the household'),
                             ('dependency', 'Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)'),
                             ('edjefe', 'years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0'),
                             ('edjefa', 'years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0'),
                             ('meaneduc', 'average years of education for adults (18+)'),
                             ('bedrooms', 'number of bedrooms'),
                             ('overcrowding', '# persons per room'),
                             ('tipovivi1', '=1 own and fully paid house'),
                             ('tipovivi2', '"=1 own,  paying in installments"'),
                             ('tipovivi3', '=1 rented'),
                             ('tipovivi4', '=1 precarious'),
                             ('tipovivi5', '"=1 other(assigned,  borrowed)"'),
                             ('computer', '=1 if the household has notebook or desktop computer'),
                             ('television', '=1 if the household has TV'),
                             ('mobilephone', '=1 if mobile phone'),
                             ('qmobilephone', '# of mobile phones'),
                             ('lugar1', '=1 region Central'),
                             ('lugar2', '=1 region Chorotega'),
                             ('lugar3', '=1 region PacÃƒÂ­fico central'),
                             ('lugar4', '=1 region Brunca'),
                             ('lugar5', '=1 region Huetar AtlÃƒÂ¡ntica'),
                             ('lugar6', '=1 region Huetar Norte'),
                             ('area1', '=1 zona urbana'),
                             ('area2', '=2 zona rural')]

household_features = [feature for feature, description in household_features_and_descriptions]

individual_features_and_descriptions = [('v18q', 'owns a tablet'),
                             ('escolari', 'years of schooling'),
                             ('rez_esc', 'Years behind in school'),
                             ('dis', '=1 if disable person'),
                             ('male', '=1 if male'),
                             ('female', '=1 if female'),
                             ('estadocivil1', '=1 if less than 10 years old'),
                             ('estadocivil2', '=1 if free or coupled uunion'),
                             ('estadocivil3', '=1 if married'),
                             ('estadocivil4', '=1 if divorced'),
                             ('estadocivil5', '=1 if separated'),
                             ('estadocivil6', '=1 if widow/er'),
                             ('estadocivil7', '=1 if single'),
                             ('parentesco1', '=1 if household head'),
                             ('parentesco2', '=1 if spouse/partner'),
                             ('parentesco3', '=1 if son/doughter'),
                             ('parentesco4', '=1 if stepson/doughter'),
                             ('parentesco5', '=1 if son/doughter in law'),
                             ('parentesco6', '=1 if grandson/doughter'),
                             ('parentesco7', '=1 if mother/father'),
                             ('parentesco8', '=1 if father/mother in law'),
                             ('parentesco9', '=1 if brother/sister'),
                             ('parentesco10', '=1 if brother/sister in law'),
                             ('parentesco11', '=1 if other family member'),
                             ('parentesco12', '=1 if other non family member'),
                             ('instlevel1', '=1 no level of education'),
                             ('instlevel2', '=1 incomplete primary'),
                             ('instlevel3', '=1 complete primary'),
                             ('instlevel4', '=1 incomplete academic secondary level'),
                             ('instlevel5', '=1 complete academic secondary level'),
                             ('instlevel6', '=1 incomplete technical secondary level'),
                             ('instlevel7', '=1 complete technical secondary level'),
                             ('instlevel8', '=1 undergraduate and higher education'),
                             ('instlevel9', '=1 postgraduate higher education'),
                             ('age', 'Age in years')]

individual_features = [feature for feature, description in individual_features_and_descriptions]

squared_features_and_descriptions = [('SQBescolari', 'escolari squared'),
                             ('SQBage', 'age squared'),
                             ('SQBhogar_total', 'hogar_total squared'),
                             ('SQBedjefe', 'edjefe squared'),
                             ('SQBhogar_nin', 'hogar_nin squared'),
                             ('SQBovercrowding', 'overcrowding squared'),
                             ('SQBdependency', 'dependency squared'),
                             ('SQBmeaned', 'square of the mean years of education of adults (>=18) in the household'),
                             ('agesq', 'Age squared')]

squared_features = [feature for feature, description in squared_features_and_descriptions]

squared_household_features_and_descriptions = [('SQBhogar_total', 'hogar_total squared'),
                             ('SQBedjefe', 'edjefe squared'),
                             ('SQBhogar_nin', 'hogar_nin squared'),
                             ('SQBovercrowding', 'overcrowding squared'),
                             ('SQBdependency', 'dependency squared'),
                             ('SQBmeaned', 'square of the mean years of education of adults (>=18) in the household')]

squared_household_features = [feature for feature, description in squared_household_features_and_descriptions]

squared_individual_features_and_descriptions = [('SQBescolari', 'escolari squared'),
                             ('SQBage', 'age squared'),
                             ('agesq', 'Age squared')]

squared_individual_features = [feature for feature, description in squared_individual_features_and_descriptions]

In [None]:
# Verify that seemingly duplicate attributes, SQBage and agesq, 
# are in fact duplicates and contain duplicate values in both train and test sets.

for df, name in [(train, 'train'), (test, 'test')]:
    assert df.agesq.equals(df.SQBage), f"agesq is not equivalent with SQBage in the {name} set"
    
# Remove duplicate column agesq from feature lists
lists_with_agesq = [features_and_descriptions, squared_features_and_descriptions, squared_individual_features_and_descriptions]
for feature_list in lists_with_agesq:
    try:
        feature_list.remove(('agesq', 'Age squared'))
    except:
        continue

# Verify we deleted them all
for feature_list in lists_with_agesq:
    assert 'agesq' not in {x for x, y in feature_list}, 'Duplicated column agesq is still in feature list.'

## Attributes with NaNs

In [None]:
def print_nan_counts(df):
    nan_counts = df.isna().sum()
    print(nan_counts[nan_counts > 0].sort_values(ascending = False))
    
print_nan_counts(train)

Good news is that for most features, there aren't a lot of missing values. Let's figure out how best to handle each of these.

### rez_esc

First, `rez_esc`. The description of this field is "Years behind in school".  Let's take a look to see if we can figure out why there are so many empty values here. My intuition is that it has something to do with whether the individual is in school or not. For example, a two year old can't really be "behind in school". Nor can a graduate. We'll also check to see if there's been any discussion for how this field was created by IDB.

In [None]:
print('Stats for potentially related characteristics for individuals where rez_esc isna.')
print(train[train.rez_esc.isna()][['age', 'escolari', 'meaneduc']].describe())

print('')
print('Stats for potentially related characteristics for individuals where rez_esc is not null.')
print(train[train.rez_esc.notna()][['age', 'escolari', 'meaneduc']].describe())

The thing that jumps out from this analysis is that range for the age of individuals where `rez_esc` is not na is between 7 and 17. The other variables don't show an obvious difference. Indeed, IDB clarifies in a discussion about this variable: 

> This variable is only collected for people between 7 and 19 years of age and it is the difference between the years of education a person should have and the years of education he/she has. it is capped at 5.

As a first step, then, let's create a custom transformer that fills in 0 for anyone who is younger than 7 or older than 19.

In [None]:
class ZerofillRezEscOutOfBounds(BaseEstimator, TransformerMixin):
    """Zerofill rez_esc for any row where age is < 7 or age is > 19."""
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X.loc[X.rez_esc.isna() & ((X.age < 7) | (X.age > 19)), 'rez_esc'] = 0
        return X
        
    
zerofiller = ZerofillRezEscOutOfBounds()
train_zerofilled = zerofiller.transform(train.copy())

assert not train_zerofilled[(train_zerofilled.age < 7) | (train_zerofilled.age > 19)].rez_esc.isna().any(), 'There are individuals younger than 7 or older than 19 with a na value for rez_esc.'
print(f"There are {train_zerofilled.rez_esc.isna().sum()} individuals with a na value for rez_esc.")
print(f"There are {train_zerofilled[(train_zerofilled.age < 7) | (train_zerofilled.age > 19)].rez_esc.isna().sum()} individuals younger than 7 or older than 19 with a na value for rez_esc.")

For the remaining 350 individuals between 7 and 19, we'll begin by simply imputing their values as part of our data preprocessing pipeline step. If needed, we might look further into this and see if can make a more meanginful inference from other data. For now, though, we'll spend the time on getting a working model and just use a basic imputer.

I see in the IDB response that this should be capped at 5. We should check to see that everyone is capped correctly or evaluate whether this matters.

In [None]:
train.loc[train.rez_esc > 5, id_features + individual_features].head()

In [None]:
test.loc[test.rez_esc > 5, id_features + individual_features].head()

It looks like there's one value in the test set that has a value higher than the cap. I would assume that a value of 99 should just be set to the max, but in this case, this seems like bad input. It's for a 8-year old boy, who has an `instlevel2`, or "incomplete primary". This person actually seems right on track. Even if he was just starting kindergarten, he wouldn't be 5 years behind in school. For now, we'll set anything higher than 5 to 0.

In [None]:
class ZeroMaxRezEsc(BaseEstimator, TransformerMixin):
    """Zero out rez_esc for any row where rez_esc is > the prescribed max of 5."""
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X.loc[X.rez_esc > 5, 'rez_esc'] = 0
        return X
        
    
zero_rez_esc = ZeroMaxRezEsc()
test_zero_rez_esc = zero_rez_esc.transform(test.copy())

assert len(test_zero_rez_esc[test_zero_rez_esc.rez_esc > 5]) == 0, 'There are individuals with greater than 5 rez_esc'

### v18q1

Next, let's look at **v18q1**. The description for this field is "number of tablets household owns". There's also a field `v18q`, which represents if the individual "owns a tablet".  Theoretically, `v18q1` would be a derivative field summing the `v18q` for the individuals in the household. It's possible a blank `v18q1` could  represent households without a tablet. Let's see if we can confirm these before plowing forward.

In [None]:
train.loc[train.v18q1 > 1, ['Id', 'idhogar', 'parentesco1', 'v18q', 'v18q1']].head(20)

Wrong! We can immediately see that the hypothesis that v18q1, the "number of tablets household owns" is *not* the sum of `v18q` for each individual in the household. Household 28ec0c747 lists 2 for `v18q1`, but a 1 for each of the 6 individual's `v18q`.

I'm now realizing that `v18q` is probably a boolean field indicating whether the household for that individual owns a tablet, similar to many of the other fields in the dataset. We can check.

In [None]:
train.v18q.value_counts()

Bingo! All of the values in the `v18q` column are 0s or 1s. ***And...*** that 7342 count rings a bell. In fact, it's the count of the number of NaN values in the `v18q1` column. It's now pretty reasonable that any field that has a 0 in `v18q` is part of a houseold that doesn't own a tablet and, therefore, the count of tablets in the house, `v18q1` should be set to 0. First, let's make sure this isn't a fluke and that there are households with more than 1 tablet. 

In [None]:
print(train.v18q1.value_counts())
print(len(train.loc[(train.v18q == 0 & train.v18q1.isna())]))

Yep, I feel comfortable now creating a transformer to set those 0s. And, moreover, we might want to consider excluding `v18q` from the final features used in the models, since it's redundant with `v18q1`—0 for `v18q1` is equivalent with 0 `v18q` and anything else is equivalent to 1 `v18q`

In [None]:
class ZerofillV18q1ForFalseV18Q(BaseEstimator, TransformerMixin):
    """Zerofill v18q1 for any row where v18q is 0."""
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X.loc[X.v18q == 0, 'v18q1'] = 0
        return X
        
    
zerofiller = ZerofillV18q1ForFalseV18Q()
train_zerofilled = zerofiller.transform(train.copy())

assert not train_zerofilled.v18q1.isna().any(), 'There are individuals with na value for v18q1.'

### v21a1

Next up, let's look at the other attribute that had a lot of NaN values, `v2a1`, described as "Monthly rent payment". Why would someone not have a monthly rent payment? I can only think of a few valid reasons, they own a house outright, live with someone else rent-free, or are homeless. If we can't figure out how to fill in some of these values, we might have to ignore this field. That would be a shame, though, since there's probably a decent correlation with a household's poverty level and their rent (or whether they own their home). 

Again, the representitive from IDB offers some help in the [Kaggle discussions](https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403#360694):

> In the example of v2a1 you should first filter by tipoviv3. Most households that don't have information will be households that do not rent or are not paying loans (tipoviv2). If a household is tipoviv3 and does not have information then you need to make a choice about the treatment of the variable either assume 0, impute a value or not use the variable. Basically in those cases we don't know the value.

Let's see what we can find out following that path.

In [None]:
for v in ['tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5']:
    print(f"Value counts for {v} for individuals with missing v2a1:")
    print(train.loc[train.v2a1.isna(), v].value_counts())
    print('')

We can see that most of the individuals who have a missing `v2a1` have a 1 for `tipovivi1`. Looking at the data definitions, this indicates individuals who "own and fully paid house". This corresponds to one of the hypotheses we had for why `v2a1` could be legitimately be blank. For these, we should fill with 0s.

What about the others with a missing `v2a1`? A few have a 1 for `tipovivi4` and the rest have a 1 for `tipovivi5`. These represent "precarious" and "other(assigned,  borrowed)", respectively. I'm not sure exactly what precarious refers to, but `tipovivi5`, homes that are assigned or borrowed probably can also be assigned a 0 value for `v2a1`. Since `tipovivi4` represents such a small percentage of this set, we could choose to impute those values or set them to 0 as well. As a first round, we're going to leave them as blank and impute them as part of our preprocessing pipeline, but should flag this as something we might want to revisit as we try to squeeze more performance from our models. We should probably do some research to find out if that is a common designation for housing. 

In [None]:
class ZerofillV2a1(BaseEstimator, TransformerMixin):
    """Zerofill v2a1 if tipovivi1 or tipovivi5 is 1."""
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X.loc[(X.v2a1.isna()) & ((X.tipovivi1 == 1) | (X.tipovivi5 == 1)), 'v2a1'] = 0
        return X
        
    
zerofiller = ZerofillV2a1()
train_zerofilled = zerofiller.transform(train.copy())

# Assert that everyone that isn't a tipovivi4 doesn't have a NaN value for v2a1
assert not train_zerofilled[train_zerofilled.tipovivi4 == 0].v2a1.isna().any(), 'There are individuals with na value for v2a1.'

Let's verify we've handled all the missing values we can.

In [None]:
nan_pipeline = make_pipeline(ZerofillRezEscOutOfBounds(), ZerofillV18q1ForFalseV18Q(), ZerofillV2a1())
train_nan = nan_pipeline.fit_transform(train.copy())

print_nan_counts(train_nan)

We've addressed all of the columns with a lot of missing values. We will address the remaining missing values by imputing them as part of our preprocessing pipeline. And the other two attributes are so small that we can either impute them as well or drop those rows. For later tuning, we may want to look at them further to see if we can make some smart decisions about what they should be. For now, I think we've successfully built some transformer to handle NaN values. Next up, let's handle text and categorical attributes.

## Text & Categorical Attributes

The way we typically handle text columns is eather one-hot encoding them or transforming them into an obvious numeric value, usually if the values are ordinal in nature. Let's see what we've got here.

In [None]:
train.select_dtypes(include='object').head()

Aside from the IDs of the individual and household, there are really three columns, each that look to have a mix between numbers and yes, nos. According the data descriptions:

> dependency, Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)

> edjefe, years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0

> edjefa, years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0

It seems like for `edjefe` and `edjefa` they added "no" for anyone who had 0 years of education and "yes" for anyone with 1. We can easily transform those back to numeric columns. I am a little suspicious of why they would do the latter and would want to validate with the stakeholders that they didn't transform anyone with more than 1 years of education into a "yes". For now, we'll respect the data definition.

Dependency is a rate, though and doesn't list this rule. Luckily, I found in the discussion that this is a general rule and can be applied to the `dependency` field as well:

> yes it is a general rule we applied to the database. In these three cases (edjefe, edjefa, dependency), when converting formats the variables were not delabeled for some reason. In the database yes is always 1 and no is always 0.

In [None]:
class TransformYesNoToNumeric(BaseEstimator, TransformerMixin):
    """Transform edjefe, edjefa, and dependencey yes/no values to numeric values.
    yes=1 and no=0."""
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        replacements = {'yes': 1, 'no': 0}
        columns = [('edjefe', 'uint8'), 
                   ('edjefa', 'uint8'), 
                   ('dependency', 'float16')]
    
        for column, converted_type in columns:    
            if X[column].dtype == 'object':
                X[column] = X[column].replace(replacements).astype(converted_type)
        
        return X

yes_no_transformer = TransformYesNoToNumeric()
train_yes_no_transformed = yes_no_transformer.transform(train.copy())

# Assert that all columns aside from Id and idhogar are numeric
assert train_yes_no_transformed.select_dtypes(include='object').columns.values.tolist() == ['Id', 'idhogar'], 'There are columns aside from Id and idhogar that are type object.'

## Aggregating Individual Features to Household Features

Since the primary business objective is concerned with households, we're going to be training our model just on households. It would be a shame to lose all of the information embedded in all of the individual attributes. To address this, let's add some features that aggregate the individual attributes to the household level by applying a few different statistics

In [None]:
class AggregateIndividualFeatures(BaseEstimator, TransformerMixin):
    """Aggregate individual features per household by grouping them by idhogar and
    applying sum, min, max, std.
    
    New features will be added as {feature name}-{aggregation type}, e.g. age-std
    """
    def __init__(self):
        self.excluded_individual_features = ['v18q', 'male', 'female', 'Target']
        self.individual_features_to_aggregate = [feature for feature in individual_features if feature not in self.excluded_individual_features]
        self.aggregations = ['sum', 'min', 'max', 'std', 'mean', 'median']
        self.aggregated_features = []
            
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if 'instlevel' in X.columns and 'instlevel' not in self.individual_features_to_aggregate:
            self.individual_features_to_aggregate.append('instlevel')
            
        aggregates = X[id_features + self.individual_features_to_aggregate].groupby("idhogar").agg(self.aggregations)
        aggregates.columns = ['-'.join(column).strip() for column in aggregates.columns.values]
        self.aggregated_features = aggregates.columns.values.tolist()
    
        return X.merge(aggregates, on='idhogar', how='left')

    
ind_aggregator = AggregateIndividualFeatures()
train_ind_aggregated = ind_aggregator.transform(train.copy())

# To be exhaustive, we could loop through all households, but this spost check should 
# give us a lot more confidence and be a lot faster.
household_id = '6893e65ca'

for aggregation in ind_aggregator.aggregations:
    manual = train.loc[train.idhogar == household_id].age.apply(aggregation)
    aggregated = train_ind_aggregated.loc[(train_ind_aggregated.idhogar == household_id), [f"age-{aggregation}"]].iloc[0].values[0]
    assert manual == aggregated, f"Calculated {aggregation} for age doesn't match aggregation for household {household_id}"
    print(f"Calculated {aggregation} for age, {aggregated}, matches manual aggregation, {manual}, for household {household_id}")

## Ordinalize features

If we look at some of the features names, we can see that there are some feature groups that have been broken out into a series of boolean values, e.g. instlevel[1-9]. This makes sense for any categorical feature when there isn't really any order to the categories. In the case of instlevel, though, each category is related to other categories in an orderly fashion. If we remap these to values that capture the scale, we'll be able to reduce the dimensionality of our models as well as pass it more information. 

For now, we'll use a linear scaling, though I could also see an argument that exponential growth could be more accurate in the case of instlevel. 

In [None]:
class Ordinalizer(BaseEstimator, TransformerMixin):
    """Add an ordered numeric attribute for previously broken out boolean attributes"""
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        features = ['instlevel', 'epared', 'etecho', 'eviv']

        for feature in features:
            columns = [column for column in all_features if column.startswith(feature)]
            X[feature] = X.loc[:, columns].idxmax(1).apply(lambda x: columns.index(x) + 1)
            
            # Don't drop instleveli columns since we'll use them in household aggregation
            if feature != 'instlevel':
                X.drop(columns, axis=1, inplace=True)
                
        return X
    
    
# Tests
ordinalizer = Ordinalizer()
train_ordinalized = ordinalizer.transform(train.copy())
    
# Test 10 random rows
for index, individual in train_ordinalized.sample(10).iterrows():
    for feature in ['instlevel', 'epared', 'etecho', 'eviv']:
        assert train.loc[index, f"{feature}{individual[feature]}"] == 1, f"Ordinalized {feature} doesn't match original boolean instlevel"
                      
# Test dropped columns
for feature in ['epared', 'etecho', 'eviv']:
    feature_columns = [column for column in train_ordinalized.columns if column.startswith(feature)]
    assert len(feature_columns) == 1, f"{feature} column is still present."

# Test AggregateIndividualFeatures picks up instlevel for aggregation
household_id = "6893e65ca"

instlevel_ind_aggregator = AggregateIndividualFeatures()
train_instlevel_ind_aggregated = instlevel_ind_aggregator.transform(train_ordinalized)
                      
assert 'instlevel-sum' in instlevel_ind_aggregator.aggregated_features, "instlevel is not in aggregated features."
                      
for aggregation in instlevel_ind_aggregator.aggregations:
    manual = train_ordinalized.loc[train_ordinalized.idhogar == household_id].instlevel.apply(aggregation)
    aggregated = train_instlevel_ind_aggregated.loc[(train_instlevel_ind_aggregated.idhogar == household_id), [f"instlevel-{aggregation}"]].iloc[0].values[0]
    assert manual == aggregated, f"Calculated {aggregation} for instlevel doesn't match aggregation for household {household_id}"
    print(f"Calculated {aggregation} for instlevel, {aggregated}, matches manual aggregation, {manual}, for household {household_id}")

## Feature engineering

In [None]:
class FeatureCreator(BaseEstimator, TransformerMixin):
    """Adds additional features"""
    
    def __init__(self):
        self.created_features = []
    
    def fit(self, X, y=None):
        self.created_features = []
        
        self._add_feature('rent_per_room', lambda X: X.v2a1 / X.rooms)
        self._add_feature('rent_per_hhsize', lambda X: X.v2a1 / X.hhsize)
        self._add_feature('tablets_per_hhsize', lambda X: X.v18q1 / X.hhsize)
        self._add_feature('tablets_per_adult', lambda X: X.v18q1 / X.hogar_adul)
        self._add_feature('escolari-mean_to_age_mean', lambda X: X['escolari-mean'] / X['age-mean'])
        self._add_feature('rez_esc-mean_to_age_mean', lambda X: X['rez_esc-mean'] / X['age-mean'])
        self._add_feature('males_to_females', lambda X: X.r4h3 / X.r4m3)
        self._add_feature('under12_to_over12', lambda X: X.r4t1 / X.r4t2)
        
        return self
    
    def transform(self, X, y=None):
        for label, calculation in self.created_features:
            X[label] = calculation(X)
        
#         # Address NaNs and infinity probably on a per feature basis
# #         X.replace({'males_to_females': {np.inf: np.nan},
# #                    'under12_to_over12': {np.inf: np.nan}}, inplace=True)
        X.replace(np.inf, np.nan, inplace=True)
        
        return X
        
    def _add_feature(self, label, calculation):
        self.created_features.append((label, calculation))

        
# Tests
ind_aggregator = AggregateIndividualFeatures()
train_ind_aggregated = ind_aggregator.transform(train.copy())
feature_creator = FeatureCreator()
features_created = feature_creator.fit_transform(train_ind_aggregated.copy())

assert features_created.rent_per_room.equals(features_created.v2a1 / features_created.rooms), "rent_per_room wasn't created correctly."
assert len(features_created.columns) - len(train_ind_aggregated.columns) == 8, "Didn't create the expected number of new features"

features_created[['rent_per_room', 'v2a1', 'rooms']].head()
assert len(features_created.columns[features_created[features_created == np.inf].any()]) == 0, "inifinity values present"

## Household split

In [None]:
household_groups = train.groupby('idhogar')

# Verify that all the rows in a household group have the same value for the household features
assert len(household_groups) == len(train.idhogar.unique()), "Length of household groups is not the same as the number of unique household ids."
assert not household_groups[["idhogar"] + household_features].var().any().any(), "Not all rows for a group have the same values for a household feature"

In [None]:
# Todo: I don't love that the separating of the target variable happens
# here. Seems like it should be an explicit step in the pipeline or that
# We should have a better way of getting the household targets. Probably 
# the latter.

class ExtractHouseholds(BaseEstimator, TransformerMixin):
    """Returns dataframe for households."""
    
    def __init__(self, individual_aggregator=None, feature_creator=None):
        self.individual_aggregator = individual_aggregator
        self.feature_creator = feature_creator
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        self.included_features = household_features + squared_household_features
        
        if self.individual_aggregator:
            self.included_features = self.included_features + self.individual_aggregator.aggregated_features
            
        if self.feature_creator:
            self.included_features = self.included_features + [feature for feature, _ in self.feature_creator.created_features]
            
        # Don't include features that have been dropped earlier in the pipeline
        self.included_features = [feature for feature in self.included_features if feature in X]
    
        return X.groupby('idhogar')[self.included_features].first()
 
# Tests
household_extractor = ExtractHouseholds()
train_households = household_extractor.transform(train.copy())
assert len(train_households) == len(train.idhogar.unique()), f"Extracted households length ({len(train_households)}) doesn't match unique household indentifiers in train set ({len(train.idhogar.unique())})."

household_extractor = ExtractHouseholds(individual_aggregator=instlevel_ind_aggregator)
train_households = household_extractor.transform(train_instlevel_ind_aggregated.copy())
assert len(train_households) == len(train_instlevel_ind_aggregated.idhogar.unique()), f"Extracted households length ({len(train_households)}) doesn't match unique household indentifiers in train set ({len(train_instlevel_ind_aggregated.idhogar.unique())})."

train_households.head()

# Shortlist Models

At this point, we can get a basic model trained and we can begin trying to find a shortlist of potential models to fine tune.

In [None]:
import scipy

# Transformers
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Cross validation
from sklearn.model_selection import cross_validate,  RandomizedSearchCV
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import f1_score

# Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    """Returns NumPy array corresponding to the selected features."""
    
    def __init__(self, selected_features=[], excluded_features=[]):
        self.selected_features = selected_features
        self.excluded_features = excluded_features
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if not self.selected_features:
            self.selected_features = X.columns.tolist()
            
        if self.excluded_features:
            self.selected_features = [feature for feature in self.selected_features if feature not in self.excluded_features]

        return X[self.selected_features]

In [None]:
from sklearn.exceptions import NotFittedError

class ModelEvaluator():
    def __init__(self, pipeline, estimators, features=[], excluded_features=[]):
        self.pipeline = pipeline
        self.estimators = [self._init_estimator(estimator) for estimator in estimators]
        self._preprocessed = False
        self.features = features
        self.excluded_features = excluded_features
        
        # Ensure we're working with clean data and that none of our previous work
        # has leaked into the datasets.
        self.train, self.test = import_train_test()
        self.results = []
        
        
    def _init_estimator(self, estimator):
        return {'estimator': estimator,
                'fit': False,
                'cv_results': {},
                'feature_importances': None,
                'tuned': False,
                'tuned_best_estimator': None,
                'tuned_best_score': None,
                'tuned_best_params': None,
                'tuned_cv_results': {}}
        
    def _preprocess_data(self):
        try:
            self.pipeline.named_steps['featureselector'].selected_features = self.features
            self.pipeline.named_steps['featureselector'].excluded_features = self.excluded_features
        except KeyError:
            pass
        
        self._trainX_preprocessed = pipeline.fit_transform(self.train.copy())

        # Since X is now households, we need the corresponding targets to those households
        self._trainY_preprocessed = self.train.copy().groupby('idhogar')['Target'].first()
        
        self._testX_preprocessed = self.pipeline.transform(self.test.copy())


        self._preprocessed = True
        
        
    def evaluate(self):
        if not self._preprocessed:
            self._preprocess_data()
            
        print(f"Evaluating on {self._trainX_preprocessed.shape[1]} features and {self._trainX_preprocessed.shape[0]} samples.")
        print("")
        
        for estimator in self.estimators:
            estimator['cv_results'] = self._cross_validate(estimator)
            estimator['feature_importances'] = self._calculate_feature_importances(estimator)
            
        self._print_cv_results()

        
    def _cross_validate(self, estimator):
        return cross_validate(estimator['estimator'], 
                              self._trainX_preprocessed, 
                              self._trainY_preprocessed, 
                              cv=5, 
                              scoring='f1_macro',
                              n_jobs=-1,
                              return_train_score=False)
        
        
    def _fit_estimator(self, estimator):
        estimator['estimator'].fit(self._trainX_preprocessed, self._trainY_preprocessed)
        estimator['fit'] = True

        
    def _calculate_feature_importances(self, estimator):
        if not estimator['fit']:
            self._fit_estimator(estimator)
            
        try:
            return self._feature_importances_dataframe(estimator['estimator'].feature_importances_)
        except AttributeError:
            return None
            
        
    def _feature_importances_dataframe(self, feature_importances):
        # Get the features we actually trained on
        # Todo - consider moving this to preprocessing step
        trained_features = self.pipeline.named_steps['featureselector'].selected_features

        return pd.DataFrame(data={'importance': feature_importances}, 
                            index=trained_features).sort_values(by='importance', ascending=False)
    
    def _print_cv_results(self):
        for estimator in self.estimators:
            print(f"Scores for {estimator['estimator']}")
            print(f"Mean Macro F1 Score: {estimator['cv_results']['test_score'].mean()}, SD={estimator['cv_results']['test_score'].std()}")
            print("")
                  
                  
    def tune_hyperparameters(self, estimator, param_dist, n_iter=20, cv=5):
        random_search = RandomizedSearchCV(estimator['estimator'], 
                                           param_distributions=param_dist,
                                           n_iter=n_iter, 
                                           cv=cv, 
                                           scoring='f1_macro',
                                           n_jobs=-1,
                                           verbose=1)
                  
        random_search.fit(self._trainX_preprocessed, self._trainY_preprocessed)
                  
        estimator['tuned'] = True
        estimator['tuned_best_estimator'] = random_search.best_estimator_
        estimator['tuned_best_score'] = random_search.best_score_
        estimator['tuned_cv_results'] = random_search.cv_results_
        estimator['tuned_best_params'] = random_search.best_params_
        estimator['tuned_best_index'] = random_search.best_index_
                  
        return random_search
                  
        
    def prepare_submissions(self):
        """Prepare a submission csv for every tuned estimator"""
        for estimator in self.estimators:
            if not estimator['tuned']:
                continue
            
            self._save_submission(estimator)
    
                  
    def _save_submission(self, estimator):
        if not estimator['tuned_best_estimator']:
            pass
                  
        # Make predictions on test set
        predictions = self.test.copy().groupby('idhogar').first().reset_index()
        predictions['Target'] = estimator['tuned_best_estimator'].predict(self._testX_preprocessed)

        # # Merge household predictions back to individuals in test set
        test_results = test[['Id', 'idhogar']].copy()
        test_results = test_results.merge(predictions[['idhogar', 'Target']].copy(), on="idhogar", how="left").drop("idhogar", axis=1)

        assert test_results.shape[0] == self.test.shape[0], "Number of results don't match number of test samples."

        filename = f"{estimator['tuned_best_estimator'].__class__.__name__}-tuned-{estimator['tuned_best_score']:.3f}.csv"
        self._save_predictions(test_results, filename)
        
                  
    def _save_predictions(self, predictions, filename):
        """Write results to csv file."""
        predictions.to_csv(filename, index=False)
        print(f"Wrote results to {filename}")

In [None]:
# store this so we can pass it into household extractor
individual_aggregator = AggregateIndividualFeatures()
feature_creator = FeatureCreator()

# Create pipeline
pipeline = make_pipeline(ZerofillRezEscOutOfBounds(),
                         ZeroMaxRezEsc(),
                         ZerofillV18q1ForFalseV18Q(), 
                         ZerofillV2a1(),
                         TransformYesNoToNumeric(),
                         Ordinalizer(), # Make sure this happens before individual aggregation
                         individual_aggregator,
                         feature_creator,
                         ExtractHouseholds(individual_aggregator=individual_aggregator, 
                                           feature_creator=feature_creator),
                         FeatureSelector(),
                         SimpleImputer(),
                         MinMaxScaler()
                        )

# Estimators we want to evaluate
estimators = [DecisionTreeClassifier(random_state=42),
              KNeighborsClassifier(),
              LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial', max_iter=1000),
              RandomForestClassifier(random_state=42, n_estimators=100),
              AdaBoostClassifier(random_state=42),
              GradientBoostingClassifier(random_state=42),
              SGDClassifier(random_state=42, max_iter=1000, tol=1e-3),
              MLPClassifier(random_state=42, solver='lbfgs')]

evaluator = ModelEvaluator(pipeline, estimators)
evaluator.evaluate()

### Detect correlated features
Let's see if any features are highly correlated with each other and can be removed

In [None]:
# Build a pipeline that performs everything except imputation and scaling
pipeline = make_pipeline(ZerofillRezEscOutOfBounds(),
                         ZeroMaxRezEsc(),
                         ZerofillV18q1ForFalseV18Q(), 
                         ZerofillV2a1(),
                         TransformYesNoToNumeric(),
                         Ordinalizer(), # Make sure this happens before individual aggregation
                         individual_aggregator,
                         feature_creator,
                         ExtractHouseholds(individual_aggregator=individual_aggregator, 
                                           feature_creator=feature_creator)
                        )

evaluator_corr = ModelEvaluator(pipeline, estimators)
evaluator_corr._preprocess_data()

corr = evaluator_corr._trainX_preprocessed.corr()

In [None]:
print(evaluator_corr.train.columns)
print(evaluator_corr._trainX_preprocessed.columns)

In [None]:
correlated_features = corr[corr.abs().gt(0.95)].count() > 1
corr.loc[correlated_features, correlated_features]

plt.figure(figsize=(25,25))
sns.heatmap(corr.loc[correlated_features, correlated_features])

In [None]:
for attribute in corr.loc[correlated_features, correlated_features]:
    print(corr.loc[attribute, corr.loc[attribute].abs().gt(0.95)])
    print()

So, we definitely have some features that are highly correlated with others. When looking through the list and at the heatmap, it's clear that a lot of the correlations stem from the aggregation step on the individuals. There are a lot of correlations between an individual aggregation's `sum`, `max`, and `std`. We should probably only include one of these, maybe two, but definitely not all three.

instlevel and escolari mins and sums are highly correlated.

`area1` and `area2` are 100% inversely correlated, so we should only include one of them.

`public` and `coopele` aren't as inversely correlated, but still highly correlated, so we shoud only include one of those.

`r4t3`, `tamhog`, `hhsize`, and `hogar_total` are also all highly correlated and seem to all be about total household size. We should just include one of them.

In [None]:
variance = len(train.loc[train[['r4t3', 'tamhog', 'hhsize', 'hogar_total']].var(axis=1) != 0, ['r4t3', 'tamhog', 'hhsize', 'hogar_total']])
print(f"Variance in train between 'r4t3', 'tamhog', 'hhsize', 'hogar_total': {variance}")
      
variance = len(test.loc[test[['r4t3', 'tamhog', 'hhsize', 'hogar_total']].var(axis=1) != 0, ['r4t3', 'tamhog', 'hhsize', 'hogar_total']])
print(f"Variance in test between 'r4t3', 'tamhog', 'hhsize', 'hogar_total': {variance}")

variance = len(train.loc[train[['tamhog', 'hhsize', 'hogar_total']].var(axis=1) != 0, ['tamhog', 'hhsize', 'hogar_total']])
print(f"Variance in train between 'tamhog', 'hhsize', 'hogar_total': {variance}")

variance = len(test.loc[test[['tamhog', 'hhsize', 'hogar_total']].var(axis=1) != 0, ['tamhog', 'hhsize', 'hogar_total']])
print(f"Variance in test between 'tamhog', 'hhsize', 'hogar_total': {variance}")
      
train.loc[train[['r4t3', 'tamhog', 'hhsize', 'hogar_total']].var(axis=1) != 0, ['r4t3', 'tamhog', 'hhsize', 'hogar_total','Target']].head(10)

Well, in 48 of the rows in train set and 101 rows in the test set, these 4 attributes aren't equal. After inspecting the rows, it looks like `r4t3` is the one that is different and the other 3 are equal in all cases. With this knowledge, we should probably keep `r4t3` and one of the other three. We should probably also create a feature for the differnce between these two.

There's one other related attribute that seems to be related even though it's not showing up as being correlated, `tamviv`, defined as "number of persons living in the household". I'm not sure why that's different than `r4t3`, "Total persons in the household". Surely, `r4t3` isn't including skeletons in the closet, right? Maybe semi-permanent guests? Let's see how they differ.

In [None]:
print(train[['tamviv', 'r4t3', 'tamhog', 'hhsize', 'hogar_total']].corr())
print(test[['tamviv', 'r4t3', 'tamhog', 'hhsize', 'hogar_total']].corr())

In [None]:
print(train.loc[(train.tamviv != train.r4t3), ['tamviv', 'r4t3', 'tamhog', 'hhsize', 'hogar_total', 'Target']].head())
print(train.loc[(train.tamviv > train.r4t3), ['tamviv', 'r4t3', 'tamhog', 'hhsize', 'hogar_total', 'Target']].head())

equality = train.loc[(train.tamviv != train.r4t3)].equals(train.loc[(train.tamviv > train.r4t3)])
print(f"When tamviv isn't equal to r4t3, tamviv is greater than tamviv, train: {equality}")

equality = test.loc[(test.tamviv != test.r4t3)].equals(test.loc[(test.tamviv > test.r4t3)])
print(f"When tamviv isn't equal to r4t3, tamviv is greater than tamviv, test: {equality}")

# Whenever r4t3 is different than the hhsize attributes, is tamviv always equal to r4t3?
print(len(train.loc[(train.r4t3 != train.hhsize) & (train.tamviv != train.r4t3), ['tamviv', 'r4t3', 'tamhog', 'hhsize', 'hogar_total', 'Target']]))

So what to do about `tamviv` and `r4t3`. The correlation isn't high enough to hit our 95% threshold in our automatic correlation detection above, but there's still high correlation. It seems as if when there are more people living in the household than are currently part of the household?

It's hard to say exaclty what's going on here and what the difference is, if there is a legitimate one, without talking to the stakeholders, but I think the best thing to do is merge the two columns together taking the max between the two into a new column. We'll create a new attribute `tamviv_r4t3_combined` and experiment with how it performs leaving only it in vs. all three. Worst case, we'll end up dropping it if the models don't use it.

In [None]:
class TamvivR4t3Combined(BaseEstimator, TransformerMixin):
    """Adds a new attribute with the max of tamviv and r4t3"""
    
    def __init__(self):
        self.created_features = []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X['tamviv_r4t3_combined'] = X[['r4t3', 'tamviv']].max(axis=1)
        
        self.created_features = ['tamviv_r4t3_combined']

        return X
    
    
tamviv_r4t3_combiner = TamvivR4t3Combined()
tamviv_r4t3_preprocessed = tamviv_r4t3_combiner.transform(train.copy())

max_check = tamviv_r4t3_preprocessed.tamviv_r4t3_combined.equals(tamviv_r4t3_preprocessed[['r4t3', 'tamviv']].max(axis=1))
assert max_check, "TamvivR4t3Combiner didn't create the correct max column"

In [None]:
# store this so we can pass it into household extractor
individual_aggregator = AggregateIndividualFeatures()
feature_creator = FeatureCreator()

# Create pipeline
pipeline = make_pipeline(ZerofillRezEscOutOfBounds(),
                         ZeroMaxRezEsc(),
                         ZerofillV18q1ForFalseV18Q(), 
                         ZerofillV2a1(),
                         TransformYesNoToNumeric(),
                         Ordinalizer(), # Make sure this happens before individual aggregation
                         individual_aggregator,
                         feature_creator,
                         ExtractHouseholds(individual_aggregator=individual_aggregator, 
                                           feature_creator=feature_creator),
                         TamvivR4t3Combined(),
                         FeatureSelector(),
                         SimpleImputer(),
                         MinMaxScaler()
                        )

correlated_features_list = correlated_features[correlated_features == True].index

# sum_features = [feature for feature in correlated_features_list if feature[-4:] == '-sum']
# max_features = [feature for feature in correlated_features_list if feature[-4:] == '-max']
# std_features = [feature for feature in correlated_features_list if feature[-4:] == '-std']

excluded_features = ['elimbasu5', 'area2', 'r4t3', 'tamviv', 'coopele', 'SQBdependency', 'hogar_total', 'tamhog'] #+ squared_features + sum_features + max_features
kept_features = ['area1', 'public', 'hhsize', 'dependency']

# Exclude the aggregation features more selectively than excluding them all
for feature in correlated_features_list:
    if feature not in excluded_features + kept_features:
        for excluded_feature in corr.loc[feature, corr.loc[feature].abs().between(0.95, 0.9999999)].index:
            excluded_features.append(excluded_feature)
            
        kept_features.append(feature)
        
# print(f"Excluding features: {excluded_features}")

evaluator_excluded = ModelEvaluator(pipeline, estimators, excluded_features=excluded_features)
evaluator_excluded.evaluate()

In [None]:
def select_important_features(feature_importances, threshold=0.9):
    total_importance = 0
    important_features = []

    for feature, importance in feature_importances.iterrows():
        total_importance += importance[0]
        important_features.append(feature)

        if total_importance >= threshold:
            break
        
    return important_features

In [None]:
# Get important features from estimator
feature_importance_estimator = [estimator for estimator in evaluator_excluded.estimators if estimator['estimator'].__class__.__name__ == 'GradientBoostingClassifier'][0]
important_features = select_important_features(feature_importance_estimator['feature_importances'], threshold=0.95)

evaluator_important_features = ModelEvaluator(pipeline, estimators, important_features)
evaluator_important_features.evaluate()

# Tune Models

At this point, it seems like the GradientBoostingClassifier and AdaBoostClassifiers stand out and we should turn our attention to tuning those models.

In [None]:
estimators = [GradientBoostingClassifier(random_state=42)]
tuning_evaluator = ModelEvaluator(pipeline, estimators, important_features)
tuning_evaluator.evaluate()

In [None]:
param_dist = {'learning_rate': scipy.stats.truncnorm(-0.995, 2, loc=0.1, scale=0.05),
              'n_estimators': scipy.stats.randint(80, 1000),
              'max_depth': scipy.stats.randint(3, 8),
              'min_samples_split': scipy.stats.randint(2, 100),
              'min_samples_leaf': scipy.stats.randint(20, 60),
              'max_features': ['sqrt', None, 'log2'],
              'subsample': scipy.stats.uniform(loc=0.6, scale=0.4)}

results = tuning_evaluator.tune_hyperparameters(tuning_evaluator.estimators[0], param_dist, n_iter=20, cv=5)

In [None]:
print(results.best_params_)
print(results.best_score_)

In [None]:
# tuning_evaluator.estimators[0]['tuned_cv_results']

# Submit

In [None]:
tuning_evaluator.prepare_submissions()