# Costa Rican Household Poverty Level Prediction


The objective of the Costa Rican Household Poverty Level Prediction contest is to develop a machine learning model that can predict the poverty level of households using both individual and household characteristics. This "data science for good" project offers the opportunity to put our skills towards a task more beneficial to society than getting people to click on ads!

## Problem and Data Explanation

The data for this competition is provided in two files: `train.csv` and `test.csv`. The training set has 9557 rows and 143 columns while the testing set has 23856 rows and 142 columns. Each row represents __one individual__ and each column is a __feature, either unique to the individual, or for the household of the individual__. The training set has one additional column, `Target`, which represents the poverty level on a 1-4 scale and is the label for the competition. A value of 1 is the most extreme poverty. 

This is a __supervised multi-class classification machine learning problem__:

* __Supervised__: provided with the labels for the training data
* __Multi-class classification__: Labels are discrete values with 4 classes

### Objective

The objective is to predict poverty on a __household level__. We are given data on the individual level with each individual having unique features but also information about their household. In the dataset for the task, we'll have to perform some _aggregations of the individual data_ for each household. Moreover, we have to make a prediction for every individual in the test set, but _"ONLY the heads of household are used in scoring"_ which means we want to predict poverty on a household basis. 

__Important note: while all members of a household should have the same label in the training data, there are errors where individuals in the same household have different labels. In these cases, we are told to use the label for the head of each household, which can be identified by the rows where `parentesco1 == 1.0`.__ [competition main discussion](https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403)

The `Target` values represent poverty levels as follows:

    1 = extreme poverty 
    2 = moderate poverty 
    3 = vulnerable households 
    4 = non vulnerable households

The explanations for all 143 columns can be found in the [competition documentation](https://www.kaggle.com/c/costa-rican-household-poverty-prediction/data), but a few to note are below:

* __Id__: a unique identifier for each individual, this should not be a feature that we use! 
* __idhogar__: a unique identifier for each household. This variable is not a feature, but will be used to group individuals by household as all individuals in a household will have the same identifier.
* __parentesco1__: indicates if this person is the head of the household.
* __Target__: the label, which should be equal for all members in a household

### Metric

Predictions will be assessed by the __Macro F1 Score.__ 

```
from sklearn.metrics import f1_score
f1_score(y_true, y_predicted, average = 'macro`)
```

## Roadmap

a. Explore data and perform data visualization
b. Fill in missing values (NULL values) either using mean or median (if the attribute is numeric) or most-frequently occurring value if the attribute is 'object' or categorical.
b. Perform feature engineering, may be using some selected features and only from numeric features.
c. Scale numeric features, AND IF REQUIRED, perform One HOT Encoding of categorical features
d. IF number of features is very large, please do not forget to do PCA.
e. Select some estimators for your work. May be select some (or all) of these:

        GradientBoostingClassifier
        RandomForestClassifier
        KNeighborsClassifier
        ExtraTreesClassifier
        XGBoost
        LightGBM
   
   First perform modeling with default parameter values and get accuracy.

f. Then perform tuning using Bayesian Optimization. 


In [None]:
# 1.0 Call libraries

# Data manipulation
%reset -f
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set a few plotting defaults
%matplotlib inline

plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18
plt.rcParams['patch.edgecolor'] = 'k'

# 1.0.1 For measuring time elapsed
import time

from collections import OrderedDict

In [None]:
# 1.1 Working with imbalanced data
# http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
# Check imblearn version number as:
#   import imblearn;  imblearn.__version__
from imblearn.over_sampling import SMOTE, ADASYN

# 1.2 Processing data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  OneHotEncoder as ohe
from sklearn.preprocessing import StandardScaler as ss
from sklearn.compose import ColumnTransformer as ct

# 1.3 Data imputation
from sklearn.impute import SimpleImputer


In [None]:
# 1.4 Model building
from sklearn.linear_model import LogisticRegression

# 1.5 for ROC graphs & metrics
import scikitplot as skplt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import average_precision_score
import sklearn.metrics as metrics

# to make this notebook's output stable across runs
#Somehow this is not happening as o/p of models is not consistent
np.random.seed(42)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
from sklearn.exceptions import ConvergenceWarning

# Filter out warnings from models

warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
warnings.filterwarnings('ignore', category = ConvergenceWarning)
warnings.filterwarnings('ignore', category = DeprecationWarning)
warnings.filterwarnings('ignore', category = UserWarning)
warnings.filterwarnings('ignore', category = FutureWarning)

# 1.9 Misc
import gc

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline


# Custom scorer for cross validation
scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro')

In [None]:
# 1.3 Dimensionality reduction
from sklearn.decomposition import PCA

# 1.4 Data splitting and model parameter search
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from bayes_opt import BayesianOptimization

# 1.5 Modeling modules
# conda install -c anaconda py-xgboost
from xgboost.sklearn import XGBClassifier

In [None]:
# Model imports
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegressionCV, RidgeClassifierCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier

### Read in Data and Look at Summary Information

In [None]:
pd.options.display.max_columns = 150

# Read in data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

#train = pd.read_csv('train.csv')
#test = pd.read_csv('test.csv')
train.head()


That gives us a look at all of the columns which don't appear to be in any order. To get a quick overview of the data we define `ExamineData`.

In [None]:
# 3.0 Let us understand train data
# 3.1 Begin by defining some functions
def ExamineData(x):
    """Prints various data charteristics, given x
    """
    print("Data shape:", x.shape)
    print("\nColumns:", x.columns)
    print("\nData types\n", x.dtypes)
    print("\nDescribe data\n", x.describe())
    print("\nData\n", x.head(2))
    print ("\nSize of data:", np.sum(x.memory_usage()))    # Get size of dataframes
    print("\nAre there any NULLS\n", np.sum(x.isnull()))


In [None]:
# 3.2 start examining data - commented after analysis due to large data dump on screen.
#ExamineData(train)

This tells us there are 130 integer columns, 8 float (numeric) columns, and 5 object columns. The integer columns probably represent Boolean variables (that take on either 0 or 1) or [ordinal variables](https://www.ma.utexas.edu/users/mks/statmistakes/ordinal.html) with discrete ordered values. The object columns might pose an issue because they cannot be fed directly into a machine learning model.

Let's glance at the test data which has many more rows (individuals) than the train. It does have one fewer column because there's no Target!

In [None]:
# commented after analysis due to large data dump on screen.
#ExamineData(test)

#### Integer Columns

Let's look at the distribution of unique values in the integer columns. For each column, we'll count the number of unique values and show the result in a bar plot.

#### Define plotting function `PlotKDE`

 KDE plots of column values provided as 'x'
    The following graphs shows the distributions of the float columns 
    colored by the value of the Target. 
    With these plots, we can see if there is a significant difference in the 
    variable distribution depending on the household poverty level.


In [None]:
def PlotKDE(x):
    

    plt.figure(figsize = (20, 15))
#    plt.style.use('fivethirtyeight')
#    plt.style.available
    plt.style.use('seaborn-pastel')

    # Color mapping
    colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
    poverty_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 4: 'non vulnerable'})

    # Iterate through the columns
    for i, col in enumerate(x):
        ax = plt.subplot(8, 5, i + 1)
        # Iterate through the poverty levels
        for poverty_level, color in colors.items():
            # Plot each poverty level as a separate line
            sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                        ax = ax, color = color, label = poverty_mapping[poverty_level])
        
        plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')

    plt.subplots_adjust(top = 2)


In [None]:
#PlotKDE(train.select_dtypes('int64'))

The columns with only 2 unique values represent Booleans (0 or 1). In a lot of cases, this boolean information is already on a household level. For example, the `refrig` column says whether or not the household has a refrigerator. When it comes time to make features from the Boolean columns that are on the household level, we will _not need to aggregate_ these. However, the Boolean columns that are on the individual level will need to be aggregated. 

#### Float Columns

Another column type is floats which represent continuous variables. We can make a quick distribution plot to show the distribution of all float columns. We'll use an [`OrderedDict`](https://pymotw.com/2/collections/ordereddict.html) to map the poverty levels to colors because this keeps the keys and values in the same order as we specify (unlike a regular Python dictionary).

The following graphs shows the distributions of the `float` columns colored by the value of the `Target`. With these plots, we can see if there is a significant difference in the variable distribution depending on the household poverty level.

In [None]:
# 3.4 Visual examination of float columns
PlotKDE(train.select_dtypes('float'))

Later on we'll calculate correlations between the variables and the `Target` to gauge the relationships between the features, but these plots can already give us a sense of which variables may be most "relevant" to a model. For example, the `meaneduc`, representing the average education of the adults in the household appears to be related to the poverty level: __a higher average adult education leads to higher values of the target which are less severe levels of poverty__. The theme of the importance of education is one we will come back to again and again in this notebook! 

#### Object Columns


The `Id` and `idhogar` object types make sense because these are identifying variables. However, the other columns seem to be a mix of strings and numbers which we'll need to address before doing any machine learning. According to the documentation for these columns:

* `dependency`: Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
* `edjefe`: years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
* `edjefa`: years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0

dependency, edjefe, edjefa:
        For these three variables, "yes" = 1 and "no" = 0 
        We can correct the variables using a mapping and convert to floats.

In [None]:
# 3.5 Object data types

train.select_dtypes('object').head()

mapObj = {"yes": 1, "no": 0}

In [None]:
# Apply same operation to both train and test
for df in [train, test]:
    # Fill in the values with the correct mapping
    df['dependency'] = df['dependency'].replace(mapObj).astype(np.float64)
    df['edjefa'] = df['edjefa'].replace(mapObj).astype(np.float64)
    df['edjefe'] = df['edjefe'].replace(mapObj).astype(np.float64)

train[['dependency', 'edjefa', 'edjefe']].describe()

In [None]:
PlotKDE(train.select_dtypes('float')) # the parameters are now classified as float

These variables are now correctly represented as numbers and can be fed into a machine learning model. 

Joining `test` and `train` dataframes before starting with Feature Engineering.
    In feature engineering the same operations should be applied to both dataframes so we end up with same set of features.
    Later we can separate out the sets based on the `Target` value.
    Test data will have 'null' values in `Target` Column.
#### Mental Note Anirudh

In [None]:
# 4.1 filling up column Target in test with nan
test['Target'] = np.nan

#4.2 appending test to train
X = train.append(test, ignore_index = True)

In [None]:
#4.3.1 Shape
train.shape #(9557, 143)
test.shape #(23856, 143)
X.shape #Sum of test and train: (33413, 143)

In [None]:
#4.3.2 info
train.info()
test.info()
X.info() 

for X: mem usage and RangeIndex = train + test

## Exploring Label Distribution


In [None]:
#4.4 Exploring Data distribution across classes
#4.4.1 Extract the records for heads of household where 'parentesco1==1'
X_heads = X.loc[X['parentesco1']==1].copy() #Make a copy to preserve X
X_heads.info() #10307 entries, 0 to 33409


In [None]:
#4.4.2 look at label distribution where 'Target is notnull'
X_heads_labels = X_heads.loc[(X_heads['Target'].notnull()), ['Target']]
X_heads_labels_counts = X_heads_labels['Target'].value_counts().sort_index()
X_heads_labels_counts 

imbalanced class with many more households classified as 4.0 i.e. non vulnerable

## Addressing Wrong Labels

####4.5 Exploring classification of members in a household and 
####    correcting errors as per directions in the challenge i.e. head-of-household as correct label


In [None]:
#4.5.1 Grouping by headh of household 'idhogar' and adding 
#train.groupby('idhogar').size() #Length: 2988
train_ok = train.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)                    
train_ok.size #2988 households

In [None]:
#4.5.2 Identify the labels with errors
train_notok = train_ok[train_ok != True]
train_notok.size #85 labels have errors

In [None]:
#4.5.3 View one example of incorrect labels
train[train['idhogar'] == train_notok.index[2]][['Id', 'idhogar', 'parentesco1', 'Target']]


In [None]:
#4.5.4 Fix the labels correctly

for not_ok_id in train_notok.index:
    # Find correct Target value for head of household
    # not_ok_id
    ok_target = int(train[(train['idhogar'] == not_ok_id) & (train['parentesco1'] == 1.0)]['Target'])
    
    # Set the correct label for all members in the household
    train.loc[train['idhogar'] == not_ok_id, 'Target'] = ok_target

In [None]:
#Checking - Trying query function of dataframe
train_check = pd.DataFrame(train.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1))
train_check.query("Target != True") #Empty DataFrame here is a Success!

## 4.6 How many families are there without head of household
X_heads.info() has 10307 entries i.e. ['parentesco1']==1
    train_ok.size returns 2988 households i.e. groupby('idhogar')
    If a household does not have a head, then there is no golden value of a label.
    We can't use any training data wherein household is without a head


In [None]:
#4.6.1 how many huseholds have parentesco1
train_heads = pd.DataFrame(train.groupby('idhogar')['parentesco1'].sum())
train_heads.size #2988 records
train_heads.query("parentesco1 > 1").count() #just checking -- 0 households have more than one head of household
train_heads.query("parentesco1 == 1").count() #just checking -- 2973 households OK

train_heads.query("parentesco1 < 1").count() #15 households do not have head of household


In [None]:
#4.6.2 How many of the households have sum(parentesco1) computed in 4.6.1 as zero
""" Cannot use these households data """
train_heads_no = train_heads.query("parentesco1 == 0") #15 unique 'idhogar's


## Missing Values

find out missing values by column

In [None]:
# Number of missing in each column
missing = pd.DataFrame(X.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(X)

missing.sort_values('percent', ascending = False).head(7).drop('Target')

 `Target` was dropped because we made that `NaN` for test data.

__v18q1__: Number of tablets owned by a family

This is a household variable so only select rows for head of household.



In [None]:
#X_heads['v18q1'].value_counts().sort_index()
X_heads['v18q1'].value_counts()

In [None]:
X_heads['v18q1'].value_counts().sum()

In [None]:
X_heads['v18q1'].isnull().describe()

while '1' is most common value, 8044 `null` exist in this category. 
 `v18q` indicates whether a family owns a tablet.

`groupby`  `v18q` and check `v18q1` for null values.
if all 8044 Null are on '0' we know these families do not own a tablet

In [None]:
X_heads.groupby('v18q')['v18q1'].apply(lambda x: x.isnull().sum())

we can fill in missing value for `v18q1` with zero.

In [None]:
X['v18q1'] = X['v18q1'].fillna(0)

__v2a1__: Monthly rent payment

The next missing column is `v2a1` which represents the montly rent payment. 

we use the home ownership variable below:

    tipovivi1 =1 own and fully paid house
    tipovivi2 =1 own,  paying installment
    tipovivi3 =1 rented
    tipovivi4 =1 precarious
    tipovivi5 =1 other
    

In [None]:
# Fill in households that own the house with 0 rent payment
X.loc[(X['tipovivi1'] == 1), 'v2a1'] = 0

# Create missing rent payment column
X['v2a1-missing'] = X['v2a1'].isnull()

X['v2a1-missing'].value_counts()

__rez_esc__: years behind in school

Finding the ages of those who have a missing value in this column and the ages of those who do not have a missing value.

In [None]:
X.loc[X['rez_esc'].notnull()]['age'].describe()

oldest age with missing value is 17

In [None]:
X.loc[X['rez_esc'].isnull()]['age'].describe()

For this variable, if age > 19 or age < 7 and missing value we set it to zero (outside of school)
others let the imputation take care of it

In [None]:
# If individual is over 19 or younger than 7 and missing years behind, set it to 0
X.loc[((X['age'] > 19) | (X['age'] < 7)) & (X['rez_esc'].isnull()), 'rez_esc'] = 0

# Add a flag for those between 7 and 19 with a missing value
X['rez_esc-missing'] = X['rez_esc'].isnull()

There is also one outlier in the `rez_esc` column. the maximum value for this variable is 5.

In [None]:
X.loc[X['rez_esc'] > 5, 'rez_esc'] = 5

# Feature Engineering



### Define Variable Categories

1. Individual Variables
    * Boolean
    * Integers with an ordering
2. Household variables
    * Boolean
    * Integers with an ordering
    * Continuous
3. Squared Variables: derived variables
4. Id variables: not used


In [None]:
id_ = ['Id', 'idhogar', 'Target']

In [None]:
ind_bool = ['v18q', 'dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 
            'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
            'parentesco1', 'parentesco2',  'parentesco3', 'parentesco4', 'parentesco5', 
            'parentesco6', 'parentesco7', 'parentesco8',  'parentesco9', 'parentesco10', 
            'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2', 'instlevel3', 
            'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 
            'instlevel9', 'mobilephone', 'rez_esc-missing']

ind_ordered = ['rez_esc', 'escolari', 'age']

In [None]:
hh_bool = ['hacdor', 'hacapo', 'v14a', 'refrig', 'paredblolad', 'paredzocalo', 
           'paredpreb','pisocemento', 'pareddes', 'paredmad',
           'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisoother', 
           'pisonatur', 'pisonotiene', 'pisomadera',
           'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 
           'abastaguadentro', 'abastaguafuera', 'abastaguano',
            'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 
           'sanitario2', 'sanitario3', 'sanitario5',   'sanitario6',
           'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 
           'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 
           'elimbasu5', 'elimbasu6', 'epared1', 'epared2', 'epared3',
           'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 
           'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 
           'computer', 'television', 'lugar1', 'lugar2', 'lugar3',
           'lugar4', 'lugar5', 'lugar6', 'area1', 'area2', 'v2a1-missing']

hh_ordered = [ 'rooms', 'r4h1', 'r4h2', 'r4h3', 'r4m1','r4m2','r4m3', 'r4t1',  'r4t2', 
              'r4t3', 'v18q1', 'tamhog','tamviv','hhsize','hogar_nin',
              'hogar_adul','hogar_mayor','hogar_total',  'bedrooms', 'qmobilephone']

hh_cont = ['v2a1', 'dependency', 'edjefe', 'edjefa', 'meaneduc', 'overcrowding']

In [None]:
sqr_ = ['SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBedjefe', 
        'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq']

In [None]:
x = ind_bool + ind_ordered + id_ + hh_bool + hh_ordered + hh_cont + sqr_


#### Squared Variables

Removing all of the squared variables these features are redundant and highly correlated

In [None]:
# Remove squared variables
X = X.drop(columns = sqr_)
X.shape #(33413, 136)

## Id Variables

kept as is for identification

## Household Variables

heads of household

In [None]:
heads = X.loc[X['parentesco1'] == 1, :]
heads = heads[id_ + hh_bool + hh_cont + hh_ordered]
heads.shape

## Feature Construction


Household feature we create is a `bonus` where a family gets a point for having a refrigerator, computer, tablet, or television

In [None]:
# Owns a refrigerator, computer, tablet, and television
heads['bonus'] = 1 * (heads['refrig'] + 
                      heads['computer'] + 
                      (heads['v18q1'] > 0) + 
                      heads['television'])

sns.violinplot('bonus', 'Target', data = heads,
                figsize = (10, 6));
plt.title('Target vs Bonus Variable');

## Per Capita Features

per-capita features in the household

In [None]:
heads['phones-per-capita'] = heads['qmobilephone'] / heads['tamviv']
heads['tablets-per-capita'] = heads['v18q1'] / heads['tamviv']
heads['rooms-per-capita'] = heads['rooms'] / heads['tamviv']
heads['rent-per-capita'] = heads['v2a1'] / heads['tamviv']

In [None]:
household_feats = list(heads.columns)

# Individual Level Variables

There are two types of individual level variables: Boolean (1 or 0 for True or False) and ordinal (discrete values with a meaningful ordering). 

In [None]:
ind = X[id_ + ind_bool + ind_ordered]
ind.shape #(33413, 40)

we have both male and female, remove the male column.

In [None]:
ind = ind.drop(columns = 'male')

### Feature Construction

We can make a few features using the existing data. For example, we can divide the years of schooling by the age.

In [None]:
ind['escolari/age'] = ind['escolari'] / ind['age']

plt.figure(figsize = (10, 8))
sns.violinplot('Target', 'escolari/age', data = ind);

## Feature Engineering through Aggregations

aggregate individual data for each household.
 `groupby`  `idhogar` and `agg`

In [None]:
# Group and aggregate
ind_agg = ind.drop(columns = 'Target').groupby('idhogar').agg(['min', 'max', 'sum', 'count', 'std'])
ind_agg.head()

185 features. rename the columns to keep track

In [None]:
# Rename the columns
new_col = []
for c in ind_agg.columns.levels[0]:
    for stat in ind_agg.columns.levels[1]:
        new_col.append(f'{c}-{stat}')
        
ind_agg.columns = new_col
ind_agg.head()

In [None]:
ind_agg.iloc[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]].head()

merge `ind` with `heads` for final dataset

In [None]:
ind_feats = list(ind_agg.columns)

# Merge on the household id
final = heads.merge(ind_agg, on = 'idhogar', how = 'left')

print('Final features shape: ', final.shape)

In [None]:
final.head() #289 columns

Gender for head of household

In [None]:
head_gender = ind.loc[ind['parentesco1'] == 1, ['idhogar', 'female']]
final = final.merge(head_gender, on = 'idhogar', how = 'left').rename(columns = {'female': 'female-head'})

In [None]:
final.groupby('female-head')['Target'].value_counts(normalize=True)

households with head as female are more likely to be poorer

In [None]:
sns.violinplot(x = 'female-head', y = 'Target', data = final);
plt.title('Target by Female Head of Household');

# Machine Learning Modeling

getting started! 
Random Forest Classifier to establish a baseline. 
Later Gradient Boosting Machine.

To assess our model, we'll use 10-fold cross validation on the training data.
`F1 Macro` measure to evaluate performance.

In [None]:
# Labels for training
train_labels = np.array(list(final[final['Target'].notnull()]['Target'].astype(np.uint8)))

# Extract the training data
train_set = final[final['Target'].notnull()].drop(columns = ['Id', 'idhogar', 'Target'])
test_set = final[final['Target'].isnull()].drop(columns = ['Id', 'idhogar', 'Target'])

# Submission base which is used for making submissions to the competition
submission_base = test[['Id', 'idhogar']].copy()

In [None]:
features = list(train_set.columns)

pipeline = Pipeline([('imputer', Imputer(strategy = 'median')), 
                      ('scaler', MinMaxScaler())])

# Fit and transform training data
train_set = pipeline.fit_transform(train_set)
test_set = pipeline.transform(test_set)

In [None]:
len(features)

# Model-1: RandomForestClassifier


In [None]:
model_RF = RandomForestClassifier(n_estimators=100, random_state=10, 
                               n_jobs = -1)
# 10 fold cross validation
cv_score_RF = cross_val_score(model_RF, train_set, train_labels, cv = 10, scoring = scorer)

print(f'10 Fold Cross Validation F1 Score = {round(cv_score_RF.mean(), 4)} with std = {round(cv_score_RF.std(), 4)}')

# Model-2: LogisticRegression With L2 Penalty

In [None]:
model_LRL2 = LogisticRegression(C=0.1, penalty='l2', random_state=10, n_jobs = -1)
# 10 fold cross validation
cv_score_LRL2 = cross_val_score(model_LRL2, train_set, train_labels, cv = 10, scoring = scorer)

print(f'10 Fold Cross Validation F1 Score = {round(cv_score_LRL2.mean(), 4)} with std = {round(cv_score_LRL2.std(), 4)}')

F1 Score = 0.2933 with std = 0.0455; lower than baseline 0.3368 - along expected lines, but better to check! 

## Feature Importances

With a tree-based model, we can look at the feature importances which show a relative ranking of the usefulness of features in the model. These represent the sum of the reduction in impurity at nodes that used the variable for splitting, but we don't have to pay much attention to the absolute value. Instead we'll focus on relative scores.

If we want to view the feature importances, we'll have to train a model on the whole training set. Cross validation does not return the feature importances. 

In [None]:
model_RF.fit(train_set, train_labels)

# Feature importances into a dataframe
feature_importances = pd.DataFrame({'feature': features, 'importance': model_RF.feature_importances_})
feature_importances.head()

plot the feature importances for visual analysis. The number of relevant features will give us some data points for PCA. 

In [None]:
def plot_feature_importances(df, n = 15, threshold = 0.95):
    """Plots n most important features. Also plots the cumulative importance 
    
    Args:
        df (dataframe): Dataframe of feature importances. Columns must be "feature" and "importance".
    
        n (int): Number of most important features to plot. Default is 15.
    
        threshold (float): Threshold for cumulative importance plot. Default is 95%.
        
    Returns:
        df (dataframe): Dataframe ordered by feature importances with a normalized column (sums to 1) 
                        and a cumulative importance column
    
    Note:
    
        * Normalization in this case means sums to 1. 
        * Cumulative importance is calculated by summing features from most to least important
        * A threshold of 0.95 will show the most important features needed to reach 95% of cumulative importance
    
    """
    plt.style.use('fivethirtyeight')
    
    # Sort features with most important at the head
    df = df.sort_values('importance', ascending = False).reset_index(drop = True)
    
    # Normalize the feature importances to add up to one and calculate cumulative importance
    df['importance_normalized'] = df['importance'] / df['importance'].sum()
    df['cumulative_importance'] = np.cumsum(df['importance_normalized'])
    
    plt.rcParams['font.size'] = 12
    
    # Bar plot of n most important features
    df.loc[:n, :].plot.barh(y = 'importance_normalized', 
                            x = 'feature', color = 'darkgreen', 
                            edgecolor = 'k', figsize = (12, 8),
                            legend = False, linewidth = 2)

    plt.xlabel('Normalized Importance', size = 18); plt.ylabel(''); 
    plt.title(f'{n} Most Important Features', size = 18)
    plt.gca().invert_yaxis()
    
    
    if threshold:
        # Cumulative importance plot
        plt.figure(figsize = (8, 6))
        plt.plot(list(range(len(df))), df['cumulative_importance'], 'b-')
        plt.xlabel('Number of Features', size = 16); plt.ylabel('Cumulative Importance', size = 16); 
        plt.title('Cumulative Feature Importance', size = 18);
        
        # Number of features needed for threshold cumulative importance
        # This is the index (will need to add 1 for the actual number)
        importance_index = np.min(np.where(df['cumulative_importance'] > threshold))
        
        # Add vertical line to plot
        plt.vlines(importance_index + 1, ymin = 0, ymax = 1.05, linestyles = '--', colors = 'red')
        plt.show();
        
        print('{} features required for {:.0f}% of cumulative importance.'.format(importance_index + 1, 
                                                                                  100 * threshold))
    
    return df


In [None]:
norm_fi = plot_feature_importances(feature_importances, threshold=0.95)

__Education & Age__ education and Age related parameters are most relevant. 
Interesting to see that two of the top three parameters escolari-max and escolari/age-sum are in top three.
Actually analysis of top parameters reveal that they are education and age related.

We need 185 of the 287 features to account for 95% of the importance. 
This will be useful to tune PCA.

# Model Selection

The baseline model: Random Forest Classifier with F1 Score = 0.3368. 

1. write a function that can evaluate a model. 
2. try other models from SciKit-learn or h2o.
3. make a dataframe to hold the results for each model.

In [None]:
# Dataframe to hold results
model_results = pd.DataFrame(columns = ['model', 'cv_mean', 'cv_std'])

model_results = model_results.append(pd.DataFrame({
    'model': 'RandomForestClassifier', 
    'cv_mean': cv_score_RF.mean(), 
    'cv_std': cv_score_RF.std()}, 
    index = [0]),
                                     ignore_index = True)

In [None]:
model_results = model_results.append(pd.DataFrame({
    'model': 'LogisticRegression', 
    'cv_mean': cv_score_LRL2.mean(), 
    'cv_std': cv_score_LRL2.std()}, 
    index = [0]),
                                     ignore_index = True)

In [None]:

def cv_model(train, train_labels, model, name, model_results=None):
    """Perform 10 fold cross validation of a model"""
    
    cv_scores = cross_val_score(model, train, train_labels, cv = 10, scoring=scorer, n_jobs = -1)
    print(f'10 Fold CV Score: {round(cv_scores.mean(), 5)} with std: {round(cv_scores.std(), 5)}')
    
    if model_results is not None:
        model_results = model_results.append(pd.DataFrame({'model': name, 
                                                           'cv_mean': cv_scores.mean(), 
                                                            'cv_std': cv_scores.std()},
                                                           index = [0]),
                                             ignore_index = True)

        return model_results

# Model-3: LinearSVC

In [None]:
model_results = cv_model(train_set, train_labels, LinearSVC(), 
                         'LSVC', model_results)

LSVC : 10 Fold CV Score: 0.28658 with std: 0.04568 -- lower than baseline 0.3368

# Model-4: GaussianNB

In [None]:
model_results = cv_model(train_set, train_labels, 
                         GaussianNB(), 'GNB', model_results)

GaussianNB : 10 Fold CV Score: 0.18349 with std: 0.04377 -- lower than baseline 0.3368

# Model-5: MLPClassifier

In [None]:
model_results = cv_model(train_set, train_labels, 
                         MLPClassifier(hidden_layer_sizes=(32, 64, 128, 64, 32)),
                         'MLP', model_results)

MLPClassifier : 10 Fold CV Score: 0.30368 with std: 0.0615 -- lower than baseline 0.3368

However, with tuning can get better.

# Model-6: LinearDiscriminantAnalysis

In [None]:
model_results = cv_model(train_set, train_labels, 
                          LinearDiscriminantAnalysis(), 
                          'LDA', model_results)

LinearDiscriminantAnalysis : 10 Fold CV Score: 0.32088 with std: 0.0568 -- slightly lower than baseline 0.3368

However, with tuning can get better.

# Model-7: RidgeClassifierCV

In [None]:
model_results = cv_model(train_set, train_labels, 
                         RidgeClassifierCV(), 'RIDGE', model_results)

RidgeClassifierCV : 10 Fold CV Score: 0.28046 with std: 0.03179 -- much lower than baseline 0.3368


# Model-8: KNeighborsClassifier

In [None]:
for n in [5, 10, 20]:
    print(f'\nKNN with {n} neighbors\n')
    model_results = cv_model(train_set, train_labels, 
                             KNeighborsClassifier(n_neighbors = n),
                             f'knn-{n}', model_results)

KNeighborsClassifier:

KNN with 5 neighbors - 10 Fold CV Score: 0.34332 with std: 0.0298 -- better than baseline 0.3368 (first model to do so)

KNN with 10 neighbors - 10 Fold CV Score: 0.33688 with std: 0.03935 -- equal to baseline 0.3368

KNN with 20 neighbors - 10 Fold CV Score: 0.30568 with std: 0.03353 -- lower than baseline 0.3368

Conclusions from KNN:
1. KNN with 5 neighbors perfomrs best
2. KNN performance goes down with increase in number of neighbors


# Model-9: ExtraTreesClassifier

In [None]:
model_results = cv_model(train_set, train_labels, 
                         ExtraTreesClassifier(n_estimators = 100, random_state = 10),
                         'EXT', model_results)

ExtraTreesClassifier : 10 Fold CV Score: 0.3276 with std: 0.03164 -- slightly lower than baseline 0.3368


## Comparing Model Performance

With the modeling results in a dataframe, we can plot them to see which model does the best.

In [None]:
model_results.set_index('model', inplace = True)
model_results['cv_mean'].plot.bar(color = 'aqua', figsize = (8, 6),
                                  yerr = list(model_results['cv_std']),
                                  edgecolor = 'k', linewidth = 2)
plt.title('Model F1 Score Results');
plt.ylabel('Mean F1 Score (with error bar)');
model_results.reset_index(inplace = True)

KNN-5 performs the best with RandomFOrest coming in close. 
Also the Std_dev on error of KNN5 is much tighter than the random forest

#  Model using xgBoost

After Random Forest, use the gradient boosting machine xgBoost

In [None]:
train_set = pd.DataFrame(train_set, columns = features)
train_set.info()

In [None]:
test_set = pd.DataFrame(test_set, columns = features)
test_set.info()

In [None]:
features = list(train_set.columns)

## Bayesian Optimization

In [None]:
############### GG. Tuning using Bayes Optimization ############
"""
11. Step 1: Define BayesianOptimization function.
"""
# 11.1 Which parameters to consider and what is each one's range
para_set = {
           'learning_rate':  (0, 1),                 # any value between 0 and 1
           'n_estimators':   (10,100),               # any number between 50 to 300
           'max_depth':      (6,20),                 # any depth between 3 to 10
           'n_components' :  (150,200)               # any number between 150 to 190
            }

# 11.2 Create a function that when passed some parameters
#    evaluates results using cross-validation
#    This function is used by BayesianOptimization() object

def xg_eval(learning_rate,n_estimators, max_depth,n_components):
    # 12.1 Make pipeline. Pass parameters directly here
    pipe_xg1 = make_pipeline (ss(),                        # Why repeat this here for each evaluation?
                              PCA(n_components=int(round(n_components))),
                              XGBClassifier(
                                           silent = False,
                                           n_jobs=2,
                                           learning_rate=learning_rate,
                                           max_depth=int(round(max_depth)),
                                           n_estimators=int(round(n_estimators))
                                           )
                             )

    # 12.2 Now fit the pipeline and evaluate
    """Perform 10 fold cross validation of a model"""
    cv_result = cross_val_score(estimator = pipe_xg1,
                                X = train_set,
                                y = train_labels,
                                cv = 10,
                                n_jobs = -1,
                                scoring = scorer
                                ).mean()             # take the mean/max of all results


    # 12.3 Finally return maximum/average value of result
    return cv_result

#    return cv_result, pipe_xg1


In [None]:
# 12 This is the main workhorse
#      Instantiate BayesianOptimization() object
#
xgBO = BayesianOptimization(
                             xg_eval,     # Function to evaluate performance.
                             para_set     # Parameter set from where parameters will be selected
                             )


In [None]:
# 13. Gaussian process parameters
#     Modulate intelligence of Bayesian Optimization process
gp_params = {"alpha": 1e-5}      # Initialization parameter for gaussian
                                 # process.

# 14. Fit/train (so-to-say) the BayesianOptimization() object
#     Start optimization. 25minutes
#     Our objective is to maximize performance (results)
start = time.time()
xgBO.maximize(init_points=10,    # Number of randomly chosen points to
                                 # sample the target function before
                                 #  fitting the gaussian Process (gp)
                                 #  or gaussian graph
               n_iter=15,        # Total number of times the
               #acq="ucb",       # ucb: upper confidence bound
                                 #   process is to be repeated
                                 # ei: Expected improvement
               # kappa = 1.0     # kappa=1 : prefer exploitation; kappa=10, prefer exploration
              **gp_params
               )
end = time.time()
(end-start)/60


In [None]:
# 15. Get values of parameters that maximise the objective
#xgBO.res
type(xgBO.res) #If type is list then call max directly

In [None]:
#xgBO.res['max']
xgBO.max

In [None]:
xgBO.max['params']

In [None]:
xgBO.max['params']['learning_rate']
xgBO.max['params']['n_estimators']
xgBO.max['params']['max_depth']
xgBO.max['params']['n_components']

### Best set of parameters recommended by Bayesian Optimization:

{'learning_rate': 0.9167937871301227,
 'max_depth': 19.94179375518867,
 'n_components': 178.42829982693257,
 'n_estimators': 10.011309079810477}


In [None]:
cv_score_xgBO = xg_eval(
    xgBO.max['params']['learning_rate'],
    xgBO.max['params']['n_estimators'],
    xgBO.max['params']['max_depth'],
    xgBO.max['params']['n_components']
)

In [None]:
model_results = model_results.append(pd.DataFrame({
    'model': 'XGBClassifier', 
    'cv_mean': cv_score_xgBO.mean(), 
    'cv_std': cv_score_xgBO.std()}, 
    index = [0]),
                                     ignore_index = True)

In [None]:
model_results

In [None]:
cv_score_xgBO.mean()

In [None]:
cv_score_xgBO.std()

SO the much vaunted Bayesian Optimization gives only a marginal improvement in accuracy 

from a baseline score of 0.3368 we are now at 0.3087795925140086

this is disappointing. 


### NEXT Steps - time permitting
1. Try SVD instead of PCA
2. Implement EvolutionaryAlgorithmSearchCV instead of Bayesian Optimization
3. Implement lightGBM instead of xgBoost


4. Predict using xgBoost / lightGBM
5. Do some more data manipulation to extract predictions for head of household (optional)
6. export results to csv with id and idhogar

## Running xgBoost - fit

In [None]:
pipe_xg1 = make_pipeline (ss(),
                          PCA(n_components=int(round(xgBO.max['params']['n_components']))),
                          XGBClassifier(
                              silent = False,
                              n_jobs=-1,
                              learning_rate=xgBO.max['params']['learning_rate'],
                              max_depth=int(round(xgBO.max['params']['max_depth'])),
                              n_estimators=int(round(xgBO.max['params']['n_estimators']))
                          )
                         )


In [None]:
pipe_xg1.fit(train_set, train_labels)

# Make predictions on test data - xgBoost


In [None]:
test_set.info()

In [None]:
predictions = pipe_xg1.predict(test_set)
#predictions = [round(value) for value in test_labels]

In [None]:
predictions.size

In [None]:
predictions

# joining predictions with ID and idhogar


In [None]:
submission_base.info()

In [None]:
test_ids = list(final.loc[final['Target'].isnull(), 'idhogar'])
predictions = pd.DataFrame({'idhogar': test_ids,
                               'Target': predictions})

# Make a submission dataframe
submission = submission_base.merge(predictions, 
                                   on = 'idhogar',
                                   how = 'left').drop(columns = ['idhogar'])
    
# Fill in households missing a head
submission['Target'] = submission['Target'].fillna(4).astype(np.int8)
submission.to_csv('Anirudh_submission.csv', index = False)