** Costa Rica Household Poverty Level Forecast **

The Inter-American Development Bank asked the Kaggle community to help some of the most impoverished families in the world.

The detailed problem description is available at Official Website: 
 https://www.kaggle.com/c/costa-rican-household-poverty-prediction


Directly speaking, we need to develop a model according to people's family conditions, ceilings in the home, the living environment, and other available information to predict the level of poverty.


This task is a Kaggle kernel competition, so the code must be submitted through the kernel instead of the CSV prediction results. 

This article provides less explanation on outlier processing, but contains the complete submission process:

1 Data exploration and data preprocessing
1.1 Review of the questions
1.2 Exploratory data analysis and outlier processing

2 Feature engineering
2.1 New Feature Construction
2.2 Synthesis of individual-level features and those of family-level

3 Model construction and tuning
3.1 Use and integration of LightGBM


There are in total 142 features, and please check their meanings when necessary:
https://www.kaggle.com/c/costa-rican-household-poverty-prediction/data

And here is a helpful basic walkthrough before you get your hands dirty:
https://www.kaggle.com/willkoehrsen/a-complete-introduction-and-walkthrough

In this demo, a simple LightGBM can achieve accuracy that ranks around No.10. I did not primarily deal with outliers or tune parameters. **Applied machine learning is basically feature engineering! Spend more time understanding the features and their connections.**

Regarding the following questions, I have referred to some related papers to understand them better.
1. How to deal with the imbalanced data.
2. How learning rate decay affects LightGBM training.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set a few plotting defaults
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18
plt.rcParams['patch.edgecolor'] = 'k'

** Read in data **
Training data has 9557 rows and 143 columns: 142 features, plus a target column

Test data has 23856 rows.

Target column: classification of personal poverty, with 4 levels
1 = extreme
2 = moderate
3 = vulnerable
4 = not vulnerable

The accuracy evaluation index is Macro F1 : (F1 Class 1 + F1 Class 2 + F1 Class 3 + F1 Class 4) / 4

In [None]:
pd.options.display.max_columns = 150

# Read in data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()

In [None]:
train.info()

In [None]:
from collections import OrderedDict

** Exploratory data analysis **
This data set is relatively complete. I show some graphs representing the relationship between some features and targets.  
Ideally, a informatic feature should be able to seprate four leverls of the target. 

Feature examples
V2a1: Monthly rent
V18q1: Number of tables owned by the family

In [None]:
plt.figure(figsize = (20, 12))
plt.style.use('fivethirtyeight')

# Color mapping
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
poverty_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 4: 'non vulnerable'})

for i,col in enumerate(train.select_dtypes('float')):
    ax=plt.subplot(4,2,i+1)
    for poverty_level,color in colors.items():
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')
        
plt.subplots_adjust(top = 2)

Checking the data type in the data set, I found that the continues and categorical variables are mixed in some columns and thus those variables need further processing. The explanation of several characteristics is as follows:

dependency: Calculate dependency ratio = (Number of family members under 19 or over 64) / (Number of family members between 19 and 64)

edjefe: years of education of male heads of household, based on escolari (years of education), head of household and gender, yes = 1, no = 0

edjefa: years of education of female heads of household, based on escolari (years of education), head of household and gender, yes = 1, no = 0

In [None]:
train.select_dtypes('object').head()

Replace yes no with 1 and 0

In [None]:
map={'yes':1,"no":0}
for df in [train,test]:
    df['dependency']=df['dependency'].replace(map).astype(np.float64)
    df['edjefe']=df['edjefe'].replace(map).astype(np.float64)
    df['edjefa']=df['edjefa'].replace(map).astype(np.float64)
train[['dependency','edjefe','edjefa']].describe()

Distributions

In [None]:
plt.figure(figsize = (20, 16))
plt.style.use('ggplot')

# Color mapping
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
poverty_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 4: 'non vulnerable'})

for i,col in enumerate(['dependency','edjefe','edjefa']):
    ax=plt.subplot(3,1,i+1)
    for poverty_level,color in colors.items():
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')
        


Combine test and train data

In [None]:
test['Target']=np.nan
data=train.append(test,ignore_index=True)

Among the four levels of poverty, there are very few "extremely poor" people, and the number of samples in the four categories is severely uneven. In this way, machine learning models rarely have the chance to learn from samples belonging to "extremely poor" level, resulting in inaccurate predictions.
Two common ways to deal with imbalanced datasets are:
1. Bootstrapping: Generate new samples to balance the dataset.
2. Weight the samples, interfering optimization process.
In this problem, I utilize the latter strategy.

The original idea comes from  "Logistic Regression in Rare Events Data, King, Zen, 2001.", a work proposed logistic regression correction under imbalanced samples. The author re-designs the process of the maximum likelihood estimation so that it works for imbalanced data using two tricks. 

1. Prior Correction. Prior correction is to modify the value of the regression coefficient directly. The magnitude of the correction is directly related to the proportion class.

2. Weighting. In the estimation, the category with fewer samples has more significant weight, contributing more to the total error. 


For LigthGBM, this is how weighting works:
https://stackoverflow.com/questions/34389624/what-does-sample-weight-do-to-the-way-a-decisiontreeclassifier-works-in-skle

In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC

X = [[0],[1],[2]] # 3 simple training examples
Y = [ 1,  2,  1 ] # class labels

dtc = DTC(max_depth=1)

In [None]:
#no weighting
dtc.fit(X,Y)

print (dtc.tree_.threshold)
# [0.5, -2, -2]
#gini-index
print ( dtc.tree_.impurity)
# [0.44444444, 0, 0.5]

In [None]:
#weigthing the samples
dtc.fit(X,Y,sample_weight=[1,2,3])

print (dtc.tree_.threshold)
# [1.5, -2, -2]
print (dtc.tree_.impurity)
# [0.44444444, 0.44444444, 0.]

It can be found that after the sample weight is added, the tree splitting standard and the leaf node purity will change accordingly. In this way, the estimation process is changed.

In [None]:
data['Target'].value_counts()

Since the test data calculates accuracy only based on head of household, in the trainning we only look at the poverty level of the head of household (['parentesco1'] == 1).
All heads of households are directly extracted as the training set, and there are 2973 heads of households.

In [None]:
heads=data.loc[data['parentesco1'] == 1].copy()
train_labels=data.loc[(data['Target'].notnull())&(data['parentesco1']==1),['Target','idhogar']]

In [None]:
len(train_labels)

In [None]:
type_counts=train_labels['Target'].value_counts()

In [None]:
type_counts=pd.DataFrame(type_counts)

In [None]:
type_counts['level']=type_counts.index

In [None]:
type_counts['level']=type_counts['level'].replace(poverty_mapping)


In [None]:
type_counts['Count']=type_counts['Target']

In [None]:
ax = sns.barplot(x="level", y="Count", data=type_counts)

The description tells that each family can only have one poverty level, so we need to know whether some families were labeled more than one level.

In [None]:
all_equal=train.groupby('idhogar')['Target'].apply(lambda x: x.nunique()==1)

In [None]:
print(len(all_equal))
all_equal.head()
type(all_equal)

In [None]:
not_equal= all_equal[all_equal!=True]

In [None]:
not_equal.head()

In [None]:
example=train.loc[train['idhogar']==not_equal.index[0],['idhogar', 'parentesco1', 'Target']]

And here is an example of data that we do not expect

In [None]:
example

In [None]:
households_leader=train.groupby('idhogar')['parentesco1'].sum()

In [None]:
households_leader.head()

Some are some families that have more than 1 leader ;D

In [None]:
households_leader_flase=train.loc[train['idhogar'].isin(households_leader
                                                       [households_leader !=1].index),:]

In [None]:
households_leader_flase[['idhogar', 'parentesco1', 'Target']]

If a household has more than one level of poverty, the poverty level of the head of household is considered to be accurate, and poverty levels of others are modified so that they are consistent with that of the head

In [None]:
for household in not_equal.index:
    true_target= int(train[(train['idhogar']==household) & 
                          (train['parentesco1']==1.0)]['Target'])
    train.loc[train['idhogar']==household, 'Target']=true_target

Next, handle missing values and outliers

In [None]:
missing = pd.DataFrame(data.isnull().sum()).rename(columns={0:'total'})
missing['Percent']=missing['total']/len(data)
missing.sort_values('Percent',ascending=False).head(10)

In [None]:
heads=data.loc[data['parentesco1']==1].copy()
plt.figure(figsize=(8,6))
heads['v18q1'].value_counts().sort_index().plot.bar()
plt.show()

In [None]:
heads.groupby('v18q')['v18q1'].apply(lambda x:x.isnull().sum())

In [None]:
data['v18q1']=data['v18q1'].fillna(0)

Handle v2a1 based on rent / buying status (tipoxxxx)

In [None]:
own_variables=[x for x in data if x.startswith('tipo')]

In [None]:
own_variables

Meanings of those features:
tipovivi1 = 1, finish paying your own house
tipovivi2 = 1, during payment
tipovivi3 = 1, rented
tipovivi4 = 1, unstable
tipovivi5 = 1, other allocations, borrowed

In [None]:
data.loc[data['v2a1'].isnull(),own_variables].sum().plot.bar()
plt.xticks([0, 1, 2, 3, 4],
           ['Owns and Paid Off', 'Owns and Paying', 'Rented', 'Precarious', 'Other'],
          rotation = 60)

In [None]:
owns=pd.DataFrame(data.loc[data['tipovivi3']==1,'Target'])

In [None]:
owns['Target'].value_counts()

So an nature idea is that families what have already bought a house will not need to pay rent. Following this idea, modify some of missing values in v2a1

In [None]:
# Fill in households that own the house with 0 rent payment
data.loc[(data['tipovivi1'] == 1), 'v2a1'] = 0


For the case where v2a1 is still missing, I add a column called ‘v2a1-missing’ to the dataset to indicate whether v2a1 is vacant for each row.

In [None]:

# Create missing rent payment column
data['v2a1-missing'] = data['v2a1'].isnull()


In [None]:
data['v2a1-missing'].value_counts()

According to discussions of others, tipovivi3 is used here to fill the remaining missing values

In [None]:
data['v2a1'] = data['v2a1'].fillna(value=data['tipovivi3'])

In [None]:
data['v2a1'].head(20)


Then handle the rez_esc vacancy. This feature represents the number of years behind school. So it is only meaningful for people aged 7-19. 
Then, for people who exceed this range, I add 0 and also a column called rez_esc-missing to indicate whether it is still missing.

In [None]:
data.loc[((data['age']>19) | (data['age']<7)) & (data['rez_esc'].isnull()), 'rez_esc']=0

In [None]:
data['rez_esc-missing'] = data['rez_esc'].isnull()

In [None]:
data['rez_esc']=data['rez_esc'].fillna(0)

There is another guy who is 97 years behind school. This is funny. 

In [None]:
data['rez_esc'].plot()

In [None]:
data.loc[data['rez_esc'] > 5, 'rez_esc'] = 5

After a preliminary understanding of the data and the processing of missing values and outliers, I can now do feature engineering.
 First, clarify the type and category of the variable:

Types of:
bool (true / false), ordered (hierarchical), cout (continuous)

category:
ind (for individuals), hh (for families)

For individual-level variables, family-level synthesis can be done, such as calculating the average income, highest education, age range, etc. for all people in a family.

Besides, id and sqr are the ordinal class and the square class of the known variable, which does not require much processing.

I do feature engineering to
1. Create new variables, such as the size of the house in the household / total number of people, reflecting the degree of congestion
2. Individual and family-level variables synthesis

The specific meaning of the feature can be found on the official website. Many new features here are shared by Kaggle's discussion and posts from others.

**

In [None]:
id_ = ['Id', 'idhogar', 'Target']

In [None]:
ind_bool = ['v18q', 'dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 
            'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
            'parentesco1', 'parentesco2',  'parentesco3', 'parentesco4', 'parentesco5', 
            'parentesco6', 'parentesco7', 'parentesco8',  'parentesco9', 'parentesco10', 
            'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2', 'instlevel3', 
            'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 
            'instlevel9', 'mobilephone', 'rez_esc-missing']

ind_ordered = ['rez_esc', 'escolari', 'age']

In [None]:
hh_bool = ['hacdor', 'hacapo', 'v14a', 'refrig', 'paredblolad', 'paredzocalo', 
           'paredpreb','pisocemento', 'pareddes', 'paredmad',
           'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisoother', 
           'pisonatur', 'pisonotiene', 'pisomadera',
           'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 
           'abastaguadentro', 'abastaguafuera', 'abastaguano',
            'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 
           'sanitario2', 'sanitario3', 'sanitario5',   'sanitario6',
           'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 
           'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 
           'elimbasu5', 'elimbasu6', 'epared1', 'epared2', 'epared3',
           'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 
           'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 
           'computer', 'television', 'lugar1', 'lugar2', 'lugar3',
           'lugar4', 'lugar5', 'lugar6', 'area1', 'area2', 'v2a1-missing']

hh_ordered = [ 'rooms', 'r4h1', 'r4h2', 'r4h3', 'r4m1','r4m2','r4m3', 'r4t1',  'r4t2', 
              'r4t3', 'v18q1', 'tamhog','tamviv','hhsize','hogar_nin',
              'hogar_adul','hogar_mayor','hogar_total',  'bedrooms', 'qmobilephone']

hh_cont = ['v2a1', 'dependency', 'edjefe', 'edjefa', 'meaneduc', 'overcrowding']

In [None]:
sqr_ = ['SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBedjefe', 
        'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq']

In [None]:
heads = data.loc[data['parentesco1'] == 1, :]
heads = heads[id_ + hh_bool + hh_cont + hh_ordered]
heads.shape

For example, counting the number of electronic products in households.

In [None]:
elec = []

# Assign values
for i, row in heads.iterrows():
    if row['noelec'] == 1:
        elec.append(0)
    elif row['coopele'] == 1:
        elec.append(1)
    elif row['public'] == 1:
        elec.append(2)
    elif row['planpri'] == 1:
        elec.append(3)
    else:
        elec.append(np.nan)
        
# Record the new variable and missing flag
heads['elec'] = elec
heads['elec-missing'] = heads['elec'].isnull()


In [None]:
heads = heads.drop(columns = 'area2')

heads.groupby('area1')['Target'].value_counts(normalize = True)

In [None]:
# Wall ordinal variable
heads['walls'] = np.argmax(np.array(heads[['epared1', 'epared2', 'epared3']]),
                           axis = 1)

# heads = heads.drop(columns = ['epared1', 'epared2', 'epared3'])
#plot_categoricals('walls', 'Target', heads)

In [None]:
heads['epared2'].head()

In [None]:
# Roof ordinal variable
heads['roof'] = np.argmax(np.array(heads[['etecho1', 'etecho2', 'etecho3']]),
                           axis = 1)
#heads = heads.drop(columns = ['etecho1', 'etecho2', 'etecho3'])

# Floor ordinal variable
heads['floor'] = np.argmax(np.array(heads[['eviv1', 'eviv2', 'eviv3']]),
                           axis = 1)
# heads = heads.drop(columns = ['eviv1', 'eviv2', 'eviv3'])

In [None]:
# Create new feature
heads['walls+roof+floor'] = heads['walls'] + heads['roof'] + heads['floor']

#plot_categoricals('walls+roof+floor', 'Target', heads, annotate=False)

In [None]:
# No toilet, no electricity, no floor, no water service, no ceiling
heads['warning'] = 1 * (heads['sanitario1'] + 
                         (heads['elec'] == 0) + 
                         heads['pisonotiene'] + 
                         heads['abastaguano'] + 
                         (heads['cielorazo'] == 0))

In [None]:
# Owns a refrigerator, computer, tablet, and television
heads['bonus'] = 1 * (heads['refrig'] + 
                      heads['computer'] + 
                      (heads['v18q1'] > 0) + 
                      heads['television'])


In [None]:
heads['phones-per-capita'] = heads['qmobilephone'] / heads['tamviv']
heads['tablets-per-capita'] = heads['v18q1'] / heads['tamviv']
heads['rooms-per-capita'] = heads['rooms'] / heads['tamviv']
heads['rent-per-capita'] = heads['v2a1'] / heads['tamviv']

In [None]:
#feature from other notebook
heads.loc[(heads.v14a ==  1) & (heads.sanitario1 ==  1) & (heads.abastaguano == 0), "v14a"] = 0
heads.loc[(heads.v14a ==  1) & (heads.sanitario1 ==  1) & (heads.abastaguano == 0), "sanitario1"] = 0

In [None]:
ind = data[id_ + ind_bool + ind_ordered]
ind.shape
ind['escolari/age'] = ind['escolari'] / ind['age']

plt.figure(figsize = (10, 8))
sns.violinplot('Target', 'escolari/age', data = ind);

In [None]:
# Define custom function
range_ = lambda x: x.max() - x.min()
range_.__name__ = 'range_'

# Group and aggregate
ind_agg = ind.drop(columns = 'Target').groupby('idhogar').agg(['min', 'max', 'sum', 'count', 'std', range_])
ind_agg.head()

In [None]:
# Rename the columns
new_col = []
for c in ind_agg.columns.levels[0]:
    for stat in ind_agg.columns.levels[1]:
        new_col.append(f'{c}-{stat}')
        
ind_agg.columns = new_col
ind_agg.head()

In [None]:
ind_feats = list(ind_agg.columns)

# Merge on the household id
final = heads.merge(ind_agg, on = 'idhogar', how = 'left')

print('Final features shape: ', final.shape)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

# Custom scorer for cross validation
scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro')

In [None]:
# Labels for training
train_labels = np.array(list(final[final['Target'].notnull()]['Target'].astype(np.uint8)))

# Extract the training data
train_set = final[final['Target'].notnull()].drop(columns = ['Id', 'Target'])
test_set = final[final['Target'].isnull()].drop(columns = ['Id',  'Target'])

# Submission base which is used for making submissions to the competition
submission_base = test[['Id', 'idhogar']].copy()

In [None]:
train_set['adult'] = train_set['hogar_adul'] - train_set['hogar_mayor']
train_set['dependency_count'] = train_set['hogar_nin'] + train_set['hogar_mayor']
train_set['dependency'] = train_set['dependency_count'] / train_set['adult']
train_set['child_percent'] = train_set['hogar_nin']/train_set['hogar_total']
train_set['elder_percent'] = train_set['hogar_mayor']/train_set['hogar_total']
train_set['adult_percent'] = train_set['hogar_adul']/train_set['hogar_total']
test_set['adult'] = test_set['hogar_adul'] - test_set['hogar_mayor']
test_set['dependency_count'] = test_set['hogar_nin'] + test_set['hogar_mayor']
test_set['dependency'] = test_set['dependency_count'] / test_set['adult']
test_set['child_percent'] = test_set['hogar_nin']/test_set['hogar_total']
test_set['elder_percent'] = test_set['hogar_mayor']/test_set['hogar_total']
test_set['adult_percent'] = test_set['hogar_adul']/test_set['hogar_total']

train_set['rent_per_adult'] = train_set['v2a1']/train_set['hogar_adul']
train_set['rent_per_person'] = train_set['v2a1']/train_set['hhsize']
test_set['rent_per_adult'] = test_set['v2a1']/test_set['hogar_adul']
test_set['rent_per_person'] = test_set['v2a1']/test_set['hhsize']

train_set['overcrowding_room_and_bedroom'] = (train_set['hacdor'] + train_set['hacapo'])/2
test_set['overcrowding_room_and_bedroom'] = (test_set['hacdor'] + test_set['hacapo'])/2

In [None]:
train_set['r4h1_percent_in_male'] = train_set['r4h1'] / train_set['r4h3']
train_set['r4m1_percent_in_female'] = train_set['r4m1'] / train_set['r4m3']
train_set['r4h1_percent_in_total'] = train_set['r4h1'] / train_set['hhsize']
train_set['r4m1_percent_in_total'] = train_set['r4m1'] / train_set['hhsize']
train_set['r4t1_percent_in_total'] = train_set['r4t1'] / train_set['hhsize']
test_set['r4h1_percent_in_male'] = test_set['r4h1'] / test_set['r4h3']
test_set['r4m1_percent_in_female'] = test_set['r4m1'] / test_set['r4m3']
test_set['r4h1_percent_in_total'] = test_set['r4h1'] / test_set['hhsize']
test_set['r4m1_percent_in_total'] = test_set['r4m1'] / test_set['hhsize']
test_set['r4t1_percent_in_total'] = test_set['r4t1'] / test_set['hhsize']


In [None]:

train_set['rent_per_room'] = train_set['v2a1']/train_set['rooms']
train_set['bedroom_per_room'] = train_set['bedrooms']/train_set['rooms']
train_set['elder_per_room'] = train_set['hogar_mayor']/train_set['rooms']
train_set['adults_per_room'] = train_set['adult']/train_set['rooms']
train_set['child_per_room'] = train_set['hogar_nin']/train_set['rooms']
train_set['male_per_room'] = train_set['r4h3']/train_set['rooms']
train_set['female_per_room'] = train_set['r4m3']/train_set['rooms']
train_set['room_per_person_household'] = train_set['hhsize']/train_set['rooms']

test_set['rent_per_room'] = test_set['v2a1']/test_set['rooms']
test_set['bedroom_per_room'] = test_set['bedrooms']/test_set['rooms']
test_set['elder_per_room'] = test_set['hogar_mayor']/test_set['rooms']
test_set['adults_per_room'] = test_set['adult']/test_set['rooms']
test_set['child_per_room'] = test_set['hogar_nin']/test_set['rooms']
test_set['male_per_room'] = test_set['r4h3']/test_set['rooms']
test_set['female_per_room'] = test_set['r4m3']/test_set['rooms']
test_set['room_per_person_household'] = test_set['hhsize']/test_set['rooms']

train_set['rent_per_bedroom'] = train_set['v2a1']/train_set['bedrooms']
train_set['edler_per_bedroom'] = train_set['hogar_mayor']/train_set['bedrooms']
train_set['adults_per_bedroom'] = train_set['adult']/train_set['bedrooms']
train_set['child_per_bedroom'] = train_set['hogar_nin']/train_set['bedrooms']
train_set['male_per_bedroom'] = train_set['r4h3']/train_set['bedrooms']
train_set['female_per_bedroom'] = train_set['r4m3']/train_set['bedrooms']
train_set['bedrooms_per_person_household'] = train_set['hhsize']/train_set['bedrooms']

test_set['rent_per_bedroom'] = test_set['v2a1']/test_set['bedrooms']
test_set['edler_per_bedroom'] = test_set['hogar_mayor']/test_set['bedrooms']
test_set['adults_per_bedroom'] = test_set['adult']/test_set['bedrooms']
test_set['child_per_bedroom'] = test_set['hogar_nin']/test_set['bedrooms']
test_set['male_per_bedroom'] = test_set['r4h3']/test_set['bedrooms']
test_set['female_per_bedroom'] = test_set['r4m3']/test_set['bedrooms']
test_set['bedrooms_per_person_household'] = test_set['hhsize']/test_set['bedrooms']

train_set['tablet_per_person_household'] = train_set['v18q1']/train_set['hhsize']
train_set['phone_per_person_household'] = train_set['qmobilephone']/train_set['hhsize']
test_set['tablet_per_person_household'] = test_set['v18q1']/test_set['hhsize']
test_set['phone_per_person_household'] = test_set['qmobilephone']/test_set['hhsize']

train_set['age_12_19'] = train_set['hogar_nin'] - train_set['r4t1']
test_set['age_12_19'] = test_set['hogar_nin'] - test_set['r4t1']    



In [None]:
train_set['num_over_18'] = 0
train_set['num_over_18'] = train_set[train.age >= 18].groupby('idhogar').transform("count")
train_set['num_over_18'] = train_set.groupby("idhogar")["num_over_18"].transform("max")
train_set['num_over_18'] = train_set['num_over_18'].fillna(0)

test_set['num_over_18'] = 0
test_set['num_over_18'] = test_set[test.age >= 18].groupby('idhogar').transform("count")
test_set['num_over_18'] = test_set.groupby("idhogar")["num_over_18"].transform("max")
test_set['num_over_18'] = test_set['num_over_18'].fillna(0)

In [None]:
train_set=train_set.drop(columns='idhogar')
test_set=test_set.drop(columns='idhogar')

After feature engineering, division brought me some inf. Turning those inf into null values.

In [None]:
#deal with nan and inf
train_set=train_set.replace([np.inf, -np.inf], np.nan)
test_set=test_set.replace([np.inf, -np.inf], np.nan)
train_set.describe()

In [None]:
train_set.columns

In [None]:
# drop duplicated columns
needless_cols = ['r4t3', 'tamhog', 'tamviv', 'hhsize', 'v14a']
train_set = train_set.drop(needless_cols, axis=1)
test_set = test_set.drop(needless_cols, axis=1)

In [None]:
features = list(train_set.columns)


I do not have any idea how to deal with other missing values, so, I just set them to 0

In [None]:
train_set= train_set.fillna(0)
test_set= test_set.fillna(0)

In [None]:
test_ids = list(final.loc[final['Target'].isnull(), 'idhogar'])

In [None]:
train_set = pd.DataFrame(train_set, columns = features)

# Create correlation matrix
corr_matrix = train_set.corr()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 1)]

to_drop

In [None]:
train_set = train_set.drop(columns = to_drop)
train_set.shape

In [None]:
test_set = pd.DataFrame(test_set, columns = features)
train_set, test_set = train_set.align(test_set, axis = 1, join = 'inner')
features = list(train_set.columns)

Recursive Feature Elimination with Random Forest is used for feature selections. I do not know whether it will work or not. Only test data will tell.

In [None]:
from sklearn.feature_selection import RFECV

# Create a model for feature selection
estimator = RandomForestClassifier(random_state = 10, n_estimators = 100,  n_jobs = -1)

# Create the object
selector = RFECV(estimator, step = 1, cv = 3, scoring= scorer, n_jobs = -1)

In [None]:
selector.fit(train_set, train_labels)

In [None]:
plt.plot(selector.grid_scores_);

plt.xlabel('Number of Features'); plt.ylabel('Macro F1 Score'); plt.title('Feature Selection Scores');
selector.n_features_

We find 105 features, so I will build a model based on this number.

In [None]:
train_selected = selector.transform(train_set)
test_selected = selector.transform(test_set)

In [None]:
# Convert back to dataframe
selected_features = train_set.columns[np.where(selector.ranking_==1)]
train_selected = pd.DataFrame(train_selected, columns = selected_features)
test_selected = pd.DataFrame(test_selected, columns = selected_features)

In [None]:
def macro_f1_score(labels, predictions):
    # Reshape the predictions as needed
    predictions = predictions.reshape(len(np.unique(labels)), -1 ).argmax(axis = 0)
    
    metric_value = f1_score(labels, predictions, average = 'macro')
    
    # Return is name, value, is_higher_better
    return 'macro_f1', metric_value, True


Add learning rate decay to GBM algorithm[Stochastic Gradient Boosting Jerome H. Friedman* March 26, 1999 ...]
To my knowledge, learning rate decay is popular for deel learning algorithms, but here we do see improvments if it is implemented. 

In [None]:


def learning_rate_power_0997(current_iter):
    base_learning_rate = 0.1
    min_learning_rate = 0.02
    lr = base_learning_rate  * np.power(.995, current_iter)
    return max(lr, min_learning_rate)


The training set is much smaller, so we need to pay attention to overfitting carefully. 


In [None]:
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
from IPython.display import display

def model_gbm(features, labels, test_features, test_ids, 
              nfolds = 5, return_preds = False, hyp = None,random_state_int=101,early_stopping=300):
    """Model using the GBM and cross validation.
       Trains with early stopping on each fold.
       Hyperparameters probably need to be tuned."""
    
    feature_names = list(features.columns)

    # Option for user specified hyperparameters
    if hyp is not None:
        # Using early stopping so do not need number of esimators
        if 'n_estimators' in hyp:
            del hyp['n_estimators']
        params = hyp
    
    else:
        # Model hyperparameters
        params = {
                  'colsample_bytree': 0.88, 
                  'learning_rate': 0.028, 
                   'min_child_samples': 10, 
                   'num_leaves': 36, 
                  
                   'subsample': 0.54, 
                   'class_weight': 'balanced'}
    
    # Build the model
    model = lgb.LGBMClassifier(**params, objective = 'multiclass', 
                               n_jobs = 4, n_estimators = 10000,
                               random_state = 10)
    
    # Using stratified kfold cross validation
    strkfold = StratifiedKFold(n_splits = nfolds, shuffle = True, random_state= random_state_int)
    
    # Hold all the predictions from each fold
    predictions = pd.DataFrame()
    importances = np.zeros(len(feature_names))
    
    # Convert to arrays for indexing
    features = np.array(features)
    test_features = np.array(test_features)
    labels = np.array(labels).reshape((-1 ))
    
    valid_scores = []
    
    # Iterate through the folds
    for i, (train_indices, valid_indices) in enumerate(strkfold.split(features, labels)):
        
        # Dataframe for fold predictions
        fold_predictions = pd.DataFrame()
        
        # Training and validation data
        X_train = features[train_indices]
        X_valid = features[valid_indices]
        y_train = labels[train_indices]
        y_valid = labels[valid_indices]
        
        # Train with early stopping
        model.fit(X_train, y_train, early_stopping_rounds =early_stopping, 
                  eval_metric = macro_f1_score,
                  eval_set = [(X_train, y_train), (X_valid, y_valid)],
                  eval_names = ['train', 'valid'],
                  callbacks=[lgb.reset_parameter(learning_rate=learning_rate_power_0997)])
        
        # Record the validation fold score
        valid_scores.append(model.best_score_['valid']['macro_f1'])
        
        # Make predictions from the fold as probabilities
        fold_probabilitites = model.predict_proba(test_features)
        
        # Record each prediction for each class as a separate column
        for j in range(4):
            fold_predictions[(j + 1)] = fold_probabilitites[:, j]
            
        # Add needed information for predictions 
        fold_predictions['idhogar'] = test_ids
        fold_predictions['fold'] = (i+1)
        
        # Add the predictions as new rows to the existing predictions
        predictions = predictions.append(fold_predictions)
        
        # Feature importances
        importances += model.feature_importances_ / nfolds   
        
        # Display fold information
        display(f'Fold {i + 1}, Validation Score: {round(valid_scores[i], 5)}, Estimators Trained: {model.best_iteration_}')

    # Feature importances dataframe
    feature_importances = pd.DataFrame({'feature': feature_names,
                                        'importance': importances})
    
    valid_scores = np.array(valid_scores)
    display(f'{nfolds} cross validation score: {round(valid_scores.mean(), 5)} with std: {round(valid_scores.std(), 5)}.')
    
    # If we want to examine predictions don't average over folds
    if return_preds:
        predictions['Target'] = predictions[[1, 2, 3, 4]].idxmax(axis = 1)
        predictions['confidence'] = predictions[[1, 2, 3, 4]].max(axis = 1)
        return predictions, feature_importances
    
    # Average the predictions over folds
    predictions = predictions.groupby('idhogar', as_index = False).mean()
    
    # Find the class and associated probability
    predictions['Target'] = predictions[[1, 2, 3, 4]].idxmax(axis = 1)
    predictions['confidence'] = predictions[[1, 2, 3, 4]].max(axis = 1)
    predictions = predictions.drop(columns = ['fold'])
    
    # Merge with the base to have one prediction for each individual
    submission = submission_base.merge(predictions[['idhogar', 'Target']], on = 'idhogar', how = 'left').drop(columns = ['idhogar'])
        
    # Fill in the individuals that do not have a head of household with 4 since these will not be scored
    submission['Target'] = submission['Target'].fillna(4).astype(np.int8)
    
    # return the submission and feature importances along with validation scores
    return submission, feature_importances, valid_scores

In [None]:
hyp1 = {
                  'colsample_bytree': 0.88, 
                  'learning_rate': 0.08, 
                   'min_child_samples': 90, 'num_leaves': 34, 'subsample': 0.94, 'reg_lambda': 0.5, 
                   'class_weight': 'balanced'}

hyp2 = {
                  'colsample_bytree': 0.88, 
                  'learning_rate': 0.08, 
                   'min_child_samples': 90, 'num_leaves': 14, 'subsample': 0.94, 'reg_lambda': 0.5, 
                   'class_weight': 'balanced'}

hyp3 = {
                  'colsample_bytree': 0.78, 
                  'learning_rate': 0.08, 
                   'min_child_samples': 45, 'num_leaves': 14, 'subsample': 0.64, 'reg_lambda': 0.1, 
                   'class_weight': 'balanced'}

hyp4 = {
                  'colsample_bytree': 0.72, 
                  'learning_rate': 0.08, 
                   'min_child_samples': 30, 'num_leaves': 18, 'subsample': 0.64, 'reg_lambda': 0.1, 
                   'class_weight': 'balanced'}

Train the models

In [None]:
%%capture --no-display

submission1, gbm_fi_selected, valid_scores_selected = model_gbm(train_set, train_labels, test_set, test_ids,hyp = hyp1,random_state_int=103,early_stopping=300)
submission2, gbm_fi_selected, valid_scores_selected = model_gbm(train_set, train_labels, test_set, test_ids,hyp = hyp1,random_state_int=103,early_stopping=300)
submission3, gbm_fi_selected, valid_scores_selected = model_gbm(train_selected, train_labels, test_selected, test_ids)


No learning rate decay, train them again.

In [None]:
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
from IPython.display import display

def model_gbm_2(features, labels, test_features, test_ids, 
              nfolds = 5, return_preds = False, hyp = None,random_state_int=101,early_stopping=300):
    """Model using the GBM and cross validation.
       Trains with early stopping on each fold.
       Hyperparameters probably need to be tuned."""
    
    feature_names = list(features.columns)

    # Option for user specified hyperparameters
    if hyp is not None:
        # Using early stopping so do not need number of esimators
        if 'n_estimators' in hyp:
            del hyp['n_estimators']
        params = hyp
    
    else:
        # Model hyperparameters
        params = {
                  'colsample_bytree': 0.88, 
                  'learning_rate': 0.028, 
                   'min_child_samples': 10, 
                   'num_leaves': 36, 
                  
                   'subsample': 0.54, 
                   'class_weight': 'balanced'}
    
    # Build the model
    model = lgb.LGBMClassifier(**params, objective = 'multiclass', 
                               n_jobs = 4, n_estimators = 10000,
                               random_state = 10)
    
    # Using stratified kfold cross validation
    strkfold = StratifiedKFold(n_splits = nfolds, shuffle = True, random_state= random_state_int)
    
    # Hold all the predictions from each fold
    predictions = pd.DataFrame()
    importances = np.zeros(len(feature_names))
    
    # Convert to arrays for indexing
    features = np.array(features)
    test_features = np.array(test_features)
    labels = np.array(labels).reshape((-1 ))
    
    valid_scores = []
    
    # Iterate through the folds
    for i, (train_indices, valid_indices) in enumerate(strkfold.split(features, labels)):
        
        # Dataframe for fold predictions
        fold_predictions = pd.DataFrame()
        
        # Training and validation data
        X_train = features[train_indices]
        X_valid = features[valid_indices]
        y_train = labels[train_indices]
        y_valid = labels[valid_indices]
        
        # Train with early stopping
        model.fit(X_train, y_train, early_stopping_rounds =early_stopping, 
                  eval_metric = macro_f1_score,
                  eval_set = [(X_train, y_train), (X_valid, y_valid)],
                  eval_names = ['train', 'valid'],
                  verbose = 200)
        
        # Record the validation fold score
        valid_scores.append(model.best_score_['valid']['macro_f1'])
        
        # Make predictions from the fold as probabilities
        fold_probabilitites = model.predict_proba(test_features)
        
        # Record each prediction for each class as a separate column
        for j in range(4):
            fold_predictions[(j + 1)] = fold_probabilitites[:, j]
            
        # Add needed information for predictions 
        fold_predictions['idhogar'] = test_ids
        fold_predictions['fold'] = (i+1)
        
        # Add the predictions as new rows to the existing predictions
        predictions = predictions.append(fold_predictions)
        
        # Feature importances
        importances += model.feature_importances_ / nfolds   
        
        # Display fold information
        display(f'Fold {i + 1}, Validation Score: {round(valid_scores[i], 5)}, Estimators Trained: {model.best_iteration_}')

    # Feature importances dataframe
    feature_importances = pd.DataFrame({'feature': feature_names,
                                        'importance': importances})
    
    valid_scores = np.array(valid_scores)
    display(f'{nfolds} cross validation score: {round(valid_scores.mean(), 5)} with std: {round(valid_scores.std(), 5)}.')
    
    # If we want to examine predictions don't average over folds
    if return_preds:
        predictions['Target'] = predictions[[1, 2, 3, 4]].idxmax(axis = 1)
        predictions['confidence'] = predictions[[1, 2, 3, 4]].max(axis = 1)
        return predictions, feature_importances
    
    # Average the predictions over folds
    predictions = predictions.groupby('idhogar', as_index = False).mean()
    
    # Find the class and associated probability
    predictions['Target'] = predictions[[1, 2, 3, 4]].idxmax(axis = 1)
    predictions['confidence'] = predictions[[1, 2, 3, 4]].max(axis = 1)
    predictions = predictions.drop(columns = ['fold'])
    
    # Merge with the base to have one prediction for each individual
    submission = submission_base.merge(predictions[['idhogar', 'Target']], on = 'idhogar', how = 'left').drop(columns = ['idhogar'])
        
    # Fill in the individuals that do not have a head of household with 4 since these will not be scored
    submission['Target'] = submission['Target'].fillna(4).astype(np.int8)
    
    # return the submission and feature importances along with validation scores
    return submission, feature_importances, valid_scores

In [None]:
%%capture --no-display
submission4, gbm_fi_selected, valid_scores_selected = model_gbm_2(train_set, train_labels, test_set, test_ids,hyp = hyp3,random_state_int=103,early_stopping=300)

In [None]:
suball=submission1.merge(submission2, on='Id')
suball=suball.merge(submission4, on='Id')
suball=suball.merge(submission3, on='Id')
suball_2=suball.merge(submission4, on='Id')

Vote for the final result, take the majority as the result for submission

In [None]:
#
suball_2['index']=suball_2.index

In [None]:
modle_13= pd.DataFrame(suball_2.mode(axis=1)[0])

In [None]:
modle_13['index']=modle_13.index

In [None]:
modle_13.head()

In [None]:
final = suball_2.merge(modle_13, on ='index')

In [None]:
final_res=final[['Id',0]]

In [None]:
final_res.columns=['Id','Target']

In [None]:
final_res.to_csv('submission.csv',index=False)