This is my first kernel on Kaggle. This is also one of the first times applying machine learning to a real world problem (as opposed to an exercise from a class). I've used quite a lot of material from Will Koehrsen's walkthrough for this competition, with a few twists of m own added. His kernel can be viewed [here](https://www.kaggle.com/willkoehrsen/a-complete-introduction-and-walkthrough). 

In [None]:
#Setup
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Any results you write to the current directory are saved as output.

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
test = pd.read_csv('../input/test.csv')
train = pd.read_csv('../input/train.csv')

In [None]:
test.info()

In [None]:
train.info()

Something interesting to note here: the test data set has almost 24 thousand entries and the training set has less than 10 thousand entries. 

Note also that a lot of the columns in the 140+ column dataset are binary/boolean values. We can determine precisely which ones by looking at the descriptions, or we can count the number of unique values for each column and count the number of columns with only two unique values. 

In [None]:
train.select_dtypes(np.int64).nunique().value_counts().sort_index().plot.bar(color = 'blue')
plt.xlabel('Number of Unique Values')
plt.ylabel('Count')
plt.title('Count of Unique Values in Columns of Type Int64')

So we see that there are about 101 columns with boolean entries. (I say about because its possible that there is an entry with only two unique values which do not represent true and false but very unlikely.) Quite a few of these are actually categorical variables which have been encoded. For example `lugar1` to `lugar6` are simply columns indicating the region to which the household belongs. This is a purely cateogrical variable so aside from possibly removing `lugar6`, we will leave these columns as is. There are other columns which represent ordinal variables, such as say `eviv1`, `eviv2`, and `eviv3`, which tells us whether the floor is "bad", "regular", or "good", in that order. We may merge these three columns into one, with entries 0, 1, or 2, indicating whether the house has `eviv1 = 0` and so on.

Of the remaining columns, 8 are floats. These include `v2a1`, `v18q1`, `rez_esc`, `mean_educ`, `overcrowding`, and a few columns containing squared data. We will ignore the columns with squared data, since they're only really needed witih linear models. Of the five float columns we are focusing on, `v2a1`, `mean_educ`, and `overcrowding` are the only ones for which a float makes sense. `v18q1` and `rez_esc` are both counts of discrete objects (tablets and years behind in schooling respectively) and only take on integer values. 

The data we have is organized at the individual level, with certain columns which are pertinent to the individual in particular and certain others which are for the entire household. Since our task is only to classify at the household level, we will distinguish the individuals who are the heads of their households. We will also separate the household level columns and the columns which give features at the individual level. The latter will need to be aggregated before we proceed. 

In [None]:
# Quality of life changes:
test['Target'] = np.nan
data = train.append(test, ignore_index = False)

heads = data.loc[data['parentesco1'] == 1, :]

**Correcting Errors in the Data**

We know from the discussion that there are some households with incosistent Target data, that is, with members whose Target is different from that of the head of their household. We know that the correct labels for these individuals is the label that has been assigned to the head of their household. We turn now to correcting such errors for the sake of completeness. 

In [None]:
consistent = train.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)
inconsistent = consistent[consistent != True]
for household in inconsistent.index:
    actual = int(train[(train['idhogar'] == household) & (train['parentesco1'] == 1)]['Target'])
    train.loc[train['idhogar'] == household, 'Target'] = actual
    
# let's check again for inconsistencies
consistent = train.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)
inconsistent = consistent[consistent != True]
print('There are {} households with inconsistent Target labels.'.format(len(inconsistent)))

**Exploring Missing Values in the Data**

In [None]:
missing = train.isnull().sum().sort_values(ascending = False)
missing = missing[missing > 0]
missing = (missing/len(train)) # express missing counts as ratio of whole
missing.round(3).plot.bar()

Three columns have a significant number of missing values. The other two, `meaneduc` and `SQBmeaned` can actually be derived from the other values in the same row.  The column `rez_esc` represents the number of years that the individual represented is behind in their schooling. 

>This [data] is only collected for people between 7 and 19 years of age and [is] the difference between the years of education a person should have and the years of education he/she has. [It] is capped at 5. 

The second column, `v18q1`, counts the number of tablets the household owns. The third column, `v2a1`,  records the household's monthly rent payment. `v18q` indicates whether the household owns a tablet or not. It is possible that `v18q1` lists `NaN` for the households for whom `v18q` is 0. 

In [None]:
heads.groupby('v18q')['v18q1'].apply(lambda x: x.isnull().sum())
# This counts the number of null entries in v18q1, grouped by v18q. 

# Attribute error raised for .isnull().sum(). the above seems to be the right way to do that. Why is this?

It seems that the `v18q1` column is `NaN` whenever `v18q` is 0. Thus, we may simply replace the `NaN` entries in `v18q1` with zeroes and possibly drop the `v18q` column altogether. 

In [None]:
data = data.fillna({'v18q1':0})

In [None]:
# and now we want to graph
df = heads.loc[:, ['idhogar', 'v18q1', 'Target']]
relative = df.groupby('Target')['v18q1'].value_counts()
relative = relative/relative.groupby('Target').sum()
relative = relative.rename('counts').reset_index()
g = sns.catplot(data = relative, col = 'Target', kind = 'bar', y= 'counts', x = 'v18q1')

Could the same be true of `v2a1`? That is, could it be that the households who do not have a rent payment listed simply do not pay rent, say, for example, because they own the house already? To find out, we take a look at the `tipovivi` columns:

```
tipovivi1, =1 own and fully paid house
tipovivi2, =1 own, paying in installments
tipovivi3, =1 rented
tipovivi4, =1 precarious
tipovivi5, =1 other(assigned, borrowed)
```

In [None]:
house_vars = ['tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5']
nulls = train.loc[train.v2a1.isnull(), house_vars].sum()
notnulls = train.loc[train.v2a1.notnull(), house_vars].sum()
pd.DataFrame(data = {'isnull': nulls, 'notnull': notnulls})

So we see that the households with entries in `v2a1` are households which either own the house and are paying it off in installments or are renting the house and the households with null entries in `v2a1` are households which either own the house, are classified as "precarious", or belong in the "other" category. In the case of households that own the house, we may simply replace the null entries with 0.  For the other cases, we will have to impute these values, but we should add a column indicating that they did not provide rent data.

In [None]:
data.loc[data.tipovivi1 == 1, 'v2a1'] = 0 # if the family owns the house set rent to 0. 

data['v2a1-missing'] = data['v2a1'].isnull()

Finally we have `rez_esc`.  From the description, we know that only persons aged 7 to 19 will have values collected for this variable. Thus we may as well set the value for those outside of this age range to 0 and then add a column indicating which entries had to be imputed, as with `v2a1`.

In [None]:
data.loc[((data['age'] < 7) | (data['age'] > 19)) & (data['rez_esc'].isnull()), 'rez_esc'] = 0

data['rez_esc-missing'] = data['rez_esc'].isnull()

In [None]:
heads = data.loc[data.parentesco1 == 1, :]

We now proceed to separate the data as needed into their different categories. For example, we will distinguish between boolean colums at the household level, boolean columns at the individual level, categorical columns at the household level, and so on. We will also discard the squared variables, which are only really necessary when building a linear model. 

In [None]:
id_ = ['Id', 'idhogar', 'Target']

In [None]:
ind_bool = ['dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 
            'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
            'parentesco1', 'parentesco2',  'parentesco3', 'parentesco4', 'parentesco5', 
            'parentesco6', 'parentesco7', 'parentesco8',  'parentesco9', 'parentesco10', 
            'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2', 'instlevel3', 
            'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 
            'instlevel9', 'mobilephone', 'rez_esc-missing']

ind_ordered = ['rez_esc', 'escolari', 'age']

In [None]:
hh_bool = ['v18q', 'hacdor', 'hacapo', 'v14a', 'refrig', 'paredblolad', 'paredzocalo', 
           'paredpreb','pisocemento', 'pareddes', 'paredmad',
           'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisoother', 
           'pisonatur', 'pisonotiene', 'pisomadera',
           'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 
           'abastaguadentro', 'abastaguafuera', 'abastaguano',
            'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 
           'sanitario2', 'sanitario3', 'sanitario5',   'sanitario6',
           'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 
           'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 
           'elimbasu5', 'elimbasu6', 'epared1', 'epared2', 'epared3',
           'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 
           'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 
           'computer', 'television', 'lugar1', 'lugar2', 'lugar3',
           'lugar4', 'lugar5', 'lugar6', 'area1', 'area2', 'v2a1-missing']

hh_ordered = [ 'rooms', 'r4h1', 'r4h2', 'r4h3', 'r4m1','r4m2','r4m3', 'r4t1',  'r4t2', 
              'r4t3', 'v18q1', 'tamhog','tamviv','hhsize','hogar_nin',
              'hogar_adul','hogar_mayor','hogar_total',  'bedrooms', 'qmobilephone']

hh_cont = ['v2a1', 'dependency', 'edjefe', 'edjefa', 'meaneduc', 'overcrowding']

**Household Level Variables**

In [None]:
heads = heads[id_ + hh_bool + hh_cont + hh_ordered]

A quick note here.  `dependency`, `edjefe`, and `edjefa` are listed here as continuous variables, but the column data is of dtype object. This is primarily because some of the entries are "no", and some are "yes".  From the variable description, we know that "yes" represents 1 and "no" represents 0. Presumably, a "no" in `edjefe`means that the household leader is not male. To simplify matters, we will combine the data in `edjefe` and `edjefa` and then add a column that indicates whether the household leader is male or female. 

In [None]:
maps = {'yes': 1, 'no': 0}

heads['dependency'] = heads.dependency.replace(maps).astype(np.float64)
heads['edjef'] = heads.edjefe.replace(maps).astype(np.float64) + heads.edjefa.replace(maps).astype(np.float64)
heads['leader_male'] = heads.edjefe.map(lambda x: x != 0)
heads = heads.drop(columns = ['edjefe', 'edjefa'])



Note now that we have several variables which seem to overlap:

* r4t3, Total persons in the household
* tamhog, size of the household
* tamviv, number of persons living in the household
* hhsize, household size
* hogar_total, number of total individuals in the household

Let us examine the correlation matrix of these five features.

In [None]:
sns.heatmap(heads[['r4t3', 'tamhog', 'tamviv', 'hhsize', 'hogar_total']].corr(), annot = True, fmt = '.3f')

We see that `r4t3`, `tamhog`, `hhsize`, and `hogar_total` hold essentially the same data, so we will choose one (`hhsize`) and discard the rest. On the other hand, notice that `tamviv`, although highly correlated, differs slightly from `hhsize`. `tamviv` counts the number of people living in the household, the organizers have indicated this sometimes includes domestic workers/househelp, while `hhsize` counts the number of people who belong to the household, not all of whom may live in the household. 

In [None]:
heads.plot.scatter(x = 'tamviv', y = 'hhsize')

sns.jointplot(x = 'tamviv', y = 'hhsize', data = heads, kind = 'hex', gridsize = 10)

There aren't any households for which `tamviv` is less than `hhsize`. There are some households with larger `tamviv` than `hhsize`, but as we can see from the hexplot, they are a small minority. Nevertheless, this may come into play, so we will create a new feature showing the difference between `tamviv` and `hhsize`. 

In [None]:
heads['hhsize-diff'] = heads.hhsize - heads.tamviv
heads = heads.drop(columns = ['tamhog', 'hogar_total', 'r4t3'])

There are a number of other categorical variables which have already been encoded in the dataset. For example:

* paredblolad, =1 if predominant material on the outside wall is block or brick
* paredzocalo, =1 if predominant material on the outside wall is socket (wood,  zinc or asbestos)
* paredpreb, =1 if predominant material on the outside wall is prefabricated or cement
* pareddes, =1 if predominant material on the outside wall is waste material
* paredmad, =1 if predominant material on the outside wall is wood
* paredzinc, =1 if predominant material on the outside wall is zink
* paredfibras, =1 if predominant material on the outside wall is natural fibers
* paredother, =1 if predominant material on the outside wall is other

We can delete the `paredother` column in this case. Similarly, we can delete `pisoother`, `techootro`,  `abastaguafuera`,  `coopele`, `sanitario6`, `energcocinar4`, `elimbasu6`, `tipovivi5`, `lugar6`, `area2`. 

We would also like to create oridnal variables out of the boolean variables like `epared1`, `epared2`, `epared3`, which are binary variables indicating whether the walls are "bad", "regular", or "good" respectively. 

In [None]:
heads = heads.drop(columns = ['paredother', 'pisoother', 'techootro', 'abastaguafuera', 
                      'coopele', 'sanitario6', 'energcocinar4', 'elimbasu6', 
                      'tipovivi5', 'lugar6', 'area2'])
# delete the dummy variable trap columns

heads['wall'] = np.argmax(np.array(heads[['epared1', 'epared2', 'epared3']]), axis = 1)
heads['floor'] = np.argmax(np.array(heads[['eviv1', 'eviv2', 'eviv3']]), axis = 1)
heads['roof'] = np.argmax(np.array(heads[['etecho1', 'etecho2', 'etecho3']]), axis = 1)
heads = heads.drop(columns = ['epared1', 'epared2', 'epared3', 
                      'eviv1', 'eviv2', 'eviv3', 
                      'etecho1', 'etecho2', 'etecho3'])

**Some Additional Features**

In [None]:
heads['phones-per-capita'] = heads['qmobilephone']/heads['tamviv']
heads['tablets-per-capita'] = heads['v18q1']/heads['tamviv']
heads['rooms-per-capita'] = heads['rooms']/heads['tamviv']
heads['rent-per-capita'] = heads['v2a1']/heads['tamviv']

**Individual Level Variables**

In [None]:
ind = data[id_ + ind_bool + ind_ordered]

As before, we have some boolean columns which are categorical variables that have been encoded. In particular, we can look at the `instlevel_` variables. We also have `male` and `female`, one of which we can remove. 

In [None]:
ind['inst'] = np.argmax(np.array(ind[['instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5',
                                      'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9']]), axis = 1)
ind = ind.drop(columns = [c for c in ind if c.startswith('instlevel')])
ind = ind.drop(columns = 'female')

Now we have to deal with aggregation. For the categorical variables such as `dis`, which specifies whether the individual is disabled or not, and `male`, we will take the mean over the entire household. This will tell us what porportion of the whole household is disabled or male. For the variables which indicate the civil status of the individuals, we will take the sum.  

In [None]:
ind = ind.drop(columns = 'Target') # We don't need Target data from the individual level
ind_agg_ordered = ind[['age', 'escolari', 'rez_esc', 'inst', 'idhogar']].groupby('idhogar').agg(['min', 'max'])
# rename the columns
new_col = []
for c in ind_agg_ordered.columns.levels[0]:
    for stat in ind_agg_ordered.columns.levels[1]:
        new_col.append(f'{c}-{stat}')
ind_agg_ordered.columns = new_col

ind_agg = ind.groupby('idhogar').agg('mean')
# rename the columns
new_col = []
for c in ind_agg:
    new_col.append(f'{c}-mean')
ind_agg.columns = new_col

#concatenate the dataframes
ind_agg = pd.concat([ind_agg, ind_agg_ordered], axis = 1)

Are any of these variables we've created correlated with each other?

In [None]:
# Create correlation matrix
corr_matrix = ind_agg.corr()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]
to_drop

We will drop these three columns and then create the final data.

In [None]:
ind_agg = ind_agg.drop(columns = to_drop)
final = heads.merge(ind_agg, on = 'idhogar', how = 'left')

The fourth class (non-vulnerable) is more easily distinguished from the first three. We will split our classification into two portions. First, we wish to separate the fourth class from the first three. Once we have accomplished this, we will try to train the machine to distinguish between the three vulnerable classes. 

In [None]:
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import Imputer, MinMaxScaler
from sklearn.pipeline import Pipeline

# custom scorer with macro f1
scorer = make_scorer(f1_score, greater_is_better = True, average = 'macro')

# this is where we will make our predictions when submitting
submissions_base = test[['Id', 'idhogar']].copy()

train_set = final[final.Target.notnull()].drop(columns = ['Id', 'idhogar', 'Target'])
test_set = final[final.Target.isnull()].drop(columns = ['Id', 'idhogar', 'Target'])
train_labels = np.array(list(final[final['Target'].notnull()]['Target'].astype(np.uint8)))
test_ids = list(final.loc[final.Target.isnull(), 'idhogar'])

In [None]:
def split_model(model, train_set, train_labels, test_set, test_ids):
    """ 
    Main learning module, two phases
    data is household level DataFrame
    """
    
    # Phase 1: Distinguish class 4 from classes 1-3
    train0_labels = np.array([x < 4 for x in train_labels])
    model.fit(train_set, train0_labels)
    test0_labels = model.predict(test_set)
    
    # Filter out the non-vulnerable households
    test1_set = test_set[test0_labels]
    test1_ids = [test_ids[i] for i in range(len(test_ids)) if test0_labels[i]]
    train1_set = train_set[train0_labels]
    train1_labels = train_labels[train0_labels]
    
    # Phase 2: Distinguish between classes 1-3
    model.fit(train1_set, train1_labels)
    labels = model.predict(test1_set)
    labels = pd.DataFrame({'idhogar': test1_ids, 'Target': labels})
    
    #Everything that hasnt been given a label by Phase 2 either has no parentesco1==1 or belongs to class 4
    submission = submissions_base.merge(labels, how = 'left', on = 'idhogar').drop(columns = 'idhogar')
    submission['Target'] = submission['Target'].fillna(4).astype(np.int8)
    return submission


In [None]:
#features = list(train_set.columns)
pipeline = Pipeline([('imputer', Imputer(strategy = 'median')), ('scaler', MinMaxScaler())])

# Impute missing values as well as scale data
train_set = pipeline.fit_transform(train_set)
test_set = pipeline.transform(test_set)

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100, n_jobs = -1)
cv_score = cross_val_score(model, train_set, train_labels, cv = 10, scoring = scorer)
cv_score.mean(), cv_score.std()

In [None]:
RF_submission = split_model(model, train_set, train_labels, test_set, test_ids)
RF_submission.to_csv('RF_submission.csv', index = False)

Submitting this result gives us a surprisingly high score of 0.414. I say surprising because the random forest classifier run without phases yields a cross valuation score as seen above. Would this score improve with some parameter tuning or perhaps a more powerful gradient boost model? 

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier(objective = 'multi:softprob', num_class = 4, learning_rate = 0.006)
#cv_score = cross_val_score(model, train_set, train_labels, cv = 10, scoring = scorer)
#cv_score.mean(), cv_score.std()

In [None]:
# Phase 1: Distinguish class 4 from classes 1-3
model = XGBClassifier(learning_rate = 0.05)
train0_labels = np.array([x < 4 for x in train_labels])
model.fit(train_set, train0_labels)
test0_labels = model.predict(test_set)

# Filter out the non-vulnerable households
test1_set = test_set[test0_labels]
test1_ids = [test_ids[i] for i in range(len(test_ids)) if test0_labels[i]]
train1_set = train_set[train0_labels]
train1_labels = train_labels[train0_labels]

# Phase 2: Distinguish between classes 1-3
model = XGBClassifier(objective = 'multi:softprob', num_class = 4, learning_rate = 0.05)
model.fit(train1_set, train1_labels)
labels = model.predict(test1_set)
labels = pd.DataFrame({'idhogar': test1_ids, 'Target': labels})

#Everything that hasnt been given a label by Phase 2 either has no parentesco1==1 or belongs to class 4
submission = submissions_base.merge(labels, how = 'left', on = 'idhogar').drop(columns = 'idhogar')
submission['Target'] = submission['Target'].fillna(4).astype(np.int8)
submission.to_csv('xgb_submission.csv', index = False)

In [None]:
import lightgbm as lgb

model = lgb.LGBMClassifier(boosting_type = 'dart', colsample_bytree = 0.88, learning_rate = 0.028,
                          min_child_samples = 10, num_leaves = 36, reg_alpha = 0.76, reg_lambda = 0.43,
                          subsample_for_bin = 40000, subsample = 0.54, class_weight = 'balanced',
                          objective = 'multiclass', n_estimators = 100, random_state = 10)
cv_score = cross_val_score(model, train_set, train_labels, cv = 10, scoring = scorer)
cv_score.mean(), cv_score.std()

model = lgb.LGBMClassifier(boosting_type = 'dart', colsample_bytree = 0.88, learning_rate = 0.028,
                          min_child_samples = 10, num_leaves = 36, reg_alpha = 0.76, reg_lambda = 0.43,
                          subsample_for_bin = 40000, subsample = 0.54, class_weight = 'balanced',
                          objective = 'binary', n_estimators = 100, random_state = 10)
train0_labels = np.array([x < 4 for x in train_labels])
model.fit(train_set, train0_labels)
test0_labels = model.predict(test_set)

# Filter out the non-vulnerable households
test1_set = test_set[test0_labels]
test1_ids = [test_ids[i] for i in range(len(test_ids)) if test0_labels[i]]
train1_set = train_set[train0_labels]
train1_labels = train_labels[train0_labels]

# Phase 2: Distinguish between classes 1-3
model = lgb.LGBMClassifier(boosting_type = 'dart', colsample_bytree = 0.88, learning_rate = 0.028,
                          min_child_samples = 10, num_leaves = 36, reg_alpha = 0.76, reg_lambda = 0.43,
                          subsample_for_bin = 40000, subsample = 0.54, class_weight = 'balanced',
                          objective = 'multiclass', n_estimators = 100, random_state = 10)
model.fit(train1_set, train1_labels)
labels = model.predict(test1_set)
labels = pd.DataFrame({'idhogar': test1_ids, 'Target': labels})

#Everything that hasnt been given a label by Phase 2 either has no parentesco1==1 or belongs to class 4
submission = submissions_base.merge(labels, how = 'left', on = 'idhogar').drop(columns = 'idhogar')
submission['Target'] = submission['Target'].fillna(4).astype(np.int8)
submission.to_csv('lgb_submission.csv', index = False)