# Costa Rica
This approach trains an ensemble of neural networks and is adapted from [Lesson 4](http://course.fast.ai/lessons/lesson4.html) of the Fast.Ai Deep Learning for coders course.

Categorical features are embedded rather than 1-hot encoded (https://arxiv.org/abs/1604.06737)

Since predictions are only scored on the heads of housholds, rather than throwing away data for other houhold members, their data is used to pretrain the base model for the final classifier, mainly to learn better-than-random weights for the embedding matrices.

NB, the pretraining does npt use any external data, nor is a pretrained model used

Training of the network uses Smith's 1-cycle policy (https://arxiv.org/abs/1803.09820) and cosine annealed learning rates (https://arxiv.org/abs/1608.03983).

The training data is also partially duplicated to balance the class ratios; don't know if ratios in test data match those of training, so hopefully should remove some bias.

Some parts of data correction, and the household-feature engineering was borrowed from Gaxx's [Exploratory data analysis + LightGBM
](https://www.kaggle.com/gaxxxx/exploratory-data-analysis-lightgbm) kernel. This is miy first kernel-based Kaggle competition, so please comment if I should be doing more than just upvoting and referencing it.

Since the approach here requires a more up to date version of the fastai library, we need to download it (I couldn't find a way to update the installed version)

### If running on Kaggle

In [None]:
!git clone https://github.com/fastai/fastai.git
!mv fastai/fastai/ fastai/Fastai
import sys
sys.path.append('./fastai/')

In [None]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [None]:
from Fastai.structured import *
from Fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)
import warnings
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)
from sklearn.model_selection import train_test_split, StratifiedKFold

### If running locally

from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)
import warnings
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)
from sklearn.model_selection import train_test_split, StratifiedKFold

## Data import

### If running on Kaggle

In [None]:
!cp ../input/*csv .
PATH='./'

### If running locally

PATH='/home/giles/Downloads/fastai_data/costa-rica/'

In [None]:
train = pd.read_csv(f'{PATH}train.csv')
test = pd.read_csv(f'{PATH}test.csv')

In [None]:
len(train),len(test)

###  Fix columns

#### Outlier correction
https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403

In [None]:
test.loc[test['rez_esc'] == 99.0 , 'rez_esc'] = 5

#### Drop unneeded columns

In [None]:
train.drop(columns=[x for x in train.columns if 'SQB' in x or x == 'agesq'], inplace=True)
test.drop(columns=[x for x in test.columns if 'SQB' in x or x == 'agesq'], inplace=True)

####  Inf check

In [None]:
train.replace([np.inf, -np.inf], np.nan, inplace=True)
test.replace([np.inf, -np.inf], np.nan, inplace=True)

#### NaN correction

In [None]:
train.columns[train.isna().any()].tolist(), test.columns[test.isna().any()].tolist()

In [None]:
#Fill na (from https://www.kaggle.com/gaxxxx/exploratory-data-analysis-lightgbm)
def repalce_v18q1(x):
    if x['v18q'] == 0:
        return x['v18q']
    else:
        return x['v18q1']

train['v18q1'] = train.apply(lambda x : repalce_v18q1(x),axis=1)
test['v18q1'] = test.apply(lambda x : repalce_v18q1(x),axis=1)

train['v2a1'] = train['v2a1'].fillna(value=train['tipovivi3'])
test['v2a1'] = test['v2a1'].fillna(value=test['tipovivi3'])

In [None]:
train['rez_esc'] = train.v18q1.fillna(0).astype(np.int32)
test['rez_esc'] = test.v18q1.fillna(0).astype(np.int32)

In [None]:
train['meaneduc'] = train.v18q1.fillna(0).astype(np.float32)
test['meaneduc'] = test.v18q1.fillna(0).astype(np.float32)

In [None]:
train.columns[train.isna().any()].tolist(), test.columns[test.isna().any()].tolist()

#### Fix categoricals
From https://www.kaggle.com/gaxxxx/exploratory-data-analysis-lightgbm

In [None]:
train['roof_waste_material'] = np.nan
test['roof_waste_material'] = np.nan
train['electricity_other'] = np.nan
test['electricity_other'] = np.nan

def fill_roof_exception(x):
    if (x['techozinc'] == 0) and (x['techoentrepiso'] == 0) and (x['techocane'] == 0) and (x['techootro'] == 0):
        return 1
    else:
        return 0
    
def fill_no_electricity(x):
    if (x['public'] == 0) and (x['planpri'] == 0) and (x['noelec'] == 0) and (x['coopele'] == 0):
        return 1
    else:
        return 0

train['roof_waste_material'] = train.apply(lambda x : fill_roof_exception(x),axis=1)
test['roof_waste_material'] = test.apply(lambda x : fill_roof_exception(x),axis=1)
train['electricity_other'] = train.apply(lambda x : fill_no_electricity(x),axis=1)
test['electricity_other'] = test.apply(lambda x : fill_no_electricity(x),axis=1)

## Create features
Examine exisiting features and check cardinality

In [None]:
train.head().T.head(142)

In [None]:
for c in train.columns:
    print(c, len(set(train[c])))

We don't want to train on these features

In [None]:
ignore = [x for x in train.columns if x == 'Target' or x == 'idhogar'] + ['edjefe', 'edjefa', 'Id']

### Household 
Since the final training data will only contain the heads of housholds, we want to engineer features which capture the information of the other members of the households

In [None]:
train[train.idhogar == 'fd8a6d014'].T.head(142)

I'd initially tried my own approach, but Gaxx's turned out to be better and faster

In [None]:
#from https://www.kaggle.com/gaxxxx/exploratory-data-analysis-lightgbm
train['escolari_age'] = train['escolari']/train['age']
test['escolari_age'] = test['escolari']/test['age']

In [None]:
#from https://www.kaggle.com/gaxxxx/exploratory-data-analysis-lightgbm
df_train = pd.DataFrame()
df_test = pd.DataFrame()

aggr_mean_list = ['rez_esc', 'dis', 'male', 'female',
                  'estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4',
                  'estadocivil5', 'estadocivil6', 'estadocivil7',
                  'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6', 'parentesco7',
                  'parentesco8', 'parentesco9', 'parentesco10', 'parentesco11', 'parentesco12',
                  'instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5',
                  'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9']

other_list = ['escolari', 'age', 'escolari_age']

for item in aggr_mean_list:
    group_train_mean = train[item].groupby(train['idhogar']).mean()
    group_test_mean = test[item].groupby(test['idhogar']).mean()
    new_col = item + '_aggr_mean'
    df_train[new_col] = group_train_mean
    df_test[new_col] = group_test_mean

for item in other_list:
    for function in ['mean','std','min','max','sum']:
        group_train = train[item].groupby(train['idhogar']).agg(function)
        group_test = test[item].groupby(test['idhogar']).agg(function)
        new_col = item + '_' + function
        df_train[new_col] = group_train
        df_test[new_col] = group_test

In [None]:
test.head()

In [None]:
#from https://www.kaggle.com/gaxxxx/exploratory-data-analysis-lightgbm
df_test = df_test.reset_index()
df_train = df_train.reset_index()

train = pd.merge(train, df_train, on='idhogar')
test = pd.merge(test, df_test, on='idhogar')

#fill all na as 0
train.fillna(value=0, inplace=True)
test.fillna(value=0, inplace=True)

Every row in the training data now contains the household features. We know longer need the parentesco features, except for parentesco1 (is head of household), however we don't want to train on this (will always be True), so add to ignore

In [None]:
train.drop(columns=['parentesco' + str(i+2) for i in range(11)], inplace=True)
ignore.append('parentesco1')

In [None]:
test.head()

### Categorical Features
The categorical features are supplied 1-hot encoded. The method of creating embeddings expects single values, so we need to combine the encodings back into single features

In [None]:
def toCategorical(inData, columns, name):
    inData[name] = np.zeros_like(len(inData))
    for i, c in enumerate(columns):
        inData.loc[:, name] += (i+1)*inData.loc[:, c]
    inData.drop(columns=columns, inplace=True)

In [None]:
wall_mat = ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes', 'paredmad', 'paredzinc', 'paredfibras', 'paredother']
toCategorical(train, wall_mat, 'wall_mat')
toCategorical(test, wall_mat, 'wall_mat')

In [None]:
floor_mat = ['pisomoscer', 'pisocemento', 'pisoother', 'pisonatur', 'pisonotiene', 'pisomadera']
toCategorical(train, floor_mat, 'floor_mat')
toCategorical(test, floor_mat, 'floor_mat')

In [None]:
roof_mat = ['techozinc', 'techoentrepiso', 'techocane', 'techootro', 'roof_waste_material']
toCategorical(train, roof_mat, 'roof_mat')
toCategorical(test, roof_mat, 'roof_mat')

In [None]:
water_prov = ['abastaguadentro', 'abastaguafuera', 'abastaguano']
toCategorical(train, water_prov, 'water_prov')
toCategorical(test, water_prov, 'water_prov')

In [None]:
elec_prov = ['public', 'planpri', 'noelec', 'coopele', 'electricity_other']
toCategorical(train, elec_prov, 'elec_prov')
toCategorical(test, elec_prov, 'elec_prov')

In [None]:
toilet = ['sanitario1', 'sanitario2', 'sanitario3', 'sanitario5', 'sanitario6']
toCategorical(train, toilet, 'toilet')
toCategorical(test, toilet, 'toilet')

In [None]:
cooking = ['energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4']
toCategorical(train, cooking, 'cooking')
toCategorical(test, cooking, 'cooking')

In [None]:
rubbish = ['elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 'elimbasu5', 'elimbasu6']
toCategorical(train, rubbish, 'rubbish')
toCategorical(test, rubbish, 'rubbish')

In [None]:
wall_quality = ['epared1', 'epared2', 'epared3']
toCategorical(train, wall_quality, 'wall_quality')
toCategorical(test, wall_quality, 'wall_quality')

In [None]:
roof_quality = ['etecho1', 'etecho2', 'etecho3']
toCategorical(train, roof_quality, 'roof_quality')
toCategorical(test, roof_quality, 'roof_quality')

In [None]:
floor_quality = ['eviv1', 'eviv2', 'eviv3']
toCategorical(train, floor_quality, 'floor_quality')
toCategorical(test, floor_quality, 'floor_quality')

In [None]:
gender = ['male', 'female']
toCategorical(train, gender, 'gender')
toCategorical(test, gender, 'gender')

In [None]:
civil_status = ['estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7']
toCategorical(train, civil_status, 'civil_status')
toCategorical(test, civil_status, 'civil_status')

In [None]:
education = ['instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5',
             'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9']
toCategorical(train, education, 'education')
toCategorical(test, education, 'education')

In [None]:
house_ownership = ['tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5']
toCategorical(train, house_ownership, 'house_ownership')
toCategorical(test, house_ownership, 'house_ownership')

In [None]:
region = ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5', 'lugar6']
toCategorical(train, region, 'region')
toCategorical(test, region, 'region')

In [None]:
area = ['area1', 'area2']
toCategorical(train, area, 'area')
toCategorical(test, area, 'area')

All done, let's check the cardinality of all features and seperate out the categorical ones

In [None]:
for c in [x for x in train.columns if x not in ignore]:
    print(c, len(set(train[c])))

In [None]:
cat_vars = [
'hacdor',
'hacapo',
'v14a',
'refrig',
'v18q',
'cielorazo',
'dis',
'computer',
'television',
'mobilephone',
'wall_mat',
'floor_mat',
'roof_mat',
'water_prov',
'elec_prov',
'toilet',
'cooking',
'rubbish',
'wall_quality',
'roof_quality',
'floor_quality',
'gender',
'civil_status',
'education',
'house_ownership',
'region',
'area']

### Continuous features
Just a bit of alterations to make for the continuous features

In [None]:
train.replace({'dependency': {'no': 0, 'yes': 1}}, inplace=True)
test.replace({'dependency': {'no': 0, 'yes': 1}}, inplace=True)

In [None]:
contin_vars = [x for x in train.columns if x not in ignore and x not in cat_vars]
for c in [x for x in contin_vars]:
    print(c, len(set(train[c])))

# Data preparation
Now we'll slim the data to only the necessary features, and reset the test ID which was lost during the creation of the household features

In [None]:
dep = 'Target'

In [None]:
test.index=test['Id']
test[dep] = 0
len(test)

For the categorical features, we change their type in the dataframe to categorical

In [None]:
for v in cat_vars: 
    train[v] = train[v].astype('category').cat.as_ordered()
    
apply_cats(test, train)

for v in contin_vars:
    train[v] = train[v].fillna(0).astype('float32')
    test[v] = test[v].fillna(0).astype('float32')

train.head(2)

We'll now split out the training data (only heads of households) and keep the remainder (other members of household) for pretaining

In [None]:
pretrain = train[train.parentesco1 == 0].copy()
train = train[train.parentesco1 == 1].copy()
train.reset_index(inplace=True)
pretrain.reset_index(inplace=True)
n = len(train)
print(f'Pretraining on {len(pretrain)}, final training on {n} points')

pretrain.drop(columns=['parentesco1'], inplace=True)
train.drop(columns=['parentesco1'], inplace=True)
test.drop(columns=['parentesco1'], inplace=True)

## Preprocessing
Using the training data we fit standardisation and normalisation transformations of the continuous features (ignoring categorical), and then apply these transformations to the train, pretrain, and test data

In [None]:
df, y, nas, mapper = proc_df(train[cat_vars+contin_vars+[dep]], 'Target', do_scale=True)

In [None]:
df_test, _, nas, mapper = proc_df(test[cat_vars+contin_vars+[dep]], 'Target', do_scale=True,
                                  mapper=mapper, na_dict=nas)

In [None]:
df_pre, yp, nas, mapper = proc_df(pretrain[cat_vars+contin_vars+[dep]], 'Target', do_scale=True,
                                  mapper=mapper, na_dict=nas)

For the categorical embeddings we set the output equal to (cardinality of feature +1)//2 or 50, which ever is lower. This provides a reduction in the number of input features, whilst still allowing for a rich representation of the information they contain

In [None]:
cat_sz = [(c, len(train[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]

In [None]:
#Just checking we've not missed anything
[x for x in df_test.columns if x not in df.columns] , [x for x in df.columns if x not in df_test.columns]

## Models and functions

In [None]:
def inv_y(a): return np.exp(a) #The model will output the log of predictions

In [None]:
def macro(y_pred, y_true): #Metric for comparison
    y_pred = np.argmax(inv_y(y_pred), axis=1) #We take the highest class prediction as the prediction
    f1 = sklearn.metrics.f1_score(y_true, y_pred, average='macro')
    return f1

In [None]:
#Simple model with one layer of 100 neurons, embeddings have dropout of 0.05, fully connected layers use DO=0.5
def getPreModel(md): 
    return md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                          0.05, 5, [100], [0.5]) 

In [None]:
def getModel(md):
    return md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                          0.05, 5, [100, 100], [0.5, 0.5]) 

### Functions to reoptimise training if model changed

trn_idx, val_idx = train_test_split(range(len(df)), test_size=0.1, stratify=y)
len(trn_idx), len(val_idx)

trn_Housholds = train['idhogar'].iloc[trn_idx]
val_Housholds = train['idhogar'].iloc[val_idx]

ptrn_idx = pretrain[pretrain['idhogar'].isin(trn_Housholds)].index
pval_idx = pretrain[pretrain['idhogar'].isin(val_Housholds)].index

len(ptrn_idx), len(pval_idx), (len(pretrain))

#Class balancing
tmpDF = df_pre.copy()
tmpY = yp.copy()
maxN = np.sum(np.equal(4, yp[ptrn_idx]))
for c in [1,2,3]:
    rows = pd.Series(np.equal(c, yp[ptrn_idx]), name='bools')
    n = np.sum(rows)
    nCopyMult = maxN//n
    for j in range(nCopyMult):
        tmpDF = tmpDF.append(df_pre.iloc[ptrn_idx][rows.values].copy(), ignore_index=True)
        tmpY = np.append(tmpY, yp[ptrn_idx][rows].copy())

plt.hist(tmpY)

pmd = ColumnarModelData.from_data_frame(PATH, pval_idx, tmpDF, tmpY.astype(int), cat_flds=cat_vars, bs=16,
                                        is_reg=False, is_multi=False)

m = getModel(pmd)

We use Smith's Learning rate range test to quickly find the optimal LR (https://arxiv.org/abs/1803.09820)

m.lr_find()
m.sched.plot(n_skip_end=30)

Train the model using Smith's one-cycle policy

m.fit(2e-3, 1, wds=1e-3, metrics=[macro], cycle_len=15,use_clr=(5,8), best_save_name='pre')

tmpDF = df.copy()
tmpY = y.copy()
maxN = np.sum(np.equal(4, y[trn_idx]))
for c in [1,2,3]:
    rows = pd.Series(np.equal(c, y[trn_idx]), name='bools')
    n = np.sum(rows)
    nCopyMult = maxN//n
    for j in range(nCopyMult):
        tmpDF = tmpDF.append(df.iloc[trn_idx][rows.values].copy(), ignore_index=True)
        tmpY = np.append(tmpY, y[trn_idx][rows].copy())

plt.hist(tmpY)

md = ColumnarModelData.from_data_frame(PATH, val_idx, tmpDF, tmpY.astype(int), cat_flds=cat_vars, bs=16,
                                           test_df=df_test, is_reg=False, is_multi=False)

m = getModelNew(md)
m.model.load_state_dict(torch.load(m.get_model_path('pre')), strict=False)

m.summary

m.freeze_to(2)

m.lr_find()
m.sched.plot(n_skip_end=30)

m.fit(8e-2,1,wds=1e-3,cycle_len=15,use_clr=(5,8), metrics=[macro], best_save_name='tmpbest')

m.load('tmpbest')
m.unfreeze()
m.bn_freeze(True)

m.lr_find()
m.sched.plot(n_skip_end=150)

lr = 8e-3
m.fit(np.array([lr/9,lr/3,lr])/5, 1, wds=1e-3, metrics=[macro], cycle_len=15,use_clr=(5,8), best_save_name='best')

## Stratified k-fold ensemble
We train the model 10 times using stratified cross-validation, balancing the classes, and pretraining the model

In [None]:
def preTrainModel(trn_idx, val_idx):
    #Get indeces of pretrain data
    trn_Housholds = train['idhogar'].iloc[trn_idx]
    val_Housholds = train['idhogar'].iloc[val_idx]
    ptrn_idx = pretrain[pretrain['idhogar'].isin(trn_Housholds)].index
    pval_idx = pretrain[pretrain['idhogar'].isin(val_Housholds)].index
    
    #Class balancing
    tmpDF = df_pre.copy()
    tmpY = yp.copy()
    maxN = np.sum(np.equal(4, yp[ptrn_idx]))
    for c in [1,2,3]:
        rows = pd.Series(np.equal(c, yp[ptrn_idx]), name='bools')
        n = np.sum(rows)
        nCopyMult = maxN//n
        for j in range(nCopyMult):
            tmpDF = tmpDF.append(df_pre.iloc[ptrn_idx][rows.values].copy(), ignore_index=True)
            tmpY = np.append(tmpY, yp[ptrn_idx][rows].copy())
    
    #Load data
    pmd = ColumnarModelData.from_data_frame(PATH, pval_idx, tmpDF, tmpY.astype(int), cat_flds=cat_vars, bs=16,
                                            is_reg=False, is_multi=False)
    
    #Create pre model and train with 1-cycle
    print('Pretraining model')
    m = getPreModel(pmd)
    m.fit(2e-3, 1, wds=1e-3, metrics=[macro], cycle_len=15,use_clr=(5,8), best_save_name='pre')

In [None]:
def trainModel(trn_idx, val_idx):
    #Balance training classes
    tmpDF = df.copy()
    tmpY = y.copy()
    maxN = np.sum(np.equal(4, y[trn_idx]))
    for c in [1,2,3]:
        rows = pd.Series(np.equal(c, y[trn_idx]), name='bools')
        n = np.sum(rows)
        nCopyMult = maxN//n
        for j in range(nCopyMult):
            tmpDF = tmpDF.append(df.iloc[trn_idx][rows.values].copy(), ignore_index=True)
            tmpY = np.append(tmpY, y[trn_idx][rows].copy())
            
    #Load data
    md = ColumnarModelData.from_data_frame(PATH, val_idx, tmpDF, tmpY.astype(int), cat_flds=cat_vars, bs=16,
                                           test_df=df_test, is_reg=False, is_multi=False)
    
    #Create new model and initialise with pretrained model
    m = getModel(md)
    m.model.load_state_dict(torch.load(m.get_model_path('pre')), strict=False)
    
    #Freeze all but last layer, to avoid destroying the pretrained weights, train with 1-cycle
    m.freeze_to(2)
    print('Training last layer')
    m.fit(8e-2,1,wds=1e-3,cycle_len=15,use_clr=(5,8), metrics=[macro], best_save_name='tmpbest')
    
    #Load best, unfreeze all layers for final training
    m.load('tmpbest')
    m.unfreeze()
    m.bn_freeze(True)
    
    #Final training, use differential learning rates, and train via 1-cycle
    lr = 8e-3
    print('Final training')
    m.fit(np.array([lr/9,lr/3,lr])/5, 1, wds=1e-3, metrics=[macro], cycle_len=15,use_clr=(5,8), best_save_name='best')
    m.load('best')
    
    return m

In [None]:
%%time
nSplits = 10
skf = StratifiedKFold(nSplits, True, 1234)
folds = skf.split(df, y)

pred_test = []
valScore = 0
for i, (trn_idx, val_idx) in enumerate(folds):
    print('________________________')
    print('Running fold', i)
    
    preTrainModel(trn_idx, val_idx)
    
    m = trainModel(trn_idx, val_idx)
    
    #Test on val
    score = macro(*m.predict_with_targs())
    valScore += score
    print('Fold', i, 'score:', score)
    
    #Predict test and append for averaging
    pred_test.append(m.predict(True))
    print('________________________\n')

In [None]:
print("\nCV finished, mean validation score:", valScore/nSplits)

## Save test predictions
Take average prediction for the each of the 10 classifiers and save to submission csv

In [None]:
testClassPred = np.argmax(inv_y(np.mean(pred_test, axis=0)), axis=1)
testClassPred

In [None]:
test['Target']=testClassPred

In [None]:
csv_fn=f'{PATH}sub.csv'

In [None]:
test.head()

In [None]:
test[['Target']].to_csv(csv_fn, index=True)

In [None]:
len(test)

### If running on Kaggle
Delete fastai library, else Kaggle complains about output directory depth.

In [None]:
!rm -rf fastai