https://www.kaggle.com/c/costa-rican-household-poverty-prediction

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
%xmode Plain

from copy import deepcopy

## Basic data exploration
For visualizaton, can refer to other published kernels under the challenge.

In [None]:
train_data = pd.read_csv('/kaggle/input/costa-rican-household-poverty-prediction/train.csv')
test_data = pd.read_csv('/kaggle/input/costa-rican-household-poverty-prediction/test.csv')

train = deepcopy(train_data)
test = deepcopy(test_data)

In [None]:
print(train.info())
print(test.info())

In [None]:
train_data.head()

```
Target - the target is an ordinal variable indicating groups of income levels.
1 = extreme poverty
2 = moderate poverty
3 = vulnerable households
4 = non vulnerable households
```

In [None]:
train_data.Target.hist()

## Data cleaning
1. clean and unify data types
2. fill in missing values and delete duplicates
3. distinguish categorical and numerical features, perform normalization and encoding if necesary


### Cleaning data types

**Most of dtypes are float and integers, check columns of dtype 'object':**

- first two columns are unique row id and household id
- dependency, Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
- edjefe, years of education of male head of household, yes=1 and no=0
- edjefa, years of education of female head of household, yes=1 and no=0

**From above, female and male head are two features, which later we can create a new feature combining the two genders.**

**As poverty level is based on household, we will reorganize the dataframe based on household id.**

### Fill missing values (no duplicates)

5 features has missing values while thre features is almost 70% missing, lets check the feature meanings
- rez_esc, Years behind in school
- v18q1, number of tablets household owns
- v2a1, Monthly rent payment
- meaneduc,average years of education for adults (18+)
- SQBmeaned, square of the mean years of education of adults (>=18) in the household

According to value distributions, and feature meaning, nan values is mostly likely to mean:
* rez_esc, Years behind in school: 0 year
* v18q1, number of tablets household owns: 0 pc
* v2a1, Monthly rent payment: 0 payment,

rows with missing value of mean education are all non vulnerable households, thus NAN is very likely to mean 0 year, thus fill with median values, as it is numerical feature
* meaneduc,average years of education for adults (18+): median
* SQBmeaned, square of the mean years of education of adults (>=18) in the household: median

### Create new feature:
ed_all, years of education of family head, regardless of gender

## outlier in test dataset


In [None]:
#outlier in test set which rez_esc is 99.0
test.loc[test['rez_esc'] == 99.0 , 'rez_esc'] = 5

## Correct target value mismatch

it is found that the same household can have different target values

In [None]:
data = train
d={}
weird=[]
for row in data.iterrows():
    idhogar=row[1]['idhogar']
    target=row[1]['Target']
    if idhogar in d:
        if d[idhogar]!=target:
            weird.append(idhogar)
    else:
        d[idhogar]=target
        
len(set(weird))    

In [None]:
#examine
data[data['idhogar']==weird[1]][['idhogar','parentesco1', 'Target']]

we were told that the correct target value is the one belonging to the head of the household. So we should set the correct value each time.

In [None]:
for i in set(weird):
    hhold=data[data['idhogar']==i][['idhogar', 'parentesco1', 'Target']]
    target=hhold[hhold['parentesco1']==1]['Target'].tolist()[0]
    for row in hhold.iterrows():
        idx=row[0]
        if row[1]['parentesco1']!=1:
            data.at[idx, 'Target']=target

In [None]:
data[data['idhogar']==weird[1]][['idhogar','parentesco1', 'Target']]

### Feature engineering
reference:
https://www.kaggle.com/gaxxxx/exploratory-data-analysis-lightgbm#Feature-Engineering

It turns out orignial data lost one feature both for roof and electricity, so we manually add new feature

In [None]:
train_set = train
test_set = test

train_set['roof_waste_material'] = np.nan
test_set['roof_waste_material'] = np.nan
train_set['electricity_other'] = np.nan
test_set['electricity_other'] = np.nan

def fill_roof_exception(x):
    if (x['techozinc'] == 0) and (x['techoentrepiso'] == 0) and (x['techocane'] == 0) and (x['techootro'] == 0):
        return 1
    else:
        return 0
    
def fill_no_electricity(x):
    if (x['public'] == 0) and (x['planpri'] == 0) and (x['noelec'] == 0) and (x['coopele'] == 0):
        return 1
    else:
        return 0

train_set['roof_waste_material'] = train_set.apply(lambda x : fill_roof_exception(x),axis=1)
test_set['roof_waste_material'] = test_set.apply(lambda x : fill_roof_exception(x),axis=1)
train_set['electricity_other'] = train_set.apply(lambda x : fill_no_electricity(x),axis=1)
test_set['electricity_other'] = test_set.apply(lambda x : fill_no_electricity(x),axis=1)

In [None]:
def owner_is_adult(x):
    if x['age'] <= 18:
        return 0
    else:
        return 1

train_set['head_less_18'] = train_set.apply(lambda x : owner_is_adult(x),axis=1)
test_set['head_less_18'] = test_set.apply(lambda x : owner_is_adult(x),axis=1)

## Merge Preprocessing 

In [None]:
missing_col = list(train.isnull().sum().sort_values(ascending=False).index[:5])

def preprocess(df):
    
    # replace yes and no to 1 and 0
    df.replace({'yes':1, 'no':0}, inplace=True)
    #modify data types
    df['dependency'] = df['dependency'].astype('float64')
    df[['edjefe','edjefa']] = df[['edjefe','edjefa']].astype('int64')
    #fill in missing values
    values = dict(zip(missing_col,[0,0,0,train['meaneduc'].mean(),train['SQBmeaned'].mean()]))
    df.fillna(value=values, inplace=True)
    #new feature
    df['ed_all'] = df['edjefe']+df['edjefa']
    return df

train_pp = preprocess(train)
# target = train['Target']
test_pp = preprocess(test)


In [None]:
print(train_pp.info())
print(test_pp.info())

In [None]:
train_pp.isnull().sum().sort_values(ascending=False)

### Features Selection

In [None]:
discard_feat = ['Id','idhogar','elimbasu5','Target']

In [None]:
values = train_pp.select_dtypes(['float64', 'int64'])
numerical_feat = list(set(values.columns[values.nunique()>2])-set('Target'))
categorical_feat = list(values.columns[values.nunique()<=2])

## Modelling

In [None]:
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from xgboost import XGBClassifier

In [None]:
features = list(set(categorical_feat+numerical_feat)-set(discard_feat))

def train_val(model, data=train_pp,feat=features):
    X_train, X_valid, y_train, y_valid = train_test_split(data[feat],train['Target'],\
                                                          test_size=0.2, random_state=1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    macro_f1 = f1_score(y_valid, y_pred, average='macro')
    print('Macro f1 score is {} using model {}'.format(macro_f1, model))
    return model


In [None]:
import warnings 
from sklearn.exceptions import ConvergenceWarning

# Filter out warnings from models
warnings.filterwarnings('ignore', category = ConvergenceWarning)

# lsvc = LinearSVC()
# lsvc = train_val(lsvc)

# svc = SVC()
# svc = train_val(svc)

# gnb = GaussianNB()
# gnb = train_val(gnb)

# mlp = MLPClassifier()
# mlp = train_val(mlp)

# lg = LogisticRegression()
# lg = train_val(lg)

# ridge = RidgeClassifier()
# ridge = train_val(ridge)

# knn = KNeighborsClassifier()
# knn = train_val(knn)

rdf = RandomForestClassifier()
rdf = train_val(rdf)

xgb = XGBClassifier()
xgb = train_val(xgb)

### Model performance comparison
Naive Bayes perform worst, better models are KNN, MLP, tree-based ensemble models: random forest and XGB. 

Next let's use XGB for test data prediction.

In [None]:
def submission(model, data=test_pp, feat=features):
    preds = model.predict(data[feat]) #add[features] because of feature order

    sub = pd.DataFrame()
    sub['Id'] = test['Id']
    sub['Target'] = preds
    sub.to_csv('submission.csv', index=False) 

# submission(xgb)
#submission obtain 0.388 score

In [None]:
clf = XGBClassifier(learning_rate=0.1)
clf = train_val(clf)
submission(clf)

## Feature selection
- correlation: pearson/spearman
- pca

## Hyper parameter tuning


https://www.kaggle.com/katacs/data-cleaning-and-random-forest

```
Params:
lr: 0.01~0.05
max_depth: 5~9 (feat importance num)
reg_alpha 
reg_gamma 
max_delta_step
min_child_weight: 5~11 (high depth, low weight)
colsample_bynode, bytree: 0.7~0.9 (lower faster)
samplesample: 0.8-0.9
random seed
```


In [None]:
# from sklearn.model_selection import GridSearchCV

# X_train, X_valid, y_train, y_valid = train_test_split(data[features],train['Target'],\
#                                                           test_size=0.2, random_state=1)
# clf = XGBClassifier()
# parameters =    {
#             'max_depth': [5, 6],
#         'learning_rate': [0.01, 0.05, 0.1],
#             'n_estimators': [50, 100, 150],
#             'gamma': [0, 0.1],
# #             'reg_alpha': [0, 1e-2],
# #             'reg_lambda': [1e-2, 1],
#             }
# gs = GridSearchCV(clf, parameters, scoring='f1_macro', cv=3)
# gs.fit(X_train, y_train)
# print(gs.best_params_)
# print(gs.best_score_)
# print(gs.best_estimator_)
# submission(gs.best_estimator_)