<center> 
<strong>LightGBM+RFECV+BayesSearchCV</strong><br />
<img src="https://panampost.com/wp-content/uploads/pobreza-costa-rica-560x276.jpg">
Helping the Inter-American Development Bank with income qualification of the world's poorest families.

</center>

# Brief Introduction
According to the annual report on Costa Rica’s “State of Nation”, from 2017, 20 percent of households were in a situation of poverty and exclusion. Despite some improvements, such as the fall of the percentage of households poverty situations between 2015 and 2016, last year 31.5 percent of Costa Rican households suffered from some kind of poverty - monetary, multidimensional, and other types. Also, 'The State of the Nation Report 2017' states that Costa Rica has failed to name some of the structural problems underlying poverty. These facts lead to a necessary action from Costa Rican authorities to fight these structural problems. And, to do so, some institutions like The Inter-American Development Bank are asking skilled people to help them deal with such issue.


# First steps
In this notebook, we will approach this problem using a LightGBM , recursive feature elemination technice and BayesSearchCV to get the best hyper-parameters.

In [None]:
import matplotlib
import numpy as np
import pandas as pd
import lightgbm as lgb
from skopt import BayesSearchCV
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split


df = pd.read_csv('../input/train.csv')

test_data = pd.read_csv('../input/test.csv')

pd.set_option('display.max_columns', 500)

# Data treatment

In [None]:
print(df.shape) # Shape of the data
df.head(10) # See the first 3 values of the df

### Next, we have built some functions in order to simplify our dataset:
1.  First, we checked that different persons with the same 'idhogar' had different targets, we corrected this fact by constructing a function which makes all targets equal to the target of the head of the household;
2. We also treated the 'dependecy' column by recalculating all values; 
3. Next we treated the columns 'edjefe' and 'edjefa' wich had some "no" and "yes" answers;
4. We then noted that the "v2a1" had some NaN values we tried to figure out what was the reason for this;
5. "v18q1" had some inconsisties too and we took care of them;
6. In the last step we treated 'rez_esc' and 'meaneduc' wich both had some zeros;
5. Checking the Kernel ["Start Here: A Complete Walkthrough"](https://www.kaggle.com/willkoehrsen/start-here-a-complete-walkthrough) may help you understand this dataset more deply.

### Checking wich data types we have in our dataset

In [None]:
print('Dtypes count:' + '\n', df.dtypes.value_counts())
columns_object = df.columns[df.dtypes == object]
print('Columns wich could have a problem :', \
       columns_object) # Columns wich need treatment beacause they are object type

The columns 'Id' and 'idhogar' don't need treatment because they don't represent a direct feature in the meaning that they represent a simple identification for each row and a identification for the household.

### Treating people with the same 'idhogar' and different targets

In [None]:
def correct_targets(df):
    # Making groups by household
    all_equal_groups_1 = df.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)
    
    # Selection of households were targets are not equal for all members
    all_not_equal_groups_1 = all_equal_groups_1[all_equal_groups_1 != True]

    for household in all_not_equal_groups_1.index:
        # We assumed that the correct label is the label of the head of the household, this is one possible approach
        true_target = int(df[(df['idhogar'] == household) & (df['parentesco1'] == 1.0)]['Target'])
        # Setting the correct tag for every member of the household
        df.loc[df['idhogar'] == household, 'Target'] = true_target
    
    all_equal_groups_2 = df.groupby('idhogar')['Target'].apply(lambda x: x.nunique() == 1)
    all_not_equal_groups_2 = all_equal_groups_2[all_equal_groups_2 != True]
    
    n_corrected = len(all_not_equal_groups_1) - len(all_not_equal_groups_2)
    print("Number of targets corrected :", n_corrected)
    
    return df
    
df = correct_targets(df)

### Treating 'dependency'

In [None]:
df[columns_object[2]].head(5)

As you can see this column has some 'yes' and 'no' answers when the anwser should be a float or int.
Since the 'dependecy' is calculated as:  
dependency = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)  
Then we can calculate it manually by suming the values of the  columns 'hogar_nin' + 'hogar_mayor' and dividing it by the column 'hogar_adul'.

In [None]:
(df['hogar_nin'] + df['hogar_adul'] == df['hogar_total']).all()
# Testing if the 'total' is the sum of 'min' +'adult'

This means that the number of members of the household wich are youger than 19 and older than 65 is given by:

In [None]:
inf_19_sup_65 = df['hogar_nin'] + df['hogar_mayor']

As you can see the column 'hogar_adul' inclues the number of people older than 19 years old.
The column 'hogar_mayor' has the number of people older then 65.
This means that the number of members of the household wich are older than 19 and younger than 65 is given by:

In [None]:
sup_19_inf_65 = df['hogar_adul'] - df['hogar_mayor']

In [None]:
dependecy = inf_19_sup_65*1.0 / sup_19_inf_65 # Recalculates the dependecy
dependecy.head(5)

In [None]:
dependecy = dependecy.replace([np.inf, -np.inf], np.nan) # Replaces all inf with NaN
dependecy.head(5)

In [None]:
dependecy.nlargest() # Gives the heighest dependecy rate of all families without considering inf

We will assume that the families which have no one who is older than 19 and younger than 65 have a dependecy rate of 8, this was the initial dependecy rate before any treatment.

In [None]:
dependecy = dependecy.fillna(dependecy.nlargest().iloc[0]) 
# Finds the largest value and fills de NaN with it

In [None]:
df['dependency'].head(5)

In [None]:
df['dependency'] = dependecy.values
df['dependency'].head(5)

#### Defining a function to treat 'dependency'

In [None]:
def correct_dependency(df):
    inf_19_sup_65 = df['hogar_nin'] + df['hogar_mayor']
    sup_19_inf_65 = df['hogar_adul'] - df['hogar_mayor']
    dependecy = inf_19_sup_65*1.0 / sup_19_inf_65 # Recalculates the dependecy
    dependecy = dependecy.replace([np.inf, -np.inf], np.nan) # Replaces all inf with NaN
    dependecy = dependecy.fillna(dependecy.nlargest().iloc[0]) # Finds the largest value and fills de NaN with it
    df['dependency'] = dependecy.values
    return df    

### Treating 'edjefe' and 'edjefa'

Based on the introduction for each feature we found out that we can assing 'yes' to be 1 and 'no' to be 0.

In [None]:
df[columns_object[3]].head(5)

In [None]:
df[columns_object[3]] = df[columns_object[3]].replace({'no': 0, 'yes':1}).astype(float)

In [None]:
df[columns_object[4]].head(5)

In [None]:
df[columns_object[4]] = df[columns_object[4]].replace({'no': 0, 'yes':1}).astype(float)

#### Defining a function to treat 'edjefe' and 'edjefa'

In [None]:
def correct_edjefe_edjefa(df):
    df[columns_object[3]] = df[columns_object[3]].replace({'no': 0, 'yes':1}).astype(float)
    df[columns_object[4]] = df[columns_object[4]].replace({'no': 0, 'yes':1}).astype(float)
    return df

## Columns with NaN
In this part we checked wich columns had NaN values.

In [None]:
list_na = df.columns[df.isnull().any()].tolist() #It's a list of all columns that have NAN
for column in list_na:
    series = df[column]
    n_null = series.isnull().sum()
    print('The column ' + column + ' has ' + str(n_null) + ' null values.')

### Treating v2a1

After checking the documentation we saw that:
* v2a1, Monthly rent payment;  
* tipovivi1, =1 own and fully paid house;  
* tipovivi4, =1 precarious;  
* tipovivi5, "=1 other(assigned,  borrowed)"

Therefore there must be a correlation between the 4 variables above. Every time that tipovivi1 = 1 or tipovivi4 = 1 or tipovivi5 = 1 our v2a1 must be NaN.

In [None]:
df[['v2a1', 'tipovivi1', 'tipovivi4', 'tipovivi5']].head(5)

In [None]:
count = 0
n_ret = 0
for index, row in df.iterrows():
    if np.isnan(row['v2a1']) == True and row['tipovivi1'] == 1 or row['tipovivi4'] == 1 or row['tipovivi5'] == 1:
        df.loc[index,'v2a1'] = 0
        n_ret += 1
        
print('The amount of v2a1 changed was :', n_ret)

#### Defining a function to treat 'v2a1'

In [None]:
def correct_v2a1(df):
    count = 0
    n_ret = 0
    for index, row in df.iterrows():
        if np.isnan(row['v2a1']) == True and row['tipovivi1'] == 1 or row['tipovivi4'] == 1\
        or row['tipovivi5'] == 1:
            df.loc[index,'v2a1'] = 0
            n_ret += 1
    print('The amount of v2a1 changed was :', n_ret)
    return df

### Treating v18q1

In [None]:
df[['v18q', 'v18q1']].head(5)

In the documentation it is stated that:  
* 'v18q' = owns a tablet;  
* 'v18q1' = number of tablets household owns.  
Therefore if the 'v18q' = 0 then 'v18q1' should also be zero.

In [None]:
count = 0
n_ret = 0
for index, row in df.iterrows():
    if row['v18q'] == 0 and np.isnan(row['v18q1']) == True:
        df.loc[index,'v18q1'] = 0
        n_ret += 1
        
print('The amount of v18q1 changed was :', n_ret)

In [None]:
def correct_v18q1(df):
    count = 0
    n_ret = 0
    for index, row in df.iterrows():
        if row['v18q'] == 0 and np.isnan(row['v18q1']) == True:
            df.loc[index,'v18q1'] = 0
            n_ret += 1
    print('The amount of v18q1 changed was :', n_ret)
    return df

### Treating rez_esc

In [None]:
df[['rez_esc', 'escolari']].head(5)

After seeing the documentation is stated that 'rez_esc' is equivalent to the 'Years behind in school', we assumed than that all the NaN values should be equal to 0. We assumed that the people who never failed a year in school didn't fill this field leading to NaN.

In [None]:
df['rez_esc'] = df['rez_esc'].fillna(0)

### Treating meaneduc

We filled the 'meaneduc' and 'SQBmeaned' columns with zeros. We did this because we assumed that the reason why people that had 'meaneduc'= NaN was because they didn't attend school therefore their 'meaneduc' = 'SQBmeaned' = 0.

In [None]:
df['meaneduc'] = df['meaneduc'].fillna(0)
df['SQBmeaned'] = df['SQBmeaned'].fillna(0)

#### Treating 'meaneduc', 'rez_esc' and 'SQBmeaned'

In [None]:
def correct_meaneduc_rez_esc(df):
    df['rez_esc'] = df['rez_esc'].fillna(0)
    df['meaneduc'] = df['meaneduc'].fillna(0)
    df['SQBmeaned'] = df['SQBmeaned'].fillna(0)
    return df

# Feature importance
We discarted all columns that had a standard deviation of zero, and also we eleminated the columns 'idhogar' and 'Id'.

In [None]:
# We choose to drop all Ids
needless_col_prov = ['idhogar','Id']
df = df.drop(needless_col_prov, axis = 1)

needless_col = needless_col_prov

# We assumed that all columns with a std inferior to 0.05 should also be droped
needless_col_prov = []
for col in df.columns:
    if df[col].std() == 0: 
        needless_col_prov.append(col)
print('The following columns have zero std so they will be discarted :', needless_col_prov)
df = df.drop(needless_col_prov, axis = 1)


needless_col = needless_col + needless_col_prov

In [None]:
print('In total there where', len(needless_col), 'columns eleminated.')
print('The eleminated columns where the following ones :', needless_col)
print('We are now considering', len(df.columns.tolist())-1, 'features.')

# Features and target

In [None]:
X = df.drop('Target', axis = 1)
y = df[['Target']]

# LGBM hyper-parameters auto-optimizer
We used a Bayesian optimization to optimize our hyper-parameters.  
From here on all the code that is comment out we only executed it on our computer, we didn't executed it on the Kaggle enviroment because we assumed that it would take to long to process.

In [None]:
'''
bayes_cv_tuner = BayesSearchCV( estimator = lgb.LGBMClassifier(boosting_type='gbdt', n_jobs=-1, verbose=2),
        search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'num_leaves': (2, 500),
        'max_depth': (0, 500),
        'min_child_samples': (0, 200),
        'max_bin': (100, 100000),
        'subsample': (0.01, 1.0, 'uniform'),
        'subsample_freq': (0, 10),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'min_child_weight': (0, 10),
        'subsample_for_bin': (100000, 500000),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'scale_pos_weight': (1e-6, 500, 'log-uniform'),
        'n_estimators': (10, 10000),
        },
        scoring = 'f1_macro', cv = StratifiedKFold(n_splits=2), n_iter = 30, verbose = 1, refit = True)
'''

# Recursive Feature elemination
We defined the function rfecv_opt wich was responsible for doing a recursive feature elemination with F1 Macro as the metrics for calculating the score.

In [None]:
'''
def rfecv_opt(model, n_jobs, X, y, cv = StratifiedKFold(2)):
    rfecv = RFECV(estimator = model, step = 1, cv = cv,
                    n_jobs = n_jobs, scoring = 'f1_macro', verbose = 1)
    rfecv.fit(X.values, y.values.ravel())
    print('Optimal number of features : %d', rfecv.n_features_)
    print('Max score with current model :', round(np.max(rfecv.grid_scores_), 3))
    # Plot number of features VS. cross-validation scores
    plt.figure()
    plt.xlabel('Number of features selected')
    plt.ylabel('Cross validation score (f1_macro)')
    plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
    plt.show()
    important_columns = []
    n = 0
    for i in rfecv.support_:
        if i == True:
            important_columns.append(X.columns[n])
        n +=1
    return important_columns, np.max(rfecv.grid_scores_), rfecv
'''

# Routine
Next we did a function to optmize our model and do a RFECV on every step.

In [None]:
'''
def routine(X, y, n_iter_max, n_jobs):
    list_models = []
    list_scores_max = []
    list_features = []
    list_f1_score = []
    for i in range(n_iter_max):
        print('Currently on iteration', i+1, 'of', n_iter_max, '.')
        if i == 0:
            model = lgb.LGBMClassifier(max_depth=-1, learning_rate=0.1, objective='multiclass',
                            silent = True, metric = 'None', n_jobs = n_jobs,
                            n_estimators = 8000, class_weight = 'balanced')
        else:
            print('Adjusting model.')
            X_provi = X[imp_columns]
            # Get current parameters and the best parameters    
            result = bayes_cv_tuner.fit(X_provi.values, y.values.ravel())
            best_params = pd.Series(result.best_params_)
            param_dict=pd.Series.to_dict(best_params)
            model = lgb.LGBMClassifier(colsample_bytree = param_dict['colsample_bytree'],
                          learning_rate = param_dict['learning_rate'],
                          max_bin = int(param_dict['max_bin']),
                          max_depth = int(param_dict['max_depth']),
                          min_child_samples = int(param_dict['min_child_samples']),
                          min_child_weight = param_dict['min_child_weight'],
                          n_estimators = int(param_dict['n_estimators']),
                          num_leaves = int(param_dict['num_leaves']),
                          reg_alpha = param_dict['reg_alpha'],
                          reg_lambda = param_dict['reg_lambda'],
                          scale_pos_weight = param_dict['scale_pos_weight'],
                          subsample = param_dict['subsample'],
                          subsample_for_bin = int(param_dict['subsample_for_bin']),
                          subsample_freq = int(param_dict['subsample_freq']),
                          n_jobs = n_jobs,
                          class_weight='balanced',
                          objective='multiclass'
                          )
        imp_columns, max_score, rfecv = rfecv_opt(model, n_jobs, X, y)
        list_models.append(model)
        list_scores_max.append(max_score)
        list_features.append(imp_columns)
        
    return list_models, list_scores_max, list_features
'''

In [None]:
'''
list_models, list_scores_max, list_features = routine(X, y, 15, 4)

index_max = list_scores_max.index(max(list_scores_max))
features = list_features[index_max]
model = list_models[index_max]
'''

# Results
After running it on our computer we got the following results.

In [None]:
model = lgb.LGBMClassifier(boosting_type='gbdt', class_weight='balanced',
        colsample_bytree=0.364429092365, learning_rate=0.11718910536,
        max_bin=75490, max_depth=312, min_child_samples=21,
        min_child_weight=7.0, min_split_gain=0.0, n_estimators=5392,
        n_jobs=15, num_leaves=249, objective='multiclass',
        random_state=None, reg_alpha=2.51960359296e-05,
        reg_lambda=10.9020792516, scale_pos_weight=0.0247756521295,
        silent=True, subsample=0.195224406679, subsample_for_bin=126252,
        subsample_freq=3)

features = ['v2a1', 'rooms', 'r4h2', 'r4h3', 'r4m2', 'r4m3', 'r4t1', 'r4t2', 'r4t3', 'tamhog', 'tamviv',\
            'escolari', 'energcocinar3', 'hogar_nin', 'hogar_adul', 'dependency', 'edjefe', 'edjefa',\
            'meaneduc', 'bedrooms', 'overcrowding', 'qmobilephone', 'lugar1', 'age', 'SQBescolari', 'SQBage',\
            'SQBedjefe', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq']

# Testing our model on trainset
We divides our trainset into train and test to check the accuracy of our model on our trainset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train = X_train[features]
X_test = X_test[features]
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

In [None]:
test_model = model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=500, verbose=200)

In [None]:
predictions = test_model.predict(X_test)

print('F1-macro score on train = ', f1_score(y_test, predictions, average='macro'))

# Treating testset
The treatment for the testset is similar to what we have done for the trainset.

In [None]:
print('Dtypes count:' + '\n', test_data.dtypes.value_counts())
columns_object = test_data.columns[test_data.dtypes == object]
print('Columns wich could have a problem :', \
      columns_object) # Columns wich need treatment beacause they are object type

In [None]:
test_data = correct_dependency(test_data) # Correcting the 'dependency' column problem
test_data = correct_edjefe_edjefa(test_data) # Correcting the 'edjefa' and 'edjefe' problem

In [None]:
list_na = test_data.columns[test_data.isnull().any()].tolist() 
#It's a list of all columns that have NAN
for column in list_na:
    series = test_data[column]
    n_null = series.isnull().sum()
    print('The column ' + column + ' has ' + str(n_null) + ' null values.')

In [None]:
test_data = correct_v2a1(test_data)
test_data = correct_v18q1(test_data)
test_data = correct_meaneduc_rez_esc(test_data)

In [None]:
if not test_data.columns[test_data.isnull().any()].tolist(): 
    #It's a list of all columns that have NAN
    print('There are no columns with NaN values on the testset')

# Training our model to make predictions

In [None]:
X = X[features]
y = y.values.ravel()

In [None]:
prediction_model = model.fit(X, y)

In [None]:
id_column = test_data.Id

y_pred_final = prediction_model.predict(test_data[features])

In [None]:
file_to_submit = pd.DataFrame({'Id':id_column, 'Target':y_pred_final})
file_to_submit.to_csv('prediction.csv', index=False)

# Conclusions
In this kernel, we fitted a Light and got a score of 0.97 meaning that this model can work a get good predictions overall.  
Our next steps will be set with the goal of proving our dataset treatment, as well as finding possible correlations between features improving, possibly, our score. We will try to do a plot of number of features vs. score using recursive feature elimination and also improve our visualization sets, giving us a better look at our challenge.