# Costa-Rican Poverty Line Prediction

The Inter-American Development Bank is asking the Kaggle community for help with income qualification for some of the world's poorest families. Are you up for the challenge?

Here's the backstory: Many social programs have a hard time making sure the right people are given enough aid. It’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify.

In Latin America, one popular method uses an algorithm to verify income qualification. It’s called the Proxy Means Test (or PMT). With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling, or the assets found in the home to classify them and predict their level of need.

While this is an improvement, accuracy remains a problem as the region’s population grows and poverty declines.

To improve on PMT, the IDB (the largest source of development financing for Latin America and the Caribbean) has turned to the Kaggle community. They believe that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance.

Beyond Costa Rica, many countries face this same problem of inaccurately assessing social need. If Kagglers can generate an improvement, the new algorithm could be implemented in other countries around the world.

## Calling required libraries for the work

In [None]:
# essential libraries
import numpy as np 
import pandas as pd
# for data visulization
import matplotlib.pyplot as plt
import seaborn as sns


#for data processing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import  OneHotEncoder as ohe
from sklearn.preprocessing import StandardScaler as ss
from sklearn.compose import ColumnTransformer as ct
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# for modeling estimators
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier as gbm
from xgboost.sklearn import XGBClassifier
import lightgbm as lgb

# for measuring performance
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc, roc_curve
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
import sklearn.metrics as metrics
from xgboost import plot_importance
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

#for tuning parameters
from bayes_opt import BayesianOptimization
from skopt import BayesSearchCV
from eli5.sklearn import PermutationImportance

# Misc.
import os
import time
import gc
import random
from scipy.stats import uniform
import warnings

## Reading the data

In [None]:
pd.options.display.max_columns = 150

# Read in data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')


## Explore data and perform data visualization

In [None]:
train.head()

In [None]:
train.info()   

In [None]:
sns.countplot("Target", data=train)

In [None]:
 sns.countplot(x="r4t3",hue="Target",data=train)


In [None]:
sns.countplot(x="v18q",hue="Target",data=train)

In [None]:
sns.countplot(x="v18q1",hue="Target",data=train)

In [None]:
sns.countplot(x="tamhog",hue="Target",data=train)

In [None]:
sns.countplot(x="hhsize",hue="Target",data=train)

In [None]:
sns.countplot(x="abastaguano",hue="Target",data=train)

In [None]:
sns.countplot(x="noelec",hue="Target",data=train)

In [None]:
train.select_dtypes('object').head()

In [None]:


yes_no_map = {'no':0,'yes':1}
train['dependency'] = train['dependency'].replace(yes_no_map).astype(np.float32)
train['edjefe'] = train['edjefe'].replace(yes_no_map).astype(np.float32)
train['edjefa'] = train['edjefa'].replace(yes_no_map).astype(np.float32)
    
    

## Converting categorical objects into numericals 

In [None]:
train[["dependency","edjefe","edjefa"]].describe()

### Fill in missing values (NULL values)  using 1 for yes and 0 for no

In [None]:
 # Number of missing in each column
missing = pd.DataFrame(train.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(train)

missing.sort_values('percent', ascending = False).head(10)


In [None]:
train['v18q1'] = train['v18q1'].fillna(0)
test['v18q1'] = test['v18q1'].fillna(0)

In [None]:
train['v2a1'] = train['v2a1'].fillna(0)
test['v2a1'] = test['v2a1'].fillna(0)

In [None]:
train['rez_esc'] = train['rez_esc'].fillna(0)
test['rez_esc'] = test['rez_esc'].fillna(0)
train['SQBmeaned'] = train['SQBmeaned'].fillna(0)
test['SQBmeaned'] = test['SQBmeaned'].fillna(0)
train['meaneduc'] = train['meaneduc'].fillna(0)
test['meaneduc'] = test['meaneduc'].fillna(0)

In [None]:
#Checking for missing values again to confirm that no missing values present
# Number of missing in each column
missing = pd.DataFrame(train.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(train)

missing.sort_values('percent', ascending = False).head(10)



### Dropping unnecesary columns

In [None]:
train.drop(['Id','idhogar',"dependency","edjefe","edjefa"], inplace = True, axis =1)

test.drop(['Id','idhogar',"dependency","edjefe","edjefa"], inplace = True, axis =1)

In [None]:
train.shape

In [None]:
test.shape

### Dividing the data into predictors & target

In [None]:
y = train.iloc[:,137]
y.unique()


In [None]:
X = train.iloc[:,1:138]
X.shape


### Scaling  numeric features & applying PCA to reduce features

In [None]:

my_imputer = SimpleImputer()
X = my_imputer.fit_transform(X)
scale = ss()
X = scale.fit_transform(X)
pca = PCA(0.95)
X = pca.fit_transform(X)


### Final features selected for modeling

In [None]:
X.shape, y.shape

### Splitting the data into train & test 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
                                                    X,
                                                    y,
                                                    test_size = 0.2)


# Modelling

## Modelling with Random Forest

In [None]:
 
modelrf = rf()

In [None]:
start = time.time()
modelrf = modelrf.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelrf.predict(X_test)

In [None]:
(classes == y_test).sum()/y_test.size 

## Performing tuning using Bayesian Optimization.

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    rf(
       n_jobs = 2         # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 3                # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

### Accuracy improved from 71.91% to 76.20%

## Modelling with ExtraTreeClassifier

In [None]:
modeletf = ExtraTreesClassifier()

In [None]:
start = time.time()
modeletf = modeletf.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modeletf.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size

## Performing tuning using Bayesian Optimization.

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    ExtraTreesClassifier( ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {   'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    n_iter=32,            # How many points to sample
    cv = 2            # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

## Modelling with KNeighborsClassifier

In [None]:
modelneigh = KNeighborsClassifier(n_neighbors=4)

In [None]:
start = time.time()
modelneigh = modelneigh.fit(X_train, y_train)
end = time.time()
(end-start)/60



In [None]:
classes = modelneigh.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

## Performing tuning using Bayesian Optimization.

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    KNeighborsClassifier(
       n_neighbors=4         # No need to tune this parameter value
      ),
    {"metric": ["euclidean", "cityblock"]},
    n_iter=32,            # How many points to sample
    cv = 2            # Number of cross-validation folds
   )

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

## Modelling with GradientBoostingClassifier

In [None]:
modelgbm=gbm()

In [None]:
start = time.time()
modelgbm = modelgbm.fit(X_train, y_train)
end = time.time()
(end-start)/60


In [None]:
classes = modelgbm.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

## Performing tuning using Bayesian Optimization.

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    gbm(
               # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this
        
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 2                # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

## Modelling with XGBClassifier

In [None]:
modelxgb=XGBClassifier()

In [None]:
start = time.time()
modelxgb = modelxgb.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelxgb.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

## Performing tuning using Bayesian Optimization.

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    XGBClassifier(
       n_jobs = 2         # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 3                # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

## Modelling with Light Gradient Booster

In [None]:
modellgb = lgb.LGBMClassifier(max_depth=-1, learning_rate=0.1, objective='multiclass',
                             random_state=None, silent=True, metric='None', 
                             n_jobs=4, n_estimators=5000, class_weight='balanced',
                             colsample_bytree =  0.93, min_child_samples = 95, num_leaves = 14, subsample = 0.96)

In [None]:
start = time.time()
modellgb = modellgb.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modellgb.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

## Performing tuning using Bayesian Optimization.

In [None]:
bayes_cv_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    lgb.LGBMClassifier(
       n_jobs = 2         # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },

    # 2.13
    n_iter=32,            # How many points to sample
    cv = 3                # Number of cross-validation folds
)

In [None]:

# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
#  Get what average accuracy was acheived during cross-validation
bayes_cv_tuner.best_score_

In [None]:
#  What accuracy is available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
#  And what all sets of parameters were tried?
bayes_cv_tuner.cv_results_['params']

                                            ACCURACY                         ACCURACY
                                with default parameters             with  parameters tuned with Bayesian                                                                                  Optimization 
        RandomForestClassifier         77.87                               85.61
        KNeighborsClassifier           80.70                               81.85 
        ExtraTreesClassifier           77.98                               86.97
        GradientBoostingClassifier     80.75                               91.42 
        XGBoost                        78.03                               91.57
        LightGBM                       93.41                               92.05 