# Poverty Prediction - Driven Data Challenge

--- Information taken from Driven Data ---

Design a model to predict whether or not a given household for a given country is poor or not. The training features are survey data from three countries. For each country, A, B, and C, survey data is provided at the household as well as individual level. Each household is identified by its id, and each individual is identified by both their household id and individual iid. Most households have multiple individuals that make up that household.

TRAINING DATA DESCRIPTION

Predictions should be made at the household level only, but data for each of the three countries is provided at the household and individual level. It may be the case that you can construct additional features for the household using the individual data that are particular useful for predicting at the household level, which is why we provide both. There are six training files in total.

The dataset has been structured so that the id columns match across the individual and house hold datasets. For both datasets, an assessment of whether or not the household is above or below the poverty line is in the poor column. This binary variable is the target variable for the competition.

Each column in the dataset corresponds with a survey question. Each question is either multiple choice, in which case each choice has been encoded as random string, or it is a numeric value. Many of the multiple choice questions are about consumable goods--for example does your household have items such as Bar soap, Cooking oil, Matches, and Salt. Numeric questions often ask things like How many working cell phones in total does your household own? or How many separate rooms do the members of your household occupy?

TEST DATA DESCRIPTION

The test data format is the same as the training data, except that the poor column is not included.

PERFORMANCE METRIC

Mean log loss is used the performance metric 

In [2]:
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import ExtraTreesClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation, metrics 
import matplotlib.pylab as plt
%matplotlib inline



# Input File read

In [2]:
input_df1 = pd.read_csv('C://Users//saish//Documents//driven_data//combined_A.csv', index_col='id')
print('A records', input_df1.shape)

A records (37560, 387)


In [3]:
input_df2 = pd.read_csv('C://Users//saish//Documents//driven_data//combined_B.csv',index_col='id')
print('B records', input_df2.shape)
input_df2.head()

B records (20252, 667)


Unnamed: 0_level_0,iid,RzaXNcgd,LfWEhutI,jXOqJdNL,wJthinfa_x,PTLgvdlQ,ZvEApWrk,euTESpHe,bDVMMSYY,aSzMhjgD,...,NZYkmhkD,fxWioPPP,ulQCDoYe,tzYvQeOb,DWmTWcUm,PxgyaWYq,NfpXxGQk,cavdrXpj,poor,country_y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
62801,1,zTghO,pYfmQ,lNhMv,18,iuxWN,33,OLVWN,FDqwJ,rxJJI,...,uCOQO,UYIFp,9,,MAFfK,VnOFM,-7827.0,uJXdA,False,B
62801,2,zTghO,pYfmQ,lNhMv,18,iuxWN,33,OLVWN,FDqwJ,rxJJI,...,uCOQO,UYIFp,29,,MAFfK,VnOFM,,uJXdA,False,B
62801,3,zTghO,pYfmQ,lNhMv,18,iuxWN,33,OLVWN,FDqwJ,rxJJI,...,uCOQO,UYIFp,-82,,MAFfK,VnOFM,,uJXdA,False,B
20689,1,zTghO,pYfmQ,lNhMv,74,iuxWN,-2,OLVWN,FDqwJ,ufugi,...,uCOQO,UYIFp,-6,,MAFfK,ppEcI,-7867.0,uJXdA,True,B
20689,2,zTghO,pYfmQ,lNhMv,74,iuxWN,-2,OLVWN,FDqwJ,ufugi,...,uCOQO,UYIFp,19,,MAFfK,ppEcI,-7987.0,uJXdA,True,B


In [4]:
input_df3 = pd.read_csv('C://Users//saish//Documents//driven_data//combined_C.csv',index_col='id')
print('C records', input_df3.shape)
input_df3.head()

C records (29913, 206)


Unnamed: 0_level_0,iid,GRGAYimk,DNnBfiSI,cNDTCUPU,GvTJUYOo,vmKoAlVH,LhUIIEHQ,DTNyjXJp,PNAiwXUz,ABnhybHK,...,NAxEQZVi,ShCKQiAy,rkLqZrQW,VGJlUgVG,kMVbipfP,sCTSWhXf,rVneGwzn,uVFOfrpa,poor,country_y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30639,1,VFAuL,boDkI,gJLrc,EPKkJ,YXkKd,7,XuMYE,-1,sEJgr,...,Rihyc,INYbJ,SoOdX,VlcEt,zzxBZ,yQhuJ,xgpHA,DnIbO,False,C
30639,2,VFAuL,boDkI,gJLrc,EPKkJ,YXkKd,7,XuMYE,-1,sEJgr,...,Rihyc,TYbsc,SoOdX,VlcEt,zzxBZ,yQhuJ,xgpHA,DnIbO,False,C
30639,3,VFAuL,boDkI,gJLrc,EPKkJ,YXkKd,7,XuMYE,-1,sEJgr,...,GkrMH,xJurw,pbPGJ,YYwlj,rPkFE,yQhuJ,ldKFc,kXobL,False,C
30639,9,VFAuL,boDkI,gJLrc,EPKkJ,YXkKd,7,XuMYE,-1,sEJgr,...,Rihyc,iuiyo,SoOdX,YYwlj,zzxBZ,yQhuJ,QGHnL,xRxWC,False,C
30639,10,VFAuL,boDkI,gJLrc,EPKkJ,YXkKd,7,XuMYE,-1,sEJgr,...,Rihyc,iuiyo,SoOdX,YYwlj,zzxBZ,yQhuJ,QGHnL,xRxWC,False,C


# Data imbalance check

In [5]:
input_df1.poor.value_counts()

True     19684
False    17876
Name: poor, dtype: int64

In [6]:
input_df2.poor.value_counts()

False    18375
True      1877
Name: poor, dtype: int64

In [7]:
input_df3.poor.value_counts()

False    22868
True      7045
Name: poor, dtype: int64

# Pre-process

In [8]:
def data_preprocess(input_df1, enforce_cols=None):
    print('Initial Input Shape', input_df1.shape)
    numeric = input_df1.select_dtypes(include=['int64', 'float64'])
    input_df1[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
    print('After standardization Input Shape', input_df1.shape)
    input_df1 = pd.get_dummies(input_df1)
    print('After encoding Input Shape', input_df1.shape)
    
    """
    processing for test set
    setdiffid(a,b) = give values that are in 'a' but not 'b'
    """
    if enforce_cols is not None:
        to_drop = np.setdiff1d(input_df1.columns, enforce_cols)
        to_add = np.setdiff1d(enforce_cols, input_df1.columns)
        
        input_df1.drop(to_drop, axis=1, inplace=True)
        input_df1 = input_df1.assign(**{c: 0 for c in to_add})
    
        print('After enforcing Input Shape', input_df1.shape)
        
    return input_df1

In [9]:
aX_train = data_preprocess(input_df1.drop('poor', axis=1))
ay_train = np.ravel(input_df1.poor)
bX_train = data_preprocess(input_df2.drop('poor', axis=1))
by_train = np.ravel(input_df2.poor)
cX_train = data_preprocess(input_df3.drop('poor', axis=1))
cy_train = np.ravel(input_df3.poor)

Initial Input Shape (37560, 386)
After standardization Input Shape (37560, 386)
After encoding Input Shape (37560, 1134)
Initial Input Shape (20252, 666)
After standardization Input Shape (20252, 666)
After encoding Input Shape (20252, 3074)
Initial Input Shape (29913, 205)
After standardization Input Shape (29913, 205)
After encoding Input Shape (29913, 1096)


In [10]:
print('shape of input 1 after pre-processing', aX_train.shape)
print('shape of input 2 after pre-processing', bX_train.shape)
print('shape of input 3 after pre-processing', cX_train.shape)

shape of input 1 after pre-processing (37560, 1134)
shape of input 2 after pre-processing (20252, 3074)
shape of input 3 after pre-processing (29913, 1096)


# Removing Null Values

In [11]:
bX_train = bX_train.fillna(bX_train.mean())
#bX_train.isnull().sum().sort_values(ascending=False)

In [12]:
#cX_train.isnull().sum().sort_values(ascending=False)

# Test Data creation

In [13]:
test_df1 = pd.read_csv('C://Users//saish//Documents//driven_data//new_comb//combined_test_A.csv',index_col='id')
print('A records', test_df1.shape)
test_df2 = pd.read_csv('C://Users//saish//Documents//driven_data//new_comb//combined_test_B.csv',index_col='id')
print('B records', test_df2.shape)
test_df3 = pd.read_csv('C://Users//saish//Documents//driven_data//new_comb//combined_test_C.csv',index_col='id')
print('C records', test_df3.shape)
##
a_test = data_preprocess(test_df1, enforce_cols=aX_train.columns)
##
b_test = data_preprocess(test_df2, enforce_cols=bX_train.columns)
b_test = b_test.fillna(b_test.mean())
#b_test.isnull().sum().sort_values(ascending=False)
##
c_test = data_preprocess(test_df3, enforce_cols=cX_train.columns)
c_test = c_test.fillna(c_test.mean())
#c_test.isnull().sum().sort_values(ascending=False)

A records (18535, 386)
B records (10066, 666)
C records (14701, 205)
Initial Input Shape (18535, 386)
After standardization Input Shape (18535, 386)
After encoding Input Shape (18535, 1125)
After enforcing Input Shape (18535, 1134)
Initial Input Shape (10066, 666)
After standardization Input Shape (10066, 666)
After encoding Input Shape (10066, 2954)
After enforcing Input Shape (10066, 3074)
Initial Input Shape (14701, 205)
After standardization Input Shape (14701, 205)
After encoding Input Shape (14701, 1075)
After enforcing Input Shape (14701, 1096)


# Model

# Tuning parameters for country A

XGBOOST Outperfomed all other algorithms. Below is the tuning procedure of XGBOOST.

Following are the parameters tuned using GridSearchCV

- Number of estimators
- Max_depth: Maximum depth of the tree
- Minimum child weight: Minimum sum of weight need in the child
- Gamma : Minimum loss reduction required to make a further partition on a leaf node of the tree
- Subsample: Subsample ratio of the training sample
- colsample_bytree: subsample ratio of columns when constructing each tree.
- reg_alpha: L1 Regularization
- learning_rate

In [None]:
predictors, ts_x, train, ts_y = train_test_split(aX_train, ay_train, test_size = 0.2, random_state=42)

In [None]:
def modelfit(alg, train, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=100):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        #
        xgtrain = xgb.DMatrix(predictors, label=train)
        #
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    print('no.estimators',cvresult.shape[0])
    #Fit the algorithm on the data
    alg.fit(predictors, train,eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(predictors)
    dtrain_predprob = alg.predict_proba(predictors)[:,1]
        
    #Print model report:
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.accuracy_score(train, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(train, dtrain_predprob))
                    
    

In [None]:

xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, ay_train, aX_train)

In [None]:
param_test1 = {
 'max_depth': [2,4,6,8,10],
 'min_child_weight': [1,3,5]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=1222, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(aX_train,ay_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test1 = {
 'max_depth': [1,2,3,5],
 'min_child_weight': [5,6,7]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=154, max_depth=2,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(aX_train,ay_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test1 = {
 'min_child_weight': [4,5,6,7]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=154, max_depth=2,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(aX_train,ay_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test3 = {
 'gamma':[i/10.0 for i in range(0,10)]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=154, max_depth=2,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(aX_train,ay_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:

xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=2,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, ay_train, aX_train)

In [None]:
param_test4 = {
 'subsample':[i/10.0 for i in range(4,10)],
 'colsample_bytree':[i/10.0 for i in range(4,10)]
}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=340, max_depth=2,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(aX_train,ay_train)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

In [None]:
param_test4 = {
 'subsample':[i/100.0 for i in range(85, 100, 5)],
 'colsample_bytree':[i/100.0 for i in range(65,80,5)]
}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=340, max_depth=2,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(aX_train,ay_train)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

In [None]:
param_test6 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=340, max_depth=2,
 min_child_weight=6, gamma=0, subsample=0.95, colsample_bytree=0.75,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(aX_train,ay_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
param_test6 = {
 'reg_alpha':[0.4,0.9,1,1.2,1.5,1.7,2,2.5,3,5,7]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=340, max_depth=2,
 min_child_weight=6, gamma=0, subsample=0.95, colsample_bytree=0.75,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(aX_train,ay_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=2,
 min_child_weight=6,
 gamma=0,
 subsample=0.95,
 colsample_bytree=0.75,
 objective= 'binary:logistic',
 reg_alpha = 2,
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, ay_train, aX_train)

In [None]:
param_test6 = {
 'learning_rate':[0.03,0.05,0.07,0.09,0.1,0.12,0.14,0.15,0.17,0.19,0.2]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=393, max_depth=2,
 min_child_weight=6, gamma=0, subsample=0.95, colsample_bytree=0.75, reg_alpha = 2,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(aX_train,ay_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=2,
 min_child_weight=6,
 gamma=0,
 subsample=0.95,
 colsample_bytree=0.75,
 reg_alpha = 2,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, ay_train, aX_train)

In [None]:
est = XGBClassifier(
 learning_rate =0.2,
 n_estimators=393,
 max_depth=2,
 min_child_weight=6,
 gamma=0,
 subsample=0.95,
 colsample_bytree=0.75,
 reg_alpha = 2,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

est.fit(predictors,train)

tr_pred_a = est.predict(ts_x)
#
accuracy = est.score(ts_x, ts_y)
print("In-sample accuracy:",accuracy)
#classification report
print(classification_report(ts_y,tr_pred_a))
#
a_test_s = a_test[predictors.columns]
a_pred = est.predict_proba(a_test_s)

# countryB Tuning

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, by_train, bX_train)

In [None]:
param_test1 = {
 'max_depth': [2,4,6,8,10,12],
 'min_child_weight': [1,2,3,4,5,6]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=90, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(bX_train,by_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test1 = {
 'max_depth': [1,2,3],
 'min_child_weight': [1,2,3]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=90, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(bX_train,by_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test1 = {
 'min_child_weight': [1,2,3,5,7,9,10,12]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=90, max_depth=1,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(bX_train,by_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test3 = {
 'gamma':[i/10.0 for i in range(0,10)]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=90, max_depth=1,
 min_child_weight=3, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(bX_train,by_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=1,
 min_child_weight=3,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, by_train, bX_train)

In [None]:
param_test4 = {
 'subsample':[i/10.0 for i in range(4,10)],
 'colsample_bytree':[i/10.0 for i in range(4,10)]
}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=191, max_depth=1,
 min_child_weight=3, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(bX_train,by_train)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

In [None]:
param_test4 = {
 'subsample':[i/100.0 for i in range(85, 100, 5)],
 'colsample_bytree':[i/100.0 for i in range(45,55,5)]
}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=191, max_depth=1,
 min_child_weight=3, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(bX_train,by_train)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

In [None]:
param_test6 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=191, max_depth=1,
 min_child_weight=3, gamma=0, subsample=0.95, colsample_bytree=0.5,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(bX_train,by_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
param_test6 = {
 'reg_alpha':[0.01,0.05,0.09,0.1,0.15,0.2]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=191, max_depth=1,
 min_child_weight=3, gamma=0, subsample=0.95, colsample_bytree=0.5,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(bX_train,by_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=1,
 min_child_weight=3,
 gamma=0,
 subsample=0.95,
 colsample_bytree=0.5,
 objective= 'binary:logistic',
 nthread=4,
 reg_alpha = 0.1,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, by_train, bX_train)

In [None]:
param_test6 = {
 'learning_rate':[0.03,0.05,0.07,0.09,0.1,0.12,0.14,0.15,0.17,0.19,0.2]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=191, max_depth=1,
 min_child_weight=3, gamma=0, subsample=0.95, colsample_bytree=0.5, reg_alpha = 0.1,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(bX_train,by_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
#bX_train_int, by_train_int = shuffle(b_tr_svd, by_train, random_state=0)
tr_x, ts_x, tr_y, ts_y = train_test_split(bX_train, by_train, test_size = 0.2, random_state=42)
#
est = XGBClassifier(
 learning_rate =0.2,
 n_estimators=191,
 max_depth=1,
 min_child_weight=3,
 gamma=0,
 subsample=0.95,
 colsample_bytree=0.5,
 objective= 'binary:logistic',
 nthread=4,
 reg_alpha = 0.1,
 scale_pos_weight=1,
 seed=27)

#
est.fit(tr_x, tr_y)
tr_pred_b = est.predict(ts_x)
#
accuracy = est.score(ts_x, ts_y)
print("In-sample accuracy:",accuracy)
#classification report
print(classification_report(ts_y,tr_pred_b))
#
b_test_s = b_test[tr_x.columns]
b_pred = est.predict_proba(b_test_s)

# Country C Tuning

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, cy_train, cX_train)

In [None]:
param_test1 = {
 'max_depth': [2,4,6,8,10,12],
 'min_child_weight': [1,2,3,4,5,6]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=34, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(cX_train,cy_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test1 = {
 'max_depth': [9,10,11],
 'min_child_weight': [1,2,3]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=34, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(cX_train,cy_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test1 = {
 'min_child_weight': [1,2,3,5,7,9,11]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=34, max_depth=9,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(cX_train,cy_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
param_test3 = {
 'gamma':[i/10.0 for i in range(0,10)]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=34, max_depth=9,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

#
gsearch1.fit(cX_train,cy_train)
#
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=9,
 min_child_weight=2,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, cy_train, cX_train)

In [None]:
param_test4 = {
 'subsample':[i/10.0 for i in range(4,10)],
 'colsample_bytree':[i/10.0 for i in range(4,10)]
}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=31, max_depth=9,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(cX_train,cy_train)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

In [None]:
param_test4 = {
 'subsample':[i/100.0 for i in range(75,90,5)],
 'colsample_bytree':[i/100.0 for i in range(85,100,5)]
}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=31, max_depth=9,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(cX_train,cy_train)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=9,
 min_child_weight=2,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.85,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, cy_train, cX_train)

In [None]:
param_test6 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=30, max_depth=9,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.85,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(cX_train,cy_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
param_test6 = {
 'reg_alpha':[1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=30, max_depth=9,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.85,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(cX_train,cy_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.2,
 n_estimators=5000,
 max_depth=9,
 min_child_weight=2,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.85,
 reg_alpha = 1e-07,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, cy_train, cX_train)

In [None]:
param_test6 = {
 'learning_rate':[0.03,0.05,0.07,0.09,0.1,0.12,0.14,0.15,0.17,0.19,0.2]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.2, n_estimators=30, max_depth=9,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.85, reg_alpha = 1e-07,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(cX_train,cy_train)
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

In [None]:
xgb1 = XGBClassifier(
 learning_rate =0.17,
 n_estimators=5000,
 max_depth=9,
 min_child_weight=2,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.85,
 reg_alpha = 1e-07,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

modelfit(xgb1, cy_train, cX_train)

In [None]:
#cX_train_int, cy_train_int = shuffle(c_in, c_out, random_state=0)
tr_x, ts_x, tr_y, ts_y = train_test_split(cX_train, cy_train, test_size = 0.2, random_state=42)
#
est = XGBClassifier(
 learning_rate =0.17,
 n_estimators=38,
 max_depth=9,
 min_child_weight=2,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.85,
 reg_alpha = 1e-07,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

#
est.fit(tr_x, tr_y)
tr_pred_c = est.predict(ts_x)
#
accuracy = est.score(ts_x, ts_y)
print("In-sample accuracy:",accuracy)
#classification report
print(classification_report(ts_y,tr_pred_c))
#
c_test_s = c_test[tr_x.columns]
c_pred = est.predict_proba(c_test_s)

# change prediction format

In [None]:
def make_country_sub(preds, test_feat, country):
    # make sure we code the country correctly
    country_codes = ['A', 'B', 'C']
    
    # get just the poor probabilities
    country_sub = pd.DataFrame(data=preds[:, 1],  # proba p=1
                               columns=['poor'], 
                               index=test_feat.index)

    
    # add the country code for joining later
    country_sub["country"] = country
    return country_sub[["country", "poor"]]

In [None]:
a_sub = make_country_sub(a_pred, a_test, 'A')
b_sub = make_country_sub(b_pred, b_test, 'B')
c_sub = make_country_sub(c_pred, c_test, 'C')

In [None]:
sub_fl = pd.concat([a_sub, b_sub, c_sub])

In [None]:
sub_fl.head()

In [None]:
sub_fl.to_csv('C://Users//saish//Documents//driven_data//submission_0221.csv')

In [None]:
out_1 = pd.read_csv('C://Users//saish//Documents//driven_data//submission_0221.csv')
re = pd.read_csv('C://Users//saish//Documents//driven_data//submission_1.csv')

In [None]:
re.head()

In [None]:
out_2 = out_1.groupby(['id','country'], sort=False)['poor'].mean().reset_index()
out_2.head()

In [None]:
out_2 = out_2.set_index('id')
out_2 = out_2.reindex(index=re['id'])

In [None]:
out_3 = out_2
out_3.head()

In [None]:
out_3.to_csv('C://Users//saish//Documents//driven_data//submission_fl.csv')