# Janatahack: Cross-sell Prediction

Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue. 

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import xgboost
import matplotlib
import seaborn as sns
sns.set(rc={'figure.figsize':(10,8)})


# importing dataset

In [None]:
train = pd.read_csv('../input/janatahack-crosssell-prediction/train.csv')
test = pd.read_csv('../input/janatahack-crosssell-prediction/test.csv')
sample = pd.read_csv('../input/janatahack-crosssell-prediction/sample_submission.csv')

In [None]:
train.head()

In [None]:
test.head()

# Preprocessing of data

In [None]:
train.head()

In [None]:
test[test['Annual_Premium']>200000]

In [None]:
# mark train and test dataset and merge them
train['train_or_test'] = 1
test['train_or_test'] = 0
train = train[train['Annual_Premium']<200000]
train.reset_index(drop=True, inplace=True)
merge_data = pd.concat([train, test])

## Checking null values

##### There are no null values(Response is our target variable its empty in test dataset)

### Unique Values in dataset

In [None]:
merge_data.nunique()

In [None]:
merge_data.info()

In [None]:
merge_data.loc[merge_data.train_or_test==1].info()

In [None]:
merge_data.fillna(0,inplace=True)

In [None]:
merge_data.head()

In [None]:
# changing type of Response, Region_Code, Policy_Sales_Channel
merge_data['Response'] = merge_data['Response'].astype('int')
merge_data['Region_Code'] = merge_data['Region_Code'].astype('int')
merge_data['Policy_Sales_Channel'] = merge_data['Policy_Sales_Channel'].astype('int')

In [None]:
merge_data['Age'].plot.hist()

In [None]:
# creating new category for Age
merge_data['Age_group'] = pd.cut(x=merge_data['Age'], bins=range(20,90,10)).astype('str')


In [None]:
merge_data['Age_group'].value_counts()

In [None]:
# adding Vintage into seperate months
merge_data['year_vintage'] = merge_data['Vintage']//365
merge_data['months_vintage'] = merge_data['Vintage']//30

In [None]:
merge_data.head()

# EDA of the Data
We should we ask following questions .
1. Distribution of Driving_license users with Response
2. Distribution of Gender with respect to Response
3. Which regions have more responses 
4. Whether previosly_insured will take insurance or not
5. Which age group have most of insurances
6. Does policy_sales_channel affects responses?
7. older customers will stick with insurance or not
8. Whether past damage of vehicle affects Response

#### 1.  Distribution of Driving_license users with Response

In [None]:
merge_data.groupby('Response')['Driving_License'].value_counts()

In [None]:
merge_data.groupby('Response')['Driving_License'].value_counts().plot(kind='bar')

##### Insights
we can say ppl who dont have license usually doesnt go for the insurance

2. Distribution of Gender with respect to Response

In [None]:
merge_data.groupby('Response')['Gender'].value_counts()

In [None]:
merge_data.groupby('Response')['Gender'].value_counts().plot(kind='bar')

In [None]:
sns.countplot(x='Response', hue='Gender', data=merge_data)

#### 3. Which regions have more responses 

In [None]:
merge_data[merge_data.Response==1]['Region_Code'].value_counts().nlargest(5)

In [None]:
merge_data[merge_data.Response==1]['Region_Code'].value_counts().plot(kind='bar')

In [None]:
# region no. 28 have more no of users who take the insurance

#### 4. Whether previosly_insured will take insurance or not

In [None]:
merge_data.groupby('Response')['Previously_Insured'].value_counts()

In [None]:
sns.countplot(x='Response', hue='Previously_Insured', data=merge_data)

In [None]:
## user have previously taken the insurance will mostly take the insurance

#### 5. Which age group have most of insurances

In [None]:
sns.countplot(x='Response', hue='Age_group', data=merge_data[merge_data.Response==1])

In [None]:
#  ppl from age group 40-50, 30-40 have more no. insurance responses

#### 6. Does policy_sales_channel affects responses?

In [None]:
merge_data[merge_data.Response==1]['Policy_Sales_Channel'].value_counts().nlargest(10).plot(kind='bar')

In [None]:
# 26,124 this sales channel have high no. responses

#### 7. older customers will stick with insurance or not

In [None]:
# sns.barplot(x='Vintage', y='Response', data=merge_data[merge_data.Response==1])
merge_data.groupby('Response')['year_vintage'].value_counts()
# Vintage

#### 8. Whether past damage of vehicle affects Response

In [None]:
merge_data.groupby('Response')['Vehicle_Damage'].value_counts()

In [None]:
# Vehicles which have damaged in the past are most likely to have insurance

## Data Postprecessing after EDA

In [None]:
merge_data['annual_bin'] = pd.cut(merge_data['Annual_Premium'], bins=np.arange(0,200000,5000), labels=range(1,40)).cat.add_categories(40)

In [None]:
merge_data['annual_bin'] = merge_data['annual_bin'].fillna(40)
merge_data['annual_bin'] = merge_data['annual_bin'].astype('int')

In [None]:
merge_data['Annual_Premium']=np.log(merge_data['Annual_Premium'])

In [None]:
merge_data.head()

In [None]:
merge_data.columns

In [None]:
merge_data['Vehicle_Age'] = merge_data['Vehicle_Age'].map({'> 2 Years':1, '1-2 Year':2, '< 1 Year':0})

In [None]:
# from category_encoders.helmert import HelmertEncoder
# from category_encoders.cat_boost import CatBoostEncoder
# categories = ['Gender', 'Driving_License', 'Vehicle_Damage', 'Age_group']
# cat = CatBoostEncoder()

# merge_data.loc[merge_data.train_or_test==1, categories] = cat.fit_transform(merge_data[merge_data.train_or_test==1].drop(['Response'], axis=1)[categories], y=merge_data[merge_data.train_or_test==1]['Response'])
# merge_data.loc[merge_data.train_or_test==0, categories] = cat.transform(merge_data[merge_data.train_or_test==1].drop(['Response'], axis=1)[categories])

In [None]:
sns.countplot(x='Vehicle_Age', hue='Response', data=merge_data)

In [None]:
merge_data.head()

In [None]:
# convert gender, vehicle_age, vehicle_damge, age_group categories to integer values
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
categories = ['Gender', 'Driving_License','Vehicle_Age', 'Vehicle_Damage', 'Age_group']
for i in categories:
    merge_data[i] = encoder.fit_transform(merge_data[i])

In [None]:
# frequency encoding
fe=merge_data.groupby('Vehicle_Age').size()/len(merge_data)
merge_data['Vehicle_Age']=merge_data['Vehicle_Age'].apply(lambda x: fe[x])

In [None]:
# frequency encoding
fe=merge_data.groupby('Policy_Sales_Channel').size()/len(merge_data)
merge_data['Policy_Sales_Channel']=merge_data['Policy_Sales_Channel'].apply(lambda x: fe[x])

In [None]:
merge_data.head()

In [None]:
merge_data.info()

In [None]:
# Annual_Premium
merge_data['mean_Annual_Premium_region']=merge_data.groupby(['Region_Code'])['Annual_Premium'].transform('mean')
merge_data['sum_Annual_Premium_region']=merge_data.groupby(['Region_Code'])['Annual_Premium'].transform('sum')
merge_data['max_Annual_Premium_region']=merge_data.groupby(['Region_Code'])['Annual_Premium'].transform('max')
merge_data['min_Annual_Premium_region']=merge_data.groupby(['Region_Code'])['Annual_Premium'].transform('min')
merge_data['std_Annual_Premium_region']=merge_data.groupby(['Region_Code'])['Annual_Premium'].transform('std')

# age_Group
merge_data['mean_Annual_Premium_Age_group']=merge_data.groupby(['Age_group'])['Annual_Premium'].transform('mean')
merge_data['sum_Annual_Premium_Age_group']=merge_data.groupby(['Age_group'])['Annual_Premium'].transform('sum')
merge_data['max_Annual_Premium_Age_group']=merge_data.groupby(['Age_group'])['Annual_Premium'].transform('max')
merge_data['min_Annual_Premium_Age_group']=merge_data.groupby(['Age_group'])['Annual_Premium'].transform('min')
merge_data['std_Annual_Premium_Age_group']=merge_data.groupby(['Age_group'])['Annual_Premium'].transform('std')

# Policy_Sales_Channel
merge_data['mean_Annual_Premium_Policy_Sales_Channel']=merge_data.groupby(['Policy_Sales_Channel'])['Annual_Premium'].transform('mean')
merge_data['sum_Annual_Premium_Policy_Sales_Channel']=merge_data.groupby(['Policy_Sales_Channel'])['Annual_Premium'].transform('sum')
merge_data['max_Annual_Premium_Policy_Sales_Channel']=merge_data.groupby(['Policy_Sales_Channel'])['Annual_Premium'].transform('max')
merge_data['min_Annual_Premium_Policy_Sales_Channel']=merge_data.groupby(['Policy_Sales_Channel'])['Annual_Premium'].transform('min')
merge_data['std_Annual_Premium_Policy_Sales_Channel']=merge_data.groupby(['Policy_Sales_Channel'])['Annual_Premium'].transform('std')

In [None]:
merge_data.head()

In [None]:
## Now our data looks is ready to modelling.

In [None]:
# merge_data = pd.get_dummies(merge_data, columns=['Vehicle_Damage', 'Previously_Insured'])

In [None]:
merge_data.head()

## Modelling 

In [None]:
# importing models
from sklearn import tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from catboost import CatBoostClassifier
import xgboost
import lightgbm as lgb
from sklearn.ensemble import StackingClassifier, VotingClassifier
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression,SGDClassifier
from imblearn.over_sampling import SMOTE


In [None]:
# drop id column because its not that important
merge_data.drop(['id'], axis=1, inplace=True)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# Divide train and test for training
X = merge_data[merge_data['train_or_test']==1].drop(['Response', 'train_or_test'],axis=1)
y = merge_data[merge_data['train_or_test']==1]['Response']

In [None]:
features=['Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
       'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel','Vintage',
       'Age_group', 'months_vintage','mean_Annual_Premium_region',
       'sum_Annual_Premium_region', 'max_Annual_Premium_region',
       'min_Annual_Premium_region', 'mean_Annual_Premium_Age_group',
       'sum_Annual_Premium_Age_group', 'max_Annual_Premium_Age_group',
       'min_Annual_Premium_Age_group',
       'mean_Annual_Premium_Policy_Sales_Channel',
       'sum_Annual_Premium_Policy_Sales_Channel',
       'max_Annual_Premium_Policy_Sales_Channel',
       'min_Annual_Premium_Policy_Sales_Channel'
         ]
# features=['Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
#        'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
#        'Policy_Sales_Channel', 'Vintage',
#        'Age_group', 'months_vintage', 'mean_Annual_Premium_region',
#        'sum_Annual_Premium_region', 'max_Annual_Premium_region',
#        'min_Annual_Premium_region', 'mean_Annual_Premium_Age_group',
#        'sum_Annual_Premium_Age_group', 'max_Annual_Premium_Age_group',
#        'min_Annual_Premium_Age_group',
#        'mean_Annual_Premium_Policy_Sales_Channel',
#        'sum_Annual_Premium_Policy_Sales_Channel',
#        'max_Annual_Premium_Policy_Sales_Channel',
#        'min_Annual_Premium_Policy_Sales_Channel'
#          ]




In [None]:
# Divide train and test for training
train = merge_data[merge_data['train_or_test']==1]
X = train.drop(['Response', 'train_or_test'],axis=1)[features]
y = merge_data[merge_data['train_or_test']==1]['Response']

In [None]:
X.shape, y.shape

In [None]:
test.shape

In [None]:
## test dataset
test = merge_data[merge_data['train_or_test']==0].drop(['Response', 'train_or_test'],axis=1)[features]

In [None]:
test.head().shape

In [None]:
np.arange(0.1,1+0.1,0.1,dtype=float)

In [None]:
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope
import xgboost as xgb
from functools import partial
from sklearn import model_selection,metrics

In [None]:
param_space_xgb = {
        'max_depth': scope.int(hp.quniform("max_depth", 3,15,1)),
        'n_estimators':scope.int(hp.quniform("n_estimators", 100, 600, 1)),
        'colsample_bytree':hp.uniform("colsample_bytree", 0.01, 1),
        
        'scale_pos_weight': scope.int(hp.quniform("scale_pos_weight", 1, 20, 1)),
        'eta':hp.uniform("eta", 0.01, 0.3),

        'tree_method':hp.choice('tree_method', ["gpu_hist"]),
        'gpu_id':hp.choice('gpu_id', [0]),
    }
param_space_catboost = {
        'depth': scope.int(hp.quniform("max_depth", 3,15,1)),
        'n_estimators':scope.int(hp.quniform("n_estimators", 100, 1500,1)),
        
        'scale_pos_weight': scope.int(hp.quniform("scale_pos_weight", 1, 20, 1)),
        'learning_rate':hp.uniform("learning_rate", 0.01, 0.3),
        'l2_leaf_reg':hp.uniform("l2_leaf_reg", 0.5,1),
        'bagging_temperature':hp.uniform("bagging_temperature", 0.01,1),
    
        'eval_metric':hp.choice('eval_metric', ["AUC"]),
        'task_type':hp.choice('task_type', ['GPU']),
        'devices':hp.choice('devices', ['0:1']),
    
    }
    
trials = Trials()

In [None]:
import catboost

In [None]:
def optimize(params, x, y):
    
    model = catboost.CatBoostClassifier(**params)

    kf = model_selection.StratifiedKFold(n_splits=5)
    aucs = []
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        x_train = x[train_idx]
        y_train = y[train_idx]

        x_test = x[test_idx]
        y_test = y[test_idx]

        model.fit(x_train, y_train, eval_set=[(x_test, y_test)], early_stopping_rounds=100, verbose=200)
        preds = model.predict_proba(x_test)[:,1]
        fold_auc = metrics.roc_auc_score(y_test, preds)
        aucs.append(fold_auc)
        
    return -1.0*np.mean(aucs)

In [None]:
optimize_function = partial(optimize, x = X.values, y=y.values)
results = fmin(fn=optimize_function, space=param_space_catboost, max_evals=15, trials=trials, algo=tpe.suggest)
print(dict(zip(param_names, results.x)))

In [None]:
from hyperopt import hp
from hyperopt.pyll import scope
param_hyperopt= {
    'learning_rate': hp.choice('learning_rate', [0.1, 0.4, 0.04,0.3]),
    'depth': hp.choice('max_depth', np.arange(1, 13+1, dtype=int)),
    'n_estimators': hp.choice('n_estimators', [100, 200, 300,400, 500,600,700,800,900]),
    'task_type':hp.choice('task_type', ['GPU']),
    'early_stopping_rounds': hp.choice('early_stopping_rounds', [100]),
    'devices': hp.choice('devices',['0:1']),
#     'colsample_bytree': hp.choice('colsample_bytree', np.arange(0.6,1+0.1,0.1,dtype=float)),
    'l2_leaf_reg': hp.uniform('l2_leaf_reg', 0.0, 1.0),
    'eval_metric': hp.choice('eval_metric', ['AUC']),
    'random_strength': hp.choice('random_strength', np.arange(0.1,1+0.1,0.1,dtype=int)),
    'bagging_temperature': hp.choice('bagging_fraction', np.arange(0.1,1+0.1,0.1,dtype=float)),
}

In [None]:
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import time
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)


def my_custom_loss_func(y_true, y_pred):
    print(y_pred.shape, y_true.shape)
    return roc_auc_score(y_true, y_pred)

custom_roc = make_scorer(my_custom_loss_func, greater_is_better=True)
custom = {'roc':custom_roc}

def hyperopt(param_space, X_train, y_train, X_test, num_eval):
    
    start = time.time()
    
    def objective_function(params):
        clf = CatBoostClassifier(**params)
        cv_probs = cross_val_predict(clf, X_train, y_train, cv=skf,method='predict_proba')
        cv_probs = np.array(cv_probs)
        auc = []
        for train_index, test_index in skf.split(X_train, y_train):
            we = np.array(y_train)
            auc.append(roc_auc_score(we[test_index], cv_probs[test_index][:,1]))
        score = np.mean(auc)
        print(np.mean(auc))
        return {'loss': 1-score, 'status': STATUS_OK}

    trials = Trials()
    best_param = fmin(objective_function, 
                      param_space, 
                      algo=tpe.suggest, 
                      max_evals=num_eval, 
                      trials=trials,
                      rstate= np.random.RandomState(1))
    loss = [x['result']['loss'] for x in trials.trials]
    
    best_param_values = [x for x in best_param.values()]
    
    if best_param_values[0] == 0:
        boosting_type = 'gbdt'
    else:
        boosting_type= 'dart'
    
    clf_best = CatBoostClassifier(learning_rate=best_param_values[2],
                                  num_leaves=int(best_param_values[5]),
                                  max_depth=int(best_param_values[3]),
                                  n_estimators=int(best_param_values[4]),
                                  boosting_type=boosting_type,
                                  colsample_bytree=best_param_values[1],
                                  reg_lambda=best_param_values[6],
                                 )
                                  
    clf_best.fit(X_train, y_train)
    
    print("")
    print("##### Results")
    print("Score best parameters: ", max(loss)*-1, max(loss))
    print("Best parameters: ", best_param)
#     print("Test Score: ", clf_best.score(X_test, y_test))
#     print("Time elapsed: ", time.time() - start)
#     print("Parameter combinations evaluated: ", num_eval)
    
    return trials

In [None]:
from hyperopt import fmin, tpe, Trials
import numpy as np

from sklearn.model_selection import train_test_split
trials = Trials()
# X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, stratify=y)

results_hyperopt = hyperopt(param_hyperopt, X, y, test, 75)

best_param = fmin(objective_function, param_hyperopt, algo=tpe.suggest, max_evals=75, trials=trials, rstate= np.random.RandomState(1))


In [None]:

from sklearn.svm import SVC
from imblearn.combine import SMOTETomek


from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from imblearn.over_sampling import SVMSMOTE

class Modelling_class:
    def __init__(self, X, y):
        self.X = X
        self.y = y
        self.roc_auc_each = {}
        self.xgb = xgboost.XGBClassifier(
    booster = "gbtree",
            eval_metric='auc',reg_lambda=2,max_delta_step=4,
    n_estimators=500,
    learning_rate=0.04, colsample_bytree=0.9,
    seed=42,tree_method='gpu_hist', gpu_id=0, )
        
        self.lgb = lgb.LGBMClassifier(boosting_type='gbdt',n_estimators=500,depth=10,learning_rate=0.04,objective='binary',metric='auc',is_unbalance=True,
                 colsample_bytree=0.5,reg_lambda=2,reg_alpha=2,random_state=22,n_jobs=-1)
        
        self.lgb_k = lgb.LGBMClassifier(random_state=22,n_jobs=-1,max_depth=-1,min_data_in_leaf=24,num_leaves=49,bagging_fraction=0.01,metric='auc',
                        colsample_bytree=1.0,lambda_l1=1,lambda_l2=11,learning_rate=0.1,n_estimators=5000)
        
        self.rf = RandomForestClassifier(n_jobs=-1)
        self.skf = StratifiedKFold(n_splits=10)
        self.lin = Pipeline([('scaler', StandardScaler()), ('svc', LogisticRegression(max_iter=500 ))])
        self.sgd = Pipeline([('scaler', StandardScaler()), ('sgd', SGDClassifier(loss='log'))])
        self.svc = Pipeline([('scaler', StandardScaler()), ('svc', SVC(gamma='auto', class_weight='balanced'))])
        self.quad = Pipeline([('scaler', StandardScaler()), ('quad', QuadraticDiscriminantAnalysis())])
        
        self.ada = Pipeline([('scaler', StandardScaler()), ('ada', AdaBoostClassifier(n_estimators=100, random_state=0))])
        
        self.kf = KFold(n_splits=5)
        self.catboost = CatBoostClassifier(n_estimators=2000,
                       random_state=20,
                       eval_metric='AUC',
                       learning_rate=0.1,
                       depth=6,l2_leaf_reg=0.5,
#                        bagging_temperature=0.1,
                       task_type='GPU', verbose=10,devices='0:1'
                       #num_leaves=64
                       
                       )
        self.best_catboost = CatBoostClassifier(random_state=42)
        
        self.total = 0
        self.model_dict = {}
        self.temp = None
        self.y_pred_tot = np.zeros((len(test), 2))
        self.models = None
        self.define_models()
    
    def define_models(self):
        
        self.models =  {
            'lgb': self.lgb_k,
            'xgb': self.xgb, 
                 
#                 'rf': self.rf, 
            
#                 'lin':self.lin,
#                 'sgd':self.sgd,
#             'svc':self.svc,
#             'quad':self.ada,
#             'best_Cat':self.best_catboost
                'catboost':self.catboost
        }
    
    def stacking(self):
        self.models = {'stacking': StackingClassifier(estimators = [('xgb', self.xgb), ('lgb',self.lgb)])}
        
        
    def evaluate(self, test, target, smote=False, standard_scalar = False):    
        c = 0
    
        for model_name, model in self.models.items():
            c = 0
           
            for train_id, test_id in self.skf.split(self.X,self.y):
                
#                 if c > //:
#                     continue
                
                X_train, X_test = self.X.loc[train_id], self.X.loc[test_id]
                y_train, y_test = self.y[train_id], self.y[test_id]
                
                model.fit(X_train, y_train,eval_set=[(X_test, y_test)], early_stopping_rounds=100, verbose=200)
                self.model_dict[model_name+str(c)] = model

                y_pred = model.predict(X_test)
                self.temp = model.predict_proba(X_test)
                y_proba =  model.predict_proba(X_test)[:,1]

                self.roc_auc_each[model_name+str(c)] = roc_auc_score(y_test,y_proba)

                c+=1
                print("model_name-",model_name,"-score:",roc_auc_score(y_test,y_proba))

                self.total+=1
                print(self.total_roc_auc())
                finals = np.array(model.predict_proba(test)[:,1])
                sample[target] = finals
                sample.to_csv('./'+model_name+str(c)+'.csv',index=False)
                self.y_pred_tot+= model.predict_proba(test)
                print(self.y_pred_tot.shape)
                
    def total_roc_auc(self):
        print("TOTAL ROC_AUC")
        return sum(self.roc_auc_each.values())/self.total
models = Modelling_class(X,y )
models.evaluate(test, 'Response')
print(models.total_roc_auc)

In [None]:
models.y_pred_tot=models.y_pred_tot/models.total
sample['Response'] = models.y_pred_tot[:,1]
sample['Response'].value_counts(normalize=True)
sample.to_csv('xgb_lgb_cat_norm.csv',index=False)

In [None]:
def without_valid_prediction(test):
    cat = models.model_dict['base_cat1']
    cat.fit(X,y)
    cats_pred = cat.predict_proba(test)[:,1]
    final = np.array(cats_pred)
    sample['Response'] = final
    sample['Response'].value_counts(normalize=True)
    sample.to_csv('we.csv',index=False)
without_valid_prediction(test)

In [None]:
first = pd.read_csv('https://datahack-prod.s3.amazonaws.com/submissions/janatahack-cross-sell-prediction/896_729963_us_xgb_lgb_norm15_9_3dnzg6c.csv')
second = pd.read_csv('https://datahack-prod.s3.amazonaws.com/submissions/janatahack-cross-sell-prediction/896_729963_us_xgb_lgb_cat_norm_7_RFIJ2mJ.csv')
first['Response'] = (first['Response'] + second['Response'])/2

In [None]:
first.loc[first.id.isin(test[test['Annual_Premium']>200000]['id'])] = second[second.id.isin(test[test['Annual_Premium']>200000]['id'])]

In [None]:
first.to_csv('final.csv', index=False)

In [None]:
first.head()

In [None]:
second

In [None]:
# s
models.stacking()
models.evaluate(test, 'Response')

In [None]:
models.y_pred_tot=models.y_pred_tot/5
sample['Response'] = models.y_pred_tot[:,1]
# sample['Response'].value_counts(normalize=True)
sample.to_csv('stacking.csv',index=False)

In [None]:
stacking = pd.read_csv('stacking.csv')
xgb_final = pd.read_csv('https://datahack-prod.s3.amazonaws.com/submissions/janatahack-cross-sell-prediction/896_729963_us_xgb_lgb_norm_3.csv')
xgb_final['Response'] = (stacking['Response']+xgb_final['Response'])/2


In [None]:
xgb_final.to_csv('xgb_stacking.csv', index=False)