# Problem Statement
## Predicting Coupon Redemption . 

XYZ Credit Card company regularly helps it’s merchants understand their data better and take key business decisions accurately by providing machine learning and analytics consulting. ABC is an established Brick & Mortar retailer that frequently conducts marketing campaigns for its diverse product range. As a merchant of XYZ, they have sought XYZ to assist them in their discount marketing process using the power of machine learning. Can you wear the AmExpert hat and help out ABC?  

 
Discount marketing and coupon usage are very widely used promotional techniques to attract new customers and to retain & reinforce loyalty of existing customers. The measurement of a consumer’s propensity towards coupon usage and the prediction of the redemption behaviour are crucial parameters in assessing the effectiveness of a marketing campaign.  

 
ABC’s promotions are shared across various channels including email, notifications, etc. A number of these campaigns include coupon discounts that are offered for a specific product/range of products. The retailer would like the ability to predict whether customers redeem the coupons received across channels, which will enable the retailer’s marketing team to accurately design coupon construct, and develop more precise and targeted marketing strategies.  

 
The data available in this problem contains the following information, including the details of a sample of campaigns and coupons used in previous campaigns -

- User Demographic Details
- Campaign and coupon Details
- Product details
- Previous transactions 

Based on previous transaction & performance data from the last 18 campaigns, predict the probability for the next 10 campaigns in the test set for each coupon and customer combination, whether the customer will redeem the coupon or not?

## Dataset Description
Here is the schema for the different data tables available. The detailed data dictionary is provided next.

![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2019/09/Screenshot-2019-09-27-at-10.27.29-PM.png)

You are provided with the following files in train.zip:

**train.csv:** Train data containing the coupons offered to the given customers under the 18 campaigns

|Variable|	Definition|
|--------|------------|
|id|	Unique id for coupon customer impression|
|campaign_id|	Unique id for a discount campaign|
|coupon_id|	Unique id for a discount coupon|
|customer_id|	Unique id for a customer|
|redemption_status|	(target) (0 - Coupon not redeemed, 1 - Coupon redeemed)| . 


**campaign_data.csv:** Campaign information for each of the 28 campaigns

|Variable|	Definition|
|--------|------------|
|campaign_id|	Unique id for a discount campaign|
|campaign_type|	Anonymised Campaign Type (X/Y)|
|start_date|	Campaign Start Date|
|end_date|	Campaign End Date|  

**coupon_item_mapping.csv:** Mapping of coupon and items valid for discount under that coupon

|Variable|	Definition|
|--------|------------|
|coupon_id|	Unique id for a discount coupon (no order)|
|item_id|	Unique id for items for which given coupon is valid (no order)|    

**customer_demographics.csv:** Customer demographic information for some customers|

|Variable|	Definition|
|--------|------------|
|customer_id|	Unique id for a customer|
|age_range|	Age range of customer family in years|
|marital_status|	Married/Single|
|rented|	0 - not rented accommodation, 1 - rented accommodation|
|family_size|	Number of family members|
|no_of_children|	Number of children in the family|
|income_bracket|	Label Encoded Income Bracket (Higher income corresponds to higher number)|  

**customer_transaction_data.csv:** Transaction data for all customers for duration of campaigns in the train data

|Variable|	Definition|
|--------|------------|
|date|	Date of Transaction|
|customer_id|	Unique id for a customer|
|item_id|	Unique id for item|
|quantity|	quantity of item bought|
|selling_price|	Sales value of the transaction|
|other_discount|Discount from other sources such as manufacturer coupon/loyalty card|
|coupon_discount|	Discount availed from retailer coupon  

**item_data.csv:** Item information for each item sold by the retailer

|Variable|	Definition|
|--------|------------|
|item_id|	Unique id for item|
|brand|	Unique id for item brand|
|brand_type|	Brand Type (local/Established)|
|category|	Item Category|  

**test.csv:** Contains the coupon customer combination for which redemption status is to be predicted 

|Variable|Definition|
|--------|----------|
|id|	Unique id for coupon customer impression|
|campaign_id|	Unique id for a discount campaign|
|coupon_id|	Unique id for a discount coupon|
|customer_id|	Unique id for a customer|  

**Campaign, coupon and customer data** for test set is also contained in train.zip

**sample_submission.csv:** This file contains the format in which you have to submit your predictions.  

<br>
<br>
To summarise the entire process:

- Customers receive coupons under various campaigns and may choose to redeem it.
- They can redeem the given coupon for any valid product for that coupon as per coupon item mapping within the duration between campaign start date and end date
- Next, the customer will redeem the coupon for an item at the retailer store and that will reflect in the transaction table in the column coupon_discount.
 
## Evaluation Metric
Submissions are evaluated on **area under the ROC curve** between the predicted probability and the observed target.
 

## Public and Private Split
- Test data is further randomly divided into Public (40%) and Private data (60%)
- Your initial responses will be checked and scored on the Public data.
- The final rankings would be based on your private score which will be published once the competition is over.





<div class="alert alert-block alert-success">
<b>Results:</b> With feature engineering and same model, xxxxxxx.
</div>

In [0]:
!pip install -q catboost

In [0]:
import numpy as np
import pandas as pd
import sklearn
# No warnings about setting value on copy of slice
pd.options.mode.chained_assignment = None

# Display up to 60 columns of a dataframe
pd.set_option('display.max_columns', 2000)

# Matplotlib visualization
import matplotlib.pyplot as plt
%matplotlib inline

# Set default font size
# plt.rcParams['font.size'] = 24

# Seaborn for visualization
import seaborn as sns
#sns.set(font_scale = 2)
#sns.set(style='white', context='notebook', palette='deep')

# Splitting data into training and testing
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, StratifiedShuffleSplit

# XGBoost
import xgboost as xgb

# light GBM
import lightgbm as lgb

# catboost
import catboost as cbst

# Logistic Regression
from sklearn.linear_model import LogisticRegression

# SVM
from sklearn import svm 

# KNN
from sklearn.neighbors import KNeighborsClassifier

# Feature Selection
from sklearn.feature_selection import RFE 

# Preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Metrics
from sklearn.metrics import classification_report, matthews_corrcoef, f1_score
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, roc_auc_score
from sklearn import metrics

# date-time manipulation
from datetime import datetime
import datetime as dt

import gc

#import itertools
import imblearn
import matplotlib
import platform

# SMOTE / Imbalance dataset handling
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced

from imblearn.over_sampling import RandomOverSampler, SMOTENC, BorderlineSMOTE
from imblearn.under_sampling import ClusterCentroids, RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek

from statsmodels.stats.outliers_influence import variance_inflation_factor



In [0]:
# Get all version information
print('The python version is {}.'.format(platform.python_version()))
print('The numpy version is {}.'.format(np.__version__))
print('The pandas version is {}.'.format(pd.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))
print('The imblearn version is {}.'.format(imblearn.__version__))
print('The matplotlib version is {}.'.format(matplotlib.__version__))
print('The seaborn version is {}.'.format(sns.__version__))
print('The xgboost version is {}.'.format(xgb.__version__))

The python version is 3.6.8.
The numpy version is 1.16.5.
The pandas version is 0.24.2.
The scikit-learn version is 0.21.3.
The imblearn version is 0.4.3.
The matplotlib version is 3.0.3.
The seaborn version is 0.9.0.
The xgboost version is 0.90.


In [0]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [0]:
# Path of Dataset

DATASET_AND_OUTPUT_PATH = "/content/gdrive/My Drive/ColabNotebooks/AmExpart-2019/data_set/"
TRAIN_DATASET_NAME = "train.csv"
TEST_DATASET_NAME = "test_QyjYwdj.csv"

CAMPAIGN_DATASET_NAME = "campaign_data.csv"
COUPON_ITEM_MAPPING_DATASET_NAME = "coupon_item_mapping.csv"

CUST_DEMO_DATASET_NAME = "customer_demographics.csv"
CUST_TRANS_DATASET_NAME = "customer_transaction_data.csv"
ITEM_DATASET_NAME = "item_data.csv"

SAMPLE_SUBMISSION_NAME = "sample_submission_Byiv0dS.csv"

In [0]:
train = pd.read_csv(DATASET_AND_OUTPUT_PATH + TRAIN_DATASET_NAME)
test = pd.read_csv(DATASET_AND_OUTPUT_PATH + TEST_DATASET_NAME)

campaign = pd.read_csv(DATASET_AND_OUTPUT_PATH + CAMPAIGN_DATASET_NAME)
coupon_item = pd.read_csv(DATASET_AND_OUTPUT_PATH + COUPON_ITEM_MAPPING_DATASET_NAME)

customer_demographic = pd.read_csv(DATASET_AND_OUTPUT_PATH + CUST_DEMO_DATASET_NAME)
customer_transaction =  pd.read_csv(DATASET_AND_OUTPUT_PATH + CUST_TRANS_DATASET_NAME)

item = pd.read_csv(DATASET_AND_OUTPUT_PATH + ITEM_DATASET_NAME)

sample_submission = pd.read_csv(DATASET_AND_OUTPUT_PATH + SAMPLE_SUBMISSION_NAME)

print("Shape of train set", train.shape,
      "\nShape of test set ", test.shape,
      "\nShape of campaign set ", campaign.shape,
      "\nShape of coupon item mapping set ", coupon_item.shape,
      "\nShape of customer demographics ", customer_demographic.shape,
      "\nShape of customer transaction ", customer_transaction.shape,
      "\nShape of item set ", item.shape,
      "\nShape of sample_submission set ", sample_submission.shape)

Shape of train set (78369, 5) 
Shape of test set  (50226, 4) 
Shape of campaign set  (28, 4) 
Shape of coupon item mapping set  (92663, 2) 
Shape of customer demographics  (760, 7) 
Shape of customer transaction  (1324566, 7) 
Shape of item set  (74066, 4) 
Shape of sample_submission set  (50226, 2)


In [0]:
train.describe(),test.describe()

(                  id   campaign_id     coupon_id   customer_id  \
 count   78369.000000  78369.000000  78369.000000  78369.000000   
 mean    64347.975449     13.974441    566.363243    787.451888   
 std     37126.440855      8.019215    329.966054    456.811339   
 min         1.000000      1.000000      1.000000      1.000000   
 25%     32260.000000      8.000000    280.000000    399.000000   
 50%     64318.000000     13.000000    597.000000    781.000000   
 75%     96577.000000     13.000000    857.000000   1190.000000   
 max    128595.000000     30.000000   1115.000000   1582.000000   
 
        redemption_status  
 count       78369.000000  
 mean            0.009302  
 std             0.095999  
 min             0.000000  
 25%             0.000000  
 50%             0.000000  
 75%             0.000000  
 max             1.000000  ,
                   id   campaign_id     coupon_id   customer_id
 count   50226.000000  50226.000000  50226.000000  50226.000000
 mean    64220

In [0]:
# No campaign Id is common between train and test data
pd.Series(list(set(train["campaign_id"]) & set(test["campaign_id"])))

Series([], dtype: float64)

In [0]:
# There are 81 common coupons between train and test data
pd.Series(list(set(train["coupon_id"]) & set(test["coupon_id"]))).count()

81

In [0]:
# There are 1096 common customer between train and test data
pd.Series(list(set(train["customer_id"]) & set(test["customer_id"]))).count()

1096

In [0]:
train.head(),test.head()

(   id  campaign_id  coupon_id  customer_id  redemption_status
 0   1           13         27         1053                  0
 1   2           13        116           48                  0
 2   6            9        635          205                  0
 3   7           13        644         1050                  0
 4   9            8       1017         1489                  0,
    id  campaign_id  coupon_id  customer_id
 0   3           22        869          967
 1   4           20        389         1566
 2   5           22        981          510
 3   8           25       1069          361
 4  10           17        498          811)

In [0]:
train.shape, test.shape

((78369, 5), (50226, 4))

In [0]:
train_dummy = train.copy()
test_dummy = test.copy()

In [0]:
df = train_dummy.append(test_dummy, ignore_index=True)
df.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


(128595, 5)

### Customer Demographics

In [0]:
# No duplicate row in customer demographics
customer_demographic.duplicated().value_counts()

False    760
dtype: int64

In [0]:
customer_demographic.isna().sum(), customer_demographic["age_range"].unique(),customer_demographic["marital_status"].unique(), customer_demographic["rented"].unique(), customer_demographic["family_size"].unique(),customer_demographic["no_of_children"].unique(), customer_demographic["income_bracket"].unique()

(customer_id         0
 age_range           0
 marital_status    329
 rented              0
 family_size         0
 no_of_children    538
 income_bracket      0
 dtype: int64,
 array(['70+', '46-55', '26-35', '36-45', '18-25', '56-70'], dtype=object),
 array(['Married', nan, 'Single'], dtype=object),
 array([0, 1]),
 array(['2', '3', '4', '1', '5+'], dtype=object),
 array([nan, '1', '2', '3+'], dtype=object),
 array([ 4,  5,  3,  6,  1,  7,  2,  8,  9, 12, 10, 11]))

In [0]:
customer_demographic.dtypes

customer_id        int64
age_range         object
marital_status    object
rented             int64
family_size       object
no_of_children    object
income_bracket     int64
dtype: object

In [0]:
customer_demographic["age_range"] = pd.Series(customer_demographic['age_range'].factorize()[0])

In [0]:
customer_demographic[customer_demographic["no_of_children"] == "3+"].head(5)

Unnamed: 0,customer_id,age_range,marital_status,rented,family_size,no_of_children,income_bracket
16,31,3,Single,0,5+,3+,2
17,33,1,Married,0,5+,3+,9
25,45,1,Married,0,5+,3+,1
28,52,3,Married,0,5+,3+,7
36,71,3,Married,0,5+,3+,4


In [0]:
# Impute no_of_children

customer_demographic['no_of_children'] = customer_demographic['no_of_children'].replace('3+', 3).astype(float)
customer_demographic['family_size'] = customer_demographic['family_size'].replace('5+', 3).astype(float)

In [0]:
customer_demographic.head()

Unnamed: 0,customer_id,age_range,marital_status,rented,family_size,no_of_children,income_bracket
0,1,0,Married,0,2.0,,4
1,6,1,Married,0,2.0,,5
2,7,2,,0,3.0,1.0,3
3,8,2,,0,4.0,2.0,6
4,10,1,Single,0,1.0,,5


In [0]:
customer_demographic['marital_status'] = pd.Series(customer_demographic['marital_status'].factorize()[0])

In [0]:
customer_demographic.head()

Unnamed: 0,customer_id,age_range,marital_status,rented,family_size,no_of_children,income_bracket
0,1,0,0,0,2.0,,4
1,6,1,0,0,2.0,,5
2,7,2,-1,0,3.0,1.0,3
3,8,2,-1,0,4.0,2.0,6
4,10,1,1,0,1.0,,5


In [0]:
def return_marital_status(x):
  if (x["marital_status"] == -1) & (x["family_size"] - x["no_of_children"] == 2):
    x["marital_status"] = 0
  elif (x["marital_status"] == -1) & (x["family_size"] - x["no_of_children"] == 1):
    x["marital_status"] = 1
  elif (x["marital_status"] == -1) & (x["family_size"] == 1) & (np.isnan(x["no_of_children"])):
    x["marital_status"] = 1
  elif (x["marital_status"] == -1) & (x["family_size"] >= 2) & (np.isnan(x["no_of_children"])):
    x["marital_status"] = 0  
  
  if (x["family_size"] >= 2) & (x["marital_status"] == -1) & (x["family_size"] == x["no_of_children"]):
    x["marital_status"] = 0 # 0 Married 1 Single
    
  return x["marital_status"]

In [0]:
customer_demographic["marital_status"] = customer_demographic.apply(return_marital_status, axis="columns")

In [0]:
def return_no_of_children(x):
  if (x["marital_status"] == 1) & (x["family_size"] >= 1) & (np.isnan(x["no_of_children"])):
    x["no_of_children"] = x["family_size"] - 1
  elif (x["marital_status"] ==  0) & (x["family_size"] >= 2) & (np.isnan(x["no_of_children"])):
    x["no_of_children"] = x["family_size"] - 2 
  
  return x["no_of_children"]

In [0]:
customer_demographic["no_of_children"] = customer_demographic.apply(return_no_of_children, axis="columns")

In [0]:
customer_demographic.head()

Unnamed: 0,customer_id,age_range,marital_status,rented,family_size,no_of_children,income_bracket
0,1,0,0.0,0,2.0,0.0,4
1,6,1,0.0,0,2.0,0.0,5
2,7,2,0.0,0,3.0,1.0,3
3,8,2,0.0,0,4.0,2.0,6
4,10,1,1.0,0,1.0,0.0,5


In [0]:
customer_demographic.isna().sum()

customer_id       0
age_range         0
marital_status    0
rented            0
family_size       0
no_of_children    0
income_bracket    0
dtype: int64

In [0]:
customer_demographic.nunique()

customer_id       760
age_range           6
marital_status      2
rented              2
family_size         4
no_of_children      4
income_bracket     12
dtype: int64

In [0]:
# customer_demo_agg_col={
#      'age_range':['mean'],'marital_status':['count'], 'rented':['mean'], "family_size":['mean'], "no_of_children":['count','mean'], "income_bracket":['sum']
# }

In [0]:
# customer_demographic["customer_id"].duplicated().value_counts()

In [0]:
# customer_demographic_int = customer_demographic.groupby("customer_id").agg(customer_demo_agg_col)
# customer_demographic_int.head(10)

In [0]:
# customer_demographic_int.columns=['cdi_' + '_'.join(col).strip() for col in customer_demographic_int.columns.values]
# customer_demographic_int.reset_index(inplace=True)

In [0]:
# customer_demographic_int.head(10)

In [0]:
# df = df.merge(customer_demographic_int, on="customer_id", how='left')
df = df.merge(customer_demographic, on="customer_id", how='left')
df.shape

(128595, 11)

In [0]:
df.head()

Unnamed: 0,campaign_id,coupon_id,customer_id,id,redemption_status,age_range,marital_status,rented,family_size,no_of_children,income_bracket
0,13,27,1053,1,0.0,1.0,1.0,0.0,1.0,0.0,5.0
1,13,116,48,2,0.0,3.0,0.0,0.0,2.0,0.0,3.0
2,9,635,205,6,0.0,1.0,0.0,0.0,2.0,0.0,7.0
3,13,644,1050,7,0.0,,,,,,
4,8,1017,1489,9,0.0,1.0,0.0,0.0,2.0,0.0,3.0


In [0]:
# df.head()

In [0]:
# del customer_demographic
# import gc
# gc.collect()

### Review/Merge Campaign Table

In [0]:
campaign.head(),campaign.isna().sum()

(   campaign_id campaign_type start_date  end_date
 0           24             Y   21/10/13  20/12/13
 1           25             Y   21/10/13  22/11/13
 2           20             Y   07/09/13  16/11/13
 3           23             Y   08/10/13  15/11/13
 4           21             Y   16/09/13  18/10/13, campaign_id      0
 campaign_type    0
 start_date       0
 end_date         0
 dtype: int64)

In [0]:
campaign["start_date"] = pd.to_datetime(campaign["start_date"], dayfirst=True)
campaign["end_date"] = pd.to_datetime(campaign["end_date"], dayfirst=True)

In [0]:
campaign["campaign_type"] = pd.Series(campaign['campaign_type'].factorize()[0])

In [0]:
df = df.merge(campaign, on="campaign_id", how='left')
df.shape

(128595, 14)

In [0]:
# Create new feature like , total number of campaign days. 
df["campaign_days"] = (df["end_date"]-df["start_date"]).dt.days

In [0]:
# Create new features based on start and end date

date_cols = ["start_date", "end_date"]

for date_col in date_cols:
  #df[date_col + "_in_seconds"] = (df[date_col] - dt.datetime(1970,1,1)).dt.total_seconds()
  df[date_col + "_in_days"] = (df[date_col] - dt.datetime(1970,1,1)).dt.days
    
#     train_df[date_col + "_month"] = train_df[date_col].dt.month
#     train_df[date_col + "_year"] = train_df[date_col].dt.year
#     train_df[date_col + "_week"] = train_df[date_col].dt.year

### Review / Merge coupon item mapping and item data

In [0]:
# Any duplicate entry in item data -> No
item.duplicated().value_counts()

False    74066
dtype: int64

In [0]:
# Any value missing in item data  -> No
item.isna().sum()

item_id       0
brand         0
brand_type    0
category      0
dtype: int64

In [0]:
# Total unique values in each feature of item data

for k in item.columns:
    print(k,item[k].nunique())

item_id 74066
brand 5528
brand_type 2
category 19


In [0]:
item.head(5)

Unnamed: 0,item_id,brand,brand_type,category
0,1,1,Established,Grocery
1,2,1,Established,Miscellaneous
2,3,56,Local,Bakery
3,4,56,Local,Grocery
4,5,56,Local,Grocery


In [0]:
# Any duplicate entry in coupon item mapping data : No
coupon_item.duplicated().value_counts()

False    92663
dtype: int64

In [0]:
# Any value missing in coupon item mapping data  -> No

coupon_item.isna().sum()

coupon_id    0
item_id      0
dtype: int64

In [0]:
# Neither coupon_id nor item_id are unique in this table
coupon_item.describe()

Unnamed: 0,coupon_id,item_id
count,92663.0,92663.0
mean,155.967387,36508.613071
std,282.99172,21131.312716
min,1.0,1.0
25%,22.0,18255.5
50%,30.0,37955.0
75%,42.0,54191.5
max,1116.0,74061.0


In [0]:
for k in coupon_item.columns:
    print(k,coupon_item[k].nunique())

coupon_id 1116
item_id 36289


In [0]:
coupon_item_with_item_data = coupon_item.merge(item, on="item_id", how='left')

In [0]:
coupon_item_with_item_data.isna().sum()

coupon_id     0
item_id       0
brand         0
brand_type    0
category      0
dtype: int64

In [0]:
coupon_item_with_item_data.head()

Unnamed: 0,coupon_id,item_id,brand,brand_type,category
0,105,37,56,Local,Grocery
1,107,75,56,Local,Grocery
2,494,76,209,Established,Grocery
3,522,77,278,Established,Grocery
4,518,77,278,Established,Grocery


In [0]:
coupon_item_with_item_data['brand_type'] = pd.Series(coupon_item_with_item_data['brand_type'].factorize()[0]).replace(-1, np.nan)
coupon_item_with_item_data['category'] = pd.Series(coupon_item_with_item_data['category'].factorize()[0]).replace(-1, np.nan)
coupon_item_with_item_data['brand'] = pd.Series(coupon_item_with_item_data['brand'].factorize()[0]).replace(-1, np.nan)

In [0]:
coupon_item_with_item_data.head()

Unnamed: 0,coupon_id,item_id,brand,brand_type,category
0,105,37,0,0,0
1,107,75,0,0,0
2,494,76,1,1,0
3,522,77,2,1,0
4,518,77,2,1,0


In [0]:
coupon_item_with_item_data.columns, coupon_item_with_item_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92663 entries, 0 to 92662
Data columns (total 5 columns):
coupon_id     92663 non-null int64
item_id       92663 non-null int64
brand         92663 non-null int64
brand_type    92663 non-null int64
category      92663 non-null int64
dtypes: int64(5)
memory usage: 4.2 MB


(Index(['coupon_id', 'item_id', 'brand', 'brand_type', 'category'], dtype='object'),
 None)

In [0]:

coupon_item_with_item_data_agg_col={
     'item_id':['nunique'], 'brand':['nunique','mean'], 'brand_type':['nunique','mean'], 'category':['nunique','mean']
    
}



coupon_item_with_item_data_agg_col

{'brand': ['nunique', 'mean'],
 'brand_type': ['nunique', 'mean'],
 'category': ['nunique', 'mean'],
 'item_id': ['nunique']}

In [0]:
coupon_item_with_item_data_int = coupon_item_with_item_data.groupby("coupon_id").agg(coupon_item_with_item_data_agg_col)
coupon_item_with_item_data_int.head(15)

Unnamed: 0_level_0,item_id,brand,brand,brand_type,brand_type,category,category
Unnamed: 0_level_1,nunique,nunique,mean,nunique,mean,nunique,mean
coupon_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
1,39,3,740.564103,1,1.0,2,5.384615
2,2,1,1667.0,1,1.0,1,0.0
3,17,2,22.882353,1,1.0,1,0.0
4,24,1,92.0,1,1.0,1,0.0
5,7,1,213.0,1,1.0,1,3.0
6,3713,170,319.50983,2,0.828171,5,0.887692
7,2942,371,1140.457172,2,0.935758,14,8.675391
8,3718,955,682.296934,2,0.90936,6,7.733997
9,1535,59,157.792182,2,0.603257,5,0.020847
10,11,3,417.454545,1,1.0,2,1.090909


In [0]:
coupon_item_with_item_data_int.columns=['ciidi_' + '_'.join(col).strip() for col in coupon_item_with_item_data_int.columns.values]
coupon_item_with_item_data_int.reset_index(inplace=True)

In [0]:
coupon_item_with_item_data_int.head(10)

Unnamed: 0,coupon_id,ciidi_item_id_nunique,ciidi_brand_nunique,ciidi_brand_mean,ciidi_brand_type_nunique,ciidi_brand_type_mean,ciidi_category_nunique,ciidi_category_mean
0,1,39,3,740.564103,1,1.0,2,5.384615
1,2,2,1,1667.0,1,1.0,1,0.0
2,3,17,2,22.882353,1,1.0,1,0.0
3,4,24,1,92.0,1,1.0,1,0.0
4,5,7,1,213.0,1,1.0,1,3.0
5,6,3713,170,319.50983,2,0.828171,5,0.887692
6,7,2942,371,1140.457172,2,0.935758,14,8.675391
7,8,3718,955,682.296934,2,0.90936,6,7.733997
8,9,1535,59,157.792182,2,0.603257,5,0.020847
9,10,11,3,417.454545,1,1.0,2,1.090909


In [0]:
del coupon_item_with_item_data

gc.collect()

62

In [0]:
df.shape

(128595, 17)

In [0]:
df = df.merge(coupon_item_with_item_data_int, on="coupon_id", how='left')
df.shape

(128595, 24)

### Review/Merge Customer Transaction Data

In [0]:
# Any duplicate entry in customer_transaction data -> Yes
customer_transaction.duplicated().value_counts()

False    1321650
True        2916
dtype: int64

In [0]:
# Remove all duplicate rows
customer_transaction.drop_duplicates(inplace = True)
customer_transaction.duplicated().value_counts()

False    1321650
dtype: int64

In [0]:
# Check cutomer transaction min and max date
customer_transaction["date"] = pd.to_datetime(customer_transaction["date"])
customer_transaction["date"].min(),customer_transaction["date"].max()

(Timestamp('2012-01-02 00:00:00'), Timestamp('2013-07-03 00:00:00'))

In [0]:
# Train and Test has no common campaign Id. Test campaign ids are between 16-25. Let's find out corresponding dates
campaign[(campaign["campaign_id"] >=16)  & (campaign["campaign_id"] <= 25)]["start_date"].min(), campaign[(campaign["campaign_id"] >=16)  & (campaign["campaign_id"] <= 25)]["end_date"].max()

(Timestamp('2013-07-15 00:00:00'), Timestamp('2013-12-20 00:00:00'))

- Customer Transaction data is between 02-01-2012 and 03-07-2013  

- Test data has campaign between 15-07-2013 and 20-12-2013  

- Customer Transaction Data does not have transactions from Test duration.

In [0]:
# New
customer_transaction = customer_transaction.merge(item, on="item_id", how='left')

In [0]:
customer_transaction['brand_type'] = pd.Series(customer_transaction['brand_type'].factorize()[0]).replace(-1, np.nan)
customer_transaction['category'] = pd.Series(customer_transaction['category'].factorize()[0]).replace(-1, np.nan)
customer_transaction['brand'] = pd.Series(customer_transaction['brand'].factorize()[0]).replace(-1, np.nan)

In [0]:
customer_transaction["trans_with_no_coupon_matching"] = 0

# Add new feature for all transactions does not belong to any of the campaign.
customer_transaction["trans_with_no_coupon_matching"].loc[customer_transaction[~customer_transaction["item_id"].isin(list(coupon_item["item_id"].unique()))].index] = 1

customer_transaction["trans_with_Valid_coupon_matching"] = 0

# Add new feature for all transactions that belong to any of the campaign.
customer_transaction["trans_with_Valid_coupon_matching"].loc[customer_transaction[customer_transaction["item_id"].isin(list(coupon_item["item_id"].unique()))].index] = 1

In [0]:
customer_transaction["item_price"] = customer_transaction["selling_price"]/customer_transaction["quantity"] 

In [0]:
# Remove all transactions does not belong to any of the campaign. I.e. Remove all transactions where item_id does not match with item_id present in Coupon Item Mapping

# customer_transaction.drop(customer_transaction[~customer_transaction["item_id"].isin(list(coupon_item["item_id"].unique()))].index, inplace=True)

In [0]:
customer_transaction.shape

(1321650, 13)

In [0]:
customer_transaction.head()

Unnamed: 0,date,customer_id,item_id,quantity,selling_price,other_discount,coupon_discount,brand,brand_type,category,trans_with_no_coupon_matching,trans_with_Valid_coupon_matching,item_price
0,2012-01-02,1501,26830,1,35.26,-10.69,0.0,0,0,0,0,1,35.26
1,2012-01-02,1501,54253,1,53.43,-13.89,0.0,0,0,0,0,1,53.43
2,2012-01-02,1501,31962,1,106.5,-14.25,0.0,1,1,1,0,1,106.5
3,2012-01-02,1501,33647,1,67.32,0.0,0.0,2,1,1,1,0,67.32
4,2012-01-02,1501,48199,1,71.24,-28.14,0.0,1,1,1,0,1,71.24


In [0]:
customer_transaction.dtypes

date                                datetime64[ns]
customer_id                                  int64
item_id                                      int64
quantity                                     int64
selling_price                              float64
other_discount                             float64
coupon_discount                            float64
brand                                        int64
brand_type                                   int64
category                                     int64
trans_with_no_coupon_matching                int64
trans_with_Valid_coupon_matching             int64
item_price                                 float64
dtype: object

In [0]:
 customer_transaction["purchase_date_in_days"] = (customer_transaction["date"] - dt.datetime(1970,1,1)).dt.days

In [0]:
# Change sign of discounts

customer_transaction["other_discount"] = customer_transaction["other_discount"].apply(lambda x: 0 if x == 0 else  -x)
customer_transaction["coupon_discount"] = customer_transaction["coupon_discount"].apply(lambda x: 0 if x == 0 else  -x)

In [0]:
# Convert coupon and other discount in percentage

customer_transaction["coupon_discount_percentage"] = (customer_transaction["coupon_discount"] * 100) / (customer_transaction["selling_price"] + customer_transaction["coupon_discount"])
customer_transaction["other_discount_percentage"] = (customer_transaction["other_discount"] * 100) / (customer_transaction["selling_price"] + customer_transaction["other_discount"])

In [0]:
customer_transaction["dt_day"] = customer_transaction["date"].dt.day
customer_transaction["dt_week_day"] = customer_transaction["date"].dt.weekday

In [0]:
 customer_transaction.head()

Unnamed: 0,date,customer_id,item_id,quantity,selling_price,other_discount,coupon_discount,brand,brand_type,category,trans_with_no_coupon_matching,trans_with_Valid_coupon_matching,item_price,purchase_date_in_days,coupon_discount_percentage,other_discount_percentage,dt_day,dt_week_day
0,2012-01-02,1501,26830,1,35.26,10.69,0.0,0,0,0,0,1,35.26,15341,0.0,23.264418,2,0
1,2012-01-02,1501,54253,1,53.43,13.89,0.0,0,0,0,0,1,53.43,15341,0.0,20.632799,2,0
2,2012-01-02,1501,31962,1,106.5,14.25,0.0,1,1,1,0,1,106.5,15341,0.0,11.801242,2,0
3,2012-01-02,1501,33647,1,67.32,0.0,0.0,2,1,1,1,0,67.32,15341,0.0,0.0,2,0
4,2012-01-02,1501,48199,1,71.24,28.14,0.0,1,1,1,0,1,71.24,15341,0.0,28.315556,2,0


In [0]:
# Any value missing in customer_transaction data  -> No
customer_transaction.isna().sum()

date                                0
customer_id                         0
item_id                             0
quantity                            0
selling_price                       0
other_discount                      0
coupon_discount                     0
brand                               0
brand_type                          0
category                            0
trans_with_no_coupon_matching       0
trans_with_Valid_coupon_matching    0
item_price                          0
purchase_date_in_days               0
coupon_discount_percentage          0
other_discount_percentage           0
dt_day                              0
dt_week_day                         0
dtype: int64

In [0]:
for k in customer_transaction.columns:
    print(k,customer_transaction[k].nunique())

date 549
customer_id 1582
item_id 74063
quantity 9252
selling_price 4923
other_discount 1418
coupon_discount 232
brand 5528
brand_type 2
category 19
trans_with_no_coupon_matching 2
trans_with_Valid_coupon_matching 2
item_price 17452
purchase_date_in_days 549
coupon_discount_percentage 2859
other_discount_percentage 27991
dt_day 31
dt_week_day 7


In [0]:
cat_agg = ["count","nunique"]
num_agg = ["min", "mean","max","sum"]
date_agg = ["count", "nunique", "max" , "min"]

customer_transaction_agg_col={
     "item_id":["nunique"], "quantity":["count","nunique","mean"],"dt_day":["mean"],"dt_week_day":["mean"], "item_price":["mean","std","sum"],
      "selling_price":["mean"],"other_discount":["mean"], 'brand':['nunique','mean'], 'brand_type':['nunique','mean'], 'category':['nunique','mean'],
      "coupon_discount":["mean"], "purchase_date_in_days":["mean"],"coupon_discount_percentage":["min", "mean","max"],"other_discount_percentage":["min", "mean","max"],
    "trans_with_no_coupon_matching":["mean","sum"],"trans_with_Valid_coupon_matching":["mean","sum"],
}

customer_transaction_agg_col

{'brand': ['nunique', 'mean'],
 'brand_type': ['nunique', 'mean'],
 'category': ['nunique', 'mean'],
 'coupon_discount': ['mean'],
 'coupon_discount_percentage': ['min', 'mean', 'max'],
 'dt_day': ['mean'],
 'dt_week_day': ['mean'],
 'item_id': ['nunique'],
 'item_price': ['mean', 'std', 'sum'],
 'other_discount': ['mean'],
 'other_discount_percentage': ['min', 'mean', 'max'],
 'purchase_date_in_days': ['mean'],
 'quantity': ['count', 'nunique', 'mean'],
 'selling_price': ['mean'],
 'trans_with_Valid_coupon_matching': ['mean', 'sum'],
 'trans_with_no_coupon_matching': ['mean', 'sum']}

In [0]:
customer_transaction_int = customer_transaction.groupby("customer_id").agg(customer_transaction_agg_col)
#customer_transaction_int.head(10)

In [0]:
customer_transaction_int.columns=['cti_' + '_'.join(col).strip() for col in customer_transaction_int.columns.values]
customer_transaction_int.reset_index(inplace=True)

In [0]:
customer_transaction_int.head(10)

Unnamed: 0,customer_id,cti_item_id_nunique,cti_quantity_count,cti_quantity_nunique,cti_quantity_mean,cti_dt_day_mean,cti_dt_week_day_mean,cti_item_price_mean,cti_item_price_std,cti_item_price_sum,cti_selling_price_mean,cti_other_discount_mean,cti_brand_nunique,cti_brand_mean,cti_brand_type_nunique,cti_brand_type_mean,cti_category_nunique,cti_category_mean,cti_coupon_discount_mean,cti_purchase_date_in_days_mean,cti_coupon_discount_percentage_min,cti_coupon_discount_percentage_mean,cti_coupon_discount_percentage_max,cti_other_discount_percentage_min,cti_other_discount_percentage_mean,cti_other_discount_percentage_max,cti_trans_with_no_coupon_matching_mean,cti_trans_with_no_coupon_matching_sum,cti_trans_with_Valid_coupon_matching_mean,cti_trans_with_Valid_coupon_matching_sum
0,1,463,1046,5,1.170172,14.728489,2.232314,84.577165,44.131219,88467.714167,93.848537,16.159207,163,229.816444,2,0.913958,9,1.85086,1.955631,15644.305927,0.0,1.429817,50.0,0.0,11.618699,59.918983,0.517208,541,0.482792,505
1,2,352,419,5,1.131265,15.298329,3.260143,94.373588,65.278592,39542.533333,102.864033,16.83043,153,274.181384,2,0.78043,9,1.816229,0.595084,15645.009547,0.0,0.275801,50.0,0.0,11.305725,68.652645,0.360382,151,0.639618,268
2,3,406,705,12,11.578723,15.670922,2.631206,71.007598,61.988335,50060.356693,103.617404,22.714227,114,155.634043,2,0.841135,8,1.384397,3.091546,15616.889362,0.0,1.962749,67.112577,0.0,14.570377,69.700837,0.441135,311,0.558865,394
3,4,125,220,5,1.272727,15.418182,3.981818,129.373114,189.919812,28462.085,154.423727,13.305409,72,232.890909,2,0.872727,8,1.631818,0.404773,15573.309091,0.0,0.075758,16.666667,0.0,7.627963,50.166659,0.413636,91,0.586364,129
4,5,490,792,17,117.869949,16.324495,2.675505,104.222145,95.464683,82543.939144,130.827146,13.657917,168,380.993687,2,0.775253,14,2.07197,0.114684,15642.174242,0.0,0.08984,50.0,0.0,8.724264,59.816678,0.369949,293,0.630051,499
5,6,429,583,4,1.212693,18.12693,2.277873,91.470758,79.93713,53327.451667,102.072419,12.001938,142,266.300172,2,0.660377,9,2.169811,0.702607,15678.077187,0.0,0.413127,37.451372,0.0,9.34628,63.025012,0.353345,206,0.646655,377
6,7,780,1052,11,54.906844,15.875475,2.148289,86.753607,78.319456,91264.79472,101.948118,17.73269,218,229.205323,2,0.806084,13,1.742395,0.695817,15654.069392,0.0,0.354028,50.0,0.0,13.336028,61.391719,0.378327,398,0.621673,654
7,8,719,1308,78,1020.214067,16.154434,2.607034,107.494347,130.352651,140602.605693,227.149847,23.478716,194,224.565749,2,0.737003,12,2.551223,4.694534,15654.925076,0.0,2.715901,55.55209,0.0,11.637692,95.240642,0.388379,508,0.611621,800
8,9,405,556,24,211.904676,15.131295,3.413669,73.609785,60.929356,40927.040716,102.95009,20.993129,117,170.793165,2,0.715827,11,1.928058,0.809766,15638.30036,0.0,0.365336,50.0,0.0,17.212905,69.604915,0.365108,203,0.634892,353
9,10,268,491,11,165.305499,14.749491,3.11609,84.688004,55.063871,41581.810165,103.788798,11.479552,111,475.657841,2,0.778004,10,2.093686,0.0,15596.843177,0.0,0.0,0.0,0.0,8.048009,74.984081,0.466395,229,0.533605,262


In [0]:
del customer_transaction
gc.collect()

77

In [0]:
df.shape

(128595, 24)

In [0]:
df = df.merge(customer_transaction_int, on="customer_id", how='left')
df.shape

(128595, 53)

In [0]:
# Done till Here

### Model Building process

In [0]:
df.drop(columns=["start_date","end_date","campaign_id","customer_id","coupon_id"], axis="columns", inplace = True)
# df.drop(columns=["start_date","end_date"], axis="columns", inplace = True)

In [0]:
# df = pd.get_dummies(df,columns=["coupon_id", "customer_id"],drop_first=True)

In [0]:
df.shape

(128595, 48)

In [0]:
train_dummy = df[df["redemption_status"].isnull() == False ].copy()
test_dummy = df[df["redemption_status"].isnull() == True ].copy()

test_dummy.drop("redemption_status", axis="columns", inplace=True)

In [0]:
del df
gc.collect()

56

In [0]:
# train_dummy.fillna(0, inplace=True)
# test_dummy.fillna(0, inplace=True)

train_dummy["age_range"].fillna(train_dummy["age_range"].mode()[0], inplace=True)
train_dummy["marital_status"].fillna(train_dummy["marital_status"].mode()[0], inplace=True)
train_dummy["rented"].fillna(train_dummy["rented"].mode()[0], inplace=True)
train_dummy["family_size"].fillna(train_dummy["family_size"].mode()[0], inplace=True)
train_dummy["no_of_children"].fillna(train_dummy["no_of_children"].mode()[0], inplace=True)
train_dummy["income_bracket"].fillna(train_dummy["income_bracket"].mode()[0], inplace=True)

test_dummy["age_range"].fillna(test_dummy["age_range"].mode()[0], inplace=True)
test_dummy["marital_status"].fillna(test_dummy["marital_status"].mode()[0], inplace=True)
test_dummy["rented"].fillna(test_dummy["rented"].mode()[0], inplace=True)
test_dummy["family_size"].fillna(test_dummy["family_size"].mode()[0], inplace=True)
test_dummy["no_of_children"].fillna(test_dummy["no_of_children"].mode()[0], inplace=True)
test_dummy["income_bracket"].fillna(test_dummy["income_bracket"].mode()[0], inplace=True)

In [0]:
test_dummy.shape, train_dummy.shape

((50226, 47), (78369, 48))

In [0]:
cat_cols = ['age_range','marital_status','rented','income_bracket','campaign_type']
#cat_cols = ['coupon_id', 'customer_id', 'age_range','marital_status','rented','income_bracket','campaign_type']

for col in cat_cols:
  train_dummy[col] = train_dummy[col].astype('category')
  test_dummy[col] = test_dummy[col].astype('category')

In [0]:
X, y = train_dummy.drop(["id","redemption_status"], axis="columns"), train["redemption_status"]
Xtest = test_dummy.drop(["id"], axis="columns")

X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.25,random_state = 2019, stratify = y)

In [0]:
# print(X.shape,Xtest.shape)


categorical_features_indices = np.where(X_train.dtypes =='category')[0]
categorical_features_indices

array([0, 1, 2, 5, 6])

In [0]:
# model = cbst.CatBoostClassifier(n_estimators=2500,random_state=1994,learning_rate=0.03,eval_metric='AUC')

# model.fit(X_train,y_train,eval_set=[(X_val, y_val.values)], early_stopping_rounds=200,verbose=200,cat_features=categorical_features_indices)

# p = model.predict_proba(X_val)[:,-1]

# print(roc_auc_score(y_val,p))

In [0]:
errCB=[]
y_pred_tot_cb=[]


fold = StratifiedKFold(n_splits=10,shuffle=True,random_state=2019)

i=1

for train_index, test_index in fold.split(X,y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # (random_seed=123, eval_metric='AUC', n_estimators=1100,max_depth=7, learning_rate=0.03, colsample_bylevel=0.1, reg_lambda=70)
    model = cbst.CatBoostClassifier(n_estimators=3500, random_state=1994,eval_metric='AUC',learning_rate=0.03, reg_lambda=70) # 2500-5000 and .03 to .01 and reg_lambda
    model.fit(X_train,y_train,eval_set=[(X_test, y_test)], early_stopping_rounds=200,verbose=200,cat_features=categorical_features_indices)
    preds = model.predict_proba(X_test)[:,-1]
    
    print("err_cb: ",roc_auc_score(y_test,preds))
    
    errCB.append(roc_auc_score(y_test,preds))
    p = model.predict_proba(Xtest)[:,-1]
    i=i+1
    y_pred_tot_cb.append(p)

0:	test: 0.6953987	best: 0.6953987 (0)	total: 156ms	remaining: 9m 4s
200:	test: 0.8926164	best: 0.8926817 (194)	total: 18.8s	remaining: 5m 8s
400:	test: 0.9008737	best: 0.9008737 (400)	total: 37.1s	remaining: 4m 46s
600:	test: 0.9052705	best: 0.9053411 (584)	total: 55.4s	remaining: 4m 27s
800:	test: 0.9075607	best: 0.9075607 (800)	total: 1m 13s	remaining: 4m 8s
1000:	test: 0.9095368	best: 0.9095368 (1000)	total: 1m 31s	remaining: 3m 49s
1200:	test: 0.9113665	best: 0.9114018 (1192)	total: 1m 50s	remaining: 3m 31s
1400:	test: 0.9129932	best: 0.9130303 (1367)	total: 2m 8s	remaining: 3m 12s
1600:	test: 0.9142883	best: 0.9145071 (1554)	total: 2m 27s	remaining: 2m 54s
1800:	test: 0.9155216	best: 0.9155851 (1772)	total: 2m 45s	remaining: 2m 36s
2000:	test: 0.9166543	best: 0.9166790 (1970)	total: 3m 4s	remaining: 2m 17s
2200:	test: 0.9179388	best: 0.9180711 (2180)	total: 3m 22s	remaining: 1m 59s
2400:	test: 0.9190927	best: 0.9191686 (2376)	total: 3m 40s	remaining: 1m 41s
2600:	test: 0.9194703	

In [0]:
np.mean(errCB,0)

0.9287377709006248

In [0]:
sample_submission["redemption_status"] = np.mean(y_pred_tot_cb, 0)
sample_submission.head()

Unnamed: 0,id,redemption_status
0,3,0.168929
1,4,0.004439
2,5,0.042379
3,8,0.002022
4,10,0.001


In [0]:
# .8612 on LB with new values .86559
# sample_submission.to_csv("/content/gdrive/My Drive/ColabNotebooks/AmExpart-2019/data_set/amexpert-2019-catboost-final.csv",index=False)
# sample_submission.shape

#### light GBM Model Start

In [0]:
X, y = train_dummy.drop(["id","redemption_status"], axis="columns"), train["redemption_status"].copy()
Xtest = test_dummy.drop(["id"], axis="columns")

print(X.shape, Xtest.shape)



(78369, 46) (50226, 46)


In [0]:
X.columns, train["redemption_status"].value_counts()/train.shape[0]

(Index(['age_range', 'marital_status', 'rented', 'family_size',
        'no_of_children', 'income_bracket', 'campaign_type', 'campaign_days',
        'start_date_in_days', 'end_date_in_days', 'ciidi_item_id_nunique',
        'ciidi_brand_nunique', 'ciidi_brand_mean', 'ciidi_brand_type_nunique',
        'ciidi_brand_type_mean', 'ciidi_category_nunique',
        'ciidi_category_mean', 'cti_item_id_nunique', 'cti_quantity_count',
        'cti_quantity_nunique', 'cti_quantity_mean', 'cti_dt_day_mean',
        'cti_dt_week_day_mean', 'cti_item_price_mean', 'cti_item_price_std',
        'cti_item_price_sum', 'cti_selling_price_mean',
        'cti_other_discount_mean', 'cti_brand_nunique', 'cti_brand_mean',
        'cti_brand_type_nunique', 'cti_brand_type_mean', 'cti_category_nunique',
        'cti_category_mean', 'cti_coupon_discount_mean',
        'cti_purchase_date_in_days_mean', 'cti_coupon_discount_percentage_min',
        'cti_coupon_discount_percentage_mean',
        'cti_coupon_disco

In [0]:
err=[]
y_pred_tot=[]



fold = StratifiedKFold(n_splits=10,shuffle=True,random_state=2019)
i=1

for train_index, test_index in fold.split(X,y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # (n_estimators=1000, max_depth=5, learning_rate=0.01, random_state=i, colsample_bytree=0.2, reg_lambda=15, reg_alpha=10)
    model = lgb.LGBMClassifier(n_estimators=5000, max_depth=5, random_state=1994,learning_rate=0.01,colsample_bytree=0.2,objective='binary',scale_pos_weight=1, reg_lambda=15, reg_alpha=10)
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric='auc', early_stopping_rounds=200,verbose=200)

    preds = model.predict_proba(X_test)[:,-1]
    print("err: ",roc_auc_score(y_test,preds))
    err.append(roc_auc_score(y_test,preds))
    p = model.predict_proba(Xtest)[:,-1]

    i=i+1
    
    y_pred_tot.append(p)


Training until validation scores don't improve for 200 rounds.
[200]	valid_0's binary_logloss: 0.0424645	valid_0's auc: 0.894921
[400]	valid_0's binary_logloss: 0.0398969	valid_0's auc: 0.903174
[600]	valid_0's binary_logloss: 0.0387473	valid_0's auc: 0.909252
[800]	valid_0's binary_logloss: 0.0380326	valid_0's auc: 0.913043
[1000]	valid_0's binary_logloss: 0.0376203	valid_0's auc: 0.915199
[1200]	valid_0's binary_logloss: 0.0372832	valid_0's auc: 0.917274
[1400]	valid_0's binary_logloss: 0.0370208	valid_0's auc: 0.91869
[1600]	valid_0's binary_logloss: 0.0368389	valid_0's auc: 0.919701
[1800]	valid_0's binary_logloss: 0.0367072	valid_0's auc: 0.920622
[2000]	valid_0's binary_logloss: 0.036624	valid_0's auc: 0.921136
[2200]	valid_0's binary_logloss: 0.0365825	valid_0's auc: 0.921358
[2400]	valid_0's binary_logloss: 0.0365641	valid_0's auc: 0.921413
[2600]	valid_0's binary_logloss: 0.0365411	valid_0's auc: 0.921533
[2800]	valid_0's binary_logloss: 0.0365246	valid_0's auc: 0.921586
[3000

In [0]:
# #from lightgbm import plot_importance

# fig, ax = plt.subplots(figsize=(12,30))
# lgb.plot_importance(model, max_num_features=100, height=0.8, ax=ax)
# ax.grid(False)
# plt.title("LightGBM - Feature Importance", fontsize=15)
# plt.show()

In [0]:
print(np.mean(err,0))
# sample_submission["redemption_status"] = np.mean(y_pred_tot, 0)
# sample_submission.to_csv("/content/gdrive/My Drive/ColabNotebooks/AmExpart-2019/data_set/amexpert-2019-lgbm-final.csv",index=False)

0.9319929035740022


In [0]:
# sample_submission["redemption_status"] = np.mean(y_pred_tot, 0)*0.35 + np.mean(y_pred_tot_cb, 0)*0.65
# sample_submission.to_csv("/content/gdrive/My Drive/ColabNotebooks/AmExpart-2019/data_set/amexpert-2019-cb-lgb-stack-6535.csv",index=False)

In [0]:
sample_submission.shape

(50226, 2)

**Sachin light GBM Model Ends**

#### Tune 15 lgbm and 10 catboost, take HM of their respective AM

In [0]:
preds_lgbm15 = np.zeros((len(test_dummy), 1))
m = 15

for i in range(m):
    print("training LGBC model {}".format(i))
    lgbc = lgb.LGBMClassifier(n_estimators=1000, max_depth=5, learning_rate=0.01, random_state = i + 2000, colsample_bytree=0.2, reg_lambda=15, reg_alpha=10)
    lgbc.fit(train_dummy.drop(["id","redemption_status"],axis="columns"), train_dummy["redemption_status"])
    preds_lgbm15 = preds_lgbm15 + lgbc.predict_proba(test_dummy.drop(["id"], axis="columns"))[:,1].reshape(-1, 1)
    
    
preds_lgbm15 = preds_lgbm15/m

training LGBC model 0
training LGBC model 1
training LGBC model 2
training LGBC model 3
training LGBC model 4
training LGBC model 5
training LGBC model 6
training LGBC model 7
training LGBC model 8
training LGBC model 9
training LGBC model 10
training LGBC model 11
training LGBC model 12
training LGBC model 13
training LGBC model 14


In [0]:
preds_cb_15 = np.zeros((len(test_dummy), 1))

n = 10

for i in range(n):
  print("training catboost model {}".format(i))
  cbc = cbst.CatBoostClassifier(random_seed = 123 + i, 
                                eval_metric='AUC', 
                                n_estimators=1100, 
                                max_depth=7, 
                                learning_rate=0.03, 
                                colsample_bylevel=0.1, reg_lambda=70)

  cbc.fit(train_dummy.drop(["id","redemption_status"], axis="columns"), train_dummy["redemption_status"],cat_features=categorical_features_indices)

  preds_cb_15 = preds_cb_15 + cbc.predict_proba(test_dummy.drop(["id"], axis="columns"))[:,1].reshape(-1, 1)
  
preds_cb_15 = preds_cb_15/n

training catboost model 0
0:	total: 35.8ms	remaining: 39.3s
1:	total: 97.2ms	remaining: 53.4s
2:	total: 166ms	remaining: 1m
3:	total: 226ms	remaining: 1m 1s
4:	total: 293ms	remaining: 1m 4s
5:	total: 365ms	remaining: 1m 6s
6:	total: 423ms	remaining: 1m 5s
7:	total: 485ms	remaining: 1m 6s
8:	total: 559ms	remaining: 1m 7s
9:	total: 622ms	remaining: 1m 7s
10:	total: 682ms	remaining: 1m 7s
11:	total: 745ms	remaining: 1m 7s
12:	total: 788ms	remaining: 1m 5s
13:	total: 860ms	remaining: 1m 6s
14:	total: 924ms	remaining: 1m 6s
15:	total: 981ms	remaining: 1m 6s
16:	total: 1.06s	remaining: 1m 7s
17:	total: 1.13s	remaining: 1m 7s
18:	total: 1.19s	remaining: 1m 7s
19:	total: 1.25s	remaining: 1m 7s
20:	total: 1.31s	remaining: 1m 7s
21:	total: 1.38s	remaining: 1m 7s
22:	total: 1.43s	remaining: 1m 6s
23:	total: 1.49s	remaining: 1m 6s
24:	total: 1.55s	remaining: 1m 6s
25:	total: 1.61s	remaining: 1m 6s
26:	total: 1.67s	remaining: 1m 6s
27:	total: 1.73s	remaining: 1m 6s
28:	total: 1.79s	remaining: 1m 6s

In [0]:
from scipy.stats import hmean

sub = pd.DataFrame()

sub['id'] = test_dummy['id'].copy()

sub['redemption_status1'] = preds_lgbm15
sub['redemption_status2'] = preds_cb_15

hmean_preds = hmean(sub[['redemption_status1', 'redemption_status2']].values, axis=1)

sub['redemption_status'] = hmean_preds

sub[['id', 'redemption_status']].to_csv("/content/gdrive/My Drive/ColabNotebooks/AmExpart-2019/data_set/amexpert-2019-lgbm-15_cbst-10_Final.csv", index=False)