Changing feature enginering

In [2]:
%reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? n
Nothing done.


In [1]:
import numpy as np
import pandas as pd
import datetime
import gc
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
np.random.seed(4590)

In [2]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max(
            )
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [5]:
df_train = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/train.csv')
df_test = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/test.csv')
df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions.csv')
df_new_merchant_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions.csv')

In [6]:
df_train=reduce_mem_usage(df_train)
df_test=reduce_mem_usage(df_test)
df_hist_trans=reduce_mem_usage(df_hist_trans)
df_new_merchant_trans=reduce_mem_usage(df_new_merchant_trans)

Mem. usage decreased to  4.04 Mb (56.2% reduction)
Mem. usage decreased to  2.24 Mb (52.5% reduction)
Mem. usage decreased to 1749.11 Mb (43.7% reduction)
Mem. usage decreased to 114.20 Mb (45.5% reduction)


Count number of purchases made in each merchant. We will use it to fill NaN values by most frequent merchant id.

Chech weather we have NaN values in the following categories

In [7]:
df_hist_trans.isnull().any()

authorized_flag         False
card_id                 False
city_id                 False
category_1              False
installments            False
category_3               True
merchant_category_id    False
merchant_id              True
month_lag               False
purchase_amount         False
purchase_date           False
category_2               True
state_id                False
subsector_id            False
dtype: bool

In [8]:
df_new_merchant_trans.isnull().any()

authorized_flag         False
card_id                 False
city_id                 False
category_1              False
installments            False
category_3               True
merchant_category_id    False
merchant_id              True
month_lag               False
purchase_amount         False
purchase_date           False
category_2               True
state_id                False
subsector_id            False
dtype: bool

It seems that 'category_2' , 'category_3' and 'mechant_id' has Nan values in historical and new mechant transactions. Now, let's count values in each of this categories for historical transactions. Let's start with 'category_2'

In [9]:
df_hist_trans.category_2.value_counts(dropna=False)

 1.0    15177199
 3.0     3911795
 5.0     3725915
NaN      2652864
 4.0     2618053
 2.0     1026535
Name: category_2, dtype: int64

There is 2652864 Nan values in 'category_2'. Now, let's check if in each 'card_id' 'category_2' have only 'Nan' unique value and change it to the most frequent values found in the whole history transacations. Most frequent seems to be 1.(in the above cell).

In order to check we can use groupby and sum function. By setting min_count=1, we can get nan value for sum of nan array, if we didn't it will give zero by default.

In [10]:
group_cat2=df_hist_trans.groupby(['card_id']).category_2.sum(min_count=1)

Now let's select those who have non-values only

In [11]:
group_cat2_nan=group_cat2[group_cat2.isnull()]

In [12]:
group_cat2_nan.head()

card_id
C_ID_001b4c5151   NaN
C_ID_001c09a36b   NaN
C_ID_0028e15a78   NaN
C_ID_002b706ded   NaN
C_ID_0030e0945f   NaN
Name: category_2, dtype: float16

I am setting index as 'card_id'. In order to , change 'category_2' values. I tried just by using 
df_hist_trans.loc[df_hist_trans.card_id.isin(group_cat2_nan.index)].category_2=1
or 
df_hist_trans[df_hist_trans.card_id.isin(group_cat2_nan.index)].category_2
and you can check by
df_hist_trans.loc[df_hist_trans.card_id.isin(group_cat2_nan.index)].category_2
or
df_hist_trans.loc[df_hist_trans.card_id.isin(group_cat2_nan.index)].category_2
You will see that values do not change. Both of this methods gives copy from the dataframe so we can't change it. It seems when you use masking(df_hist_trans.card_id.isin(group_cat2_nan.index)) you will have copy. 
Therefore, I decided to call from index and index will be 'card_id'.


In [13]:
df_hist_trans[df_hist_trans.card_id.isin(group_cat2_nan.index)].category_2=1

In [14]:
df_hist_trans[df_hist_trans.card_id.isin(group_cat2_nan.index)].category_2.head()

15207   NaN
15208   NaN
15209   NaN
15210   NaN
15211   NaN
Name: category_2, dtype: float16

You can see that it doesn't work. So , let's change index of df_hist_trans to 'card_id'. Call dataframe from their respective indexes

In [15]:
df_hist_trans.set_index('card_id',inplace=True)

In [16]:
df_hist_trans.loc[group_cat2_nan.index,'category_2']=1

In [17]:
df_hist_trans.loc[group_cat2_nan.index,'category_2'].unique()

array([ 1.])

Now, we only changed the once which have only NaN values in 'category_2'. Let's check how many nan are still there. So nan values are reduced by 115446.

In [18]:
df_hist_trans.category_2.value_counts(dropna=False)

 1.0    15327456
 3.0     3911795
 5.0     3725915
 4.0     2618053
NaN      2502607
 2.0     1026535
Name: category_2, dtype: int64

Now let's reset_index and groupby 'card_id' and 'category' . We can look at number of counts in each category of 'category_2'.

In [19]:
df_hist_trans.reset_index(inplace=True)

In [20]:
category_2_count=df_hist_trans.groupby(['card_id','category_2']).count()

In [21]:
category_2_count.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,authorized_flag,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,state_id,subsector_id
card_id,category_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
C_ID_00007093c1,3.0,120,120,120,120,120,120,120,120,120,120,120,120
C_ID_00007093c1,5.0,1,1,1,1,1,1,1,1,1,1,1,1
C_ID_0001238066,1.0,95,95,95,95,94,95,95,95,95,95,95,95
C_ID_0001238066,5.0,20,20,20,20,19,20,20,20,20,20,20,20
C_ID_0001506ef0,1.0,2,2,2,2,2,2,2,2,2,2,2,2
C_ID_0001506ef0,3.0,64,64,64,64,64,64,64,64,64,64,64,64
C_ID_0001793786,1.0,11,11,11,11,11,11,11,11,11,11,11,11
C_ID_0001793786,2.0,76,76,76,76,76,76,76,76,76,76,76,76
C_ID_0001793786,3.0,15,15,15,15,15,15,15,15,15,15,15,15
C_ID_000183fdda,1.0,7,7,7,7,7,7,7,7,7,7,7,7


We need only one column(since all of them are same) let's choose 'authorized_flag'. 

In [22]:
category_2_count=category_2_count.authorized_flag

In [23]:
category_2_count.head(20)

card_id          category_2
C_ID_00007093c1  3.0           120
                 5.0             1
C_ID_0001238066  1.0            95
                 5.0            20
C_ID_0001506ef0  1.0             2
                 3.0            64
C_ID_0001793786  1.0            11
                 2.0            76
                 3.0            15
C_ID_000183fdda  1.0             7
                 2.0             1
                 3.0           131
                 5.0             1
C_ID_00024e244b  1.0             3
                 3.0            67
C_ID_0002709b5a  1.0             1
                 2.0            52
                 5.0            14
C_ID_00027503e2  1.0             3
                 3.0            39
Name: authorized_flag, dtype: int64

We will need only index of maximum values. We can do it by groupby(level=0), level=0 is 'card_id' in our case.

In [24]:
category_2_count_max=category_2_count.groupby(level=0).idxmax()



Now we need only second part of the tuple. Finally, we will obtain Series object with corresponding max count of categories for each 'card_id'

In [25]:
category_2_count_max=category_2_count_max.apply(lambda x: x[1])

In [26]:
category_2_count_max.head()

card_id
C_ID_00007093c1    3.0
C_ID_0001238066    1.0
C_ID_0001506ef0    3.0
C_ID_0001793786    2.0
C_ID_000183fdda    3.0
Name: authorized_flag, dtype: float64

Now, we can input most frequent 'category_2' value for each non value in certain 'card_id'

In [27]:
df_hist_trans.set_index('card_id',inplace=True)
df_hist_trans.head()

Unnamed: 0_level_0,authorized_flag,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,category_2,state_id,subsector_id
card_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
C_ID_4e6213e9bc,Y,88,N,0,A,80,M_ID_e020e9b302,-8,-0.703331,2017-06-25 15:33:07,1.0,16,37
C_ID_4e6213e9bc,Y,88,N,0,A,367,M_ID_86ec983688,-7,-0.733128,2017-07-15 12:10:45,1.0,16,16
C_ID_4e6213e9bc,Y,88,N,0,A,80,M_ID_979ed661fc,-6,-0.720386,2017-08-09 22:04:29,1.0,16,37
C_ID_4e6213e9bc,Y,88,N,0,A,560,M_ID_e6d5ae8ea6,-5,-0.735352,2017-09-02 10:06:26,1.0,16,34
C_ID_4e6213e9bc,Y,88,N,0,A,80,M_ID_e020e9b302,-11,-0.722865,2017-03-10 01:14:19,1.0,16,37


Let's check what where value counts before

In [28]:
df_hist_trans.category_2.value_counts(dropna=False)

 1.0    15327456
 3.0     3911795
 5.0     3725915
 4.0     2618053
NaN      2502607
 2.0     1026535
Name: category_2, dtype: int64

Fill nan values according to given series by using fillna function

In [29]:
df_hist_trans.category_2=df_hist_trans.category_2.fillna(category_2_count_max)

Finally, we get rid of all nan values for 'category_2' . Now, we can try same for 'category_3' and 'merchant_id'. Also, we need to do same for 'df_new_merchant_trans'. Finally, let's check if there is any nan values in category_2

In [30]:
df_hist_trans.category_2.value_counts(dropna=False)

1.0    16804879
3.0     4289903
5.0     4050578
4.0     2793190
2.0     1173811
Name: category_2, dtype: int64

Let's do same for 'category_3'

In [31]:
df_hist_trans.category_3.value_counts(dropna=False)

A      15411747
B      11677522
C       1844933
NaN      178159
Name: category_3, dtype: int64

Let's change 'A' , 'B', 'C' to numerical values in order to be able to use sum(min_counts=1) function

In [32]:
d_cat3={'A':1,'B':2,'C':3}

In [33]:
df_hist_trans.category_3=df_hist_trans.category_3.map(d_cat3)

In [34]:
df_hist_trans.head()

Unnamed: 0_level_0,authorized_flag,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,category_2,state_id,subsector_id
card_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
C_ID_4e6213e9bc,Y,88,N,0,1.0,80,M_ID_e020e9b302,-8,-0.703331,2017-06-25 15:33:07,1.0,16,37
C_ID_4e6213e9bc,Y,88,N,0,1.0,367,M_ID_86ec983688,-7,-0.733128,2017-07-15 12:10:45,1.0,16,16
C_ID_4e6213e9bc,Y,88,N,0,1.0,80,M_ID_979ed661fc,-6,-0.720386,2017-08-09 22:04:29,1.0,16,37
C_ID_4e6213e9bc,Y,88,N,0,1.0,560,M_ID_e6d5ae8ea6,-5,-0.735352,2017-09-02 10:06:26,1.0,16,34
C_ID_4e6213e9bc,Y,88,N,0,1.0,80,M_ID_e020e9b302,-11,-0.722865,2017-03-10 01:14:19,1.0,16,37


In [35]:
df_hist_trans.category_3.value_counts(dropna=False)

 1.0    15411747
 2.0    11677522
 3.0     1844933
NaN       178159
Name: category_3, dtype: int64

In [36]:
group_cat3=df_hist_trans.groupby(['card_id']).category_3.sum(min_count=1)

In [37]:
group_cat3.isnull().sum()

0

### This means that we don't have any card_id which have only NaN values in category_3. So we can jump to changing nan values in each card id.

In [38]:
df_hist_trans.reset_index(inplace=True)
category_3_count=df_hist_trans.groupby(['card_id','category_3']).count()
category_3_count=category_3_count.authorized_flag
category_3_count_max=category_3_count.groupby(level=0).idxmax()
category_3_count_max=category_3_count_max.apply(lambda x: x[1])
category_3_count_max.head()


card_id
C_ID_00007093c1    2.0
C_ID_0001238066    2.0
C_ID_0001506ef0    1.0
C_ID_0001793786    1.0
C_ID_000183fdda    2.0
Name: authorized_flag, dtype: float64

### Let's check again what was before

In [39]:
df_hist_trans.set_index('card_id',inplace=True)
df_hist_trans.category_3.value_counts(dropna=False)

 1.0    15411747
 2.0    11677522
 3.0     1844933
NaN       178159
Name: category_3, dtype: int64

In [40]:
df_hist_trans.category_3=df_hist_trans.category_3.fillna(category_3_count_max)
df_hist_trans.category_3.value_counts(dropna=False)

1.0    15412531
2.0    11833535
3.0     1866295
Name: category_3, dtype: int64

In [41]:
d_cat3_inv={1.0:"A",2.0:"B",3.0:"C"}
df_hist_trans.category_3=df_hist_trans.category_3.map(d_cat3_inv)
df_hist_trans.category_3.value_counts(dropna=False)

A    15412531
B    11833535
C     1866295
Name: category_3, dtype: int64

### Now let's do it for merchant_id

In [42]:
#df_hist_trans.set_index('card_id',inplace=True)
df_hist_trans.merchant_id.value_counts(dropna=False).head()

M_ID_00a6ca8a8a    1115097
M_ID_e5374dabc0     428619
M_ID_9139332ccc     361385
M_ID_50f575c681     183894
M_ID_fc7d7969c3     177040
Name: merchant_id, dtype: int64

In [43]:
df_hist_trans.reset_index(inplace=True)
merchant_id_count=df_hist_trans.groupby(['card_id','merchant_id']).count()
merchant_id_count=merchant_id_count.authorized_flag
merchant_id_count_max=merchant_id_count.groupby(level=0).idxmax()
merchant_id_count_max=merchant_id_count_max.apply(lambda x: x[1])
merchant_id_count_max.head()

card_id
C_ID_00007093c1    M_ID_9400cf2342
C_ID_0001238066    M_ID_d17aabd756
C_ID_0001506ef0    M_ID_b1fc88154d
C_ID_0001793786    M_ID_923d57de8d
C_ID_000183fdda    M_ID_f9cfe0a43b
Name: authorized_flag, dtype: object

### Let's check again what was before

In [44]:
df_hist_trans.set_index('card_id',inplace=True)
df_hist_trans.merchant_id.value_counts(dropna=False).head(20)

M_ID_00a6ca8a8a    1115097
M_ID_e5374dabc0     428619
M_ID_9139332ccc     361385
M_ID_50f575c681     183894
M_ID_fc7d7969c3     177040
M_ID_5ba019a379     170935
NaN                 138481
M_ID_f86439cec0     110341
M_ID_1f4773aa76     106476
M_ID_86be58d7e0      97259
M_ID_98b342c0e3      93394
M_ID_d855771cd9      84377
M_ID_6f274b9340      81072
M_ID_cd2c0b07e9      80179
M_ID_57df19bf28      76750
M_ID_b9dcf28cb9      75487
M_ID_b98db225f5      70384
M_ID_445742726b      68499
M_ID_2637773dd2      66836
M_ID_82a30d9203      65853
Name: merchant_id, dtype: int64

In [45]:
df_hist_trans.merchant_id=df_hist_trans.merchant_id.fillna(merchant_id_count_max)
df_hist_trans.merchant_id.value_counts(dropna=False).head()

M_ID_00a6ca8a8a    1130790
M_ID_e5374dabc0     433318
M_ID_9139332ccc     364256
M_ID_50f575c681     185941
M_ID_fc7d7969c3     177967
Name: merchant_id, dtype: int64

### Finally, let's check if any nan values left

In [46]:
df_hist_trans.isnull().any()

authorized_flag         False
city_id                 False
category_1              False
installments            False
category_3              False
merchant_category_id    False
merchant_id             False
month_lag               False
purchase_amount         False
purchase_date           False
category_2              False
state_id                False
subsector_id            False
dtype: bool

In [47]:
#df_hist_trans.to_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions_new.csv')

### Now, let's do same for new merchant transactions


In [48]:
#df_new_merchant_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions.csv')

In [49]:
df_new_merchant_trans.isnull().any()

authorized_flag         False
card_id                 False
city_id                 False
category_1              False
installments            False
category_3               True
merchant_category_id    False
merchant_id              True
month_lag               False
purchase_amount         False
purchase_date           False
category_2               True
state_id                False
subsector_id            False
dtype: bool

In [50]:
df_new_merchant_trans.category_2.value_counts(dropna=False)

 1.0    1058242
 3.0     289525
 5.0     259266
 4.0     178590
NaN      111745
 2.0      65663
Name: category_2, dtype: int64

In [51]:
group_cat2=df_new_merchant_trans.groupby(['card_id']).category_2.sum(min_count=1)

In [52]:
group_cat2_nan=group_cat2[group_cat2.isnull()]
group_cat2_nan.head()

card_id
C_ID_000cfb6503   NaN
C_ID_000f6fea6a   NaN
C_ID_000f7e3e49   NaN
C_ID_0017dadfd5   NaN
C_ID_001b43d48f   NaN
Name: category_2, dtype: float16

In [53]:
df_new_merchant_trans.set_index('card_id',inplace=True)
df_new_merchant_trans.loc[group_cat2_nan.index,'category_2']=1
df_new_merchant_trans.loc[group_cat2_nan.index,'category_2'].unique()

array([ 1.])

In [54]:
df_new_merchant_trans.category_2.value_counts(dropna=False)

 1.0    1077518
 3.0     289525
 5.0     259266
 4.0     178590
NaN       92469
 2.0      65663
Name: category_2, dtype: int64

In [55]:
df_new_merchant_trans.reset_index(inplace=True)
category_2_count=df_new_merchant_trans.groupby(['card_id','category_2']).count()
category_2_count.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,authorized_flag,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,state_id,subsector_id
card_id,category_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
C_ID_00007093c1,1.0,1,1,1,1,1,1,1,1,1,1,1,1
C_ID_00007093c1,3.0,1,1,1,1,1,1,1,1,1,1,1,1
C_ID_0001238066,1.0,20,20,20,20,19,20,20,20,20,20,20,20
C_ID_0001238066,5.0,3,3,3,3,3,3,3,3,3,3,3,3
C_ID_0001506ef0,3.0,2,2,2,2,2,2,1,2,2,2,2,2
C_ID_0001793786,1.0,15,15,15,15,15,15,15,15,15,15,15,15
C_ID_0001793786,2.0,8,8,8,8,8,8,8,8,8,8,8,8
C_ID_0001793786,3.0,5,5,5,5,5,5,5,5,5,5,5,5
C_ID_0001793786,5.0,1,1,1,1,1,1,1,1,1,1,1,1
C_ID_000183fdda,3.0,11,11,11,11,10,11,11,11,11,11,11,11


In [56]:
category_2_count=category_2_count.authorized_flag
category_2_count_max=category_2_count.groupby(level=0).idxmax()
category_2_count_max=category_2_count_max.apply(lambda x: x[1])
category_2_count_max.head()

card_id
C_ID_00007093c1    1.0
C_ID_0001238066    1.0
C_ID_0001506ef0    3.0
C_ID_0001793786    1.0
C_ID_000183fdda    3.0
Name: authorized_flag, dtype: float64

In [57]:
df_new_merchant_trans.set_index('card_id',inplace=True)
print("before replacing nan values")
print(df_new_merchant_trans.category_2.value_counts(dropna=False))
df_new_merchant_trans.category_2=df_new_merchant_trans.category_2.fillna(category_2_count_max)
print("after replacing nan values")
print(df_new_merchant_trans.category_2.value_counts(dropna=False))

before replacing nan values
 1.0    1077518
 3.0     289525
 5.0     259266
 4.0     178590
NaN       92469
 2.0      65663
Name: category_2, dtype: int64
after replacing nan values
1.0    1129966
3.0     304780
5.0     270897
4.0     185915
2.0      71473
Name: category_2, dtype: int64


### Now let's do it in category_3

In [58]:
df_new_merchant_trans.category_3.value_counts(dropna=False)

A      922244
B      836178
C      148687
NaN     55922
Name: category_3, dtype: int64

In [59]:
d_cat3={'A':1,'B':2,'C':3}
df_new_merchant_trans.category_3=df_new_merchant_trans.category_3.map(d_cat3)
df_new_merchant_trans.category_3.value_counts(dropna=False)

 1.0    922244
 2.0    836178
 3.0    148687
NaN      55922
Name: category_3, dtype: int64

In [60]:
group_cat3.isnull().sum()

0

### This means that we don't have any card_id which have only NaN values in category_3. So we can jump to changing nan values in each card id.

In [61]:
df_new_merchant_trans.reset_index(inplace=True)
category_3_count=df_new_merchant_trans.groupby(['card_id','category_3']).count()
category_3_count=category_3_count.authorized_flag
category_3_count_max=category_3_count.groupby(level=0).idxmax()
category_3_count_max=category_3_count_max.apply(lambda x: x[1])
category_3_count_max.head()

card_id
C_ID_00007093c1    2.0
C_ID_0001238066    2.0
C_ID_0001506ef0    1.0
C_ID_0001793786    1.0
C_ID_000183fdda    2.0
Name: authorized_flag, dtype: float64

### Let's check again what was before

In [62]:
df_new_merchant_trans.set_index('card_id',inplace=True)
df_new_merchant_trans.category_3.value_counts(dropna=False)

 1.0    922244
 2.0    836178
 3.0    148687
NaN      55922
Name: category_3, dtype: int64

In [63]:
df_new_merchant_trans.category_3=df_new_merchant_trans.category_3.fillna(category_3_count_max)
df_new_merchant_trans.category_3.value_counts(dropna=False)

 1.0    922575
 2.0    883126
 3.0    154744
NaN       2586
Name: category_3, dtype: int64

### There are still categories which are Nan. Maybe we can get them from historical transactions.

In [64]:
df_hist_trans.category_3.value_counts(dropna=False)

A    15412531
B    11833535
C     1866295
Name: category_3, dtype: int64

In [65]:
#df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions.csv')
#df_hist_trans=reduce_mem_usage(df_hist_trans)

In [66]:
df_hist_trans.category_3=df_hist_trans.category_3.map(d_cat3)

In [67]:
category_3_count_h=df_hist_trans.groupby(['card_id','category_3']).count()
category_3_count_h=category_3_count_h.authorized_flag
category_3_count_h_max=category_3_count_h.groupby(level=0).idxmax()
category_3_count_h_max=category_3_count_h_max.apply(lambda x: x[1])
category_3_count_h_max.head()

card_id
C_ID_00007093c1    2
C_ID_0001238066    2
C_ID_0001506ef0    1
C_ID_0001793786    1
C_ID_000183fdda    2
Name: authorized_flag, dtype: int64

In [68]:
df_new_merchant_trans.category_3=df_new_merchant_trans.category_3.fillna(category_3_count_h_max)
df_new_merchant_trans.category_3.value_counts(dropna=False)

1.0    922582
2.0    884860
3.0    155589
Name: category_3, dtype: int64

In [69]:
df_new_merchant_trans.category_3=df_new_merchant_trans.category_3.map(d_cat3_inv)
df_new_merchant_trans.category_3.value_counts(dropna=False)

A    922582
B    884860
C    155589
Name: category_3, dtype: int64

### Good we are able to remove all nan values by using historic transactions for category_3
### Now, let's work on 'merchant_id'

In [70]:
df_new_merchant_trans.merchant_id.value_counts(dropna=False).head()

NaN                26216
M_ID_00a6ca8a8a    23018
M_ID_cd2c0b07e9    19118
M_ID_9139332ccc    14220
M_ID_50f575c681    13778
Name: merchant_id, dtype: int64

In [71]:
df_new_merchant_trans[df_new_merchant_trans.merchant_id.isnull()].shape

(26216, 13)

In [72]:

#df_new_merchant_trans[df_new_merchant_trans.merchant_id.isnull()].index.unique().shape

### Almost all of them are unique

In [73]:
df_new_merchant_trans.reset_index(inplace=True)
merchant_id_count=df_new_merchant_trans.groupby(['card_id','merchant_id']).count()
merchant_id_count=merchant_id_count.authorized_flag
merchant_id_count_max=merchant_id_count.groupby(level=0).idxmax()
merchant_id_count_max=merchant_id_count_max.apply(lambda x: x[1])
merchant_id_count_max.head()

card_id
C_ID_00007093c1    M_ID_00a6ca8a8a
C_ID_0001238066    M_ID_00a6ca8a8a
C_ID_0001506ef0    M_ID_ab756f937e
C_ID_0001793786    M_ID_0360f86430
C_ID_000183fdda    M_ID_113378fe3b
Name: authorized_flag, dtype: object

In [74]:
df_new_merchant_trans.merchant_id=df_new_merchant_trans.merchant_id.fillna(merchant_id_count_max)
df_new_merchant_trans.merchant_id.value_counts(dropna=False).head()

NaN                26216
M_ID_00a6ca8a8a    23018
M_ID_cd2c0b07e9    19118
M_ID_9139332ccc    14220
M_ID_50f575c681    13778
Name: merchant_id, dtype: int64

### It doesn't help at all. So, let's use data from historical transactions.
### It seems that historic transaction doesn't help as well. Probably

In [75]:
#df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions.csv')
#df_hist_trans=reduce_mem_usage(df_hist_trans)


In [76]:
merchant_id_count_h=df_hist_trans.groupby(['card_id','merchant_id']).count()
merchant_id_count_h=merchant_id_count_h.authorized_flag
merchant_id_count_max_h=merchant_id_count_h.groupby(level=0).idxmax()
merchant_id_count_max_h=merchant_id_count_max_h.apply(lambda x: x[1])
merchant_id_count_max_h.head()

card_id
C_ID_00007093c1    M_ID_9400cf2342
C_ID_0001238066    M_ID_d17aabd756
C_ID_0001506ef0    M_ID_b1fc88154d
C_ID_0001793786    M_ID_923d57de8d
C_ID_000183fdda    M_ID_f9cfe0a43b
Name: authorized_flag, dtype: object

In [77]:
df_new_merchant_trans.merchant_id=df_new_merchant_trans.merchant_id.fillna(merchant_id_count_max_h)
df_new_merchant_trans.merchant_id.value_counts(dropna=False).head()

NaN                26216
M_ID_00a6ca8a8a    23018
M_ID_cd2c0b07e9    19118
M_ID_9139332ccc    14220
M_ID_50f575c681    13778
Name: merchant_id, dtype: int64

### It seems that historic transaction doesn't help as well. Probably, those card_id with nan values appear only in new transaction

In [78]:
car_id_nan=df_new_merchant_trans[df_new_merchant_trans.merchant_id.isnull()].card_id

In [79]:

df_hist_trans[df_hist_trans.index.isin(car_id_nan)].merchant_id.value_counts()

M_ID_00a6ca8a8a    117564
M_ID_e5374dabc0     38087
M_ID_9139332ccc     26338
M_ID_50f575c681     18789
M_ID_5ba019a379     16390
M_ID_fc7d7969c3     13282
M_ID_f86439cec0     10975
M_ID_1f4773aa76     10091
M_ID_98b342c0e3      9857
M_ID_d855771cd9      8569
M_ID_cd2c0b07e9      8220
M_ID_86be58d7e0      7555
M_ID_b9dcf28cb9      7378
M_ID_2637773dd2      6833
M_ID_82a30d9203      6532
M_ID_b98db225f5      6506
M_ID_6f274b9340      5598
M_ID_940fb4498f      5445
M_ID_c03b62d83d      5376
M_ID_57df19bf28      5061
M_ID_48257bb851      4820
M_ID_820c7b73c8      4691
M_ID_445742726b      4688
M_ID_deb43ff012      4538
M_ID_a9d91682ad      4436
M_ID_26d4fadb60      4220
M_ID_1ac6bbc867      4210
M_ID_7c5e93af2f      4137
M_ID_b5b80addf5      4110
M_ID_59764e8cb1      3823
                    ...  
M_ID_06a294f13c         1
M_ID_9ae1973b8a         1
M_ID_c0426aeaeb         1
M_ID_f09d59089a         1
M_ID_beda97ddfe         1
M_ID_22f1e676fc         1
M_ID_dc1fb7e314         1
M_ID_6e755b3

### Let's just change to the most frequent value

In [80]:
df_new_merchant_trans.merchant_id.value_counts().head()

M_ID_00a6ca8a8a    23018
M_ID_cd2c0b07e9    19118
M_ID_9139332ccc    14220
M_ID_50f575c681    13778
M_ID_725a60d404     7029
Name: merchant_id, dtype: int64

In [81]:
df_new_merchant_trans['merchant_id'].fillna('M_ID_00a6ca8a8a',inplace=True)

In [82]:
df_new_merchant_trans.isnull().any()

card_id                 False
authorized_flag         False
city_id                 False
category_1              False
installments            False
category_3              False
merchant_category_id    False
merchant_id             False
month_lag               False
purchase_amount         False
purchase_date           False
category_2              False
state_id                False
subsector_id            False
dtype: bool

In [83]:
#df_new_merchant_trans.to_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions_new.csv')

In [84]:
# df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions_new.csv')
# df_hist_trans=reduce_mem_usage(df_hist_trans)


In [85]:
df_hist_trans.isnull().any()

authorized_flag         False
city_id                 False
category_1              False
installments            False
category_3              False
merchant_category_id    False
merchant_id             False
month_lag               False
purchase_amount         False
purchase_date           False
category_2              False
state_id                False
subsector_id            False
dtype: bool

# Testing is started here

In [3]:
df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions_new.csv')
df_new_merchant_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions_new.csv')
df_train = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/train.csv')
df_test = pd.read_csv('test_without_nan.csv') # nan values filled by first purchase from historic transactions


In [4]:
df_hist_trans=reduce_mem_usage(df_hist_trans)
df_new_merchant_trans=reduce_mem_usage(df_new_merchant_trans)
df_train=reduce_mem_usage(df_train)
df_test=reduce_mem_usage(df_test)

Mem. usage decreased to 1749.11 Mb (43.7% reduction)
Mem. usage decreased to 114.20 Mb (45.5% reduction)
Mem. usage decreased to  4.04 Mb (56.2% reduction)
Mem. usage decreased to  2.24 Mb (52.5% reduction)


### Now let's work with dates . We need reference date at month_lag=0. Let's start with converting purchase_date to datime

In [5]:
df_hist_trans.purchase_date=pd.to_datetime(df_hist_trans.purchase_date)

In [6]:
pur_date=df_hist_trans[df_hist_trans.month_lag==0].groupby('card_id').purchase_date.max()

In [7]:
index_month_lag_nan=df_hist_trans[df_hist_trans.card_id.isin(pur_date.index)==False].index

In [8]:
card_id_nan_unique=df_hist_trans.loc[index_month_lag_nan].card_id.unique()

In [9]:
df_hist_trans.purchase_date=pd.to_datetime(df_hist_trans.purchase_date)

In [10]:
df_hist_trans.head()

Unnamed: 0,card_id,authorized_flag,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,category_2,state_id,subsector_id
0,C_ID_4e6213e9bc,Y,88,N,0,A,80,M_ID_e020e9b302,-8,-0.703331,2017-06-25 15:33:07,1.0,16,37
1,C_ID_4e6213e9bc,Y,88,N,0,A,367,M_ID_86ec983688,-7,-0.733128,2017-07-15 12:10:45,1.0,16,16
2,C_ID_4e6213e9bc,Y,88,N,0,A,80,M_ID_979ed661fc,-6,-0.720386,2017-08-09 22:04:29,1.0,16,37
3,C_ID_4e6213e9bc,Y,88,N,0,A,560,M_ID_e6d5ae8ea6,-5,-0.735352,2017-09-02 10:06:26,1.0,16,34
4,C_ID_4e6213e9bc,Y,88,N,0,A,80,M_ID_e020e9b302,-11,-0.722865,2017-03-10 01:14:19,1.0,16,37


In [11]:
pur_date.head()

card_id
C_ID_00007093c1   2018-02-27 05:14:57
C_ID_0001238066   2018-02-27 16:18:59
C_ID_0001506ef0   2018-02-17 12:33:56
C_ID_0001793786   2017-10-31 20:20:18
C_ID_000183fdda   2018-02-25 20:57:08
Name: purchase_date, dtype: datetime64[ns]

In [12]:
pur_date.shape

(293172,)

In [13]:
df_hist_trans.groupby('card_id').count().shape

(325540, 13)

### There is a lot of missing values. I have tried various ways to fillna . Starting from simplest for more in depth
1) Just fill it with 2018 Feb

2) Check max(month_lag) which is something but not [-1,-11] in hystory_transactions, than just find month_lag==0 purchase date by adding respective month_lag*30. It makes sense to do. However, error was actually higher than previous example (3.699)

3) Since  at purchase_date at max(month_lag) might be in between the beggining and the end of the month we might have error of 1 month. Therefore, I tried to add max(purchase_date)+min(purchase_date)/(1+(min(abs(month_lag)+max(abs(month_lag)) (doesn't help)

4) Since it doesn't help I decided to change which are in 2018 to Feb 2018

### Finally, first one seems to be best solution.

I will show how I did it just each of them.


### 1) Filling nan values with 2018 Feb
Let's choose values which are NaN in history_transactions


In [14]:
df_hist_trans.reset_index(inplace=True)

In [15]:
card_id_nan_unique=df_hist_trans[df_hist_trans.card_id.isin(pur_date.index)==False].card_id.unique()

In [16]:
df=pd.DataFrame(card_id_nan_unique)
df['month_lag_date']=pd.to_datetime('2008-02') # Seetting all nan values to 2018 Feb
df.set_index(0,inplace=True)
new_map=df.month_lag_date
del df

pur_date_1=pur_date.append(new_map)
train_month_lag_0=df_train.card_id.map(pur_date_1)
test_month_lag_0=df_test.card_id.map(pur_date_1)
hist_lag_0=df_hist_trans.card_id.map( pur_date_1)
new_mech_lag_0=df_new_merchant_trans.card_id.map(pur_date_1)

### 2) Filling nan values with respect to max(month_lag) in history_transacation. Simply by adding number of days with respect max(month lag) . max(purchase_date)+ abs(max(month_lag))*30



In [17]:
hist_month_lag_max=df_hist_trans.loc[index_month_lag_nan,['card_id','purchase_date','month_lag']].groupby('card_id').max()
hist_month_lag_max_=hist_month_lag_max.copy()
new_map=hist_month_lag_max_.purchase_date+ pd.to_timedelta(hist_month_lag_max_.month_lag.abs()*30,unit='D')

train_month_lag_0=df_train.card_id.map(new_map)
test_month_lag_0=df_test.card_id.map(new_map)
hist_lag_0=df_hist_trans.card_id.map( new_map)
new_mech_lag_0=df_new_merchant_trans.card_id.map(new_map)

In [18]:
pur_date_2=pur_date.append(new_map)

### Now you just to add to purch. Incomment below lines if u wanna try method 3

In [19]:

hist_month_lag_min_max=df_hist_trans.loc[index_month_lag_nan,['card_id','month_lag','purchase_date']].groupby('card_id').agg(['min','max'])
hist_month_lag_min_max['average_month']=(hist_month_lag_min_max['purchase_date']['max']-hist_month_lag_min_max['purchase_date']['min'])
hist_month_lag_min_max['average_month']=hist_month_lag_min_max['average_month']/(1-hist_month_lag_min_max.month_lag['min']+hist_month_lag_min_max.month_lag['max'])
hist_month_lag_min_max['month_lag_0_date']=hist_month_lag_min_max.purchase_date['max']-hist_month_lag_min_max.month_lag['max']*hist_month_lag_min_max['average_month']
new_map=hist_month_lag_min_max.month_lag_0_date

In [20]:
pur_date_3=pur_date.append(new_map)

### Now you just to add to purch. Incomment below lines if u wanna try method 4

In [21]:
ref_2018=pd.to_datetime('2018-02')
new_map=new_map.apply(lambda x: ref_2018 if x.year==2018 else x )
pur_date_4=pur_date.append(new_map)
# train_month_lag_0=df_train.card_id.map(pur_date_4)
# test_month_lag_0=df_test.card_id.map(pur_date_4)
# hist_lag_0=df_hist_trans.card_id.map( pur_date_4)
# new_mech_lag_0=df_new_merchant_trans.card_id.map(pur_date_4)

### Finally, uncomment and change name of the files if you wanna save results .

In [22]:
# train_month_lag_0.to_csv('C:/Users/user/Documents/Salamat/ELO/train_month_lag_0_updated.csv',index=False,header=None)
# test_month_lag_0.to_csv('C:/Users/user/Documents/Salamat/ELO/test_month_lag_0_updated.csv',index=False,header=None)
# hist_lag_0.to_csv('C:/Users/user/Documents/Salamat/ELO/hist_month_lag_0_updated.csv',index=False,header=None)
# new_mech_lag_0.to_csv('C:/Users/user/Documents/Salamat/ELO/new_mech_month_lag_0_updated.csv',index=False,header=None)

### 1) is the best solution 4) is second best solution
## Finally, we fill all missing values now we can do some feature engineering

### reload files if required

In [23]:
# df_train = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/train.csv')
# df_test = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/test.csv')
# df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions_new.csv')
# df_new_merchant_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions_new.csv')

# df_train=reduce_mem_usage(df_train)
# df_test=reduce_mem_usage(df_test)
# df_hist_trans=reduce_mem_usage(df_hist_trans)
# df_new_merchant_trans=reduce_mem_usage(df_new_merchant_trans)


In [24]:

# train_month_lag_0=pd.read_csv('C:/Users/user/Documents/Salamat/ELO/train_month_lag_0.csv',header=None,squeeze=True,parse_dates=[0])
# test_month_lag_0=pd.read_csv('C:/Users/user/Documents/Salamat/ELO/test_month_lag_0.csv',header=None,squeeze=True,parse_dates=[0])
# hist_lag_0=pd.read_csv('C:/Users/user/Documents/Salamat/ELO/hist_month_lag_0.csv',header=None,squeeze=True,parse_dates=[0])
# new_mech_lag_0=pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_mech_month_lag_0.csv',header=None,squeeze=True,parse_dates=[0])

In [25]:
#for df in [df_hist_trans,df_new_merchant_trans]:
#    df['category_2'].fillna(1.0,inplace=True)
#    df['category_3'].fillna('A',inplace=True)
#    df['merchant_id'].fillna('M_ID_00a6ca8a8a',inplace=True)

In [26]:
pur_date.shape

(293172,)

In [27]:
pur_date_4.head()

card_id
C_ID_00007093c1   2018-02-27 05:14:57
C_ID_0001238066   2018-02-27 16:18:59
C_ID_0001506ef0   2018-02-17 12:33:56
C_ID_0001793786   2017-10-31 20:20:18
C_ID_000183fdda   2018-02-25 20:57:08
dtype: datetime64[ns]

In [28]:
pur_date_4.shape

(325540,)

In [71]:
# pur_date_1.to_csv("pur_date_3.csv",header=None,index='card_id')
# pur_date_2.to_csv("pur_date_3.csv",header=None,index='card_id')
# pur_date_3.to_csv("pur_date_3.csv",header=None,index='card_id')
# pur_date_4.to_csv("pur_date_4.csv",header=None,index='card_id')
## Example of openning the pur_date_4
#pur_date_4_1=pd.read_csv("pur_date_4.csv",header=None,index_col=[0], squeeze=True,parse_dates=[1])


In [65]:
def get_new_columns(name,aggs):
    return [name + '_' + k + '_' + agg for k in aggs.keys() for agg in aggs[k]]

In [30]:
methods=[pur_date_1,pur_date_2,pur_date_3,pur_date_4]
    
    

In [174]:
pur_date_4.head()

card_id
C_ID_00007093c1   2018-02-27 05:14:57
C_ID_0001238066   2018-02-27 16:18:59
C_ID_0001506ef0   2018-02-17 12:33:56
C_ID_0001793786   2017-10-31 20:20:18
C_ID_000183fdda   2018-02-25 20:57:08
dtype: datetime64[ns]

In [197]:
results=[]
CV_error=[]
#methods=[pur_date_1,pur_date_2,pur_date_3,pur_date_4]
methods=[pur_date_4]
for method in methods:
    df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions_new.csv')
    df_new_merchant_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions_new.csv')
    df_train = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/train.csv')
    df_test = pd.read_csv('test_without_nan.csv') # nan values filled by first purchase from historic transactions
    
    
    
    #I we can take weekofyear,day of week, weekend as extra parameter which I didn't. I am not sure if month diff is correct. Still confused about this issue
    
    for df in [df_hist_trans,df_new_merchant_trans]:
        df['purchase_date'] = pd.to_datetime(df['purchase_date'])
        df['year'] = df['purchase_date'].dt.year
        df['weekofyear'] = df['purchase_date'].dt.weekofyear
        df['month'] = df['purchase_date'].dt.month
        df['dayofweek'] = df['purchase_date'].dt.dayofweek
        df['weekend'] = (df.purchase_date.dt.weekday >=5).astype(int)
        df['hour'] = df['purchase_date'].dt.hour
        df['authorized_flag'] = df['authorized_flag'].map({'Y':1, 'N':0})
        df['category_1'] = df['category_1'].map({'Y':1, 'N':0})
        df['month_diff'] = ((df['card_id'].map(method) - df['purchase_date']).dt.days)//30
        df['month_diff'] += df['month_lag']
        
        #https://www.kaggle.com/c/elo-merchant-category-recommendation/discussion/73244
        #df['month_diff'] = ((datetime.datetime.today() - df['purchase_date']).dt.days)//30
        #df['month_diff'] += df['month_lag']

    # We will use our extracted values for reference. From month_lag at zero (max)    

#     df_hist_trans['month_diff'] = ((df_hist_trans['card_id'].map(method) - df_hist_trans['purchase_date']).dt.days)//30
#     df_hist_trans['month_diff'] += df_hist_trans['month_lag']

#     df_new_merchant_trans['month_diff'] = ((df_new_merchant_trans['card_id'].map(method) - df_new_merchant_trans['purchase_date']).dt.days)//30
#     df_new_merchant_trans['month_diff'] += df_new_merchant_trans['month_lag']


    aggs = {}
    for col in ['month','hour','weekofyear','dayofweek','year','subsector_id','merchant_id','merchant_category_id']:
        aggs[col] = ['nunique']

    aggs['purchase_amount'] = ['sum','max','min','mean','var']
    aggs['installments'] = ['sum','max','min','mean','var']
    aggs['purchase_date'] = ['max','min']
    aggs['month_lag'] = ['max','min','mean','var']
    aggs['month_diff'] = ['mean']
    aggs['authorized_flag'] = ['sum', 'mean']
    aggs['weekend'] = ['sum', 'mean']
    aggs['category_1'] = ['sum', 'mean']
    aggs['card_id'] = ['size']

    for col in ['category_2','category_3']:
        df_hist_trans[col+'_mean'] = df_hist_trans.groupby([col])['purchase_amount'].transform('mean')
        aggs[col+'_mean'] = ['mean']    

    new_columns = get_new_columns('hist',aggs)
    df_hist_trans_group = df_hist_trans.groupby('card_id').agg(aggs)
    df_hist_trans_group.columns = new_columns
    df_hist_trans_group.reset_index(drop=False,inplace=True)
    df_hist_trans_group['hist_purchase_date_diff'] = (df_hist_trans_group['hist_purchase_date_max'] - df_hist_trans_group['hist_purchase_date_min']).dt.days
    df_hist_trans_group['hist_purchase_date_average'] = df_hist_trans_group['hist_purchase_date_diff']/df_hist_trans_group['hist_card_id_size']

    #df_hist_trans_group['hist_purchase_date_uptonow'] = (datetime.datetime.today() - df_hist_trans_group['hist_purchase_date_max']).dt.days
    #df_hist_trans_group['hist_purchase_date_uptonow'] = (hist_lag_0 - df_hist_trans_group['hist_purchase_date_max']).dt.days

    df_hist_trans_group['hist_purchase_date_uptonow'] = (df_hist_trans_group['card_id'].map(method) - df_hist_trans_group['hist_purchase_date_max']).dt.days



    df_train = df_train.merge(df_hist_trans_group,on='card_id',how='left')
    df_test = df_test.merge(df_hist_trans_group,on='card_id',how='left')
    #del df_hist_trans_group;gc.collect()

    aggs = {}
    for col in ['month','hour','weekofyear','dayofweek','year','subsector_id','merchant_id','merchant_category_id']:
        aggs[col] = ['nunique']
    aggs['purchase_amount'] = ['sum','max','min','mean','var']
    aggs['installments'] = ['sum','max','min','mean','var']
    aggs['purchase_date'] = ['max','min']
    aggs['month_lag'] = ['max','min','mean','var']
    aggs['month_diff'] = ['mean']
    aggs['weekend'] = ['sum', 'mean']
    aggs['category_1'] = ['sum', 'mean']
    aggs['card_id'] = ['size']

    for col in ['category_2','category_3']:
        df_new_merchant_trans[col+'_mean'] = df_new_merchant_trans.groupby([col])['purchase_amount'].transform('mean')
        aggs[col+'_mean'] = ['mean']

    new_columns = get_new_columns('new_hist',aggs)
    df_hist_trans_group = df_new_merchant_trans.groupby('card_id').agg(aggs)
    df_hist_trans_group.columns = new_columns
    df_hist_trans_group.reset_index(drop=False,inplace=True)
    df_hist_trans_group['new_hist_purchase_date_diff'] = (df_hist_trans_group['new_hist_purchase_date_max'] - df_hist_trans_group['new_hist_purchase_date_min']).dt.days
    df_hist_trans_group['new_hist_purchase_date_average'] = df_hist_trans_group['new_hist_purchase_date_diff']/df_hist_trans_group['new_hist_card_id_size']

    #df_hist_trans_group['new_hist_purchase_date_uptonow'] = (datetime.datetime.today() - df_hist_trans_group['new_hist_purchase_date_max']).dt.days
    #df_hist_trans_group['new_hist_purchase_date_uptonow'] = (new_mech_lag_0 - df_hist_trans_group['new_hist_purchase_date_max']).dt.days


    df_hist_trans_group['new_hist_purchase_date_uptonow'] = (df_hist_trans_group['card_id'].map(method) - df_hist_trans_group['new_hist_purchase_date_max']).dt.days



    df_train = df_train.merge(df_hist_trans_group,on='card_id',how='left')
    df_test = df_test.merge(df_hist_trans_group,on='card_id',how='left')
    #del df_hist_trans_group;gc.collect()

    df_train['outliers'] = 0
    df_train.loc[df_train['target'] < -30, 'outliers'] = 1
    df_train['outliers'].value_counts()


    for df in [df_train,df_test]:
        df['first_active_month'] = pd.to_datetime(df['first_active_month'])
        df['dayofweek'] = df['first_active_month'].dt.dayofweek
        df['weekofyear'] = df['first_active_month'].dt.weekofyear
        df['month'] = df['first_active_month'].dt.month

        #df['elapsed_time'] = (datetime.datetime.today() - df['first_active_month']).dt.days
        df['elapsed_time'] = (df['card_id'].map(method) - df['first_active_month']).dt.days

        df['hist_first_buy'] = (df['hist_purchase_date_min'] - df['first_active_month']).dt.days
        df['new_hist_first_buy'] = (df['new_hist_purchase_date_min'] - df['first_active_month']).dt.days
        for f in ['hist_purchase_date_max','hist_purchase_date_min','new_hist_purchase_date_max',\
                         'new_hist_purchase_date_min']:
            df[f] = df[f].astype(np.int64) * 1e-9
        df['card_id_total'] = df['new_hist_card_id_size']+df['hist_card_id_size']
        df['purchase_amount_total'] = df['new_hist_purchase_amount_sum']+df['hist_purchase_amount_sum']

#     for f in ['feature_1','feature_2','feature_3']:
#         order_label = df_train.groupby([f])['outliers'].mean()
#         df_train[f] = df_train[f].map(order_label)
#         df_test[f] = df_test[f].map(order_label)


    ### We will change only elapsed time

    #df_train['elapsed_time'] = (df_train['card_id'].map(method) - df_train['first_active_month']).dt.days
    #df_test['elapsed_time'] = (df_test['card_id'].map(method) - df_test['first_active_month']).dt.days  

    df_train_columns = [c for c in df_train.columns if c not in ['card_id', 'first_active_month','target','outliers']]
    #df_train=df_train[df_train.outliers==0]
    target = df_train['target']
    #del df_train['target']

    param = {'num_leaves': 31,
             'min_data_in_leaf': 30, 
             'objective':'regression',
             'max_depth': -1,
             'learning_rate': 0.01,
             "min_child_samples": 20,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9 ,
             "bagging_seed": 11,
             "metric": 'rmse',
             "lambda_l1": 0.1,
             "verbosity": -1,
             "nthread": 4,
             "random_state": 4590}
    folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=4590)
    oof = np.zeros(len(df_train))
    predictions = np.zeros(len(df_test))
    feature_importance_df = pd.DataFrame()

    for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
        print("fold {}".format(fold_))
        trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
        val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

        num_round = 10000
        clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
        oof[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

        fold_importance_df = pd.DataFrame()
        fold_importance_df["Feature"] = df_train_columns
        fold_importance_df["importance"] = clf.feature_importance()
        fold_importance_df["fold"] = fold_ + 1
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

        predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

    CV_error.append(np.sqrt(mean_squared_error(oof, target)))
    results.append(predictions)

fold 0
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 3.66886	valid_1's rmse: 3.72467
[200]	training's rmse: 3.59118	valid_1's rmse: 3.68887
[300]	training's rmse: 3.54326	valid_1's rmse: 3.67347
[400]	training's rmse: 3.50678	valid_1's rmse: 3.66511
[500]	training's rmse: 3.47738	valid_1's rmse: 3.65915
[600]	training's rmse: 3.45237	valid_1's rmse: 3.65582
[700]	training's rmse: 3.42981	valid_1's rmse: 3.65329
[800]	training's rmse: 3.40921	valid_1's rmse: 3.65211
[900]	training's rmse: 3.39008	valid_1's rmse: 3.65073
[1000]	training's rmse: 3.37374	valid_1's rmse: 3.65024
[1100]	training's rmse: 3.35735	valid_1's rmse: 3.64979
[1200]	training's rmse: 3.34156	valid_1's rmse: 3.64959
[1300]	training's rmse: 3.32589	valid_1's rmse: 3.64908
[1400]	training's rmse: 3.31077	valid_1's rmse: 3.64896
Early stopping, best iteration is:
[1319]	training's rmse: 3.32324	valid_1's rmse: 3.64886
fold 1
Training until validation scores don't improve for 100 ro

In [None]:
CV_error

In [73]:
CV_error

[3.6489937737530007]

In [69]:
CV_error

[3.6499894856856048]

In [209]:
df_train.outliers.shape

(201917,)

In [195]:
from sklearn.model_selection import KFold, RepeatedKFold,RepeatedStratifiedKFold
from sklearn.linear_model import BayesianRidge


In [199]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 32, 
         'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.005,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.1,
         "nthread": 4,
         "verbosity": -1}

In [200]:
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, df_train['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

fold n°0
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 3.73586	valid_1's rmse: 3.75937
[200]	training's rmse: 3.67147	valid_1's rmse: 3.71703
[300]	training's rmse: 3.62707	valid_1's rmse: 3.69454
[400]	training's rmse: 3.59471	valid_1's rmse: 3.68294
[500]	training's rmse: 3.56743	valid_1's rmse: 3.67564
[600]	training's rmse: 3.54506	valid_1's rmse: 3.67025
[700]	training's rmse: 3.52573	valid_1's rmse: 3.66666
[800]	training's rmse: 3.50806	valid_1's rmse: 3.66269
[900]	training's rmse: 3.49247	valid_1's rmse: 3.65964
[1000]	training's rmse: 3.47827	valid_1's rmse: 3.6578
[1100]	training's rmse: 3.4651	valid_1's rmse: 3.65621
[1200]	training's rmse: 3.4527	valid_1's rmse: 3.65514
[1300]	training's rmse: 3.44112	valid_1's rmse: 3.65405
[1400]	training's rmse: 3.43048	valid_1's rmse: 3.65332
[1500]	training's rmse: 3.42028	valid_1's rmse: 3.65277
[1600]	training's rmse: 3.41055	valid_1's rmse: 3.65206
[1700]	training's rmse: 3.4013	valid_1's rms

[1800]	training's rmse: 3.3906	valid_1's rmse: 3.65851
[1900]	training's rmse: 3.38164	valid_1's rmse: 3.65811
[2000]	training's rmse: 3.37334	valid_1's rmse: 3.65764
[2100]	training's rmse: 3.36515	valid_1's rmse: 3.65724
[2200]	training's rmse: 3.35735	valid_1's rmse: 3.65709
[2300]	training's rmse: 3.35011	valid_1's rmse: 3.65696
[2400]	training's rmse: 3.34313	valid_1's rmse: 3.65673
[2500]	training's rmse: 3.33596	valid_1's rmse: 3.65659
[2600]	training's rmse: 3.32911	valid_1's rmse: 3.65654
[2700]	training's rmse: 3.32184	valid_1's rmse: 3.65643
[2800]	training's rmse: 3.31496	valid_1's rmse: 3.65633
[2900]	training's rmse: 3.3076	valid_1's rmse: 3.65632
[3000]	training's rmse: 3.30018	valid_1's rmse: 3.6563
[3100]	training's rmse: 3.29309	valid_1's rmse: 3.6564
Early stopping, best iteration is:
[2947]	training's rmse: 3.30421	valid_1's rmse: 3.6562
CV score: 3.65055 


In [201]:
lgbparam = {'num_leaves': 31,
            'boosting_type': 'rf',
             'min_data_in_leaf': 30, 
             'objective':'regression',
             'max_depth': -1,
             'learning_rate': 0.01,
             "min_child_samples": 20,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9 ,
             "bagging_seed": 11,
             "metric": 'rmse',
             "lambda_l1": 0.1,
             "verbosity": -1,
             "nthread": 4,
             "random_state": 4590}

In [202]:
#from sklearn.model_selection import RepeatedKFold
folds = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=4520)

oof_lgb = np.zeros(len(df_train))
predictions_lgb = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 11000
    clf = lgb.train(lgbparam, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof_lgb[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions_lgb += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / (5 * 2)

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)**0.5))



fold n°0
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 3.67343	valid_1's rmse: 3.7118
[200]	training's rmse: 3.59598	valid_1's rmse: 3.67579
[300]	training's rmse: 3.54689	valid_1's rmse: 3.6603
[400]	training's rmse: 3.51014	valid_1's rmse: 3.65252
[500]	training's rmse: 3.48141	valid_1's rmse: 3.64794
[600]	training's rmse: 3.4561	valid_1's rmse: 3.64449
[700]	training's rmse: 3.43343	valid_1's rmse: 3.64201
[800]	training's rmse: 3.41318	valid_1's rmse: 3.6397
[900]	training's rmse: 3.39459	valid_1's rmse: 3.63834
[1000]	training's rmse: 3.37798	valid_1's rmse: 3.6371
[1100]	training's rmse: 3.36131	valid_1's rmse: 3.63657
[1200]	training's rmse: 3.34485	valid_1's rmse: 3.6356
[1300]	training's rmse: 3.32922	valid_1's rmse: 3.63549
Early stopping, best iteration is:
[1299]	training's rmse: 3.3293	valid_1's rmse: 3.63547
fold n°1
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 3.67046	valid_1's rmse: 3.7189

[1600]	training's rmse: 3.27924	valid_1's rmse: 3.66032
[1700]	training's rmse: 3.26654	valid_1's rmse: 3.66044
[1800]	training's rmse: 3.25208	valid_1's rmse: 3.6601
Early stopping, best iteration is:
[1783]	training's rmse: 3.25421	valid_1's rmse: 3.66005
CV score: 3.65308 


In [187]:
predictions_lgb

array([-2.09231531, -0.17194729, -0.95052083, ...,  0.85005684,
       -2.80954205,  0.09606236])

In [158]:
sub_df1 = pd.DataFrame({"card_id":df_test["card_id"].values})
sub_df1["target"] = predictions_lgb
sub_df1.to_csv("submit_lgb1_stacking_w_outlier.csv", index=False)

In [203]:
train_stack = np.vstack([oof,oof_lgb]).transpose()
test_stack = np.vstack([predictions,predictions_lgb]).transpose()

folds = RepeatedKFold(n_splits=5,n_repeats=1,random_state=4520)
oof_stack = np.zeros(train_stack.shape[0])
predictions_stack = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, target)):
    print("fold n°{}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values

    print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=3000, learning_rate=0.1, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), cat_features=[], use_best_model=True, verbose=True)
    clf = BayesianRidge()
    clf.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf.predict(val_data)
    predictions_stack += clf.predict(test_stack) / 5


np.sqrt(mean_squared_error(target.values, oof_stack))

fold n°0
----------Stacking 0----------
fold n°1
----------Stacking 1----------
fold n°2
----------Stacking 2----------
fold n°3
----------Stacking 3----------
fold n°4
----------Stacking 4----------


3.6491930700667297

In [163]:
predictions_stack

array([-2.1060703 , -0.17353636, -0.96257875, ...,  0.87252091,
       -2.99655373,  0.13051841])

In [204]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['target'] = predictions_stack
sample_submission.to_csv('Bayesian_Ridge_Stacking_w_outliers_pur_date_4.csv', index=False)

### Now let's predict  outliers with binary classifications and binary logloss metric

In [210]:
target=df_train.outliers

In [211]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 32, 
         'objective':'binary',
         'max_depth': -1,
         'learning_rate': 0.005,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'binary_logloss',
         "lambda_l1": 0.1,
         "nthread": 4,
         "verbosity": -1}

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, df_train['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

fold n°0
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0465725	valid_1's binary_logloss: 0.0494251
[200]	training's binary_logloss: 0.0423393	valid_1's binary_logloss: 0.0467432
[300]	training's binary_logloss: 0.0398121	valid_1's binary_logloss: 0.04563
[400]	training's binary_logloss: 0.0379655	valid_1's binary_logloss: 0.0451054
[500]	training's binary_logloss: 0.0364101	valid_1's binary_logloss: 0.0448216
[600]	training's binary_logloss: 0.0350469	valid_1's binary_logloss: 0.0446494
[700]	training's binary_logloss: 0.0338615	valid_1's binary_logloss: 0.0445619
[800]	training's binary_logloss: 0.0328614	valid_1's binary_logloss: 0.0444887
[900]	training's binary_logloss: 0.031945	valid_1's binary_logloss: 0.044434
[1000]	training's binary_logloss: 0.0310842	valid_1's binary_logloss: 0.0444219
[1100]	training's binary_logloss: 0.0303012	valid_1's binary_logloss: 0.0443903
[1200]	training's binary_logloss: 0.0295564	valid_1's binary_

In [212]:
lgbparam = {'num_leaves': 31,
            'boosting_type': 'rf',
             'min_data_in_leaf': 30, 
             'objective':'binary',
             'max_depth': -1,
             'learning_rate': 0.01,
             "min_child_samples": 20,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9 ,
             "bagging_seed": 11,
             "metric": 'binary_logloss',
             "lambda_l1": 0.1,
             "verbosity": -1,
             "nthread": 4,
             "random_state": 4590}

folds = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=4520)

oof_lgb = np.zeros(len(df_train))
predictions_lgb = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 11000
    clf = lgb.train(lgbparam, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof_lgb[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions_lgb += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / (5 * 2)

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)**0.5))


fold n°0
Training until validation scores don't improve for 100 rounds.
[100]	training's binary_logloss: 0.042444	valid_1's binary_logloss: 0.0464311
[200]	training's binary_logloss: 0.037997	valid_1's binary_logloss: 0.0445148
[300]	training's binary_logloss: 0.0350549	valid_1's binary_logloss: 0.0440185
[400]	training's binary_logloss: 0.0328816	valid_1's binary_logloss: 0.0437926
[500]	training's binary_logloss: 0.0311618	valid_1's binary_logloss: 0.0437086
[600]	training's binary_logloss: 0.0296233	valid_1's binary_logloss: 0.0436764
[700]	training's binary_logloss: 0.0282174	valid_1's binary_logloss: 0.043643
[800]	training's binary_logloss: 0.0269396	valid_1's binary_logloss: 0.0436123
[900]	training's binary_logloss: 0.0257815	valid_1's binary_logloss: 0.0435908
[1000]	training's binary_logloss: 0.0246591	valid_1's binary_logloss: 0.0436136
Early stopping, best iteration is:
[948]	training's binary_logloss: 0.0252231	valid_1's binary_logloss: 0.0435791
fold n°1
Training until va

In [213]:
train_stack = np.vstack([oof,oof_lgb]).transpose()
test_stack = np.vstack([predictions,predictions_lgb]).transpose()

folds = RepeatedKFold(n_splits=5,n_repeats=1,random_state=4520)
oof_stack = np.zeros(train_stack.shape[0])
predictions_stack = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, target)):
    print("fold n°{}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values

    print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=3000, learning_rate=0.1, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), cat_features=[], use_best_model=True, verbose=True)
    clf = BayesianRidge()
    clf.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf.predict(val_data)
    predictions_stack += clf.predict(test_stack) / 5


np.sqrt(mean_squared_error(target.values, oof_stack))

fold n°0
----------Stacking 0----------
fold n°1
----------Stacking 1----------
fold n°2
----------Stacking 2----------
fold n°3
----------Stacking 3----------
fold n°4
----------Stacking 4----------


0.09925568471873035

In [214]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['target'] = predictions_stack
sample_submission.to_csv('Bayesian_Ridge_Stacking_outliers_map_pur_date_4.csv', index=False)

### Now. let's do predictions without outliers

In [218]:
df_train1=df_train[df_train.outliers==0]
target=df_train1.target

In [219]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 32, 
         'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.005,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.1,
         "nthread": 4,
         "verbosity": -1}

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train1))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train1.values, df_train1['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train1.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train1.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train1.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

fold n°0
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 1.64179	valid_1's rmse: 1.64989
[200]	training's rmse: 1.60555	valid_1's rmse: 1.61671
[300]	training's rmse: 1.58583	valid_1's rmse: 1.59937
[400]	training's rmse: 1.57369	valid_1's rmse: 1.58911
[500]	training's rmse: 1.565	valid_1's rmse: 1.5824
[600]	training's rmse: 1.55842	valid_1's rmse: 1.57776
[700]	training's rmse: 1.55304	valid_1's rmse: 1.57441
[800]	training's rmse: 1.54852	valid_1's rmse: 1.57182
[900]	training's rmse: 1.54455	valid_1's rmse: 1.56977
[1000]	training's rmse: 1.54102	valid_1's rmse: 1.56822
[1100]	training's rmse: 1.53777	valid_1's rmse: 1.567
[1200]	training's rmse: 1.53483	valid_1's rmse: 1.56602
[1300]	training's rmse: 1.53209	valid_1's rmse: 1.56529
[1400]	training's rmse: 1.52949	valid_1's rmse: 1.56466
[1500]	training's rmse: 1.52708	valid_1's rmse: 1.56418
[1600]	training's rmse: 1.52477	valid_1's rmse: 1.56378
[1700]	training's rmse: 1.52254	valid_1's rmse

[3500]	training's rmse: 1.49585	valid_1's rmse: 1.53725
[3600]	training's rmse: 1.49423	valid_1's rmse: 1.53719
[3700]	training's rmse: 1.49258	valid_1's rmse: 1.53712
[3800]	training's rmse: 1.49097	valid_1's rmse: 1.53709
[3900]	training's rmse: 1.48939	valid_1's rmse: 1.53707
[4000]	training's rmse: 1.48776	valid_1's rmse: 1.53703
[4100]	training's rmse: 1.48615	valid_1's rmse: 1.53699
[4200]	training's rmse: 1.48458	valid_1's rmse: 1.53696
[4300]	training's rmse: 1.48301	valid_1's rmse: 1.53691
[4400]	training's rmse: 1.48149	valid_1's rmse: 1.53685
[4500]	training's rmse: 1.47992	valid_1's rmse: 1.53681
[4600]	training's rmse: 1.47832	valid_1's rmse: 1.53676
[4700]	training's rmse: 1.47684	valid_1's rmse: 1.53669
[4800]	training's rmse: 1.47531	valid_1's rmse: 1.53666
[4900]	training's rmse: 1.47373	valid_1's rmse: 1.53664
[5000]	training's rmse: 1.4722	valid_1's rmse: 1.53658
[5100]	training's rmse: 1.47065	valid_1's rmse: 1.53655
[5200]	training's rmse: 1.46917	valid_1's rmse: 1

In [220]:
lgbparam = {'num_leaves': 31,
            'boosting_type': 'rf',
             'min_data_in_leaf': 30, 
             'objective':'regression',
             'max_depth': -1,
             'learning_rate': 0.01,
             "min_child_samples": 20,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9 ,
             "bagging_seed": 11,
             "metric": 'rmse',
             "lambda_l1": 0.1,
             "verbosity": -1,
             "nthread": 4,
             "random_state": 4590}

folds = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=4520)

oof_lgb = np.zeros(len(df_train1))
predictions_lgb = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train1,df_train1['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train1.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train1.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 11000
    clf = lgb.train(lgbparam, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof_lgb[val_idx] = clf.predict(df_train1.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions_lgb += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / (5 * 2)

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)**0.5))


fold n°0
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 1.60625	valid_1's rmse: 1.61426
[200]	training's rmse: 1.57443	valid_1's rmse: 1.58652
[300]	training's rmse: 1.55907	valid_1's rmse: 1.57509
[400]	training's rmse: 1.54926	valid_1's rmse: 1.56914
[500]	training's rmse: 1.54188	valid_1's rmse: 1.56567
[600]	training's rmse: 1.53575	valid_1's rmse: 1.56341
[700]	training's rmse: 1.53051	valid_1's rmse: 1.56213
[800]	training's rmse: 1.52571	valid_1's rmse: 1.56123
[900]	training's rmse: 1.52132	valid_1's rmse: 1.56059
[1000]	training's rmse: 1.51738	valid_1's rmse: 1.5601
[1100]	training's rmse: 1.51353	valid_1's rmse: 1.5597
[1200]	training's rmse: 1.50982	valid_1's rmse: 1.55947
[1300]	training's rmse: 1.50633	valid_1's rmse: 1.55928
[1400]	training's rmse: 1.50277	valid_1's rmse: 1.55911
[1500]	training's rmse: 1.49938	valid_1's rmse: 1.559
[1600]	training's rmse: 1.49611	valid_1's rmse: 1.55896
[1700]	training's rmse: 1.4929	valid_1's rmse

[1900]	training's rmse: 1.48431	valid_1's rmse: 1.56346
[2000]	training's rmse: 1.48112	valid_1's rmse: 1.56335
[2100]	training's rmse: 1.47799	valid_1's rmse: 1.5633
[2200]	training's rmse: 1.4749	valid_1's rmse: 1.56329
[2300]	training's rmse: 1.47198	valid_1's rmse: 1.5632
[2400]	training's rmse: 1.46898	valid_1's rmse: 1.5632
[2500]	training's rmse: 1.46608	valid_1's rmse: 1.56312
Early stopping, best iteration is:
[2491]	training's rmse: 1.46631	valid_1's rmse: 1.5631
fold n°6
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 1.60835	valid_1's rmse: 1.60489
[200]	training's rmse: 1.57664	valid_1's rmse: 1.57776
[300]	training's rmse: 1.56139	valid_1's rmse: 1.56649
[400]	training's rmse: 1.55144	valid_1's rmse: 1.56039
[500]	training's rmse: 1.5438	valid_1's rmse: 1.55685
[600]	training's rmse: 1.53756	valid_1's rmse: 1.55471
[700]	training's rmse: 1.53216	valid_1's rmse: 1.55343
[800]	training's rmse: 1.52728	valid_1's rmse: 1.55252
[900]	train

In [221]:
train_stack = np.vstack([oof,oof_lgb]).transpose()
test_stack = np.vstack([predictions,predictions_lgb]).transpose()

folds = RepeatedKFold(n_splits=5,n_repeats=1,random_state=4520)
oof_stack = np.zeros(train_stack.shape[0])
predictions_stack = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, target)):
    print("fold n°{}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values

    print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=3000, learning_rate=0.1, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), cat_features=[], use_best_model=True, verbose=True)
    clf = BayesianRidge()
    clf.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf.predict(val_data)
    predictions_stack += clf.predict(test_stack) / 5


np.sqrt(mean_squared_error(target.values, oof_stack))

fold n°0
----------Stacking 0----------
fold n°1
----------Stacking 1----------
fold n°2
----------Stacking 2----------
fold n°3
----------Stacking 3----------
fold n°4
----------Stacking 4----------


1.5544492450406158

In [222]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['target'] = predictions_stack
sample_submission.to_csv('Bayesian_Ridge_Stacking_wo_outliers_pur_date_4.csv', index=False)

In [231]:
map1=pd.read_csv('Bayesian_Ridge_Stacking_outliers_map_pur_date_4.csv',squeeze=True)
map1.set_index('card_id',inplace=True)

In [235]:
map2=map1.sort_values('target',ascending=False).head(25000)

In [223]:
model_with_outlier=pd.read_csv("Bayesian_Ridge_Stacking_w_outliers_pur_date_4.csv")
model_without_outlier=pd.read_csv("Bayesian_Ridge_Stacking_wo_outliers_pur_date_4.csv")
model_with_outlier.set_index('card_id',inplace=True)
model_without_outlier.set_index('card_id',inplace=True)

In [236]:
model_with_mixed=model_without_outlier.copy()


Unnamed: 0_level_0,target
card_id,Unnamed: 1_level_1
C_ID_aae50409e7,-22.631091
C_ID_a74b12dcf8,-25.189084
C_ID_bced41d837,-16.296076
C_ID_6ab591cf62,-21.624292
C_ID_ac114ef831,-21.948205


In [139]:
map2=map1.sort_values('target',ascending=False).head(25000)

In [166]:
model_with_outlier=pd.read_csv("Bayesian_Ridge_Stacking_w_outliers.csv")
model_without_outlier=pd.read_csv("Bayesian_Ridge_Stacking_wo_outliers.csv")
model_with_outlier.set_index('card_id',inplace=True)
model_without_outlier.set_index('card_id',inplace=True)

In [238]:
model_with_mixed=model_without_outlier.copy()
model_with_mixed.loc[map2.index]=model_with_outlier.loc[map2.index]
model_with_mixed.reset_index(inplace=True)
model_with_mixed.to_csv("Bayesian_Ridge_Stacking_mixed_outliers_pur_date_4.csv",index=False)

In [None]:
map2=map1.sort_values('target',ascending=False).head(25000)

In [240]:
model_with_mixed.target.min()

-25.189084327999662

### We not to try replacing more target values with outliers in final submission
### We can move treshould value of 25 000

### Now let's try to do wi 50 000 and 12 500 basically logarithmic search


In [428]:
map1=pd.read_csv('Bayesian_Ridge_Stacking_outliers_map_pur_date_4.csv',squeeze=True)
map1.set_index('card_id',inplace=True)

threshould=37500
map2=map1.sort_values('target',ascending=False).head(threshould)

model_with_outlier=pd.read_csv("Bayesian_Ridge_Stacking_w_outliers.csv")
model_without_outlier=pd.read_csv("Bayesian_Ridge_Stacking_wo_outliers.csv")
model_with_outlier.set_index('card_id',inplace=True)
model_without_outlier.set_index('card_id',inplace=True)

model_with_mixed=model_without_outlier.copy()
model_with_mixed.loc[map2.index]=model_with_outlier.loc[map2.index]
model_with_mixed.reset_index(inplace=True)
model_with_mixed.to_csv("Bayesian_Ridge_Stacking_mixed_outliers_pur_date_4_"+str(threshould)+".csv",index=False)

### best score is still for 25 000 . We need to find models to find best threshould values. This might be done by predicting it for training set and try to change treshould value. The possible problem might arise due to overfitting on training set. But still worth to try.


In [245]:

target=df_train.outliers

In [246]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 32, 
         'objective':'binary',
         'max_depth': -1,
         'learning_rate': 0.005,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'binary_logloss',
         "lambda_l1": 0.1,
         "nthread": 4,
         "verbosity": -1}

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, df_train['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))

fold n°0
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.0465725	valid_1's binary_logloss: 0.0494251
[200]	training's binary_logloss: 0.0423393	valid_1's binary_logloss: 0.0467432
[300]	training's binary_logloss: 0.0398121	valid_1's binary_logloss: 0.04563
[400]	training's binary_logloss: 0.0379655	valid_1's binary_logloss: 0.0451054
[500]	training's binary_logloss: 0.0364101	valid_1's binary_logloss: 0.0448216
[600]	training's binary_logloss: 0.0350469	valid_1's binary_logloss: 0.0446494
[700]	training's binary_logloss: 0.0338615	valid_1's binary_logloss: 0.0445619
[800]	training's binary_logloss: 0.0328614	valid_1's binary_logloss: 0.0444887
[900]	training's binary_logloss: 0.031945	valid_1's binary_logloss: 0.044434
[1000]	training's binary_logloss: 0.0310842	valid_1's binary_logloss: 0.0444219
[1100]	training's binary_logloss: 0.0303012	valid_1's binary_logloss: 0.0443903
[1200]	training's binary_logloss: 0.0295564	valid_1's binary_

In [247]:
lgbparam = {'num_leaves': 31,
            'boosting_type': 'rf',
             'min_data_in_leaf': 30, 
             'objective':'binary',
             'max_depth': -1,
             'learning_rate': 0.01,
             "min_child_samples": 20,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9 ,
             "bagging_seed": 11,
             "metric": 'binary_logloss',
             "lambda_l1": 0.1,
             "verbosity": -1,
             "nthread": 4,
             "random_state": 4590}

folds = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=4520)

oof_lgb = np.zeros(len(df_train))
predictions_lgb = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 11000
    clf = lgb.train(lgbparam, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof_lgb[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions_lgb += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / (5 * 2)

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)**0.5))

fold n°0
Training until validation scores don't improve for 100 rounds.
[100]	training's binary_logloss: 0.042444	valid_1's binary_logloss: 0.0464311
[200]	training's binary_logloss: 0.037997	valid_1's binary_logloss: 0.0445148
[300]	training's binary_logloss: 0.0350549	valid_1's binary_logloss: 0.0440185
[400]	training's binary_logloss: 0.0328816	valid_1's binary_logloss: 0.0437926
[500]	training's binary_logloss: 0.0311618	valid_1's binary_logloss: 0.0437086
[600]	training's binary_logloss: 0.0296233	valid_1's binary_logloss: 0.0436764
[700]	training's binary_logloss: 0.0282174	valid_1's binary_logloss: 0.043643
[800]	training's binary_logloss: 0.0269396	valid_1's binary_logloss: 0.0436123
[900]	training's binary_logloss: 0.0257815	valid_1's binary_logloss: 0.0435908
[1000]	training's binary_logloss: 0.0246591	valid_1's binary_logloss: 0.0436136
Early stopping, best iteration is:
[948]	training's binary_logloss: 0.0252231	valid_1's binary_logloss: 0.0435791
fold n°1
Training until va

In [248]:
train_stack = np.vstack([oof,oof_lgb]).transpose()
test_stack = np.vstack([predictions,predictions_lgb]).transpose()

folds = RepeatedKFold(n_splits=5,n_repeats=1,random_state=4520)
oof_stack = np.zeros(train_stack.shape[0])
predictions_stack = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, target)):
    print("fold n°{}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values

    print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=3000, learning_rate=0.1, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), cat_features=[], use_best_model=True, verbose=True)
    clf = BayesianRidge()
    clf.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf.predict(val_data)
    predictions_stack += clf.predict(test_stack) / 5


np.sqrt(mean_squared_error(target.values, oof_stack))

fold n°0
----------Stacking 0----------
fold n°1
----------Stacking 1----------
fold n°2
----------Stacking 2----------
fold n°3
----------Stacking 3----------
fold n°4
----------Stacking 4----------


0.09925568471873035

In [252]:
train_stack.shape

(201917, 2)

In [253]:
train_map=clf.predict(train_stack)

### Now let's get df_train predicted values without outliers

In [359]:
df_train1=df_train[df_train.outliers==0]
target=df_train1.target

In [360]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 32, 
         'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.005,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.1,
         "nthread": 4,
         "verbosity": -1}

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train1))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train1.values, df_train1['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train1.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train1.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train1.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))


fold n°0
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 1.64179	valid_1's rmse: 1.64989
[200]	training's rmse: 1.60555	valid_1's rmse: 1.61671
[300]	training's rmse: 1.58583	valid_1's rmse: 1.59937
[400]	training's rmse: 1.57369	valid_1's rmse: 1.58911
[500]	training's rmse: 1.565	valid_1's rmse: 1.5824
[600]	training's rmse: 1.55842	valid_1's rmse: 1.57776
[700]	training's rmse: 1.55304	valid_1's rmse: 1.57441
[800]	training's rmse: 1.54852	valid_1's rmse: 1.57182
[900]	training's rmse: 1.54455	valid_1's rmse: 1.56977
[1000]	training's rmse: 1.54102	valid_1's rmse: 1.56822
[1100]	training's rmse: 1.53777	valid_1's rmse: 1.567
[1200]	training's rmse: 1.53483	valid_1's rmse: 1.56602
[1300]	training's rmse: 1.53209	valid_1's rmse: 1.56529
[1400]	training's rmse: 1.52949	valid_1's rmse: 1.56466
[1500]	training's rmse: 1.52708	valid_1's rmse: 1.56418
[1600]	training's rmse: 1.52477	valid_1's rmse: 1.56378
[1700]	training's rmse: 1.52254	valid_1's rmse

[3500]	training's rmse: 1.49585	valid_1's rmse: 1.53725
[3600]	training's rmse: 1.49423	valid_1's rmse: 1.53719
[3700]	training's rmse: 1.49258	valid_1's rmse: 1.53712
[3800]	training's rmse: 1.49097	valid_1's rmse: 1.53709
[3900]	training's rmse: 1.48939	valid_1's rmse: 1.53707
[4000]	training's rmse: 1.48776	valid_1's rmse: 1.53703
[4100]	training's rmse: 1.48615	valid_1's rmse: 1.53699
[4200]	training's rmse: 1.48458	valid_1's rmse: 1.53696
[4300]	training's rmse: 1.48301	valid_1's rmse: 1.53691
[4400]	training's rmse: 1.48149	valid_1's rmse: 1.53685
[4500]	training's rmse: 1.47992	valid_1's rmse: 1.53681
[4600]	training's rmse: 1.47832	valid_1's rmse: 1.53676
[4700]	training's rmse: 1.47684	valid_1's rmse: 1.53669
[4800]	training's rmse: 1.47531	valid_1's rmse: 1.53666
[4900]	training's rmse: 1.47373	valid_1's rmse: 1.53664
[5000]	training's rmse: 1.4722	valid_1's rmse: 1.53658
[5100]	training's rmse: 1.47065	valid_1's rmse: 1.53655
[5200]	training's rmse: 1.46917	valid_1's rmse: 1

In [361]:
lgbparam = {'num_leaves': 31,
            'boosting_type': 'rf',
             'min_data_in_leaf': 30, 
             'objective':'regression',
             'max_depth': -1,
             'learning_rate': 0.01,
             "min_child_samples": 20,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9 ,
             "bagging_seed": 11,
             "metric": 'rmse',
             "lambda_l1": 0.1,
             "verbosity": -1,
             "nthread": 4,
             "random_state": 4590}

folds = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=4520)

oof_lgb = np.zeros(len(df_train1))
predictions_lgb = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train1,df_train1['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train1.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train1.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 11000
    clf = lgb.train(lgbparam, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof_lgb[val_idx] = clf.predict(df_train1.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions_lgb += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / (5 * 2)

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)**0.5))


fold n°0
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 1.60625	valid_1's rmse: 1.61426
[200]	training's rmse: 1.57443	valid_1's rmse: 1.58652
[300]	training's rmse: 1.55907	valid_1's rmse: 1.57509
[400]	training's rmse: 1.54926	valid_1's rmse: 1.56914
[500]	training's rmse: 1.54188	valid_1's rmse: 1.56567
[600]	training's rmse: 1.53575	valid_1's rmse: 1.56341
[700]	training's rmse: 1.53051	valid_1's rmse: 1.56213
[800]	training's rmse: 1.52571	valid_1's rmse: 1.56123
[900]	training's rmse: 1.52132	valid_1's rmse: 1.56059
[1000]	training's rmse: 1.51738	valid_1's rmse: 1.5601
[1100]	training's rmse: 1.51353	valid_1's rmse: 1.5597
[1200]	training's rmse: 1.50982	valid_1's rmse: 1.55947
[1300]	training's rmse: 1.50633	valid_1's rmse: 1.55928
[1400]	training's rmse: 1.50277	valid_1's rmse: 1.55911
[1500]	training's rmse: 1.49938	valid_1's rmse: 1.559
[1600]	training's rmse: 1.49611	valid_1's rmse: 1.55896
[1700]	training's rmse: 1.4929	valid_1's rmse

[1900]	training's rmse: 1.48431	valid_1's rmse: 1.56346
[2000]	training's rmse: 1.48112	valid_1's rmse: 1.56335
[2100]	training's rmse: 1.47799	valid_1's rmse: 1.5633
[2200]	training's rmse: 1.4749	valid_1's rmse: 1.56329
[2300]	training's rmse: 1.47198	valid_1's rmse: 1.5632
[2400]	training's rmse: 1.46898	valid_1's rmse: 1.5632
[2500]	training's rmse: 1.46608	valid_1's rmse: 1.56312
Early stopping, best iteration is:
[2491]	training's rmse: 1.46631	valid_1's rmse: 1.5631
fold n°6
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 1.60835	valid_1's rmse: 1.60489
[200]	training's rmse: 1.57664	valid_1's rmse: 1.57776
[300]	training's rmse: 1.56139	valid_1's rmse: 1.56649
[400]	training's rmse: 1.55144	valid_1's rmse: 1.56039
[500]	training's rmse: 1.5438	valid_1's rmse: 1.55685
[600]	training's rmse: 1.53756	valid_1's rmse: 1.55471
[700]	training's rmse: 1.53216	valid_1's rmse: 1.55343
[800]	training's rmse: 1.52728	valid_1's rmse: 1.55252
[900]	train

In [362]:
train_stack = np.vstack([oof,oof_lgb]).transpose()
test_stack = np.vstack([predictions,predictions_lgb]).transpose()

folds = RepeatedKFold(n_splits=5,n_repeats=1,random_state=4520)
oof_stack = np.zeros(train_stack.shape[0])
predictions_stack = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, target)):
    print("fold n°{}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values

    print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=3000, learning_rate=0.1, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), cat_features=[], use_best_model=True, verbose=True)
    clf = BayesianRidge()
    clf.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf.predict(val_data)
    predictions_stack += clf.predict(test_stack) / 5


np.sqrt(mean_squared_error(target.values, oof_stack))

fold n°0
----------Stacking 0----------
fold n°1
----------Stacking 1----------
fold n°2
----------Stacking 2----------
fold n°3
----------Stacking 3----------
fold n°4
----------Stacking 4----------


1.5544492450406158

In [363]:
train_wo_outliers=clf.predict(train_stack1)

### Now let's predict with outliers


In [354]:
df_train1=df_train
target=df_train1.target

In [355]:
param = {'num_leaves': 31,
         'min_data_in_leaf': 32, 
         'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.005,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.1,
         "nthread": 4,
         "verbosity": -1}

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train1))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train1.values, df_train1['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train1.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train1.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train1.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))


fold n°0
Training until validation scores don't improve for 200 rounds.
[100]	training's rmse: 3.73586	valid_1's rmse: 3.75937
[200]	training's rmse: 3.67147	valid_1's rmse: 3.71703
[300]	training's rmse: 3.62707	valid_1's rmse: 3.69454
[400]	training's rmse: 3.59471	valid_1's rmse: 3.68294
[500]	training's rmse: 3.56743	valid_1's rmse: 3.67564
[600]	training's rmse: 3.54506	valid_1's rmse: 3.67025
[700]	training's rmse: 3.52573	valid_1's rmse: 3.66666
[800]	training's rmse: 3.50806	valid_1's rmse: 3.66269
[900]	training's rmse: 3.49247	valid_1's rmse: 3.65964
[1000]	training's rmse: 3.47827	valid_1's rmse: 3.6578
[1100]	training's rmse: 3.4651	valid_1's rmse: 3.65621
[1200]	training's rmse: 3.4527	valid_1's rmse: 3.65514
[1300]	training's rmse: 3.44112	valid_1's rmse: 3.65405
[1400]	training's rmse: 3.43048	valid_1's rmse: 3.65332
[1500]	training's rmse: 3.42028	valid_1's rmse: 3.65277
[1600]	training's rmse: 3.41055	valid_1's rmse: 3.65206
[1700]	training's rmse: 3.4013	valid_1's rms

[1800]	training's rmse: 3.3906	valid_1's rmse: 3.65851
[1900]	training's rmse: 3.38164	valid_1's rmse: 3.65811
[2000]	training's rmse: 3.37334	valid_1's rmse: 3.65764
[2100]	training's rmse: 3.36515	valid_1's rmse: 3.65724
[2200]	training's rmse: 3.35735	valid_1's rmse: 3.65709
[2300]	training's rmse: 3.35011	valid_1's rmse: 3.65696
[2400]	training's rmse: 3.34313	valid_1's rmse: 3.65673
[2500]	training's rmse: 3.33596	valid_1's rmse: 3.65659
[2600]	training's rmse: 3.32911	valid_1's rmse: 3.65654
[2700]	training's rmse: 3.32184	valid_1's rmse: 3.65643
[2800]	training's rmse: 3.31496	valid_1's rmse: 3.65633
[2900]	training's rmse: 3.3076	valid_1's rmse: 3.65632
[3000]	training's rmse: 3.30018	valid_1's rmse: 3.6563
[3100]	training's rmse: 3.29309	valid_1's rmse: 3.6564
Early stopping, best iteration is:
[2947]	training's rmse: 3.30421	valid_1's rmse: 3.6562
CV score: 3.65055 


In [356]:
lgbparam = {'num_leaves': 31,
            'boosting_type': 'rf',
             'min_data_in_leaf': 30, 
             'objective':'regression',
             'max_depth': -1,
             'learning_rate': 0.01,
             "min_child_samples": 20,
             "boosting": "gbdt",
             "feature_fraction": 0.9,
             "bagging_freq": 1,
             "bagging_fraction": 0.9 ,
             "bagging_seed": 11,
             "metric": 'rmse',
             "lambda_l1": 0.1,
             "verbosity": -1,
             "nthread": 4,
             "random_state": 4590}

folds = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=4520)

oof_lgb = np.zeros(len(df_train1))
predictions_lgb = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train1,df_train1['outliers'].values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train1.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx],categorical_feature=['feature_1','feature_2','feature_3'])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train1.iloc[val_idx][df_train_columns], label=target.iloc[val_idx],categorical_feature=['feature_1', 'feature_2','feature_3'])#, categorical_feature=categorical_feats)

    num_round = 11000
    clf = lgb.train(lgbparam, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof_lgb[val_idx] = clf.predict(df_train1.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions_lgb += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / (5 * 2)

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)**0.5))

fold n°0
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 3.67343	valid_1's rmse: 3.7118
[200]	training's rmse: 3.59598	valid_1's rmse: 3.67579
[300]	training's rmse: 3.54689	valid_1's rmse: 3.6603
[400]	training's rmse: 3.51014	valid_1's rmse: 3.65252
[500]	training's rmse: 3.48141	valid_1's rmse: 3.64794
[600]	training's rmse: 3.4561	valid_1's rmse: 3.64449
[700]	training's rmse: 3.43343	valid_1's rmse: 3.64201
[800]	training's rmse: 3.41318	valid_1's rmse: 3.6397
[900]	training's rmse: 3.39459	valid_1's rmse: 3.63834
[1000]	training's rmse: 3.37798	valid_1's rmse: 3.6371
[1100]	training's rmse: 3.36131	valid_1's rmse: 3.63657
[1200]	training's rmse: 3.34485	valid_1's rmse: 3.6356
[1300]	training's rmse: 3.32922	valid_1's rmse: 3.63549
Early stopping, best iteration is:
[1299]	training's rmse: 3.3293	valid_1's rmse: 3.63547
fold n°1
Training until validation scores don't improve for 100 rounds.
[100]	training's rmse: 3.67046	valid_1's rmse: 3.7189

[1600]	training's rmse: 3.27924	valid_1's rmse: 3.66032
[1700]	training's rmse: 3.26654	valid_1's rmse: 3.66044
[1800]	training's rmse: 3.25208	valid_1's rmse: 3.6601
Early stopping, best iteration is:
[1783]	training's rmse: 3.25421	valid_1's rmse: 3.66005
CV score: 3.65308 


In [357]:
train_stack = np.vstack([oof,oof_lgb]).transpose()
test_stack = np.vstack([predictions,predictions_lgb]).transpose()

folds = RepeatedKFold(n_splits=5,n_repeats=1,random_state=4520)
oof_stack = np.zeros(train_stack.shape[0])
predictions_stack = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, target)):
    print("fold n°{}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values

    print("-" * 10 + "Stacking " + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=3000, learning_rate=0.1, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), cat_features=[], use_best_model=True, verbose=True)
    clf = BayesianRidge()
    clf.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf.predict(val_data)
    predictions_stack += clf.predict(test_stack) / 5


np.sqrt(mean_squared_error(target.values, oof_stack))

fold n°0
----------Stacking 0----------
fold n°1
----------Stacking 1----------
fold n°2
----------Stacking 2----------
fold n°3
----------Stacking 3----------
fold n°4
----------Stacking 4----------


3.6491930700667297

In [358]:
train_stack1=np.copy(train_stack)

In [268]:
train_w_outliers=clf.predict(train_stack)

In [273]:
train_w_outliers.shape[0]-train_wo_outliers.shape[0]

2207

In [364]:
train_map.shape

(201917,)

In [365]:
train_w_outliers.shape

(201917,)

In [366]:
train_wo_outliers.shape

(201917,)

In [404]:
df=df_train[['card_id','outliers','target']].copy()
df['predicted_target']=train_w_outliers
df['predicted_outliers']=train_map
df['predicted_target_wo_outliers']=train_wo_outliers

In [405]:
df.sort_values('predicted_outliers',ascending=False,inplace=True)

In [420]:
errors={}
for i in range(0,199710,100):
    a=df.predicted_target_wo_outliers.iloc[:i]
    b=df.predicted_target.iloc[i:]
    c=a.append(b)
    e=mean_squared_error(df.target.values, c)
    errors[i]=e

In [421]:
plt.figure()
plt.plot(errors.keys(),errors.values())

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x1386a0b0fd0>]

In [424]:
### Threshould value for minimum error
print(min(errors,key=errors.get))
### However, we find out if we reduce to 

4200


In [None]:
# df_train = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/train.csv')
# df_test = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/test.csv')
# df_hist_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions_new.csv')
# df_new_merchant_trans = pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions_new.csv')

In [435]:
# df_train.to_csv('C:/Users/user/Documents/Salamat/ELO/train_pur_date_4.csv', index=False)
# df_test.to_csv('C:/Users/user/Documents/Salamat/ELO/test_pur_date_4.csv',index=False)

In [437]:
# df_hist_trans.to_csv('C:/Users/user/Documents/Salamat/ELO/historical_transactions_pur_date_4.csv',index=False)
# df_new_merchant_trans.to_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions_pur_date_4.csv',index=False)

In [521]:
df_new_merchant_trans=pd.read_csv('C:/Users/user/Documents/Salamat/ELO/new_merchant_transactions_pur_date_4.csv')

In [449]:
df_train_original=pd.read_csv('train.csv',parse_dates=['first_active_month'])

In [460]:
columns=['card_id', 'authorized_flag', 'city_id', 'category_1', 'installments',
       'category_3', 'merchant_category_id', 'merchant_id', 'month_lag',
       'purchase_amount', 'purchase_date', 'category_2', 'month_diff', 'category_2_mean', 'category_3_mean']

In [461]:
df_new_merchant_trans[df_new_merchant_trans.card_id=='C_ID_415bb3a509'][columns]

Unnamed: 0,card_id,authorized_flag,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,category_2,month_diff,category_2_mean,category_3_mean
0,C_ID_415bb3a509,1,107,0,1,B,307,M_ID_b0c793002c,1,-0.557617,2018-03-11 14:57:36,1.0,0,-0.557412,-0.571578
1,C_ID_415bb3a509,1,140,0,1,B,307,M_ID_88920c89e8,1,-0.569336,2018-03-19 18:53:37,1.0,0,-0.557412,-0.571578
2,C_ID_415bb3a509,1,330,0,1,B,507,M_ID_ad5237ef6b,2,-0.55127,2018-04-26 14:08:44,1.0,-1,-0.557412,-0.571578
3,C_ID_415bb3a509,1,-1,1,1,B,661,M_ID_9e84cda3b1,1,-0.671875,2018-03-07 09:43:21,1.0,0,-0.557412,-0.571578


In [469]:
df_train_original.head()

Unnamed: 0,first_active_month,card_id,feature_1,feature_2,feature_3,target
0,2017-06-01,C_ID_92a2005557,5,2,1,-0.820283
1,2017-01-01,C_ID_3d0044924f,4,1,0,0.392913
2,2016-08-01,C_ID_d639edf6cd,2,2,0,0.688056
3,2017-09-01,C_ID_186d6a6901,4,3,0,0.142495
4,2017-11-01,C_ID_cdbd2c0db2,1,3,0,-0.159749


In [596]:
df_new=df_new_merchant_trans.groupby('card_id').agg({'merchant_id':['count','nunique']})
df_hist=df_hist_trans.groupby('card_id').agg({'merchant_id':['count','nunique']})


In [597]:
df_new=df_new['merchant_id']
df_hist=df_hist['merchant_id']

In [598]:
df_new['diff']=df_new['count']-df_new['nunique']
df_new['ratio']=df_new['diff']/(df_new['count']+1e-3)
#df_new['diff'].value_counts()
df_hist['diff']=df_hist['count']-df_hist['nunique']
df_hist['ratio']=df_hist['diff']/(df_hist['count']+1e-3)

In [599]:
df_hist.loc[df_new[df_new['diff']==5].index,'ratio'].mean()

0.4320662016467264

In [600]:
df_hist.loc[df_new[df_new['diff']==0].index,'ratio'].mean()

0.5145632418637889

In [595]:
df_hist.head()

Unnamed: 0,index,card_id,count,nunique,diff,ratio
0,0,C_ID_00007093c1,149,29,120,0.805364
1,1,C_ID_0001238066,123,65,58,0.471541
2,2,C_ID_0001506ef0,66,28,38,0.575749
3,3,C_ID_0001793786,216,119,97,0.449072
4,4,C_ID_000183fdda,144,73,71,0.493052


In [601]:
df_hist.reset_index(inplace=True)

In [602]:
df_new.reset_index(inplace=True)

In [603]:
# df_train_original.reset_index(inplace=True)
# df_train_original.merge()
df_train_original = df_train_original.merge(df_hist,on='card_id',how='left')


In [615]:
plt.figure()
plt.scatter(df_train_original.diff_y,df_train_original.target)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x138f96b72e8>

In [617]:
import xlearn as xl

XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path

In [586]:
df_train_original[df_train_original['diff']==0].target.mean()

-0.44013124225561284

In [591]:
df_train_original[df_train_original['diff']==5].target.mean()

-0.38353952

In [505]:
df_train_ex=df_train_original.copy()
#df.set_index('card_id',inplace=True)
intersect=[]

hist_freq=[]
new_freq=[]


df_hist_group=df_hist_trans.groupby('card_id').agg('count')
df_new_group=df_new_merchant_trans.groupby('card_id')


df_hist_group['hist_pur_counts']=df_hist_group.counts()






for card_id in df_train_original.card_id:
    
    new_merchants=df_new_merchant_trans.merchant_id[df_new_merchant_trans.card_id==card_id].values
    old_merchants=df_hist_trans.merchant_id[df_hist_trans.card_id==card_id].values
    a=set(new_merchants)
    b=set(old_merchants)
    intersect.append(b.intersection(a))
    
    hist_freq.append(len(old_merchants)/(len(b)+1e-3)) # add small number to avoid division by zero
    new_freq.append(len(new_merchants)/(len(a)+1e-3)) # add small number to avoid division by zero
df_train_ex['intersec_merchant_id']=intersect
df_train_ex['hist_freq']=hist_freq
df_train_ex['new_freq']=new_freq 

KeyboardInterrupt: 

In [509]:
len(new_freq)

236

In [510]:
new_freq

[0.9999565236294073,
 0.9998333611064822,
 0.9990009990009991,
 0.999857163262391,
 0.9999722229938058,
 0.9997500624843788,
 0.9998000399920015,
 0.9996667777407531,
 0.9995002498750625,
 0.9996667777407531,
 0.9995002498750625,
 0.9990009990009991,
 0.9990009990009991,
 0.9997500624843788,
 0.9998333611064822,
 0.0,
 0.9996667777407531,
 0.9990009990009991,
 0.9997500624843788,
 0.9999000099990002,
 0.9999696978879429,
 0.0,
 0.9990009990009991,
 0.9999000099990002,
 0.9998889012331964,
 0.0,
 0.9998889012331964,
 0.9990009990009991,
 0.9997500624843788,
 0.999857163262391,
 0.9998889012331964,
 0.9998000399920015,
 0.9995002498750625,
 0.9997500624843788,
 0.9999000099990002,
 0.9997500624843788,
 0.9990009990009991,
 0.9999615399407714,
 0.9998333611064822,
 0.9995002498750625,
 0.9998750156230471,
 0.9999375039060058,
 0.999857163262391,
 0.9995002498750625,
 0.9998750156230471,
 0.9999230828397816,
 0.9999333377774815,
 0.9990009990009991,
 0.999857163262391,
 0.0,
 0.99987501562

In [504]:
df_new_merchant_trans.merchant_id[df_new_merchant_trans.card_id==card_id].values

array(['M_ID_47179467b9', 'M_ID_87c8e4b8c0', 'M_ID_c0499d51ac',
       'M_ID_f87158682e', 'M_ID_cd2c0b07e9', 'M_ID_7502bfffb5',
       'M_ID_dc9b2bc91f', 'M_ID_02e6a68aab', 'M_ID_b361c6a843'],
      dtype=object)

In [502]:
plt.figure()
plt.scatter(hist_freq,new_freq)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x137fe7d9cc0>

In [500]:
len(new_freq)

137

In [473]:
new_merchants=df_new_merchant_trans[df_new_merchant_trans.card_id=='C_ID_92a2005557'].merchant_id.values

In [475]:
old_merchants=df_hist_trans[df_hist_trans.card_id=='C_ID_92a2005557'].merchant_id.values

In [477]:
a=set(new_merchants)
b=set(old_merchants)

In [486]:
len(a)

23

In [484]:
len(b)

94

In [487]:
old_merchants.shape

(260,)

In [480]:
a.intersection(b)

set()

In [170]:
model_with_mixed.loc[map2.index]=model_with_outlier.loc[map2.index]

In [171]:
model_with_mixed.reset_index(inplace=True)

In [172]:
model_with_mixed.to_csv("Bayesian_Ridge_Stacking_mixed_outliers.csv",index=False)

In [218]:
index_high_prob=map2[map2.target>0.5]

In [429]:
df_train.shape

(201917, 87)

In [430]:
df_test.shape

(123623, 85)

In [223]:
model_with_mixed[model_with_mixed.card_id.isin(index_high_prob.index)]

Unnamed: 0,card_id,target
7750,C_ID_a74b12dcf8,-25.262202
20556,C_ID_aae50409e7,-21.672768
27982,C_ID_e7f772dfc0,-18.387164
30248,C_ID_65a0e440f8,-15.969893
32446,C_ID_ac114ef831,-21.413514
70804,C_ID_833aa2f7af,-15.424178
77945,C_ID_6ab591cf62,-20.672101
80840,C_ID_bced41d837,-15.642591
104991,C_ID_86ddafb51c,-20.197719
114106,C_ID_e54aeb08f7,-17.926073


In [225]:
model_with_mixed_2=model_with_mixed.copy()
model_with_mixed_2.set_index('card_id',inplace=True)
model_with_mixed_2.loc[index_high_prob.index]=-33.21928095

In [227]:
model_with_mixed_2.reset_index(inplace=True)
model_with_mixed.to_csv("chau_feature_engineering_date_fixed_4_mixed_outliers_whighprop.csv",index=False)

In [None]:
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score
#sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)

In [None]:
sub_df = pd.DataFrame({"card_id":df_test["card_id"].values})
sub_df["target"] = predictions
sub_df.to_csv("chau_pur_date_4_trained_wo_out.csv", index=False)

In [151]:
map_out=pd.read_csv("predicted_outliers_test.csv",squeeze=True)
map_out.set_index("card_id",inplace=True)

In [169]:
map1=map_out.sort_values('target',ascending=False).head(1340)

In [166]:
len(map_out)*1.09/100

1347.4907

In [165]:
df_train_origina=pd.read_csv('train.csv')
print("ration of outliers")
print(len(df_train_origina[df_train_origina.target<-20])/len(df_train_origina))
print(" unique values of outliers")
print(df_train_origina[df_train_origina.target<-20].target.unique())


ration of outliers
0.010930233709890698
 unique values of outliers
[-33.21928095]


In [189]:
map1=map_out.sort_values('target',ascending=False).head(1340)
map2=map_out.sort_values('target',ascending=False).head(25000)



sub_df=pd.read_csv("chau_pur_date_4_trained_wo_out.csv")
sub_df.set_index('card_id',inplace=True)
sub_df.loc[map2.index,'target']=-33.21928095
sub_df.reset_index(inplace=True)
sub_df.to_csv("chau_pur_date_4_trained_w_out_25000.csv", index=False)

If that it is same ration than outliers will be ~1340 in test set. Howver, according to their results 
https://www.kaggle.com/waitingli/combining-your-model-with-a-model-without-outlier
they choose first 25000 to be outliers rather than 1340.


In [None]:
sub_df = pd.DataFrame({"card_id":df_train["card_id"].values})
sub_df["target_predicted"] = oof
sub_df["target_real"]=target
sub_df.to_csv("predicted_outliers_train.csv", index=False)

In [160]:
df_train_origina=pd.read_csv('train.csv')

In [None]:
name='purchase_date_'
for i in range(1,5):
    name_=name+str(i)
    sub_df = pd.DataFrame({"card_id":df_test["card_id"].values})
    sub_df["target"] = results[i-1]
    sub_df.to_csv("chau_feature_engineering_date_fixed_"+str(i)+".csv", index=False)

In [None]:
sub_df = pd.DataFrame({"card_id":df_test["card_id"].values})
sub_df["target"] = predictions
sub_df.to_csv("chau_feature_engineering_date_fixed_4_catfeatures.csv", index=False)

pur_date_1,pur_date_2,pur_date_3,pur_date_4
best seems to be still first and last
[3.6503261351619978, 3.652261631810461, 3.6524115687225662, 3.6506695114985384]


trying without using maping outliers: CV score 
3.649842864600676
     for f in ['feature_1','feature_2','feature_3']:
         order_label = df_train.groupby([f])['outliers'].mean()
         df_train[f] = df_train[f].map(order_label)
         df_test[f] = df_test[f].map(order_label)
         
         
same as previous with categorical features for fearure_1,2,3
3.6491202164373253

Same as previous but without outliers: CV_score
Outliers have significant effect on the score

1.5559535690683506

predicting outliers based on pur_date_4
log_loss on train
0.044014957513606616
roc_auc
0.903436128739

with categorical feature_1,feature_2,feature_3
0.0440283251986
0.903455034721



In [None]:
mask_out=pd.read_csv("predicted_outliers_test.csv",index_col='card_id')

### 1 fixed

3.6499271201859389


In [None]:
cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:1000].index)

best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,25))
sns.barplot(x="importance",
            y="Feature",
            data=best_features.sort_values(by="importance",
                                           ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances_feature_date_fixed_outlier_prediction.png')

In [None]:
sub_df = pd.DataFrame({"card_id":df_test["card_id"].values})
sub_df["target"] = predictions
sub_df.to_csv("chau_feature_engineering_3_date_updated3.csv", index=False)

**To be continued ...**

In [None]:
best_features.groupby('Feature').mean().sort_values('importance',ascending=False).to_csv('feature_importance2.csv')

In [None]:
df_train.to_csv('C:/Users/user/Documents/Salamat/ELO/train_new_lag_updated.csv',index=False)

In [None]:
df_test.to_csv('C:/Users/user/Documents/Salamat/ELO/test_new_lag_updated.csv',index=False)

In [None]:
best_features.groupby('Feature').mean().sort_values('importance',ascending=False).to_csv('best_features_dates.csv')

In [None]:
len(target[(target<=1)&(target>=-1)])/len(target)