### Mean encodings
In this programming assignment you will be working with 1C dataset from the final competition. You are asked to encode item_id in 4 different ways:

1) Via KFold scheme;  
2) Via Leave-one-out scheme;<br>
3) Via smoothing scheme;<br>
4) Via expanding mean scheme.

You will need to submit the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

## General tips
- Fill NANs in the encoding with 0.3343.
- Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

In [2]:
import pandas as pd
import numpy as np
from itertools import product

## Read Data

In [4]:
sales = pd.read_csv('sales_train.csv')

## Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [8]:
index_cols  = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

#fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]

#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)

#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

In [9]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0
141495,0,27,0,0
144968,0,28,0,0
142661,0,29,0,0
138947,0,32,0,6


In [19]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3792745 entries, 139255 to 3789465
Data columns (total 5 columns):
shop_id            int32
item_id            int32
date_block_num     int32
target             float64
item_target_enc    float64
dtypes: float64(2), int32(3)
memory usage: 130.2 MB


## Mean encodings without regularization
After we did the techinical work, we are ready to actually mean encode the desired item_id variable.

Here are two ways to implement mean encoding features without any regularization. You can use this code as a starting point to implement regularized techniques.

__Method 1__

In [10]:
#Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

#In our non-regularized case, we just *map* the computed means to the 'item_id''s
all_data['item_target_enc']= all_data['item_id'].map(item_id_target_mean)

#Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True)

#Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.5819146035879366


The printed value is the correlation coefficient between the target variable and your new encoded item_id feature. 

__Method 2__

In [11]:
'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''

all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.5819146035879366


## 1. KFold Regularization

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection.

Split your data in 5 folds with sklearn.model_selection.KFold with shuffle=False argument.
Iterate through folds: use all but the current fold to calculate mean target for each level item_id, and fill the current fold.

See the Method 1 from the example implementation. In particular learn what map and pd.Series.map functions do. They are pretty handy in many situations.

In [42]:
#Code adapted from: https://github.com/mervynlee94/Advance-Machine-Learning/
#blob/master/%20Course%202%20of%207:%20How%20to%20Win%20a%20Data%20Science%20Competition:
#%20Learn%20from%20Top%20Kagglers/Programming%20assignment%2C%20week%203:%20Mean%20encodings/
#Programming_assignment_week_3.ipynb

#Split data into 5 folds
from sklearn import model_selection

kf = model_selection.KFold(5, shuffle=False)
all_data['item_target_enc'] = np.nan

In [56]:
#Iterate through folds using all but the current fold to calculate mean target for each leavel item_id

for tr_ind, val_ind in kf.split(all_data):
    X_tr, X_val = all_data.iloc[tr_ind], all_data.iloc[val_ind]
    all_data.loc[all_data.index[val_ind], 'item_target_enc'] = X_val['item_id'].map(X_tr.groupby('item_id').target.mean())

all_data['item_target_enc'].fillna(0.3343, inplace=True)
encoded_feature = all_data['item_target_enc'].values

all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0,0.3343
141495,0,27,0,0,0.056
144968,0,28,0,0,0.157333
142661,0,29,0,0,0.032967
138947,0,32,0,6,1.88


In [57]:
#Correlation for KFold Regularization
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

0.5140762895921509


## 2. Leave-one-out Regularization

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the KFold scheme, you will probably wait for a very long time.

To implement a faster version, note, that to calculate mean target value using all the objects but one given object, you can:

1. Calculate sum of the target values using all the objects. <br>
2. Then subtract the target of the given object and divide the resulting value by n_objects - 1. <br><br>
Note that you do not need to perform 1. for every object. And 2. can be implemented without any for loop. <br>

It is the most convenient to use .transform function as in Method 2.

In [55]:
leave_one_out_sum = all_data['item_id'].map(all_data.groupby('item_id').target.sum())
leave_one_out_count = all_data['item_id'].map(all_data.groupby('item_id').target.count())

all_data['item_target_enc'] = ((leave_one_out_sum - all_data['target']))/(leave_one_out_count-1)
all_data['item_target_enc'].fillna(0.3343, inplace=True)                                                
encoded_feature = all_data['item_target_enc'].values

all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0,0.022727
141495,0,27,0,0,0.066239
144968,0,28,0,0,0.160256
142661,0,29,0,0,0.044248
138947,0,32,0,6,2.589744


In [54]:
#Leave One Out Correlation
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

0.5770489635001632


## 3. Smoothing

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as globalmean. Note that nrows is the number of objects that belong to a certain category (not the number of rows in the dataset).

In [52]:
alpha = 100
globalmean = 0.3343
#Make a copy of the data to work with
train_new = all_data.copy()
#Group all item_id columns
nrows = train_new.groupby('item_id').size()
#Find the mean of all item_id columns in dataframe
means = train_new.groupby('item_id').target.agg('mean')

score = (np.multiply(means,nrows)  + globalmean*alpha) / (nrows+alpha)
train_new['smooth'] = train_new['item_id']
train_new['smooth'] = train_new['smooth'].map(score)
encoded_feature = train_new['smooth'].values

#Let's look at this dataframe with new 'smooth' column
train_new.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc,smooth
139255,0,19,0,0,0.022727,0.237448
141495,0,27,0,0,0.066239,0.113234
144968,0,28,0,0,0.160256,0.190562
142661,0,29,0,0,0.044248,0.132813
138947,0,32,0,6,2.589744,2.19935


In [None]:
#Smoothing Correlation
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

## 4. Expanding Mean Regularization

In [50]:
cumsum = all_data.groupby('item_id').target.cumsum() - all_data['target']
cumcnt = all_data.groupby('item_id').cumcount()

#Add mean target to dataframe. Mean target = cumsum / cumcnt
train_new["mean_target"] = cumsum / cumcnt
train_new['mean_target'].fillna(0.3343, inplace=True)
encoded_feature = train_new['mean_target'].values

train_new.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc,smooth,mean_target
139255,0,19,0,0,0.022727,0.237448,0.3343
141495,0,27,0,0,0.066239,0.113234,0.3343
144968,0,28,0,0,0.160256,0.190562,0.3343
142661,0,29,0,0,0.044248,0.132813,0.3343
138947,0,32,0,6,2.589744,2.19935,0.3343


In [58]:
#Expanding Mean Correlation
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)

0.5140762895921509
