Version 1.1.0

# Mean encodings

In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:

    1) Via KFold scheme;  
    2) Via Leave-one-out scheme;
    3) Via smoothing scheme;
    4) Via expanding mean scheme.

**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.

### General tips

* Fill NANs in the encoding with `0.3343`.
* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

In [98]:
import pandas as pd
import numpy as np
from itertools import product
from grader import Grader


In [100]:
!pip install sklearn

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.[0m
Collecting sklearn
  Using cached https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn (from sklearn)
[?25l  Downloading https://files.pythonhosted.org/packages/19/af/1e116d24d6d74da12d90c42f408f16dae8f1a59ab4d95a48acbd2c277183/scikit_learn-0.20.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.3MB)
[K     |████████████████████████████████| 8.3MB 125kB/s eta 0:00:01
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/futianshu/Library/Caches/pip/wheels/76/03/bb/589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfull

In [101]:
import sklearn

# Read data

In [3]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')

In [4]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [68]:
print("data amount: "+ str(len(sales)))

data amount: 2935849


In [64]:
n_Id= len(sales["item_id"].unique())
print('num of unique id :  '+ str(n_Id))

num of unique id :  21807


In [65]:
n_Shop= len(sales["shop_id"].unique())
print('num of shop id :  '+ str(n_Shop))

num of shop id :  60


# Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

In [11]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

#fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


In [17]:
len(sales)

2935849

In [28]:
len(sales)

2935849

In [32]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target
139255,0,19,0,0.0
141495,0,27,0,0.0
144968,0,28,0,0.0
142661,0,29,0,0.0
138947,0,32,0,6.0


In [46]:
sales[(sales["shop_id"]==59) & (sales["date_block_num"]==0) ].head(10)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
40084,10.01.2013,0,59,22151,399.0,1.0
77502,04.01.2013,0,59,5603,699.0,1.0
77503,19.01.2013,0,59,5587,199.0,2.0
77504,31.01.2013,0,59,5613,5571.0,1.0
77505,10.01.2013,0,59,5623,699.0,1.0
77506,14.01.2013,0,59,5623,699.0,1.0
77507,10.01.2013,0,59,5629,2390.0,1.0
77508,04.01.2013,0,59,5643,2390.0,1.0
77509,17.01.2013,0,59,5643,2390.0,2.0


In [45]:
all_data[(all_data["shop_id"]==59) & (all_data["date_block_num"]==0) ].head(10)

Unnamed: 0,shop_id,item_id,date_block_num,target
1300,59,19,0,0.0
3540,59,27,0,0.0
7013,59,28,0,0.0
4706,59,29,0,0.0
992,59,32,0,3.0
993,59,33,0,0.0
994,59,34,0,0.0
1292,59,35,0,1.0
4717,59,40,0,0.0
4110,59,41,0,0.0


# Mean encodings without regularization

After we did the techinical work, we are ready to actually *mean encode* the desired `item_id` variable. 

Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. 

#### Method 1

In [47]:
# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.4830386988621791


In [76]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.022222
141495,0,27,0,0.0,0.056834
144968,0,28,0,0.0,0.141176
142661,0,29,0,0.0,0.037383
138947,0,32,0,6.0,1.319042


In [74]:
np.corrcoef(all_data['target'].values, encoded_feature)

array([[1.       , 0.4830387],
       [0.4830387, 1.       ]])

#### Method 2

In [89]:
'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.4830386988621791


See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute correlation coefficient** between the encodings, that you will implement and **submit those to coursera**.

In [90]:
grader = Grader()

# 1. KFold scheme

Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

**Now it's your turn to write the code!** 

You may use 'Regularization' video as a reference for all further tasks.

First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. 

1. Split your data in 5 folds with `sklearn.model_selection.KFold` with `shuffle=False` argument.
2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and  fill the current fold.

    *  See the **Method 1** from the example implementation. In particular learn what `map` and pd.Series.map functions do. They are pretty handy in many situations.

In [105]:
from sklearn.model_selection import KFold

In [126]:
kf = KFold(n_splits=5, shuffle=False)

In [127]:
print(kf)

KFold(n_splits=5, random_state=None, shuffle=False)


In [130]:
trainData=[]
testData=[]

In [133]:
#store the index of each part in a big list
for train_index, test_index in kf.split(all_data):
        trainData.append(train_index)
        testData.append(test_index)
                

In [188]:
#create new data for generating result 
all_data_copy = all_data.copy()
all_data_copy = all_data_copy.drop(columns=["item_target_enc"])

In [202]:
all_data_copy_new = pd.DataFrame()

In [203]:
for i in range(5):
    #get the train and val index 
    train_i_Index, test_i_Index = trainData[i], testData[i]
    #get the X trainset 
    train_i,test_i = all_data_copy.iloc[train_i_Index],all_data_copy.iloc[test_i_Index]
    #calculate the means by the trainData 
    means =train_i.groupby('item_id').target.mean()
    #map the result to the val set and create a column "mean target"
    test_i["mean_target"] = test_i["item_id"].map(means)
    #fill the NAN 
    test_i["mean_target"].fillna(0.3343, inplace=True) 
    # put the test_i to the new data 
    all_data_copy_new=all_data_copy_new.append(test_i)

    
    
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [209]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,item_target_enc
139255,0,19,0,0.0,0.022222
141495,0,27,0,0.0,0.056834
144968,0,28,0,0.0,0.141176
142661,0,29,0,0.0,0.037383
138947,0,32,0,6.0,1.319042


In [208]:
all_data_copy_new.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,mean_target
139255,0,19,0,0.0,0.3343
141495,0,27,0,0.0,0.048523
144968,0,28,0,0.0,0.142424
142661,0,29,0,0.0,0.030303
138947,0,32,0,6.0,0.89402


In [210]:
encoded_feature = all_data_copy_new['mean_target'].values

In [211]:
# YOUR CODE GOES HERE

# You will need to compute correlation like that
corr = np.corrcoef(all_data_copy_new['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('KFold_scheme', corr)

0.41645907127988024
Current answer for task KFold_scheme is: 0.41645907127988024


# 2. Leave-one-out scheme

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. 

To implement a faster version, note, that to calculate mean target value using all the objects but one *given object*, you can:

1. Calculate sum of the target values using all the objects.
2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. 

Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.

It is the most convenient to use `.transform` function as in **Method 2**.

In [214]:
all_data_copy["target"][0]

1.0

In [216]:
(all_data_copy["target"].sum()-all_data_copy["target"][0])/(len(all_data_copy)-1)

0.3342729957139777

In [220]:
totalSumTarget = all_data_copy["target"].sum()
length = (len(all_data_copy))

In [218]:
all_data_copy["mean_enc"] = pd.Series()

In [222]:
df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})

In [223]:
df.transform(lambda x: x + 1)

Unnamed: 0,A,B
0,1,2
1,2,3
2,3,4


In [235]:
def meanF(target,totalSum,n):
    print(target)
    print((totalSum-target)/(n-1))
    return (totalSum-target)/(n-1)

In [230]:
all_data_copy_head= all_data_copy.head(10)

In [241]:
#create the sum value for reach itwm with the sum of it's corresponding id's sum
sums_value = all_data_copy.groupby("item_id")["target"].transform('sum')

In [242]:
n_objects = all_data_copy.groupby("item_id")["target"].transform('size')

In [247]:
all_data_copy["sums_value"] = sums_value
all_data_copy["n_objects"] = n_objects

In [249]:
all_data_copy["mean_enc"] =(all_data_copy["sums_value"]-all_data_copy["target"])/(all_data_copy["n_objects"]-1)

In [251]:
all_data_copy.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,sums_value,n_objects,mean_enc
139255,0,19,0,0.0,1.0,45,0.022727
141495,0,27,0,0.0,42.0,739,0.056911
144968,0,28,0,0.0,84.0,595,0.141414
142661,0,29,0,0.0,12.0,321,0.0375
138947,0,32,0,6.0,2092.0,1586,1.316088


In [252]:
encoded_feature = all_data_copy["mean_enc"].values


In [253]:
# YOUR CODE GOES HERE

corr = np.corrcoef(all_data_copy['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Leave-one-out_scheme', corr)

0.480384831129305
Current answer for task Leave-one-out_scheme is: 0.480384831129305


# 3. Smoothing

Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Next, implement smoothing scheme with $\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset).

In [254]:
all_data_copy_Q3 = all_data_copy.drop(columns=["mean_enc"])

In [256]:
alpha =100 
globalmean = 0.3343

In [257]:
all_data_copy_Q3["means"] = (all_data_copy_Q3["sums_value"]+globalmean*alpha)/(all_data_copy_Q3["n_objects"]+alpha)

In [259]:
all_data_copy_Q3.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,sums_value,n_objects,means
139255,0,19,0,0.0,1.0,45,0.237448
141495,0,27,0,0.0,42.0,739,0.089905
144968,0,28,0,0.0,84.0,595,0.168964
142661,0,29,0,0.0,12.0,321,0.10791
138947,0,32,0,6.0,2092.0,1586,1.260635


In [260]:
 encoded_feature = all_data_copy_Q3["means"].values

In [261]:
# YOUR CODE GOES HERE

corr = np.corrcoef(all_data_copy_Q3['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Smoothing_scheme', corr)

0.4818198797097282
Current answer for task Smoothing_scheme is: 0.4818198797097282


# 4. Expanding mean scheme

Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization).

Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas.

In [262]:
all_data_copy_Q4 = all_data_copy.drop(columns=["mean_enc"])

In [263]:
all_data_copy_Q4.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,sums_value,n_objects
139255,0,19,0,0.0,1.0,45
141495,0,27,0,0.0,42.0,739
144968,0,28,0,0.0,84.0,595
142661,0,29,0,0.0,12.0,321
138947,0,32,0,6.0,2092.0,1586


In [266]:
cumsum = all_data_copy.groupby("item_id")["target"].cumsum()-all_data_copy["target"]
cumcnt = all_data_copy.groupby("item_id").cumcount()
all_data_copy_Q4["mean_enc"] = cumsum / cumcnt

In [269]:
all_data_copy_Q4.fillna(0.3343, inplace=True) 

In [272]:
all_data_copy_Q4.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,sums_value,n_objects,mean_enc
139255,0,19,0,0.0,1.0,45,0.3343
141495,0,27,0,0.0,42.0,739,0.3343
144968,0,28,0,0.0,84.0,595,0.3343
142661,0,29,0,0.0,12.0,321,0.3343
138947,0,32,0,6.0,2092.0,1586,0.3343


In [273]:
encoded_feature = all_data_copy_Q4["mean_enc"]


In [274]:
# YOUR CODE GOES HERE

corr = np.corrcoef(all_data_copy_Q4['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Expanding_mean_scheme', corr)

0.5025245211081701
Current answer for task Expanding_mean_scheme is: 0.5025245211081701


## Authorization & Submission
To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate token on this programming assignment page. Note: Token expires 30 minutes after generation.

In [275]:
STUDENT_EMAIL = "1439631673@qq.com"
STUDENT_TOKEN = "myZOCCrQf2RFE468"
grader.status()

You want to submit these numbers:
Task KFold_scheme: 0.41645907127988024
Task Leave-one-out_scheme: 0.480384831129305
Task Smoothing_scheme: 0.4818198797097282
Task Expanding_mean_scheme: 0.5025245211081701


In [276]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
