# Encoding Techniques illustrated
by A4Ayub Data Science Labs (http://www.a4ayub.me/)

### Class Problem Statement

Illustrate concepts on:

1. One-Hot Encoding
2. Label Encoding
3. Ordinal Encoding
4. Helmert Encoding
5. Binary Encoding
6. Frequency Encoding
7. Mean Encoding
8. Weight of Evidence Encoding
9. Probability Ratio Encoding
10. Hashing
11. Backward Difference Encoding
12. Leave One Out Encoding
13. James-Stein Encoding
14. M-Estimator Encoding

<font color="red">**Please take note that the illustrations in this notebook is NOT for results/accuracy but for explaining the various concepts**</font>

### Data Description

<font color="red">**This data is proprietory and cannot be shared to anyone who is NOT attending A4Ayub Data Science Labs.!**</font>

Each row in the dataset corresponds to one unique product in a basket (e.g. if there are three occurences of the same product in that basket, it will have one row for the product in that basket, with quantity equal to three)

The file has the below structure:

| Column Name | Description | Type | Sample Values |
| --- | --- | --- | --- |
| shop_week | Identifies the week of the basket | Char | Format is YYYYWW where the first 4 characters identify the fiscal year and the other two characters identify the specific week within the year (e.g. 200735). Being the fiscal year, the first week doesn’t start in January.  (See time.csv file for start/end dates of each week) |
| shop_date | Date when shopping has been made. Date is specified in the yyyymmdd format | Char | 20060413, 20060412 |
| shop_weekday | Identifies the day of the week | Num | 1=Sunday, 2=Monday, …, 7=Saturday |
| shop_hour | Hour slot of the shopping | Num | 0=00:00-00:59, 1=01:00-01:59, …23=23:00-23:59 |
| Quantity | Number of items of the same product bought in this basket | Num | Integer number |
| spend | Spend associated to the items bought | Num | Number with two decimal digits |
| prod_code | Product Code | Char | PRD0900001, PRD0900003 |
| prod_code_10 | Product Hierarchy Level 10 Code | Char | CL00072, CL00144 |
| prod_code_20 | Product Hierarchy Level 20 Code | Char | DEP00021, DEP00051 |
| prod_code_30 | Product Hierarchy Level 30 Code | Char | G00007, G00015 |
| prod_code_40 | Product Hierarchy Level 40 Code | Char | D00002, D00003 |
| cust_code | Customer Code | Char | CUST0000001624, CUST0000001912 |
| cust_price_sensitivity | Customer’s Price Sensitivity | Char | LA=Less Affluent, MM=Mid Market, UM=Up Market, XX=unclassified |
| cust_lifestage | Customer’s Lifestage | Char | YA=Young Adults, OA=Older Adults, YF=Young Families, OF=Older Families, PE=Pensioners, OT=Other, XX=unclassified |
| basket_id | Basket ID. All items in a basket share the same basket_id value. | Num | 994100100000020, 994100100000344 |
| basket_size | Basket size | Char | L=Large, M=Medium, S=Small |
| basket_price_sensitivity | Basket price sensitivity  | Char | LA=Less Affluent, MM=Mid Market, UM=Up Market, XX=unclassified |
| basket_type | Basket type | Char | Small Shop, Top Up, Full Shop, XX |
| basket_dominant_mission | Shopping dominant mission | Char | Fresh, Grocery, Mixed, Non Food, XX |
| store_code | Store Code | Char | STORE00001, STORE00002 |
| store_format | Format of the Store | Char | LS, MS, SS, XLS |
| store_region | Region the store belongs to | Char | E02, W01, E01, N03 |


### Workbench

#### Importing the required libraries

In [2]:
# Import the numpy and pandas package
import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt 
from matplotlib import gridspec
import seaborn as sns

# Import the warnings
import warnings

# Import statsmodels
import statsmodels.formula.api as smf

# Import RMSE
from statsmodels.tools.eval_measures import rmse

# Imort Linear Regression from scikit-learn
from sklearn.linear_model import LinearRegression

# configuration settings
%matplotlib inline 
sns.set(color_codes=True)
warnings.filterwarnings('ignore') ## Surpress the warnings
pd.options.display.max_columns = None # Display all columns


  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


#### Load the data into a dataframe

In [3]:
# load the data into a dataframe called supermarket_till_transactions_df
supermarket_till_transactions_df = pd.read_csv("data/supermarket_till_transactions.csv")

In [4]:
# view the top five records
supermarket_till_transactions_df.head(5)

Unnamed: 0,SHOP_WEEK,SHOP_DATE,SHOP_WEEKDAY,SHOP_HOUR,QUANTITY,SPEND,PROD_CODE,PROD_CODE_10,PROD_CODE_20,PROD_CODE_30,PROD_CODE_40,CUST_CODE,CUST_PRICE_SENSITIVITY,CUST_LIFESTAGE,BASKET_ID,BASKET_SIZE,BASKET_PRICE_SENSITIVITY,BASKET_TYPE,BASKET_DOMINANT_MISSION,STORE_CODE,STORE_FORMAT,STORE_REGION
0,200607,20060413,5,20,1,103,PRD0900097,CL00001,DEP00001,G00001,D00001,CUST0000634693,LA,YF,994100100532898,L,LA,Top Up,Fresh,STORE00001,LS,E02
1,200607,20060412,4,19,1,28,PRD0900353,CL00070,DEP00020,G00007,D00002,CUST0000634693,LA,YF,994100100532897,M,MM,Small Shop,Fresh,STORE00001,LS,E02
2,200607,20060413,5,20,3,84,PRD0900550,CL00167,DEP00055,G00016,D00003,CUST0000634693,LA,YF,994100100532898,L,LA,Top Up,Fresh,STORE00001,LS,E02
3,200607,20060412,4,19,1,221,PRD0901647,CL00010,DEP00003,G00002,D00001,CUST0000634693,LA,YF,994100100532897,M,MM,Small Shop,Fresh,STORE00001,LS,E02
4,200607,20060413,5,20,1,334,PRD0902064,CL00073,DEP00021,G00007,D00002,CUST0000634693,LA,YF,994100100532898,L,LA,Top Up,Fresh,STORE00001,LS,E02


## One-Hot Encoding

I will use the column PROD_CODE_30 to illustrate the encoding techniques. This column has 17 categories and i will use both get_dummies and OneHotEncoding

### Using get_dummies

In [5]:
supermarket_till_transactions_df["PROD_CODE_30"].value_counts()

G00007    30
G00004    25
G00021    12
G00016    11
G00022     5
G00015     5
G00028     5
G00013     5
G00003     4
G00010     4
G00002     3
G00014     2
G00008     2
G00029     2
G00001     2
G00006     1
G00027     1
Name: PROD_CODE_30, dtype: int64

In [6]:
example_categorical_ohe_df = supermarket_till_transactions_df[["PROD_CODE_30"]]
example_categorical_ohe_df.sample(5)

Unnamed: 0,PROD_CODE_30
21,G00007
86,G00007
29,G00008
35,G00021
88,G00022


In [7]:
encoded_example_categorical_ohe_df = pd.get_dummies(example_categorical_ohe_df,prefix=["PROD_CODE_30"], 
                                                    columns=["PROD_CODE_30"])
encoded_example_categorical_ohe_df.sample(5)

Unnamed: 0,PROD_CODE_30_G00001,PROD_CODE_30_G00002,PROD_CODE_30_G00003,PROD_CODE_30_G00004,PROD_CODE_30_G00006,PROD_CODE_30_G00007,PROD_CODE_30_G00008,PROD_CODE_30_G00010,PROD_CODE_30_G00013,PROD_CODE_30_G00014,PROD_CODE_30_G00015,PROD_CODE_30_G00016,PROD_CODE_30_G00021,PROD_CODE_30_G00022,PROD_CODE_30_G00027,PROD_CODE_30_G00028,PROD_CODE_30_G00029
12,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
9,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
110,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
61,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
78,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


In [8]:
encoded_example_categorical_ohe_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 17 columns):
PROD_CODE_30_G00001    119 non-null uint8
PROD_CODE_30_G00002    119 non-null uint8
PROD_CODE_30_G00003    119 non-null uint8
PROD_CODE_30_G00004    119 non-null uint8
PROD_CODE_30_G00006    119 non-null uint8
PROD_CODE_30_G00007    119 non-null uint8
PROD_CODE_30_G00008    119 non-null uint8
PROD_CODE_30_G00010    119 non-null uint8
PROD_CODE_30_G00013    119 non-null uint8
PROD_CODE_30_G00014    119 non-null uint8
PROD_CODE_30_G00015    119 non-null uint8
PROD_CODE_30_G00016    119 non-null uint8
PROD_CODE_30_G00021    119 non-null uint8
PROD_CODE_30_G00022    119 non-null uint8
PROD_CODE_30_G00027    119 non-null uint8
PROD_CODE_30_G00028    119 non-null uint8
PROD_CODE_30_G00029    119 non-null uint8
dtypes: uint8(17)
memory usage: 2.1 KB


This has created 17 additional columns 

### Using Scikit-Learn OneHotEncoder

In [9]:
#import the encoding library
from sklearn.preprocessing import OneHotEncoder

#Initialize the encoder
ohe = OneHotEncoder()

# create an array
ohe_array = ohe.fit_transform(example_categorical_ohe_df.PROD_CODE_30.values.reshape(-1,1)).toarray()

# create a new dataframe using the array
encoder_df = pd.DataFrame(
                            ohe_array,
                            columns=["PROD_CODE_30_"+str(ohe.categories_[0][i]) for i in range(len(ohe.categories_[0]))]
                         )

# join the two dataframes
encoded_df = pd.concat([example_categorical_ohe_df,encoder_df], axis=1)

#view the sample
encoded_df.sample(5)

Unnamed: 0,PROD_CODE_30,PROD_CODE_30_G00001,PROD_CODE_30_G00002,PROD_CODE_30_G00003,PROD_CODE_30_G00004,PROD_CODE_30_G00006,PROD_CODE_30_G00007,PROD_CODE_30_G00008,PROD_CODE_30_G00010,PROD_CODE_30_G00013,PROD_CODE_30_G00014,PROD_CODE_30_G00015,PROD_CODE_30_G00016,PROD_CODE_30_G00021,PROD_CODE_30_G00022,PROD_CODE_30_G00027,PROD_CODE_30_G00028,PROD_CODE_30_G00029
40,G00016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
14,G00004,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,G00021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
86,G00007,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52,G00015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
encoded_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 18 columns):
PROD_CODE_30           119 non-null object
PROD_CODE_30_G00001    119 non-null float64
PROD_CODE_30_G00002    119 non-null float64
PROD_CODE_30_G00003    119 non-null float64
PROD_CODE_30_G00004    119 non-null float64
PROD_CODE_30_G00006    119 non-null float64
PROD_CODE_30_G00007    119 non-null float64
PROD_CODE_30_G00008    119 non-null float64
PROD_CODE_30_G00010    119 non-null float64
PROD_CODE_30_G00013    119 non-null float64
PROD_CODE_30_G00014    119 non-null float64
PROD_CODE_30_G00015    119 non-null float64
PROD_CODE_30_G00016    119 non-null float64
PROD_CODE_30_G00021    119 non-null float64
PROD_CODE_30_G00022    119 non-null float64
PROD_CODE_30_G00027    119 non-null float64
PROD_CODE_30_G00028    119 non-null float64
PROD_CODE_30_G00029    119 non-null float64
dtypes: float64(17), object(1)
memory usage: 16.8+ KB


## Label Encoding

In [22]:
from sklearn.preprocessing import LabelEncoder
prod_code_le_df = supermarket_till_transactions_df[["PROD_CODE_30"]]
prod_code_le_df.sample(5)

Unnamed: 0,PROD_CODE_30
67,G00007
96,G00004
48,G00029
25,G00004
54,G00004


### Using Label Encoder

In [24]:
prod_code_le_df["PROD_CODE_LE"] = LabelEncoder().fit_transform(prod_code_le_df.PROD_CODE_30)
prod_code_le_df.sample(5)

Unnamed: 0,PROD_CODE_30,PROD_CODE_LE
48,G00029,16
10,G00004,3
54,G00004,3
111,G00021,12
8,G00013,8


### Using pandas Factorize

In [21]:
prod_code_le_df["PROD_CODE_FAC"] = pd.factorize(prod_code_le_df["PROD_CODE_LE"])[0].reshape(-1,1)
prod_code_le_df.sample(5)

Unnamed: 0,PROD_CODE_30,PROD_CODE_LE,PROD_CODE_FAC
91,G00008,6,9
89,G00010,7,8
49,G00007,5,1
35,G00021,12,7
112,G00004,3,4


## Ordinal Encoding

In [30]:
basket_size_df = supermarket_till_transactions_df[["BASKET_SIZE"]]
basket_size_df.sample(5)

Unnamed: 0,BASKET_SIZE
60,L
20,L
55,L
19,M
118,M


In [31]:
basket_size_dic = {"S" : "1","M":"2","L":"3"}
basket_size_df["BASKET_SIZE_ORDINAL"] = basket_size_df.BASKET_SIZE.map(basket_size_dic)
basket_size_df.sample(5)

Unnamed: 0,BASKET_SIZE,BASKET_SIZE_ORDINAL
15,L,3
104,L,3
62,S,1
43,L,3
0,L,3


## Helmert Encoding

In [38]:
import category_encoders as ce
encoder = ce.HelmertEncoder(cols=['BASKET_SIZE'],drop_invariant=True)
dfh = encoder.fit_transform(basket_size_df['BASKET_SIZE'])
df = pd.concat([basket_size_df,dfh],axis=1)
df.sample(5)

Unnamed: 0,BASKET_SIZE,BASKET_SIZE_ORDINAL,BASKET_SIZE_0,BASKET_SIZE_1
13,L,3,-1.0,-1.0
2,L,3,-1.0,-1.0
41,L,3,-1.0,-1.0
117,M,2,1.0,-1.0
43,L,3,-1.0,-1.0


## Binary Encoder

In [36]:
bi_encoder = ce.BinaryEncoder(cols=['BASKET_SIZE'],drop_invariant=True)
encoded_df = bi_encoder.fit_transform(basket_size_df['BASKET_SIZE'])
encoded_df.sample(5)

Unnamed: 0,BASKET_SIZE_1,BASKET_SIZE_2
78,0,1
46,0,1
79,0,1
25,0,1
64,1,0


In [37]:
merged_df = pd.concat([basket_size_df,encoded_df],axis=1)
merged_df.sample(5)

Unnamed: 0,BASKET_SIZE,BASKET_SIZE_ORDINAL,BASKET_SIZE_1,BASKET_SIZE_2
108,M,2,1,0
114,M,2,1,0
44,L,3,0,1
69,M,2,1,0
28,L,3,0,1


## Frequency Encoding

In [40]:
fe = basket_size_df.groupby('BASKET_SIZE').size()/len(basket_size_df)
basket_size_df.loc[:,'BASKET_SIZE_FREQ_ENC'] = basket_size_df['BASKET_SIZE'].map(fe)
basket_size_df.sample(5)

Unnamed: 0,BASKET_SIZE,BASKET_SIZE_ORDINAL,BASKET_SIZE_FREQ_ENC
1,M,2,0.252101
114,M,2,0.252101
86,L,3,0.697479
64,M,2,0.252101
23,L,3,0.697479


## Mean Encoding

In [43]:
mean_encode_df = supermarket_till_transactions_df[["BASKET_SIZE","SPEND"]]

In [44]:
mean_encode = mean_encode_df.groupby('BASKET_SIZE')['SPEND'].mean()
print(mean_encode)

BASKET_SIZE
L    207.783133
M    175.800000
S    274.833333
Name: SPEND, dtype: float64


In [46]:
mean_encode_df.loc[:,"BASKET_SIZE_ME"] = mean_encode_df["BASKET_SIZE"].map(mean_encode)
mean_encode_df.sample(5)

Unnamed: 0,BASKET_SIZE,SPEND,BASKET_SIZE_ME
105,L,223,207.783133
11,L,340,207.783133
45,L,97,207.783133
78,L,52,207.783133
118,M,499,175.8
