# Encoding Techniques

1. OneHotEncoding
2. Ordinal Number Encoding
3. Count / Frequency Encoding
4. Target Guided Ordinal Encoding
5. Mean Encoding

# 1. OneHotEncoding

- One column is splitted into N number ocolumns where N is number of Classes to which value of its column belongs. e.g. Sex column has two classes N=2.

- It simply makes truth table in terms of 1's and 0's, where drop_first() can be used when we remove first column and it is represented with other columns. like when we want to show value belongs to first column all other columns shows 0. 

In [2]:
import numpy as np
import pandas as pd
data = pd.read_csv('C:/Users/Shubham/Documents/Data Science/Notebooks/00. Data_Store/titanic_train.csv', usecols=["Sex", "Embarked"])
data.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


### We can use get_dummies() feature on whole data and it will automatically encode all the categorical data.

In [3]:
df_encoded = pd.get_dummies(data, drop_first=True)
df_encoded

Unnamed: 0,Sex_male,Embarked_Q,Embarked_S
0,1,0,1
1,0,0,0
2,0,0,1
3,0,0,1
4,1,0,1
...,...,...,...
886,1,0,1
887,0,0,1
888,0,0,1
889,1,0,0


In [11]:
sex_dummies = pd.get_dummies(data["Sex"])
emb_dummies = pd.get_dummies(data["Embarked"])

emb_dummies_drop_first = pd.get_dummies(data["Embarked"], drop_first=True)

In [7]:
sex_dummies.head(3)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0


In [6]:
emb_dummies.head(3)

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1


### Dummy variable trap
**drop_first() parameter is used so that first column is dropped and it is represented as 1 by making other two 0 and vice versa.**

In [12]:
emb_dummies_drop_first.head(3)

Unnamed: 0,Q,S
0,0,1
1,0,0
2,0,1


### For Large number of categories

When there are many categories we cannot simply let OneHotEncoding create that many number of columns. Otherwise it will create problems of CURSE OF DIMENSIONALITY so we take TopMost frequent classes and all other are consider belonging to one separate class.

e.g. If their are 20 categories where for last 10 categories their aree very less number of instances then we simply take first 10 classes and all other are considered belonging to 11th class.

In [19]:
data = pd.read_csv('C:/Users/Shubham/Documents/Data Science/Notebooks/00. Data_Store/mercedes.csv')
data.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [38]:
data.shape

(4209, 388)

In [28]:
data["X1"].value_counts()

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
w      52
z      46
u      37
e      33
m      32
t      31
h      29
f      23
y      23
j      22
n      19
k      17
p       9
g       6
ab      3
d       3
q       3
Name: X1, dtype: int64

In [30]:
list_10 = data["X1"].value_counts().sort_values(ascending=False).head(10).index
list_10 = list(list_10)
list_10

['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

In [35]:
for categ in list_10:
    data[categ] = np.where(data["X1"]==categ, 1, 0)
    
list_10.append("X1")

data[list_10]

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o,X1,X1.1
0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
4204,0,1,0,0,0,0,0,0,0,0,0,0
4205,0,0,0,0,0,0,0,0,0,1,0,0
4206,0,0,0,0,1,0,0,0,0,0,0,0
4207,0,0,0,0,0,1,0,0,0,0,0,0


# 2. Ordinal Number Encoding

When numbers are in perticular order e.g. Grading system A,B,C,D,E,F as this grades are ranked as 1,2,3,4,5,6 etc similarly they can be in the indirect propertion. 

### Creating data of weekdays for demonstration

In [32]:
import datetime

today = datetime.datetime.today()

days = [today - datetime.timedelta(x) for x in range(1,16)]
days = pd.DataFrame(days)
days.columns=["Day"]

days["Weekday"] = days["Day"].dt.day_name()

In [33]:
days

Unnamed: 0,Day,Weekday
0,2021-05-16 19:12:54.968777,Sunday
1,2021-05-15 19:12:54.968777,Saturday
2,2021-05-14 19:12:54.968777,Friday
3,2021-05-13 19:12:54.968777,Thursday
4,2021-05-12 19:12:54.968777,Wednesday
5,2021-05-11 19:12:54.968777,Tuesday
6,2021-05-10 19:12:54.968777,Monday
7,2021-05-09 19:12:54.968777,Sunday
8,2021-05-08 19:12:54.968777,Saturday
9,2021-05-07 19:12:54.968777,Friday


### Ordinal Number Encoding using *map()* function

We give the Dictionary of key-value pair where Keys are data instance of dataset and values are replacing value.

In [34]:
dictionary={'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6,'Sunday':7}

days["Weekday"] = days["Weekday"].map(dictionary)

In [35]:
days

Unnamed: 0,Day,Weekday
0,2021-05-16 19:12:54.968777,7
1,2021-05-15 19:12:54.968777,6
2,2021-05-14 19:12:54.968777,5
3,2021-05-13 19:12:54.968777,4
4,2021-05-12 19:12:54.968777,3
5,2021-05-11 19:12:54.968777,2
6,2021-05-10 19:12:54.968777,1
7,2021-05-09 19:12:54.968777,7
8,2021-05-08 19:12:54.968777,6
9,2021-05-07 19:12:54.968777,5


# 3. Count or Frequency Encoding

It will replace categoris with their frequency of occurance

### Advantages - 
1. Easy
2. Not increasing feature space

### Disadvantages - 
1. It will provide same weight if frequencies are same.

In [49]:
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' , header = None,index_col=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [69]:
columns=[1,3,5,6,7,8,9,13]

df = data[columns]

df.columns=['Employment','Degree','Status','Designation','family_job','Race','Sex','Country']

df.head()

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [70]:
for col in df.columns:
    print(col, " : ", len(df[col].unique()), "labels")

Employment  :  9 labels
Degree  :  16 labels
Status  :  7 labels
Designation  :  15 labels
family_job  :  6 labels
Race  :  5 labels
Sex  :  2 labels
Country  :  42 labels


In [71]:
df["Country"].value_counts()

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

In [72]:
country_map = df["Country"].value_counts()
country_map = country_map.to_dict()
country_map

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Trinadad&Tobago': 19,
 ' Cambodia': 19,
 ' Laos': 18,
 ' Thailand': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Hungary': 13,
 ' Honduras': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [73]:
df["Country"] = df["Country"].map(country_map)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Country"] = df["Country"].map(country_map)


Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
...,...,...,...,...,...,...,...,...
32556,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,29170
32557,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,29170
32558,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,29170
32559,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,29170


### Hence country column is now encoded to their frequency of occurance.

# 4.Target Guided Ordinal Encoding

1. Ordering the labels according to the target
2. Replace the labels by the joint probability of being 1 or 0
3. Ranks are given on the basis of the probability of occurance of positive class


In [96]:
import pandas as pd
df=pd.read_csv('C:/Users/Shubham/Documents/Data Science/Notebooks/00. Data_Store/titanic_train.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [97]:
df['Cabin'].fillna('Missing',inplace=True)

df['Cabin'] = df['Cabin'].astype(str).str[0]

df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [98]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [99]:
survival_rate = df.groupby(["Cabin"])["Survived"].mean()
labels = survival_rate.sort_values()
labels

Cabin
T    0.000000
M    0.299854
A    0.466667
G    0.500000
C    0.593220
F    0.615385
B    0.744681
E    0.750000
D    0.757576
Name: Survived, dtype: float64

In [100]:
ordinal_labels = { k: i for i, k in enumerate(labels.index, 0)}
ordinal_labels

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [102]:
df["Cabin_Ordinal_labels"] = df["Cabin"].map(ordinal_labels)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_Ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


# 5. Mean Encoding

Similar to Target guided encoding only difference is that here we just replace the catogorical value with probability of occurance of Positive class in Binary Classification.

### Advantages

1. Capture information within the label, therefore rendering more predictive features
2. Creates a monotonic relationship between the variable and the target

### Disadvantages
1. It may cause over-fitting in the model because It may cause model to give more significance to the feature than its actual influence on the results.

In [1]:
mean_label = df.groupby(["Cabin"])["Survived"].mean().to_dict()

NameError: name 'df' is not defined

In [108]:
df["Cabin_mean_labels"] = df["Cabin"].map(mean_label)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_Ordinal_labels,Cabin_mean_labels
0,0,M,1,0.299854
1,1,C,4,0.59322
2,1,M,1,0.299854
3,1,C,4,0.59322
4,0,M,1,0.299854
