### Ordinal Number Encoding

When we have a feature where variables have some order/rank.

For example: Student’s performance, Customer’s review, Education of person, etc…

![image.png](attachment:image.png)

In the above example, we have orders/ranks/sequences. We can assign ranks based on student’s performance, based on feedback given by customers, based on the highest education of the person. Those features are called Ordinal features.

In [1]:
import pandas as pd
import datetime

In [2]:
today_date = datetime.datetime.today()
today_date

datetime.datetime(2023, 3, 18, 23, 8, 24, 897326)

In [3]:
today_date - datetime.timedelta(1)

datetime.datetime(2023, 3, 17, 23, 8, 24, 897326)

In [4]:
dates = [today_date - datetime.timedelta(x) for x in range(15)]
dates

[datetime.datetime(2023, 3, 18, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 17, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 16, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 15, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 14, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 13, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 12, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 11, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 10, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 9, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 8, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 7, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 6, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 5, 23, 8, 24, 897326),
 datetime.datetime(2023, 3, 4, 23, 8, 24, 897326)]

In [5]:
dates = pd.DataFrame(dates,columns=["date"])
dates

Unnamed: 0,date
0,2023-03-18 23:08:24.897326
1,2023-03-17 23:08:24.897326
2,2023-03-16 23:08:24.897326
3,2023-03-15 23:08:24.897326
4,2023-03-14 23:08:24.897326
5,2023-03-13 23:08:24.897326
6,2023-03-12 23:08:24.897326
7,2023-03-11 23:08:24.897326
8,2023-03-10 23:08:24.897326
9,2023-03-09 23:08:24.897326


In [6]:
dates["day"] = dates["date"].dt.day
dates

Unnamed: 0,date,day
0,2023-03-18 23:08:24.897326,18
1,2023-03-17 23:08:24.897326,17
2,2023-03-16 23:08:24.897326,16
3,2023-03-15 23:08:24.897326,15
4,2023-03-14 23:08:24.897326,14
5,2023-03-13 23:08:24.897326,13
6,2023-03-12 23:08:24.897326,12
7,2023-03-11 23:08:24.897326,11
8,2023-03-10 23:08:24.897326,10
9,2023-03-09 23:08:24.897326,9


In [7]:
dates["day_name"] = dates["date"].dt.day_name()
dates

Unnamed: 0,date,day,day_name
0,2023-03-18 23:08:24.897326,18,Saturday
1,2023-03-17 23:08:24.897326,17,Friday
2,2023-03-16 23:08:24.897326,16,Thursday
3,2023-03-15 23:08:24.897326,15,Wednesday
4,2023-03-14 23:08:24.897326,14,Tuesday
5,2023-03-13 23:08:24.897326,13,Monday
6,2023-03-12 23:08:24.897326,12,Sunday
7,2023-03-11 23:08:24.897326,11,Saturday
8,2023-03-10 23:08:24.897326,10,Friday
9,2023-03-09 23:08:24.897326,9,Thursday


In [8]:
ordinal_data = {
    "Sunday":7,"Saturday":6,"Friday":5,"Thursday":4,
    "Wednesday":3,"Tuesday":2,"Monday":1
}

In [9]:
dates["ordinal_day_name"] = dates["day_name"].map(ordinal_data)
dates

Unnamed: 0,date,day,day_name,ordinal_day_name
0,2023-03-18 23:08:24.897326,18,Saturday,6
1,2023-03-17 23:08:24.897326,17,Friday,5
2,2023-03-16 23:08:24.897326,16,Thursday,4
3,2023-03-15 23:08:24.897326,15,Wednesday,3
4,2023-03-14 23:08:24.897326,14,Tuesday,2
5,2023-03-13 23:08:24.897326,13,Monday,1
6,2023-03-12 23:08:24.897326,12,Sunday,7
7,2023-03-11 23:08:24.897326,11,Saturday,6
8,2023-03-10 23:08:24.897326,10,Friday,5
9,2023-03-09 23:08:24.897326,9,Thursday,4


### Count/Frequency Encoding

In [10]:
df = pd.read_csv("mercedes_benz.csv",usecols=["X1","X2"])

In [11]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [12]:
for i in df.columns:
    print("Column",i,"-",df[i].nunique())

Column X1 - 27
Column X2 - 44


In [13]:
df["X1"].value_counts()

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
w      52
z      46
u      37
e      33
m      32
t      31
h      29
f      23
y      23
j      22
n      19
k      17
p       9
g       6
ab      3
q       3
d       3
Name: X1, dtype: int64

In [14]:
frequency = df["X1"].value_counts().to_dict()
frequency

{'aa': 833,
 's': 598,
 'b': 592,
 'l': 590,
 'v': 408,
 'r': 251,
 'i': 203,
 'a': 143,
 'c': 121,
 'o': 82,
 'w': 52,
 'z': 46,
 'u': 37,
 'e': 33,
 'm': 32,
 't': 31,
 'h': 29,
 'f': 23,
 'y': 23,
 'j': 22,
 'n': 19,
 'k': 17,
 'p': 9,
 'g': 6,
 'ab': 3,
 'q': 3,
 'd': 3}

In [15]:
df["X1_encoded"] = df["X1"].map(frequency)

In [16]:
df.head()

Unnamed: 0,X1,X2,X1_encoded
0,v,at,408
1,t,av,31
2,w,n,52
3,t,n,31
4,v,n,408


### Advantages And Disadvantages of Count/Frequency Encoding

**Advantages**
- Easy To Use
- Not increasing feature space

**Disadvantages**
- It will provide same weight if the frequencies are same

### Mean Encoding

Mean Encoding or Target Encoding is one viral encoding approach followed by Kagglers. There are many variations of this. Here I will cover the basic version and smoothing version. Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on training data. This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself. The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning. Usually, Mean encoding is notorious for over-fitting; thus, a regularization with cross-validation or some other approach is a must on most occasions. 

In [17]:
df = pd.read_csv("titanic.csv",usecols=["Survived","Cabin"])

In [18]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [19]:
df["Cabin"].fillna("Missing",inplace=True)

In [20]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [21]:
df["Cabin"].value_counts()

Missing        687
B96 B98          4
G6               4
C23 C25 C27      4
D                3
              ... 
B80              1
B41              1
A23              1
E17              1
C106             1
Name: Cabin, Length: 148, dtype: int64

In [22]:
df["Cabin"] = df["Cabin"].str[0]

In [23]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [24]:
df["Cabin"].value_counts()

M    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: Cabin, dtype: int64

In [25]:
df.groupby("Cabin")["Survived"].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [26]:
frequency = df.groupby("Cabin")["Survived"].mean().to_dict()
frequency

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [27]:
df["Cabin_Encoded"] = df["Cabin"].map(frequency)

In [28]:
df.head()

Unnamed: 0,Survived,Cabin,Cabin_Encoded
0,0,M,0.299854
1,1,C,0.59322
2,1,M,0.299854
3,1,C,0.59322
4,0,M,0.299854


### Target Guided Ordinal Encoding

- Ordering the labels according to the target
- Replace the labels by the joint probability of being 1 or 0

In [29]:
df = pd.read_csv("titanic.csv",usecols=["Survived","Cabin"])

In [30]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [31]:
df["Cabin"].fillna("Missing",inplace=True)

In [32]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [33]:
df["Cabin"] = df["Cabin"].str[0]

In [34]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [35]:
df.groupby("Cabin")["Survived"].mean().sort_values()

Cabin
T    0.000000
M    0.299854
A    0.466667
G    0.500000
C    0.593220
F    0.615385
B    0.744681
E    0.750000
D    0.757576
Name: Survived, dtype: float64

In [36]:
ordinal_labels = df.groupby("Cabin")["Survived"].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [37]:
ordinal_labels_org = {k:i for i,k in enumerate(ordinal_labels)}

In [38]:
ordinal_labels_org

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [39]:
df["Cabin_Encoded"] = df["Cabin"].map(ordinal_labels_org)

In [40]:
df.head()

Unnamed: 0,Survived,Cabin,Cabin_Encoded
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


### Probability Ratio Encoding
1. Probability of Survived based on Cabin--- Categorical Feature
2. Probability of Not Survived---1-pr(Survived)
3. pr(Survived)/pr(Not Survived)
4. Dictonary to map cabin with probability
5. replace with the categorical feature

In [41]:
df = pd.read_csv("titanic.csv",usecols=["Cabin","Survived"])

In [42]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [43]:
df["Cabin"].fillna("Missing",inplace=True)

In [44]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [45]:
df["Cabin"] = df["Cabin"].str[0]

In [46]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [47]:
df["Cabin"].value_counts()

M    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: Cabin, dtype: int64

In [48]:
prob_df = df.groupby("Cabin")["Survived"].mean()
prob_df

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [49]:
prob_df = pd.DataFrame(prob_df)
prob_df

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


In [50]:
prob_df["Died"] = 1 - prob_df["Survived"]

In [51]:
prob_df.head()

Unnamed: 0_level_0,Survived,Died
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25


In [52]:
prob_df["probability_ratio"] = prob_df["Survived"]/prob_df["Died"]

In [53]:
prob_df.head()

Unnamed: 0_level_0,Survived,Died,probability_ratio
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0


In [54]:
prob_encoded = prob_df["probability_ratio"].to_dict()
prob_encoded

{'A': 0.875,
 'B': 2.916666666666666,
 'C': 1.4583333333333333,
 'D': 3.125,
 'E': 3.0,
 'F': 1.6000000000000003,
 'G': 1.0,
 'M': 0.42827442827442824,
 'T': 0.0}

In [55]:
df["Cabin_encoded"] = df["Cabin"].map(prob_encoded)

In [56]:
df.head()

Unnamed: 0,Survived,Cabin,Cabin_encoded
0,0,M,0.428274
1,1,C,1.458333
2,1,M,0.428274
3,1,C,1.458333
4,0,M,0.428274
