## Handling Categorical Features (Ordinal Encoding-2)

In [1]:
import pandas as pd
import numpy as np

### Ordinal Encoding

Ordinal Encoding is used for ordinal categorical variable i.e. the columns containing categorical values which are ordered.
For example, "Review" column in a dataset. "poor", "good", "excellent" are ordered values. 

### Introduction to Adult Dataset

**Abstract:** 
Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

**Data Set Information:**

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

You can learn more about the dataset <a href="http://archive.ics.uci.edu/ml/datasets/Adult">here</a>.

In [2]:
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

In [3]:
adult = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names=col_names, na_values=" ?")
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
print("Number of unique values in each column")
for i in adult.columns:
    print(i,":", adult[i].nunique())

Number of unique values in each column
age : 73
workclass : 8
fnlwgt : 21648
education : 16
education-num : 16
marital-status : 7
occupation : 14
relationship : 6
race : 5
sex : 2
capital-gain : 119
capital-loss : 92
hours-per-week : 94
native-country : 41
income : 2


#### 4. Target Mean Encoding

In this method we encode the categories according to the target. We replace the labels with probability of target being 1 for the category.

In [5]:
adult["income"].value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

We will begin by mapping income over 50k as one and lower than or equal to 50k as 0.

In [6]:
adult["income"] = adult["income"].map({" <=50K":0, " >50K":1})
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


We can groupby the "education" column and then calculate mean for "income" column to calculate probability of target ("income") being 1.

In [7]:
adult.groupby("education")["income"].mean()

education
 10th            0.066452
 11th            0.051064
 12th            0.076212
 1st-4th         0.035714
 5th-6th         0.048048
 7th-8th         0.061920
 9th             0.052529
 Assoc-acdm      0.248360
 Assoc-voc       0.261216
 Bachelors       0.414753
 Doctorate       0.740920
 HS-grad         0.159509
 Masters         0.556587
 Preschool       0.000000
 Prof-school     0.734375
 Some-college    0.190235
Name: income, dtype: float64

We can create a dictionary for mapping from the series above by using "to_dict()" series method.

In [8]:
edu_mapping = adult.groupby("education")["income"].mean().to_dict()
edu_mapping

{' 10th': 0.06645230439442658,
 ' 11th': 0.05106382978723404,
 ' 12th': 0.07621247113163972,
 ' 1st-4th': 0.03571428571428571,
 ' 5th-6th': 0.04804804804804805,
 ' 7th-8th': 0.06191950464396285,
 ' 9th': 0.05252918287937743,
 ' Assoc-acdm': 0.24835988753514526,
 ' Assoc-voc': 0.26121562952243127,
 ' Bachelors': 0.4147525676937442,
 ' Doctorate': 0.7409200968523002,
 ' HS-grad': 0.15950861822683554,
 ' Masters': 0.5565873476494486,
 ' Preschool': 0.0,
 ' Prof-school': 0.734375,
 ' Some-college': 0.19023453572898094}

Finally, we map the values based on dictionary we created above.

In [9]:
adult["education_ordinal"] = adult["education"].map(edu_mapping)
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,education_ordinal
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,0.414753
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,0.414753
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,0.159509
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,0.051064
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,0.414753


We will drop the column we just created before proceeding to next method of encoding.

In [10]:
adult.drop("education_ordinal", axis=1, inplace=True)

In [11]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


#### Advantages and Disadvantages of Target mean Encoding

##### Advantages
1. Easy To Use
2. Not increasing feature space

##### Disadvantages
1. The process becomes a little complicated when applying to target with more than two options
2. Prone to creating biased dataset

#### 5. Probability Ratio Encoding

In this method we encode the categories according to ration of probability of being 1 to probability of being 0. We replace the labels with ratio for the category.

We can groupby the "education" column and then calculate mean for "income" column to calculate probability of target ("income") being 1 i.e. income over 50k and assign it to prob_df.

In [12]:
prob_df=adult.groupby("education")["income"].mean()
prob_df.head()

education
 10th       0.066452
 11th       0.051064
 12th       0.076212
 1st-4th    0.035714
 5th-6th    0.048048
Name: income, dtype: float64

We will change the above series to a DataFrame.

In [13]:
prob_df=pd.DataFrame(prob_df)
prob_df.head()

Unnamed: 0_level_0,income
education,Unnamed: 1_level_1
10th,0.066452
11th,0.051064
12th,0.076212
1st-4th,0.035714
5th-6th,0.048048


We will cahnge the column name to more meaning label.

In [14]:
prob_df.columns=["prob_income_>50k"]

We will calculate probability of income being less than or equal to 50k and add it to our DataFrame.

In [15]:
prob_df["prob_income_<=50k"]=1-prob_df["prob_income_>50k"]
prob_df.head()

Unnamed: 0_level_0,prob_income_>50k,prob_income_<=50k
education,Unnamed: 1_level_1,Unnamed: 2_level_1
10th,0.066452,0.933548
11th,0.051064,0.948936
12th,0.076212,0.923788
1st-4th,0.035714,0.964286
5th-6th,0.048048,0.951952


We will calculate probability ratio, which will be ratio of probability of being over 50k to probability of being lower than or equal to 50k.

In [16]:
prob_df["prob_ratio"]=prob_df["prob_income_>50k"]/prob_df["prob_income_<=50k"]
prob_df.head()

Unnamed: 0_level_0,prob_income_>50k,prob_income_<=50k,prob_ratio
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10th,0.066452,0.933548,0.071183
11th,0.051064,0.948936,0.053812
12th,0.076212,0.923788,0.0825
1st-4th,0.035714,0.964286,0.037037
5th-6th,0.048048,0.951952,0.050473


We will will create a dictionary from "prob_ratio" series for mapping.

In [17]:
edu_mapping = prob_df["prob_ratio"].to_dict()
edu_mapping

{' 10th': 0.0711825487944891,
 ' 11th': 0.05381165919282511,
 ' 12th': 0.08249999999999999,
 ' 1st-4th': 0.037037037037037035,
 ' 5th-6th': 0.050473186119873815,
 ' 7th-8th': 0.06600660066006601,
 ' 9th': 0.055441478439425054,
 ' Assoc-acdm': 0.33042394014962595,
 ' Assoc-voc': 0.3535749265426053,
 ' Bachelors': 0.7086790044671348,
 ' Doctorate': 2.8598130841121487,
 ' HS-grad': 0.18978019487876727,
 ' Masters': 1.2552356020942408,
 ' Preschool': 0.0,
 ' Prof-school': 2.764705882352941,
 ' Some-college': 0.23492547425474256}

Finally, we map the values based on dictionary we created above.

In [18]:
adult["education_ordinal"] = adult["education"].map(edu_mapping)
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,education_ordinal
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,0.708679
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,0.708679
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,0.18978
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,0.053812
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,0.708679


#### Advantages and Disadvantages of Probability Ratio Encoding

##### Advantages
1. Easy To Use
2. Not increasing feature space

##### Disadvantages
1. The process becomes a little complicated when applying to target with more than two options
2. Prone to creating biased dataset