## Handling Categorical Features (Ordinal Encoding-1)

In [1]:
import pandas as pd
import numpy as np


### Ordinal Encoding

Ordinal Encoding is used for ordinal categorical variable i.e. the columns containing categorical values which are ordered.
For example, "Review" column in a dataset. "poor", "good", "excellent" are ordered values. 

### Introduction to Adult Dataset

**Abstract:** 
Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

**Data Set Information:**

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

You can learn more about the dataset <a href="http://archive.ics.uci.edu/ml/datasets/Adult">here</a>.

In [2]:
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

In [3]:
adult = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names=col_names, na_values=" ?")
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
print("Number of unique values in each column")
for i in adult.columns:
    print(i,":", adult[i].nunique())

Number of unique values in each column
age : 73
workclass : 8
fnlwgt : 21648
education : 16
education-num : 16
marital-status : 7
occupation : 14
relationship : 6
race : 5
sex : 2
capital-gain : 119
capital-loss : 92
hours-per-week : 94
native-country : 41
income : 2


#### 1. Ordinal number Encoding

In this method we map the categories based on our knowledge of the order the categories.

In the DataFrame above, "education-num" is the ordinal encoding for "education" column. We will first drop the column and then create a new column to learn better.

In [5]:
adult.drop("education-num", axis=1, inplace=True)

We will create a dictionary which we will later use for mapping our values.

In [6]:
edu_mapping={" Preschool":1,
             " 1st-4th":2,
             " 5th-6th":3,
             " 7th-8th":4,
             " 9th":5,
             " 10th":6,
             " 11th":7,
             " 12th":8,
             " HS-grad":9,
             " Some-college":10,
             " Assoc-voc":11,
             " Assoc-acdm":12,
             " Bachelors":13,
             " Masters":14,
             " Prof-school":15,
             " Doctorate":16}
edu_mapping

{' Preschool': 1,
 ' 1st-4th': 2,
 ' 5th-6th': 3,
 ' 7th-8th': 4,
 ' 9th': 5,
 ' 10th': 6,
 ' 11th': 7,
 ' 12th': 8,
 ' HS-grad': 9,
 ' Some-college': 10,
 ' Assoc-voc': 11,
 ' Assoc-acdm': 12,
 ' Bachelors': 13,
 ' Masters': 14,
 ' Prof-school': 15,
 ' Doctorate': 16}

In [7]:
adult["education_ordinal"] = adult["education"].map(edu_mapping)

In [8]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,education_ordinal
0,39,State-gov,77516,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,13
1,50,Self-emp-not-inc,83311,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,13
2,38,Private,215646,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,9
3,53,Private,234721,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,7
4,28,Private,338409,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,13


We will drop the column we just created before proceeding to next method of encoding.

In [9]:
adult.drop("education_ordinal", axis=1, inplace=True)

In [10]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#### Advantages and Disadvantages of Ordinal number Encoding

##### Advantages
1. Easy To Use
2. Not increasing feature space

##### Disadvantages
1. Since we are providing our own order, it may create a bias towards some categories if the weights are incorrect

#### 2. Count or Frequency Encoding

In this method we map the categories to their value counts.

In [11]:
adult["education"].value_counts().to_dict()

{' HS-grad': 10501,
 ' Some-college': 7291,
 ' Bachelors': 5355,
 ' Masters': 1723,
 ' Assoc-voc': 1382,
 ' 11th': 1175,
 ' Assoc-acdm': 1067,
 ' 10th': 933,
 ' 7th-8th': 646,
 ' Prof-school': 576,
 ' 9th': 514,
 ' 12th': 433,
 ' Doctorate': 413,
 ' 5th-6th': 333,
 ' 1st-4th': 168,
 ' Preschool': 51}

In [12]:
country_map = adult["education"].value_counts().to_dict()

In [13]:
adult["education_ordinal"] = adult["education"].map(country_map)
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,education_ordinal
0,39,State-gov,77516,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,5355
1,50,Self-emp-not-inc,83311,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,5355
2,38,Private,215646,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,10501
3,53,Private,234721,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,1175
4,28,Private,338409,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,5355


We will drop the column we just created before proceeding to next method of encoding.

In [14]:
adult.drop("education_ordinal", axis=1, inplace=True)

In [15]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#### Advantages and Disadvantages of Count Encoding

##### Advantages
1. Easy To Use
2. Not increasing feature space

##### Disadvantages
1. It will provide same weight if the frequencies are same

#### 3. Target Guided Ordinal Encoding

In this method we encode the categories according to the target. We replace the labels in order based on probability of target being 1 for the category.

In [16]:
adult["income"].value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

We will begin by mapping income over 50k as one and lower than or equal to 50k as 0.

In [17]:
adult["income"] = adult["income"].map({" <=50K":0, " >50K":1})
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [18]:
adult["education"].unique()

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

We can groupby the "education" column and then calculate mean for "income" column to calculate probability of target ("income") being 1.

In [19]:
adult.groupby("education")["income"].mean()

education
 10th            0.066452
 11th            0.051064
 12th            0.076212
 1st-4th         0.035714
 5th-6th         0.048048
 7th-8th         0.061920
 9th             0.052529
 Assoc-acdm      0.248360
 Assoc-voc       0.261216
 Bachelors       0.414753
 Doctorate       0.740920
 HS-grad         0.159509
 Masters         0.556587
 Preschool       0.000000
 Prof-school     0.734375
 Some-college    0.190235
Name: income, dtype: float64

We will sort the values in ascending order and then retrive the index based on the oreder of values.

In [20]:
ordinal_labels = adult.groupby("education")["income"].mean().sort_values().index
ordinal_labels

Index([' Preschool', ' 1st-4th', ' 5th-6th', ' 11th', ' 9th', ' 7th-8th',
       ' 10th', ' 12th', ' HS-grad', ' Some-college', ' Assoc-acdm',
       ' Assoc-voc', ' Bachelors', ' Masters', ' Prof-school', ' Doctorate'],
      dtype='object', name='education')

We will use a dictionary comprehension with enumerate to create a dictionary with order starting with 0.

In [21]:
edu_mapping = {k:i for i,k in enumerate(ordinal_labels,0)}
edu_mapping

{' Preschool': 0,
 ' 1st-4th': 1,
 ' 5th-6th': 2,
 ' 11th': 3,
 ' 9th': 4,
 ' 7th-8th': 5,
 ' 10th': 6,
 ' 12th': 7,
 ' HS-grad': 8,
 ' Some-college': 9,
 ' Assoc-acdm': 10,
 ' Assoc-voc': 11,
 ' Bachelors': 12,
 ' Masters': 13,
 ' Prof-school': 14,
 ' Doctorate': 15}

Finally, we map the values based on dictionary we created above.

In [22]:
adult["education_ordinal"] = adult["education"].map(edu_mapping)
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,education_ordinal
0,39,State-gov,77516,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,12
1,50,Self-emp-not-inc,83311,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,12
2,38,Private,215646,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,8
3,53,Private,234721,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,3
4,28,Private,338409,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,12


#### Advantages and Disadvantages of Target guided Ordinal Encoding

##### Advantages
1. Easy To Use
2. Not increasing feature space

##### Disadvantages
1. The process becomes a little complicated when applying to target with more than two options
2. Prone to creating biased dataset