## Handling Categorical Features (Nominal Encoding)

In [1]:
import pandas as pd
import numpy as np

### Nominal Encoding

Nominal Encoding is used for nominal categorical variable i.e. the columns containing categorical values which are not ordered.
For example, "Sex" column in a dataset. Gender is an unordered category. 

### Introduction to Adult Dataset

**Abstract:** 
Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

**Data Set Information:**

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

You can learn more about the dataset <a href="http://archive.ics.uci.edu/ml/datasets/Adult">here</a>.

In [2]:
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

In [3]:
adult = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names=col_names, na_values=" ?")
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
adult.shape

(32561, 15)

In [5]:
adult.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [6]:
adult.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64

In [7]:
print("Number of unique values in each column")
for i in adult.columns:
    print(i,":", adult[i].nunique(), "unique labels")

Number of unique values in each column
age : 73 unique labels
workclass : 8 unique labels
fnlwgt : 21648 unique labels
education : 16 unique labels
education-num : 16 unique labels
marital-status : 7 unique labels
occupation : 14 unique labels
relationship : 6 unique labels
race : 5 unique labels
sex : 2 unique labels
capital-gain : 119 unique labels
capital-loss : 92 unique labels
hours-per-week : 94 unique labels
native-country : 41 unique labels
income : 2 unique labels


#### 1. One Hot Encoding for variables with few categories

We can perform one hot encoding using pandas top-level funtion "get_dummies()". We pass the name of the DataFrame and the name of columns we want to perform one hot encoding on. The original columns will be removed and new columns for eah category will be created.

In [8]:
pd.get_dummies(adult, columns=["race", "sex", "income"]).head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,capital-gain,capital-loss,...,native-country,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Female,sex_ Male,income_ <=50K,income_ >50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,2174,0,...,United-States,0,0,0,0,1,0,1,1,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,0,0,...,United-States,0,0,0,0,1,0,1,1,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,0,0,...,United-States,0,0,0,0,1,0,1,1,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,0,0,...,United-States,0,0,1,0,0,0,1,1,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,0,0,...,Cuba,0,0,1,0,0,1,0,1,0


Note that for representing 'n' categorical values the above code created 'n' new columns, but we can represent 'n' categorical values with 'n-1' new columns. For example, for two sex, male and female, a single column for male is sufficient to represent both sex. If the value is 1, then sex is male, if the value is 0, the sex is female. We don't need an additional column for female category. We can achieve this using parameter "drop_first=True". We can overwrite the original DataFrame to make the changes permanent. However, we will skip that part for now and move to another method.

In [9]:
pd.get_dummies(adult, columns=["race", "sex", "income"], drop_first = True).head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,capital-gain,capital-loss,hours-per-week,native-country,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Male,income_ >50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,2174,0,40,United-States,0,0,0,1,1,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,0,0,13,United-States,0,0,0,1,1,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,0,0,40,United-States,0,0,0,1,1,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,0,0,40,United-States,0,1,0,0,1,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,0,0,40,Cuba,0,1,0,0,0,0


#### 2. One Hot Encoding for variables with many categories

One hot encoding creates a number of new features. If we have a categorical column with many categories, it will add many new features. If there are multiple such columns, our DataFrame features will increase exponentially and may happer accuracy of our model. We will discuss one method applied to categorical column with many many categories. We will encode only the top 10 categories instead of 41 categories present in "native-country" column. Understand that we are treating "native-country" as nominal category only for learning purpose. With respect to predicting income, location can affect your income, and "native-country" is a ordinal category column.

We prefer this method when there is rapid decreace in value counts after n categories. 

Why this method works?  
This method probably works because categories with low value counts may very well be outliers or errors. In addition, the lower categories are still represented as the hidden column like the "female" column above. However, this method is not so well studied and may not be suitable for your data.

In [10]:
adult["native-country"].value_counts().head(10)

 United-States    29170
 Mexico             643
 Philippines        198
 Germany            137
 Canada             121
 Puerto-Rico        114
 El-Salvador        106
 India              100
 Cuba                95
 England             90
Name: native-country, dtype: int64

We will begin by forming a list of top 10 frequent countries for "native-country" column. 

In [11]:
top_10 = list(adult["native-country"].value_counts().head(10).index)
top_10

[' United-States',
 ' Mexico',
 ' Philippines',
 ' Germany',
 ' Canada',
 ' Puerto-Rico',
 ' El-Salvador',
 ' India',
 ' Cuba',
 ' England']

We will use a loop to encode the above 10 categories.

In [12]:
for categories in top_10:
    adult[categories]=np.where(adult["native-country"]==categories,1,0)

We can view the last few columns of our DataFrame to see the change.

In [13]:
adult.iloc[0:5, -12:]

Unnamed: 0,native-country,income,United-States,Mexico,Philippines,Germany,Canada,Puerto-Rico,El-Salvador,India,Cuba,England
0,United-States,<=50K,1,0,0,0,0,0,0,0,0,0
1,United-States,<=50K,1,0,0,0,0,0,0,0,0,0
2,United-States,<=50K,1,0,0,0,0,0,0,0,0,0
3,United-States,<=50K,1,0,0,0,0,0,0,0,0,0
4,Cuba,<=50K,0,0,0,0,0,0,0,0,1,0


#### Advantages and Disadvantages of One Hot Encoding 

##### Advantages
1. Easy to implement

##### Disadvantages
1. Creating Additional Features may hamper accuracy (Curse of Dimensionality)