# Encoding categorical variables with Python.



Categorical data is data that has more than one category. When working with that type of data we have two types, nominal and ordinal. Nominal data is data that has no particular order or hierarchy to it, and ordinal data is categorical data where the categories have order, but the differences between the categories are not important or unclear.

We will be working with a dataset of used cars for this article to truly understand and demonstrate how to work with categorical data. Let’s explore it and see what type of data we are working with.

In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv('1000_Companies.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [6]:
df.columns

Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

# Ordinal encoding


We mentioned already that ordinal data is data that does have order and a hierarchy between its values. Let us take a look at the condition feature from our data frame and perform a value_counts to see how many times each label is listed in our feature.

In [3]:
df.State

0        New York
1      California
2         Florida
3        New York
4         Florida
          ...    
995    California
996    California
997    California
998    California
999      New York
Name: State, Length: 1000, dtype: object

In [18]:
print(df['State'].value_counts())

California    344
New York      334
Florida       322
Name: State, dtype: int64


We need to convert these labels into numbers, and we can do this with two different approaches. First, we can do this by creating a dictionary where every label is the key and the new numeric number is the value. ‘Florida ’ , Then we will map each label from the condition column to the numeric value and create a new column called states_names 

In [19]:
states_dict =  {'California':1, 'New York':2, 'Florida':3}

In [12]:
states_dict

{'California': 1, 'New York': 2, 'Florida': 3}

In [22]:
new_col= df['State'].map(states_dict)


In [24]:
print(new_col.head())

0    2
1    1
2    3
3    2
4    3
Name: State, dtype: int64


In [7]:
df.State

0        New York
1      California
2         Florida
3        New York
4         Florida
          ...    
995    California
996    California
997    California
998    California
999      New York
Name: State, Length: 1000, dtype: object

In [6]:
df.State.shape

(1000,)

In [None]:
df

In [25]:
df['states_codes'] = new_col

In [4]:
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [None]:
d

# Sklearn Encoding 

The second approach we will show is how to utilize the sklearn.preprocessing library OrdinalEncoder. We follow a similar approach: we set our categories as a list, and then we will .fit_transform the values in our feature condition. We need to make sure we adhere to the shape requirements of a 2-D array, so you’ll notice the method .reshape(-1,1).

We’ll also note, this method will not work if your feature has NaN values. Those need to be addressed prior to running .fit_transform.

In [30]:
# using scikit-learn
from sklearn.preprocessing import OrdinalEncoder
 
# create encoder and set category order
encoder = OrdinalEncoder(categories=[['California', 'New York', 'Florida']])
state_reshaped = np.array(df['State']).reshape(-1,1)
state_reshaped

array([['New York'],
       ['California'],
       ['Florida'],
       ['New York'],
       ['Florida'],
       ['New York'],
       ['California'],
       ['Florida'],
       ['New York'],
       ['California'],
       ['Florida'],
       ['California'],
       ['Florida'],
       ['California'],
       ['Florida'],
       ['New York'],
       ['California'],
       ['New York'],
       ['Florida'],
       ['New York'],
       ['California'],
       ['New York'],
       ['Florida'],
       ['Florida'],
       ['New York'],
       ['California'],
       ['Florida'],
       ['New York'],
       ['Florida'],
       ['New York'],
       ['Florida'],
       ['New York'],
       ['California'],
       ['Florida'],
       ['California'],
       ['New York'],
       ['Florida'],
       ['California'],
       ['New York'],
       ['California'],
       ['California'],
       ['Florida'],
       ['California'],
       ['New York'],
       ['California'],
       ['New York'],
       ['Florida'],

In [33]:
scaled_val = encoder.fit_transform(state_reshaped)
print(scaled_val)

[[1.]
 [0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]
 [1.]
 [0.]
 [2.]
 [0.]
 [2.]
 [0.]
 [2.]
 [1.]
 [0.]
 [1.]
 [2.]
 [1.]
 [0.]
 [1.]
 [2.]
 [2.]
 [1.]
 [0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]
 [0.]
 [1.]
 [2.]
 [0.]
 [1.]
 [0.]
 [0.]
 [2.]
 [0.]
 [1.]
 [0.]
 [1.]
 [2.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]
 [1.]
 [0.]
 [2.]
 [0.]
 [2.]
 [0.]
 [2.]
 [1.]
 [0.]
 [1.]
 [2.]
 [1.]
 [0.]
 [1.]
 [2.]
 [2.]
 [1.]
 [0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]
 [0.]
 [1.]
 [2.]
 [0.]
 [1.]
 [0.]
 [0.]
 [2.]
 [0.]
 [1.]
 [0.]
 [1.]
 [2.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]
 [1.]
 [0.]
 [2.]
 [0.]
 [2.]
 [0.]
 [2.]
 [1.]
 [0.]
 [1.]
 [2.]
 [1.]
 [0.]
 [1.]
 [2.]
 [2.]
 [1.]
 [0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]
 [0.]
 [1.]
 [2.]
 [0.]
 [1.]
 [0.]
 [0.]
 [2.]
 [0.]
 [1.]
 [0.]
 [1.]
 [2.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]
 [1.]
 [0.]
 [2.]
 [0.]
 [2.]
 [0.]
 [2.]
 [1.]
 [0.

In [35]:
df['sklearn_scaled'] =  scaled_val
print(df.head())

   R&D Spend  Administration  Marketing Spend       State     Profit  \
0  165349.20       136897.80        471784.10    New York  192261.83   
1  162597.70       151377.59        443898.53  California  191792.06   
2  153441.51       101145.55        407934.54     Florida  191050.39   
3  144372.41       118671.85        383199.62    New York  182901.99   
4  142107.34        91391.77        366168.42     Florida  166187.94   

   states_codes  sklearn_scaled  
0             2             1.0  
1             1             0.0  
2             3             2.0  
3             2             1.0  
4             3             2.0  
