### Encoding categorical data into numeric representation ###

<b>Please note - This is done only to features as many algorithms cannot accept categoric features. Do not do this to labels.</b>

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Create example dataframe
# Feature Columns: gender, city, temperature
# Label Column: rating


data = pd.DataFrame(
       [['female', 'New York', 'low', 4], ['female', 'London', 'medium', '3f'], ['male', 'New Delhi', 'high', 2]],
       columns=['Gender', 'City', 'Temperature', 'Rating'])
print(f"Data:\n{data}\nShape:\n{data.shape}\n\nData Types:\n{data.dtypes}")

Data:
   Gender       City Temperature Rating
0  female   New York         low      4
1  female     London      medium     3f
2    male  New Delhi        high      2
Shape:
(3, 4)

Data Types:
Gender         object
City           object
Temperature    object
Rating         object
dtype: object


In [3]:
# Preprocessing: remove the f in 3f
# If these type of data aberrations are large in a dataset, then the most efficient way is to use the regular expressions library for string manipulations

data["Rating"].replace(['3f'],3,inplace=True)

print(f"{data}\nData Type{data.dtypes}")

   Gender       City Temperature  Rating
0  female   New York         low       4
1  female     London      medium       3
2    male  New Delhi        high       2
Data TypeGender         object
City           object
Temperature    object
Rating          int64
dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["Rating"].replace(['3f'],3,inplace=True)
  data["Rating"].replace(['3f'],3,inplace=True)



    - All of the features are categorical data. Most are strings, one should be numeric.
    
    - Gender is a binary category. It's either male of female.
    
    - City is nominal category. This is because it's not meaningful to order the cities in any way.
    
    - Temperature is an ordinal category. This is because there is a meaningful order to the category 
        i.e. greater-than and less-than comparisons are meaningful.
    


In [4]:
# One can specify the order if we know it - as in the case of temperature in this dataset
data['Temperature_encoded'] = data['Temperature'].map( {'low':0, 'medium':1, 'high':2}) # Encoding Categorical Data
print(data)

   Gender       City Temperature  Rating  Temperature_encoded
0  female   New York         low       4                    0
1  female     London      medium       3                    1
2    male  New Delhi        high       2                    2


In [5]:
# Use LabalEncoder for the city column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() # Instantiate the labelencode object

data["City_Encoded"] = le.fit_transform(data['City']) # Fit transform the city column

print(f"Encoded Data:\n{data.drop(['City','Temperature'],axis=1)}") # Drop the original columns and replace with encoded columns

Encoded Data:
   Gender  Rating  Temperature_encoded  City_Encoded
0  female       4                    0             2
1  female       3                    1             0
2    male       2                    2             1


<b>Binary encoding</b>

Binary encodings are a special case of categoric features (such as Gender). Here's a way to do this that also happens to preserve any missing values as missing:

In [6]:
data['Gender'] = data['Gender'].map( {'male':1, 'female':0} )
print(f"Encoded Data:\n{data.drop(['City','Temperature'],axis=1)}")

Encoded Data:
   Gender  Rating  Temperature_encoded  City_Encoded
0       0       4                    0             2
1       0       3                    1             0
2       1       2                    2             1


<b>One-hot encoding (Very popular method)</b>.

One-hot encoding is where you represent each possible value for a category as a separate feature. The most straight-forward way to do this is with pandas (e.g. with the City feature again):

In [10]:
pd.get_dummies(data['Gender'], prefix='Gender').astype(int) # One hot encoding

Unnamed: 0,Gender_female,Gender_male
0,1,0
1,1,0
2,0,1


In [9]:
# To transform the datasets categoric features, if order does not matter to us we can do the following

data = pd.DataFrame(
       [['female', 'New York', 'low', 4], ['female', 'London', 'medium', 3], ['male', 'New Delhi', 'high', 2]],
       columns=['Gender', 'City', 'Temperature', 'Rating'])

new_data =  pd.get_dummies(data).astype(int)

print(new_data,'\n',new_data.shape)

   Rating  Gender_female  Gender_male  City_London  City_New Delhi  \
0       4              1            0            0               0   
1       3              1            0            1               0   
2       2              0            1            0               1   

   City_New York  Temperature_high  Temperature_low  Temperature_medium  
0              1                 0                1                   0  
1              0                 0                0                   1  
2              0                 1                0                   0   
 (3, 9)


https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html