# Hands on: Handling categorical attributes 

`sklearn.preprocessing.LabelEncoder (Python class, in sklearn.preprocessing.LabelEncoder)`
`sklearn.preprocessing.OrdinalEncoder (Python class, in sklearn.preprocessing.OrdinalEncoder)`
`sklearn.preprocessing.OneHotEncoder (Python class, in sklearn.preprocessing.OneHotEncoder)`

## Why mapping or enconding categorial data????


### Overview

- [1 Encoding class labels ](#ch1)</a>

    - [1.1  with numpy  ](#ch1_1)</a>

    - [1.2 with `LabelEncoder`  ](#ch1_2)</a>

- [2 Encoding ordinal features  ](#ch2)</a>

    - [2.1 with `OrdinalEncoder` ](#ch2_1)</a>

    - [2.2 with map  ](#ch2_2)</a>

    - [2.3 with `LabelEncoder`? ](#ch2_3)</a>

- [3 Encoding nominal features ](#ch3)</a>

    - [3.1 using `OneHotEncoder` ](#ch3_1)</a>

    - [3.2 using `get dummies` (Pandas) ](#ch3_2)</a>

    - [3.3 with `LabelEncoder`? ](#ch3_3)</a>

- [4 Some notes on Encoding features](#ch4)</a>

- [5 Addressing heterogenous data  ](#ch5)</a>
   
   

In [1]:
# to execute and return the results of executions
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'brandA', 'class1'],
                   ['red', 'XL', 13.5, 'brandB', 'class2'],
                   ['blue', 'L', 15.3, 'brandA', 'class1']])
df.columns = ['color', 'size', 'price', 'brand', 'classlabel']
df_original = df.copy()
df
df.loc[1,'size']

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


'XL'

# 1 Encoding class labels <a name="ch1"></a>

## 1.1  with numpy  <a name="ch1_1"></a>

In [3]:
import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping
df['classlabel'] = df['classlabel'].map(class_mapping)
df

{'class1': 0, 'class2': 1}

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,0
1,red,XL,13.5,brandB,1
2,blue,L,15.3,brandA,0


**--------------------------------------------------------------------------------------------------------------**

**reverse:** Convert the classes labels back to the original 

In [4]:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


## 1.2 with `LabelEncoder`  <a name="ch1_2"></a>

In [5]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder() 

df['classlabel'] = class_le.fit_transform(df['classlabel'])
df


Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,0
1,red,XL,13.5,brandB,1
2,blue,L,15.3,brandA,0


**--------------------------------------------------------------------------------------------------------------**

**reverse:** Convert the classes labels back to the original 

In [6]:
df['classlabel'] = class_le.inverse_transform(df['classlabel']) 
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


# 2 Encoding ordinal features  <a name="ch2"></a>

## 2.1 with `OrdinalEncoder` <a name="ch2_1"></a>

In [7]:
from sklearn.preprocessing import OrdinalEncoder

**Encoding a matrix of categorical features**

In [8]:
# OrdinalEncoder class is intended for input variables that are organized into rows and columns, e.g. a matrix.

X = df[['color', 'size', 'brand', 'price']].values
oe = OrdinalEncoder()
X[:, [0,1]] = oe.fit_transform(X[:, [0,1]]) # output: array
df2 = df.copy()
df2[['color', 'size', 'brand', 'price']] = X
df2


Unnamed: 0,color,size,price,brand,classlabel
0,1.0,1.0,10.1,brandA,class1
1,2.0,2.0,13.5,brandB,class2
2,0.0,0.0,15.3,brandA,class1


**Encoding a column of categorical features**

In [9]:
df3 = df.copy()
df3[['size']]  = oe.fit_transform(df2[['size']].values)
df3

Unnamed: 0,color,size,price,brand,classlabel
0,green,1.0,10.1,brandA,class1
1,red,2.0,13.5,brandB,class2
2,blue,0.0,15.3,brandA,class1


**Q: Take a look at the feature 'size'. Do you think it's correctly encoded?**

**Q: Is it important to keep the order for feature 'size'? why?**

### How to encode ordinal features, keeping the order importance???

In [10]:
Ord_enc= OrdinalEncoder(categories=[['M', 'L', 'XL']]) # categories=[['XS','S','M', 'L', 'XL']]
df.loc[:,['size']] = Ord_enc.fit_transform(df.loc[:,['size']]) # +1
df

  df.loc[:,['size']] = Ord_enc.fit_transform(df.loc[:,['size']]) # +1


Unnamed: 0,color,size,price,brand,classlabel
0,green,0.0,10.1,brandA,class1
1,red,2.0,13.5,brandB,class2
2,blue,1.0,15.3,brandA,class1


**--------------------------------------------------------------------------------------------------------------**

**reverse:** Convert the data back to the original representation

In [11]:
df.loc[:,['size']]  = Ord_enc.inverse_transform(df.loc[:,['size']])
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


## 2.2 with map  <a name="ch2_2"></a>

In [12]:
size_mapping = {'XL': 2, # dicionary
                'L': 1,
                'M': 0}

df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,0,10.1,brandA,class1
1,red,2,13.5,brandB,class2
2,blue,1,15.3,brandA,class1


**reverse mapping:**
Convert the data back to the original representation

In [13]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'] = df['size'].map(inv_size_mapping)
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


## 2.3 with `LabelEncoder`? <a name="ch2_3"></a>
**would `LabelEncoder` work for ordinal features?**

In [14]:
X = df[['color', 'size', 'brand']].values
print(X)
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])
print('new X')
print(X)

[['green' 'M' 'brandA']
 ['red' 'XL' 'brandB']
 ['blue' 'L' 'brandA']]
new X
[['green' 1 'brandA']
 ['red' 2 'brandB']
 ['blue' 0 'brandA']]


**Q: which disadvantage do you identify in using `LabelEncoder` for ordinal features?**

### Try use `LabelEncoder` to encode a matrix input -- see the error. Figure out why!
Uncoment the code bellow

In [16]:

X = df[['color', 'size', 'brand']].values
print(X)
labenc = LabelEncoder()
X[:, [0,1]] = labenc.fit_transform(X[:, [0,1]])
print(X)

# LabelEncoder expects a one-dimensional input for the single target variable, doesn't work with matrices.


[['green' 'M' 'brandA']
 ['red' 'XL' 'brandB']
 ['blue' 'L' 'brandA']]


ValueError: y should be a 1d array, got an array of shape (3, 2) instead.

# 3 Encoding nominal features <a name="ch3"></a>

## 3.1 using `OneHotEncoder` <a name="ch3_1"></a>

In [17]:
from sklearn.preprocessing import OneHotEncoder
X = df[['color', 'brand']].values
print(X)
ohe = OneHotEncoder()
X=ohe.fit_transform(X).toarray()
print(X)
pd.DataFrame(ohe.fit_transform(df[['color', 'size', 'brand']]).toarray())

[['green' 'brandA']
 ['red' 'brandB']
 ['blue' 'brandA']]
[[0. 1. 0. 1. 0.]
 [0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0.]]


Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


## 3.2 using `get dummies` (Pandas) <a name="ch3_2"></a>

In [18]:
df
features_dummies = pd.get_dummies(df, columns=['color', 'brand'])
features_dummies

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


Unnamed: 0,size,price,classlabel,color_blue,color_green,color_red,brand_brandA,brand_brandB
0,M,10.1,class1,0,1,0,1,0
1,XL,13.5,class2,0,0,1,0,1
2,L,15.3,class1,1,0,0,1,0


## 3.3 with `LabelEncoder`? <a name="ch3_3"></a>
**would `LabelEncoder` work for nominal features?**

In [19]:
X = df[['color', 'brand']].values
le = LabelEncoder()
X[:,0] = le.fit_transform(df['color'])
X

array([[1, 'brandA'],
       [2, 'brandB'],
       [0, 'brandA']], dtype=object)

**Q: which disadvantage do you identify in using `LabelEncoder` for nominal features?**

# 4. Some notes on Encoding features <a name="ch4"></a>

### N1: `LabelEncoder` shouldn't be used to encode nominal features

`LabelEncoder` will assign integers to labels in alphabetic order, which imposes an ordinal relationship where no such relationship exist.



### N2: `LabelEncoder` shouldn't be used to encode ordinal features

`LabelEncoder` will assign integers to labels in alphabetic order, which turns impossible to define the specific order-ranking of the ordinal feature.



### N3: The 2 encoding procedures `OneHotEncoder` and `get dummies` shouldn't be used for ordinal features.

`LabelEncoder` and `get dummies` don't preserve order (are meant for nominal features)

**See what happens when performing on ordinal features.**

In [20]:
features_dummies = pd.get_dummies(df, columns=['price', 'color', 'size', 'brand'])
features_dummies

Unnamed: 0,classlabel,price_10.1,price_13.5,price_15.3,color_blue,color_green,color_red,size_L,size_M,size_XL,brand_brandA,brand_brandB
0,class1,1,0,0,0,1,0,0,1,0,1,0
1,class2,0,1,0,0,0,1,0,0,1,0,1
2,class1,0,0,1,1,0,0,1,0,0,1,0


# 5 Addressing heterogenous data: `make_column_transformer` <a name="ch5"></a>
## Applying OrdinalEncoder and OneHotEnconder to part of the columns  

In [21]:
import sklearn
target = df.classlabel.values # df.['classlabel'].values
features = df[['price', 'color', 'size', 'brand']]
features

Unnamed: 0,price,color,size,brand
0,10.1,green,M,brandA
1,13.5,red,XL,brandB
2,15.3,blue,L,brandA


In [22]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer

# determine categorical and numerical features
numerical_columns = features.select_dtypes(include=['int64', 'float64']).columns
nominal_columns = ['color', 'brand'] 
ordinal_columns = ['size'] 

# simple preprocessing pipeline to encode numerical features and categorical (ordinal and nominal) features.
ColEncoding = make_column_transformer(
    (OneHotEncoder(), nominal_columns), # one-hot encode for nominal features
    (OrdinalEncoder(categories=[['M', 'L', 'XL']]), ordinal_columns), # ordinal encode for ordinal features
     remainder = 'passthrough')  


features_encoded = ColEncoding.fit_transform(features)
features_encoded = pd.DataFrame(features_encoded)#, columns=ColEncoding.get_feature_names_out()
features_encoded.columns = ['color_blue', 'color_green', 'color_red', 'brand_brandA', 'brand_brandB', 'size', 'price']
features_encoded

Unnamed: 0,color_blue,color_green,color_red,brand_brandA,brand_brandB,size,price
0,0.0,1.0,0.0,1.0,0.0,0.0,10.1
1,0.0,0.0,1.0,0.0,1.0,2.0,13.5
2,1.0,0.0,0.0,1.0,0.0,1.0,15.3
