# Feature Transformation

## Encoding Categorical Data

### Technique of converting categorical variables into numerical values, so that it could be easily fitted to a machine learning model.

There are two types of catgorical variables:

Ordinal categorical variables-->Ordinal Encoding<br>
Nominal categorical variables-->One-Hot Encoding <br>Note : for encoding labelled(output) feature we use Label Encoding.

### Ordinal and Label Encoding

Ordinal Encoding is applied on ordinal categorical input feature and Label Encoding is applied on categorical output(labelled) feature. Working of both is same just when applied for input features then known as ordinal encodiing and when applied on output feature known as label encoding.

Example : HG<UG<PG after ordinal encoding --> 0<1<2

### Importing Dependencies:


In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv('customer.csv')
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


### Features data type:

age-->numerical>continuous<br>
gender-->categorical>nominal(One-Hot encoding)<br>
review-->categorical>ordinal(ordinal encoding)<br>
education-->categorical>ordinal(ordinal encoding)<br>
purchased-->categorical>nominal(label encoding)<br>

In [3]:
df1=df.iloc[:,2:]
#first 5 rows
df1.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


### Ordinal Encoding

In [4]:
#recommended(train-test split)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1.iloc[:,0:2], df1.iloc[:,-1], test_size=0.3, random_state=0)

In [5]:
#shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((35, 2), (15, 2), (35,), (15,))

In [6]:
#first 5 rows of X_train before ordinal encoding
X_train.head()

Unnamed: 0,review,education
7,Poor,School
14,Poor,PG
45,Poor,PG
48,Good,UG
29,Average,UG


In [7]:
#first 5 rows of X_test before ordinal encoding
X_test.head()

Unnamed: 0,review,education
28,Poor,School
11,Good,UG
10,Good,UG
41,Good,PG
2,Good,PG


In [8]:
#OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
oe=OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])
#observations:giving categories as a parameter is not necessary but recommended while using ordinal encoding, 
#because we can tell model to which catogory to give more weight(ascending order-->0,1,2...and so on)
#here, 'Poor'=0, 'Average'=1, and 'Good'=2
#and , 'School'=0, 'UG'=1, 'PG'=2

In [9]:
#categories
oe.categories

[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']]

In [10]:
# fit the encoder to the train set, it will learn the parameters
oe.fit(X_train)
# transform train and test sets
X_train=oe.transform(X_train)
X_test=oe.transform(X_test)

In [11]:
#array of X_train after ordinal encoding(we can match with X_train before encoding)
X_train

array([[0., 0.],
       [0., 2.],
       [0., 2.],
       [2., 1.],
       [1., 1.],
       [0., 1.],
       [1., 1.],
       [1., 1.],
       [0., 1.],
       [2., 2.],
       [1., 0.],
       [0., 2.],
       [1., 1.],
       [1., 0.],
       [2., 0.],
       [1., 0.],
       [0., 1.],
       [2., 0.],
       [2., 1.],
       [0., 1.],
       [0., 0.],
       [1., 2.],
       [1., 2.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [1., 2.],
       [0., 2.],
       [2., 1.],
       [0., 2.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [2., 2.],
       [1., 1.]])

In [12]:
#array of X_test after ordinal encoding(we can match with X_test before encoding)
X_test

array([[0., 0.],
       [2., 1.],
       [2., 1.],
       [2., 2.],
       [2., 2.],
       [0., 2.],
       [2., 0.],
       [0., 0.],
       [0., 2.],
       [1., 1.],
       [2., 2.],
       [0., 0.],
       [0., 2.],
       [1., 0.],
       [2., 0.]])

### Label Encoding

In [13]:
#array of y_train before label encoding
y_train

7     Yes
14    Yes
45    Yes
48    Yes
29    Yes
15     No
30     No
32    Yes
16    Yes
42    Yes
20    Yes
43     No
8      No
13     No
25     No
5     Yes
17    Yes
40     No
49     No
1      No
12     No
37    Yes
24    Yes
6      No
23     No
36    Yes
21     No
19    Yes
9     Yes
39     No
46     No
3      No
0      No
47    Yes
44     No
Name: purchased, dtype: object

In [14]:
#array of y_test before label encoding
y_test

28     No
11    Yes
10    Yes
41    Yes
2      No
27     No
38     No
31    Yes
22    Yes
4      No
33    Yes
35    Yes
26     No
34     No
18     No
Name: purchased, dtype: object

In [15]:
#LabelEncoder
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

In [16]:
# fit the encoder to the train set, it will learn the parameters
le.fit(y_train)
# transform train and test sets
y_train=le.transform(y_train)
y_test=le.transform(y_test)

In [17]:
#classes
le.classes_

array(['No', 'Yes'], dtype=object)

In [18]:
#array of y_train after label encoding(we can match with y_train before encoding)
y_train
print(y_test)

[0 1 1 1 0 0 0 1 1 0 1 1 0 0 0]


In [19]:
#array of y_test after label encoding(we can match with y_test before encoding)
y_test

array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0])