In many practical Datasets, the data set will contain categorical variables.These variables are typically stored as text values which represent various traits. Ex: geographic designations or color. These features are typically stored as text values which represent various traits of the observations. 


There will be a challenge in figuring out how to turn these text attributes into numerical values for further processing.

Reasons for challenge:
1.Many machine learning models are algebraic. This means that their input must be numerical. 
2.Categorical features may have a very large number of levels, known as high cardinality where most of the levels appear in a relatively small number of instances.
3.For the machine, categorical data doesn’t contain the same context or information that humans can easily associate and understand.For the model blue, red and green are just three different levels (possible values) of the same feature City. If you don’t specify the additional contextual information, it is impossible for the model to differentiate between highly different levels.



Machine learning algorithms require that input and output variables as numbers. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.

Along with the input, the output also must be encoded to numbers. If the output, our dependent variable is in text format, we need to encode it to numbers.

There are mostly 3 types of Encoding.
1. Integer Encoding
2. One Hot Encoding(emample provided below)
3. Learned Embedding

Let us have a look at it in detail.
First we will import all the Libraries and Dataset.
We then replace missing values if any.
Then we replace the categorical values.

In the example we are going to use, countries will be the categories we will be working with.

In [1]:
# Import followed by Name of library and add a shortcut name 'np'
import numpy as np

In [2]:
# pyplot is a module in matplotlib
import matplotlib.pyplot as plt

In [3]:
import pandas as pd

In [4]:
# pass the data from file into a variable 'dataset'
dataset = pd.read_csv('Data_encode_category.csv')

In [5]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,Mexico,44.0,72000.0,No
1,India,27.0,48000.0,Yes
2,Srilanka,30.0,54000.0,No
3,India,38.0,61000.0,No
4,Srilanka,40.0,,Yes
5,Mexico,35.0,58000.0,Yes
6,India,,52000.0,No
7,Mexico,48.0,79000.0,Yes
8,Srilanka,50.0,83000.0,No
9,Mexico,37.0,67000.0,Yes


Missing data is represented as Nan in python. We can see two rows- 2 and 7 have missing salaries.

In [6]:
X = dataset.iloc[:, 0:3]
y = dataset.iloc[:, -1:]

In [7]:
print(X)


    Country   Age   Salary
0    Mexico  44.0  72000.0
1     India  27.0  48000.0
2  Srilanka  30.0  54000.0
3     India  38.0  61000.0
4  Srilanka  40.0      NaN
5    Mexico  35.0  58000.0
6     India   NaN  52000.0
7    Mexico  48.0  79000.0
8  Srilanka  50.0  83000.0
9    Mexico  37.0  67000.0


In [8]:
print(y)

  Purchased
0        No
1       Yes
2        No
3        No
4       Yes
5       Yes
6        No
7       Yes
8        No
9       Yes


In [9]:
from sklearn.impute import SimpleImputer

SimpleImputer is an Imputation transformer for completing missing values.

In [10]:
# Create an object of the class SimpleImputer and replace the missing salaries with the average salary
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Call the class  SimpleImputer 
missing_values: The placeholder for the missing values.
Strategy: Imputation strategy. Ex:
    mean(default): replace missing values using the mean of the column
    most_frequent: replace missing values with most frequent

In [11]:
imputer.fit(X.iloc[:,1:])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

Fit method will look for the missing values and compute the Mean. Pass the rows where missing data has to be searched and computed.

In [12]:
xy = imputer.transform(X.iloc[:,1:])

In [13]:
X.iloc[:,1:] = xy

Transform method will actually replace the missing values.
The above statement, searches all the rows of our X dataset and column 3 i.e Salary and computes the Mean 

But this value has to be actually replaced in our dataset X. hence xy is passed to our missing row.
Alternatively, we can also write it as:
X[:, 3:] = imputer.transform(X[:, 3:])

In [14]:
print(X)

    Country        Age        Salary
0    Mexico  44.000000  72000.000000
1     India  27.000000  48000.000000
2  Srilanka  30.000000  54000.000000
3     India  38.000000  61000.000000
4  Srilanka  40.000000  63777.777778
5    Mexico  35.000000  58000.000000
6     India  38.777778  52000.000000
7    Mexico  48.000000  79000.000000
8  Srilanka  50.000000  83000.000000
9    Mexico  37.000000  67000.000000


Now, import the necessary classed to encode the categorical data.

We need to turn the categoreis to numbers. One country column will transform into 3 numerical columns, each column pertaining to each category.We will replace country column with 3 columns, a unique id per column(country). Number "1" is assigned when the row is true to its particular column and "0" is assigned for false. 
Ex: For the row mexico, the column value will be "1" for the the column named Mexico and the other two columns will be "0".

There will be no inter relationship between the three countries.
It assignes Binary vectors for the three columns 0,0,1.
We will only have 0 and 1. 

To achieve this, we will use 2 classes, column transformer and Onehot encoder

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

We mix both the classes for encoding.
We call the column transformer first, to transform the first column pass the below 3 arguments. remainder- keep cols that wont apply transformation
1. kind of transformation- What transformation we want to do on which index
2. Encoder class name- Which wncoder we are using.
3. Index- Index of column which needs to be transformed.

Then we pass the remainder, all the other reminder columns as pass through.

In [15]:
coltrans = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],remainder='passthrough')

In [16]:
X = np.array(coltrans.fit_transform(X))

We fit, and transform the column which is passed to variable, coltrans.

In [49]:
print(X)

[[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 5.00000000e+01
  8.30000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]]


Hre, we see the first column, country is replaced by three binary columns with values 0,1,0 with countries India, Mexico and Srilanka.
The first row- 0 1 0 states, the is categorised to Mexico. 

In [18]:
laben = LabelEncoder()

Not only the input, but also the output should also be cateorised from Yes, No to 1,0.
To do this we import a class called label encoder.
Instantiate, Fit and transform the label encoder for our dependent variable vector.

In [19]:
y = np.array(laben.fit_transform(y))

  y = column_or_1d(y, warn=True)


In [20]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


Print the categorised output.
We have now seen how to encode the categorical data. We will see more about it in future.