# Categorical Data

#### Data Preprocessing

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv("Data.csv")

In [2]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

### Encoding categorical data for Country Column

Initially, the Country column contain countries written as text.
There are 2 categorical variable (Country and Purchased)

In [4]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [5]:
from sklearn.preprocessing import LabelEncoder

In [6]:
labelencoder_X = LabelEncoder()
labelencoder_X.fit_transform(X[:,0])

array([0, 2, 1, 2, 1, 0, 2, 0, 1, 0])

The Countries are no longer written as text. Now we have the encoded values of this country.

In [7]:
X[:,0] = labelencoder_X.fit_transform(X[:,0])

In [8]:
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

The conversion of text to numberis done because Machine Learning is based on Mathematical equations and we can put these numbers in that equation. But this also comes with a drawback. The equations in Machine Learning model will think that Spain > Germany > France. But there is no relational order between the 3 countries.

So we have to prevent the Machine Learning equations from thinking that there is a relational order between the countries.

To prevent this we have to use dummy varibales.

#### Creating dummy variables

In [9]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [10]:
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])],     remainder='passthrough')
X=np.array(columnTransformer.fit_transform(X),dtype=np.str)

In [11]:
X

array([['1.0', '0.0', '0.0', '44.0', '72000.0'],
       ['0.0', '0.0', '1.0', '27.0', '48000.0'],
       ['0.0', '1.0', '0.0', '30.0', '54000.0'],
       ['0.0', '0.0', '1.0', '38.0', '61000.0'],
       ['0.0', '1.0', '0.0', '40.0', '63777.77777777778'],
       ['1.0', '0.0', '0.0', '35.0', '58000.0'],
       ['0.0', '0.0', '1.0', '38.77777777777778', '52000.0'],
       ['1.0', '0.0', '0.0', '48.0', '79000.0'],
       ['0.0', '1.0', '0.0', '50.0', '83000.0'],
       ['1.0', '0.0', '0.0', '37.0', '67000.0']], dtype='<U17')

Now the 1st column Country is replaced by 3 columns. 1st, 2nd and 3rd columns represent France, Germany and Spain respectively.

### Encoding categorical data for Purchased Column

In [12]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [13]:
labelencoder_y = LabelEncoder()
labelencoder_y.fit_transform(y)

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [14]:
y = labelencoder_y.fit_transform(y)

In [15]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

Purchased column is encoded: 0 - No and 1 - Yes