# Variable conversion

In this activity you will learn to convert variables from one type into the other.

## Numeric to categorical

Consider the wine dataset we used earlier:

In [7]:
import sklearn.datasets as datasets
import pandas as pd
import numpy as np

dataset = datasets.load_wine()
X = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])

print(X.head(5))

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  
0                  

In [4]:
X.shape

(178, 13)

Let's first bin the variable 'flavanoids' into 5 bins using pandas:

In [2]:
flavanoids = pd.cut(X['flavanoids'], 5)
print(flavanoids.value_counts())

(2.236, 3.184]    64
(0.335, 1.288]    51
(1.288, 2.236]    43
(3.184, 4.132]    19
(4.132, 5.08]      1
Name: flavanoids, dtype: int64


In [6]:
X.head(10)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0
5,14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450.0
6,14.39,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290.0
7,14.06,2.15,2.61,17.6,121.0,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295.0
8,14.83,1.64,2.17,14.0,97.0,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045.0
9,13.86,1.35,2.27,16.0,98.0,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045.0


Notice that the bins are all of an equal width, but the distribution is uneven.
We can use a different function to obtain equal-size bins:

In [9]:
flavanoids = pd.qcut(X['flavanoids'], 5)
print(flavanoids.value_counts())

(2.46, 2.98]      36
(1.738, 2.46]     36
(0.339, 0.872]    36
(2.98, 5.08]      35
(0.872, 1.738]    35
Name: flavanoids, dtype: int64


## Categorical to numeric

Let's create a colour variable:

In [11]:
colours = ['blue', 'red', 'green', 'yellow']
colour_array = np.random.choice(colours, 100, p=[0.5, 0.1, 0.1, 0.3])
print(colour_array)

['blue' 'red' 'blue' 'blue' 'blue' 'green' 'blue' 'blue' 'yellow' 'yellow'
 'green' 'blue' 'green' 'blue' 'red' 'green' 'blue' 'red' 'green' 'blue'
 'blue' 'yellow' 'blue' 'red' 'yellow' 'yellow' 'blue' 'blue' 'green'
 'blue' 'red' 'blue' 'blue' 'yellow' 'blue' 'blue' 'yellow' 'blue'
 'yellow' 'blue' 'blue' 'blue' 'yellow' 'blue' 'blue' 'red' 'blue' 'green'
 'blue' 'yellow' 'yellow' 'yellow' 'yellow' 'yellow' 'blue' 'yellow'
 'blue' 'red' 'yellow' 'yellow' 'blue' 'blue' 'blue' 'yellow' 'green'
 'blue' 'blue' 'red' 'yellow' 'blue' 'yellow' 'yellow' 'blue' 'yellow'
 'red' 'yellow' 'blue' 'blue' 'yellow' 'blue' 'blue' 'yellow' 'blue' 'red'
 'green' 'blue' 'blue' 'blue' 'blue' 'blue' 'green' 'yellow' 'yellow'
 'blue' 'blue' 'blue' 'red' 'green' 'blue' 'blue']


We can easily obtain dummies by using the following code:

In [18]:
dummy_colours = pd.get_dummies(colour_array, prefix='color', drop_first=True)
dummy_colours.head(10)

Unnamed: 0,color_green,color_red,color_yellow
0,0,0,0
1,0,1,0
2,0,0,0
3,0,0,0
4,0,0,0
5,1,0,0
6,0,0,0
7,0,0,0
8,0,0,1
9,0,0,1


Notice that blue is not included? All encoding is relative to the presence of blue. This is due to the ```drop_first``` parameter.

We can also use scikit-learn:

In [19]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# We can use a label encoder to transform categories into numbers
enc = LabelEncoder()
colour_label = enc.fit_transform(colour_array)
print(colour_label)

[0 2 0 0 0 1 0 0 3 3 1 0 1 0 2 1 0 2 1 0 0 3 0 2 3 3 0 0 1 0 2 0 0 3 0 0 3
 0 3 0 0 0 3 0 0 2 0 1 0 3 3 3 3 3 0 3 0 2 3 3 0 0 0 3 1 0 0 2 3 0 3 3 0 3
 2 3 0 0 3 0 0 3 0 2 1 0 0 0 0 0 1 3 3 0 0 0 2 1 0 0]


You will notice that every colour now has its own integer value.