# Variable conversion

In this activity you will learn to convert variables from one type into the other.

## Numeric to categorical

Consider the wine dataset we used earlier:

In [2]:
import sklearn.datasets as datasets
import pandas as pd
import numpy as np

dataset = datasets.load_wine()
X = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])

print(X.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  
0                  

Let's first bin the variable 'flavanoids' into 5 bins using pandas:

In [3]:
flavanoids = pd.cut(X['flavanoids'], 5)
print(flavanoids.value_counts())

(2.236, 3.184]    64
(0.335, 1.288]    51
(1.288, 2.236]    43
(3.184, 4.132]    19
(4.132, 5.08]      1
Name: flavanoids, dtype: int64


Notice that the bins are all of an equal width, but the distribution is uneven.
We can use a different function to obtain equal-size bins:

In [4]:
flavanoids = pd.qcut(X['flavanoids'], 5)
print(flavanoids.value_counts())

(0.339, 0.872]    36
(1.738, 2.46]     36
(2.46, 2.98]      36
(0.872, 1.738]    35
(2.98, 5.08]      35
Name: flavanoids, dtype: int64


## Categorical to numeric

Let's create a colour variable:

In [14]:
colours = ['red', 'blue' , 'green', 'yellow']
colour_array = np.random.choice(colours, 100, p=[0.5, 0.1, 0.1, 0.3])
print(colour_array)

['yellow' 'red' 'red' 'green' 'blue' 'red' 'yellow' 'blue' 'red' 'yellow'
 'red' 'yellow' 'red' 'blue' 'red' 'red' 'red' 'red' 'red' 'red' 'green'
 'red' 'green' 'yellow' 'yellow' 'red' 'red' 'red' 'yellow' 'blue' 'red'
 'green' 'yellow' 'red' 'yellow' 'yellow' 'yellow' 'green' 'yellow'
 'yellow' 'red' 'red' 'red' 'yellow' 'blue' 'green' 'red' 'green' 'red'
 'green' 'green' 'red' 'yellow' 'red' 'yellow' 'red' 'red' 'red' 'red'
 'red' 'red' 'red' 'blue' 'red' 'green' 'green' 'yellow' 'yellow' 'red'
 'red' 'yellow' 'red' 'yellow' 'red' 'red' 'yellow' 'red' 'red' 'red'
 'yellow' 'yellow' 'red' 'blue' 'yellow' 'red' 'green' 'red' 'red' 'red'
 'red' 'yellow' 'red' 'blue' 'yellow' 'red' 'red' 'red' 'red' 'blue'
 'yellow']


We can easily obtain dummies by using the following code:

In [18]:
dummy_colours = pd.get_dummies(colour_array, prefix='color', drop_first=True)
dummy_colours.head()

Unnamed: 0,color_green,color_red,color_yellow
0,0,0,1
1,0,1,0
2,0,1,0
3,1,0,0
4,0,0,0


Notice that blue is not included? All encoding is relative to the presence of blue. This is due to the ```drop_first``` parameter.

We can also use scikit-learn:

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# We can use a label encoder to transform categories into numbers
enc = LabelEncoder()
colour_label = enc.fit_transform(colour_array)
print(colour_label)

You will notice that every colour now has its own integer value.