#**Feature Engineering Exercise Solution**

Data processing and feature
engineering is often described to be the toughest task or step in building any Machine Learning system by
data scientists. With the need of both domain knowledge as well as mathematical transformations, feature
engineering is often said to be both an art as well as a science. The obvious complexities involve dealing
with diverse types of data and variables. Besides this, each Machine Learning problem or task needs
specific features and there is no one solution fits all in the case of feature engineering. This makes feature
engineering all the more difficult and complex.

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

# Feature Engineering on Numeric Data



Even though numeric data can be directly fed into Machine Learning models, you would still need to
engineer features that are relevant to the scenario, problem, and domain before building a model. Hence
the need for feature engineering remains. Important aspects of numeric features include feature scale and
distribution. In some scenarios,
we need to apply specific transformations to change the scale of numeric values and in other scenarios we
need to change the overall distribution of the numeric values, like transforming a skewed distribution to a
normal distribution.

In [156]:
# Import necessary dependencies and settings
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import scipy.stats as spstats

%matplotlib inline
mpl.style.reload_library()
mpl.style.use('classic')
mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
mpl.rcParams['figure.figsize'] = [6.0, 4.0]
mpl.rcParams['figure.dpi'] = 100

## Raw Measures

Raw measures typically
indicated using numeric variables directly as features without any form of transformation or engineering.
Typically these features can indicate values or counts.

###Values

Usually, scalar values in its raw form indicate a specific measurement, metric, or observation belonging to
a specific variable or field. The semantics of this field is usually obtained from the field name itself or a data
dictionary if present.

###Ecoli Dataset

Ecoli dataset is for predicting Protein Localization Sites in Ecoli. It includes 336 samples and 8 attributes (7 predictive and 1 target of localization site).

You can learn more about the dataset here:
* Ecoli Dataset ([ecoli.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.data))
* Ecoli Dataset Description ([ecoli.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.names))

Attribute Information.

1.  **accession**: Accession number for the SWISS-PROT database
1.  **mcg**: McGeoch's method for signal sequence recognition.
1.  **gvh**: von Heijne's method for signal sequence recognition.
1.  **lip**: von Heijne's Signal Peptidase II consensus sequence score (Binary attribute).
1.  **chg**: Presence of charge on N-terminus of predicted lipoproteins. Binary attribute.
1.  **aac**: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
1. **alm1**: score of the ALOM membrane spanning region prediction program.
1. **alm2**: score of ALOM program after excluding putative cleavable signal regions from the sequence.


The localization site. 

```
  cp  | cytoplasm                                   | 143
  im  | inner membrane without signal sequence      |  77 
  pp  | perisplasm                                  |  52
  imU | inner membrane, uncleavable signal sequence |  35
  om  | outer membrane                              |  20
  omL | outer membrane lipoprotein                  |   5
  imL | inner membrane lipoprotein                  |   2
  imS | inner membrane, cleavable signal sequence   |   2
```

In [157]:
# Download Ecoli dataset
!wget -O ecoli.csv "https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/ecoli.csv"
!head ecoli.csv 

ecoli_df = pd.read_csv('ecoli.csv')
ecoli_df.head(10)

--2022-05-20 04:29:02--  https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/ecoli.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16498 (16K) [text/plain]
Saving to: ‘ecoli.csv’


2022-05-20 04:29:02 (81.0 MB/s) - ‘ecoli.csv’ saved [16498/16498]

accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
EMRA_ECOLI,0.06,0.61,0.48,0.50,0.49,0.92,0.37,im
AAT_ECOLI,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
ATKC_ECOLI,0.85,0.53,0.48,0.50,0.53,0.52,0.35,imS
ACEA_ECOLI,0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp
FADL_ECOLI,0.78,0.68,0.48,0.50,0.83,0.40,0.29,om
NLPA_ECOLI,0.75,0.55,1.00,1.00,0.40,0.47,0.30,imL
MULI_ECOLI,0.77,0.57,1.00,0.50,0.37,0.54,0.01,omL
ACEK_ECOLI,0.56,0.40,0.48,0.50,0.49,0.37,0.46,cp
ACKA_ECOLI,0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp


Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
8,ACKA_ECOLI,0.59,0.49,0.48,0.5,0.52,0.45,0.36,cp
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp


In [158]:
# Show some of features
ecoli_df[['mcg', 'gvh', 'chg']].head()

Unnamed: 0,mcg,gvh,chg
0,0.06,0.61,0.5
1,0.49,0.29,0.5
2,0.85,0.53,0.5
3,0.07,0.4,0.5
4,0.78,0.68,0.5


In [159]:
# Compute basic statistical measures on the fields of 'mcg', 'gvh', 'chg'
ecoli_df[['mcg', 'gvh', 'chg']].describe()

Unnamed: 0,mcg,gvh,chg
count,336.0,336.0,336.0
mean,0.50006,0.5,0.501488
std,0.194634,0.148157,0.027277
min,0.0,0.16,0.5
25%,0.34,0.4,0.5
50%,0.5,0.47,0.5
75%,0.6625,0.57,0.5
max,0.89,1.0,1.0


###Counts
Raw numeric measures can also indicate counts, frequencies and occurrences of specific attributes.

###Diabetes Dataset
The dataset classifies patient data as
either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem. 

You can learn more about the dataset here:

* Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
* Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

In [160]:
# Download Diabetes dataset
!wget -O pima-indians-diabetes.csv "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
!head pima-indians-diabetes.csv 

--2022-05-20 04:29:02--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23278 (23K) [text/plain]
Saving to: ‘pima-indians-diabetes.csv’


2022-05-20 04:29:02 (84.0 MB/s) - ‘pima-indians-diabetes.csv’ saved [23278/23278]

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1


In [161]:
diabetes_df = pd.read_csv('pima-indians-diabetes.csv', header=None)
diabetes_df.columns=['pregnancy', 'glucose', 'bp', 'triceps', 'insulin', 'bmi', 'pedigree', 'age', 'diabetes']
diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [162]:
diabetes_df.describe()

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


##Binarization

If you are more concerned about the various songs he/she has listened to. In this case, a binary
feature is preferred as opposed to a count based feature.

In [163]:
# Binarize 'age' field manually
age = np.array(diabetes_df['age']) 
old = np.array(diabetes_df['age']) 
old[age > 50] = 1
old[age <= 50] = 0
diabetes_df['old'] = old

diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old
0,6,148,72,35,0,33.6,0.627,50,1,0
1,1,85,66,29,0,26.6,0.351,31,0,0
2,8,183,64,0,0,23.3,0.672,32,1,0
3,1,89,66,23,94,28.1,0.167,21,0,0
4,0,137,40,35,168,43.1,2.288,33,1,0
5,5,116,74,0,0,25.6,0.201,30,0,0
6,3,78,50,32,88,31.0,0.248,26,1,0
7,10,115,0,0,0,35.3,0.134,29,0,0
8,2,197,70,45,543,30.5,0.158,53,1,1
9,8,125,96,0,0,0.0,0.232,54,1,1


In [164]:
# Binarize 'age' field using Binarizer
from sklearn.preprocessing import Binarizer

# Binarize data (set feature values to 0 or 1) according to a threshold.
# Values greater than the threshold map to 1, while values less than
# or equal to the threshold map to 0. With the default threshold of 0,
# only positive values map to 1.
bn = Binarizer(threshold=50)
bn_old = bn.transform([diabetes_df['age']])[0]
diabetes_df['bn_old'] = bn_old
diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old,bn_old
0,6,148,72,35,0,33.6,0.627,50,1,0,0
1,1,85,66,29,0,26.6,0.351,31,0,0,0
2,8,183,64,0,0,23.3,0.672,32,1,0,0
3,1,89,66,23,94,28.1,0.167,21,0,0,0
4,0,137,40,35,168,43.1,2.288,33,1,0,0
5,5,116,74,0,0,25.6,0.201,30,0,0,0
6,3,78,50,32,88,31.0,0.248,26,1,0,0
7,10,115,0,0,0,35.3,0.134,29,0,0,0
8,2,197,70,45,543,30.5,0.158,53,1,1,1
9,8,125,96,0,0,0.0,0.232,54,1,1,1


##Rounding
Often when dealing with numeric attributes like proportions or percentages, we may not need values with a
high amount of precision. Hence it makes sense to round off these high precision percentages into numeric
integers. These integers can then be directly used as raw numeric values or even as categorical (discreteclass
based) features.

In [165]:
# Creare a column 'pedigree_scale_10' and rounding off the 'pedigree' by 10
diabetes_df['pedigree_scale_10'] = np.array(np.round((diabetes_df['pedigree'] * 10)), dtype='int')
# Creare a column 'popularity_scale_100' and rounding off the 'pop_percent' by 100
diabetes_df['pedigree_scale_100'] = np.array(np.round((diabetes_df['pedigree'] * 100)), dtype='int')
diabetes_df

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old,bn_old,pedigree_scale_10,pedigree_scale_100
0,6,148,72,35,0,33.6,0.627,50,1,0,0,6,63
1,1,85,66,29,0,26.6,0.351,31,0,0,0,4,35
2,8,183,64,0,0,23.3,0.672,32,1,0,0,7,67
3,1,89,66,23,94,28.1,0.167,21,0,0,0,2,17
4,0,137,40,35,168,43.1,2.288,33,1,0,0,23,229
...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0,1,1,2,17
764,2,122,70,27,0,36.8,0.340,27,0,0,0,3,34
765,5,121,72,23,112,26.2,0.245,30,0,0,0,2,24
766,1,126,60,0,0,30.1,0.349,47,1,0,0,3,35


##Interactions
Often in real-world datasets and scenarios, it makes sense to also try to capture the
interactions between these feature variables as a part of the input feature set.

In [166]:
gvh_lip = ecoli_df[['gvh','lip']]
gvh_lip.head()

Unnamed: 0,gvh,lip
0,0.61,0.48
1,0.29,0.48
2,0.53,0.48
3,0.4,0.48
4,0.68,0.48


In [167]:
from sklearn.preprocessing import PolynomialFeatures
# build features up to the second degree using the PolynomialFeatures class from scikit-learn's API.
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(gvh_lip)
res

array([[0.61  , 0.48  , 0.3721, 0.2928, 0.2304],
       [0.29  , 0.48  , 0.0841, 0.1392, 0.2304],
       [0.53  , 0.48  , 0.2809, 0.2544, 0.2304],
       ...,
       [0.6   , 0.48  , 0.36  , 0.288 , 0.2304],
       [0.61  , 0.48  , 0.3721, 0.2928, 0.2304],
       [0.74  , 0.48  , 0.5476, 0.3552, 0.2304]])

We have a total of five features including the new interaction
features.

We can see the degree of each feature in the matrix.

In [168]:
pd.DataFrame(pf.powers_, columns=['gvh_degree', 'lip_degree'])

Unnamed: 0,gvh_degree,lip_degree
0,1,0
1,0,1
2,2,0
3,1,1
4,0,2


Now that we know what each feature actually represented from the degrees depicted, we can assign a
name to each feature as follows to get the updated feature set.

In [169]:
intr_features = pd.DataFrame(res, columns=['gvh', 'lip', 'gvh^2', 'gvh x lip', 'lip^2'])
intr_features.head(5)  

Unnamed: 0,gvh,lip,gvh^2,gvh x lip,lip^2
0,0.61,0.48,0.3721,0.2928,0.2304
1,0.29,0.48,0.0841,0.1392,0.2304
2,0.53,0.48,0.2809,0.2544,0.2304
3,0.4,0.48,0.16,0.192,0.2304
4,0.68,0.48,0.4624,0.3264,0.2304


Transforming new data in the future (during predictions)

In [170]:
# take some sample new observations for Pok mon attack and defense features and try to transform
# them using this same mechanism.
new_df = pd.DataFrame([[0.35, 0.49],[0.46, 0.38], [0.25, 0.48]], 
                      columns=['gvh', 'lip'])
new_df

Unnamed: 0,gvh,lip
0,0.35,0.49
1,0.46,0.38
2,0.25,0.48


In [171]:
# use the pf object that we created earlier and transform these input features to give us the
# interaction features
new_res = pf.transform(new_df)
new_intr_features = pd.DataFrame(new_res, 
                                 columns=['gvh', 'lip', 'gvh^2', 'gvh x lip', 'lip^2'])
new_intr_features

Unnamed: 0,gvh,lip,gvh^2,gvh x lip,lip^2
0,0.35,0.49,0.1225,0.1715,0.2401
1,0.46,0.38,0.2116,0.1748,0.1444
2,0.25,0.48,0.0625,0.12,0.2304


#Feature Engineering on Categorical Data

Any attribute or feature that is categorical in nature represents discrete values that belong to a specific
finite set of categories or classes. Category or class labels can be text or numeric in nature. Usually there are
two types of categorical variables—nominal and ordinal.


In [172]:
# Import necessary dependencies and settings
import pandas as pd
import numpy as np

##Transforming Nominal Features

Nominal features or attributes are categorical variables that usually have a finite set of distinct discrete
values. Often these values are in string or text format and Machine Learning algorithms cannot understand
them directly. Hence usually you might need to transform these features into a more representative numeric
format.

In [173]:
ecoli_df.head(11)

Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
8,ACKA_ECOLI,0.59,0.49,0.48,0.5,0.52,0.45,0.36,cp
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp


The dataset depicted in this dataframe shows us various attributes pertaining to video games. Features
like Platform, Genre, and Publisher are nominal categorical variables.

In [174]:
sites = np.unique(ecoli_df['site'])
sites

array(['cp', 'im', 'imL', 'imS', 'imU', 'om', 'omL', 'pp'], dtype=object)

This output tells us we have 8 distinct sites in Ecoli dataset. 

In [175]:
from sklearn.preprocessing import LabelEncoder

# Let’s transform this feature now using a mapping scheme of 'site'
sle = LabelEncoder()
site_labels = sle.fit_transform(ecoli_df['site'])
site_mappings = {index: label for index, label in enumerate(sle.classes_)}
site_mappings

{0: 'cp', 1: 'im', 2: 'imL', 3: 'imS', 4: 'imU', 5: 'om', 6: 'omL', 7: 'pp'}

A mapping scheme has been generated where each site value is
mapped to a number with the help of the LabelEncoder object sle. The transformed labels are stored in the
site_labels value.

In [176]:
ecoli_df['siteLabel'] = site_labels
ecoli_df.head(11)

Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site,siteLabel
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im,1
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp,0
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS,3
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp,0
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om,5
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL,2
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL,6
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp,0
8,ACKA_ECOLI,0.59,0.49,0.48,0.5,0.52,0.45,0.36,cp,0
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp,7


The SiteLabel field shows the mapped numeric labels for each of the site labels and we can clearly
see that this adheres to the mappings that we generated earlier.

##Transforming Ordinal Features

Ordinal features are similar to nominal features except that order matters and is an inherent property with
which we can interpret the values of these features. Like nominal features, even ordinal features might be
present in text form and you need to map and transform them into their numeric representation.

In [177]:
# Pokemon dataset is fictional animals also available on Kaggle.
!wget -O Pokemon.csv "https://raw.githubusercontent.com/dipanjanS/practical-machine-learning-with-python/master/notebooks/Ch04_Feature_Engineering_and_Selection/datasets/Pokemon.csv"
!head Pokemon.csv 

poke_df = pd.read_csv('Pokemon.csv', encoding='utf-8')
poke_df.head()

--2022-05-20 04:29:03--  https://raw.githubusercontent.com/dipanjanS/practical-machine-learning-with-python/master/notebooks/Ch04_Feature_Engineering_and_Selection/datasets/Pokemon.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47228 (46K) [text/plain]
Saving to: ‘Pokemon.csv’


2022-05-20 04:29:03 (12.5 MB/s) - ‘Pokemon.csv’ saved [47228/47228]

#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,Gen 1,FALSE
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,Gen 1,FALSE
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,Gen 1,FALSE
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,Gen 1,FALSE
4,Charmander,Fire,,309,39,52,43,60,50,65,Gen 1,FALSE
5,Charmeleon,Fire,,405

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,Gen 1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,Gen 1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,Gen 1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,Gen 1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,Gen 1,False


In [178]:
poke_df = pd.read_csv('Pokemon.csv', encoding='utf-8')
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)

np.unique(poke_df['Generation'])

array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)

From this output we can see that there are a total of six
generations of Pok mon. This attribute is definitely ordinal because Pok mon belonging to Generation 1
were introduced earlier in the video games and the television shows than Generation 2 and so on. Hence
they have a sense of order among them. 

However, there is no generic module or function to map and transform these features
into numeric representations. Hence we need to hand-craft this using our own logic, which is depicted in the
following code snippet.

In [179]:
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}

poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]

Unnamed: 0,Name,Generation,GenerationLabel
4,Octillery,Gen 2,2
5,Helioptile,Gen 6,6
6,Dialga,Gen 4,4
7,DeoxysDefense Forme,Gen 3,3
8,Rapidash,Gen 1,1
9,Swanna,Gen 5,5


##Encoding Categorical Features

If we directly fed these transformed numeric
representations of categorical features into any algorithm, the model will essentially try to interpret these as
raw numeric features and hence the notion of magnitude will be wrongly introduced in the system.

Hence models built using these features directly would
be sub-optimal and incorrect models. There are several schemes and strategies where dummy features are
created for each unique value or label out of all the distinct categories in any feature. In the subsequent
sections, we will discuss some of these schemes including one hot encoding, dummy coding, effect coding,
and feature hashing schemes.

###One Hot Encoding Scheme
Considering we have numeric representation of any categorical feature with m labels, the one hot encoding
scheme, encodes or transforms the feature into m binary features, which can only contain a value of 1 or 0. Each observation in the categorical feature is thus converted into a vector of size m with only one of the
values as 1 (indicating it as active).

In [180]:
poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10]

Unnamed: 0,Name,Generation,Legendary
4,Octillery,Gen 2,False
5,Helioptile,Gen 6,False
6,Dialga,Gen 4,True
7,DeoxysDefense Forme,Gen 3,True
8,Rapidash,Gen 1,False
9,Swanna,Gen 5,False


In [181]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# transform and map pokemon generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
poke_df['Gen_Label'] = gen_labels

# transform and map pokemon legendary status
leg_le = LabelEncoder()
leg_labels = leg_le.fit_transform(poke_df['Legendary'])
poke_df['Lgnd_Label'] = leg_labels

poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]
poke_df_sub.iloc[4:10]

Unnamed: 0,Name,Generation,Gen_Label,Legendary,Lgnd_Label
4,Octillery,Gen 2,1,False,0
5,Helioptile,Gen 6,5,False,0
6,Dialga,Gen 4,3,True,1
7,DeoxysDefense Forme,Gen 3,2,True,1
8,Rapidash,Gen 1,0,False,0
9,Swanna,Gen 5,4,False,0


In [182]:
# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)

# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()
leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)

In [183]:
# Let’s now concatenate these feature frames and see the final result.
poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'],gen_feature_labels,
              ['Legendary', 'Lgnd_Label'],leg_feature_labels], [])
poke_df_ohe[columns].iloc[4:10]

Unnamed: 0,Name,Generation,Gen_Label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6,Legendary,Lgnd_Label,Legendary_False,Legendary_True
4,Octillery,Gen 2,1,0.0,1.0,0.0,0.0,0.0,0.0,False,0,1.0,0.0
5,Helioptile,Gen 6,5,0.0,0.0,0.0,0.0,0.0,1.0,False,0,1.0,0.0
6,Dialga,Gen 4,3,0.0,0.0,0.0,1.0,0.0,0.0,True,1,0.0,1.0
7,DeoxysDefense Forme,Gen 3,2,0.0,0.0,1.0,0.0,0.0,0.0,True,1,0.0,1.0
8,Rapidash,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0,False,0,1.0,0.0
9,Swanna,Gen 5,4,0.0,0.0,0.0,0.0,1.0,0.0,False,0,1.0,0.0


We can clearly see the new one hot encoded features
for Gen_Label and Lgnd_Label. Each of these one hot encoded features is binary in nature and if they
contain the value 1, it means that feature is active for the corresponding observation.

In [184]:
# The following code shows us two dummy data points pertaining to new Pokemon.
new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True], 
                           ['CharMyToast', 'Gen 4', False]],
                           columns=['Name', 'Generation', 'Legendary'])
new_poke_df

Unnamed: 0,Name,Generation,Legendary
0,PikaZoom,Gen 3,True
1,CharMyToast,Gen 4,False


In [185]:
# converting the text categories into numeric representations using our previously built LabelEncoder objects
new_gen_labels = gen_le.transform(new_poke_df['Generation'])
new_poke_df['Gen_Label'] = new_gen_labels

new_leg_labels = leg_le.transform(new_poke_df['Legendary'])
new_poke_df['Lgnd_Label'] = new_leg_labels

new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]

Unnamed: 0,Name,Generation,Gen_Label,Legendary,Lgnd_Label
0,PikaZoom,Gen 3,2,True,1
1,CharMyToast,Gen 4,3,False,0


In [186]:
# use our previously built LabelEncoder objects and perform one hot encoding on these new data observations
new_gen_feature_arr = gen_ohe.transform(new_poke_df[['Gen_Label']]).toarray()
new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels)

new_leg_feature_arr = leg_ohe.transform(new_poke_df[['Lgnd_Label']]).toarray()
new_leg_features = pd.DataFrame(new_leg_feature_arr, columns=leg_feature_labels)

new_poke_ohe = pd.concat([new_poke_df, new_gen_features, new_leg_features], axis=1)
columns = sum([['Name', 'Generation', 'Gen_Label'], gen_feature_labels,
               ['Legendary', 'Lgnd_Label'], leg_feature_labels], [])
new_poke_ohe[columns]

Unnamed: 0,Name,Generation,Gen_Label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6,Legendary,Lgnd_Label,Legendary_False,Legendary_True
0,PikaZoom,Gen 3,2,0.0,0.0,1.0,0.0,0.0,0.0,True,1,0.0,1.0
1,CharMyToast,Gen 4,3,0.0,0.0,0.0,1.0,0.0,0.0,False,0,1.0,0.0


In [187]:
# Pandas provides to_dummies() function that can help us easily perform one hot encoding
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
4,Octillery,Gen 2,0,1,0,0,0,0
5,Helioptile,Gen 6,0,0,0,0,0,1
6,Dialga,Gen 4,0,0,0,1,0,0
7,DeoxysDefense Forme,Gen 3,0,0,1,0,0,0
8,Rapidash,Gen 1,1,0,0,0,0,0
9,Swanna,Gen 5,0,0,0,0,1,0


###Dummy Coding Scheme
The dummy coding scheme is similar to the one hot encoding scheme, except in the case of dummy coding
scheme, when applied on a categorical feature with m distinct labels, we get m-1 binary features. Thus each
value of the categorical variable gets converted into a vector of size m-1. The extra feature is completely
disregarded and thus if the category values range from {0, 1, ..., m-1} the 0th or the m-1th feature is usually
represented by a vector of all zeros (0).

In [188]:
# Create dummy coding scheme on Pok mon Generation by dropping the first level binary encoded feature (Gen 1).
gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
4,Octillery,Gen 2,1,0,0,0,0
5,Helioptile,Gen 6,0,0,0,0,1
6,Dialga,Gen 4,0,0,1,0,0
7,DeoxysDefense Forme,Gen 3,0,1,0,0,0
8,Rapidash,Gen 1,0,0,0,0,0
9,Swanna,Gen 5,0,0,0,1,0


In [189]:
# choose to drop the last level binary encoded feature (Gen 6)
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
gen_dummy_features = gen_onehot_features.iloc[:,:-1]
pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5
4,Octillery,Gen 2,0,1,0,0,0
5,Helioptile,Gen 6,0,0,0,0,0
6,Dialga,Gen 4,0,0,0,1,0
7,DeoxysDefense Forme,Gen 3,0,0,1,0,0
8,Rapidash,Gen 1,1,0,0,0,0
9,Swanna,Gen 5,0,0,0,0,1
