#**Feature Engineering Exercise Solution**

Data processing and feature
engineering is often described to be the toughest task or step in building any Machine Learning system by
data scientists. With the need of both domain knowledge as well as mathematical transformations, feature
engineering is often said to be both an art as well as a science. The obvious complexities involve dealing
with diverse types of data and variables. Besides this, each Machine Learning problem or task needs
specific features and there is no one solution fits all in the case of feature engineering. This makes feature
engineering all the more difficult and complex.

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

# Feature Engineering on Numeric Data



Even though numeric data can be directly fed into Machine Learning models, you would still need to
engineer features that are relevant to the scenario, problem, and domain before building a model. Hence
the need for feature engineering remains. Important aspects of numeric features include feature scale and
distribution. In some scenarios,
we need to apply specific transformations to change the scale of numeric values and in other scenarios we
need to change the overall distribution of the numeric values, like transforming a skewed distribution to a
normal distribution.

In [None]:
# Import necessary dependencies and settings
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import scipy.stats as spstats

%matplotlib inline
mpl.style.reload_library()
mpl.style.use('classic')
mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
mpl.rcParams['figure.figsize'] = [6.0, 4.0]
mpl.rcParams['figure.dpi'] = 100

## Raw Measures

Raw measures typically
indicated using numeric variables directly as features without any form of transformation or engineering.
Typically these features can indicate values or counts.

###Values

Usually, scalar values in its raw form indicate a specific measurement, metric, or observation belonging to
a specific variable or field. The semantics of this field is usually obtained from the field name itself or a data
dictionary if present.

###Ecoli Dataset

Ecoli dataset is for predicting Protein Localization Sites in Ecoli. 
```
Number of Instances:  336 
Number of Attributes: 8 ( 7 predictive, 1 name )
Attribute Information.
  1. Sequence Name: Accession number for the SWISS-PROT database
  2. mcg: McGeoch's method for signal sequence recognition.
  3. gvh: von Heijne's method for signal sequence recognition.
  4. lip: von Heijne's Signal Peptidase II consensus sequence score (Binary attribute).
  5. chg: Presence of charge on N-terminus of predicted lipoproteins (Binary attribute).
  6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
  7. alm1: score of the ALOM membrane spanning region prediction program.
  8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
Missing Attribute Values: None.
Class Distribution. The class is the localization site.
  cp  (cytoplasm)                                    143
  im  (inner membrane without signal sequence)        77               
  pp  (perisplasm)                                    52
  imU (inner membrane, uncleavable signal sequence)   35
  om  (outer membrane)                                20
  omL (outer membrane lipoprotein)                     5
  imL (inner membrane lipoprotein)                     2
  imS (inner membrane, cleavable signal sequence)      2
```
You can learn more about the dataset here:
* Ecoli Dataset ([ecoli.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.data))
* Ecoli Dataset Description ([ecoli.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.names))


In [None]:
# Download Ecoli dataset
!pip install wget
!python -m wget -o ecoli.csv "https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/ecoli.csv"

ecoli_df = pd.read_csv('ecoli.csv')
ecoli_df.head(10)

--2022-05-20 06:11:16--  https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/ecoli.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16498 (16K) [text/plain]
Saving to: ‘ecoli.csv’


2022-05-20 06:11:16 (101 MB/s) - ‘ecoli.csv’ saved [16498/16498]

accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
EMRA_ECOLI,0.06,0.61,0.48,0.50,0.49,0.92,0.37,im
AAT_ECOLI,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
ATKC_ECOLI,0.85,0.53,0.48,0.50,0.53,0.52,0.35,imS
ACEA_ECOLI,0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp
FADL_ECOLI,0.78,0.68,0.48,0.50,0.83,0.40,0.29,om
NLPA_ECOLI,0.75,0.55,1.00,1.00,0.40,0.47,0.30,imL
MULI_ECOLI,0.77,0.57,1.00,0.50,0.37,0.54,0.01,omL
ACEK_ECOLI,0.56,0.40,0.48,0.50,0.49,0.37,0.46,cp
ATKA_ECOLI,0.72,0.42,0.48,0.50,0.65,0.77,0.79,imU


Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
8,ATKA_ECOLI,0.72,0.42,0.48,0.5,0.65,0.77,0.79,imU
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp


In [None]:
# Show some of features
ecoli_df[['mcg', 'gvh', 'chg']].head()

Unnamed: 0,mcg,gvh,chg
0,0.06,0.61,0.5
1,0.49,0.29,0.5
2,0.85,0.53,0.5
3,0.07,0.4,0.5
4,0.78,0.68,0.5


In [None]:
# Compute basic statistical measures on the fields of 'mcg', 'gvh', 'chg'
ecoli_df[['mcg', 'gvh', 'chg']].describe()

Unnamed: 0,mcg,gvh,chg
count,336.0,336.0,336.0
mean,0.50006,0.5,0.501488
std,0.194634,0.148157,0.027277
min,0.0,0.16,0.5
25%,0.34,0.4,0.5
50%,0.5,0.47,0.5
75%,0.6625,0.57,0.5
max,0.89,1.0,1.0


###Counts
Raw numeric measures can also indicate counts, frequencies and occurrences of specific attributes.

###Diabetes Dataset
The dataset classifies patient data as
either an onset of diabetes within five years or not.

```
Number of Instances: 768
Number of Attributes: 8 plus class 
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```
You can learn more about the dataset here:

* Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
* Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

In [None]:
# Download Diabetes dataset
!python -m wget -o pima-indians-diabetes.csv "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"


--2022-05-20 06:11:17--  https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23278 (23K) [text/plain]
Saving to: ‘pima-indians-diabetes.csv’


2022-05-20 06:11:17 (102 MB/s) - ‘pima-indians-diabetes.csv’ saved [23278/23278]

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1


In [None]:
diabetes_df = pd.read_csv('pima-indians-diabetes.csv', header=None)
diabetes_df.columns=['pregnancy', 'glucose', 'bp', 'triceps', 'insulin', 'bmi', 'pedigree', 'age', 'diabetes']
diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [None]:
diabetes_df.describe()

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


##Binarization

If you are more concerned about the various songs he/she has listened to. In this case, a binary
feature is preferred as opposed to a count based feature.

In [None]:
# Binarize 'age' field manually
age = np.array(diabetes_df['age']) 
old = np.array(diabetes_df['age']) 
old[age > 50] = 1
old[age <= 50] = 0
diabetes_df['old'] = old

diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old
0,6,148,72,35,0,33.6,0.627,50,1,0
1,1,85,66,29,0,26.6,0.351,31,0,0
2,8,183,64,0,0,23.3,0.672,32,1,0
3,1,89,66,23,94,28.1,0.167,21,0,0
4,0,137,40,35,168,43.1,2.288,33,1,0
5,5,116,74,0,0,25.6,0.201,30,0,0
6,3,78,50,32,88,31.0,0.248,26,1,0
7,10,115,0,0,0,35.3,0.134,29,0,0
8,2,197,70,45,543,30.5,0.158,53,1,1
9,8,125,96,0,0,0.0,0.232,54,1,1


In [None]:
# Binarize 'age' field using Binarizer
from sklearn.preprocessing import Binarizer

# Binarize data (set feature values to 0 or 1) according to a threshold.
# Values greater than the threshold map to 1, while values less than
# or equal to the threshold map to 0. With the default threshold of 0,
# only positive values map to 1.
bn = Binarizer(threshold=50)
bn_old = bn.transform([diabetes_df['age']])[0]
diabetes_df['bn_old'] = bn_old
diabetes_df.head(10)

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old,bn_old
0,6,148,72,35,0,33.6,0.627,50,1,0,0
1,1,85,66,29,0,26.6,0.351,31,0,0,0
2,8,183,64,0,0,23.3,0.672,32,1,0,0
3,1,89,66,23,94,28.1,0.167,21,0,0,0
4,0,137,40,35,168,43.1,2.288,33,1,0,0
5,5,116,74,0,0,25.6,0.201,30,0,0,0
6,3,78,50,32,88,31.0,0.248,26,1,0,0
7,10,115,0,0,0,35.3,0.134,29,0,0,0
8,2,197,70,45,543,30.5,0.158,53,1,1,1
9,8,125,96,0,0,0.0,0.232,54,1,1,1


##Rounding
Often when dealing with numeric attributes like proportions or percentages, we may not need values with a
high amount of precision. Hence it makes sense to round off these high precision percentages into numeric
integers. These integers can then be directly used as raw numeric values or even as categorical (discreteclass
based) features.

In [None]:
# Creare a column 'pedigree_scale_10' and rounding off the 'pedigree' by 10
diabetes_df['pedigree_scale_10'] = np.array(np.round((diabetes_df['pedigree'] * 10)), dtype='int')
# Creare a column 'popularity_scale_100' and rounding off the 'pop_percent' by 100
diabetes_df['pedigree_scale_100'] = np.array(np.round((diabetes_df['pedigree'] * 100)), dtype='int')
diabetes_df

Unnamed: 0,pregnancy,glucose,bp,triceps,insulin,bmi,pedigree,age,diabetes,old,bn_old,pedigree_scale_10,pedigree_scale_100
0,6,148,72,35,0,33.6,0.627,50,1,0,0,6,63
1,1,85,66,29,0,26.6,0.351,31,0,0,0,4,35
2,8,183,64,0,0,23.3,0.672,32,1,0,0,7,67
3,1,89,66,23,94,28.1,0.167,21,0,0,0,2,17
4,0,137,40,35,168,43.1,2.288,33,1,0,0,23,229
...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0,1,1,2,17
764,2,122,70,27,0,36.8,0.340,27,0,0,0,3,34
765,5,121,72,23,112,26.2,0.245,30,0,0,0,2,24
766,1,126,60,0,0,30.1,0.349,47,1,0,0,3,35


##Interactions
Often in real-world datasets and scenarios, it makes sense to also try to capture the
interactions between these feature variables as a part of the input feature set.

In [None]:
gvh_lip = ecoli_df[['gvh','lip']]
gvh_lip.head()

Unnamed: 0,gvh,lip
0,0.61,0.48
1,0.29,0.48
2,0.53,0.48
3,0.4,0.48
4,0.68,0.48


In [None]:
from sklearn.preprocessing import PolynomialFeatures
# build features up to the second degree using the PolynomialFeatures class from scikit-learn's API.
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(gvh_lip)
res

array([[0.61  , 0.48  , 0.3721, 0.2928, 0.2304],
       [0.29  , 0.48  , 0.0841, 0.1392, 0.2304],
       [0.53  , 0.48  , 0.2809, 0.2544, 0.2304],
       ...,
       [0.6   , 0.48  , 0.36  , 0.288 , 0.2304],
       [0.61  , 0.48  , 0.3721, 0.2928, 0.2304],
       [0.74  , 0.48  , 0.5476, 0.3552, 0.2304]])

We have a total of five features including the new interaction
features.

We can see the degree of each feature in the matrix.

In [None]:
pd.DataFrame(pf.powers_, columns=['gvh_degree', 'lip_degree'])

Unnamed: 0,gvh_degree,lip_degree
0,1,0
1,0,1
2,2,0
3,1,1
4,0,2


Now that we know what each feature actually represented from the degrees depicted, we can assign a
name to each feature as follows to get the updated feature set.

In [None]:
intr_features = pd.DataFrame(res, columns=['gvh', 'lip', 'gvh^2', 'gvh x lip', 'lip^2'])
intr_features.head(5)  

Unnamed: 0,gvh,lip,gvh^2,gvh x lip,lip^2
0,0.61,0.48,0.3721,0.2928,0.2304
1,0.29,0.48,0.0841,0.1392,0.2304
2,0.53,0.48,0.2809,0.2544,0.2304
3,0.4,0.48,0.16,0.192,0.2304
4,0.68,0.48,0.4624,0.3264,0.2304


Transforming new data in the future (during predictions)

In [None]:
# take some sample new observations for Pok mon attack and defense features and try to transform
# them using this same mechanism.
new_df = pd.DataFrame([[0.35, 0.49],[0.46, 0.38], [0.25, 0.48]], 
                      columns=['gvh', 'lip'])
new_df

Unnamed: 0,gvh,lip
0,0.35,0.49
1,0.46,0.38
2,0.25,0.48


In [None]:
# use the pf object that we created earlier and transform these input features to give us the
# interaction features
new_res = pf.transform(new_df)
new_intr_features = pd.DataFrame(new_res, 
                                 columns=['gvh', 'lip', 'gvh^2', 'gvh x lip', 'lip^2'])
new_intr_features

Unnamed: 0,gvh,lip,gvh^2,gvh x lip,lip^2
0,0.35,0.49,0.1225,0.1715,0.2401
1,0.46,0.38,0.2116,0.1748,0.1444
2,0.25,0.48,0.0625,0.12,0.2304


#Feature Engineering on Categorical Data

Any attribute or feature that is categorical in nature represents discrete values that belong to a specific
finite set of categories or classes. Category or class labels can be text or numeric in nature. Usually there are
two types of categorical variables—nominal and ordinal.


In [None]:
# Import necessary dependencies and settings
import pandas as pd
import numpy as np

##Transforming Nominal Features

Nominal features or attributes are categorical variables that usually have a finite set of distinct discrete
values. Often these values are in string or text format and Machine Learning algorithms cannot understand
them directly. Hence usually you might need to transform these features into a more representative numeric
format.

In [None]:
ecoli_df.head(11)

Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
8,ATKA_ECOLI,0.72,0.42,0.48,0.5,0.65,0.77,0.79,imU
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp


The dataset depicted in this dataframe shows us various attributes pertaining to video games. Features
like Platform, Genre, and Publisher are nominal categorical variables.

In [None]:
sites = np.unique(ecoli_df['site'])
sites

array(['cp', 'im', 'imL', 'imS', 'imU', 'om', 'omL', 'pp'], dtype=object)

This output tells us we have 8 distinct sites in Ecoli dataset. 

In [None]:
from sklearn.preprocessing import LabelEncoder

# Let’s transform this feature now using a mapping scheme of 'site'
sle = LabelEncoder()
site_labels = sle.fit_transform(ecoli_df['site'])
site_mappings = {index: label for index, label in enumerate(sle.classes_)}
site_mappings

{0: 'cp', 1: 'im', 2: 'imL', 3: 'imS', 4: 'imU', 5: 'om', 6: 'omL', 7: 'pp'}

A mapping scheme has been generated where each site value is
mapped to a number with the help of the LabelEncoder object sle. The transformed labels are stored in the
site_labels value.

In [None]:
ecoli_df['siteLabel'] = site_labels
ecoli_df.head(11)

Unnamed: 0,accession,mcg,gvh,lip,chg,aac,alm1,alm2,site,siteLabel
0,EMRA_ECOLI,0.06,0.61,0.48,0.5,0.49,0.92,0.37,im,1
1,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp,0
2,ATKC_ECOLI,0.85,0.53,0.48,0.5,0.53,0.52,0.35,imS,3
3,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp,0
4,FADL_ECOLI,0.78,0.68,0.48,0.5,0.83,0.4,0.29,om,5
5,NLPA_ECOLI,0.75,0.55,1.0,1.0,0.4,0.47,0.3,imL,2
6,MULI_ECOLI,0.77,0.57,1.0,0.5,0.37,0.54,0.01,omL,6
7,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp,0
8,ATKA_ECOLI,0.72,0.42,0.48,0.5,0.65,0.77,0.79,imU,4
9,AGP_ECOLI,0.74,0.49,0.48,0.5,0.42,0.54,0.36,pp,7


The SiteLabel field shows the mapped numeric labels for each of the site labels and we can clearly
see that this adheres to the mappings that we generated earlier.

##Transforming Ordinal Features

Ordinal features are similar to nominal features except that order matters and is an inherent property with
which we can interpret the values of these features. Like nominal features, even ordinal features might be
present in text form and you need to map and transform them into their numeric representation.

###Create Generation based on 'age'

In [None]:
age = np.array(diabetes_df['age']) 

diabetes_df['Generation'] = diabetes_df['age'].apply(lambda value: 'Gen Z' 
                                                          if value <= 25 else 'Millennials' 
                                                              if value <= 41 else 'Gen X'
                                                                  if value <= 57 else 'Boomers II'
                                                                     if value <= 67 else 'Boomers I'
                                                                        if value <= 76 else 'Post WWII'
                                                                            if value <= 94 else 'WWII'
                                                              )

diabetes_df[['age', 'Generation']].head(10)

Unnamed: 0,age,Generation
0,50,Gen X
1,31,Millennials
2,32,Millennials
3,21,Gen Z
4,33,Millennials
5,30,Millennials
6,26,Millennials
7,29,Millennials
8,53,Gen X
9,54,Gen X


In [None]:
np.unique(diabetes_df['Generation'])

array(['Boomers I', 'Boomers II', 'Gen X', 'Gen Z', 'Millennials',
       'Post WWII'], dtype=object)

From this output we can see that there are a total of six generations of people. This attribute is definitely ordinal and they have a sense of order among them.

However, there is no generic module or function to map and transform these features into numeric representations. Hence we need to hand-craft this using our own logic, which is depicted in the following code snippet.

In [None]:
gen_ord_map = {'Gen Z': 1, 'Millennials': 2, 'Gen X': 3, 
               'Boomers II': 4, 'Boomers I': 5, 'Post WWII': 6}

diabetes_df['GenerationLabel'] = diabetes_df['Generation'].map(gen_ord_map)
diabetes_df[['age', 'Generation', 'GenerationLabel']].iloc[4:10]

Unnamed: 0,age,Generation,GenerationLabel
4,33,Millennials,2
5,30,Millennials,2
6,26,Millennials,2
7,29,Millennials,2
8,53,Gen X,3
9,54,Gen X,3


###Create BML Class based on 'bmi'

In [None]:
bmi = np.array(diabetes_df['bmi']) 

diabetes_df['BMI'] = diabetes_df['bmi'].apply(lambda value: 'Underweight' 
                                                          if value <= 18.5 else 'Normal' 
                                                              if value <= 22.9 else 'Pre-obese'
                                                                  if value <= 24.9 else 'Class I obesity'
                                                                     if value <= 29.9 else 'Class II obesity'
                                                                        if value <= 34.9 else 'Class II obesity'
                                                              )

diabetes_df[['bmi', 'BMI']].head(10)

Unnamed: 0,bmi,BMI
0,33.6,Class II obesity
1,26.6,Class I obesity
2,23.3,Pre-obese
3,28.1,Class I obesity
4,43.1,Class II obesity
5,25.6,Class I obesity
6,31.0,Class II obesity
7,35.3,Class II obesity
8,30.5,Class II obesity
9,0.0,Underweight


In [None]:
np.unique(diabetes_df['BMI'])

array(['Class I obesity', 'Class II obesity', 'Normal', 'Pre-obese',
       'Underweight'], dtype=object)

From this output we can see that there are a total of five BML classes. This attribute is definitely ordinal and they have a sense of order among them.

However, there is no generic module or function to map and transform these features into numeric representations. Hence we need to hand-craft this using our own logic, which is depicted in the following code snippet.

In [None]:
bmi_ord_map = {'Underweight': 1, 'Normal': 2, 'Pre-obese': 3, 
               'Class I obesity': 4, 'Class II obesity': 5}

diabetes_df['BMILabel'] = diabetes_df['BMI'].map(bmi_ord_map)
diabetes_df[['bmi', 'BMI', 'BMILabel']].iloc[4:10]

Unnamed: 0,bmi,BMI,BMILabel
4,43.1,Class II obesity,5
5,25.6,Class I obesity,4
6,31.0,Class II obesity,5
7,35.3,Class II obesity,5
8,30.5,Class II obesity,5
9,0.0,Underweight,1


From this output we can see that there are a total of six
generations of Pok mon. This attribute is definitely ordinal because Pok mon belonging to Generation 1
were introduced earlier in the video games and the television shows than Generation 2 and so on. Hence
they have a sense of order among them. 

However, there is no generic module or function to map and transform these features
into numeric representations. Hence we need to hand-craft this using our own logic, which is depicted in the
following code snippet.

##Encoding Categorical Features

If we directly fed these transformed numeric
representations of categorical features into any algorithm, the model will essentially try to interpret these as
raw numeric features and hence the notion of magnitude will be wrongly introduced in the system.

Hence models built using these features directly would
be sub-optimal and incorrect models. There are several schemes and strategies where dummy features are
created for each unique value or label out of all the distinct categories in any feature. In the subsequent
sections, we will discuss some of these schemes including one hot encoding, dummy coding, effect coding,
and feature hashing schemes.

###One Hot Encoding Scheme
Considering we have numeric representation of any categorical feature with m labels, the one hot encoding
scheme, encodes or transforms the feature into m binary features, which can only contain a value of 1 or 0. Each observation in the categorical feature is thus converted into a vector of size m with only one of the
values as 1 (indicating it as active).

In [None]:
diabetes_df[['diabetes', 'Generation', 'BMI']].iloc[4:10]

Unnamed: 0,diabetes,Generation,BMI
4,1,Millennials,Class II obesity
5,0,Millennials,Class I obesity
6,1,Millennials,Class II obesity
7,0,Millennials,Class II obesity
8,1,Gen X,Class II obesity
9,1,Gen X,Underweight


In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# transform and map diabetes generations
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(diabetes_df['Generation'])
diabetes_df['Gen_Label'] = gen_labels

# transform and map diabetes bmi status
bmi_le = LabelEncoder()
bmi_labels = bmi_le.fit_transform(diabetes_df['BMI'])
diabetes_df['BMI_Label'] = bmi_labels

diabetes_df_sub = diabetes_df[['diabetes', 'Generation', 'Gen_Label', 'BMI', 'BMI_Label']]
diabetes_df_sub.iloc[4:10]

Unnamed: 0,diabetes,Generation,Gen_Label,BMI,BMI_Label
4,1,Millennials,4,Class II obesity,1
5,0,Millennials,4,Class I obesity,0
6,1,Millennials,4,Class II obesity,1
7,0,Millennials,4,Class II obesity,1
8,1,Gen X,2,Class II obesity,1
9,1,Gen X,2,Underweight,4


In [None]:
# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(diabetes_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)

# encode bmi labels using one-hot encoding scheme
bmi_ohe = OneHotEncoder()
bmi_feature_arr = bmi_ohe.fit_transform(diabetes_df[['BMI_Label']]).toarray()
bmi_feature_labels = ['BMI_'+str(cls_label) for cls_label in bmi_le.classes_]
bmi_features = pd.DataFrame(bmi_feature_arr, columns=bmi_feature_labels)

In [None]:
# Let’s now concatenate these feature frames and see the final result.
diabetes_df_ohe = pd.concat([diabetes_df_sub, gen_features, bmi_features], axis=1)
columns = sum([['diabetes', 'Generation', 'Gen_Label'],gen_feature_labels,
              ['BMI', 'BMI_Label'],bmi_feature_labels], [])
diabetes_df_ohe[columns].iloc[4:10]

Unnamed: 0,diabetes,Generation,Gen_Label,Boomers I,Boomers II,Gen X,Gen Z,Millennials,Post WWII,BMI,BMI_Label,BMI_Class I obesity,BMI_Class II obesity,BMI_Normal,BMI_Pre-obese,BMI_Underweight
4,1,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
5,0,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class I obesity,0,1.0,0.0,0.0,0.0,0.0
6,1,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
7,0,Millennials,4,0.0,0.0,0.0,0.0,1.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
8,1,Gen X,2,0.0,0.0,1.0,0.0,0.0,0.0,Class II obesity,1,0.0,1.0,0.0,0.0,0.0
9,1,Gen X,2,0.0,0.0,1.0,0.0,0.0,0.0,Underweight,4,0.0,0.0,0.0,0.0,1.0


We can clearly see the new one hot encoded features
for Gen_Label and BMI_Label. Each of these one hot encoded features is binary in nature and if they
contain the value 1, it means that feature is active for the corresponding observation.

In [None]:
# The following code shows us two dummy data points pertaining to new Pokemon.
new_diabetes_df = pd.DataFrame([['1', 'Gen X', 'Pre-obese'], 
                           ['0', 'Boomers II', 'Class I obesity']],
                           columns=['diabetes', 'Generation', 'BMI'])
new_diabetes_df

Unnamed: 0,diabetes,Generation,BMI
0,1,Gen X,Pre-obese
1,0,Boomers II,Class I obesity


In [None]:
# converting the text categories into numeric representations using our previously built LabelEncoder objects
new_gen_labels = gen_le.transform(new_diabetes_df['Generation'])
new_diabetes_df['Gen_Label'] = new_gen_labels

new_bmi_labels = bmi_le.transform(new_diabetes_df['BMI'])
new_diabetes_df['BMI_Label'] = new_bmi_labels

new_diabetes_df[['diabetes', 'Generation', 'Gen_Label', 'BMI', 'BMI_Label']]

Unnamed: 0,diabetes,Generation,Gen_Label,BMI,BMI_Label
0,1,Gen X,2,Pre-obese,3
1,0,Boomers II,1,Class I obesity,0


In [None]:
# use our previously built LabelEncoder objects and perform one hot encoding on these new data observations
new_gen_feature_arr = gen_ohe.transform(new_diabetes_df[['Gen_Label']]).toarray()
new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels)

new_bmi_feature_arr = bmi_ohe.transform(new_diabetes_df[['BMI_Label']]).toarray()
new_bmi_features = pd.DataFrame(new_bmi_feature_arr, columns=bmi_feature_labels)

new_diabetes_ohe = pd.concat([new_diabetes_df, new_gen_features, new_bmi_features], axis=1)
columns = sum([['diabetes', 'Generation', 'Gen_Label'], gen_feature_labels,
               ['BMI', 'BMI_Label'], bmi_feature_labels], [])
new_diabetes_ohe[columns]

Unnamed: 0,diabetes,Generation,Gen_Label,Boomers I,Boomers II,Gen X,Gen Z,Millennials,Post WWII,BMI,BMI_Label,BMI_Class I obesity,BMI_Class II obesity,BMI_Normal,BMI_Pre-obese,BMI_Underweight
0,1,Gen X,2,0.0,0.0,1.0,0.0,0.0,0.0,Pre-obese,3,0.0,0.0,0.0,1.0,0.0
1,0,Boomers II,1,0.0,1.0,0.0,0.0,0.0,0.0,Class I obesity,0,1.0,0.0,0.0,0.0,0.0


In [None]:
# Pandas provides to_dummies() function that can help us easily perform one hot encoding
gen_onehot_features = pd.get_dummies(diabetes_df['Generation'])
pd.concat([diabetes_df[['diabetes', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

Unnamed: 0,diabetes,Generation,Boomers I,Boomers II,Gen X,Gen Z,Millennials,Post WWII
4,1,Millennials,0,0,0,0,1,0
5,0,Millennials,0,0,0,0,1,0
6,1,Millennials,0,0,0,0,1,0
7,0,Millennials,0,0,0,0,1,0
8,1,Gen X,0,0,1,0,0,0
9,1,Gen X,0,0,1,0,0,0


###Dummy Coding Scheme
The dummy coding scheme is similar to the one hot encoding scheme, except in the case of dummy coding
scheme, when applied on a categorical feature with m distinct labels, we get m-1 binary features. Thus each
value of the categorical variable gets converted into a vector of size m-1. The extra feature is completely
disregarded and thus if the category values range from {0, 1, ..., m-1} the 0th or the m-1th feature is usually
represented by a vector of all zeros (0).

In [None]:
# Create dummy coding scheme on diabetes Generation by dropping the first level binary encoded feature (Boomers I).
gen_dummy_features = pd.get_dummies(diabetes_df['Generation'], drop_first=True)
pd.concat([diabetes_df[['diabetes', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

Unnamed: 0,diabetes,Generation,Boomers II,Gen X,Gen Z,Millennials,Post WWII
4,1,Millennials,0,0,0,1,0
5,0,Millennials,0,0,0,1,0
6,1,Millennials,0,0,0,1,0
7,0,Millennials,0,0,0,1,0
8,1,Gen X,0,1,0,0,0
9,1,Gen X,0,1,0,0,0


In [None]:
# choose to drop the last level binary encoded feature (Post WWII)
gen_onehot_features = pd.get_dummies(diabetes_df['Generation'])
gen_dummy_features = gen_onehot_features.iloc[:,:-1]
pd.concat([diabetes_df[['diabetes', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

Unnamed: 0,diabetes,Generation,Boomers I,Boomers II,Gen X,Gen Z,Millennials
4,1,Millennials,0,0,0,0,1
5,0,Millennials,0,0,0,0,1
6,1,Millennials,0,0,0,0,1
7,0,Millennials,0,0,0,0,1
8,1,Gen X,0,0,1,0,0
9,1,Gen X,0,0,1,0,0
