# Chapter 4 Lab 3

## Goal

The goal of this lab is to demonstrate how you can use Python to perform feature selection - specifically forward and backward selection.

Python has several libraries within sklearn that mimic - almost - the R stepwise model selection. None of them however will perfectly reproduce the output. Otherwise, there is a library we will install and import called 'mlxtend'. mlxtend has a feature_selection.SequentialFeatureSelector module (names identically to the one in sklearn). mlxtend's sequential feature selector works more like R's stepwise methodology.

We do have to do one two things manually as well. See the comments below as we step through this.

First, import what we need and then we read in the normalized numeric data from Lab 1 of this chapter.

In [296]:
#! pip install mlxtend

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, make_scorer
from math import log
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import functools
import operator
# from sklearn.preprocessing import StandardScaler

In [3]:
dota_df1 = pd.read_csv("Dota_normalized.csv")
dota_df1.head()

Unnamed: 0,GamesPlayed,GamesWon,GamesLeft,Ditches,Points,Kills,KillsPerMin,Deaths,Assists,CreepsKilled,CreepsDenied,NeutralsKilled,TowersDestroyed,RaxsDestroyed,TotalTime
0,0.141363,0.14059,0.15,0.333333,0.35024,0.095653,0.285714,0.111196,0.122158,0.104007,0.069642,0.080559,0.136486,0.13234,0.259791
1,0.020602,0.022109,0.0,0.0,0.262429,0.015711,0.309524,0.026489,0.021663,0.012018,0.011797,0.010606,0.018839,0.020158,0.039233
2,0.000634,0.0,0.0,0.0,0.269743,0.000463,0.261905,0.001648,0.000523,0.000475,0.000228,0.000163,0.0,0.0,0.001125
3,0.031379,0.033447,0.125,0.055556,0.434886,0.045447,0.619048,0.026372,0.027159,0.037422,0.042929,0.03303,0.05075,0.031551,0.057959
4,0.0,0.000567,0.0,0.0,0.2842,0.000379,0.666667,0.000118,0.000291,7.8e-05,0.000182,7e-05,0.0,0.0,0.000139


Since we'll use linear regression, we break the data up into feature (X) and target (y). In this case, we want to see how the variables relate to 'Kills', so 'Kills' is our target (y).

In [4]:
y = dota_df1[' Kills']
X = dota_df1.drop([' Kills'], axis=1)

In [5]:
feature_names = np.array(X.columns)

## Linear Regression and AIC

Quick detour: As far as I know, there is no pre-packaged AIC functionality in Python. Though, with such a huge community, there may be one out there somewhere!

As a result, we will define our own function to calculate AIC and then use 'make_scorer' in order to turn it into a scoring method that can be passed to our regression. Since the R code in the lab suggests that all 14 features are important to the regression and gives them an AIC score, let's regress all 14 and see if our custom AIC formula agrees!

Step 1: instantiate a linear regression object and fit it. Later, we can use it to predict what we think 'y' would be and compare the two to help calculate our AIC.

In [6]:
lm = LinearRegression()
lm.fit(X,y)

Step 2: Let's define AIC in a function. We have to be careful because the 'make_scorer' factory limits the way we can pass variables to it. The function below and its parameters is the recommended way of setting the function up.

In [7]:
def aic_min(y_true, y_hat):
    n_param = len(lm.coef_) + 1
    mse = mean_squared_error(y_true, y_hat)
    aic = len(y_true) * log(mse) + 2 * n_param
    return aic

Step 3: We use the 'make_scorer' factory to convert our function to a scorer.

In [8]:
score = make_scorer(aic_min, greater_is_better=False)

Step 4: Let's compare!

Calling our aic_min fuction directly with the original y values and those predicted by our model gives us an AIC score that matches the R code output! So far, so good!

In [9]:
aic_score = aic_min(y, lm.predict(X))
print(F"AIC: %.3f" % aic_score)

AIC: -8794.529


But! When we use our scorer that was produced by 'make_scorer', the absolute value still agree - but now it is positive! Potentially confusing!

The reason is that scorers assume "greater is better" so outputs a positive number. Even though we set the parameter 'greater_is_better'= False, this still occurs because we've only told it that lower figures are better in the AIC case; it still prints the absolute value of the lowest AIC score.

In [10]:
aic_custom_score = score(lm, X, y)
print(F"AIC: %.3f" % aic_custom_score)

AIC: 8794.529


## Sequential Feature Selection

mlxtend provides us with backward and forward selection. We will have to massage the output slightly to approximate the R stepwise output. We can however output the most helpful parts!

Let's break down the paramters

1. 'lm' is the model were using - the one we used above
2. 'k_features=(1,14)' is by default 1. Setting a range is like telling it "I'll take any number of features from 1 to 14 in total!" 
3. 'forward=False' is somewhat self-explanatory. This means we want backward selection. Forward is deault and does not need to be specified.
4. 'scoring=score' is where we declare that we want to use our custom AIC function from up above.
5. 'cv=0' stops the selector from using cross-validation. This is a weakness of sklearn's selector, which does not allow you to prohibit cross-validation


### Backward Feature Selection

Instantiate our selector - backwards will be first.

In [11]:
selector_b = SFS(lm, k_features=(1,14), forward=False, scoring=score, cv=0)
selector_b.fit(X,y)

We can now access some attributes and put them into a simple dataframe.

This shows us the AIC connected to each set of features.

In [12]:
pd.set_option('display.max_colwidth', None)

back_df = pd.DataFrame.from_dict(selector_b.subsets_, orient='index', columns=[ 'avg_score', 'feature_names']).round(1)
back_df.rename(columns={'':'# Features', 'avg_score':'AIC', 'feature_names':'Features' }, inplace=True)
back_df

Unnamed: 0,AIC,Features
14,8794.5,"( GamesPlayed, GamesWon, GamesLeft, Ditches, Points, KillsPerMin, Deaths, Assists, CreepsKilled, CreepsDenied, NeutralsKilled, TowersDestroyed, RaxsDestroyed, TotalTime)"
13,8789.3,"( GamesPlayed, GamesWon, GamesLeft, Ditches, KillsPerMin, Deaths, Assists, CreepsKilled, CreepsDenied, NeutralsKilled, TowersDestroyed, RaxsDestroyed, TotalTime)"
12,8781.2,"( GamesPlayed, GamesWon, GamesLeft, Ditches, KillsPerMin, Deaths, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed, TotalTime)"
11,8764.6,"( GamesPlayed, GamesWon, GamesLeft, Ditches, Deaths, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed, TotalTime)"
10,8747.3,"( GamesPlayed, GamesWon, GamesLeft, Ditches, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed, TotalTime)"
9,8730.1,"( GamesPlayed, GamesWon, GamesLeft, Ditches, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"
8,8711.0,"( GamesPlayed, GamesWon, GamesLeft, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"
7,8693.9,"( GamesPlayed, GamesWon, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"
6,8663.2,"( GamesPlayed, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"
5,8625.7,"( GamesPlayed, Assists, CreepsKilled, CreepsDenied, TowersDestroyed)"


Remember! Our custom scorer is outputting an absolute value - so 8794.5 is out best score.  We can verify the score of the model that our selector picked and then the features attached to that model

In [353]:
selector_b.k_score_

8794.528865544704

In [405]:
selector_b.k_feature_names_

(' GamesPlayed',
 ' GamesWon',
 ' GamesLeft',
 ' Ditches',
 ' Points',
 ' KillsPerMin',
 ' Deaths',
 ' Assists',
 ' CreepsKilled',
 ' CreepsDenied',
 ' NeutralsKilled',
 ' TowersDestroyed',
 ' RaxsDestroyed',
 ' TotalTime')

### Forward Feature Selection

Now, we can follow the same methodology for forward selection. (Remember, you really don't need to include "forward=True" in this case.)

We can quickly run the selector, dump the output into a dataframe and check the value of the AIC and it's set of features.

In [13]:
selector_f = SFS(lm, k_features=(1,14), forward=True, scoring=score, cv=0)
selector_f.fit(X,y)

In [14]:
forwrd_df = pd.DataFrame.from_dict(selector_f.subsets_, orient='index', columns=[ 'avg_score', 'feature_names']).round(1)
forwrd_df.rename(columns={'':'# Features', 'avg_score':'AIC', 'feature_names':'Features' }, inplace=True)

forwrd_df


Unnamed: 0,AIC,Features
1,7948.4,"( TowersDestroyed,)"
2,8358.9,"( GamesWon, TowersDestroyed)"
3,8490.7,"( GamesWon, CreepsKilled, TowersDestroyed)"
4,8542.6,"( GamesWon, CreepsKilled, CreepsDenied, TowersDestroyed)"
5,8602.7,"( GamesPlayed, GamesWon, CreepsKilled, CreepsDenied, TowersDestroyed)"
6,8658.8,"( GamesPlayed, GamesWon, Assists, CreepsKilled, CreepsDenied, TowersDestroyed)"
7,8693.9,"( GamesPlayed, GamesWon, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"
8,8716.7,"( GamesPlayed, GamesWon, KillsPerMin, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"
9,8729.8,"( GamesPlayed, GamesWon, GamesLeft, KillsPerMin, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"
10,8748.0,"( GamesPlayed, GamesWon, GamesLeft, Ditches, KillsPerMin, Assists, CreepsKilled, CreepsDenied, TowersDestroyed, RaxsDestroyed)"


In [15]:
selector_b.k_score_

8794.528865544704

In [16]:
selector_f.k_feature_names_

(' GamesPlayed',
 ' GamesWon',
 ' GamesLeft',
 ' Ditches',
 ' Points',
 ' KillsPerMin',
 ' Deaths',
 ' Assists',
 ' CreepsKilled',
 ' CreepsDenied',
 ' NeutralsKilled',
 ' TowersDestroyed',
 ' RaxsDestroyed',
 ' TotalTime')