# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [43]:
player_df = pd.read_csv("data/fifa19.csv")

In [4]:
player_df.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In [5]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [6]:
player_df = player_df[numcols+catcols]

In [7]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [8]:
traindf = pd.DataFrame(traindf,columns=features)

In [9]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

y.reset_index(drop=True, inplace=True)

In [10]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,0,0,0,0,0,0,0,0,0,0
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,0,0,0,0,0,0,0,0,0,0
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,0,0,0,0,0,0,0,0,0,0
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,0,0,0,0,0,0,0,0,0,0
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
y

0         True
1         True
2         True
3         True
4         True
         ...  
18154    False
18155    False
18156    False
18157    False
18158    False
Name: Overall, Length: 18159, dtype: bool

In [12]:
len(X.columns)

223

### Set some fixed set of features

In [13]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

In [14]:
def get_corr_support(selected_features):
     return [True if feature in selected_features else False for feature in feature_name]


In [15]:
from sklearn.preprocessing import MinMaxScaler

def get_scaled_data(X):
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)
    return pd.DataFrame(X_scaled, columns=X.columns)

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [16]:
def cor_selector(X, y,num_feats):
    
    pearson_cor = pd.concat([get_scaled_data(X), y], axis=1).corr()["Overall"].to_dict()
    del pearson_cor["Overall"]
    
    sorted_features_with_values = sorted(pearson_cor.items(), key=lambda x: x[1], reverse=True)[:num_feats]
    selected_features = [sor[0] for sor in sorted_features_with_values]
    return get_corr_support(selected_features) , selected_features

In [17]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [18]:
cor_feature

['Reactions',
 'Body Type_C. Ronaldo',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Position_LF',
 'Position_RF',
 'ShortPassing',
 'Volleys',
 'LongPassing',
 'FKAccuracy',
 'BallControl',
 'Finishing',
 'LongShots',
 'ShotPower',
 'Dribbling',
 'Nationality_Belgium',
 'Crossing',
 'Agility',
 'Weak Foot',
 'Stamina',
 'Nationality_Slovenia',
 'Nationality_Gabon',
 'Strength',
 'SprintSpeed',
 'Acceleration',
 'Nationality_Uruguay',
 'Position_LAM',
 'Nationality_Costa Rica']

## Filter Feature Selection - Chi-Sqaure

In [19]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [20]:
def chi_squared_selector(X, y, num_feats):
    chi2_features = SelectKBest(chi2, k = num_feats)
    chi2_features.fit(get_scaled_data(X), y)
    chi2_features.transform(X)
    selected_features = list(chi2_features.get_feature_names_out())
    return get_corr_support(selected_features), selected_features

In [21]:
chi_support, chi_feature = chi_squared_selector(X, y, num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square 

In [22]:
chi_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'LongShots',
 'Position_CM',
 'Position_LAM',
 'Position_LF',
 'Position_LW',
 'Position_RB',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Egypt',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Gabon',
 'Nationality_Slovakia',
 'Nationality_Slovenia',
 'Nationality_Spain',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [23]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [24]:
def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    estimator = LogisticRegression()
    selector = RFE(estimator, n_features_to_select=num_feats, step=10, verbose=True)
    selector = selector.fit(get_scaled_data(X), y)
    # Your code ends here
    return selector.support_, list(selector.get_feature_names_out())

In [25]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

Fitting estimator with 223 features.
Fitting estimator with 213 features.
Fitting estimator with 203 features.
Fitting estimator with 193 features.
Fitting estimator with 183 features.
Fitting estimator with 173 features.
Fitting estimator with 163 features.
Fitting estimator with 153 features.
Fitting estimator with 143 features.
Fitting estimator with 133 features.
Fitting estimator with 123 features.
Fitting estimator with 113 features.
Fitting estimator with 103 features.
Fitting estimator with 93 features.
Fitting estimator with 83 features.
Fitting estimator with 73 features.
Fitting estimator with 63 features.
Fitting estimator with 53 features.
Fitting estimator with 43 features.
Fitting estimator with 33 features.
30 selected features


### List the selected features from RFE

In [26]:
rfe_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'SprintSpeed',
 'Agility',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Strength',
 'Weak Foot',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_RB',
 'Position_RCB',
 'Position_RF',
 'Position_RM',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Embedded Selection - Lasso: SelectFromModel

In [27]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [28]:
def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    estimator = LogisticRegression()

    model = SelectFromModel(estimator, max_features=num_feats)
    model.fit(get_scaled_data(X), y)

    # Your code ends here
    return get_corr_support(model.get_feature_names_out()), list(model.get_feature_names_out())

In [29]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [30]:
embedded_lr_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'SprintSpeed',
 'Agility',
 'Volleys',
 'Reactions',
 'Strength',
 'Weak Foot',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_RB',
 'Position_RCB',
 'Position_RM',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_Lean',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_France',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Tree based(Random Forest): SelectFromModel

In [31]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [32]:
def embedded_rf_selector(X, y, num_feats):
    estimator = RandomForestClassifier()

    model = SelectFromModel(estimator, max_features=num_feats)
    model.fit(get_scaled_data(X), y)

    # Your code ends here
    return get_corr_support(model.get_feature_names_out()), list(model.get_feature_names_out())

In [33]:
embedder_rf_support, embeded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embeded_rf_feature)), 'selected features')

21 selected features


In [34]:
embeded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Body Type_Courtois']

## Tree based(Light GBM): SelectFromModel

In [35]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [36]:
def embedded_lgbm_selector(X, y, num_feats):
    estimator = LGBMClassifier()

    model = SelectFromModel(estimator, max_features=num_feats)
    model.fit(get_scaled_data(X), y)

    # Your code ends here
    return get_corr_support(model.get_feature_names_out()), list(model.get_feature_names_out())

In [38]:
embedded_lgbm_support, embeded_lgb_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embeded_lgb_feature)), 'selected features')

22 selected features


In [39]:
embedded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LCB',
 'Body Type_Lean',
 'Nationality_Italy']

## Putting all of it together: AutoFeatureSelector Tool

In [41]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'Random Forest':embedder_rf_support, 'LightGBM':embedded_lgbm_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

  return reduction(axis=axis, out=out, **passkwargs)


Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,Volleys,True,True,True,True,True,True,6
2,ShortPassing,True,True,True,True,True,True,6
3,Reactions,True,True,True,True,True,True,6
4,LongPassing,True,True,True,True,True,True,6
5,Finishing,True,True,True,True,True,True,6
6,BallControl,True,True,True,True,True,True,6
7,Strength,True,False,True,True,True,True,5
8,SprintSpeed,True,False,True,True,True,True,5
9,FKAccuracy,True,True,True,False,True,True,5
10,Body Type_Courtois,True,True,True,True,True,False,5


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [None]:
Py file attached. 