# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [2]:
player_df = pd.read_csv(r"C:\Users\14163\Desktop\university cu boulder\GorgeBrown\Mashine Learning 1\Assignment 7\fifa19.csv")



In [3]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [4]:
player_df = player_df[numcols+catcols]

In [5]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [6]:
traindf = pd.DataFrame(traindf,columns=features)

In [7]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [8]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,False,False,False,False,False,False,False,False,False,False
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,False,False,False,False,False,False,False,False,False,False
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,False,False,False,False,False,False,False,False,False,False
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,False,False,False,False,False,False,False,False,False,False
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,False,False,False,False,False,False,False,False,False,False


In [9]:
len(X.columns)

223

### Set some fixed set of features

In [10]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [11]:
def cor_selector(X, y, num_feats):
    cor_list = []
    for i in X.columns:
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(abs(cor))  # take absolute value

    cor_list = pd.Series(cor_list, index=X.columns)
    cor_feature = cor_list.sort_values(ascending=False).index[:num_feats].tolist()
    cor_support = [True if col in cor_feature else False for col in X.columns]
    return cor_support, cor_feature


In [12]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [13]:
cor_feature

['Reactions',
 'Body Type_Courtois',
 'Body Type_C. Ronaldo',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_Neymar',
 'Body Type_Messi',
 'Position_LF',
 'Position_RF',
 'ShortPassing',
 'Volleys',
 'LongPassing',
 'FKAccuracy',
 'BallControl',
 'Finishing',
 'LongShots',
 'ShotPower',
 'Dribbling',
 'Nationality_Belgium',
 'Crossing',
 'Agility',
 'Weak Foot',
 'Stamina',
 'Nationality_Slovenia',
 'Nationality_Gabon',
 'Strength',
 'SprintSpeed',
 'Acceleration',
 'Nationality_Uruguay',
 'Position_LAM',
 'Nationality_Costa Rica']

## Filter Feature Selection - Chi-Sqaure

In [14]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [15]:
def chi_squared_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Step 1: Scale the features to be in [0,1] range
    X_norm = MinMaxScaler().fit_transform(X)
    
    # Step 2: Apply SelectKBest with chi2 scoring function
    chi_selector = SelectKBest(score_func=chi2, k=num_feats)
    chi_selector.fit(X_norm, y)
    
    # Step 3: Get the boolean mask and feature names
    chi_support = chi_selector.get_support()
    chi_feature = X.loc[:, chi_support].columns.tolist()
    
    # Your code ends here
    return chi_support, chi_feature

In [16]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square 

In [17]:
chi_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'LongShots',
 'Position_CM',
 'Position_LAM',
 'Position_LF',
 'Position_LW',
 'Position_RB',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Egypt',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Gabon',
 'Nationality_Slovakia',
 'Nationality_Slovenia',
 'Nationality_Spain',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [18]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [19]:

def rfe_selector(X, y, num_feats):
    # Scale the features
    X_scaled = MinMaxScaler().fit_transform(X)
    
    # Initialize a base model
    model = LogisticRegression(max_iter=1000)
    
    # Run RFE with verbose output
    rfe = RFE(estimator=model, n_features_to_select=num_feats, step=1, verbose=1)
    rfe.fit(X_scaled, y)
    
    # Extract selected features
    rfe_support = rfe.get_support()
    rfe_feature = X.loc[:, rfe_support].columns.tolist()
    
    return rfe_support, rfe_feature

In [20]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

Fitting estimator with 223 features.
Fitting estimator with 222 features.
Fitting estimator with 221 features.
Fitting estimator with 220 features.
Fitting estimator with 219 features.
Fitting estimator with 218 features.
Fitting estimator with 217 features.
Fitting estimator with 216 features.
Fitting estimator with 215 features.
Fitting estimator with 214 features.
Fitting estimator with 213 features.
Fitting estimator with 212 features.
Fitting estimator with 211 features.
Fitting estimator with 210 features.
Fitting estimator with 209 features.
Fitting estimator with 208 features.
Fitting estimator with 207 features.
Fitting estimator with 206 features.
Fitting estimator with 205 features.
Fitting estimator with 204 features.
Fitting estimator with 203 features.
Fitting estimator with 202 features.
Fitting estimator with 201 features.
Fitting estimator with 200 features.
Fitting estimator with 199 features.
Fitting estimator with 198 features.
Fitting estimator with 197 features.
F

### List the selected features from RFE

In [21]:
rfe_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Strength',
 'Weak Foot',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_RB',
 'Position_RCB',
 'Position_RF',
 'Position_RM',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Embedded Selection - Lasso: SelectFromModel

In [22]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [23]:
def embedded_log_reg_selector(X, y, num_feats):
    # Step 1: Normalize features to [0, 1]
    X_scaled = MinMaxScaler().fit_transform(X)

    # Step 2: Fit Logistic Regression with L1 penalty
    lasso = LogisticRegression(penalty="l1", solver='liblinear', max_iter=1000)
    lasso.fit(X_scaled, y)

    # Step 3: Select top `num_feats` features based on coefficient magnitude
    coef = np.abs(lasso.coef_)[0]
    top_indices = np.argsort(coef)[-num_feats:]

    # Step 4: Create support mask and feature names list
    embedded_lr_support = [i in top_indices for i in range(X.shape[1])]
    embedded_lr_feature = X.columns[top_indices].tolist()

    return embedded_lr_support, embedded_lr_feature

In [24]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [25]:
embedded_lr_feature

['Nationality_Bosnia Herzegovina',
 'Nationality_Botswana',
 'Nationality_Bermuda',
 'Body Type_Stocky',
 'Balance',
 'Nationality_France',
 'Nationality_Portugal',
 'Preferred Foot_Right',
 'Nationality_Croatia',
 'Position_RCB',
 'Nationality_Germany',
 'Nationality_Brazil',
 'Nationality_England',
 'Nationality_Belgium',
 'Position_CAM',
 'Nationality_Slovenia',
 'Position_RW',
 'Position_LM',
 'Nationality_Italy',
 'Nationality_Netherlands',
 'Position_LCB',
 'Nationality_Uruguay',
 'Position_RB',
 'Body Type_Lean',
 'Position_GK',
 'Position_CM',
 'LongPassing',
 'Aggression',
 'Position_LW',
 'Reactions']

## Tree based(Random Forest): SelectFromModel

In [26]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [27]:
def embedded_rf_selector(X, y, num_feats):
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X, y)
    selector = SelectFromModel(rf, max_features=num_feats, prefit=True)
    embedded_rf_support = selector.get_support()
    embedded_rf_feature = X.loc[:, embedded_rf_support].columns.tolist()
    return embedded_rf_support, embedded_rf_feature

In [28]:
embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedded_rf_feature)), 'selected features')  
print("Selected features:", embedded_rf_feature)


24 selected features
Selected features: ['Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions', 'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Weak Foot', 'Preferred Foot_Right', 'Body Type_Courtois', 'Body Type_Normal', 'Nationality_Slovenia']


In [29]:
# Assuming embedded_rf_feature is already defined
print("[")
for feat in embedded_rf_feature:
    print(f"    '{feat}',")
print("]")



[
    'Crossing',
    'Finishing',
    'ShortPassing',
    'Dribbling',
    'LongPassing',
    'BallControl',
    'Acceleration',
    'SprintSpeed',
    'Agility',
    'Stamina',
    'Volleys',
    'FKAccuracy',
    'Reactions',
    'Balance',
    'ShotPower',
    'Strength',
    'LongShots',
    'Aggression',
    'Interceptions',
    'Weak Foot',
    'Preferred Foot_Right',
    'Body Type_Courtois',
    'Body Type_Normal',
    'Nationality_Slovenia',
]


## Tree based(Light GBM): SelectFromModel

In [30]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [31]:
def embedded_lgbm_selector(X, y, num_feats):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X, y)

    importances = model.feature_importances_
    top_indices = importances.argsort()[-num_feats:]

    embedded_rf_support = [i in top_indices for i in range(X.shape[1])]
    embedded_rf_feature = X.columns[top_indices].tolist()

    return embedded_lgbm_support, embedded_lgbm_feature




In [32]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [33]:
embedded_rf_feature  

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Right',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Slovenia']

## Putting all of it together: AutoFeatureSelector Tool

In [34]:
# put all selection together
feature_selection_df = pd.DataFrame({
    'Feature': feature_name,
    'Pearson': cor_support,
    'Chi-2': chi_support,
    'RFE': rfe_support,
    'Logistics': embedded_lr_support,
    'Random Forest': embedded_rf_support,
    # 'LightGBM': embedded_lgb_support  # Uncomment only if defined
})

# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df.iloc[:, 1:], axis=1)

# sort and display the top 30
feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=[False, True])
feature_selection_df.index = range(1, len(feature_selection_df) + 1)
feature_selection_df.head(30)



Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,Total
1,LongPassing,True,True,True,True,True,5
2,Nationality_Slovenia,True,True,True,True,True,5
3,Reactions,True,True,True,True,True,5
4,BallControl,True,True,True,False,True,4
5,Body Type_Courtois,True,True,True,False,True,4
6,FKAccuracy,True,True,True,False,True,4
7,Finishing,True,True,True,False,True,4
8,Nationality_Belgium,True,True,True,True,False,4
9,Nationality_Uruguay,True,True,True,True,False,4
10,ShortPassing,True,True,True,False,True,4


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [55]:
def preprocess_dataset(dataset_path):
    import pandas as pd
    from sklearn.preprocessing import MinMaxScaler

    # Load the dataset
    df = pd.read_csv(dataset_path)

    # Define relevant features
    numcols = ['Overall', 'Crossing','Finishing','ShortPassing','Dribbling','LongPassing',
               'BallControl','Acceleration','SprintSpeed','Agility','Stamina','Volleys',
               'FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots',
               'Aggression','Interceptions']

    catcols = ['Preferred Foot', 'Position', 'Body Type', 'Nationality', 'Weak Foot']

    # Select only those columns
    df = df[numcols + catcols]

    # Drop rows with missing values and reset index
    df = df.dropna().reset_index(drop=True)

    # Binary target based on Overall score
    y = (df['Overall'] >= 87)

    # Prepare numeric and categorical features
    X_num = df[numcols].drop(columns='Overall')
    X_cat = pd.get_dummies(df[catcols], drop_first=True)

    # Normalize numeric features
    scaler = MinMaxScaler()
    X_num_scaled = pd.DataFrame(scaler.fit_transform(X_num), columns=X_num.columns).reset_index(drop=True)

    # Reset index for categorical dummies to ensure alignment
    X_cat = X_cat.reset_index(drop=True)

    # Combine scaled numeric and encoded categorical features
    X = pd.concat([X_num_scaled, X_cat], axis=1)

    # Clean column names for LightGBM compatibility
    X.columns = X.columns.str.replace('[^A-Za-z0-9_]+', '_', regex=True)

    num_feats = X.shape[1]

    return X, y, num_feats




In [56]:
def embedded_lgbm_selector(X, y, num_feats):
    from lightgbm import LGBMClassifier

    # Ensure feature names are valid
    X.columns = X.columns.str.replace('[^A-Za-z0-9_]+', '_', regex=True)

    model = LGBMClassifier(n_estimators=100, random_state=42)
    model.fit(X, y)

    importances = model.feature_importances_
    top_indices = importances.argsort()[-num_feats:]

    embedded_lgbm_support = [i in top_indices for i in range(X.shape[1])]
    embedded_lgbm_feature = X.columns[top_indices].tolist()

    return embedded_lgbm_support, embedded_lgbm_feature



In [57]:
def autoFeatureSelector(dataset_path, methods=[], top_k=5):
    from collections import Counter

    # Preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)

    # Store selected features per method
    selected_features_per_method = []

    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y, num_feats)
        selected_features_per_method.append(cor_feature)

    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y, num_feats)
        selected_features_per_method.append(chi_feature)

    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y, num_feats)
        selected_features_per_method.append(rfe_feature)

    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
        selected_features_per_method.append(embedded_lr_feature)

    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
        selected_features_per_method.append(embedded_rf_feature)

    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
        selected_features_per_method.append(embedded_lgbm_feature)

    # Flatten all features selected across methods
    all_features = [feature for sublist in selected_features_per_method for feature in sublist]

    # Count frequency of each feature
    vote_count = Counter(all_features)

    # Sort by vote count (descending), then alphabetically
    sorted_votes = sorted(vote_count.items(), key=lambda x: (-x[1], x[0]))

    # Return top_k features
    best_features = [feature for feature, count in sorted_votes[:top_k]]

    return best_features


In [58]:
best_features = autoFeatureSelector(
    dataset_path=r"C:\Users\14163\Desktop\university cu boulder\GorgeBrown\Mashine Learning 1\Assignment 7\fifa19.csv",
    methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm']
)
best_features


[LightGBM] [Info] Number of positive: 55, number of negative: 18092
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000815 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1798
[LightGBM] [Info] Number of data points in the train set: 18147, number of used features: 122
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003031 -> initscore=-5.795892
[LightGBM] [Info] Start training from score -5.795892


['Acceleration', 'Aggression', 'Agility', 'Balance', 'BallControl']

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features