## Feature Selection by Random Shuffling

A popular method of feature selection consists in random shuffling the values of a specific variable and determining how that permutation affects the performance metric of the machine learning algorithm. In other words, the idea is to permute the values of each feature, one feature at the time, and measure how much the permutation (or shuffling of its values) decreases the accuracy, or the roc_auc, or the mse of the machine learning model (or any other performance metric!). If the variables are important, a random permutation of their values will decrease dramatically any of these metrics. Contrarily, the permutation or shuffling of values should have little to no effect on the model performance metric we are assessing.

The procedure goes more or less like this:

- Build a machine learning model and store its performance metric
- Shuffle 1 feature, and make a new prediction using the previous model
- Determine the performance of this prediction
- Determine the change in the performance of the prediction with the shuffled feature vs the original one
- Repeat for each feature

To select features, we choose those that induced a decrease in model performance, beyond an arbitrarily set threshold.

I will demonstrate how to select features based on random shuffling using on a regression and classification problem. 

**Note** For the demonstration, I will continue to use Random Forests, but this selection procedure can be used with machine learning algorithm. In fact, the importance of the features are determined specifically for the algorithm used. Therefore, different algorithms may return different subsets of important features.

In [52]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import roc_auc_score, mean_squared_error, r2_score, accuracy_score

## Classification

In [46]:
# load dataset
data = pd.read_csv('C:/Users/RAJENDRA REDDY/Downloads/Genre1.csv')
data.shape

(200, 36)

In [47]:
data.head()

Unnamed: 0,chroma_stft_min,chroma_stft_max,chroma_cqt_min,chroma_cqt_max,chroma_cens_min,chroma_cens_max,melspectogram_min,melspectogram_max,mfcc_min,mfcc_max,...,zero_crossing_rate_min,zero_crossing_rate_max,tempogram_min,tempogram_max,delta_mfcc_min,delta_mfcc_max,mel_to_stft_min,mel_to_stft_max,class,song
0,0.001296,1,0.033154,1,0.003514,0.739581,8.89e-06,6547.407,-162.60739,148.07231,...,0.02002,0.305176,-2.85e-16,1,-27.087835,25.198893,0,18.772789,1,Aa To Sahii (sahi)_shortened.wav
1,0.002739,1,0.062056,1,0.020606,0.682328,1.99e-09,3179.2095,-243.84023,156.03381,...,0.008301,0.543457,-2.85e-16,1,-24.83185,26.813145,0,14.955276,1,Aadat (23)_shortened.wav
2,0.003432,1,0.056286,1,0.02501,0.674345,1.47e-06,367.87683,-197.41306,134.92323,...,0.054688,0.480957,-3.32e-16,1,-14.765142,14.908866,0,9.169767,1,Aag Chahat Ki Lag Jayegi (1)_shortened.wav
3,0.000696,1,0.049335,1,0.0,0.777123,8.43e-07,5928.974,-204.6526,162.19836,...,0.004883,0.195801,-2.44e-16,1,-29.71674,21.724106,0,17.88996,1,Aahista Aahista (16)_shortened.wav
4,0.000197,1,0.02621,1,0.0,0.782509,2.6e-11,722.85565,-351.27094,223.6753,...,0.024902,0.239258,-3.61e-16,1,-22.297218,16.177706,0,10.572831,1,Aaiye Meharban (23)_shortened.wav


**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [48]:
# separate train and test sets
feature_cols = ['chroma_stft_min', 'chroma_stft_max', 'chroma_cqt_min',
       'chroma_cqt_max', 'chroma_cens_min', 'chroma_cens_max',
       'melspectogram_min', 'melspectogram_max', 'mfcc_min', 'mfcc_max',
       'rms_min', 'rms_max', 'spectral_centroid_min', 'spectral_centroid_max',
       'spectral_bandwidth_min', 'spectral_bandwidth_max',
       'spectral_contrast_min', 'spectral_contrast_max',
       'spectral_flatness_min', 'spectral_flatness_max',
       'spectral_rolloff_min', 'spectral_rolloff_max', 'poly_features_min',
       'poly_features_max', 'tonnetz_min', 'tonnetz_max',
       'zero_crossing_rate_min', 'zero_crossing_rate_max', 'tempogram_min',
       'tempogram_max', 'delta_mfcc_min', 'delta_mfcc_max', 'mel_to_stft_min',
       'mel_to_stft_max']

X_train, X_test, y_train, y_test = train_test_split(data[feature_cols],data['class'],test_size=0.3,random_state=0)

X_train.shape, X_test.shape

((140, 34), (60, 34))

In [49]:
# for this method, it is necessary to reset the indeces of the returned 
# datasets

X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

### Train ML algo with all features

In [53]:
# The first step to determine feature importance by feature shuffling
# is to build the machine learning model for which we want to 
# select features

# In this case, I will build Random Forests, but remember that 
# you can use this procedure with any other machine learning algorithm

# I build few and shallow trees to avoid overfitting
rf = RandomForestClassifier(
    n_estimators=50, max_depth=2, random_state=2909, n_jobs=4)

rf.fit(X_train, y_train)


# print roc-auc in train and testing sets
print('train auc score: ',
      accuracy_score(y_train, (rf.predict_proba(X_train.fillna(0)))))
print('test auc score: ',
      accuracy_score(y_test, (rf.predict_proba(X_test.fillna(0)))))

train auc score:  1.0
test auc score:  1.0


### Shuffle features and asses performance drop

In [29]:
# in this cell, I will shuffle one by one, each feature of the dataset

# then I use the dataset with the shuffled variable to make predictions
# with the random forests I trained in the previous cell

# overall train roc-auc: using all the features
train_roc = roc_auc_score(y_train, (rf.predict_proba(X_train)),multi_class="ovr")

# list to capture the performance shift
performance_shift = []

# selection  logic
for feature in X_train.columns:

    X_train_c = X_train.copy()

    # shuffle individual feature
    X_train_c[feature] = X_train_c[feature].sample(
        frac=1, random_state=10).reset_index(drop=True)

    # make prediction with shuffled feature and calculate roc-auc
    shuff_roc = roc_auc_score(y_train, rf.predict_proba(X_train_c),multi_class="ovr")
    
    drift = train_roc - shuff_roc

    # save the drop in roc-auc
    performance_shift.append(drift)

In [30]:
# le't have a look at our list of performances
performance_shift

[0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623,
 0.0003287388914665623]

In [31]:
# Now I will transform the list into a pandas Series
# for easy manipulation

feature_importance = pd.Series(performance_shift)

# add variable names in the index
feature_importance.index = X_train.columns

feature_importance.head()

chroma_stft_min    0.000329
chroma_stft_max    0.000329
chroma_cqt_min     0.000329
chroma_cqt_max     0.000329
chroma_cens_min    0.000329
dtype: float64

In [32]:
# Now I will sort the dataframe according to the drop in performance
# caused by feature shuffling

feature_importance.sort_values(ascending=False)

mel_to_stft_max           0.000329
mfcc_min                  0.000329
spectral_bandwidth_min    0.000329
spectral_centroid_max     0.000329
spectral_centroid_min     0.000329
rms_max                   0.000329
rms_min                   0.000329
mfcc_max                  0.000329
melspectogram_max         0.000329
mel_to_stft_min           0.000329
melspectogram_min         0.000329
chroma_cens_max           0.000329
chroma_cens_min           0.000329
chroma_cqt_max            0.000329
chroma_cqt_min            0.000329
chroma_stft_max           0.000329
spectral_bandwidth_max    0.000329
spectral_contrast_min     0.000329
spectral_contrast_max     0.000329
spectral_flatness_min     0.000329
spectral_flatness_max     0.000329
spectral_rolloff_min      0.000329
spectral_rolloff_max      0.000329
poly_features_min         0.000329
poly_features_max         0.000329
tonnetz_min               0.000329
tonnetz_max               0.000329
zero_crossing_rate_min    0.000329
zero_crossing_rate_m

In [33]:
# visualise the top 10 features that caused the major drop
# in the roc-auc (aka model performance)

feature_importance.sort_values(ascending=False).head(10)

mel_to_stft_max           0.000329
mfcc_min                  0.000329
spectral_bandwidth_min    0.000329
spectral_centroid_max     0.000329
spectral_centroid_min     0.000329
rms_max                   0.000329
rms_min                   0.000329
mfcc_max                  0.000329
melspectogram_max         0.000329
mel_to_stft_min           0.000329
dtype: float64

In [34]:
# original number of features (rows in this case)
feature_importance.shape[0]

34

In [35]:
# number of features that cause a drop in performance
# when shuffled

feature_importance[feature_importance>0].shape[0]

34

Only 30 out of the 108 features caused a drop in the performance of the random forests when their values were permuted. This means that we could select those features and discard the rest, and should keep the original random forest performance. 

In [36]:
# print the important features

feature_importance[feature_importance>0].index

Index(['chroma_stft_min', 'chroma_stft_max', 'chroma_cqt_min',
       'chroma_cqt_max', 'chroma_cens_min', 'chroma_cens_max',
       'melspectogram_min', 'melspectogram_max', 'mfcc_min', 'mfcc_max',
       'rms_min', 'rms_max', 'spectral_centroid_min', 'spectral_centroid_max',
       'spectral_bandwidth_min', 'spectral_bandwidth_max',
       'spectral_contrast_min', 'spectral_contrast_max',
       'spectral_flatness_min', 'spectral_flatness_max',
       'spectral_rolloff_min', 'spectral_rolloff_max', 'poly_features_min',
       'poly_features_max', 'tonnetz_min', 'tonnetz_max',
       'zero_crossing_rate_min', 'zero_crossing_rate_max', 'tempogram_min',
       'tempogram_max', 'delta_mfcc_min', 'delta_mfcc_max', 'mel_to_stft_min',
       'mel_to_stft_max'],
      dtype='object')

### Select features

In [38]:
# Now let's build a random forests only with the selected features

# capture the selected features
selected_features = feature_importance[feature_importance > 0].index

# train a new random forests using only the selected features
rf = RandomForestClassifier(n_estimators=50,
                            max_depth=2,
                            random_state=2909,
                            n_jobs=4)

rf.fit(X_train[selected_features], y_train)

# print roc-auc in train and testing sets

print('train auc score: ',roc_auc_score(y_train, (rf.predict_proba(X_train[selected_features])),multi_class="ovr"))
print('test auc score: ',roc_auc_score(y_test, (rf.predict_proba(X_test[selected_features])),multi_class="ovr"))
selected_features

Index(['chroma_stft_min', 'chroma_stft_max', 'chroma_cqt_min',
       'chroma_cqt_max', 'chroma_cens_min', 'chroma_cens_max',
       'melspectogram_min', 'melspectogram_max', 'mfcc_min', 'mfcc_max',
       'rms_min', 'rms_max', 'spectral_centroid_min', 'spectral_centroid_max',
       'spectral_bandwidth_min', 'spectral_bandwidth_max',
       'spectral_contrast_min', 'spectral_contrast_max',
       'spectral_flatness_min', 'spectral_flatness_max',
       'spectral_rolloff_min', 'spectral_rolloff_max', 'poly_features_min',
       'poly_features_max', 'tonnetz_min', 'tonnetz_max',
       'zero_crossing_rate_min', 'zero_crossing_rate_max', 'tempogram_min',
       'tempogram_max', 'delta_mfcc_min', 'delta_mfcc_max', 'mel_to_stft_min',
       'mel_to_stft_max'],
      dtype='object')

In [21]:
X_train = X_train[selected_features]
X_test =  X_test[selected_features]
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import LinearSVC
from sklearn.datasets import make_classification

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('Ada Boost roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))


clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
  max_depth=1, random_state=0).fit(X_train, y_train)

clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('GradientBoostingClassifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))


clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('HistGradientBoostingClassifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))


clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
     min_samples_split=2, random_state=0)

clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('ExtraTreesClassifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(
     estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
     voting='soft')

params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200]}

grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(X_train,y_train)
y_pred = grid.predict_proba(X_train)
print('Voting Classifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))

clf = DecisionTreeClassifier(criterion="entropy", max_depth=9)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print("Decision Tree Accuracy:",roc_auc_score(y_train, y_pred,multi_class="ovo"))

clf = BaggingClassifier(base_estimator=SVC(),
                        n_estimators=10, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print('BaggingClassifier roc-auc: {}'.format(roc_auc_score(y_train, y_pred,multi_class="ovo")))

clf = KNeighborsClassifier(n_neighbors = 5)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)

print("KNN {}nn score: {}",roc_auc_score(y_train, y_pred,multi_class="ovo"))

clf = GaussianNB()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print("Accuracy of Naive Bayes Algo: ", roc_auc_score(y_train, y_pred,multi_class="ovo"))


nca = NeighborhoodComponentsAnalysis(random_state=42)
n = []
for i in range(500):
    
    knn = KNeighborsClassifier(n_neighbors=i+1)
    clf = Pipeline([('nca', nca), ('knn', knn)])
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict_proba(X_train)
    n.append(roc_auc_score(y_train, y_pred,multi_class="ovo"))
print("Accuracy of NeighborhoodComponentsAnalysis:",max(n))



clf = MLPClassifier(random_state=1, max_iter=600).fit(X_train, y_train)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict_proba(X_train)
print("Accuracy of MLPClassifier",roc_auc_score(y_train, y_pred,multi_class="ovo"))

Ada Boost roc-auc: 0.8374698536641564
GradientBoostingClassifier roc-auc: 0.9969756715990477
HistGradientBoostingClassifier roc-auc: 1.0
ExtraTreesClassifier roc-auc: 1.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Voting Classifier roc-auc: 0.959613804805104
Decision Tree Accuracy: 0.9976198825013437
BaggingClassifier roc-auc: 0.7220450330750114
KNN {}nn score: {} 0.888264048750608
Accuracy of Naive Bayes Algo:  0.7841478742298282
Accuracy of NeighborhoodComponentsAnalysis: 1.0
Accuracy of MLPClassifier 0.743599156055825


As you can see, the random forests with the selected features show a similar performance (or even slightly higher) to the random forests built using all of the features. And it provides a simpler, faster and more reliable model.

## Regression

In [3]:
# load dataset
data = pd.read_csv('C:/Users/RAJENDRA REDDY/Downloads/finalData.csv')
data.shape

(1004, 36)

In [4]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(1004, 35)

In [5]:
# separate train and test sets
feature_cols = ['chroma_stft_min', 'chroma_stft_max', 'chroma_cqt_min',
       'chroma_cqt_max', 'chroma_cens_min', 'chroma_cens_max',
       'melspectogram_min', 'melspectogram_max', 'mfcc_min', 'mfcc_max',
       'rms_min', 'rms_max', 'spectral_centroid_min', 'spectral_centroid_max',
       'spectral_bandwidth_min', 'spectral_bandwidth_max',
       'spectral_contrast_min', 'spectral_contrast_max',
       'spectral_flatness_min', 'spectral_flatness_max',
       'spectral_rolloff_min', 'spectral_rolloff_max', 'poly_features_min',
       'poly_features_max', 'tonnetz_min', 'tonnetz_max',
       'zero_crossing_rate_min', 'zero_crossing_rate_max', 'tempogram_min',
       'tempogram_max', 'delta_mfcc_min', 'delta_mfcc_max', 'mel_to_stft_min',
       'mel_to_stft_max']

X_train, X_test, y_train, y_test = train_test_split(data[feature_cols],data['class'],test_size=0.3,random_state=0)

X_train.shape, X_test.shape

((702, 34), (302, 34))

In [6]:
# for this method, it is necessary to reset the indeces of the returned 
# datasets

X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

In [7]:
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

### Train ML algo with all features

In [8]:
# The first step to determine feature importance by feature shuffling
# is to build the machine learning model for which we want to
# select features

# In this case, I will build Random Forests, but remember that
# you can use this procedure for any other machine learning algorithm

# I build few and shallow trees to avoid overfitting
rf = RandomForestRegressor(n_estimators=100,
                           max_depth=3,
                           random_state=2909,
                           n_jobs=4)

rf.fit(X_train, y_train)

# print performance metrics
print('train rmse: ', mean_squared_error(y_train, rf.predict(X_train), squared=False))
print('train r2: ', r2_score(y_train, (rf.predict(X_train))))
print()
print('test rmse: ', mean_squared_error(y_test, rf.predict(X_test), squared=False))
print('test r2: ', r2_score(y_test, rf.predict(X_test)))

train rmse:  1.0768967419854518
train r2:  0.4276288164482638

test rmse:  1.0854408911392635
test r2:  0.4003900240187429


### Shuffle features and asses performance drift

In [9]:
# in this cell, I will shuffle one by one, each feature of the dataset
# and then use the dataset with the shuffled variable to make predictions
# using the random forests I trained in the previous cell

# overall train rmse: using all the features
train_rmse = mean_squared_error(y_train, rf.predict(X_train), squared=False)

# list to capture the performance shift
performance_shift = []

# for each feature:
for feature in X_train.columns:
    
    X_train_c = X_train.copy()

    # shuffle individual feature
    X_train_c[feature] = X_train_c[feature].sample(frac=1, random_state=11).reset_index(
        drop=True)

    # make prediction with shuffled feature and calculate roc-auc
    shuff_rmse = mean_squared_error(y_train, rf.predict(X_train_c), squared=False)
    
    drift = train_rmse - shuff_rmse 

    # store the drop in roc-auc
    performance_shift.append(drift)

In [10]:
# Now I will transform the list into a pandas Series
# for easy manipulation

feature_importance = pd.Series(performance_shift)

# add variable names in the index
feature_importance.index = X_train.columns

feature_importance.head()

chroma_stft_min   -0.000918
chroma_stft_max    0.000000
chroma_cqt_min    -0.005392
chroma_cqt_max     0.000000
chroma_cens_min   -0.001040
dtype: float64

In [11]:
# Note here that when looking at the rmse, the smaller the better.

# as we do original_rmse - shuffled_data_rmse

# if the feature was important, the shuffled data would increase the rsme

# thus, we are looking for negative values here

# number of features that cause a drop in performance
# when shuffled

feature_importance[feature_importance<0].shape[0]

29

In [12]:
# and the variable names

feature_importance[feature_importance<0].index

Index(['chroma_stft_min', 'chroma_cqt_min', 'chroma_cens_min',
       'chroma_cens_max', 'melspectogram_min', 'melspectogram_max', 'mfcc_min',
       'mfcc_max', 'rms_min', 'rms_max', 'spectral_centroid_min',
       'spectral_centroid_max', 'spectral_bandwidth_min',
       'spectral_bandwidth_max', 'spectral_contrast_min',
       'spectral_contrast_max', 'spectral_flatness_min',
       'spectral_flatness_max', 'spectral_rolloff_min', 'spectral_rolloff_max',
       'poly_features_min', 'poly_features_max', 'tonnetz_min', 'tonnetz_max',
       'zero_crossing_rate_min', 'zero_crossing_rate_max', 'delta_mfcc_min',
       'delta_mfcc_max', 'mel_to_stft_max'],
      dtype='object')

### Select features

In [13]:
# Now let's compare the performance of a random forest
# built only using the selected features

# slice the data
feat = feature_importance[feature_importance<0].index

X_train = X_train[feat]
X_test = X_test[feat]

In [14]:
X_train.shape, X_train.shape

((702, 29), (702, 29))

# Classifiers

The model with less features shows similar performance to that with all features.

In [30]:
# build and evaluate the model

rf = RandomForestRegressor(n_estimators=2000,
                           max_depth=3,
                           random_state=2909,
                           n_jobs=4)

rf.fit(X_train, y_train)

# print performance metrics
print('train rmse: ', mean_squared_error(y_train, rf.predict(X_train), squared=False))
print('train r2: ', r2_score(y_train, (rf.predict(X_train))))
print()
print('test rmse: ', mean_squared_error(y_test, rf.predict(X_test), squared=False))
print('test r2: ', r2_score(y_test, rf.predict(X_test)))

train rmse:  1.0766863652680665
train r2:  0.4278524252760294

test rmse:  1.0875362655862453
test r2:  0.3980727718508713


In [44]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = rf.predict(X_test)
rf.score(X_test,y_test)

0.3980727718508714

In [48]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import (NeighborhoodComponentsAnalysis,KNeighborsClassifier)
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification

clf = DecisionTreeClassifier(criterion="entropy", max_depth=9)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Decision Tree Accuracy:",metrics.accuracy_score(y_test, y_pred))


LogReg_clf = LogisticRegression(random_state = 1)
LogReg_clf.fit(X_train, y_train)
y_pred = LogReg_clf.predict(X_test)
acc =  accuracy_score(y_test, y_pred)
print("Regressor Accuracy:",acc)


knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train,y_train)
prediction = knn.predict(X_test)

print("KNN {}nn score: {}".format(5,knn.score(X_test,y_test)))


svm = SVC(random_state = 1)
svm.fit(X_train,y_train)
print("Accuracy of svm algo: ",svm.score(X_test,y_test))

nb = GaussianNB()
nb.fit(X_train,y_train)
print("Accuracy of Naive Bayes Algo: ", nb.score(X_test,y_test))


rf = RandomForestClassifier(n_estimators = 1000, random_state = 1)
rf.fit(X_train,y_train)
print("Accuracy of Random Forest Algo: ", rf.score(X_test,y_test))


nca = NeighborhoodComponentsAnalysis(random_state=42)
n = []
for i in range(500):
    
    knn = KNeighborsClassifier(n_neighbors=i+1)

    nca_pipe = Pipeline([('nca', nca), ('knn', knn)])
    nca_pipe.fit(X_train, y_train)
    n.append(nca_pipe.score(X_test, y_test))
print("Accuracy of NeighborhoodComponentsAnalysis:",max(n))


clf = GradientBoostingClassifier(n_estimators=50, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)
print("Accuracy of GradientBoostingClassifier:",clf.score(X_test, y_test))


clf = MLPClassifier(random_state=1, max_iter=600).fit(X_train, y_train)
print("Accuracy of MLPClassifier",clf.score(X_test, y_test))

Decision Tree Accuracy: 0.41721854304635764
Regressor Accuracy: 0.40066225165562913
KNN 5nn score: 0.37748344370860926


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy of svm algo:  0.4105960264900662
Accuracy of Naive Bayes Algo:  0.4105960264900662
Accuracy of Random Forest Algo:  0.4867549668874172
Accuracy of NeighborhoodComponentsAnalysis: 0.4370860927152318
Accuracy of GradientBoostingClassifier: 0.46357615894039733
Accuracy of MLPClassifier 0.347682119205298


That is all for this lecture, I hope you enjoyed it and see you in the next one!