<center><h1>Modelling</h1></center>

<h2>Importing Libraries</h2>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

<h2>Importing Feature sets</h2>

In [2]:
X_train = pd.read_csv("../Processed Data/All Features/X_train.csv")
X_test = pd.read_csv("../Processed Data/All Features/X_test.csv")
Y_train = pd.read_csv("../Processed Data/All Features/Y_train.csv")
Y_test = pd.read_csv("../Processed Data/All Features/Y_test.csv")

In [3]:
y_train = Y_train["activity_code"] - 1
y_test = Y_test["activity_code"] - 1

In [4]:
y_test.value_counts()

5    537
4    532
0    496
3    491
1    471
2    420
Name: activity_code, dtype: int64

<h2>Cross Validation</h2>

It is generally a good idea to have a validation set, so we can test our model performance on validation and see if there is something we need to do. But having one split may result in depending solely on a single set, which might have unforseen consequesnces. So that is why instead of just 1 we have multiple splits, on the training set, in a training size /  number of folds (or) sets ratio, wo that we can have multiple validation sets while the remaining sets are used for training.

We are going to be using StratifiegKFold for splitting our training data into 5 sets of training and validation data. Stratified preserves the class distribution which comes handy with Imbalanced datasets. Althought our dataset is not severely imbalanced, it serves the purpose.

In [5]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [6]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_model = XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1, eval_metric='mlogloss')

# The models we want to test
models = {
    "Random Forest" : rf_model,
    "XGBoost" : xgb_model
}

In [91]:
# feature sets we want to test our model
feature_sets = {
    "All features" : (X_train, y_train)
}

In [8]:
"""
Performs k-fold cross-validation for each model on each feature set and aggregates the results.

This function executes a comprehensive evaluation pipeline by training and testing every
provided model on every provided dataset (feature set). It calculates key performance
metrics (accuracy and F1 macro) including their mean and standard deviation across all folds.

Args:
    models (dict): A dictionary of initialized model objects for evaluation.
                   Format: { 'model_name': model_instance }
    feature_sets (dict): A dictionary of datasets (feature matrices and target vectors).
                         Format: { 'feature_set_name': (X_train, y_train) }

Returns:
    dict: A nested dictionary containing raw cross_validate results for each model-feature set combination.
          The keys are strings of the format "{model_name} {feature_set_name}".
          The values are the full output dictionaries returned by sklearn's `cross_validate` function.
"""


def calculate_cross_validation_scores(models, feature_sets):
    
    model_scores = {}
    
    for model_name, model in models.items():
        for feature_name, (X_train, y_train) in feature_sets.items():
            print(f"\nScore for {model_name} and the feature set {feature_name}\n")
            
            cv_scores = cross_validate(model, X_train, y_train, cv=skf, scoring=['accuracy', 'f1_macro'])

            model_feature_name = model_name + " " + feature_name
            
            if model_feature_name not in model_scores.keys():
                model_scores[model_feature_name] = cv_scores

            mean_accuracy = cv_scores['test_accuracy'].mean()
            std_accuracy = cv_scores['test_accuracy'].std()
            f1_macro = cv_scores['test_f1_macro'].mean()
            f1_macro_std = cv_scores['test_f1_macro'].std()

            print(f"Mean accuracy across 5 folds is {mean_accuracy:.3f}")
            print(f"Standard Deviation in accuracy across 5 folds is {std_accuracy:.3f}")
            print(f"Mean f1_macro across 5 folds is {f1_macro:.3f}")
            print(f"Standard Deviation in f1_macro across 5 folds is {f1_macro_std:.3f}") 
    return model_scores

In [9]:
# The scores variable contains dictionary of dictionaries of scores for each model and feature set
scores = calculate_cross_validation_scores(models, feature_sets)


Score for Random Forest and the feature set All features

Mean accuracy across 5 folds is 0.982
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.982
Standard Deviation in f1_macro across 5 folds is 0.002

Score for XGBoost and the feature set All features

Mean accuracy across 5 folds is 0.991
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.992
Standard Deviation in f1_macro across 5 folds is 0.002


In [10]:
scores

{'Random Forest All features': {'fit_time': array([4.54524422, 1.40171099, 1.47933531, 1.20769525, 1.356987  ]),
  'score_time': array([0.02460408, 0.02986217, 0.01997781, 0.02173686, 0.01852226]),
  'test_accuracy': array([0.9789259 , 0.98300476, 0.97959184, 0.9829932 , 0.98367347]),
  'test_f1_macro': array([0.97904515, 0.98304675, 0.97966997, 0.98291896, 0.98448599])},
 'XGBoost All features': {'fit_time': array([5.75532794, 5.83246112, 7.73947215, 5.84456205, 5.45334387]),
  'score_time': array([0.06110168, 0.04615879, 0.04358673, 0.036659  , 0.03780484]),
  'test_accuracy': array([0.98844324, 0.99184228, 0.99319728, 0.99115646, 0.99251701]),
  'test_f1_macro': array([0.98844554, 0.99218359, 0.99331957, 0.99140177, 0.99278015])}}

<h2>Model Efficency</h2>

Efficency can be determined as performance per feature. The scores can be interpreted as follows:
<ul>
    <li><b>High Efficiency:</b> Greater than 2.0</li>
    <li><b>Good Efficency:</b> 1.5 - 2.0</li>
    <li><b>Moderate Efficency:</b> 1.0-1.5</li>
    <li><b>Low Efficency:</b>  Less than 1.0</li>
</ul>

In [11]:
"""
Calculates a model efficiency metric for each experiment by combining predictive performance
with feature set economy.

The efficiency score is defined as: (Mean_Accuracy * Mean_F1_Macro * 1000) / Number_of_Features.
This rewards models that achieve high performance with fewer features.

Args:
    scores (dict): A dictionary where keys are experiment names and values are dictionaries
                   containing 'test_accuracy' and 'test_f1_macro' score arrays.
    feature_sets (dict): A dictionary where keys are feature set names and values are tuples
                         containing (feature_matrix, target_vector). The number of features
                         is extracted from the shape of the matrix.

Returns:
    dict: A dictionary where keys are experiment names and values are the calculated efficiency score.
"""


def evaluate_model_efficiency(scores, feature_sets):
    
    efficiencies = {}
    
    for experiment, score in scores.items():

        accuracy = score['test_accuracy'].mean()
        f1_macro = score['test_f1_macro'].mean()
        
        num_features = 0
 
        for feature_name in feature_sets.keys():
            if feature_name in experiment:
                
                num_features = feature_sets[feature_name][0].shape[1]
                break

        efficiency = (accuracy * f1_macro * 1000) / num_features
        
        print(f"Efficency per feature for {experiment} is {efficiency:.2f}")
        
        efficiencies[experiment] = efficiency
    return efficiencies

In [12]:
# efficiency will be a dictionary of 'experiment_name': efficency pairs
efficiencies = evaluate_model_efficiency(scores, feature_sets)

Efficency per feature for Random Forest All features is 1.72
Efficency per feature for XGBoost All features is 1.75


<h2>Feature set on Random Forest Feature Importance</h2>

In [13]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [70]:
rf_feature_importances = rf_model.feature_importances_

In [15]:
feature_names = X_train.columns.tolist()

In [16]:
rf_importance_df = pd.DataFrame({
    "feature" : feature_names,
    "importance" : feature_importances
}).sort_values("importance", ascending=False)

print("Top 10 important features are\n")
print(rf_importance_df.head(10))

Top 10 important features are

                       feature  importance
40     tGravityAcc-mean()-X_40    0.036380
49      tGravityAcc-max()-X_49    0.030331
558   angle(X,gravityMean)_558    0.029676
41     tGravityAcc-mean()-Y_41    0.025355
56   tGravityAcc-energy()-X_56    0.024963
559   angle(Y,gravityMean)_559    0.024415
52      tGravityAcc-min()-X_52    0.022650
50      tGravityAcc-max()-Y_50    0.021705
53      tGravityAcc-min()-Y_53    0.021635
57   tGravityAcc-energy()-Y_57    0.017179


In [17]:
def remove_correlated_features(X_data, feature_names, feature_importances, threshold):
    
    corr_series = X_data.corr()
    
    features_to_remove = set()
    
    for feature1 in feature_names:
        if feature1 in features_to_remove:
            continue
            
        highly_correlated = []
        for feature2 in feature_names:
            if feature1 != feature2 and feature2 not in features_to_remove:
                try:
                    corr_value = corr_series.loc[feature1, feature2]
                except KeyError:
                    corr_value = corr_series.loc[feature2, feature1]
                if corr_value > threshold:
                    highly_correlated.append(feature2)
                    
        for feature2 in highly_correlated:
            f1 = feature_importances[feature_importances['feature'] == feature1].values[0][1]
            f2 = feature_importances[feature_importances['feature'] == feature2].values[0][1]
            
            if f1 < f2:
                features_to_remove.add(feature1)
            else:
                features_to_remove.add(feature2)
                
    return features_to_remove

In [18]:
features_to_remove = remove_correlated_features(X_train, feature_names, rf_importance_df, 0.95)

In [19]:
len(features_to_remove)

284

In [20]:
top_n_features = 561
test = X_train[rf_importance_df['feature'][:top_n_features]]
test.shape

(7352, 561)

In [21]:
X_train.columns.values[1]

'tBodyAcc-mean()-Y_1'

In [76]:
def create_feature_set(X_data, importance_df, top_n_features, correlation_threshold):
    
    important_features = X_data[importance_df['feature'][:top_n_features]]
    
    filtered_importance_df = importance_df[importance_df['feature'].isin(important_features.columns)] 
    
    features_to_remove = remove_correlated_features(important_features, important_features.columns, 
                                                    filtered_importance_df, correlation_threshold)
    
    final_feature_set = important_features.drop(list(features_to_remove), axis=1)
    
    return final_feature_set

In [77]:
test_set = create_feature_set(X_train, rf_importance_df, 100, 0.9)

In [78]:
test_set.shape

(7352, 36)

In [79]:
feature_counts = [80, 100, 120]
correlation_thresholds = [0.85, 0.90, 0.95]

rf_feature_sets = {}

for count in feature_counts:
    for threshold in correlation_thresholds:
        feature_set = create_feature_set(X_train, rf_importance_df, count, threshold)
        identifying_string = "RF feature count " + str(count) + " correlation threshold " + str(threshold)
        rf_feature_sets[identifying_string] = feature_set
        print(f"{identifying_string} reduced to {feature_set.shape[1]} features")

RF feature count 80 correlation threshold 0.85 reduced to 25 features
RF feature count 80 correlation threshold 0.9 reduced to 26 features
RF feature count 80 correlation threshold 0.95 reduced to 33 features
RF feature count 100 correlation threshold 0.85 reduced to 34 features
RF feature count 100 correlation threshold 0.9 reduced to 36 features
RF feature count 100 correlation threshold 0.95 reduced to 43 features
RF feature count 120 correlation threshold 0.85 reduced to 37 features
RF feature count 120 correlation threshold 0.9 reduced to 39 features
RF feature count 120 correlation threshold 0.95 reduced to 47 features


In [75]:
rf_feature_sets['RF feature count 80 correlation threshold 0.85']

Unnamed: 0,tGravityAcc-mean()-X_40,"angle(X,gravityMean)_558",tGravityAcc-mean()-Y_41,"angle(Y,gravityMean)_559",tGravityAcc-energy()-Y_57,tGravityAccMag-std()_214,"angle(Z,gravityMean)_560","fBodyAccJerk-bandsEnergy()-1,16_403","tGravityAcc-arCoeff()-Z,2_74",tGravityAcc-min()-Z_54,...,tGravityAcc-entropy()-Y_63,tGravityAccMag-arCoeff()1_222,fBodyGyro-maxInds-Z_450,"fBodyAccJerk-bandsEnergy()-1,8_381","tGravityAcc-arCoeff()-Y,2_70","tGravityAcc-arCoeff()-X,2_66","fBodyAccJerk-bandsEnergy()-9,16_382","fBodyGyro-bandsEnergy()-1,8_460",tBodyGyro-entropy()-X_142,fBodyGyro-meanFreq()-X_451
0,0.963396,-0.841247,-0.140840,0.179941,-0.970905,-0.950551,-0.058627,-0.999900,0.995675,0.056483,...,-1.0,-0.173179,-1.000000,-0.999986,0.720862,0.591146,-0.999980,-0.999865,0.082632,-0.257549
1,0.966561,-0.844788,-0.141551,0.180289,-0.970583,-0.976057,-0.054317,-0.999817,0.834271,0.102764,...,-1.0,0.081569,-1.000000,-0.999996,0.125345,0.413856,-0.999980,-0.999851,0.007469,-0.048167
2,0.966878,-0.848933,-0.142010,0.180637,-0.970368,-0.988020,-0.049118,-0.999732,0.714392,0.102764,...,-1.0,0.038049,-1.000000,-0.999994,0.270500,0.027481,-0.999944,-0.999680,-0.260943,-0.216685
3,0.967615,-0.848649,-0.143976,0.181935,-0.969400,-0.986421,-0.047663,-0.999798,0.386373,0.095753,...,-1.0,-0.092856,-0.793103,-0.999998,0.228310,0.075427,-0.999965,-0.999964,-0.930551,0.216862
4,0.968224,-0.847865,-0.148750,0.185151,-0.967051,-0.991275,-0.043892,-0.999878,0.239268,0.094059,...,-1.0,0.180441,-1.000000,-0.999995,0.089943,0.268918,-0.999983,-0.999870,-0.628861,-0.153343
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7347,0.923148,-0.791883,-0.222004,0.238604,-0.918375,-0.093688,0.049819,-0.839183,0.802503,-0.071977,...,-1.0,-0.617906,-0.793103,-0.839256,0.684225,0.660477,-0.762201,-0.818556,0.195518,-0.434780
7348,0.918343,-0.771840,-0.242054,0.252676,-0.902880,-0.148539,0.050053,-0.843252,0.721749,-0.068919,...,-1.0,-0.468825,-0.931034,-0.854278,0.654116,0.660353,-0.759555,-0.866443,0.535505,-0.516570
7349,0.919810,-0.779133,-0.236950,0.249145,-0.907561,-0.158701,0.040811,-0.840560,0.835444,-0.068919,...,-1.0,-0.492911,-0.931034,-0.815380,0.448116,0.637481,-0.791132,-0.801291,-0.107501,-0.289537
7350,0.922323,-0.785181,-0.233230,0.246432,-0.910648,-0.185720,0.025339,-0.822665,0.858624,-0.040009,...,-1.0,-0.526184,-0.793103,-0.822905,0.404027,0.666204,-0.844415,-0.890578,-0.605100,-0.362980


In [62]:

list(rf_feature_sets.keys())

['RF feature count 80 correlation threshold 0.85',
 'RF feature count 80 correlation threshold 0.9',
 'RF feature count 80 correlation threshold 0.95',
 'RF feature count 100 correlation threshold 0.85',
 'RF feature count 100 correlation threshold 0.9',
 'RF feature count 100 correlation threshold 0.95',
 'RF feature count 120 correlation threshold 0.85',
 'RF feature count 120 correlation threshold 0.9',
 'RF feature count 120 correlation threshold 0.95']

In [63]:
random_forest_feature_sets = {}
for id_string, feature_set in rf_feature_sets.items():
    random_forest_feature_sets[id_string] = (feature_set, y_train)

In [103]:
list(random_forest_feature_sets.keys())

['RF feature count 80 correlation threshold 0.85',
 'RF feature count 80 correlation threshold 0.9',
 'RF feature count 80 correlation threshold 0.95',
 'RF feature count 100 correlation threshold 0.85',
 'RF feature count 100 correlation threshold 0.9',
 'RF feature count 100 correlation threshold 0.95',
 'RF feature count 120 correlation threshold 0.85',
 'RF feature count 120 correlation threshold 0.9',
 'RF feature count 120 correlation threshold 0.95']

<h2>XGBoost Feature Importance</h2>

In [66]:
xgb_model.fit(X_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='mlogloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=-1, num_parallel_tree=None, objective='multi:softprob', ...)

In [67]:
xgb_feature_importances = xgb_model.feature_importances_

In [72]:
xgb_importance_df = pd.DataFrame({
    "feature" : feature_names,
    "importance" : xgb_feature_importances
}).sort_values("importance", ascending=False)
xgb_importance_df.head()

Unnamed: 0,feature,importance
330,"fBodyAcc-bandsEnergy()-1,8_330",0.072055
52,tGravityAcc-min()-X_52,0.063094
296,fBodyAcc-skewness()-X_296,0.046372
410,"fBodyAccJerk-bandsEnergy()-9,16_410",0.039299
201,tBodyAccMag-std()_201,0.037172


In [80]:
xgb_feature_sets = {}

for count in feature_counts:
    for threshold in correlation_thresholds:
        feature_set = create_feature_set(X_train, xgb_importance_df, count, threshold)
        identifying_string = "XGB feature count " + str(count) + " correlation threshold " + str(threshold)
        xgb_feature_sets[identifying_string] = feature_set
        print(f"{identifying_string} reduced to {feature_set.shape[1]} features")

XGB feature count 80 correlation threshold 0.85 reduced to 37 features
XGB feature count 80 correlation threshold 0.9 reduced to 45 features
XGB feature count 80 correlation threshold 0.95 reduced to 53 features
XGB feature count 100 correlation threshold 0.85 reduced to 46 features
XGB feature count 100 correlation threshold 0.9 reduced to 53 features
XGB feature count 100 correlation threshold 0.95 reduced to 65 features
XGB feature count 120 correlation threshold 0.85 reduced to 54 features
XGB feature count 120 correlation threshold 0.9 reduced to 61 features
XGB feature count 120 correlation threshold 0.95 reduced to 75 features


In [83]:
xgboost_feature_sets = {}
for id_string, feature_set in xgb_feature_sets.items():
    xgboost_feature_sets[id_string] = (feature_set, y_train)

<h2> Importing other Feature Sets</h2>

In [85]:
X_train_anova_60 = pd.read_csv("../Processed Data/ANOVA hybrid set/X_train_anova_filtered_60.csv")
X_test_anova_60 = pd.read_csv("../Processed Data/ANOVA hybrid set/X_test_anova_filtered_60.csv")

In [86]:
X_train_mean_std = pd.read_csv("../Processed Data/Mean Features/feature_reduced_X_train.csv")
X_test_mean_std = pd.read_csv("../Processed Data/Mean Features/feature_reduced_X_test.csv")

In [92]:
other_feature_sets = {
    "anova_60" : (X_train_anova_60, y_train),
    "mean_std" : (X_train_mean_std, y_train)
}

feature_sets.update(other_feature_sets)

In [93]:
list(feature_sets.keys())

['All features', 'anova_60', 'mean_std']

In [94]:
rf_scores = calculate_cross_validation_scores({"Random Forest" : rf_model}, random_forest_feature_sets)


Score for Random Forest and the feature set RF feature count 80 correlation threshold 0.85

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 folds is 0.003

Score for Random Forest and the feature set RF feature count 80 correlation threshold 0.9

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 folds is 0.003

Score for Random Forest and the feature set RF feature count 80 correlation threshold 0.95

Mean accuracy across 5 folds is 0.980
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.980
Standard Deviation in f1_macro across 5 folds is 0.003

Score for Random Forest and the feature set RF feature count 100 correlation threshold 0.85

Mean accuracy across 5 folds is 0.979
Standard Deviation in accuracy a

In [95]:
rf_efficiencies = evaluate_model_efficiency(rf_scores, random_forest_feature_sets)

Efficency per feature for Random Forest RF feature count 80 correlation threshold 0.85 is 38.29
Efficency per feature for Random Forest RF feature count 80 correlation threshold 0.9 is 36.78
Efficency per feature for Random Forest RF feature count 80 correlation threshold 0.95 is 36.93
Efficency per feature for Random Forest RF feature count 100 correlation threshold 0.85 is 28.18
Efficency per feature for Random Forest RF feature count 100 correlation threshold 0.9 is 26.71
Efficency per feature for Random Forest RF feature count 100 correlation threshold 0.95 is 26.67
Efficency per feature for Random Forest RF feature count 120 correlation threshold 0.85 is 25.93
Efficency per feature for Random Forest RF feature count 120 correlation threshold 0.9 is 24.66
Efficency per feature for Random Forest RF feature count 120 correlation threshold 0.95 is 24.67


In [96]:
best_rf_experiment = max(rf_efficiencies, key=rf_efficiencies.get)

In [129]:
most_efficent_rf_model_id = best_rf_experiment[14:]
most_efficent_rf_model_id
X_train_efficent_rf = random_forest_feature_sets[most_efficent_rf_model_id]

In [132]:
X_train_efficent_rf

(      tGravityAcc-mean()-X_40  angle(X,gravityMean)_558  \
 0                    0.963396                 -0.841247   
 1                    0.966561                 -0.844788   
 2                    0.966878                 -0.848933   
 3                    0.967615                 -0.848649   
 4                    0.968224                 -0.847865   
 ...                       ...                       ...   
 7347                 0.923148                 -0.791883   
 7348                 0.918343                 -0.771840   
 7349                 0.919810                 -0.779133   
 7350                 0.922323                 -0.785181   
 7351                 0.918707                 -0.783267   
 
       tGravityAcc-mean()-Y_41  angle(Y,gravityMean)_559  \
 0                   -0.140840                  0.179941   
 1                   -0.141551                  0.180289   
 2                   -0.142010                  0.180637   
 3                   -0.143976        

In [118]:
print(f"The most efficent model is {best_rf_experiment} with {random_forest_feature_sets[most_efficent_rf_model_id][0].shape[1]} features")

The most efficent model is Random Forest RF feature count 80 correlation threshold 0.85 with 25 features


<h3>But with only 25 features, the model might not be able to capture all the patterns and perform bad on the testing set. So let us also consider one more set with higher accuracy, so we can come to a final decision when we do our final testing.</h3>

Looks like our best performing model is RF feature count 120 correlation threshold 0.95, so this will be our other feature set being considered for final testing.

In [110]:
most_accurate_rf_model_id = "RF feature count 120 correlation threshold 0.95"
print(f"The most accurate model is Random Forest{most_accurate_rf_model_id} with {random_forest_feature_sets[most_accurate_rf_model_id][0].shape[1]} features")

The most accurate model is Random ForestRF feature count 120 correlation threshold 0.95 with 47 features


<h2>Finalizing XGBoost feature sets</h2>

In [98]:
xgb_scores = calculate_cross_validation_scores({"XGBoost" : xgb_model}, xgboost_feature_sets)


Score for XGBoost and the feature set XGB feature count 80 correlation threshold 0.85

Mean accuracy across 5 folds is 0.988
Standard Deviation in accuracy across 5 folds is 0.001
Mean f1_macro across 5 folds is 0.988
Standard Deviation in f1_macro across 5 folds is 0.001

Score for XGBoost and the feature set XGB feature count 80 correlation threshold 0.9

Mean accuracy across 5 folds is 0.989
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.989
Standard Deviation in f1_macro across 5 folds is 0.002

Score for XGBoost and the feature set XGB feature count 80 correlation threshold 0.95

Mean accuracy across 5 folds is 0.990
Standard Deviation in accuracy across 5 folds is 0.001
Mean f1_macro across 5 folds is 0.990
Standard Deviation in f1_macro across 5 folds is 0.001

Score for XGBoost and the feature set XGB feature count 100 correlation threshold 0.85

Mean accuracy across 5 folds is 0.987
Standard Deviation in accuracy across 5 folds is 0.0

In [100]:
xgb_efficiencies = evaluate_model_efficiency(xgb_scores, xgboost_feature_sets)

Efficency per feature for XGBoost XGB feature count 80 correlation threshold 0.85 is 26.38
Efficency per feature for XGBoost XGB feature count 80 correlation threshold 0.9 is 21.72
Efficency per feature for XGBoost XGB feature count 80 correlation threshold 0.95 is 21.79
Efficency per feature for XGBoost XGB feature count 100 correlation threshold 0.85 is 21.19
Efficency per feature for XGBoost XGB feature count 100 correlation threshold 0.9 is 18.39
Efficency per feature for XGBoost XGB feature count 100 correlation threshold 0.95 is 18.46
Efficency per feature for XGBoost XGB feature count 120 correlation threshold 0.85 is 18.16
Efficency per feature for XGBoost XGB feature count 120 correlation threshold 0.9 is 16.08
Efficency per feature for XGBoost XGB feature count 120 correlation threshold 0.95 is 16.08


In [112]:
best_xgb_experiment = max(xgb_efficiencies, key=xgb_efficiencies.get)
best_xgb_experiment

'XGBoost XGB feature count 80 correlation threshold 0.85'

In [115]:
most_efficent_xgb_model_id = best_xgb_experiment[8:]
most_efficent_xgb_model_id


'XGB feature count 80 correlation threshold 0.85'

In [119]:
print(f"The most efficent model is {best_xgb_experiment} with {xgboost_feature_sets[most_efficent_xgb_model_id][0].shape[1]} features")

The most efficent model is XGBoost XGB feature count 80 correlation threshold 0.85 with 37 features


If we look at the results the best accuracy is 99%, which is given 4 different feature sets. So we pick the one with the least number of features, since it can be noted that the additional features did not do any good to the model's accuracy.

So we pick Feature count 80 threshold 0.85 for the most accurate one.

In [125]:
most_accurate_xgb_model_id = 'XGB feature count 80 correlation threshold 0.95'

In [126]:
print(f"The most efficent model is {most_accurate_xgb_model_id} with {xgboost_feature_sets[most_accurate_xgb_model_id][0].shape[1]} features")

The most efficent model is XGB feature count 80 correlation threshold 0.95 with 53 features


<h3>If we look at both Random Forest and XGBoost models, the feature set with least number of features is often the most efficent one. This is obviously biased towards the feature set with least features if we look at how we are evaluating our model efficency in our evaluate_model_efficiency function.</h3> 

So this is why selecting another feature set that is highly accurate is also neccesary, so that while testing we do not end up with models that does not perform good.

<h2>Baseline Establishment</h2>

In [134]:
final_feature_sets = {
    "All Features" : (X_train, y_train),
    "Mean std features" : (X_train_mean_std, y_train),
    "Anova features" : (X_train_anova_60, y_train),
    "Most efficent Random Forest features" : random_forest_feature_sets[most_efficent_rf_model_id],
    "Most accurate Random Forest features" : random_forest_feature_sets[most_accurate_rf_model_id],
    "Most efficent XGBoost features" : xgboost_feature_sets[most_efficent_xgb_model_id],
    "Most accurate XGBoost features" : xgboost_feature_sets[most_accurate_xgb_model_id]
}

In [135]:
# Comprehensive baseline evaluation
final_scores = calculate_cross_validation_scores(models, final_feature_sets)
final_efficiencies = evaluate_model_efficiency(final_scores, final_feature_sets)


Score for Random Forest and the feature set All Features

Mean accuracy across 5 folds is 0.982
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.982
Standard Deviation in f1_macro across 5 folds is 0.002

Score for Random Forest and the feature set Mean std features

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 folds is 0.002

Score for Random Forest and the feature set Anova features

Mean accuracy across 5 folds is 0.984
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.984
Standard Deviation in f1_macro across 5 folds is 0.002

Score for Random Forest and the feature set Most efficent Random Forest features

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 fold