<center><h1>Modelling</h1></center>

<h2>Importing Libraries</h2>

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

<h2>Importing Feature sets</h2>

In [14]:
X_train = pd.read_csv("../Processed Data/All Features/X_train.csv")
X_test = pd.read_csv("../Processed Data/All Features/X_test.csv")
Y_train = pd.read_csv("../Processed Data/All Features/Y_train.csv")
Y_test = pd.read_csv("../Processed Data/All Features/Y_test.csv")

In [15]:
y_train = Y_train["activity_code"] - 1
y_test = Y_test["activity_code"] - 1

In [16]:
y_test.value_counts()

5    537
4    532
0    496
3    491
1    471
2    420
Name: activity_code, dtype: int64

<h2>Cross Validation</h2>

It is generally a good idea to have a validation set, so we can test our model performance on validation and see if there is something we need to do. But having one split may result in depending solely on a single set, which might have unforseen consequesnces. So that is why instead of just 1 we have multiple splits, on the training set, in a training size /  number of folds (or) sets ratio, wo that we can have multiple validation sets while the remaining sets are used for training.

We are going to be using StratifiegKFold for splitting our training data into 5 sets of training and validation data. Stratified preserves the class distribution which comes handy with Imbalanced datasets. Althought our dataset is not severely imbalanced, it serves the purpose.

In [17]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [18]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_model = XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1, eval_metric='mlogloss')

models = {
    "Random Forest" : rf_model,
    "XGBoost" : xgb_model
}

In [19]:
feature_sets = {
    "All features" : (X_train, y_train)
}

In [20]:
"""
Performs k-fold cross-validation for each model on each feature set and aggregates the results.

Args:
    models (dict): Format: { 'model_name': model_instance }
    feature_sets (dict): Format: { 'feature_set_name': (X_train, y_train) }

Returns:
    dict: A nested dictionary containing raw cross_validate results for each model-feature set combination.
"""


def calculate_cross_validation_scores(models, feature_sets):
    
    model_scores = {}
    
    for model_name, model in models.items():
        for feature_name, (X_train, y_train) in feature_sets.items():
            print(f"\nScore for {model_name} and the feature set {feature_name}\n")
            
            cv_scores = cross_validate(model, X_train, y_train, cv=skf, scoring=['accuracy', 'f1_macro'])

            model_feature_name = model_name + " " + feature_name
            
            if model_feature_name not in model_scores.keys():
                model_scores[model_feature_name] = cv_scores

            mean_accuracy = cv_scores['test_accuracy'].mean()
            std_accuracy = cv_scores['test_accuracy'].std()
            f1_macro = cv_scores['test_f1_macro'].mean()
            f1_macro_std = cv_scores['test_f1_macro'].std()

            print(f"Mean accuracy across 5 folds is {mean_accuracy:.3f}")
            print(f"Standard Deviation in accuracy across 5 folds is {std_accuracy:.3f}")
            print(f"Mean f1_macro across 5 folds is {f1_macro:.3f}")
            print(f"Standard Deviation in f1_macro across 5 folds is {f1_macro_std:.3f}") 
    return model_scores

In [21]:
scores = calculate_cross_validation_scores(models, feature_sets)


Score for Random Forest and the feature set All features

Mean accuracy across 5 folds is 0.982
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.982
Standard Deviation in f1_macro across 5 folds is 0.002

Score for XGBoost and the feature set All features

Mean accuracy across 5 folds is 0.991
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.992
Standard Deviation in f1_macro across 5 folds is 0.002


In [22]:
scores

{'Random Forest All features': {'fit_time': array([4.71441412, 1.19176006, 1.38288283, 1.19029713, 1.20343184]),
  'score_time': array([0.02812099, 0.02442789, 0.01977324, 0.02151394, 0.02130103]),
  'test_accuracy': array([0.9789259 , 0.98300476, 0.97959184, 0.9829932 , 0.98367347]),
  'test_f1_macro': array([0.97904515, 0.98304675, 0.97966997, 0.98291896, 0.98448599])},
 'XGBoost All features': {'fit_time': array([5.30612469, 5.26603103, 5.31443477, 5.21232986, 5.43376279]),
  'score_time': array([0.04024911, 0.03685904, 0.04674101, 0.03651094, 0.04222679]),
  'test_accuracy': array([0.98844324, 0.99184228, 0.99319728, 0.99115646, 0.99251701]),
  'test_f1_macro': array([0.98844554, 0.99218359, 0.99331957, 0.99140177, 0.99278015])}}

<h2>Model Efficency</h2>

Efficency can be determined as performance per feature. The scores can be interpreted as follows:
<ul>
    <li><b>High Efficiency:</b> Greater than 2.0</li>
    <li><b>Good Efficency:</b> 1.5 - 2.0</li>
    <li><b>Moderate Efficency:</b> 1.0-1.5</li>
    <li><b>Low Efficency:</b>  Less than 1.0</li>
</ul>

In [23]:
"""
Calculates a model efficiency metric for each experiment by combining predictive performance
with feature set economy.

Args:
    scores (dict): A dictionary where keys are experiment names and values are dictionaries
                   containing 'test_accuracy' and 'test_f1_macro' score arrays.
    feature_sets (dict): A dictionary where keys are feature set names and values are tuples
                         containing (feature_matrix, target_vector).

Returns:
    dict: A dictionary where keys are experiment names and values are the calculated efficiency score.
"""


def evaluate_model_efficiency(scores, feature_sets):
    
    efficiencies = {}
    
    for experiment, score in scores.items():

        accuracy = score['test_accuracy'].mean()
        f1_macro = score['test_f1_macro'].mean()
        
        num_features = 0
 
        for feature_name in feature_sets.keys():
            if feature_name in experiment:
                
                num_features = feature_sets[feature_name][0].shape[1]
                break

        efficiency = (accuracy * f1_macro * 1000) / num_features
        
        print(f"Efficency per feature for {experiment} is {efficiency:.2f}")
        
        efficiencies[experiment] = efficiency
    return efficiencies

In [25]:
efficiencies = evaluate_model_efficiency(scores, feature_sets)

Efficency per feature for Random Forest All features is 1.72
Efficency per feature for XGBoost All features is 1.75


<h2>Feature set on Random Forest Feature Importance</h2>

In [26]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [27]:
rf_feature_importances = rf_model.feature_importances_

In [28]:
feature_names = X_train.columns.tolist()

In [30]:
rf_importance_df = pd.DataFrame({
    "feature" : feature_names,
    "importance" : rf_feature_importances
}).sort_values("importance", ascending=False)

print("Top 10 important features are\n")
print(rf_importance_df.head(10))

Top 10 important features are

                       feature  importance
40     tGravityAcc-mean()-X_40    0.036380
49      tGravityAcc-max()-X_49    0.030331
558   angle(X,gravityMean)_558    0.029676
41     tGravityAcc-mean()-Y_41    0.025355
56   tGravityAcc-energy()-X_56    0.024963
559   angle(Y,gravityMean)_559    0.024415
52      tGravityAcc-min()-X_52    0.022650
50      tGravityAcc-max()-Y_50    0.021705
53      tGravityAcc-min()-Y_53    0.021635
57   tGravityAcc-energy()-Y_57    0.017179


<b>The total of importance for all the feature combined should be equal to 1.</b>

<h2>Highly co-related Features Removal</h2>

Since the second phase of our goal is to remove one of the two highly related features based on the correlation threshold, we need to have a function to do it a number of times.

In [36]:
def remove_correlated_features(X_data, feature_names, feature_importances, threshold):
    
    """
   Removes highly correlated features while preserving those with highest importance scores.
   
   Args:
       X_data: Feature matrix for correlation calculation
       feature_names: List of feature column names
       feature_importances: DataFrame with feature importance scores
       threshold: Correlation threshold above which features are considered redundant
       
   Returns:
       set: Feature names to be removed from the dataset
       
    """
    
    corr_series = X_data.corr()
    
    features_to_remove = set()
    
    for feature1 in feature_names:
        if feature1 in features_to_remove:
            continue
            
        highly_correlated = []
        for feature2 in feature_names:
            if feature1 != feature2 and feature2 not in features_to_remove:
                try:
                    corr_value = corr_series.loc[feature1, feature2]
                except KeyError:
                    corr_value = corr_series.loc[feature2, feature1]
                if corr_value > threshold:
                    highly_correlated.append(feature2)
                    
        for feature2 in highly_correlated:
            f1 = feature_importances[feature_importances['feature'] == feature1].values[0][1]
            f2 = feature_importances[feature_importances['feature'] == feature2].values[0][1]
            
            if f1 < f2:
                features_to_remove.add(feature1)
            else:
                features_to_remove.add(feature2)
                
    return features_to_remove

In [37]:
features_to_remove = remove_correlated_features(X_train, feature_names, rf_importance_df, 0.95)

After knowing what features to remove, we have to filter out the features to be removed from the existing importance feature set to get the final feature set.

In [38]:
def create_feature_set(X_data, importance_df, top_n_features, correlation_threshold):
    """
    Creates optimized feature set by selecting top-N important features and removing correlations.
   
   Args:
       X_data: Input feature matrix
       importance_df: DataFrame containing feature importance scores sorted in descending order
       top_n_features: Number of top features to select based on importance
       correlation_threshold: Correlation threshold for feature removal
       
   Returns:
       DataFrame: Filtered feature set with reduced dimensionality and multicollinearity
   """
    
    important_features = X_data[importance_df['feature'][:top_n_features]]
    
    filtered_importance_df = importance_df[importance_df['feature'].isin(important_features.columns)] 
    
    features_to_remove = remove_correlated_features(important_features, important_features.columns, 
                                                    filtered_importance_df, correlation_threshold)
    
    final_feature_set = important_features.drop(list(features_to_remove), axis=1)
    
    return final_feature_set

<h2>Creating Feature Sets</h2>

We are going to use multiple choices for both the number of features and correlation thresholds, so we get multiple different feature sets, we can choose from based on their accuracy and f1_macro scores.

In [39]:
feature_counts = [80, 100, 120]
correlation_thresholds = [0.85, 0.90, 0.95]

rf_feature_sets = {}

for count in feature_counts:
    for threshold in correlation_thresholds:
        feature_set = create_feature_set(X_train, rf_importance_df, count, threshold)
        identifying_string = "RF feature count " + str(count) + " correlation threshold " + str(threshold)
        rf_feature_sets[identifying_string] = feature_set
        print(f"{identifying_string} reduced to {feature_set.shape[1]} features")

RF feature count 80 correlation threshold 0.85 reduced to 25 features
RF feature count 80 correlation threshold 0.9 reduced to 26 features
RF feature count 80 correlation threshold 0.95 reduced to 33 features
RF feature count 100 correlation threshold 0.85 reduced to 34 features
RF feature count 100 correlation threshold 0.9 reduced to 36 features
RF feature count 100 correlation threshold 0.95 reduced to 43 features
RF feature count 120 correlation threshold 0.85 reduced to 37 features
RF feature count 120 correlation threshold 0.9 reduced to 39 features
RF feature count 120 correlation threshold 0.95 reduced to 47 features


<b>Let us go ahead and arrange them in a dictionary, so it is easy to access when we want to work on them.</b>

In [40]:
random_forest_feature_sets = {}
for id_string, feature_set in rf_feature_sets.items():
    random_forest_feature_sets[id_string] = (feature_set, y_train)

<h2>XGBoost Feature Importance</h2>

Just as we made different feature sets based on feature importance and correlation removal based Random Forest, we are going to do the same for XGBoost.

In [42]:
xgb_model.fit(X_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='mlogloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=-1, num_parallel_tree=None, objective='multi:softprob', ...)

In [43]:
xgb_feature_importances = xgb_model.feature_importances_

In [45]:
xgb_importance_df = pd.DataFrame({
    "feature" : feature_names,
    "importance" : xgb_feature_importances
}).sort_values("importance", ascending=False)

print(xgb_importance_df.head())

                                 feature  importance
330       fBodyAcc-bandsEnergy()-1,8_330    0.072055
52                tGravityAcc-min()-X_52    0.063094
296            fBodyAcc-skewness()-X_296    0.046372
410  fBodyAccJerk-bandsEnergy()-9,16_410    0.039299
201                tBodyAccMag-std()_201    0.037172


In [46]:
xgb_feature_sets = {}

for count in feature_counts:
    for threshold in correlation_thresholds:
        feature_set = create_feature_set(X_train, xgb_importance_df, count, threshold)
        identifying_string = "XGB feature count " + str(count) + " correlation threshold " + str(threshold)
        xgb_feature_sets[identifying_string] = feature_set
        print(f"{identifying_string} reduced to {feature_set.shape[1]} features")

XGB feature count 80 correlation threshold 0.85 reduced to 37 features
XGB feature count 80 correlation threshold 0.9 reduced to 45 features
XGB feature count 80 correlation threshold 0.95 reduced to 53 features
XGB feature count 100 correlation threshold 0.85 reduced to 46 features
XGB feature count 100 correlation threshold 0.9 reduced to 53 features
XGB feature count 100 correlation threshold 0.95 reduced to 65 features
XGB feature count 120 correlation threshold 0.85 reduced to 54 features
XGB feature count 120 correlation threshold 0.9 reduced to 61 features
XGB feature count 120 correlation threshold 0.95 reduced to 75 features


In [47]:
xgboost_feature_sets = {}
for id_string, feature_set in xgb_feature_sets.items():
    xgboost_feature_sets[id_string] = (feature_set, y_train)

<h2> Importing other Feature Sets</h2>

We got two other feature sets we have from the preprocessing dictionary, which we are going to be evaluating in this notebook.

In [48]:
X_train_anova_60 = pd.read_csv("../Processed Data/ANOVA hybrid set/X_train_anova_filtered_60.csv")
X_test_anova_60 = pd.read_csv("../Processed Data/ANOVA hybrid set/X_test_anova_filtered_60.csv")

In [49]:
X_train_mean_std = pd.read_csv("../Processed Data/Mean Features/feature_reduced_X_train.csv")
X_test_mean_std = pd.read_csv("../Processed Data/Mean Features/feature_reduced_X_test.csv")

In [50]:
other_feature_sets = {
    "anova_60" : (X_train_anova_60, y_train),
    "mean_std" : (X_train_mean_std, y_train)
}

feature_sets.update(other_feature_sets)

In [51]:
rf_scores = calculate_cross_validation_scores({"Random Forest" : rf_model}, random_forest_feature_sets)


Score for Random Forest and the feature set RF feature count 80 correlation threshold 0.85

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 folds is 0.003

Score for Random Forest and the feature set RF feature count 80 correlation threshold 0.9

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 folds is 0.003

Score for Random Forest and the feature set RF feature count 80 correlation threshold 0.95

Mean accuracy across 5 folds is 0.980
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.980
Standard Deviation in f1_macro across 5 folds is 0.003

Score for Random Forest and the feature set RF feature count 100 correlation threshold 0.85

Mean accuracy across 5 folds is 0.979
Standard Deviation in accuracy a

<ul>
    <li>We can see that the higheset accuracy is given by 120 feature .95 correlation feature set which is 98.1%.</li>
    <li>The accuracies do not differ very much although the number of features increase as we do from top to bottom.</li>
</ul>

Let us also calculate the efficency of the feature set.

<b>One thing to be aware of is the feature sets with the lowest number of features yield higher efficency, since the information the model is able to capture after the first few features will be redundant and hence the value added by the first few features cannot be added by the next ones.</b>

In [56]:
rf_efficiencies = evaluate_model_efficiency(rf_scores, random_forest_feature_sets)

Efficency per feature for Random Forest RF feature count 80 correlation threshold 0.85 is 38.29
Efficency per feature for Random Forest RF feature count 80 correlation threshold 0.9 is 36.78
Efficency per feature for Random Forest RF feature count 80 correlation threshold 0.95 is 36.93
Efficency per feature for Random Forest RF feature count 100 correlation threshold 0.85 is 28.18
Efficency per feature for Random Forest RF feature count 100 correlation threshold 0.9 is 26.71
Efficency per feature for Random Forest RF feature count 100 correlation threshold 0.95 is 26.67
Efficency per feature for Random Forest RF feature count 120 correlation threshold 0.85 is 25.93
Efficency per feature for Random Forest RF feature count 120 correlation threshold 0.9 is 24.66
Efficency per feature for Random Forest RF feature count 120 correlation threshold 0.95 is 24.67


In [57]:
best_rf_experiment = max(rf_efficiencies, key=rf_efficiencies.get)

In [58]:
most_efficent_rf_model_id = best_rf_experiment[14:]
most_efficent_rf_model_id
X_train_efficent_rf = random_forest_feature_sets[most_efficent_rf_model_id]

In [59]:
print(f"The most efficent model is {best_rf_experiment} with {random_forest_feature_sets[most_efficent_rf_model_id][0].shape[1]} features")

The most efficent model is Random Forest RF feature count 80 correlation threshold 0.85 with 25 features


<h3>But with only 25 features, the model might not be able to capture all the patterns and perform bad on the testing set. So let us also consider one more set with higher accuracy, so we can come to a final decision when we do our final testing.</h3>

Looks like our best performing model is RF feature count 120 correlation threshold 0.95, so this will be our other feature set being considered for final testing.

In [60]:
most_accurate_rf_model_id = "RF feature count 120 correlation threshold 0.95"
print(f"The most accurate model is Random Forest{most_accurate_rf_model_id} with {random_forest_feature_sets[most_accurate_rf_model_id][0].shape[1]} features")

The most accurate model is Random ForestRF feature count 120 correlation threshold 0.95 with 47 features


<h2>Finalizing XGBoost feature sets</h2>

In [61]:
xgb_scores = calculate_cross_validation_scores({"XGBoost" : xgb_model}, xgboost_feature_sets)


Score for XGBoost and the feature set XGB feature count 80 correlation threshold 0.85

Mean accuracy across 5 folds is 0.988
Standard Deviation in accuracy across 5 folds is 0.001
Mean f1_macro across 5 folds is 0.988
Standard Deviation in f1_macro across 5 folds is 0.001

Score for XGBoost and the feature set XGB feature count 80 correlation threshold 0.9

Mean accuracy across 5 folds is 0.989
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.989
Standard Deviation in f1_macro across 5 folds is 0.002

Score for XGBoost and the feature set XGB feature count 80 correlation threshold 0.95

Mean accuracy across 5 folds is 0.990
Standard Deviation in accuracy across 5 folds is 0.001
Mean f1_macro across 5 folds is 0.990
Standard Deviation in f1_macro across 5 folds is 0.001

Score for XGBoost and the feature set XGB feature count 100 correlation threshold 0.85

Mean accuracy across 5 folds is 0.987
Standard Deviation in accuracy across 5 folds is 0.0

In [64]:
xgb_efficiencies = evaluate_model_efficiency(xgb_scores, xgboost_feature_sets)

Efficency per feature for XGBoost XGB feature count 80 correlation threshold 0.85 is 26.38
Efficency per feature for XGBoost XGB feature count 80 correlation threshold 0.9 is 21.72
Efficency per feature for XGBoost XGB feature count 80 correlation threshold 0.95 is 21.79
Efficency per feature for XGBoost XGB feature count 100 correlation threshold 0.85 is 21.19
Efficency per feature for XGBoost XGB feature count 100 correlation threshold 0.9 is 18.39
Efficency per feature for XGBoost XGB feature count 100 correlation threshold 0.95 is 18.46
Efficency per feature for XGBoost XGB feature count 120 correlation threshold 0.85 is 18.16
Efficency per feature for XGBoost XGB feature count 120 correlation threshold 0.9 is 16.08
Efficency per feature for XGBoost XGB feature count 120 correlation threshold 0.95 is 16.08


In [65]:
best_xgb_experiment = max(xgb_efficiencies, key=xgb_efficiencies.get)

In [68]:
most_efficent_xgb_model_id = best_xgb_experiment[8:]

In [69]:
print(f"The most efficent model is {best_xgb_experiment} with {xgboost_feature_sets[most_efficent_xgb_model_id][0].shape[1]} features")

The most efficent model is XGBoost XGB feature count 80 correlation threshold 0.85 with 37 features


If we look at the results the best accuracy is 99%, which is given 4 different feature sets. So we pick the one with the least number of features, since it can be noted that the additional features did not do any good to the model's accuracy.

So we pick Feature count 80 threshold 0.85 for the most accurate one.

In [70]:
most_accurate_xgb_model_id = 'XGB feature count 80 correlation threshold 0.95'

In [71]:
print(f"The most efficent model is {most_accurate_xgb_model_id} with {xgboost_feature_sets[most_accurate_xgb_model_id][0].shape[1]} features")

The most efficent model is XGB feature count 80 correlation threshold 0.95 with 53 features


<h3>If we look at both Random Forest and XGBoost models, the feature set with least number of features is often the most efficent one. This is obviously biased towards the feature set with least features if we look at how we are evaluating our model efficency in our evaluate_model_efficiency function.</h3> 

So this is why selecting another feature set that is highly accurate is also neccesary, so that while testing we do not end up with models that does not perform good.

<h2>Baseline Establishment</h2>

In [72]:
final_feature_sets = {
    "All Features" : (X_train, y_train),
    "Mean std features" : (X_train_mean_std, y_train),
    "Anova features" : (X_train_anova_60, y_train),
    "Most efficent Random Forest features" : random_forest_feature_sets[most_efficent_rf_model_id],
    "Most accurate Random Forest features" : random_forest_feature_sets[most_accurate_rf_model_id],
    "Most efficent XGBoost features" : xgboost_feature_sets[most_efficent_xgb_model_id],
    "Most accurate XGBoost features" : xgboost_feature_sets[most_accurate_xgb_model_id]
}

In [73]:
final_feature_sets['Most efficent Random Forest features'][0].shape[1]

25

In [74]:
final_scores = calculate_cross_validation_scores(models, final_feature_sets)
final_efficiencies = evaluate_model_efficiency(final_scores, final_feature_sets)


Score for Random Forest and the feature set All Features

Mean accuracy across 5 folds is 0.982
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.982
Standard Deviation in f1_macro across 5 folds is 0.002

Score for Random Forest and the feature set Mean std features

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 folds is 0.002

Score for Random Forest and the feature set Anova features

Mean accuracy across 5 folds is 0.984
Standard Deviation in accuracy across 5 folds is 0.002
Mean f1_macro across 5 folds is 0.984
Standard Deviation in f1_macro across 5 folds is 0.002

Score for Random Forest and the feature set Most efficent Random Forest features

Mean accuracy across 5 folds is 0.978
Standard Deviation in accuracy across 5 folds is 0.003
Mean f1_macro across 5 folds is 0.978
Standard Deviation in f1_macro across 5 fold

<b>There are a few things we can infer from the results</b>
<ul>
    <li>
        The feature set with the highest accuracy is the model which is trained on all the features and XGBoost. <b>This will be out baseline model.</b></li>
    <li>But the most efficent model we have is the XGBoost model trained on feature set that is extracted from Random Forest feature importances which has 25 features in it with an efficency value of 38.75. This is a massive reduction in features almost a 95% but just a loss of 0.7% accuracy</li>

<b>So here are the feature sets we are going to perform hyper parameter tuning and evaluating final results</b>

<ol>
    <li>All Features</li>
    <li>Research Benchmark (Mean/Std) features</li>
    <li>XGBoost most accurate set</li>
    <li>Random Forest most efficent set</li>

We have the first 2 sets already, but we need to export the XGBoost most accurate set and Random Forest most efficent set.

In [75]:
xgb_most_accurate_train = xgboost_feature_sets[most_accurate_xgb_model_id][0]
xgb_most_accurate_test = X_test[xgb_most_accurate_train.columns]
rf_most_efficent_train = random_forest_feature_sets[most_efficent_rf_model_id][0]
rf_most_efficent_test = X_test[rf_most_efficent_train.columns]

<b>Exporting the feature sets</b>

In [76]:
xgb_most_accurate_train.to_csv("../Final Feature Sets/xgb_most_accurate_train.csv")
xgb_most_accurate_test.to_csv("../Final Feature Sets/xgb_most_accurate_test.csv")
rf_most_efficent_train.to_csv("../Final Feature Sets/rf_most_efficent_train.csv")
rf_most_efficent_test.to_csv("../Final Feature Sets/rf_most_efficent_test.csv")

Now onto our Final journey with XGBoost <b>Hyper Parameter Tuning and Evaluation</b>.