**Import the libraries and the dataset**

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('/content/drive/MyDrive/ELOC-SW/features/features_original.csv')

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
data.head()

In [None]:
data.info()

**Check for duplicated values**

In [None]:
data.duplicated().any()

## Model Train, Test and Evaluation

**Steps**

- Train, test and evaluate an SVM model using **all the variables**.

- Feature selection using Recursive Feature Elimination with Cross Validation (RFECV).

- Train, test and evaluate the model with the **selected features**.

- Compare the performance of the two models.


**Import the modules for Machine Learning**

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedStratifiedKFold,StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV
from sklearn.metrics import confusion_matrix, classification_report

### Train and test an SVM model **using all the variables**

**Separate the independent variable from the dependent variable**

In [None]:
X_data = data.drop('feature_class', axis = 1)
y = data['feature_class']

**Scale the independent variables using Quantile Transformer** (Uniform Distribution)

In [None]:
std = QuantileTransformer(n_quantiles=100)

In [None]:
X = X_data.values

# ensure inputs are floats 
X = X.astype('float32')

X = std.fit_transform(X)

**Split the dataset in train(80%), and test(20%) set**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

**Create an SVC object and train fit it to the train set**

In [None]:
svc = SVC()

In [None]:
svc.fit(X_train, y_train)

In [None]:
y_train_predict = svc.predict(X_train)

In [None]:
sns.heatmap(confusion_matrix(y_train, y_train_predict), 
            annot = True)
plt.title('Confusion matrix SVM (train set)', 
          size = 13)
plt.ylabel('Real Values', size = 11)
plt.xlabel('Predicted Values', size = 11)
plt.show()

In [None]:
print(classification_report(y_train, y_train_predict))

**Perform cross-validation to evaluate the model**

I will use RepeatedStratifiedKFold to cross validate the model on the train set.

I will use 5 splits and 10 repetitions. 

- The dataset will be split in five parts. Four parts will be used to train the model and one part to test it. Each of the five part will in turn be used as test set, so one repetition produces five evaluation scores.

- The steps above will be repeated 10 times for a total of 50 train and test

- Get the average accuracy.

- I am also interested in the recall so I will perform the cross validation process again using recall as scoring metric.

In [None]:
cv = StratifiedKFold(2)

In [None]:
accuracies = cross_val_score(svc, X = X_train, y = y_train, 
                             scoring = 'accuracy', cv = cv, 
                             n_jobs = -1)

**Accuracy**

In [None]:
print(f"Accuracy:\nmean: {accuracies.mean():.3f}, std: {accuracies.std():.3f}")

**Recall**

In [None]:
recalls_w = cross_val_score(svc, X = X_train, y = y_train, 
                            scoring = 'recall', cv = cv, 
                            n_jobs = -1)

In [None]:
print(f"Recall:\nmean: {recalls_w.mean():.3f}, std: {recalls_w.std():.3f}")

Let's see the performance on the test set.

In [None]:
y_predict = svc.predict(X_test)

In [None]:
sns.heatmap(confusion_matrix(y_test, y_predict), 
            annot = True)
plt.title('Confusion matrix SVM (test set)', 
          size = 13)
plt.ylabel('Real Values', size = 11)
plt.xlabel('Predicted Values', size = 11)
plt.show()

In [None]:
print(classification_report(y_test, y_predict))

## Feature selection and second model training


- Use Recursive Feature Elimination with Cross-Validation to select the most relevant features for the model.

In [None]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
# X1_train, X1_test, y1_train, y1_test = train_test_split(X, y,
#                                                         test_size = 0.2, 
#                                                         random_state = 10)

**Recursive Feature Elimination Cross Validation (RFECV)**

Recursive Feature Selection trains a model using all the features available and computes the importance of each feature in the model. The least important features are eliminated from the model and the process is repeated until it reaches the selected number of features. To implement RFE we need to select an algorithm and the number of feature we want to use.

Because I do not know what the optimal number of features might be, I will use Recursive Feature Elimination with Cross Validation. In this case, the algorithm tries different combinations of variables and than selects the combination that returns the best mean score.

**Create an object for logistic regression and one for SVM**

In [None]:
rfecv = RFECV(SVC(kernel="linear"))
svc1 = SVC()

**Create a pipeline for feature selection and model fitting**

In [None]:
pipeline = Pipeline(steps = [('Feature Selection', rfecv), 
                             ('Model', svc1)])

**Let's use cross-validation to select the best features**

In [None]:
rfecv.fit(X1_train, y1_train)

In [None]:
rfecv.n_features_


Let's see which features contributed to the model.

In [None]:
features1 = pd.DataFrame(rfecv.support_, index = X_data.columns, 
                         columns = ['Features'])

In [None]:
features = features1[features1['Features'] == True].index
features

**Apply cross validation with the selected features**

In [None]:
cv = StratifiedKFold(2)

In [None]:
accuracies_rfe = cross_val_score(pipeline, X = X1_train, 
                                 y = y1_train, scoring = 'accuracy', 
                                 cv = cv, n_jobs = -1)

**Accuracy**

In [None]:
print(f"Accuracy:\nmean: {accuracies_rfe.mean():.3f}, std: {accuracies_rfe.std():.3f}")

**Recall**

In [None]:
recalls_rfe = cross_val_score(pipeline, X = X1_train, 
                              y = y1_train, scoring = 'recall', 
                              cv = cv, n_jobs = -1)

In [None]:
print(f"Recall:\nmean: {recalls_rfe.mean():.3f}, std: {recalls_rfe.std():.3f}")

Performance on the test set using the features selected. I will take the train and test set I created before and keep only the features selected by RFECV.

In [None]:
X1_train_lr = pd.DataFrame(X1_train)

In [None]:
X1_test_lr = pd.DataFrame(X1_test)

**Fit a new the model to the train set with selected variables**

In [None]:
svc2 = SVC()

In [None]:
svc2.fit(X1_train_lr, y1_train)

In [None]:
X1_train_lr.shape


**Let's see how the model performs on the test set**

In [None]:
y1_predict = svc2.predict(X1_test_lr)

In [None]:
sns.heatmap(confusion_matrix(y1_test, y1_predict), annot = True)
plt.title('Confusion matrix SVM-selected (test set)', 
          size = 13)
plt.ylabel('Real Values', size = 11)
plt.xlabel('Predicted Values', size = 11)
plt.show()

In [None]:
print(classification_report(y1_test, y1_predict))

**Compare the results of cross validation of the model with all variables and the model with selected variables.**

In [None]:
models = pd.DataFrame({'SVM': {'accuracy': accuracies.mean(), 'a_std': accuracies.std(), 
                    'recall': recalls_w.mean(), 'r_std': recalls_w.std()},
                       
                      'SVM_selected': {'accuracy': accuracies_rfe.mean(), 'a_std': accuracies_rfe.std(),
                      'recall': recalls_rfe.mean(), 'r_std': recalls_rfe.std()}})

In [None]:
models.T