# K-Fold Cross Validation

### Importing libs

In [59]:
import pandas as pd

### Importing the dataset

In [60]:
dataset = pd.read_csv('./filez/Social_Network_Ads.csv')
dataset.head()

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


### Splitting the dataset into Train/Test sets

In [61]:
from sklearn.model_selection import train_test_split

X = dataset.drop("Purchased", axis=1)
y = dataset["Purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)

### Feature Scaling

In [62]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Training the Kernel SVM model on the Training set

In [63]:
from sklearn.svm import SVC

classifier = SVC(kernel="rbf", random_state=0)
classifier.fit(X_train, y_train)

### Making the Confusion Matrix

In [64]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

def evaluate_model(classifier):
    y_pred = classifier.predict(X_test)

    print(f"1) classification_report:\n\n", classification_report(y_test, y_pred))
    print(f"2) confusion_matrix:\n\n", confusion_matrix(y_test, y_pred), "\n")
    print(f"3) accuracy_score:\n\n", accuracy_score(y_test, y_pred))

evaluate_model(classifier)

1) classification_report:

               precision    recall  f1-score   support

           0       0.96      0.94      0.95        68
           1       0.88      0.91      0.89        32

    accuracy                           0.93       100
   macro avg       0.92      0.92      0.92       100
weighted avg       0.93      0.93      0.93       100

2) confusion_matrix:

 [[64  4]
 [ 3 29]] 

3) accuracy_score:

 0.93


### Applying K-Fold Cross Validation
- `cv=10` specifies the number of folds in the cross-validation. In this case, the dataset is divided into 10 parts; the model is trained on 9 parts and tested on the 1 remaining part, and this process is repeated 10 times, each time using a different part as the test set.

In [65]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10)
print(f"Accuracy {accuracies.mean() * 100:,.2f}%")
print(f"Standard Deviation: {accuracies.std()*100:,.2f}%")

Accuracy 90.33%
Standard Deviation: 6.57%


- **Accuracy** = 90.33%, which means that on average, the classifier correctly predicts the class 90.33% of the time across the different folds. This is generally considered a good accuracy, indicating that the model performs well on the training data.
- **Standard Deviation** = 6.57%, which measures how much the accuracies obtained from each fold of the cross-validation vary from the mean accuracy. A lower standard deviation is typically desirable, as it indicates that the model's performance is consistent across different subsets of the training data. However, a standard deviation of 6.57% suggests some variability in the model's performance across different folds. This could be due to the variability in the dataset, or it could suggest that the model might be overfitting to certain parts of the data.
- **Consistency in Performance**: The fact that the test set accuracy (0.93) is slightly higher than the cross-validation accuracy (0.90) can be seen as a positive indicator. It suggests that the model is performing consistently well on both the training data (as evidenced by cross-validation) and the unseen test data.
- **Potential Overfitting Concerns**: If the difference in accuracies is substantial, it could sometimes indicate overfitting, especially if the model performs exceptionally well on the training data but relatively poorer on the test data. However, in our case, the accuracies are quite close, suggesting that overfitting is likely not a significant concern.
- **Dataset Variability**: The slight difference could also be due to the variability in the dataset itself. The test set might have been slightly easier for the model to predict, or the training set used in cross-validation might have been more challenging or diverse.