<a href="https://colab.research.google.com/github/tennille-bernard/Kal-Academy-Modules/blob/main/Model_Performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**3 Techniques for Model Performance**  
1. K-Fold Cross Validation
2. Grid Search
3. XGBoost  

Can be used for classification and regression models.

**K-Fold Cross Validation**: Goal is to improve the model.
1. Split data into training and test data.
2. Breaks up data set into 10 pieces, and further break up data into smaller pieces. Similar to how a book is broken up into chapters. Testing on individual chapters helps with refinement more than testing on the whole book at once.
3. Each increment is still broken up into training and test data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

**Support Vector Machine (SVM( model with feature scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

**Create model**

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

**Confusion Matrix - test for accuracy**

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[64  4]
 [ 3 29]]


0.93

**Applying K-Fold Cross Validation**

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 90.33 %
Standard Deviation: 6.57 %


1. classifier = that's what we alled our SVC above.
2. setting X = X_train; y = y_train, and our K in K-fold = cv = 10
3. {:.2f} %  --> Inline string where we're putting the value that comes in format(). In this case what the value of accuracies.mean()*100, to 2 decimal places (float)

In this case, our model didn't actually improve (from 93 - 90.33%).

**GridSearch Cross Validation (CV)**  
This is a very powerful way of improving model performance by calculating hyperparameters.

**Parameters for kernel SVM model**
1. C = regularization parameter. High C = correct classification. low C = bad classification
2. kernel = type of kernel (linear, rbf, polynomial).  
3. gamma = kernel coefficient, which is particularly relevant for rbf. Tells you whether a data point should fall on one or the other side of the hyperplane boundary. High gamma =potentially overfitting. Low gamma = potentially underfitting.
4. degree = important for polynomial kernels.


Explanation of the code below.
1. 'C' = an array passing through GridSearchCV = 0.25,..., 1
2. kernel for the top value of C is a linear kernel.
3. kernel for the bottom value of C is rbf, so we use a gamma of 0.1,..., 0.9.
4. We ask GridSearchCV to run through all those combinations
5. estimator = classifier aka the name of the model
6. param_grid = the grid of variable 'paramenters'
7. scoring = 'accuracy' = we're basing it off of the accuracy score
8. cv = k-fold # of folds = 10. GridSearch lowkey uses K-folds.
9. n_jobs = specify how many times you want this GridSearch to run. if n_jobs = 100, it will try 100 combos. if n_jobs = -1, it will run until it finds the best combination.


In [None]:
from sklearn.model_selection import GridSearchCV
parameters = [{'C': [0.25, 0.5, 0.75, 1], 'kernel': ['linear']},
              {'C': [0.25, 0.5, 0.75, 1], 'kernel': ['rbf'], 'gamma':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,]}]

grid_search = GridSearchCV(estimator = classifier,
                            param_grid = parameters,
                            scoring = 'accuracy',
                            cv = 10,
                            n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

Best Accuracy: 90.67 %
Best Parameters: {'C': 0.5, 'gamma': 0.6, 'kernel': 'rbf'}


**How to use this information**
1. Go back to the code used to build the model, and input the info above so it reads:
2. classifier = SVC(C = 0.5, kernel = 'rbf', gamma = 0.6, random_state = 0)

**XGBoost: Extreme Gradient Boosting:** A powerful model that offers high performance and execution speed.  
