# Applying Machine Learning Process: Cross-Validation

## Outline

* [Benefits from Cross-Validation](#Benefits-from-Cross-Validation)
* [Cross-Validation Explained](#Cross-Validation-Explained)
* [Performing Cross-Validation](#Performing-Cross-Validation)
* [Searching for Optimal Model Parameters using Grid Search](#Searching-for-Optimal-Model-Parameters-using-Grid-Search)
* [Challenge](#Challenge)

## Benefits from Cross-Validation

* Parameter tuning
* Feature selection
* Model selection

---

## Cross Validation Explained

In [None]:
from sklearn.model_selection import KFold

In [None]:
X = range(1, 26)
kf = KFold(n_splits=5)

for train, test in kf.split(X):
    print('Train:', train, 'Test:', test)

---

## Performing Cross-Validation

In [None]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

In [None]:
iris_data_url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
df = pd.read_csv(iris_data_url)

In [None]:
X = df.drop(['species'], axis='columns')
y = df['species']

In [None]:
knn = KNeighborsClassifier(n_neighbors=20)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
print(scores)

In [None]:
print(scores.mean())

Let's try varying the value for K.

In [None]:
k_range = range(1, 31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    k_scores.append(scores.mean())

print(k_scores)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

---

## Searching for Optimal Model Parameters using Grid Search

In [None]:
import pandas as pd

In [None]:
iris_data_url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
df = pd.read_csv(iris_data_url)

In [None]:
X = df.drop(['species'], axis='columns')
y = df['species']

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
k_range = list(range(1, 31))
print(k_range)

In [None]:
param_grid = dict(n_neighbors=k_range)
print(param_grid)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=True)
grid.fit(X, y)
grid.cv_results_

In [None]:
grid_mean_scores = []
for result in grid.cv_results_['mean_test_score']:
    grid_mean_scores.append(result)

In [None]:
grid_mean_scores = [result for result in grid.cv_results_['mean_test_score']]
print(grid_mean_scores)

In [None]:
plt.plot(k_range, grid_mean_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

In [None]:
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

## Challenge

ลองเลือกข้อมูลจาก [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) มาสัก 1 ข้อมูล ลองฝึกใช้ `GridSearchCV` กับโมเดลอะไรก็ได้แล้วดูผลลัพธ์ที่ได้ออกมา