# Digits Data K fold classification

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [2]:
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from sklearn.datasets import load_digits
digits = load_digits()

#### Usually approach is to divide datasets in train_test_split method

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

#### Now I will test this model on my three models which is Logistic Regression, SVM and Random Forest. I will calculate the score of each model 

In [5]:
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9537037037037037

In [6]:
svm = SVC()
svm.fit(x_train,y_train)
svm.score(x_test,y_test)

0.9796296296296296

In [7]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(x_train,y_train)
rf.score(x_test,y_test)

0.9666666666666667

##### As we have calculated scores from each model. Among them SVC has given the best result but now if we again run test train method the score will change. So, we need to find the best method in order to cope with this problem which is K fold Method

## Basic Api working of K Fold method

In [8]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)   # splits=3 means I want to create three folds
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [9]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)    # AS u can see in output how he has taken three test data generally from main data set

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


###### In order to write code more effectively or in a robust method we can also write one function and call it on each model 

In [10]:
def get_score(model,x_train,x_test,y_train,y_test):
    model.fit(x_train,y_train)
    return model.score(x_test,y_test)

In [11]:
get_score(SVC(),x_train,x_test, y_train,y_test)

0.9796296296296296

In [12]:
get_score(LogisticRegression(),x_train,x_test, y_train,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9537037037037037

In [13]:
get_score(RandomForestClassifier(),x_train,x_test, y_train,y_test)

0.9685185185185186

#### How to find cross val score

In [14]:
from sklearn.model_selection import cross_val_score

In [15]:
cross_val_score(LogisticRegression(),digits.data,digits.target) # by defualt it has taken 5 folds

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.92222222, 0.86944444, 0.94150418, 0.93871866, 0.89693593])

In [16]:
cross_val_score(SVC(),digits.data,digits.target)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

In [17]:
cross_val_score(RandomForestClassifier(),digits.data,digits.target)

array([0.91944444, 0.89444444, 0.95543175, 0.95543175, 0.92200557])

#### By defualt it has choosen 5 folds if i want to change folds then

In [18]:
cross_val_score(SVC(),digits.data,digits.target,cv=3)

array([0.96494157, 0.97996661, 0.96494157])

###### We use kfold for choosing the best model for our dataset or we can also use it for our model tunning also forexample I decide to use random forest for my problem then how much tree i need to use so for this i select 5 estimators and folds are 10 and i calculated the average of my score

In [23]:
score1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data,digits.target,cv=10)
np.average(score1)

0.8736902545003102

In [24]:
score2 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data,digits.target,cv=5)
np.average(score2)

0.9354750851129682

In [25]:
score3 = cross_val_score(RandomForestClassifier(n_estimators=2),digits.data,digits.target,cv=10)
np.average(score3)

0.743423339540658