# Cross Validation
🦊 `Notebook by` [Md.Samiul Alim](https://github.com/sami0055)

😋  `Machine Learning Source Codes` [GitHub](https://github.com/sami0055/Machine-Learning)

## Loas Housing Dataset

In [1]:
from sklearn.datasets import fetch_california_housing
# Load the housing dataset
housing=fetch_california_housing()
x,y=housing.data,housing.target
# Print the shapes
print(x.shape)
print(y.shape)
# Print types
print(type(x))
print(type(y))

(20640, 8)
(20640,)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


## Load the dataset as a dataframe

In [5]:
x_df,y_df=fetch_california_housing(return_X_y=True,as_frame=True)
print(x_df.shape)
print(y_df.shape)
print(type(x_df))
print(type(y_df))

(20640, 8)
(20640,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [6]:
# Convert X and y to pandas DataFrames Manually
# import pandas as pd
# X_df = pd.DataFrame(X, columns=housing.feature_names)
# y_df = pd.Series(y, name='target')
# X_df.head()

In [8]:
x_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [9]:
y_df.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

## K-Fold Cross Validation
Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

In [12]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Define the number of Fold
k=5
kf=KFold(n_splits=k,shuffle=True,random_state=42)
# Initialize an empty list to store the mean squared errors
mse_scores=[]
fold_count=0
# Perform K-fold cv manually
for train_index,test_index in kf.split(x_df):
    x_train,x_test=x_df.iloc[train_index],x_df.iloc[test_index]
    y_train,y_test=y_df.iloc[train_index],y_df.iloc[test_index]
    model=LinearRegression()
    model.fit(x_train,y_train)
    # make prediction on test sets
    y_pred=model.predict(x_test)
    mse=mean_squared_error(y_test,y_pred)
    fold_count+=1
    print("Fold",fold_count,", MSE ",mse)
    mse_scores.append(mse)

# Calculate avg
avg_mse=np.mean(mse_scores)
print("(CV) Average Mean Squared Error:", avg_mse)

Fold 1 , MSE  0.5558915986952442
Fold 2 , MSE  0.527656254773633
Fold 3 , MSE  0.5092832097248633
Fold 4 , MSE  0.5048507784142143
Fold 5 , MSE  0.5551804780114864
(CV) Average Mean Squared Error: 0.5305724639238882


## Load Iris Dataset
Doc: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html


In [13]:
from sklearn.datasets import load_iris
# Load the Iris dataset
iris=load_iris()
x=iris.data
y=iris.target
#Print the shapes
print(x.shape)
print(y.shape)
# check type
print(type(x))
print(type(y))

(150, 4)
(150,)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


# K-Fold CV

In [14]:
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
# Initialize the Decision tree classifier
clf=DecisionTreeClassifier()
# define the number of folds for k-fold cv
k=5
kf=KFold(n_splits=k,shuffle=True,random_state=42)
# Perform k-fold cv and compute the scores
scores=cross_val_score(clf,x,y,cv=kf)
# Print the cross-validation scores
print("Cross-validation scores:", scores)

# Print the mean and standard deviation of the scores
print("Mean accuracy:", np.mean(scores))

Cross-validation scores: [1.         0.96666667 0.93333333 0.93333333 0.93333333]
Mean accuracy: 0.9533333333333335


## Leave-One-Out Cross-Validation
LeaveOneOut: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html

In [16]:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

# Initialize the logistic regression model
model = DecisionTreeClassifier()

# Initialize Leave-One-Out Cross-Validation
loo = LeaveOneOut()

# Perform cross-validation
scores = cross_val_score(model, x, y, cv=loo)
print("Number of iterations:",len(scores))

# Print the cross-validation scores
print("Cross-validation scores:", scores)

# Calculate and print the mean accuracy
accuracy = scores.mean()
print("Mean Accuracy:", accuracy)

Number of iterations: 150
Cross-validation scores: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Mean Accuracy: 0.9533333333333334


## Stratified K-Fold Cross-Validation
Doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

In [18]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree classifier
clf = DecisionTreeClassifier()

# Define the number of folds for Stratified K-fold cross-validation
k = 5

# Initialize StratifiedKFold cross-validation
skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

# Perform Stratified K-fold cross-validation and compute the scores
scores = cross_val_score(clf, x, y, cv=skf)
print("Number of iterations:",len(scores))

# Print the cross-validation scores
print("Cross-validation scores:", scores)

# Calculate and print the mean accuracy
accuracy = scores.mean()
print("Mean Accuracy:", accuracy)

Number of iterations: 5
Cross-validation scores: [1.         0.96666667 0.93333333 0.96666667 0.9       ]
Mean Accuracy: 0.9533333333333335


## SKFold CV Manually