# k-fold cross validation
k-fold cross validation is a procedure used to estimate the skill of the model on new data. There are common tactics that you can use to select the value of k for your dataset. There are commonly used variations on cross-validation, such as stratified and repeated, that are available in scikit-learn

we are going to do everything that we did before so refere to Linear-Regression notebook for explanations.

In [1]:
import numpy as np
import pandas as pd

In [2]:
cars =  pd.read_csv("cleaned_data.csv")
cars.columns

Index(['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'price', 'Year', 'Engine V', 'Brand'],
      dtype='object')

In [3]:
X =  cars[['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'Year', 'Engine V', 'Brand']]


Y = cars["price"].values

In [4]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(categories="auto", handle_unknown="ignore")

categorical_features = onehot.fit_transform(X.iloc[:, [1,4,5,6,7,13]]).toarray()
X = np.delete(X.values, [0,1,2,3,4,5,6,7,13], 1)
X = np.concatenate((X,categorical_features), axis=1)

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X, Y,
    test_size=0.1,
    random_state=42
)

In [6]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
std_scaler.fit(x_train)

x_train_std = std_scaler.transform(x_train)
x_test_std  = std_scaler.transform(x_test)

In [12]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA

pip_lr = make_pipeline(
    StandardScaler(),
    PCA(),
    LinearRegression(fit_intercept=True, normalize="deprecated")
)
pip_lr.fit(x_train_std, y_train)
h_pred = pip_lr.predict(x_test)
print("Test Accuracy : {:.3f}".format(pip_lr.score(x_test_std, y_test)))

Test Accuracy : 0.856


### Stratified K-Folds cross-validator.
Provides train/test indices to split data in train/test sets.
<br><br>
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.<br>
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html">read full documentation</a>

In [None]:
from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=10).split(x_train, y_train)
scores = list()

for k, (train, test) in enumerate(kfold):

        pip_lr.fit(x_train[train], y_train[train])
        score = pip_lr.score(x_train[test], y_train[test])
        scores.append(score)

        print("Fold: {}, class dist: {}, ACC: {:.3f}".format(
            k+1,
            np.bincount(y_train[train]),
            score
        ))

print("CV accuracy: {} +/- {:.2f}".format(
    np.mean(scores),
    np.std(scores)
))

scikit-learn also implements a k-fold cross-validation scorer which alows us evaluate our model using stratified k-fold cross-validation less verbosely:

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    estimator=pip_lr,
    X=x_train,
    y=y_train,
    cv=10,
    n_jobs=1
)

print("cv accuracy scores: {}".format(scores))
print("cv accuracy: {} +/- {}".format(
    np.mean(scores),
    np.std(scores)
))

Sina Kazemi<br>
Github : <a href="https://github.com/sina96n/">sina96n</a>