<a href="https://colab.research.google.com/github/swethag04/ml-projects/blob/main/linear-regression/cross_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Cross Validation


1.   LOOCV: Leave one out Cross validation is used to assess the performance of algorithms to make predictions without using the data they were trained on. This can be used wh

2.   K-fold: Uses unseen data to estiamte the performance of a model. Using this technique, hyperparameters (k values) can be tuned to the optimal level to train the model. This method has the advantage of using each example only once for training and validation. Its computationally expensive and takes more time.

3. Holdout cross validation: Data is divided randomly into two sets - training and test/validation i.e. the holdout set. It performs well on unseen datasets.



In [36]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SequentialFeatureSelector

### Boston housing dataset:

- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per USD 10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000

In [5]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df = pd.read_csv('/content/sample_data/housing.csv', names=column_names, header=None, delimiter=r"\s+")

In [6]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [7]:
df.dtypes

CRIM       float64
ZN         float64
INDUS      float64
CHAS         int64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
MEDV       float64
dtype: object

In [8]:
df.shape

(506, 14)

In [9]:
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [11]:
X = df.drop('MEDV', axis=1)
y = df['MEDV']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.3)
train_idx = X_train.index
test_idx = X_test.index

In [38]:
# Simple cross validation
#selector = SequentialFeatureSelector(estimator = LinearRegression(),
#                                     n_features_to_select= 5,
#                                     scoring = 'neg_mean_squared_error',
#                                     cv = [[train_idx, test_idx]])
#Xt = selector.fit_transform(X, y)
ridge = Ridge().fit(X_train, y_train)
#scores = cross_val_score(lr, X, y,
#                         cv=[[train_idx, test_idx]],
#                         scoring = 'neg_mean_squared_error',)
mse = mean_squared_error(y_test, ridge.predict(X_test))
print("Ridge with simple cross validation MSE: ", mse)

Ridge with simple cross validation MSE:  22.044053089861013


In [39]:
# K Fold (5 fold) cross validation
selector = SequentialFeatureSelector(estimator = Ridge(),
                                     n_features_to_select= 5,
                                     scoring = 'neg_mean_squared_error',
                                     cv = 5)
Xt = selector.fit_transform(X, y)
ridge = Ridge().fit(Xt, y)
scores = cross_val_score(ridge, X, y,
                         cv=5,
                         scoring = 'neg_mean_squared_error',)
print("5 fold cross validation MSE: ", -scores.mean())

5 fold cross validation MSE:  35.266731097496645


In [40]:
# K Fold (10 fold) cross validation
selector = SequentialFeatureSelector(estimator = Ridge(),
                                     n_features_to_select= 5,
                                     scoring = 'neg_mean_squared_error',
                                     cv = 10)
Xt = selector.fit_transform(X, y)
ridge = Ridge().fit(Xt, y)
scores = cross_val_score(ridge, X, y,
                         cv=10,
                         scoring = 'neg_mean_squared_error',)
print("10 fold cross validation MSE: ", -scores.mean())

10 fold cross validation MSE:  34.07824620925932
