Overfitting and under-fitting in linear regression
In this exercise, you will practice identifying overfitting and underfitting in linear regression models using both holdout and cross-validation techniques. Use the "main.ipynb" notebook to perform this task. 

You will use the California Housing dataset available in the sklearn library.

If you get stuck, refer to "main_solved.ipynb" to help with coding in "main.ipynb."

Note: To complete this activity click the Submit/Mark button in the bottom right corner. These activities are for your own practice in a coding environment and are not assessed. If you have any questions about an activity, please ask your facilitator in a live session or on the Discussion board. 

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
 
y_train_pred = model.predict(X_train)
mse_train = mean_squared_error(y_train, y_train_pred)
 
y_test_pred = model.predict(X_test)
mse_test = mean_squared_error(y_test, y_test_pred)
 
print("Training MSE: ", mse_train)
print("Testing MSE: ", mse_test)
 
scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mse_cv = -scores.mean()  # scores are negative, so we negate them
print("Cross-Validation MSE: ", mse_cv)

Training MSE:  0.5179331255246699
Testing MSE:  0.5558915986952427
Cross-Validation MSE:  0.558290171768681
