# Practice Activity 7.1: Cross-Validation and Tuning

Please provide an ipynb notebook or Colab link showing the tasks marked "Practice Activities" in Chapter 13.

## 13.2.5: Predictions

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np
from plotnine import *
from re import X
import plotnine as p9
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import ParameterGrid

In [3]:
ames = pd.read_csv("C:/Users/ryanc/Desktop/GSB_544/Data/AmesHousing.csv")

Consider four possible models for predicting house prices:

1. Using only the size and number of rooms.
2. Using size, number of rooms, and building type.
3. Using size and building type, and their interaction.
4. Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all four models.

In [5]:
#Renaming some rows
ames = ames.rename(columns={'Gr Liv Area': 'Size',
                                  'TotRms AbvGrd': 'Rooms',
                                  'Bldg Type': 'BldType'})

In [6]:
y = ames['SalePrice']
X = ames[['Size', 'Rooms', 'BldType']]
         
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

**Pipeline Model #1: Size and # of Rooms**

In [8]:
ct = ColumnTransformer(
    [
        ("standardize", StandardScaler(), ["Size", "Rooms"])
    ],
    remainder = "drop"
    )

lr_pipeline_1 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

#Fit model on training dataset
lr_pipeline_fitted = lr_pipeline_1.fit(X_train, y_train)

#Predict model on test dataset
y_preds = lr_pipeline_fitted.predict(X_test)

#Calculate root mean squared error
rmse1 = mean_squared_error(y_test, y_preds, squared = False)

print(lr_pipeline_fitted.named_steps['linear_regression'].coef_)
print("RMSE:", rmse1)

[ 67328.06324265 -16226.58537719]
RMSE: 54885.46616419861


**Pipeline Model #2: Size, # of Rooms, and Building Type**

In [11]:
ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["BldType"]),
    ("standardize", StandardScaler(), ["Size", "Rooms"]),
  ],
  remainder = "drop"
)


lr_pipeline_2 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

#Fit model on training dataset
lr_pipeline_fitted = lr_pipeline_2.fit(X_train, y_train)

#Predict model on test dataset
y_preds = lr_pipeline_fitted.predict(X_test)

#Calculate root mean squared error
rmse2 = mean_squared_error(y_test, y_preds, squared = False)

print(lr_pipeline_fitted.named_steps['linear_regression'].coef_)
print("RMSE:", rmse2)

[ 21113.46072653 -34545.60098891 -31744.86454276    687.25247162
  44489.75233352  61347.77761947  -7799.08533781]
RMSE: 53184.25719760635


In [28]:
ames['Size*BldType'] = ames['Size'] * ames['Size*BldType']

X1 = ames[["Size", "Rooms"]]
X2 = ames[["Size", "Rooms", "BldType"]]
X3 = ames[["Size", "BldType", "Size*BldType"]]
X4 = ames[[]]
y = ames[["SalePrice"]]

0.5070175206515921

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# Assuming 'ames' is your DataFrame
# Assuming 'SalePrice' is the target variable

# Extract the relevant features and the target variable
features = ames[['Size', 'Rooms', 'BldType']]
target = ames['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Create a pipeline for preprocessing and modeling
pipeline = make_pipeline(
    make_column_transformer(
        (StandardScaler(), ['Size', 'Rooms']),  # Scaling for numerical features
        (OneHotEncoder(), ['BldType'])    # One-hot encoding for categorical feature
    ),
    PolynomialFeatures(degree=5),  # Create polynomial features
    LinearRegression()             # Linear regression model
)

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the model
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Training R2 score: {train_score}")
print(f"Testing R2 score: {test_score}")

# Make predictions
predictions = pipeline.predict(X_test)

Training R2 score: 0.6145739675492172
Testing R2 score: -58.34782370352903


## 13.3.1: cross_val_score

Once again consider four modeling options for house price:

1. Using only the size and number of rooms.
2. Using size, number of rooms, and building type.
3. Using size and building type, and their interaction.
4. Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

## 13.3.3: One-Hundred Modeling Options

Consider one hundred modeling options for house price:

- House size, trying degrees 1 through 10
- Number of rooms, trying degrees 1 through 10
- Building Type

Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?