# Practice Activity 7.1: Cross-Validation and Tuning

Please provide an ipynb notebook or Colab link showing the tasks marked "Practice Activities" in Chapter 13.

## 13.2.5: Predictions

Consider four possible models for predicting house prices:

1. Using only the size and number of rooms.
2. Using size, number of rooms, and building type.
3. Using size and building type, and their interaction.
4. Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all four models.

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [6]:
ames = pd.read_csv("C:/Users/ryanc/Desktop/GSB_544/Data/AmesHousing.csv")

ames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style   

In [14]:
#Renaming some rows
ames = ames.rename(columns={'Gr Liv Area': 'Size',
                                  'TotRms AbvGrd': 'Rooms',
                                  'Bldg Type': 'BldType'})

In [21]:
lr = LinearRegression()

#Testing Model 1 by doing it manually, will confirm it is the same as with pipeline
testdata = ames.copy()
X = testdata[["Size", "Rooms"]]
y = testdata[["SalePrice"]]


X_train, X_test, y_train, y_test = train_test_split(X, y) #Single split here, and every test/pipeline will use the same split data when running their functions

X_train_s = (X_train - X_train.mean())/X_train.std()

lr_fitted = lr.fit(X_train_s, y_train)
lr_fitted.coef_

array([[ 73286.48380872, -19276.14911271]])

In [22]:
y_preds = lr_fitted.predict(X_test) #No scaling when we ran regression (X_test is not scaled)

r2_score(y_test, y_preds)

-2342795.5882222354

In [26]:
X_test_s = (X_test - X_train.mean())/X_train.std()
y_preds = lr_fitted.predict(X_test_s)

r2_score(y_test, y_preds)

0.5070175206515921

In [27]:
lr_pipeline = Pipeline(
  [("standardize", StandardScaler()),
  ("linear_regression", LinearRegression())]
)

lr_pipeline

In [28]:
ames['Size*BldType'] = ames['Size'] * ames['Size*BldType']

X1 = ames[["Size", "Rooms"]]
X2 = ames[["Size", "Rooms", "BldType"]]
X3 = ames[["Size", "BldType", "Size*BldType"]]
X4 = ames[[]]
y = ames[["SalePrice"]]

0.5070175206515921

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# Assuming 'ames' is your DataFrame
# Assuming 'SalePrice' is the target variable

# Extract the relevant features and the target variable
features = ames[['Size', 'Rooms', 'BldType']]
target = ames['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Create a pipeline for preprocessing and modeling
pipeline = make_pipeline(
    make_column_transformer(
        (StandardScaler(), ['Size', 'Rooms']),  # Scaling for numerical features
        (OneHotEncoder(), ['BldType'])    # One-hot encoding for categorical feature
    ),
    PolynomialFeatures(degree=5),  # Create polynomial features
    LinearRegression()             # Linear regression model
)

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Evaluate the model
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Training R2 score: {train_score}")
print(f"Testing R2 score: {test_score}")

# Make predictions
predictions = pipeline.predict(X_test)

Training R2 score: 0.6145739675492172
Testing R2 score: -58.34782370352903


## 13.3.1: cross_val_score

Once again consider four modeling options for house price:

1. Using only the size and number of rooms.
2. Using size, number of rooms, and building type.
3. Using size and building type, and their interaction.
4. Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

## 13.3.3: One-Hundred Modeling Options

Consider one hundred modeling options for house price:

- House size, trying degrees 1 through 10
- Number of rooms, trying degrees 1 through 10
- Building Type

Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?