# Exercise - Cross-Validation and the Train-Test Split

In this exercise, you will apply what you have learned about splitting the dataset into training and test sets, impute missing values and scale features the correct way to avoid data leakage, and perform k-fold cross-validation in order to get more reliable and representative estimates of the model's performance on unseen data.  

The dataset is a modified version of the ["Housing Prices Dataset" from Kaggle](https://www.kaggle.com/datasets/yasserh/housing-prices-dataset).

In [107]:
# DO NOT MODIFY - imports
import pandas as pd

# 1. Data Preparation

Other than a few missing values which were introduced intentionally for the purpose of this demo, the dataset is clean and free from duplicated rows and other issues. You do not need to write your own code in this section. However, please read this section and inspect the code thoroughly to understand how the dataset is being set up for the next step.

In [108]:
# DO NOT MODIFY - Data loading and inspection
df = pd.read_csv("Housing_Modified_2.csv")
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420.0,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960.0,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960.0,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500.0,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420.0,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [109]:
# DO NOT MODIFY - Check for missing values
df.isnull().sum()

price                0
area                13
bedrooms             0
bathrooms            0
stories              0
mainroad             0
guestroom            0
basement             0
hotwaterheating      0
airconditioning      0
parking              0
prefarea             0
furnishingstatus     0
dtype: int64

We will impute the missing values in the `area` column.  
But first, run the cell below to convert the categorical "`yes`/`no`" columns to ones and zeros (integers).

In [110]:
# DO NOT MODIFY - Data preparation
# Convert "yes" and "no" to 1 and 0
yes_no_columns = ["mainroad", "guestroom", "basement", "hotwaterheating", "airconditioning", "prefarea"]
df[yes_no_columns] = df[yes_no_columns].map({"yes": 1, "no": 0}.get)
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420.0,4,2,3,1,0,0,0,1,2,1,furnished
1,12250000,8960.0,4,4,4,1,0,0,0,1,3,0,furnished
2,12250000,9960.0,3,2,2,1,0,1,0,0,2,1,semi-furnished
3,12215000,7500.0,4,2,2,1,0,1,0,1,3,1,furnished
4,11410000,7420.0,4,1,2,1,1,1,0,1,2,0,furnished


In [111]:
df.dtypes

price                 int64
area                float64
bedrooms              int64
bathrooms             int64
stories               int64
mainroad              int64
guestroom             int64
basement              int64
hotwaterheating       int64
airconditioning       int64
parking               int64
prefarea              int64
furnishingstatus     object
dtype: object

Run the cell below to one-hot-encode the `furnishingstatus` column with the first resulting column (`furnished`) dropped to avoid multicollinearity.

In [112]:
# DO NOT MODIFY - One-hot encoding `furnishingstatus`
df = pd.get_dummies(df, columns=["furnishingstatus"], drop_first=True)
df.dtypes

price                                int64
area                               float64
bedrooms                             int64
bathrooms                            int64
stories                              int64
mainroad                             int64
guestroom                            int64
basement                             int64
hotwaterheating                      int64
airconditioning                      int64
parking                              int64
prefarea                             int64
furnishingstatus_semi-furnished       bool
furnishingstatus_unfurnished          bool
dtype: object

We are now ready to split the data, impute missing values and scale the features if need be.

# 2. Train-Test Split and Proper Imputation and Scaling

Create the feature set, the matrix `X`, consisting of all columns but `price`. Then create the target, the array `y`, comprised of the values in the `price` column.

In [113]:
# FILL IN - Create feature set `X` and target `y`
X = df.drop(columns=["price"])
y = df["price"]

Split the data into training and testing sets using a 70/30 split. Shuffle the data while you split it, using a random seed of 52.

In [114]:
# DO NOT MODIFY - imports
from sklearn.model_selection import train_test_split

# FILL IN - Split the data into training and testing sets (70% train, 30% test) with a random state of 52
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=52)

Impute missing values in the `area` column using the `SimpleImputer` class from Scikit-Learn. Use the `median` strategy.

In [115]:
# DO NOT MODIFY - imports
from sklearn.impute import SimpleImputer

# FILL IN - Fit the imputer on the training data, then transform the training AND test data using the fitted imputer
imputer = SimpleImputer(strategy="median")
X_train["area"] = imputer.fit_transform(X_train[["area"]])
X_test["area"] = imputer.transform(X_test[["area"]])

Below, we pick out columns of data that were originally numeric (and not just 0 or 1). Scale these features using a MinMaxScaler the correct way.

In [116]:
#  DO NOT MODIFY - Features that were originally numeric (and not just 0 or 1)
numeric_columns = ["area", "bedrooms", "bathrooms", "stories", "parking"]

# DO NOT MODIFY - imports
from sklearn.preprocessing import MinMaxScaler

# FILL IN - Fit the MinMaxScaler on the training data, then transform the training AND test data using the fitted scaler
scaler = MinMaxScaler()
X_train[numeric_columns] = scaler.fit_transform(X_train[numeric_columns])
X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])

Using `describe()`, verify that all values in both sets are between zero and one now.

In [117]:
X_train.describe()

Unnamed: 0,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea
count,381.0,381.0,381.0,381.0,381.0,381.0,381.0,381.0,381.0,381.0,381.0
mean,0.239572,0.239501,0.133858,0.262467,0.855643,0.175853,0.35958,0.047244,0.304462,0.225722,0.233596
std,0.144209,0.182872,0.241559,0.280861,0.351913,0.381196,0.480508,0.21244,0.460784,0.286791,0.423674
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.134021,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.205498,0.25,0.0,0.333333,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.305842,0.25,0.0,0.333333,1.0,0.0,1.0,0.0,1.0,0.333333,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [118]:
X_test.describe()

Unnamed: 0,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea
count,164.0,164.0,164.0,164.0,164.0,164.0,164.0,164.0,164.0,164.0,164.0
mean,0.245309,0.245427,0.164634,0.28252,0.865854,0.182927,0.329268,0.042683,0.341463,0.243902,0.237805
std,0.15682,0.188782,0.271949,0.308026,0.341853,0.387791,0.471387,0.20276,0.475653,0.28861,0.427043
min,0.024055,-0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.127148,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.201375,0.25,0.0,0.333333,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.336082,0.25,0.5,0.333333,1.0,0.0,1.0,0.0,1.0,0.333333,0.0
max,0.958763,0.75,1.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# 3. K-Fold Cross-Validation

Train alinear regression model on the training set and output its *training* score (which, by default, is the R-squared for regression tasks). - **HINT:** Use the `score()` method of the fitted model.

In [119]:
# DO NOT MODIFY - imports
from sklearn.linear_model import LinearRegression

# FILL IN - Train a linear regression model and output its R-squared score on the training set
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_train, y_train)

0.6829310708044551

Can we expect a similarly high score on unseen data? Before looking at the holdout (test) set, cross-validate the model using 5-fold CV and output the average score.

In [120]:
# DO NOT MODIFY - imports
from sklearn.model_selection import cross_val_score

# FILL IN - Cross-validate the model using 5-fold CV and output the mean R-squared score
r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring="r2")
r2_scores.mean()

np.float64(0.6364504055297886)

Finally, evaluate the trained model on the test set and output the test score (R-squared). Is it closer to the training score or the average CV score?

In [121]:
# DO NOT MODIFY - imports
from sklearn.metrics import r2_score

# FILL IN - Evaluate the model on the test set and output its R-squared score
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
r2

0.6559372457471379