# Exercise 03: Predicting house prices

### GRA 4160

Data: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is taken from a Kaggle competition, and it contains information about houses in Ames, Iowa.
The goal of the competition was to predict the sale price of a house based on various features of the house.

The dataset includes the following information:

1. SalePrice: the sale price of the house (the target variable)
2. Various features of the house such as the overall quality, the living area, the number of bedrooms and bathrooms, the year built, etc.
3. Various features of the neighborhood such as the overall condition of the property, the proximity to various amenities, etc.

The dataset includes 1460 observations (houses) and 81 variables (features). The variables include both numerical and categorical variables.

Some numerical variables are continuous, while others are discrete and some categorical variables are ordinal (natural ordering among the categories), while others are nominal (do not have any inherent ordering).

The dataset is a good example of a real-world dataset that requires feature engineering, cleaning, and preprocessing before the model can be trained on it.
There are missing values and outliers in the dataset that you must deal with.

## Exercises:

1. Load the house price dataset. Have a look at its variables. What do you think are the best predictors for the sale price?
2. Split the data into a training and a test set (create the variables `X_train`, `X_test`, `y_train`, `y_test`).
3. Do some data cleaning and preprocessing:

   a. At least keep the numerical columns and drop the missing values.
   
   b. Normalize the data (e.g., make all columns into mean zero with a standard deviation of one).
   
4. Train a model for predicting the house price using the numerical columns of the dataset. Report both the in-sample and the out-of-sample performance of the model. Report at least the Mean Squared Error (MSE) and the $R^2$.
5. Do the same using Ridge and Lasso models.
6. Use the Lasso algorithm to identify the 10 most important features in the data set. Tips: You can use the `from sklearn.feature_selection import SelectFromModel` method.
7. Train a linear regression model where you only include the 10 most important features you found in 6. Report at least the Mean Squared Error (MSE) and $R^2$.
8. Write some code so that you can experiment with how changing the inputs affect the predicted price. You can for example write a function that takes a vector of not normalized features for one or more units as inputs and then the function returns the predicted price for these units. Make your code so that you can pass in training data that is not normalized (data should be normalized before making the prediction). The price that the function returns should not be normalized. Try to change some features (one by one) and see how the price predictions change.

In [None]:
# Exercise 1

import pandas as pd

data = pd.read_csv('../../data/house-prices/train.csv')
data.head()

In [None]:
# Exercise 2

from sklearn.model_selection import train_test_split

X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=10)

In [None]:
# Exercise 3a

# numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# We drop the rows with nans
X_train_num = X_train[numeric_cols].dropna()
X_test_num = X_test[numeric_cols].dropna()

# Keep only the y´s where there are no nans
y_train_num = y_train.loc[X_train_num.index]
y_test_num = y_test.loc[X_test_num.index]

In [None]:
# Exercise 3b

from sklearn.preprocessing import StandardScaler

# initialize the StandardScaler
scaler_x = StandardScaler()
scaler_x.fit(X_train_num)

scaler_y = StandardScaler()
scaler_y.fit(y_train_num.values.reshape(-1,1))

# Normalize the data
X_train_norm = pd.DataFrame(scaler_x.transform(X_train_num),
                            index=X_train_num.index, columns=X_train_num.columns)
X_test_norm = pd.DataFrame(scaler_x.transform(X_test_num),
                           index=X_test_num.index, columns=X_test_num.columns)

y_train_norm = pd.DataFrame(scaler_y.transform(y_train_num.values.reshape(-1,1)),
                            index=y_train_num.index, columns=['SalePrice'])['SalePrice']
y_test_norm = pd.DataFrame(scaler_y.transform(y_test_num.values.reshape(-1,1)),
                           index=y_test_num.index, columns=['SalePrice'])['SalePrice']

In [None]:
# Exercise 4

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
lm = LinearRegression()

# Fit the model to the training data
lm.fit(X_train_norm, y_train_norm)

In [None]:
# In sample fit
y_lm = lm.predict(X_train_norm)

# In sample  performance
mse_lm_is = mean_squared_error(y_train_norm, y_lm)
r2_lm_is = r2_score(y_train_norm, y_lm)

print(f'In sample mean Squared Error (LM): {mse_lm_is:.4f}')
print(f'In sample R-Squared (LM): {r2_lm_is:.4f}\n')

In [None]:
# Make predictions on the test data
y_pred_lm = lm.predict(X_test_norm)

# Evaluate the model's performance
mse_lm = mean_squared_error(y_test_norm, y_pred_lm)
r2_lm = r2_score(y_test_norm, y_pred_lm)

print(f'Mean Squared Error (LM): {mse_lm:.4f}')
print(f'R-Squared (LM): {r2_lm:.4f}')

In [None]:
# Exercise 5

from sklearn.linear_model import Ridge, Lasso

# Initialize the models
ridge = Ridge(alpha=0.5)
lasso = Lasso(alpha=0.01)

# Fit the models to the training data
ridge.fit(X_train_norm, y_train_norm)
lasso.fit(X_train_norm, y_train_norm)

In [None]:
# Make predictions on the train data
y_ridge = ridge.predict(X_train_norm)
y_lasso = lasso.predict(X_train_norm)

mse_ridge_is = mean_squared_error(y_train_norm, y_ridge)
mse_lasso_is = mean_squared_error(y_train_norm, y_lasso)
r2_ridge_is = r2_score(y_train_norm, y_ridge)
r2_lasso_is = r2_score(y_train_norm, y_lasso)

print(f'In sample mean Squared Error (Ridge): {mse_ridge_is:.4f}')
print(f'In sample mean Squared Error (Lasso): {mse_lasso_is:.4f}')
print(f'In sample  R-Squared (Ridge): {r2_ridge_is:.4f}')
print(f'In sample  R-Squared (Lasso): {r2_lasso_is:.4f}')

In [None]:
# Make predictions on the test data
y_pred_ridge = ridge.predict(X_test_norm)
y_pred_lasso = lasso.predict(X_test_norm)

# Evaluate the model's performance
mse_ridge = mean_squared_error(y_test_norm, y_pred_ridge)
mse_lasso = mean_squared_error(y_test_norm, y_pred_lasso)
r2_ridge = r2_score(y_test_norm, y_pred_ridge)
r2_lasso = r2_score(y_test_norm, y_pred_lasso)

print(f'Mean Squared Error (Ridge): {mse_ridge:.4f}')
print(f'Mean Squared Error (Lasso): {mse_lasso:.4f}')
print(f'R-Squared (Ridge): {r2_ridge:.4f}')
print(f'R-Squared (Lasso): {r2_lasso:.4f}')

In [None]:
# Exercise 6

from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# Initialize the Lasso model
lasso = Lasso(alpha=0.01)

# Fit the model to the training data
lasso.fit(X_train_norm, y_train_norm)

# Create a SelectFromModel object to select the 10 most important features
sfm = SelectFromModel(lasso, max_features=10)

# Fit the SelectFromModel object to the training data
sfm.fit(X_train_norm, y_train_norm)

# Get the selected features
important_features = X_train_norm.columns[sfm.get_support()]
print("The 10 most important features are: ", important_features)

In [None]:
# Exercise 7

# Initialize the Linear Regression model
lm2 = LinearRegression()

# Fit the model to the training data
lm2.fit(X_train_norm[important_features], y_train_norm)

# Make predictions on the test data
y_pred_lm2 = lm2.predict(X_test_norm[important_features])

# Evaluate the model's performance
mse_lm2 = mean_squared_error(y_test_norm, y_pred_lm2)
r2_lm2 = r2_score(y_test_norm, y_pred_lm2)

print(f'Mean Squared Error (LM): {mse_lm2:.4f}')
print(f'R-Squared (LM): {r2_lm2:.4f}')

In [None]:
# Exercise 8

def inspectPrediction(model, observed_units):
    price_prediction = model.predict(pd.DataFrame(scaler_x.transform(observed_units), index=observed_units.index, columns=observed_units.columns))
    return(scaler_y.inverse_transform(price_prediction.reshape(-1,1)))

In [None]:
observed_units = X_test_num.iloc[0:5].copy()
observed_units

In [None]:
first_5_predicted_price = pd.DataFrame(inspectPrediction(lm, observed_units), index=y_test_num.iloc[0:5].index, columns=['SalePrice'])
first_five_actual_price = y_test_num.iloc[0:5]

In [None]:
# change some features
observed_units.at[854, 'OverallQual'] = 8
observed_units.at[381, 'YearBuilt'] = 1970

observed_units

In [None]:
first_5_predicted_price_changed = pd.DataFrame(inspectPrediction(lm, observed_units), index=y_test_num.iloc[0:5].index,
                                       columns=['SalePrice'])

pd.DataFrame([first_five_actual_price, first_5_predicted_price['SalePrice'], first_5_predicted_price_changed['SalePrice']],
             index=['Actual', 'Predicted', 'Edited']).T