# Homework 1
## Deadline: Jan 27, 2024

### Instructions
Submit one Python notebook file for grading. Your file must include mathematical work (type it or insert pictures of your handwritten work), **text expalanations** of your work, **well-commented code**, and the **outputs** from your code.

## Problems

1. Ridge regression is a modified version of linear regression that penelizes the coefficients for being large. It accomplishes this by adding a so-called $L^2$ penalty term to the loss function (e.g. mean squared error): $L(\theta)=\frac{1}{n}\sum\limits_{i=1}^n \left(\hat{f}(x_i)-y_i\right)^2 + \lambda\sum\limits_{i=1}^d \theta_i^2$

    where $\lambda>0$ is a **hyperparameter** that must be tuned by the user. An appropriate choice of $\lambda$ can often help with learning datasets where the input features are highly correlated or it can help with an overfitting problem.

     a. **[5 points]** Write each part of $L(\theta)$ in matrix-vector form where $\hat{f}$ is a LBF expansion regression model. Define each matrix and vector separately by writing their elements with subscripts, and state their dimensions.

Ridge Regression = $L(\theta)=\frac{1}{n}\sum\limits_{i=1}^n \left(\hat{f}(x_i)-y_i\right)^2 + \lambda\sum\limits_{i=1}^d \theta_i^2$

<!-- Loss Function = $\frac{1}{n}\sum\limits_{i=1}^n \left(\hat{f}(x_i)-y_i\right)^2$

$L^2$ Penalty Term = $\lambda\sum\limits_{i=1}^d \theta_i^2$ -->



$$X = \begin{bmatrix}
x_1 & ... & x_n\\
... & ... & ... \\
x_d & ... & x_{nd}\\
\end{bmatrix}$$

$$X_h = \begin{bmatrix}
h_0(x_1) & ... & h_M(x_1)\\
... & ... & ... \\
h_0(x_c) & ... & h_M(x_{n})\\
\end{bmatrix}$$

$\hat{f}(X) = X_h(\theta)$

$L(\theta) = \frac{1}{n}||X_h\theta - y||^2_2 + \lambda\theta^T\theta$


b. **[5 points]** Solve the following optimization problem by hand for the loss function $L$ above $\min\limits_{\theta\in\mathbb{R}^{d+1}} L(\theta)$

c. **[5 points]** Write a Python class for this ridge LBF expansion regression model with a `fit` function applying the formula from part (b) to compute the parameters $\theta$ and a `predict` function to make predictions for input data after the model has been fit.

2. Use the details about houses in a real estate dataset and attempt to predict the list price for the houses. Use the [Mount Pleasant Real Estate dataset](https://www.hawkeslearning.com/Statistics/dis/datasets.html).

    a. **[5 points]** Read the dataset into Python and preprocess data excluding the "Misc Exterior" and "Amenities" columns into an appropriate data matrix for regression analysis. Randomly split the data into a training set, validation set, and test set at 60\%/20\%/20\%.


In [36]:
'''

This cell is used to import the data from an excel file and store the data in a Pandas dataframe.
Once data is in a pandas dataframe the data is cleaned to make all cells contain a numeric value
rather than a string. Once all data prepation has been completed the data is split into train,
test, and validate sets.

'''

import pandas as pd
from sklearn.model_selection import train_test_split

path_to_file = "/Users/spencerhirsch/Documents/GitHub/senior/mlwhite/hw/hw1/Mount_Pleasant_Real_Estate_Data.xlsx"
data = pd.read_excel(path_to_file)      # Take in file as excel file.

'''
The following code takes all values that contain the same string as identifiers and transforms them
into boolean values by adding new columns to the table. In addition to this, all boolean values are
represented as 1 for True and 0 for False. This was done to ensure that all values contained in the
table were numeric values.
'''

data = pd.get_dummies(data, columns=["Subdivision", "House Style"], drop_first=True)
data = data.drop(columns=["Misc Exterior", "Amenities", "ID"])
data = data.dropna()
data = data.replace({"Yes": 1, "No": 0})
data = data.astype(float)
print(data)


dataY = data["List Price"].to_numpy()                           # Dependent Variable saved as Y
dataX = data.drop(columns = ["List Price"]).to_numpy()          # Determinants saved as X

trainX, testX, trainY, testY = train_test_split(dataX, dataY, test_size = 0.4, random_state = 1)
valX, testX, valY, testY = train_test_split(testX, testY, test_size = 0.5, random_state=1)

b. **[5 points]** Fit the least squares hyperplane to the training set to predict house prices, and evaluate its fit on the validation set.

In [46]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

model = LinearRegression()

model.fit(trainX, trainY)

# return the predicted outputs for the datapoints in the training set
trainPredictions = model.predict(trainX)
# print(trainPredictions)
print('The mean absolute error on the training set is', mean_absolute_error(trainY, trainPredictions))
print()
predictions = model.predict(valX)
# print(predictions)
print('The mean absolute error on the validation set is', mean_absolute_error(valY, predictions))

The mean absolute error on the training set is 61132.35365541476

The mean absolute error on the validation set is 74602.94721962644


c. **[5 points]** Fit a ridge regression to the training set to predict house prices, and evaluate its fit on the validation set. Repeat this for several different values of $\lambda$.

In [49]:
from sklearn.linear_model import Ridge

lambdas = [0, 0.1, 1, 2, 5, 10]
for λ in lambdas:
    model = Ridge(alpha=λ)
    model.fit(trainX, trainY)
    prediction = model.predict(valX)
    print('The mean absolute error when λ = %s on the validation set is %s' % (λ, str(mean_absolute_error(valY, predictions))))

The mean absolute error when λ = 0 on the validation set is 74602.94721962644
The mean absolute error when λ = 0.1 on the validation set is 74602.94721962644
The mean absolute error when λ = 1 on the validation set is 74602.94721962644
The mean absolute error when λ = 2 on the validation set is 74602.94721962644
The mean absolute error when λ = 5 on the validation set is 74602.94721962644
The mean absolute error when λ = 10 on the validation set is 74602.94721962644


d. **[5 points]** Fit an LBF expansion of your choice to the training set to predict house prices, and evaluate its fit on the validation set. Repeat this for several different values of $\lambda$

In [146]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

lambdas = [0, 0.1, 1, 2, 5, 10]
for λ in lambdas:
    print(λ)

0
0.1
1
2
5
10
