# Project 4B: Regression
## University of Toronto | Project by Riyan Roy

In this part of the project, I extended linear regression analysis to explore additional features and regularization. I used a toy dataset (`LR_data.csv`) with 10 features (`x1`-`x10`) and a measurement `y`. The main focus was on linear regression, polynomial features, and regularization techniques.

## Part 1: Data Preparation
- Printed the dataset and created Numpy arrays with inputs (X) and outputs (y).
- Split the dataset into training and validation sets (80% training, 20% validation).

## Part 2: Linear Regression
- Standardized the data using StandardScaler from sklearn.
- Used `sklearn.linear_model.LinearRegression` to perform linear regression.
- Printed the RMSE for training and validation data.

## Part 3: Linear Regression with Additional Features
- Added more features to our dataset (degree 8) using `sklearn.preprocessing.PolynomialFeatures`.
- Performed linear regression using `sklearn.linear_model.LinearRegression`.
- Printed the RMSE for training and validation data.

## Part 4: Linear Regression with Additional Features and Regularization
- Used `sklearn.linear_model.Ridge` to perform linear regression with regularization.
- Applied the model to the processed data from Part 3.
- Swept `alpha` from 1E-2 to 1E10 to find the optimal regularization strength.

## Findings and Results
- Linear Regression (Part 2) provided baseline performance.
- Addition of Polynomial Features (Part 3) improved the model.
- Regularization (Part 4) helped control overfitting.
- The choice of `alpha` in regularization influenced model performance.
- The project provided insights into the impact of additional features and regularization on linear regression models.

## Gradient Descent with Additional Features and Regularization [2 marks + 1 mark Git submission]

We'll apply linear regresssion to a toy dataset (`LR_data.csv`), with 10 features `x1`-`x10` and a "measurement" `y`. We'll take a few shortcuts by using built-in sklearn functions.

1. Data Preparation **[0.5]**
  * Print the dataset, and create Numpy arrays with inputs (X) and outputs (y). 
  * Split the dataset into training and validation sets (80% training, 20% validation). When splitting, set `random_state=1`.

2. Linear Regression **[0.5]**
  * Standardize the data using StandardScaler from sklearn.
  * Use the `sklearn.linear_model.LinearRegression` function [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to perform linear regression.
  * Print the RMSE for training and validation data.

3. Linear Regression with Additional Features **[0.5]**
  * Let's add more features to our dataset (degree 8) using `sklearn.preprocessing.PolynomialFeatures` [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html). You'll want to create the additional features first, then perform standardization (start from non-standardized data).
  * Again, use `sklearn.linear_model.LinearRegression` to perform linear regression.
  * Print the RMSE for training and validation data.

4. Linear Regression with Additional Features and Regularization **[0.5]**
  * Let's switch models, and instead use the `sklearn.linear_model.Ridge` function [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) to perform linear regression with regularization. Apply the model to the processed data (additional, standardized) you used in 3 above. Use a `FOR` loop to run `sklearn.linear_model.Ridge` with different `alpha` values. Specifically, sweep `alpha` from 1E-2 to 1E10 (each step is an order of magnitude jump).

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split  
df=pd.read_csv("https://raw.githubusercontent.com/aps1070-2019/datasets/master/LR_data.csv" , skipinitialspace=True)

In [2]:
x_array = df.iloc[:,0:-1]
y_array = df.iloc[:,-1]
x_array=np.array(x_array)
y_array = np.array(y_array)

In [3]:
x_train,x_val,y_train,y_val = train_test_split(x_array,y_array,test_size=0.2,random_state=1)

In [4]:
x = pd.DataFrame(x_train)

In [5]:
scaler = StandardScaler()
std = scaler.fit(x_train)
std_x_train = std.transform(x_train)
std_x_val = std.transform(x_val)


In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

linear_reg = LinearRegression().fit(std_x_train, y_train)
y_pred_train = linear_reg.predict(std_x_train)
y_pred_val = linear_reg.predict(std_x_val)

rmse_train = np.sqrt(mean_squared_error(y_pred_train,y_train))
rmse_val = np.sqrt(mean_squared_error(y_pred_val,y_val))


In [7]:
print('The RMSE (Using simple linear regression) for training set is {} & for validation set is {}'.format(rmse_train,rmse_val))

The RMSE (Using simple linear regression) for training set is 16296980.655667374 & for validation set is 14061578.864980102


In [8]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(8)
poly_x_train = poly.fit_transform(x_train)
poly_x_val = poly.transform(x_val)

scaler1 = StandardScaler()
std1 = scaler1.fit(poly_x_train)
std_poly_x_train = std1.transform(poly_x_train)
std_poly_x_val = std1.transform(poly_x_val)

linear_reg1 = LinearRegression().fit(std_poly_x_train, y_train)
poly_y_pred_train = linear_reg1.predict(std_poly_x_train)
poly_y_pred_val = linear_reg1.predict(std_poly_x_val)

poly_rmse_train = np.sqrt(mean_squared_error(poly_y_pred_train,y_train))
poly_rmse_val = np.sqrt(mean_squared_error(poly_y_pred_val,y_val))



In [10]:
print('The RMSE (Using feature mapping on linear regression) for training set is {} & for validation set is {}'.format(poly_rmse_train,poly_rmse_val))

The RMSE (Using feature mapping on linear regression) for training set is 1.1447930035368027e-07 & for validation set is 10920843.908991363


In [9]:
from sklearn.linear_model import Ridge
r_rmse_train = []
r_rmse_val = []
i = 0.01
for i in [0.01,0.1,1,10,100,1000,10000,100000,1000000,10000000,100000000,100000000,10000000000]:
  ridge = Ridge(alpha = i)
  ridge_reg = ridge.fit(std_poly_x_train,y_train)
  ridge_y_pred_train = ridge_reg.predict(std_poly_x_train)
  ridge_y_pred_val = ridge_reg.predict(std_poly_x_val)
  
  ridge_rmse_train = np.sqrt(mean_squared_error(ridge_y_pred_train,y_train))
  ridge_rmse_val = np.sqrt(mean_squared_error(ridge_y_pred_val,y_val))

  r_rmse_train.append(ridge_rmse_train)
  r_rmse_val.append(ridge_rmse_val)


In [14]:
print('The RMSE (Using feature mapping and ridge regression) for training set is {} & for validation set is {}'.format(r_rmse_train[-1],r_rmse_val[-1]))

The RMSE (Using feature mapping and ridge regression) for training set is 23499904.341360077 & for validation set is 19236486.536391094
