# <center> Data Science 1 - Final Assignment <center>
<center>Created by Zsófia Rebeka Katona<center>

    ---

The goal of this assignment is to explore the concepts of ridge regression and principal component analysis (PCA) in the context of predictive modeling. We'll examine two exercises:

1. Ridge Regression Analysis:

- We begin by considering a simple predictive model where the response variable is predicted using only a constant term. We then introduce ridge regression, which is a regularized version of linear regression, and compare its performance with ordinary least squares (OLS) regression.
- We generate a sample dataset with a known true parameter and noise distribution. Using this dataset, we compute ridge regression estimates for various values of the regularization parameter lambda (λ) and analyze the bias, variance, and mean squared error (MSE) of these estimates.
- Finally, we plot the bias, variance, and MSE as functions of lambda to interpret the results and determine whether ridge regression provides better predictions than OLS regression.

In [2]:
# Importing the required libraries
import numpy as np
import pandas as pd

from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

## 1. Ridge Regression Analysis

In [4]:
# Setting the random seed
np.random.seed(20240315)

# Settings
n = 100
sigma = 1
beta_zero = 2
epsilon = np.random.normal(loc = 0, scale = sigma, size = n)
y = beta_zero + epsilon

# Creating the OLS fuction
beta_ols = np.mean(Y)

# Setting the evaluation parameters
x_to_evaluate = np.array([[0, 0]])
alphas_to_try = np.arange(0.01, 0.5, 0.02)

results = np.empty(len(alphas_to_try))


In [5]:
# Lasso simulation
import numpy as np
from sklearn.linear_model import Lasso

def f(X): 
    # f(X) = x1 + x2
    # : minden sor a nullás oszlopból és minden sor az egyes oszlopból
    return X[:, 0] + X[:, 1]


n = 20
R = 1000
# [[]] refers to
# We are evaluating at 0, 0
x_to_evaluate = np.array([[0, 0]])
alphas_to_try = np.arange(0.01, 0.5, 0.02)

# Monte Carlo
# empty array contains R rows and 1 column
results = np.empty((R, len(alphas_to_try)))
for _ in range(R):
    
    # Generate data
    X1, X2 = [np.random.uniform(0, 1, n) for _ in range(2)]
    X = np.column_stack((X1, X2))
    # variance is going to be 2
    epsilon = np.random.normal(0, 2, n)
    Y = f(X) + epsilon


    # Estimate LASSO
    for id_a, a in enumerate(alphas_to_try):
        lasso = Lasso(alpha = a).fit(X, Y)
        pred = lasso.predict(x_to_evaluate)
        results [_, id_a] = pred

results
# Compute bias, variance and mse
bias = np.mean(results - f(x_to_evaluate), axis = 0) 
variance = np.var(results, axis = 0)
mse = np.mean(((results - f(x_to_evaluate))**2), axis = 0)
bias, variance, mse

(array([0.00634562, 0.13753596, 0.26143482, 0.37223659, 0.47357481,
        0.56313742, 0.64057492, 0.70877762, 0.76713541, 0.8162697 ,
        0.85749461, 0.88947128, 0.91502421, 0.93528656, 0.95058186,
        0.96239506, 0.97139741, 0.97774684, 0.98221055, 0.98551271,
        0.98760404, 0.98894507, 0.98987615, 0.99049462, 0.99060436]),
 array([1.5960561 , 1.30833153, 1.07833484, 0.8904092 , 0.73866206,
        0.61573652, 0.5179621 , 0.44139926, 0.38029965, 0.33219423,
        0.29597748, 0.26828762, 0.24815833, 0.23350876, 0.22240506,
        0.21489912, 0.20967495, 0.20580029, 0.2032034 , 0.20126642,
        0.20039972, 0.20003824, 0.19988344, 0.20002622, 0.20002846]),
 array([1.59609636, 1.32724767, 1.146683  , 1.02896928, 0.96293516,
        0.93286027, 0.92829833, 0.94376497, 0.96879638, 0.99849045,
        1.03127448, 1.05944678, 1.08542763, 1.10826971, 1.12601093,
        1.14110338, 1.15328788, 1.16178917, 1.16794097, 1.17250171,
        1.17576147, 1.17805059, 1.17973823, 

In [None]:
R = 1000 # Repeating the part b) 1000 times - R mentuoned in 1. c)

## 2. Linear regression
 Suppose we estimate the regression coefficients in a linear regression
model by minimizing the function provided for a particular value of s. For parts (a) through (e), indicate which
of i. through v. is correct. Justify your answer.

- (a) As we increase s from 0, the training RSS will:
    - i. Increase initially, and then eventually start decreasing in an inverted U shape.
    - ii. Decrease initially, and then eventually start increasing in a U shape.
    - iii. Steadily increase.
    - iv. Steadily decrease.
    - v. Remain constant.
- (b) Repeat (a) for test RSS.
- (c) Repeat (a) for variance.
- (d) Repeat (a) for (squared) bias.
- (e) Repeat (a) for the irreducible error.

---

Objective:

The exercise aims to understand how the training residual sum of squares (RSS), test RSS, variance, squared bias, and irreducible error change as we vary the regularization parameter s in a linear regression model.
Linear Regression with Regularization:

In a linear regression model, we aim to minimize the difference between observed and predicted values of the response variable (RSS). However, in this exercise, we impose a constraint on the magnitude of the regression coefficients (β) such that the sum of absolute values of coefficients does not exceed a certain threshold s.
Effect of Increasing s:

As s increases from 0, the constraint becomes less restrictive, allowing larger coefficient values.

Let's analyze the expected behavior for each of the metrics:

(a) Training RSS: Initially, with a small s, the model is heavily regularized, leading to underfitting and higher training RSS. As s increases, the model becomes less regularized, fitting the training data better, and thus training RSS decreases. However, if s becomes too large, overfitting may occur, causing the training RSS to increase again.

(b) Test RSS: The test RSS typically follows a similar pattern to the training RSS. Initially, with a small s, the model generalizes poorly to unseen data, leading to high test RSS. As s increases, the model becomes better at generalizing, resulting in lower test RSS. However, if s becomes too large, overfitting may occur, causing test RSS to increase again.

(c) Variance: Variance tends to decrease as s increases because the model becomes less complex and more stable.

(d) Squared Bias: Bias tends to decrease initially as s increases because the model becomes more flexible and can capture more complex relationships in the data. However, if s becomes too large, bias may increase due to overfitting.

(e) Irreducible Error: Irreducible error remains constant regardless of the value of s because it represents the inherent noise in the data that cannot be reduced by any model.

Conclusion:

In summary, as we increase s from 0:
Training RSS and test RSS are expected to initially increase and then decrease or remain relatively stable, depending on the balance between model flexibility and overfitting.
Variance is expected to decrease.
Bias is expected to decrease initially and then may increase if overfitting occurs.
Irreducible error remains constant.
This exercise helps in understanding the trade-offs involved in choosing the appropriate level of regularization in a linear regression model and its impact on various aspects of model performance.

## 3. Principal Component Analysis (PCA):

- We then move on to the second exercise, which involves a dense regression model with 50 correlated predictors.
Using PCA, we compute the principal components of the predictors and their corresponding scores.
- We estimate OLS regression models using both the original predictors and the principal components on a training sample. Then, we use these models to predict the outcomes in a test sample and compute the mean squared prediction error (MSPE) for comparison.
- Finally, we discuss and explain the patterns observed in the MSPE for different sample sizes in comparison with a reference table provided in lecture slides.