<a href="https://colab.research.google.com/github/slyofzero/ML-algorithms-from-SCRATCH/blob/main/Ridge_Regression_from_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#AIM-

To create a Machine Learning model that uses Ridge Regression

---

So... Ridge Regression is a modified version of Linear Regression. So to learn about Ridge Regression, you have to make sure you understand Linear Regression. If you don't then [click here](https://medium.com/@sly.of.zero/linear-regression-101-f4c27fb7a586).

If you don't know what Gradient Descent is, then [click here](https://medium.com/@sly.of.zero/gradient-descent-6a449eae1095).

It is an absolute must that you know both the concepts, before you proceed with this notebook. Take your time to learn these things, I'll wait for you here.

\
\
\
\
\
$$…$$
\
\
\
\
\
$$…$$
\
\
\
\
\
$$…$$
\
\
\
\
Learned them??

Cool

Let's proceed then.

---

Firstly let's import the neccessary modules.

In [1]:
# Importing the neccesary modules.
import numpy as np
import plotly.graph_objects as go

In [2]:
# Creating some linearly related random data.
x = np.array([1, 2, 5, 6, 8, 9, 12, 14])
y = np.array([3, 6, 8, 4, 9, 12, 9, 12])

Let's plot the data we have above.

In [3]:
# Plotting the data.
fig = go.Figure()
fig.add_traces(go.Scatter(x = x, y = y, mode = "markers"))
fig.update_layout(title = "Data")
fig.show()

Now let's try to plot a line of best fit through the data. We can calculate the slope of this line using **Gradient Descent**. I can surely code gradient descent into python like I did in this notebook, but there already exists a Python library that can do this for us. So let's use that instead.

I'll be using the `LinearRegression` class from `sklearn.linear_model` to calculate the slope and intercepts (betas).

But to fit x and y into the class, we firstly need to convert them into a 2D array.

In [4]:
# Reshaping x and y.
x_reshaped = x.reshape(-1, 1)
y_reshaped = y.reshape(-1, 1)

In [5]:
# Calculating the slope and intercepts (betas).
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(x_reshaped, y_reshaped)

slope = linreg.coef_
intercept = linreg.intercept_

Now that we have both the slope and the trendline, let's calculate the line of best fit and plot it.

In [6]:
# Calculating the line of best fit.
best_fit = ((slope * x) + intercept)[0]
best_fit

array([ 4.23382226,  4.82830026,  6.61173425,  7.20621225,  8.39516825,
        8.98964625, 10.77308024, 11.96203624])

In [7]:
# Plotting the line of best fit.
fig = go.Figure()
fig.add_traces(go.Scatter(x = x, y = y, mode = "markers", name = "Data"))
fig.add_traces(go.Scatter(x = x, y = best_fit, mode = "lines", name = "Line of Best Fit"))
fig.update_layout(title = "Line of best fit on the data")
fig.show()

The line of best fit in the graph above seems to capture the linear relation pretty nicely because the data it had to work on was of a descent size.

But what would happen if it only had 2 points to work with.

In [8]:
# Getting the first two values from x and y
new_x = x[:2]
new_y = y[:2]

new_x_reshaped = new_x.reshape(-1, 1)
new_y_reshaped = new_y.reshape(-1, 1)

In [9]:
# Creating a line of best fit for the new x and y.
linreg.fit(new_x_reshaped, new_y_reshaped)

new_slope = linreg.coef_
new_intercept = linreg.intercept_

new_best_fit = ((new_slope * new_x) + new_intercept)[0]
new_best_fit

array([3., 6.])

Now let's plot this new line of best fit over the new data.

In [10]:
# Plotting the new line of best fit.
fig = go.Figure()
fig.add_traces(go.Scatter(x = new_x, y = new_y, mode = "markers", name = "New Data"))
fig.add_traces(go.Scatter(x = new_x, y = new_best_fit, mode = "lines", name = "New Line of Best Fit"))
fig.update_layout(title = "New line of best fit on the new data")
fig.show()

This new line of best fit seems to capture the trend between the new x and y values pretty accurately, but let's plot this line over the original data.

In [11]:
# Plotting the new best fit line over the original one.
fig = go.Figure()
fig.add_traces(go.Scatter(
    x = x, 
    y = y,
    mode = "markers",
    name = "Original Data"))

fig.add_traces(go.Scatter(
  x = new_x,
  y = new_y,
  mode = "markers",
  name = "New Data",
  marker = dict(
      line = dict(
          color = "orange",
          width = 5
      )
  )))

fig.add_traces(go.Scatter(
  x = new_x,
  y = new_best_fit,
  mode = "lines",
  name = "New Line of Best Fit"))

fig.add_traces(go.Scatter(
  x = x,
  y = best_fit,
  mode = "lines",
  name = "Line of Best Fit",
  marker = dict(
      color = "red"
  )))

fig.update_layout(title = "New line of best fit vs Original line of best fit")

fig.show()

As you can see here, the new line of best fit doesn't overlap with the original line of best fit.

Meaning if we have small amounts of data, then the line of best fit created won't be able to capture the actual trend of the data correctly.

In [12]:
# Plotting the new line of best fit over the original data.
new_best_fit = ((new_slope * x) + new_intercept)[0]

fig = go.Figure()
fig.add_traces(go.Scatter(
    x = x, 
    y = y,
    mode = "markers",
    name = "Original Data"))

fig.add_traces(go.Scatter(
    x = x,
    y = new_best_fit,
    mode = "lines",
    name = "New line of best fit"
))

fig.show()

Yeahhhhhhh...this new line of best fit sucks.

But how can we create a line of best fit for the whole data, using only a small subset of the data???

---

##**Ridge Regression**

Ridge Regression is a modified version of Linear Regression which can help us get an accurate line of best fit with just a small subset of the original data.

This is very useful in cases were we don't have a lot of data. If we use normal methods to get the Line of Best Fit for this data, we might end up with a line that won't predict well for the future data, like in the case above.

###***Cost Function***

If you remember, in Gradient Descent we use **MSE** as the cost function.

$$J(\beta_0, \beta_1) = \frac{\sum_{i = 1}^{n}(\hat{y_{i}} - \beta_{0} - \beta_{1} x_{1})^2}{n}$$
\
For highest possible accuracy we want to minimize the cost function *$J(\beta_0, \beta_1)$*.

$$J(\beta_0, \beta_1) ≈ 0$$

\

For Ridge Regression, we'll change this formula a little.

$$J(\beta_0, \beta_1) = \frac{\sum_{i = 1}^{n}(\hat{y_{i}} - \beta_{0} - \beta_{1} x_{1})^2}{n} + \lambda(\beta_1^2)$$

\

The formula above is for Simple Linear Regression.

If we were dealing with Multiple Linear Regression, the cost function would be,

$$J(\beta_0, \beta_1, ...,\beta_n) = \frac{\sum_{i = 1}^{n}(\hat{y_{i}} - \beta_{0} - \beta_{1} x_{1} - ... - \beta_n x_n)^2}{n} + \lambda(\beta_1^2 + \beta_2^2 + ... + \beta_n^2)$$

\

### ***Pennalisation***

This extra term, $\lambda(\beta_1^2)$, that has been added to the Cost Function for Gradient Descent is called **pennalisation**. 

Here $\lambda$ is called the **pennalisation factor**. If the value for $lambda$ is set to a very large number like 100000, then the slope of the best fit line would be very close to 0. Not exactly zero, but very close to it.

This new term pennalises the large slope values by giving those values a high Cost Function value. This is done because large slope values can be a sign of overfitting.

For large slope values $\beta_1$ would be a large number. Meaning the whole term $\lambda(\beta_1^2)$ would be a large number, which would inturn affect the Cost Function. Meaning, for large values of $\beta_1$ our Cost Function won't be minimized.

We can write Ridge Regression into code by using the exact same formula for Gradient Descent and replacing the old Cost Function with the new one.

For this example let's set the pennalisation factor to 1.

In [13]:
# Creating the Ridge Regression function.
def ridge_regression(x, y, learning_rate = 0.001, iterations = 10000, pennalise_factor = 1):
  beta_0 = beta_1 = 0
  n = len(x)

  for i in range(iterations):
    y_pred = beta_0 + (beta_1 * x)

    gradient_cost = np.mean((y - y_pred) ** 2)
    pennalisation = pennalise_factor * (beta_1 ** 2)
    cost = round(gradient_cost + pennalisation, 5)

    beta_0_d = (-2/n) * sum(y - y_pred)
    beta_1_d = (-2/n) * sum(x * (y - y_pred))

    beta_0 = beta_0 - (beta_0_d * learning_rate)
    beta_1 = beta_1 - (beta_1_d * learning_rate)

  return (round(beta_0, 2), round(beta_1, 2), cost)

In [14]:
# Getting the beta values and plotting the new best fit line.
ridge_betas = ridge_regression(x, y)

ridge_intercept = ridge_betas[0]
ridge_slope = ridge_betas[1]

ridge_best_fit = (ridge_slope * x) + ridge_intercept

In [15]:
# Plotting the best fit line gotten through Ridge Regression over the best fit line gotten from the subset of the data.
fig = go.Figure()
fig.add_traces(go.Scatter(x = new_x, y = y, mode = "markers", name = "Data"))
fig.add_traces(go.Scatter(x = new_x, y = new_best_fit, mode = "lines", name = "Line of Best Fit"))
fig.add_traces(go.Scatter(x = new_x, y = ridge_best_fit, mode = "lines", name = "Ridge Line of Best Fit"))
fig.update_layout(title = "Line of best fit on the data")
fig.show()

Let's plot this line of best fit obtained through Ridge Regression on the whole data.

In [16]:
# Plotting the ridge line of best fit over the original data.
new_best_fit = ((new_slope * x) + new_intercept)[0]

fig = go.Figure()
fig.add_traces(go.Scatter(
    x = x, 
    y = y,
    mode = "markers",
    name = "Original Data"))

fig.add_traces(go.Scatter(
    x = x,
    y = ridge_best_fit,
    mode = "lines",
    name = "Ridge Regression line of best fit"
))

fig.add_traces(go.Scatter(
  x = x,
  y = best_fit,
  mode = "lines",
  name = "Line of Best Fit"))

fig.show()

As you can see, this new line of best fit we got using the subset of the data is very close to the line we got by using Linear Regression on the whole data.

---

Yeah so...that's Ridge Regression in a nutshell. It isn't difficult if you already understand Gradient Descent, but if you still had some difficulties don't worry. Everyone learns at a different rate. Go through the notebook once more, I am sure you'll understand it!

##END OF THE NOTEBOOK.