<a href="https://colab.research.google.com/github/mirsazzathossain/CSE317-Lab/blob/autumn_2022/Lab_Assignment_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Polynomial Regression**

In this assignment, you will implement polynomial regression and apply it to the [Assignment 4 Dataset](https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/mirsazzathossain/CSE317-Lab-Numerical-Methods/blob/main/datasets/data.csv).

The dataset contains two columns, the first column is the feature and the second column is the label. The goal is find the best fit line for the data.

You will need to perform the following regression tasks and find the best one for the dataset.

1.    **Linear Regression:**

     The equation we are trying to fit is:
     $$y = \theta_0 + \theta_1 x$$
     where $x$ is the feature and $y$ is the label.

     We can rewrite the equation in vector form as:
$$Y = X\theta$$ where $X$ is a matrix with two columns, the first column is all 1s and the second column is the feature, and $Y$ is a vector with the labels. $\theta$ is a vector with two elements, $\theta_0$ and $\theta_1$. The $X$ matrix will look like this:
$$X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}$$
2. **Quadratic Regression:**

     The equation we are trying to fit is:
     $$y = \theta_0 + \theta_1 x + \theta_2 x^2$$
     where $x$ is the feature and $y$ is the label.

     We can rewrite the equation in vector form as:
$$Y = X\theta$$where $X$ is a matrix with three columns, the first column is all 1s, the second column is the feature, and the third column is the feature squared, and $Y$ is a vector with the labels. $\theta$ is a vector with three elements, $\theta_0$, $\theta_1$, and $\theta_2$. The $X$ matrix will look like this:

$$X = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix}$$
3. **Cubic Regression:**

     The equation we are trying to fit is:
$$y = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$$
     where $x$ is the feature and $y$ is the label.

     We can rewrite the equation in vector form as:
$$Y = X\theta$$where $X$ is a matrix with four columns, the first column is all 1s, the second column is the feature, the third column is the feature squared, and the fourth column is the feature cubed, and $Y$ is a vector with the labels. $\theta$ is a vector with four elements, $\theta_0$, $\theta_1$, $\theta_2$, and $\theta_3$. The $X$ matrix will look like this:
$$X = \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^2 & x_2^3 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^3 \end{bmatrix}$$

Take 15 data points from the dataset and use them as the training set. Use the remaining data points as the test set. For each regression task, find the best $\theta$ vector using the training set. Then, calculate the mean squared error (MSE) on the test set. Plot the training set, the test set (in a different color), and the best fit line for each regression task. Which regression task gives the best fit line? Which regression task gives the lowest MSE on the test set? Report your answers in a Markdown cell.

**Note:** Do not use any built-in functions like `np.polyfit` or `sklearn.linear_model.LinearRegression` or any other built-in functions that perform polynomial regression. You must implement the regression tasks yourself.

In [1]:
from google.colab import drive
drive.mount('/content/drive')


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data.csv')
#print(df.columns)

features_column_index = 0
labels_column_index = 1
features = df.iloc[:, features_column_index].values
labels = df.iloc[:, labels_column_index].values

num_training_samples = 15
X_train = features[:num_training_samples]
y_train = labels[:num_training_samples]
X_test = features[num_training_samples:]
y_test = labels[num_training_samples:]

# Step 2
def linear_regression(X, y):
    X_augmented = np.column_stack((np.ones_like(X), X))
    theta = np.linalg.inv(X_augmented.T @ X_augmented) @ (X_augmented.T @ y)
    return theta

def quadratic_regression(X, y):
    X_augmented = np.column_stack((np.ones_like(X), X, X ** 2))
    theta = np.linalg.inv(X_augmented.T @ X_augmented) @ (X_augmented.T @ y)
    return theta

def cubic_regression(X, y):
    X_augmented = np.column_stack((np.ones_like(X), X, X ** 2, X ** 3))
    theta = np.linalg.inv(X_augmented.T @ X_augmented) @ (X_augmented.T @ y)
    return theta

# Step 3
def calculate_mse(predictions, actual):
    mse = np.mean((predictions - actual) ** 2)
    return mse

# Step 4
def plot_regression_results(X_train, y_train, X_test, y_test, y_pred, label):
    plt.scatter(X_train, y_train, label='Training Data')
    plt.scatter(X_test, y_test, label='Test Data')
    plt.plot(X_test, y_pred, label=label, color='red')
    plt.xlabel('Feature')
    plt.ylabel('Label')
    plt.legend()
    plt.title(label)
    plt.show()


regression_methods = [
    ("Linear Regression", linear_regression),
    ("Quadratic Regression", quadratic_regression),
    ("Cubic Regression", cubic_regression)
]

for label, regression_method in regression_methods:
    theta = regression_method(X_train, y_train)


    if label == "Linear Regression":
        X_pred = np.column_stack((np.ones_like(X_test), X_test))
    elif label == "Quadratic Regression":
        X_pred = np.column_stack((np.ones_like(X_test), X_test, X_test ** 2))
    else:
        X_pred = np.column_stack((np.ones_like(X_test), X_test, X_test ** 2, X_test ** 3))

    y_pred = np.dot(X_pred, theta)
    mse = calculate_mse(y_pred, y_test)
    print(f"{label}: MSE = {mse:.2f}")
    plot_regression_results(X_train, y_train, X_test, y_test, y_pred, label)

ModuleNotFoundError: No module named 'google'