# Assignment 2 - Question 2
The objective of this assignment is to get you familiarize with  the  problem  of  `Linear Regression`.

## Instructions
- Write your code and analysis in the indicated cells.
- Ensure that this notebook runs without errors when the cells are run in sequence.
- Do not attempt to change the contents of other cells.
- No inbuilt functions to be used until specified

## Submission
- Ensure that this notebook runs without errors when the cells are run in sequence.
- Rename the notebook to `<roll_number>_Assignment2_Q2.ipynb`.

## 2.0 Background about the dataset

TLDR: You have 4 independent variables (`float`) for each molecule. You can use a linear combination of these 4 independent variables to predict the bandgap (dependent variable) of each molecule.

You can read more about the problem in [Li et al, Bandgap tuning strategy by cations and halide ions of lead halide perovskites learned from machine learning, RSC Adv., 2021,11, 15688-15694](https://doi.org/10.1039/D1RA03117A).

In [118]:
import csv
import random
import numpy as np

In [119]:
all_molecules = list()

with open('bg_data.txt', 'r') as infile:
    input_rows = csv.DictReader(infile)
    
    for row in input_rows:
        current_mol = ([float(row['Cs']), float(row['FA']), float(row['Cl']), float(row['Br'])], float(row['Bandgap']))
        all_molecules.append(current_mol)

random.shuffle(all_molecules)


num_train = int(len(all_molecules) * 0.8)

# each point in x_train has 4 values - 1 for each feature
x_train = [x[0] for x in all_molecules[:num_train]]
# each point in y_train has 1 value - the bandgap of the molecule
y_train = [x[1] for x in all_molecules[:num_train]]

x_test = [x[0] for x in all_molecules[num_train:]]
y_test = [x[1] for x in all_molecules[num_train:]]

### 2.1 Implement a Linear Regression model that minimizes the MSE **without using any libraries**. You may use NumPy to vectorize your code, but *do not use numpy.polyfit* or anything similar.

2.1.1 Explain how you plan to implement Linear Regression in 5-10 lines.

<!-- your answer to 1.1.1 -->

2.1.2 Implement Linear Regression using `x_train` and `y_train` as the train dataset.

2.1.2.1 Choose the best learning rate and print the learning rate for which you achieved the best MSE.

In [120]:
# implement Linear Regression

# note that all the calculations below will be matrix based because of the large number of test points 
# which can't be handle one at a time but have to be handled all together

# normalize the dataset
# assumes we recieve np arrays as input (list of arrays)
def normalize(arr):
    mean = []
    std = []
    for i in range(len(arr[0])):
        mean.append(np.mean(arr[:,i]))
        std.append(np.std(arr[:,i]))
    for i in range(len(arr)):
        for j in range(len(arr[0])):
            arr[i][j] = (arr[i][j] - mean[j])/std[j]
    return arr

def func(x, coeff):
    # print(coeff, x)
    return np.matmul(x, np.transpose(coeff))

# cost calculation fn (expects every array to be np array)
def cost(x, y, coeff):
    fx = func(x, coeff)
    # return ((fx - y).T@(fx - y))/(2*len(y))
    # print(np.matmul((fx-y), np.transpose(fx-y))/(2*len(y)), 'hahaha')
    return np.matmul((fx-y), np.transpose(fx-y))/(2*len(y))

# gradient descent fn (also expects np arrays)
def gradientDescent(x, y, coeff, learning_rate, num_epochs):
    costs = []
    # print(coeff, 'wow')
    x_ = np.hstack((np.ones((x.shape[0],1)), x))
    for i in range(num_epochs):
        fx = func(x_, coeff)
        # print(fx.shape, x_.shape)
        derivative = np.matmul(np.transpose(x_), fx-y)/len(x_)
        # print(x_.T@(fx-y), 'here', type(x_), type(fx-y))
        # print((fx-y).dot(x_), "please god")

        coeff -= learning_rate*derivative
        # print(coeff)
        costs.append(cost(x_, y, coeff))
        # break
    return costs

In [151]:
x_train = np.array(x_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.float32)
coeff = np.zeros((len(x_train[0]) + 1), dtype=np.float32)

# x_train = normalize(x_train)
# y_train = normalize([y_train])[0]
costs = gradientDescent(x_train, y_train, coeff, learning_rate=.6, num_epochs=1000)

In [152]:
costs

[0.15185848315035225,
 0.09593633041548591,
 0.08110929726809299,
 0.06976576819866738,
 0.060551965877338444,
 0.05301632094831185,
 0.04681650176240702,
 0.04168295056554719,
 0.03740300590520639,
 0.033808699355505574,
 0.030767209994865236,
 0.028173350782171317,
 0.02594368797543315,
 0.024011902918561003,
 0.022325172375061916,
 0.02084130792998218,
 0.01952651547475888,
 0.018353631487701118,
 0.017300743056810758,
 0.016350089374685843,
 0.015487212975883081,
 0.014700280154315279,
 0.013979546752078898,
 0.013316940534100345,
 0.012705731092400023,
 0.012140264091055009,
 0.011615757684760142,
 0.011128135968190598,
 0.010673900130902416,
 0.010250021649304627,
 0.009853861936667402,
 0.009483105082964767,
 0.009135704690935645,
 0.00880983991076661,
 0.00850388381690092,
 0.008216371434935001,
 0.007945979837705166,
 0.007691506562569011,
 0.007451857205379597,
 0.007226029609247539,
 0.0070131028529556065,
 0.006812234463980545,
 0.006622642023794909,
 0.006443605042619004,


2.1.3 Make a [Parity Plot](https://en.wikipedia.org/wiki/Parity_plot) of your model's bandgap predictions on the test set with the actual values.

In [123]:
# Get the predictions of x_test into `y_pred`

#
# ...
#

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,20))

ax.scatter(y_test, y_pred)

lims = [
    np.min([ax.get_xlim(), ax.get_ylim()]),
    np.max([ax.get_xlim(), ax.get_ylim()]),
]
ax.plot(lims, lims, 'k-', alpha=0.75, zorder=0)
ax.set_aspect('equal')
ax.set_xlim(lims)
ax.set_ylim(lims)

ax.set_title('Parity Plot of Custom Linear Regression')
ax.set_xlabel('Ground truth bandgap values')
ax.set_ylabel('Predicted bandgap values')
plt.show()

ModuleNotFoundError: No module named 'matplotlib'

### 2.2 Implement Ridge regression
2.2.1 Explain Ridge regression briefly in 1-2 lines.

<!-- Your answer to 1.2.1 -->

2.2.2 Implement Ridge regression and make a table of different RMSE scores you achieved with different values of alpha. What does the parameter `alpha` do? How does it affect the results here? Explain in 5-10 lines in total. (You can use scikit-learn from this cell onwards)

In [None]:
# you should not have imported sklearn before this point
import sklearn

# implement Ridge regression and make a table where you explore the effect of different values of `alpha`

### 2.3 Implement Lasso regression
2.3.1 Explain Lasso regression briefly in 1-2 lines.

2.3.2 Implement Lasso regression and make a table of different RMSE scores you achieved with different values of alpha. What does the parameter `alpha` do? How does it affect the results here? Explain in 5-10 lines in total.

In [None]:
# implement Lasso regression and make a table where you explore the effect of different values of `alpha`