### Understanding the theory

Linear regression is a basic technique frequently utilised in predictive analysis,attempting to predict the output(Y) given a set of features/inputs(x). **It assumes a linear relationship between the inputs and output.** This is essentially choosing a suitable mathematical function to model the relationship between the inputs and the output.
![Multivariate linear regression](https://miro.medium.com/max/1120/0*AqzOn7p--nveVULA.png "title")*This image shows an optimal plane found in multivariate linear regression, here we try to find a plane instead of a line since our input is not a single variable*

*The univariate linear regression equation:*
$$
y = bx + c
$$

*The multivariate linear regression equation:*
$$
y = WX^{T} + c
$$

In the first equation, we see clearly that it is a secondary school level linear equation.
The second equation is slightly more complex as it is presented in a vectorised form, the key is simply to understand that *X* is a **vector** and

$$
X\in R^{n} \to W\in R^{n}
$$


We have to train the models by supplying data beforehand to the machine, such that it produces a set of co-efficients *W* and *C*, whereby when we feed a new example of data, *X*, into the machine, our prediction, *Y*, will be relatively accurate.

$$
y_{new} = w_{1}x_{1} + w_{2}x_{2} ... w_{n}x_{n} + c
$$

and <center>$y_{new}$ is sufficienly close to $y_{actual}$</center>


We can then do a neat little trick here, $c$, being our bias term, can be seen as $w_{0}x_{0}$, where $x_{0} = 1$, and it will similarly give us a constant!

We thus can simplify our equation to:

$$
y = \sum_{i=0}^{n}w_{i}x_{i}
$$
or 

$$
y = WX^{T}
$$

Do note that vector $X$ represents a single example, so it is a $n*1$ vector.

<img src="https://www.androidcentral.com/sites/androidcentral.com/files/styles/xlarge/public/article_images/2020/10/genshin-impact-bennett_2.png" width="125"/>
<center>Nice!</center>

### Putting it into practice

#### Data pre-processing

We will take on the task of **predicting the salinity of water** using data collected from CalCOFI. The description of the dataset is as follows:

"
The CalCOFI data set represents the longest (1949-present) and most complete (more than 50,000 sampling stations) time series of oceanographic and larval fish data in the world. It includes abundance data on the larvae of over 250 species of fish; larval length frequency data and egg abundance data on key commercial species; and oceanographic and plankton data. The physical, chemical, and biological data collected at regular time and space intervals quickly became valuable for documenting climatic cycles in the California Current and a range of biological responses to them. CalCOFI research drew world attention to the biological response to the dramatic Pacific-warming event in 1957-58 and introduced the term “El Niño” into the scientific literature.
"

More information about this dataset can be found [here](https://new.data.calcofi.org/index.php/database/calcofi-database/bottle-field-descriptions).

In this task, I will first read the data into a Pandas Dataframe, remove the redundant variables, and then use the remaining to perform a multivariate regression analysis. There will be little to no elaborations from this point on and any comments crucial for understanding will be within the code.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation

In [None]:
data = pd.read_csv("../input/calcofi/bottle.csv")
data.info()

In [None]:
#just some typical data cleaning
data.dropna(axis=1, how='all', thresh = data.shape[0]*0.3, inplace=True)
data.drop(['Cst_Cnt','Btl_Cnt','Sta_ID','Depth_ID'],axis = 1, inplace = True)
data = data.loc[:,data.nunique()>100]
data = data[data['Salnty'].notna()]
data

In [None]:
data.corrwith(data['Salnty'])

I have to comment that after I found the correlation with salinity, I realised that most of the R_{name} are actually repeats of the variables above! That is why they have incredibly similar correlations, I therefore had to remove these labels manually and finally we have our set of input data X.

In [None]:
x = data.drop(['Depthm','Salnty','NO2uM','R_Depth','R_SALINITY','R_TEMP','R_POTEMP',
              'R_SIGMA','R_O2','R_O2Sat','R_PO4','R_NO3','R_NO2','R_SIO3'], axis = 1)

In [None]:
x.info()

Now we see a much smaller set of variables and these are the ones that we will use for our model :)
But first, we will need to do away with the NULL values.
Here, we have a few choices:
1. Remove the missing values - in our case this would work just fine since we have so much data!
2. Replace with mean, median, mode wtv
3. Predict them using the linear model we have built, and then use it for training.

In this case, I will just take the easy way out and go with no.2, I am not a big fan of losing data despite us having a huuuge dataset to start with :D

In [None]:
x.fillna(value = x.mean(), inplace=True)

In [None]:
x.info()
#Looks good!

### Linear regression from scratch

It is time for the real thing! We will now move our data to numpy, in order to really implement everything ourselves and I promise you we will only see numpy and matplotlib from now on.

Let us break down the individual steps in linear regression:

1. Initialise a model
2. Make predictions of the training data with our model
3. Find out the loss between our prediction of the salinity and the actual salinity
4. Carry out gradient descent
5. Update the weights

#### Gradient descent

Now would be a great time to talk about gradient descent. Let's see an example of a typical loss function, the mean squared error, it is given by:

$$
L(Y_{pred},Y_{actual}) = \frac{1}{2m}\sum_{i=1}^{m}{(y^{(i)}_{pred}-y^{(i)}_{actual})^2}
$$

If you recall, our prediction is based on the equation $ Y=WX^{T} $, so if we sub this into the equation at the top, and convert it to a vectorised implementation:

$$
L(X,W,Y_{actual}) = \frac{1}{2m}\sum((XW^{T})-Y_{actual})^2
$$

$X$ now represents a matrix, and each row of $X$ represents a vector $X^{(i)}$ which is the $i^{th}$ example in our data

Now to get our prediction as close to the actual as possible, all we want is actually to minimise $L$! And our past experience in secondary school tells us that a minimum is achieved when we hit $\frac{dy}{dx} = 0$, so what we need to do is just to differentiate $L$ with respect to $W$, since $W$ is the only variable we can actually control, and then move in the direction of the gradient!
<img src='https://miro.medium.com/max/2648/1*y8nYa2Ij4ic_lPdD1c_dtw.png' width='700'/>

#### finding derivative

We now just need to find the **partial derivatives** of $L$ with respect to the weights, $W$, taking the equation from the top, we can easily see that:

$$
\frac{\delta L}{\delta W_{j}} = \frac{1}{m}\sum_{i=1}^{m}(X^{(i)}W^{T}-Y^{(i)}_{actual})x^{(i)}_{j}
$$

The vectorised implementation

$$
{\nabla_{W} L} = \frac{1}{m}(X^{T}(XW^{T}-Y_{actual}))
$$

We will then perform the gradient update, $\alpha$ here represents the learning rate, you can think of it as a control of how fast we roll down the gradient hill:

$$
W_{i} := W_{i} - {\alpha}(\frac{\delta L}{\delta W_{j}})
$$

Enough said! Let's get to the code.

In [None]:
X = x.to_numpy()
y = data['Salnty'].to_numpy().reshape((-1,1))
X = (X - X.mean(axis=0)) / X.std(axis=0)
X = np.concatenate((np.ones((X.shape[0],1)), X),axis=1)

In [None]:
def initialise(input_size):
    return np.random.randn(input_size,1)

In [None]:
def predict(x, params):
    return np.dot(x, params).reshape((-1,1))

In [None]:
def loss(y_pred, y_actual):
    return (1/(2*y_actual.shape[0])) * (np.sum((y_pred-y_actual)**2))

In [None]:
def grad(y_pred, y_actual, x):
    return (1/x.shape[0]) * np.dot(x.T,(y_pred - y_actual))

In [None]:
def optimise(w, grad, alpha):
    w = w - alpha * grad
    return w

In [None]:
# Finally here is our training algorithm
def train(X, y_actual, alpha, batch_size, epoch):
    w = initialise(X.shape[1])
    log = []
    for i in range(epoch):
        index = np.random.randint(0,817509,size=batch_size)
        x_train = X[index, :]
        y_train = y_actual[index, :]
        y_pred = predict(x_train, w)
        grad_w = grad(y_pred, y_train, x_train)
        w = optimise(w, grad_w, alpha)
        log.append(loss(y_pred, y_train))
    return log

In [None]:
alpha = 0.1
batch_size = 512
epoch = 1000
log = train(X, y, alpha, batch_size, epoch)

In [None]:
plt.plot(np.arange(0,epoch), log, 'r-')
plt.title("Loss over epochs")
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.show()

And we did it! The graph above shows the change in loss over time! Note that if we use our W now to predict, we will actually still be a bit off, but this is rather to be expected because the values of salinity are all very clustered together, and it is hard for us to get the decimals right unless we run a huge number of training loops and also significantly increase the precision in our parameters and data, this notebook is done mainly to illuminate the ideas behind linear regression, and not so much to create an accurate model itself!

Another point to note is that our hyperplane is not going to look like a valley, instead imagine a real life mountain range, it will have a lot of peaks and valleys, so sometimes when you run an algorithm, you can get trapped in a local minima, or you can suddenly see a jump in your loss! Don't worry about it, this do happen from time to time and there are things that can be done to reduce the chance(dynamic learning rate etc.), but I hope this exercise has gave you a clearer idea of linear regression. Cheers!