# Report

<hr>

### Pre-Processing:
__1. Shuffling the data:__ The dataset was randomly shuffled to ensure the split into train and test data remains random and the model is general.<br>
__2. Splitting the data:__ The dataset was split into training and testing data in 7:3 ratio in order.<br>
__3. Standardising the data:__ The feature attributes of the training and testing datasets were standardised separately to prevent leakage of testing data into training data. Finally, the Panda Series were converted into Numpy Arrays for easier and faster computation.<br>

<hr>

### Model:
Three models were generated, one by solving Normal Equations, one by Gradient Descent and another by Stochastic Gradient Descent after each random shuffle and split of the dataset. The intercept of the models was close to 13000 as expected, which is the mean of the target attribute. Sum of squares of errors was used to represent the accuracy of the predictive model. All three algorithms yieled very similar predictive linear regression models.<br>
$y = w_{0} + w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3}$<br>
$x_{1}, x_{2}, x_{3}$ represent the age, bmi and number of children respectively. $ w_{1},w_{2}, w_{3}$ are weights associated with $x_{1}, x_{2}, x_{3}$.<br><br>
$$X = 
\begin{bmatrix} 
1 & x_{11} & x_{12} & x_{13}\\
1 & x_{21} & x_{22} & x_{23}\\
. & . & . & . \\
. & . & . & . \\
1 & x_{m1} & x_{m2} & x_{m3}\\
\end{bmatrix}
\quad
$$
$$ $$
$$Y = 
\begin{bmatrix} 
y_{1}\\
y_{2}\\
.\\
.\\
y_{m}\\
\end{bmatrix}
\quad
$$ where $m$ = size of training data
$$ $$
$$ω = 
\begin{bmatrix} 
ω_{0}\\
ω_{1}\\
ω_{2}\\
ω_{3}\\
\end{bmatrix}
\quad
$$
$$ $$
$$b = X^{T} . Y$$
$$Sum of Squares of Errors =
E(ω) = 
\begin{equation}
\frac{1}{2} * \sum_{n=0}^{N} (x_{n}*ω - y_{n})^{2} 
\end{equation}
$$<br>
Matrix dot product, inversion were performed with Numpy library by first converting Pandas Series to Numpy Array.

<hr>

### Algorithms:
#### Linear Regression by Solving Normal Equations
$$ ω = (X^{T}.X)^{-1}x^{T}.Y = (X^{T}.X)^{-1}.b$$
#### Linear Regression by Gradient Descent
$$ 
\begin{equation}
\frac{\partial E(ω)}{\partial ω} = (X.ω - Y).X
\end{equation}
$$
$$ $$
$$ Gradient$$
$$ $$
$$
ω = ω - η * \frac{\partial E(ω)}{\partial ω}
$$
$$ $$
where $η$ is the learning rate
#### Linear Regression by Stochastic Gradient Descent
SGD makes sequential passes over the training data, and during each pass, updates feature weights one example at a time with the aim of approaching the optimal weights that minimize the loss.
$$ 
\begin{equation}
\frac{\partial E(ω)}{\partial ω}_{ω=ω_{n}} = (x_{n}.ω - y_{n}).x_{n}
\end{equation}
$$
$$ $$
$$ Gradient$$
$$ $$
$$
ω = ω - η * \frac{\partial E(ω)}{\partial ω}_{ω=ω_{n}}
$$
$$ $$
where $η$ is the learning rate

<hr>

### Mean, Variance and Minimum of Training Error
#### Normal Equations
Mean of training errors obtained over 20 regression models = 60863489090.22312<br>
Variance of training errors obtained over 20 regression models = 2.3592510523138785e+18<br>
Minimum training errors obtained over 20 regression models = 58689130587.04303<br>
#### Gradient Descent
Mean of training errors obtained over 20 regression models = 60863489090.34578<br>
Variance of training errors obtained over 20 regression models = 2.359251052323336e+18<br>
Minimum of training errors obtained over 20 regression models = 58689130587.1583<br>
#### Stochastic Gradient Descent
Mean of training errors obtained over 20 regression models = 60863507490.357376<br>
Variance of training errors obtained over 20 regression models = 2.3592502939019807e+18<br>
Minimum training errors obtained over 20 regression models = 58689132776.446106<br>

### Mean, Variance and Mean of Testing Error
#### Normal Equations
Mean of testing error obtained over 20 regression models = 25548673963.726776<br>
Variance of training error obtained over 20 regression models = 2.2762131308133821e+18<br>
Minimum testing error obtained over 20 regression models = 22987187804.050194<br>
#### Gradient Descent
Mean of testing error obtained over 20 regression models = 25548672100.185345<br>
Variance of testing error obtained over 20 regression models = 2.2762222241018094e+18<br>
Minimum testing error obtained over 20 regression models = 22987179664.889412<br>
#### Stochastic Gradient Descent
Mean of testing error obtained over 20 regression models = 25548244755.53107<br>
Variance of training error obtained over 20 regression models = 2.279184996206161e+18<br>
Minimum testing error obtained over 20 regression models = 22982193889.584408<br>
<br>
The large error values are obtained due to the choice of the model's key performance indicator, namely, sum of squares fo errors.

<hr>

### Plot the Convergence: Error vs Epochs
__Plot of $E(\omega)$ against the number of iterations of Gradient Descent__<br>
<img src="GD_error_epoch.jpeg" alt="kindly put all the files in same directory to display the image"><br>
__Plot of $E(\omega)$ against the number of iterations of Stochastic Gradient Descent__
<img src="SGD_error_epoch.jpeg" alt="kindly put all the files in same directory to display the image"><br>

<hr>

### Additional Questions
All three methods yield very close models. Normal Equations and Gradient Descent methods differ only in decimals while Stochastic Gradient Descent differs by a small value. The result is due to the fact that the error function has a single global minima, and all three algorithms aim to converge to it. Since the minima is unique, the models converge to the same weights. However, the small discrepency is due to the fact that gradient and stochastic gradient descent algorithms are stopped prematurely when the difference in gradient calculated is too small to result in a significant change in the model. The most efficient algorithm to work with would be Stochastic Gradient Descent, which reaches the minima faster than other algorithms when the dataset is very large.<br><br>
Standardization is a scaling technique where the values are centered around the mean i.e. mean of the dsitribution becomes zero with a unit standard deviation. In multivariate regression, standardizations brings variables having different units to comparable units, better ensuring that all the weights are updated at similar rates and a more accurate predictive model.<br>
In general, increasing the number of training iterations tends to minimise the loss function and resulting in a more accurate model. However, after a large number of epochs, the change in error function becomes negligible and the computational complexity of running more iterations would outweigh the resulting predictive accuracy of the model.<br><br>
Stochastic gradient descent plots convegred to the minima significantly faster than gradient descent algorithm. Approximately, it took SGD 8000 epochs, while it took GD 140,000 epochs to achieve the same. However, GD algorithm yielded more accurate model. In a real world data, SGD is more practical way of implementing linear regression.<br><br>
If a very large learning rate is used in GD/SGD algorithm, the error value would overshoot the minima of the loss function. It may keep diverging from the minima indefinitely, thereby resulting in an infinite loop.<br><br>
If the model does not have a bias term, it'll be equal to zero when all the variables are zero. However, the mean of the 'charges' variable is equal to 13000. Therefore such a model would not fit the data well and result in larger error values. The minima would have been larger without a bias term.<br><br>
The final vector value signifies the linear model that best fits the training data, given initial hypothesis. Since the feature attributes are standardized, they update at similar rates. Noticeably, $ω_{1}$  has the largest value among all the weights exlcuding bias value, meaning the model is susceptible to higher change due to $ω_{1}$ given the same difference in all feature attributes. Therefore, 'age' has the highest influence on the target attribute. Similarly, it can be deduced that 'children' has the least infuence on the target attribute.