# Vectorized Linear Regression
## The model is implemented solely over Numpy and Pandas. No other python libraries are used. 
## The goal is to reduce the Cost function by obtaining the optimal value of parameter thetas, guided by Gradient Descent. 

### Multivariate Linear Regression on the Boston Housing Prices dataset. No existing machine learning algorithm is used.
### **To my surprise, I couldn't find vectorization of linear regression to this widely known dataset in particular, so hopefully, I'm making a significant contribution to the Machine Learning community.** 
### Why is vectorization important? To take complete advantage of computational power of computers, the most efficient way of implementing an algorithm is vectorizing the computations as it enables us to attain parallelized computations hence tapping into the limits of the system. It saves noteworthy time whose efficiency comes into play when dealing with Big data, where seconds in these small datasets translate to days.
### Thus, vectorization saves us on huge amount of training time and improves our algorithm.

In [None]:
from sklearn import datasets
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Loadind the Boston dataset from sklearn.datasets

In [None]:
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)                            #Appending the target feature to the dataset
df.head()

In [None]:
df.info()

In [None]:
df.shape

506 examples with 13 features and 1 target variable.

In [None]:
Y = df['target']
y = Y.copy(deep=True) 
df.drop('target', axis = 1, inplace = True)

### 506 examples and 13 features.

In [None]:
df.describe()

## Breakdown of each feature :-

crim-
per capita crime rate by town.

zn-
proportion of residential land zoned for lots over 25,000 sq.ft.

indus-
proportion of non-retail business acres per town.

chas-
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox-
nitrogen oxides concentration (parts per 10 million).

rm-
average number of rooms per dwelling.

age-
proportion of owner-occupied units built prior to 1940.

dis-
weighted mean of distances to five Boston employment centres.

rad-
index of accessibility to radial highways.

tax-
full-value property-tax rate per $10,000.

ptratio-
pupil-teacher ratio by town.

black-
1000(Bk - 0.63)^2 where Bk is the proportion of african-american by town.

lstat-
lower status of the population (percent).

target-
median value of owner-occupied homes in $1000s.

### Since features have varying distribution we'll have to normalise them to make sure one feature does not dominate other features on deciding the target value.
### Z-Score or mean normalisation equates the mean of feature to 0 and standard deviation to 1.

In [None]:
#Feature Scaling
#Unscaled features results in dominance of a particular feature/features having higher "weight" 
#and thus reduces the accuracy of the model on data the model is not tested on.
#Performing mean normalisation(Z-Score)
for i in df.columns:
    if i == 'CHAS':
        #We dont normalise a categorical data(chas only has binary values)
        continue
    df[i] = (df[i]-(df[i].mean()))/(df[i].std())

In [None]:
#Dataset with normalised values
df.head()

### Converting datasets into matrices to execute vectorised Linear Regression

In [None]:
sns.scatterplot(data = y)

## The above plot shows how the price for houses are capped at 50,000. This can prove to limit the accuracy of our model for houses whose prices are more than 50,000, both on real life and training example. This capping of the highest price is one limitation of the Boston Dataset. 

### Converting datasets into numpy array to implement vectorized Linear Regression model

In [None]:
x = df.to_numpy()
print(x.shape, y.shape)
target = y.to_numpy()
numExamples = x.shape[0]
numFeatures = x.shape[1]

In [None]:
x = np.append(np.ones((numExamples,1)),x, axis = 1)   #Adding unit bias
x.shape

In [None]:
theta = np.zeros((numFeatures + 1,1)) #Initializing theta values as 0.

In [None]:
theta.shape

In [None]:
iterations = [200,300,400,500,700,600,1000]
alpha = [.001 , .003, .01, .03, .1 ,.3]
for i in iterations:
    for j in alpha:
        for iters in range(i):
                h= x@theta                                            #Hypothesis Function
                target = np.reshape(target, (len(target),1))
                error = h-target
                J = ((error**2).sum())*(1/(2*target.shape[0]))        #Cost Function 
                gradient = (x.T@error)*(j/target.shape[0])            #Gradient
                theta = theta-gradient                                #simultaneously updating theta
        print("for",i,"interations and alpha =",j, "cost is",J.sum())
        theta = np.zeros((numFeatures + 1,1))                         #Resetting the values of theta to zeros

In [None]:
#Choosing the optimal combination of number of iterations and alpha. (300,.03)
theta = np.zeros((numFeatures + 1,1))
iterations = 300
alpha = 0.03
J_History = np.zeros((iterations, 1))
theta_history = np.zeros((iterations,len(theta)))
for i in range(iterations):
    h= x@theta                                                    #Hypothesis Function
    target = np.reshape(target, (len(target),1)) 
    error = h-target
    J = ((error**2).sum())*(1/(2*target.shape[0]))                #Cost Function
    gradient = (x.T@error)*(alpha/target.shape[0])                #Gradient
    theta = theta-gradient
    J_History[i] = J.sum()                                        #Appending values of Cost Function to the array
plt.plot(range(iterations), J_History)
plt.xlabel('Number of iterations')
plt.ylabel('Cost function')
plt.title('Cost function constantly decreases with increase in number of iterations')

In [None]:
theta #Final values of theta obtained by our vectorized liner regression madel

## New set of values for the 13 features can be multiplied by theta to predict the price of the house corresponding to the values of the 13 features. 

#### If this notebook helped you in learning, an upvote would be huge! 
#### Thank you :)