# Bivariate Linear Regression

In the Univariate Linear Regression problem, we solved a problem that only depended on one variable $x$.

In this notebook, we will try to solve a regression problem that depends on two variables.

Here's our problem statement

*A marketing company wants to find a how their Sales are affected to advertising from Newspaper and Television. Here is the data set that they have provided. Using this dataset find a model that can predict the Sales given advertising from Television and Newspaper*




In [7]:
import pandas as pd
df = pd.read_csv("sales.csv")
df.head()

Unnamed: 0,Newspaper,TV,Sales
0,5,5,500
1,2,8,740
2,8,2,260
3,3,5,480
4,1,7,640


***Aim :*** To find the best-fit curve that can predict Sales given advertising from Newspaper and TV

![Screenshot%202019-06-09%20at%202.37.17%20PM.png](attachment:Screenshot%202019-06-09%20at%202.37.17%20PM.png)

Plotting the data set on a Contour plot we can see that our data lives in 3D Space.

## Standard Notation:
$m$ *= Number of training examples*

$x_{i}$ *= $i^{th}$ Input Variable/Features*

$y$ *= Output Variable/ Target Variable*

# Definations

Since our data set has 2 features, we will use $x_{1}$ and $x_{2}$ to refer to our 2 features and $y$ to refer to our Sales output.

$x_{1} =$ *Number of Advertisements printed in Newspaper*

$x_{2} =$ *Number of Advertisements displayed in TV*

$y =$ *Total Sales generated*

In [14]:
df.columns = ['x1','x2','y']
df.head()

Unnamed: 0,x1,x2,y
0,5,5,500
1,2,8,740
2,8,2,260
3,3,5,480
4,1,7,640


The $i^{th}$ training example will be $(x_{1}^{i},x_{2}^{i},y^{i})$

# Mathematical Intrepretation

We will feed our training data to a *Learning Algorithm* and which will output a function that we're going to call *hypothesis* $h(x)$

The job of the hypothesis function is: *Given $x_{1}$ and $x_{2}$, find an estimated value of $y$ which matched the pattern in the training set*

#### Define a  *hyphothesis function* 

$h_{\theta}(x_{1},x_{2}) = \theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}$

$\theta_{0}$, $\theta_{1}$ and $\theta_{2}$ are called parameters of our learning model. They are generally expressed as $\theta_{i}$ which refers to the $i^{th}$ parameter and in our case we have 3 parameters.

***AIM :*** We want to find those values of $\theta_{0}$, $\theta_{1}$ and $\theta_{2}$ for which our hypothesis function $h_{\theta}(x)$ is close to $y$ for our training examples $(x_{1}^{i},x_{2}^{i},y^{i})$

For this, we will define our problem as a Optimization Problem where we will minimize the mean squared differences of the out estimate predicted by the hypothesis function and the actual given output


*Minimize :* $\dfrac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x_{1}^{i},x_{2}^{i})-y^{i})^{2}$ 

where $h_{\theta}(x_{1}^{i},x_{2}^{i}) = \theta_{0}+\theta_{1}x_{1}^{i}+\theta_{2}x_{2}^{i}$

We will define the above mean squared differences as a cost function $J(\theta_{0},\theta_{1},\theta_{2})$ as

$J(\theta_{0},\theta_{1},\theta_{2})$ = $\dfrac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x_{1}^{i},x_{2}^{i})-y^{i})^{2}$ 

To minimize the cost function $J(\theta_{0},\theta_{1},\theta_{2})$  we will use an algorithm called *Gradient Descent*

First, we are going to choose the values of $\theta_{0}$, $\theta_{1}$ and $\theta_{2}$ randomly.


$\theta_{0} = 0$

$\theta_{1} = 0$

$\theta_{2} = 0$




Then we will keep changing this values of $\theta_{0}$ , $\theta_{1}$ and $\theta_{2}$ until we reach a Global Minimum

To do this we will use the following algorithm of the gradient descent


$repeat$ $until$ $convergence$ *{*

$\theta_{j} := \theta_{j} - \alpha \dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{j}}$

*}*

*where* $\alpha$ *is called the learning rate.* 

And we will do this for all the parameters. 

Since, we have 3 parameters $\theta_{0}$, $\theta_{1}$ and $\theta_{2}$, that means we have to do a simultaneous update of all 3 parameters $\theta_{0}$, $\theta_{1}$ and $\theta_{2}$ after applying the above algorithm to all $\theta_{0}$, $\theta_{1}$ and $\theta_{2}$

This means we have to 

$t_{0} := \theta_{0} - \alpha \dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{0}}$ 

$t_{1} := \theta_{1} - \alpha \dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{1}}$ 

$t_{2} := \theta_{2} - \alpha \dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{2}}$ 


$\theta_{0} = t_{0}$

$\theta_{1} = t_{1}$

$\theta_{2} = t_{2}$


Now using multivariate calculus we can find out that 

$\dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{0}} = \dfrac{1}{m}\sum_{i=1}^{m} (h_{\theta}(x^{i})-y^{i})$  



$\dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{1}} = \dfrac{1}{m}\sum_{i=1}^{m} (h_{\theta}(x^{i})-y^{i})x_{1}^{i}$

and

$\dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{2}} = \dfrac{1}{m}\sum_{i=1}^{m} (h_{\theta}(x^{i})-y^{i})x_{2}^{i}$  


*We will know that we have reached a global minimum when the derivative term of gradient descent* $\dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{0}}$, $\dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{1}}$ and $\dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{2}}$ will all become zero.

So all we have to do is

$repeat$ $until$ $convergence$ *{*

$t_{0} := \theta_{0} - \alpha \dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{0}}$ 

$t_{1} := \theta_{1} - \alpha \dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{1}}$ 

$t_{2} := \theta_{2} - \alpha \dfrac{\partial J(\theta_{0},\theta_{1},\theta_{2})}{\partial \theta_{2}}$ 


$\theta_{0} = t_{0}$

$\theta_{1} = t_{1}$

$\theta_{2} = t_{2}$


*}*

In [16]:
print(df.head())

   x1  x2    y
0   5   5  500
1   2   8  740
2   8   2  260
3   3   5  480
4   1   7  640


In [17]:
t0 = 0
t1 = 0
t2 = 0

In [24]:
def hypo(t0, t1, x1, t2, x2):
    result = (t0 + t1*x1 +t2*x2)
    return result

In [25]:
def cost(t0,t1,t2):
    a = t0
    b = t1
    c = t2
    sum = 0
    for index, rows in df.iterrows():
        hyp = hypo(a,b,rows["x1"],c,rows["x2"])
        y = rows["y"]
        absError = (hyp - y)**2
        sum = sum+ absError
    totalcost =0.5*(1/len(df))*sum
    return totalcost
    
        

In [26]:
def dTermT0(t0, t1, t2):
    a = t0
    b = t1
    c = t2
    sum = 0
    for index, rows in df.iterrows():
        hyp = hypo(a,b,rows["x1"],c,rows["x2"])
        y = rows["y"]
        absError = (hyp - y)
        sum = sum+ absError
    totalcost =(1/len(df))*sum
    return totalcost
    

In [27]:
def dTermT1(t0, t1,t2):
    a = t0
    b = t1
    c = t2
    sum = 0
    for index, rows in df.iterrows():
        hyp = hypo(a,b,rows["x1"],c,rows["x2"])
        y = rows["y"]
        absError = (hyp - y)*(rows['x1'])
        sum = sum+ absError
    totalcost =(1/len(df))*sum
    return totalcost

In [28]:
def dTermT2(t0, t1,t2):
    a = t0
    b = t1
    c = t2
    sum = 0
    for index, rows in df.iterrows():
        hyp = hypo(a,b,rows["x1"],c,rows["x2"])
        y = rows["y"]
        absError = (hyp - y)*(rows['x2'])
        sum = sum+ absError
    totalcost =(1/len(df))*sum
    return totalcost

In [38]:
a=0
b=0
c=0
lr = 0.01

pcost = cost(a,b,c)
ncost = cost(a,b,c)

iteration = 0

while ncost<=pcost:
    iteration += 1
    pcost = cost(a,b,c)
    x = a - (dTermT0(a,b,c))*lr
    y = b - (dTermT1(a,b,c))*lr
    z = c - (dTermT2(a,b,c))*lr
    a = x
    b = y
    c = z
    ncost = cost(a,b,c)
    
    if iteration > 10000:
        break
        
print("The parameters should be: theta0 = ",a," theta1 = ", b," theta2 = ",c)
    

The parameters should be: theta0 =  0.005509496683877309  theta1 =  9.999453970291498  theta2 =  89.99933575508194


That means that the learning algorithm predicts that a solution or a hypothesis function should look like this 

$h(x_{1},x_{2}) = 0.0055 + 9.9994x_{1}+ 89.9993x_{2}$

which is very close to the actual function used to create the dataset

$f(x_{1},x_{2}) = 0 + 10x_{1}+ 90x_{2}$

# Important Conclusions

1. It is a unlikely that that learning algorithm will be able to find a set of parameters for which the cost function comes out to be exactly zero. Thus Trying to achieve this in the condition of while loop is not practical and would take forever to calculate and wont provide any answers. 


2. To solve this in Univariate Linear Regression we tried to check that the new cost that comes out on gradient descent is lesser than pervious cost and stop wherever it starts to overshoot but in Bivariate we cant do this also as it will also take a lot of time and is impractical.


3. A better way to do it in Bivariate or Multivariate is to run the algorithm for as long as time allows. The longer the algorithm runs, the better the solution it will predict. Once the computational time runs out, take the last values of parameters that the learning algorithm predicted as final answer. Or you can set the program to run for a certain number of iterations and once it has done that it can print the final answer.


4. The learning algorithm  highly depend upton the learning rate and improper selection of learning rate will lead to errors. Always check the learning rate by checking it again and again and changing values.