# Multiple Linear Regression

Simple Linear Regression -> 

```
y = b0 + b1 * x1
```

Multiple Linear Regression ->

```
y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn
```

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.11%20PM.jpeg)

Assumptions of a Linear Regression: 

These assumptions must hold true for dataset we are using, only then we should use Linear Regression

1. Linearity
2. Homoscedasticity
3. Multivariate normality
4. Independence of errors
5. Lack of multicollinearity

[https://towardsdatascience.com/assumptions-of-linear-regression-fdb71ebeaa8b](https://towardsdatascience.com/assumptions-of-linear-regression-fdb71ebeaa8b)

Dummy Variables

Below assumes, just 2 unique states exist, NY and Cal. 

        Profit  R&D Spend   Admin   Marketing   State       NY       Cal
y =     b0    +  b1 * x1  + b2 * x2 + b3 * x3           +  b4 * D1

We don't use the b5 * D2 (for Cal) because that is not needed, since D1 becoming 1 doesn't its part of differentiating NY and when it's 0 the equation works for Cal

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.11%20PM%20(1).jpeg)

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.11%20PM%20(2).jpeg)

Dummy Variable Trap

y =     b0    +  b1 * x1  + b2 * x2 + b3 * x3           +  b4 * D1  + b5 * D2

What will happen if we include the second dummy variable in the model as well. We will be basically duplicating a variable. This is because D2 always equals to 1 - D1 the phenomenon where one or several independent variables in a linear regression predict another is called multicollinearity as a result of this effect the model cannot distinguish between the effects of D-1 from the effects from of D2. And therefore it won't work properly. This is called Dummy Variable Trap. We can't have b0, b4 * D1, b5 * D2, these three elements in our model at the same time, the constant and both the dummy variables.

So, whenever you're building a model always omit one dummy variable and this applies irrespective of the number of dummy variables there are in that specific dummy set. eg if there are 9, then include 8.

Also note that if you have two sets of dummy variables then you need to apply the same rule to each set. For instance we could have had a column which specifies the industry in which the companies operate to build the model in that case we would have had to perform exactly the same steps and create another set of dummy variables specfically for that column. 


P - value

P - value means the probabilty of an event happening given that we are in a universe where the null hypothesis is true

Statistical Significance
![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.11%20PM%20(3).jpeg)
it helps in answering the question whether the findings we have are actually correct, like how sure we are about the findings, do they have stastical backing etc.

So if alpha, is below 0.05, we are 95% sure that we don't live in H0 universe, we are gonna reject that hypothesis, and its a H1 universe, where the coin is not fair.

alpha can be below 1 percent, 10 percent it depends on our experiment.

Statistical significance is the point where in human intuitive terms you get uneasy about the null hypothesis being true, and you get super suspicious about it. 

alpha is the sufficient level of confidence for us to reject the null hypothesis, So we are going to state that we live in this alternate universe.


![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.11%20PM%20(4).jpeg)

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.11%20PM%20(5).jpeg)

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.12%20PM%20(1).jpeg)

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.12%20PM%20(2).jpeg)

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.12%20PM%20(3).jpeg)

![](assets/WhatsApp%20Image%202022-10-21%20at%208.25.26%20PM.jpeg)


## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [2]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

## Encoding categorical data

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [5]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [7]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predicting the Test set results

In [8]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California')

In [9]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[181566.92]


Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

$1, 0, 0, 160000, 130000, 300000 \rightarrow \textrm{scalars}$

$[1, 0, 0, 160000, 130000, 300000] \rightarrow \textrm{1D array}$

$[[1, 0, 0, 160000, 130000, 300000]] \rightarrow \textrm{2D array}$

**Important note 2:** Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

## Getting the final linear regression equation with the values of the coefficients

In [10]:
print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.529248579056


Therefore, the equation of our multiple linear regression model is:

Profit=86.6×Dummy State 1−873×Dummy State 2+786×Dummy State 3+0.773×R&D Spend+0.0329×Administration+0.0366×Marketing Spend+42467.53

**Important Note:** To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.