# Multiple Linear Regression 

It is the same thing as a simple linear regression but with multiple independent variables.
For example, a salary can be predicted using years of experience, number of hours worked, projects worked in.

#### Dummy Variables

In a data set, there are categorical variables, we create an artificial variable to represent an attribute with two or
more distinct categories/levels.

For example in a dataset we might have New York and California as two categorical variables, **rows that have New York will have dummy representation of 1, and rows that have California might have 0 as their inclusion won't affect a dependent variable**.

Dummy variables are useful because they enable us to use a single regression equation to represent multiple groups. This means that we don’t need to write out separate equation models for each subgroup. 

| Category      | Dummy | Dummy |
| ----------- | -----|----- |
| California      | 0 | 1 |
| New York  | 1 | 0 |
| New York      | 1 | 0 |
| California  | 0 | 1 |
| California      | 0 | 1 |
| New York  | 1 | 0 |

Number of columns depends on the number of categorical values.

    y = constant + co.eff1 x variable 1 + co.eff2 x variable 2 + co.eff3 x variable 3 + co.eff4 x dummy variable

The default situation will be included in the constant, if we omit california because it is zero then it will be compensated in the constant.

#### What is statistical significance?

There are two hypothesis', 
- **null hypothesis**, where we assume things are fair, for example a coin is fair when it has both heads and tails. 
- **alternative hypothesis**, is when we see that things are not fair, a coin will be unfair when multiple tosses will result in a similar result.

Let's toss a coin and see what happens, assuming that it is a fair coin,

A coin is tossed once: we get tails then Probability is = .5

A coin is tossed twice: we get tails again then P = .25

A coin is tossed thrice: we get tails thrice then P = .12

A coin is tossed 4 times: P = .06

A coin is tossed 5 time: P = .03

If the results keep repeating then we might start feeling uncomfortable because the probability of getting the same result always is very low realistically.

The P value is dropping in a fair universe, but in an unfair universe it would have been 100% but it will always deliver same result and so we wouldn't feel uncomfortable, but in a fair universe chances of same result again and again is fairly unrealistic.

The uneasy feeling must stop at a certain point, this is where alpha comes in, **if the P value drops below Alpha, we understand that the null hypothesis has failed**.

- A small p (≤ 0.05), reject the null hypothesis. This is strong evidence that the null hypothesis is invalid.
- A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject the null.

### Selection Mechanisms 

We need to throw out columns while building a model, 
- if it has a lot of column then the model will be unreliable, 
- even if we use many variables, we will have to explain the model to every person, 

and so we need to select the right variables.

Methods are:

1. **All in method** : we use it when we have a prior knowledge that we need to use these set of variables to obtain optimum results, or we have to use all the variables.


2. **Backward Elimination** : 
- Step 1: Select a significance level to stay in the model (Eg: SL=0.05).
- Step 2: Fit the full model with the possible predictors.
- Step 3: Consider the predictor with the highest p-value. If P>SL, go to step 4 otherwise your model is completed.
- Step 4: Remove the predictor.
- Step 5: Fit models without this variable and move to step 3.

3. **Forward Selection** :
- Step 1: Select a significance level to enter in the model (Eg: SL=0.05)
- Step 2: Fit all Simple Regression models y ~ Xn, select the one with the lowest p-value.
- Step 3: Keep the variable and fit all possible models with one extra predictor added to the one(s) you already have.
- Step 4: Consider the predictor with the lowest p-value. If **p-value < SL, go to Step3**, else the model is completed.
- We will only stop when P>SL, the variable is not significant anymore and select the previous model.

4. **Bidirectional Elimination** :
- Step 1: Select a significance level to stay in the model (Eg: SLENTRY=0.05, SLSTAY=0.05)
- Step 2: Perform the next step of forward selection (new variables must have p-value<SLENTT to enter)
- Step 3: Perform all steps of backward elimination (old variables must have p-value<SLSTAY to stay) repeat step 2 and 3 until no new variables than move to step 5.
- Step 4: If no new variables can enter and no variables can exit, so it is the final model.

5. **All Possible Models**
- Step 1: Select the criterion of the goodness of fit.
- Step 2: Construct all possible regression models 2N−1 total combinations.
- Step 3: Select one with the best criterion.

But this is bad because it will have to go through each and every column in the data set which build exponential number of model which is resouce consuming.

In [1]:
# We need to understand how each variable is related to the profit.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
dataset = pd.read_csv('50_Startups.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])],remainder='passthrough')
#we specify what kind of transformation we want to do, then what kind of encoding and third the indexes of the column
#passthrough means keeping columns that wont be transformed
#we will use fit transform which will fit the connection and transform the column 
x = np.array(ct.fit_transform(x))
#fit transform doesnt return the data as numpy array and so we convert it using numpy

In [4]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2, random_state=0)

In [5]:
print(x_train)

[[0.0 1.0 0.0 55493.95 103057.49 214634.81]
 [0.0 0.0 1.0 46014.02 85047.44 205517.64]
 [0.0 1.0 0.0 75328.87 144135.98 134050.07]
 [1.0 0.0 0.0 46426.07 157693.92 210797.67]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 1000.23 124153.04 1903.93]
 [0.0 0.0 1.0 542.05 51743.15 0.0]
 [0.0 0.0 1.0 65605.48 153032.06 107138.38]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [0.0 1.0 0.0 61994.48 115641.28 91131.24]
 [1.0 0.0 0.0 63408.86 129219.61 46085.25]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [1.0 0.0 0.0 23640.93 96189.63 148001.11]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 15505.73 127382.3 35534.17]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [1.0 0.0 0.0 64664.71 139553.16 137962.62]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [1.0 0.0 0.0 28754.33 118546.05 172795.67]
 [1.

We dont need to bother about dummy variable because those are ignored by out class.

The model will also automatically identify the features that have the highest P values or that are the most statistically significant to figure.

In [6]:
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(x_train,y_train)

LinearRegression()

In [11]:
#We will first show the actual test set and then the predicted result to show the accuracy
y_pred = mlr.predict(x_test)
np.set_printoptions(precision=2)
# concatenate two vectors of real profits and predicted profits
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))
#So second value that is axis, here can take two values, 
#zero means that we want to do a vertical concatenation,and one means that we want to do a horizontal concatenation.

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


Backward Elimination is irrelevant in Python, because the Scikit-Learn library automatically takes care of selecting the statistically significant features when training the model to make accurate predictions.

**Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California')**

In [13]:
print(mlr.predict([[0.0, 0.0, 1.0, 160000, 130000, 300000]]))

[182266.29]


**How do I get the final regression equation y = b0 + b1 x1 + b2 x2 + ... with the final values of the coefficients?**

In [15]:
print(mlr.coef_)
print(mlr.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.52924854249


$$\textrm{Profit} = 86.6 \times \textrm{Dummy State 1} - 873 \times \textrm{Dummy State 2} + 786 \times \textrm{Dummy State 3} + 0.773 \times \textrm{R&D Spend} + 0.0329 \times \textrm{Administration} + 0.0366 \times \textrm{Marketing Spend} + 42467.53$$