# Multiple Linear Regression
- We all know that Simple linear regression will be suitable for single independent variable and single dependent variable.
- Whereas, Multiple linear regression is used to predict a single dependent variable from will be multiple independent variables. 
- Number of independent variable (X) = n
- Number of dependent variable (Y) = 1
- This algorithm follows a multiple linear regression equation
<img src="../images/multiple_linear_regression_eqn.png" alt="multiple_linear_regression_eqn.png">


## Assumptions of linear regression:-
``` 
Before going to Linear regression usually we do some checks in dataset those are not compulsory but if we do those our  model may give better prediction.
```

There are four assumptions associated with a linear regression model:
1. **Linearity**: The relationship between X and the mean of Y is linear.
2. **Homoscedasticity**: The variance of residual is the same for any value of X.
3. **Independence**: Observations are independent of each other.
4. **Multivariate Normality**: For any fixed value of X, Y is normally distributed.
5. **Lack of multicollinearity**: Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.

## Dummy Variables
- Dummy variables are conversion of a categorical variable to vector form only.
- Yes, while conversion of categorical variable we need to be very careful about the **Dummy Variable Trap**
-  Example Table Column:-

<img src="../images/dummy_variables_eg_1.png" alt="dummy_variables_eg_1.png">

- When we change category data to vector form, it will look like the below table.
- Where columns like Chennai, CBE, Ooty are our Dummy variables for the regression equation.

<img src="../images/dummy_variables_eg_2.png" alt="dummy_variables_eg_2.png">


## Dummy Variables Trap Removal
- We are converting categorical data to pass those dummy variables to the Regression equation.
- As we discussed above. We have three dummy variables so, our regression equation will convert into some equation like below.

<img src="../images/dummy_variables.png" alt="Image">

- You may wonder! We have three dummy variables but, we only use 2 of them where, is 3rd dummy variable?
- If your wonder is not a problem, actually b0 is the last dummy variable ie, b0 = Dn-1.
- Above process is called dummy variable removal, but need not worry about this because **dummy variable trap** will automatically be removed by **Scikit-Learn** library


## Lots of theory! Let get into code 😀

## Data preprocessing
- Import the necessary libraries.
- Load dataset (50_Startups).
- Our dataset doesn't have any missing so, we can skip that step.
- But we have categorical string data which need to be, converted.
- Prepare testing and training dataset.
- Linear regression algorithms are an equation type so, it is having a constant to make the model standardize, so we don't need feature scaling for this algorithm.

In [1]:
# Import the necessary libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

In [2]:
# Load dataset (50_Startups).
dataset = pd.read_csv(r"../dataset/50_Startups.csv")
X = dataset.iloc[:, :-1].values # [row, column]
Y = dataset.iloc[:, -1].values
print("Dataset", dataset, "X", X, "Y", Y, sep="\n")

Dataset
    R&D Spend  Administration  Marketing Spend       State     Profit
0   165349.20       136897.80        471784.10    New York  192261.83
1   162597.70       151377.59        443898.53  California  191792.06
2   153441.51       101145.55        407934.54     Florida  191050.39
3   144372.41       118671.85        383199.62    New York  182901.99
4   142107.34        91391.77        366168.42     Florida  166187.94
5   131876.90        99814.71        362861.36    New York  156991.12
6   134615.46       147198.87        127716.82  California  156122.51
7   130298.13       145530.06        323876.68     Florida  155752.60
8   120542.52       148718.95        311613.29    New York  152211.77
9   123334.88       108679.17        304981.62  California  149759.96
10  101913.08       110594.11        229160.95     Florida  146121.95
11  100671.96        91790.61        249744.55  California  144259.40
12   93863.75       127320.38        249839.44     Florida  141585.52
13   91992.3

In [3]:
# We have categorical string data which need to be, converted.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])],remainder='passthrough')
X = ct.fit_transform(X)
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

In [4]:
# Prepare testing and training dataset.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

## Train Multiple Linear Regression Model

In [5]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression()

## Test Multiple Linear Regression Model

In [6]:
y_pred = regressor.predict(x_test)
print(y_pred)

[103015.20159796 132582.27760816 132447.73845174  71976.09851258
 178537.48221055 116161.24230166  67851.69209676  98791.73374686
 113969.43533013 167921.06569551]


## Check training dataset with testing dataset

In [7]:
np.set_printoptions(precision=2)
y_pred_vertical = y_pred.reshape(len(y_pred), 1)
y_test_vertical = y_test.reshape(len(y_test), 1)
print(np.concatenate((y_pred_vertical, y_test_vertical), 1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Test with new data
- With given details like
    - Spend = 160000
    - Administration Spend = 130000
    - Marketing Spend = 300000
    - State = 'California' -> (1, 0, 0)

In [8]:
new_prediction = regressor.predict([[1, 0, 0, 160000, 130000, 300000]])
print(new_prediction)

[181566.92]


Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

$1, 0, 0, 160000, 130000, 300000 \rightarrow \textrm{scalars}$

$[1, 0, 0, 160000, 130000, 300000] \rightarrow \textrm{1D array}$

$[[1, 0, 0, 160000, 130000, 300000]] \rightarrow \textrm{2D array}$

**Important note 2:** Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

## We can even get the Coefficient of the equation
- These coefficient are create automatically by the class LinearRegression


In [9]:
print("b0 =", regressor.intercept_)
print("b1-n =", regressor.coef_)

print("Then our final equation can we written as :")
print(f"profit (y) = {regressor.intercept_} + ({regressor.coef_[0]} * D1) + ({regressor.coef_[1]} * D2) + ({regressor.coef_[2]} * D3) + ({regressor.coef_[3]} * R&D Spend as x1) + ({regressor.coef_[3]} * Administration as x2) + ({regressor.coef_[3]} * Marketing Spend as x3) ")

b0 = 42467.52924854249
b1-n = [ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
Then our final equation can we written as :
profit (y) = 42467.52924854249 + (86.63836918478817 * D1) + (-872.6457908797435 * D2) + (786.0074216944598 * D3) + (0.7734671927326701 * R&D Spend as x1) + (0.7734671927326701 * Administration as x2) + (0.7734671927326701 * Marketing Spend as x3) 
