# Regression Extras
- In this notebook, we are going to look at some of the extra topics which may help you to make the model efficient.
    1. What is statistical significance & p-value?
    2. Building best model
    3. Adjusted R squared factor

## What is statistical significance & p-value?

- Let assume we are in the world, where the coin has a head and tail as faces so, we have 50-50 changes to get each face.  
- We are tossing the same coin 6 times, let see the result in the given table.

| Coin | Result | Probability of getting current result in H0 |
|------|--------|-------------------------------------------|
| <img src="../images/coin.png" alt="coin.png" width="30"> | 1st time you got tails 😀 | 0.5 (50%) |
| <img src="../images/coin.png" alt="coin.png" width="30"> | 2nd time you got tails 🙂 | 0.25 (25%) |
| <img src="../images/coin.png" alt="coin.png" width="30"> | 3rd time you got tails, again 😮 | 0.125 (12%) |
| <img src="../images/coin.png" alt="coin.png" width="30"> | 4th time.... tails 😑 | 0.0625 (6%) |
| <img src="../images/coin.png" alt="coin.png" width="30"> | 5th time, tails 🤔 | 0.03125 (3%) |
| <img src="../images/coin.png" alt="coin.png" width="30"> | 6th time, tails again 🧐 | 0.015625 (1%) |

- Once you see the above table, we may think **"Is that coin is fake or does it have both side tails? 🧐".**
- Just check at what time you feel more suspicious about the result? I feel suspicious at 4th time tossing.
- And you can see that, probability of getting tails is decreasing every time by half of previous.
- We humans can get the suspicious feeling but, how about the machines or algorithms?
- To answer this question we are creating a new variable or value called **p-value**. For our case, we can fix the **p-value as 0.05 (5%)**.
- if our algorithm gets a p-value less than 0.5 then we can confirm given data is not useful or not fitting data to the algorithm.
- then we can also assume that the other 95% of data is correct and valid for the algorithm

<img src="https://blog.analytics-toolkit.com/wp-content/uploads/2017/09/2017-09-11-Statistical-Significance-P-Value-1.png" alt="p-value image" width="500">

## Building best model

- If you have one feature x(1) to predict the dependent variable y then, we can use Simple Linear Regression. 
- If you have many feature x(n) to predict dependent variable y then, we can use Multiple Linear Regression and so many regressions we learned. 
- But, how we can find unwanted features which completely useless for the prediction of y?

```
example:
Let assume we are going to predict "Profit" (y)
Which is dependent on 
1. "R&D Spend" of the company.
2. "Administration Spend"  of the company.
3. "Marketing Spend" of the company.
4. "State" where the company is located.
```

- Can you guess! What are the best set of feature variables that is most dependent for predicting "Profit"?
- Let's find it out 😎.

### 5 methods of model building
1. All-in
2. Backward elimination (Stepwise Regression)
3. Forward selection (Stepwise Regression)
4. Bidirectional elimination (Stepwise Regression)
5. All Possible Model (Score Comparision)

#### 1. All-in
- If you have prior knowledge about the dataset and you are sure that all y is dependent on all the feature variables.
- If someone gives you a completely perfect dataset, then in that case you have to use all feature variables.
- We do **All-in** before going to *Backward elimination*. 

#### 2. Backward elimination
- **STEP 1**: You have to select *statistical significance* level to **stay** in the model. ```example: SL_STAY = 0.05 (5%)```
- **STEP 2**: Perform *All-in* with all possible feature varibales.
- **STEP 3**: Find p-value for each feature. If ```p > SL_STAY``` goto **STEP 4** else **END**.
- **STEP 4**: Remove the feature
- **STEP 5**: Refit the model with new set of feature and continue to **STEP 3**.
- **END**: 🥳 Your model is ready 🥳

#### 3. Forward selection
- **STEP 1**: You have to select *statistical significance* level to **enter** in the model. ```example: SL_ENTER = 0.05 (5%)```
- **STEP 2**: Find the best simple linear regression model but apply every single feature x(n) with the y.
- **STEP 3**: Keep that selected feature in the model and try adding all other features one by one.
- **STEP 4**: Find p-value for each feature. If ```p < SL_ENTER``` goto **STEP 3** else **END**.
- **END**: 🥳 Keep your previous, that's the model your look for 🥳

#### 4. Bidirectional elimination
- **STEP 1**: You have to select *statistical significance* level to **stay & enter** in the model. ```example:  SL_ENTER = 0.05 (5%) & SL_STAY = 0.05 (5%)```
- **STEP 2**: Perform **Forward selection** to select feature variable set with (SL_ENTER = 0.05).
- **STEP 3**: Perform all steps in **Backward elimination** on the selected set with (SL_STAY = 0.05) and continue to **STEP 2**.
- **STEP 4**: Iteration of **STEP 3 & 4** will be continue until no variable added or exit from the model then **END**. 
- **END**: 🥳 Your model is ready 🥳

#### 5. All Possible Model
- **STEP 1**: Select one goodness criteria ```example: R^2```
- **STEP 2**: Construct all possible models from the N feature ```ie, N feature can have (2^N)-1 total combinations```
- **STEP 3**: Find the best model out of it by applying criteria
- **END**: 🥳 Your model is ready 🥳

> Note : If you have 10 feature then you need to find 1023 models to take best out of it 😫.

## Adjust R squared

- We all learned about the [R squared](https://github.com/sanjaysanju618/Machine-learning/blob/main/notebook/8%20.%20Regression%20Model%20Selection.ipynb), which is the great factor that helps us to evacuate the model performance.

<img src="../images/r_squared_eqn.png" alt="r_squared_eqn.png" width="500">

- But, there is one problem with it! Guess what? Answer this question "What will the result of R squared value if you add a new feature to the model?"
- The answer is your R squared value also increases! Why? You may think the new variable is not much import for prediction, but that feature is having a very small impact on prediction. Let say about 0.0001 % of dependence.
- Then how do we find the performance of model 🤔?
- Here come's our hero **Adjust R squared** 😎.

<img src="../images/adj_r_squared_eqn.png" alt="adj_r_squared_eqn.png" width="500">

### How Adjusted R squared adjusting for the model?
- Well, We have p and n for that.
- p is the number of regressors (feature), n is the total size of the sample (dataset).
- By adding new variable ```p``` is increases, ```(n-1)/(n-1-p)``` increases.
- By adding new variable ```R^2``` is increases, ```1-R^2``` decreases.
- there will be a battle between the two equations. It will finally compensate with a subtraction of 1.

# Lots of theories, Let Code 🥳

## Data preprocessing
✔️ Import the necessary libraries.

✔️ Load dataset (Combined_Cycle_Power_Plant.csv).

❌ Our dataset doesn't have any missing data.

❌ We have categorical string data.

✔️ We have 9569 data. So, we can split this dataset into testing and training datasets to evaluate the result.

⚠️ Please apply feature scaling only if required by the regression model.

In [1]:
# Import libraries....
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
np.set_printoptions(precision=2)

In [2]:
# Load dataset....
dataset = pd.read_csv(r"../dataset/50_Startups_Updated.csv")
X = dataset.iloc[:, :-1].values # [row, column]
Y = dataset.iloc[:, -1].values
print(X)

[[165349.2  136897.8  471784.1 ]
 [162597.7  151377.59 443898.53]
 [153441.51 101145.55 407934.54]
 [144372.41 118671.85 383199.62]
 [142107.34  91391.77 366168.42]
 [131876.9   99814.71 362861.36]
 [134615.46 147198.87 127716.82]
 [130298.13 145530.06 323876.68]
 [120542.52 148718.95 311613.29]
 [123334.88 108679.17 304981.62]
 [101913.08 110594.11 229160.95]
 [100671.96  91790.61 249744.55]
 [ 93863.75 127320.38 249839.44]
 [ 91992.39 135495.07 252664.93]
 [119943.24 156547.42 256512.92]
 [114523.61 122616.84 261776.23]
 [ 78013.11 121597.55 264346.06]
 [ 94657.16 145077.58 282574.31]
 [ 91749.16 114175.79 294919.57]
 [ 86419.7  153514.11      0.  ]
 [ 76253.86 113867.3  298664.47]
 [ 78389.47 153773.43 299737.29]
 [ 73994.56 122782.75 303319.26]
 [ 67532.53 105751.03 304768.73]
 [ 77044.01  99281.34 140574.81]
 [ 64664.71 139553.16 137962.62]
 [ 75328.87 144135.98 134050.07]
 [ 72107.6  127864.55 353183.81]
 [ 66051.52 182645.56 118148.2 ]
 [ 65605.48 153032.06 107138.38]
 [ 61994.4

In [3]:
# We are using statsmodels module for get stats of the model....
import statsmodels.api as sm

- statsmodels module only take raw data and perform operations
- let's say we have 
- y = b0 * x0 + b1 * x1 + b2 * x2 + b3 * x3
- where x0 is always 1
- our normal regression model will understand it automatically and consider x0 = 1.
- but statsmodels module don't take x0 as 1 we need to add extra 1 column at begin of the dataset

In [4]:
# Adding x0=1 at beginning of the dataset
X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)
print(X)

[[1.00e+00 1.65e+05 1.37e+05 4.72e+05]
 [1.00e+00 1.63e+05 1.51e+05 4.44e+05]
 [1.00e+00 1.53e+05 1.01e+05 4.08e+05]
 [1.00e+00 1.44e+05 1.19e+05 3.83e+05]
 [1.00e+00 1.42e+05 9.14e+04 3.66e+05]
 [1.00e+00 1.32e+05 9.98e+04 3.63e+05]
 [1.00e+00 1.35e+05 1.47e+05 1.28e+05]
 [1.00e+00 1.30e+05 1.46e+05 3.24e+05]
 [1.00e+00 1.21e+05 1.49e+05 3.12e+05]
 [1.00e+00 1.23e+05 1.09e+05 3.05e+05]
 [1.00e+00 1.02e+05 1.11e+05 2.29e+05]
 [1.00e+00 1.01e+05 9.18e+04 2.50e+05]
 [1.00e+00 9.39e+04 1.27e+05 2.50e+05]
 [1.00e+00 9.20e+04 1.35e+05 2.53e+05]
 [1.00e+00 1.20e+05 1.57e+05 2.57e+05]
 [1.00e+00 1.15e+05 1.23e+05 2.62e+05]
 [1.00e+00 7.80e+04 1.22e+05 2.64e+05]
 [1.00e+00 9.47e+04 1.45e+05 2.83e+05]
 [1.00e+00 9.17e+04 1.14e+05 2.95e+05]
 [1.00e+00 8.64e+04 1.54e+05 0.00e+00]
 [1.00e+00 7.63e+04 1.14e+05 2.99e+05]
 [1.00e+00 7.84e+04 1.54e+05 3.00e+05]
 [1.00e+00 7.40e+04 1.23e+05 3.03e+05]
 [1.00e+00 6.75e+04 1.06e+05 3.05e+05]
 [1.00e+00 7.70e+04 9.93e+04 1.41e+05]
 [1.00e+00 6.47e+04 1.40e

## Let find out the best fit model with the help of Backward elimination.
- Why Backward elimination? It is the fastest and best algorithm among all to start. All other algorithms need better mathematical understanding or super-fast computers.

In [5]:
# let's assign some variables for our understand....
X0_Coefficient=0
R_and_D_Spend=1
Administration_Spend=2
Marketing_Spend=3

In [6]:
# Preparing X_opt with [X0_Coefficient, R_and_D_Spend, Administration_Spend, Marketing_Spend] columns...
X_opt = X[:, [X0_Coefficient, R_and_D_Spend, Administration_Spend, Marketing_Spend]]
X_opt = X_opt.astype(np.float64)

# Fit regression model using "Ordinary Least Squares" algorithm...
regressor_OLS = sm.OLS(Y, X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Sat, 09 Oct 2021",Prob (F-statistic):,4.53e-30
Time:,09:35:28,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
x1,0.8057,0.045,17.846,0.000,0.715,0.897
x2,-0.0268,0.051,-0.526,0.602,-0.130,0.076
x3,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


## Result review 1

<img src="../images/p_value_and_adj_r_squared_highlights.png" alt="p_value_and_adj_r_squared_highlights.png" >

- P-value for x2 (Administration_Spend) is 0.602 > (SL) 0.05, completely failed with higher difference. ❌
- Adj. R-squared:	0.948, More close to 1 great model. ✔️
> So, we are removing **Administration_Spend** from the dataset.

In [7]:
# Preparing X_opt with [X0_Coefficient, R_and_D_Spend, Marketing_Spend] columns...
X_opt = X[:, [X0_Coefficient, R_and_D_Spend, Marketing_Spend]]
X_opt = X_opt.astype(np.float64)

# Fit regression model using "Ordinary Least Squares" algorithm...
regressor_OLS = sm.OLS(Y, X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Sat, 09 Oct 2021",Prob (F-statistic):,2.1600000000000003e-31
Time:,09:35:28,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


## Result review 2

- P-value for x2 (Marketing_Spend) is 0.060 > (SL) 0.05, too close to SL. ⚠️
- Adj. R-squared:	0.948, same closer Adj. R-squared. ✔️
> So, we are removing **Marketing_Spend** from the dataset.

In [8]:
# Preparing X_opt with [X0_Coefficient, R_and_D_Spend] columns...
X_opt = X[:, [X0_Coefficient, R_and_D_Spend]]
X_opt = X_opt.astype(np.float64)

# Fit regression model using "Ordinary Least Squares" algorithm...
regressor_OLS = sm.OLS(Y, X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Sat, 09 Oct 2021",Prob (F-statistic):,3.5000000000000004e-32
Time:,09:35:28,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


## Result review 3

- P-value for x1 (R_and_D_Spend) is 0.000 < (SL) 0.05, p-value under the SL level. ✔️
- Adj. R-squared:   0.945, Goodness of fit reduced. ❌
> We understand that while removing **Marketing_Spend** Adj. R-squared is reducing! So, we can take the previous X_opt value as the final feature dataset

## Final selected model 
- Hence, the last **Result review 3** of [X0_Coefficient, R_and_D_Spend] reduces the Adj R-squared.
- Out of 3 results, we understand that **Result review 2** have great *Goodness of fit* and *very close p-value*.
- So, we can confirm that Profit is more dependent on features [X0_Coefficient, R_and_D_Spend, Marketing_Spend].

In [9]:
# Selected....
X_opt = X[:, [X0_Coefficient, R_and_D_Spend, Marketing_Spend]]
X_opt = X_opt.astype(np.float64)

## Let, create one multiple linear regression model to check the final result.

In [10]:
# Prepare testing and training dataset.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_opt, Y, test_size=0.2, random_state=0)

In [11]:
# Train Multiple Linear Regression Model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression()

In [12]:
# Check training dataset with testing dataset
y_pred = regressor.predict(x_test)
y_pred_vertical = y_pred.reshape(len(y_pred), 1)
y_test_vertical = y_test.reshape(len(y_test), 1)
print("-----y_pred vs y_test-----\n", np.concatenate((y_pred_vertical, y_test_vertical), 1))

-----y_pred vs y_test-----
 [[102284.65 103282.38]
 [133873.92 144259.4 ]
 [134182.15 146121.95]
 [ 73701.11  77798.83]
 [180642.25 191050.39]
 [114717.25 105008.31]
 [ 68335.08  81229.06]
 [ 97433.46  97483.56]
 [114580.92 110352.25]
 [170343.32 166187.94]]
