# 1.3 Multiple Linear Regression & Backward Elimination.

For better understanding of current noebook for beginners go through the links:

 [1.1 Data Preprocessing](http://www.kaggle.com/saikrishna20/data-preprocessing-tools)


[1.2 Simple linear Regression](https://www.kaggle.com/saikrishna20/1-2-simple-linear-regression) 

It basically tells u about the preprocessing & Linear Regression which will help u in understanding this notebook better

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('../input/50-startups/50_Startups.csv')
X = dataset.iloc[:, :-1].values # features
y = dataset.iloc[:, -1].values # target

In [None]:
X[:5]

## Encoding categorical data

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first'), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
# we don't need to del the one dummy varible column in the dataframe for linear regression
# but for other model we need to use n-1 dummy columns if there are n unique values in a particular column
# hence to avoid confussion we can del the first column of dummy columns created. 

In [None]:
print(X)
# The dummy variables are always created in the first columns.

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predicting the Test set results

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
y_pred

y_pred is an numpy array of one row

In [None]:
np.set_printoptions(precision=2) # only two decimals after point
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In the output above we have all the predicted values from model on the left side and the real values on the right side. 

In [None]:
import statsmodels.api as sm

# Significance level - Backward elimination

We have different techniques to find out the features which have the maximum effect on the output.

Here we are going to look at the Backward elimination.

In this process we need to add one column of ones in the starting of the column.

In backward elimination we delete the value one by one whose significance level is less.

i.e In general we have a P-value and a significance level

P_value = 1 - (minus) significane level

or in other terms

p_value+ significance level = 1

if P_value is high significance level is less.

Hence we will be deleating features one by one whose P_value is high which means it has less significance level.

By eliminating process we get to the values which are of most significance

In [None]:
# Building the optimal model using Backward Elimination
import statsmodels.api as sm
X = np.append(arr = np.ones((50, 1)).astype(float), values = X, axis = 1)
print(X)
X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5]], dtype=float)


In [None]:
X_opt

the varibale whose p value is greater of all and is more than significance level 0.05 is deleted as it means it has less significance on the outcome.

In [None]:
model = sm.OLS(endog = y, exog = X_opt)
regressor_OLS = model.fit()
regressor_OLS.summary()

**R squared – It tells about the goodness of the fit. It ranges between 0 and 1. The closer the value to 1, the better it is. It explains the extent of variation of the dependent variables in the model. However, it is biased in a way that it never decreases(even on adding variables).**


**Adj Rsquared – This parameter has a penalising factor(the no. of regressors) and it always decreases or stays identical to the previous value as the number of independent variables increases. If its value keeps increasing on removing the unnecessary parameters go ahead with the model or stop and revert.**


**F statistic – It is used to compare two variances and is always greater than 0. It is formulated as v12/v22. In regression, it is the ratio of the explained to the unexplained variance of the model.
AIC and BIC – AIC stands for Akaike’s information criterion and BIC stands for Bayesian information criterion Both these parameters depend on the likelihood function L.**


**Skew – Informs about the data symmetry about the mean.**


**Kurtosis – It measures the shape of the distribution i.e.the amount of data close to the mean than far away from the mean.**


**Omnibus – D’Angostino’s test. It provides a combined statistical test for the presence of skewness and kurtosis.**


**Log-likelihood – It is the log of the likelihood function.**

In [None]:
X_opt = X[:, [0, 1, 3, 4, 5]]
X_opt = np.array(X[:, [0, 1, 3, 4, 5]], dtype=float)
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

In [None]:
X_opt = np.array(X[:, [0,3, 4, 5]], dtype=float)
#X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

In [None]:
X_opt = np.array(X[:, [0, 3, 5]], dtype=float)
#X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

In [None]:


X_opt = np.array(X[:, [0,3]], dtype=float)
#X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

### Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California')

In [None]:
print(regressor.predict([[ 0, 0, 160000, 130000, 300000]]))

Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

$1, 0, 0, 160000, 130000, 300000 \rightarrow \textrm{scalars}$

$[1, 0, 0, 160000, 130000, 300000] \rightarrow \textrm{1D array}$

$[[1, 0, 0, 160000, 130000, 300000]] \rightarrow \textrm{2D array}$

**Important note 2:** Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

## Getting the final linear regression equation with the values of the coefficients

In [None]:
print(regressor.coef_)
print(regressor.intercept_)

Therefore, the equation of our multiple linear regression model is:

$$\textrm{Profit} = 86.6 \times \textrm{Dummy State 1} - 873 \times \textrm{Dummy State 2} + 786 \times \textrm{Dummy State 3} - 0.773 \times \textrm{R&D Spend} + 0.0329 \times \textrm{Administration} + 0.0366 \times \textrm{Marketing Spend} + 42467.53$$

**Important Note:** To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.

# Like this notebook then upvote it.


# Need to improve it then comment below.


# * Enjoy Machine Learning.