# Multiple Linear Regression

**Regression** models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

**Multiple linear regression** is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:

1. How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
2. The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

**Assumptions of multiple linear regression**

Multiple linear regression makes all of the same assumptions as simple linear regression:

1. Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

    Homogeneity of variance, also known as homoscedasticity, is a statistical concept that refers to the assumption that the variance of the errors or residuals in a statistical model is constant across all levels of the independent variable(s). In simpler terms, it means that the spread of the data points around the mean is roughly the same across all groups or conditions being compared.

    This is an important assumption of parametric statistical tests because they are sensitive to any dissimilarities. Uneven variances in samples result in biased and skewed test results.

    If the variances are not equal, then the standard errors, confidence intervals, and p-values will be affected, and the conclusions drawn from the analysis may not be valid.

    There are various methods to test for homogeneity of variance, including graphical methods such as scatterplots and boxplots

2. Independence of observations: the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among variables.

    In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

3. Normality: The data follows a normal distribution.

4. Linearity: the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

The **general equation** for a multiple linear regression model is:

y = β0 + β1x1 + β2x2 + ... + βpxp + ε

Where:

y is the dependent variable
β0 is the intercept or constant term
β1, β2, ..., βp are the coefficients or regression weights for the independent variables x1, x2, ..., xp, respectively
ε is the error term, which represents the random variation or unexplained factors affecting the dependent variable
The goal of multiple linear regression is to estimate the values of the regression coefficients that best fit the data and allow us to make predictions about the dependent variable for new values of the independent variables.

**Multicollinearity** refers to a situation in multiple linear regression where two or more independent variables in the model are highly correlated with each other. This can lead to problems in the estimation of the regression coefficients, making them unstable or difficult to interpret.

Multicollinearity can arise for several reasons, including:

1.  Including redundant or highly similar variables in the model
2. Measuring the same variable using different units or scales
3. Overfitting the model to the data, i.e., including too many independent variables relative to the sample size


The **dummy variable trap** is a common problem that can occur in multiple linear regression models when one or more categorical variables are included as independent variables. It occurs when we include a categorical variable in the regression model using a set of dummy variables, but we do not exclude one of the dummy variables from the regression equation.

For example, suppose we have a categorical variable "Region" with three possible values: East, West, and South. We could create two dummy variables, "West" and "South," to represent the West and South regions, respectively. However, if we include both dummy variables in the regression model, the model will suffer from the dummy variable trap because the intercept term and the coefficients for each dummy variable will be highly correlated, making it difficult to estimate the true effects of the independent variables on the dependent variable.


To avoid the dummy variable trap, we need to exclude one of the dummy variables from the regression equation, which can be done by either dropping one of the categories (in this example, East) or by combining the dummy variables into a single variable that represents the omitted category. This approach is known as the "reference category" or "baseline category" method.

In summary, the dummy variable trap occurs when we include all dummy variables for a categorical variable in the regression equation, and it can be avoided by excluding one of the dummy variables or combining them into a single variable. By avoiding the trap, we can obtain more accurate estimates of the regression coefficients and make more reliable predictions using the model.

The "dummy variable trap" is a term used in statistics and regression analysis. It refers to a situation where a set of predictor variables (also known as independent variables or features) in a regression model are highly correlated with each other. Specifically, the dummy variable trap occurs when two or more dummy variables are perfectly collinear, meaning they can be predicted perfectly from each other.

In regression analysis, dummy variables are used to represent categorical data as numerical values. For example, a dummy variable could be used to represent a categorical variable such as gender, with 1 indicating male and 0 indicating female. However, if a regression model includes dummy variables for both gender and, say, marital status, there may be a high correlation between the two variables. This can lead to problems in the regression model, including biased coefficient estimates and inflated standard errors.

To avoid the dummy variable trap, one of the dummy variables must be dropped from the regression model. This is typically done by selecting a reference category and dropping the dummy variable for that category. For example, if gender and marital status are included in a regression model, the dummy variable for one gender or one marital status (usually the most common one) can be dropped from the model.

### Multiple linear regression

In [1]:
#data preprocessing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#importing dataset
dataset = pd.read_csv("50_Startups.csv")
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [2]:
dataset.shape

(50, 5)

In [3]:
X  = dataset.iloc[:,:-1].values
y = dataset.loc[:,'Profit'].values
print(y)


[192261.83 191792.06 191050.39 182901.99 166187.94 156991.12 156122.51
 155752.6  152211.77 149759.96 146121.95 144259.4  141585.52 134307.35
 132602.65 129917.04 126992.93 125370.37 124266.9  122776.86 118474.03
 111313.02 110352.25 108733.99 108552.04 107404.34 105733.54 105008.31
 103282.38 101004.64  99937.59  97483.56  97427.84  96778.92  96712.8
  96479.51  90708.19  89949.14  81229.06  81005.76  78239.91  77798.83
  71498.49  69758.98  65200.33  64926.08  49490.75  42559.73  35673.41
  14681.4 ]


In [4]:
# labelencoding for features
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:,3] = labelencoder_X.fit_transform(X[:,3])
X

array([[165349.2, 136897.8, 471784.1, 2],
       [162597.7, 151377.59, 443898.53, 0],
       [153441.51, 101145.55, 407934.54, 1],
       [144372.41, 118671.85, 383199.62, 2],
       [142107.34, 91391.77, 366168.42, 1],
       [131876.9, 99814.71, 362861.36, 2],
       [134615.46, 147198.87, 127716.82, 0],
       [130298.13, 145530.06, 323876.68, 1],
       [120542.52, 148718.95, 311613.29, 2],
       [123334.88, 108679.17, 304981.62, 0],
       [101913.08, 110594.11, 229160.95, 1],
       [100671.96, 91790.61, 249744.55, 0],
       [93863.75, 127320.38, 249839.44, 1],
       [91992.39, 135495.07, 252664.93, 0],
       [119943.24, 156547.42, 256512.92, 1],
       [114523.61, 122616.84, 261776.23, 2],
       [78013.11, 121597.55, 264346.06, 0],
       [94657.16, 145077.58, 282574.31, 2],
       [91749.16, 114175.79, 294919.57, 1],
       [86419.7, 153514.11, 0.0, 2],
       [76253.86, 113867.3, 298664.47, 0],
       [78389.47, 153773.43, 299737.29, 2],
       [73994.56, 122782.75, 30331

In [5]:
# Dummy encoding
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
columnTransform = make_column_transformer((OneHotEncoder(categories = 'auto'),[3]),remainder ='passthrough')
X = columnTransform.fit_transform(X)
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

In [6]:
# Avoiding dummy variable trap
X = X[:,1:]

In [7]:
# splitting the dataset
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 0)

In [8]:
# Fitting Multiple Linear Regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

In [10]:
regressor.coef_

array([-9.59284160e+02,  6.99369053e+02,  7.73467193e-01,  3.28845975e-02,
        3.66100259e-02])

In [12]:
regressor.intercept_

42554.16761776563

In [9]:
# predicting the test set results
y_pred = regressor.predict(X_test)
y_pred

array([103015.20159796, 132582.27760816, 132447.73845175,  71976.09851259,
       178537.48221054, 116161.24230163,  67851.69209676,  98791.73374688,
       113969.43533012, 167921.0656955 ])

In [11]:
diff = pd.DataFrame({"Predict":y_pred,"Values":y_test , 'Difference':y_pred-y_test})
diff

Unnamed: 0,Predict,Values,Difference
0,103015.201598,103282.38,-267.178402
1,132582.277608,144259.4,-11677.122392
2,132447.738452,146121.95,-13674.211548
3,71976.098513,77798.83,-5822.731487
4,178537.482211,191050.39,-12512.907789
5,116161.242302,105008.31,11152.932302
6,67851.692097,81229.06,-13377.367903
7,98791.733747,97483.56,1308.173747
8,113969.43533,110352.25,3617.18533
9,167921.065696,166187.94,1733.125696


In [16]:
from sklearn import metrics
metrics.mean_squared_error(y_test,y_pred)

83502864.03252874

In [26]:
#using Standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [32]:
scaled_regressor = LinearRegression()
X_train_scaled

array([[ 1.73205081, -0.73379939, -0.35006454, -0.78547109,  0.1011968 ],
       [-0.57735027,  1.36277029, -0.55530319, -1.48117426,  0.02734979],
       [ 1.73205081, -0.73379939,  0.07935762,  0.80133381, -0.55152132],
       [-0.57735027, -0.73379939, -0.54638238,  1.32505817,  0.07011684],
       [ 1.73205081, -0.73379939,  0.43485371, -0.35598663,  0.75148516],
       [ 1.73205081, -0.73379939,  1.26943143,  0.85518519,  0.98603118],
       [ 1.73205081, -0.73379939,  1.04525007,  1.28077047,  0.4404    ],
       [-0.57735027,  1.36277029, -1.529843  ,  0.02942065, -1.6218751 ],
       [-0.57735027,  1.36277029, -1.53976251, -2.76767264, -1.6372965 ],
       [-0.57735027,  1.36277029, -0.13115188,  1.14497701, -0.76949991],
       [-0.57735027,  1.36277029,  0.92791613, -0.02992062,  0.48303162],
       [ 1.73205081, -0.73379939, -0.20932933, -0.2993768 , -0.89915412],
       [-0.57735027, -0.73379939, -0.17870828,  0.2251352 , -1.26401642],
       [-0.57735027, -0.73379939,  0.1

In [33]:
scaled_regressor.fit(X_train_scaled,y_train)

In [34]:
y_pred = scaled_regressor.predict(X_test_scaled)

In [35]:
y_pred

array([103015.20159796, 132582.27760816, 132447.73845174,  71976.09851258,
       178537.48221055, 116161.24230165,  67851.69209676,  98791.73374687,
       113969.43533012, 167921.0656955 ])

In [36]:
from sklearn import metrics
metrics.mean_squared_error(y_test,y_pred)

83502864.03257717

#When we take step back do you think it is optimal model that we can meake with dataset we have here because when we built this model we used all the indepedent variable but what 
if among these model their is highly statistically significant  dependent variable that is had a great impact on the dependent variable profit and some are not significantlly at all
that is if =we remove not significantlly statiistcally variable from the model we still get some amazing prediction

###  Backward elimination method

Backward elimination is a method used in regression analysis to select the most significant independent variables (predictor variables) for inclusion in a regression model. The backward elimination method involves starting with a model that includes all the independent variables, and then systematically removing variables until only the most significant ones remain.

The backward elimination method involves the following steps:

Start with a model that includes all independent variables.
Evaluate the significance of each independent variable using a statistical test, such as the F-test or t-test. The least significant variable (i.e., the one with the highest p-value) is removed from the model.
Re-evaluate the remaining independent variables and repeat step 2 until all remaining variables are statistically significant.
This process continues until only the statistically significant independent variables remain in the model. The final model is considered the "best" model, as it includes only the most significant independent variables.

The backward elimination method can help to simplify a regression model by removing unnecessary variables and reducing the risk of overfitting (i.e., the model fitting too closely to the training data and performing poorly on new data). However, it is important to note that the backward elimination method may not always result in the best model, as the significance of independent variables can be influenced by other variables in the model.


Algorithm:

Step 1: Select a significance level to stay in the model( eg:SL = 0.05)

Step 2: Fit the full model with all possible predictors

step 3: Consider the predictor with the highest P=value . If P> SL, go to Step 4 go to finish

step 4: Remove the predictor 

Step 5: Fit model without this variable


In [31]:
# building an optimial model using backward elimination method 
#  as equation of multiple linear regression is y = b0+ b1*x1 + b2*X2 + ..+ bn*xn
# where y is dependent variable , b0 is constant , x1,x2.. are independent variable and b1,b2,b3.... are coefficients of independent variable
# as b0 is not associated with independent var. x but it is as x0 is equal to 1
# the model we use to use it right now it dosent take b0 but we will make enviornment to add it
import statsmodels.api as sm 
print(X)
X = np.append(arr=np.ones([50,1]),values = X,axis = 1)  # we added the b0 , adding to column
# but we have to add this colunm at first we will add  X to ones

[[1.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [1.0 1.0 0.0 153441.51 101145.55 407934.54]
 [1.0 0.0 1.0 144372.41 118671.85 383199.62]
 [1.0 1.0 0.0 142107.34 91391.77 366168.42]
 [1.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [1.0 1.0 0.0 130298.13 145530.06 323876.68]
 [1.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [1.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [1.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [1.0 1.0 0.0 119943.24 156547.42 256512.92]
 [1.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [1.0 0.0 1.0 94657.16 145077.58 282574.31]
 [1.0 1.0 0.0 91749.16 114175.79 294919.57]
 [1.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [1.0 0.0 1.0 78389.47 153773.43 299737.29]
 [1.0 1.0 0.0 73994.56 122782.75 3

In [16]:
X

array([[1.0, 1.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [1.0, 1.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [1.0, 1.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [1.0, 1.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [1.0, 1.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [1.0, 1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [1.0, 1.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [1.0, 1.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1.0, 1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [1.0, 1.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [1.0, 1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [1.0, 1.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [1.0, 1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [1.0, 1.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [1.0, 1.0, 0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 1.0, 

Ordinary Least Squares (OLS) is a method used in linear regression analysis to estimate the parameters of a linear regression model. The OLS algorithm involves finding the values of the regression coefficients that minimize the sum of the squared residuals between the predicted values and the actual values of the dependent variable.

The OLS algorithm involves the following steps:

Define the linear regression model: The linear regression model is defined as a linear combination of the independent variables (predictors) and the regression coefficients.

Estimate the regression coefficients: The regression coefficients are estimated using the OLS method, which involves minimizing the sum of the squared residuals between the predicted values and the actual values of the dependent variable. This is done by calculating the derivative of the sum of the squared residuals with respect to each regression coefficient, and setting them to zero.

Evaluate the goodness of fit: The goodness of fit of the model is evaluated using measures such as the R-squared value, which represents the proportion of variance in the dependent variable that is explained by the independent variables.

Make predictions: The final step is to use the estimated regression coefficients to make predictions of the dependent variable based on the values of the independent variables.

The OLS algorithm is widely used in linear regression analysis due to its simplicity and effectiveness. However, it assumes that the residuals are normally distributed and have constant variance, and that there is no multicollinearity (high correlation) among the independent variables. If these assumptions are not met, alternative methods such as robust regression or ridge regression may be used.





In [20]:
# step 2
X_opt = X[:,[0,1,2,3,4,5]]# its because we remove every column at every step
print(X_opt) # dtype is OBJECT try to change dtype to FLOAT
X_opt = np.array(X_opt,dtype = float)
regressor_OLS = sm.OLS(endog = y , exog = X_opt).fit()# endog is dependent variable exog is independent variable

[[1.0 1.0 0.0 1.0 165349.2 136897.8]
 [1.0 1.0 0.0 0.0 162597.7 151377.59]
 [1.0 1.0 1.0 0.0 153441.51 101145.55]
 [1.0 1.0 0.0 1.0 144372.41 118671.85]
 [1.0 1.0 1.0 0.0 142107.34 91391.77]
 [1.0 1.0 0.0 1.0 131876.9 99814.71]
 [1.0 1.0 0.0 0.0 134615.46 147198.87]
 [1.0 1.0 1.0 0.0 130298.13 145530.06]
 [1.0 1.0 0.0 1.0 120542.52 148718.95]
 [1.0 1.0 0.0 0.0 123334.88 108679.17]
 [1.0 1.0 1.0 0.0 101913.08 110594.11]
 [1.0 1.0 0.0 0.0 100671.96 91790.61]
 [1.0 1.0 1.0 0.0 93863.75 127320.38]
 [1.0 1.0 0.0 0.0 91992.39 135495.07]
 [1.0 1.0 1.0 0.0 119943.24 156547.42]
 [1.0 1.0 0.0 1.0 114523.61 122616.84]
 [1.0 1.0 0.0 0.0 78013.11 121597.55]
 [1.0 1.0 0.0 1.0 94657.16 145077.58]
 [1.0 1.0 1.0 0.0 91749.16 114175.79]
 [1.0 1.0 0.0 1.0 86419.7 153514.11]
 [1.0 1.0 0.0 0.0 76253.86 113867.3]
 [1.0 1.0 0.0 1.0 78389.47 153773.43]
 [1.0 1.0 1.0 0.0 73994.56 122782.75]
 [1.0 1.0 1.0 0.0 67532.53 105751.03]
 [1.0 1.0 0.0 1.0 77044.01 99281.34]
 [1.0 1.0 0.0 0.0 64664.71 139553.16]
 [1.0 1.

In [21]:
#Step 3 
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.948
Model:,OLS,Adj. R-squared:,0.943
Method:,Least Squares,F-statistic:,205.0
Date:,"Wed, 19 Apr 2023",Prob (F-statistic):,2.9e-28
Time:,00:31:53,Log-Likelihood:,-526.75
No. Observations:,50,AIC:,1064.0
Df Residuals:,45,BIC:,1073.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.73e+04,3185.530,8.571,0.000,2.09e+04,3.37e+04
x1,2.73e+04,3185.530,8.571,0.000,2.09e+04,3.37e+04
x2,1091.1075,3377.087,0.323,0.748,-5710.695,7892.910
x3,-39.3434,3309.047,-0.012,0.991,-6704.106,6625.420
x4,0.8609,0.031,27.665,0.000,0.798,0.924
x5,-0.0527,0.050,-1.045,0.301,-0.154,0.049

0,1,2,3
Omnibus:,14.275,Durbin-Watson:,1.197
Prob(Omnibus):,0.001,Jarque-Bera (JB):,19.26
Skew:,-0.953,Prob(JB):,6.57e-05
Kurtosis:,5.369,Cond. No.,7.23e+16


In [None]:
#step 4
X_opt = X[:,[0,1,3,4,5]]