# Multiple Linear Regression

Multiple Linear Regression is an extension of simple linear regression that allows us to model the relationship between one dependent variable and two or more independent variables. This technique is useful when the outcome of interest is infuenced by multiple factors.

## Purpose
The main goal of Multiple Linear Regression is to understand how several independent variables (predictors)collectively influcence a single dependent variable (response). It also helps in predicting the dependent variable based on the values of the independent variables.

## Assumptions
Multiple Linear Regression relies on several key assumptions:
* Linearity: The relationship between teh depednetn variable and each independent variable is linear.
* Independence: Observations are independent of each other.
* Homoscedasticity: The variance of residuals (errors) is constant across all levels of the independent variables.
* Normality: The residuals are normally distributed.
* No Multicollinearity: Independent variables are not too highly correlated with each other.

## Equation
The relationship between the dependent variable and multiple independent variables is described by the equation:

$Y = \beta_0 + \beta_1X_1 + + \beta_2X_2 + ... + \beta_kX_k + \epsilon$

where

* $Y$ is the dependent variable
* $X_1, X_2, ... , X_k $ are the indepdendent variables
* $\beta_0$ is the intercept
* $\beta_1, \beta_2, ... ,\beta_k$ are the coefficients (slopes) for each indepdendent varible
* $\epsilon$ is the error term.

## Estimating Parameters
Parameters $\beta_0, \beta_1, .., \beta_k$ are estimated using the least squares method, which minimizes the sum of squared residuals (differences between observed and predicted values).

## Interpretation
* Intercept ($\beta_0$): The expected value of $Y$ when all independent varibles are zero.
* Coefficients ($\beta_i$): The expected change in $Y$ for a one-unit chnage in $X_i$, holding all other varibles constant.

## Goodness of Fit
* R-squared ($R^2$): Indicates the proportion of variance in the dependent variables explained by the independent variables. A higher $R^2$ indicates a better fit.
* Adjusted R-squared: Adjusted for the number of predictors in the model, providing a more accurate measure when multiple variables are involved.
* Residuals Analysis: Examining residuals helps check assumptions and identify potential issues.

## Multicollinearity
When independent variables are highly correleated, it can cuase issues in estimating coefficients accurtely. Detecting multicollinearity can be done using:
* Variance Inflation Factor (VIF): Values above 10 indicate high multicollinearity.
* Tolerance: Values below 0.2 indicate high multicollinearity.

# Example

In the following example we will find a good corrleation between Profits and other factors like administrative and R&D costs, and the location of the company.

## Import the Libraries

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Import the data

In [7]:
data = pd.read_csv('Startups.csv')
X = data.iloc[:, :-1].values
y = data.iloc[:,-1].values
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

### Encoding Categorical Data

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

### Splitting the data into the Training set and Test set

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)


### Train the Multiple Linear Regression model on the Training Set

In [11]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

### Predict the Test Set Results

In [14]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

'''
`set_printoptions` is a function that controls the printing behavior of NumPy arrays.
`precision=2` sets the number of decimal places to display when printing floating-point numbers to 2.
`np.concatenate` is a function that joins a sequence of arrays along an existing axis.

`y_pred.reshape(len(y_pred), 1)` reshapes the `y_pred` array to a 2D array with one column
and as many rows as there are elements in `y_pred`. This is necessary because `y_pred` is originally a 1D array, and
concatenation requires the arrays to have compatible shapes.

`y_test.reshape(len(y_test), 1)` does the same reshaping for the true values `y_test`.

The two reshaped arrays are concatenated horizontally (along axis 1), which means the predictions and their
corresponding true values will be side by side in the resulting 2D array.
'''

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


'\n`set_printoptions` is a function that controls the printing behavior of NumPy arrays.\n`precision=2` sets the number of decimal places to display when printing floating-point numbers to 2.\n`np.concatenate` is a function that joins a sequence of arrays along an existing axis.\n\n`y_pred.reshape(len(y_pred), 1)` reshapes the `y_pred` array to a 2D array with one column\nand as many rows as there are elements in `y_pred`. This is necessary because `y_pred` is originally a 1D array, and\nconcatenation requires the arrays to have compatible shapes.\n\n`y_test.reshape(len(y_test), 1)` does the same reshaping for the true values `y_test`.\n\nThe two reshaped arrays are concatenated horizontally (along axis 1), which means the predictions and their corresponding true values will be side by side in the resulting 2D array.\n\n'