Multiple linear regression is a statistical method used to analyze the relationship between two or more independent variables (predictors) and a dependent variable. It extends the simple linear regression model, which involves only one independent variable.

In multiple linear regression, the relationship between the independent variables (X1, X2, ..., Xn) and the dependent variable (Y) is expressed by the following equation:

Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn + ε

Where:
- Y is the dependent variable.
- X1, X2, ..., Xn are the independent variables.
- β0 is the intercept (the value of Y when all independent variables are zero).
- β1, β2, ..., βn are the coefficients (slopes) representing the change in Y for a one-unit change in the corresponding independent variable, holding other variables constant.
- ε is the error term, representing the difference between the observed and predicted values of Y.

The goal of multiple linear regression is to estimate the coefficients (β) that best fit the observed data. This is typically done by minimizing the sum of squared differences between the observed and predicted values of Y, a method known as ordinary least squares (OLS) regression.

Multiple linear regression is widely used in various fields such as economics, finance, social sciences, and engineering for predicting outcomes and understanding the relationships between variables. It assumes linearity between the independent and dependent variables, as well as other assumptions like normality of residuals and homoscedasticity, which should be checked before interpreting the results.

# Multiple Linear Regression

## Importing the libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [3]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [4]:
print(X[:5])


[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']]


In [5]:
print(y[:5])

[192261.83 191792.06 191050.39 182901.99 166187.94]


## Encoding categorical data

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [7]:
print(X[:5])

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]]


## Splitting the dataset into the Training set and Test set

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [9]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predicting the Test set results

In [10]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_test.reshape(len(y_test),1),y_pred.reshape(len(y_pred),1) ),1))

[[103282.38 103015.2 ]
 [144259.4  132582.28]
 [146121.95 132447.74]
 [ 77798.83  71976.1 ]
 [191050.39 178537.48]
 [105008.31 116161.24]
 [ 81229.06  67851.69]
 [ 97483.56  98791.73]
 [110352.25 113969.44]
 [166187.94 167921.07]]
