<a href="https://colab.research.google.com/github/ueceu/Machine-Learning/blob/main/2_Regression%20/2.2_Multiple_Linear_Regression/Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression

Multiple Linear Regression is a regression technique used to predict a dependent variable (y) using more than one independent variable (x1, x2, ..., xn).


The general mathematical model is:


$$
y = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_n x_n
$$



Where:


*   y → Dependent variable (output)
*   x₁, x₂, … → Independent variables (inputs)
*   b₀ → Intercept (constant term)
*   b₁, b₂, … → Coefficients representing the impact of each independent variable on y

In this notebook, using the "50_Startups.csv" dataset, the following steps are performed:

*   Converting categorical variables into numerical form using One-Hot Encoding,
*   Splitting the dataset into training and test sets,
*   Training a Multiple Linear Regression model,
*   Comparing the predicted test results with the actual values.




## Importing the libraries

In [50]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [51]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [52]:
# FOR R

# dataset = read.csv('50_Startups.csv')

In [53]:
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

## Encoding categorical data

In [54]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
# Converts categorical data to numerical form.
# transformers = [(tag, numeric function, number of category data),
# remainder='passthrough' -> Encode the text column, skip the other columns (the numeric ones)]
X = np.array(ct.fit_transform(X))

In [55]:
# FOR R

# dataset$Country = factor(dataset$State,
#                          levels = c('New York', 'California', 'Florida'),
#                          labels = c(1, 2, 3))

In [56]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## Splitting the dataset into the Training set and Test set

In [57]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [58]:
# FOR R

# install.packages('caTools')
# library(caTools)
# set.seed(123)
# split = sample.split(dataset$Profit, SplitRatio = 0.8)
# training_set = subset(dataset, split == TRUE)
# test_set = subset(dataset, split == FALSE)

## Training the Multiple Linear Regression model on the Training set

In [59]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [60]:
# FOR R

# regressor = lm(formula = Profit ~ ., # "." means all independent variables.
#                data = training_set)

## Predicting the Test set results

In [61]:
y_pred = regressor.predict(X_test)
# Predicting for X_test
np.set_printoptions(precision=2)
# Limit the decimal part to 2 digits.
print(
    np.concatenate(
        (
            y_pred.reshape(len(y_pred), 1), # Predicted values
            y_test.reshape(len(y_test), 1) # Real values
        ),
        1
    )
)
# We create a table to compare the predicted result with the actual result side by side.

# reshape(row, column)
# It preserves the same data while only changing its format (row-column structure).

# np.concatenate((array1, array2), axis) ->
# It combines data of the same type along the specified axis.
# array1, array2 → arrays to be combined
# axis → the direction in which they will be combined
# (0: One below the other (add row), 1: Side by side (add column))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


In [62]:
# FOR R

# y_pred = predict(regressor, newdata = test_set)

## For Building the Optimal Model Using Backward Elimination

**Backward Elimination**

Backward Elimination is a feature selection technique used to build an optimal regression model by removing statistically insignificant independent variables. The goal is to keep only the variables that have a meaningful impact on the dependent variable while improving model interpretability and simplicity.

In [63]:
# FOR R

# regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
#                data = dataset)
# summary(regressor)

The process starts by fitting a regression model using all available independent variables. After training the model, the statistical significance of each variable is evaluated using its p-value. Variables with a p-value higher than a chosen significance level (commonly 0.05) are considered statistically insignificant.

In [64]:
# FOR R

# regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
#                data = dataset)
# summary(regressor)

At each step, the variable with the highest p-value above the significance level is removed from the model. The regression model is then refitted using the remaining variables, and the process is repeated until all variables in the model have p-values below the significance threshold.

In [65]:
# FOR R

# regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
#                data = dataset)
# summary(regressor)

The elimination process follows these steps:


1.   Fit the regression model using all available independent variables.
2.   Examine the summary() output of the model.
3.   Identify the independent variable with the highest p-value.
4.   If the highest p-value is greater than the chosen significance level (usually 0.05), remove that variable from the model.
5.   Refit the regression model using the remaining variables.
6.   Repeat the process by checking the new summary() output.



This approach follows a “start with everything, then eliminate” strategy, ensuring that only statistically significant variables remain in the final, optimal model.

## Implementation of Backward Elimination in R

In [None]:
# backwardElimination <- function(x, sl) {
#     numVars = length(x)
#     for (i in c(1:numVars)){
#       regressor = lm(formula = Profit ~ ., data = x)
#       maxVar = max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
#       if (maxVar > sl){
#         j = which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
#         x = x[, -j]
#       }
#       numVars = numVars - 1
#     }
#     return(summary(regressor))
#   }
#
#   SL = 0.05
#   dataset = dataset[, c(1,2,3,4,5)]
#   backwardElimination(training_set, SL)