## Multiple Linear Regression

<font color='grey'>Author : Surejya Suresh </font> | [GitHub](https://github.com/surejyaa) | [LinkedIn](https://www.linkedin.com/in/surejyaa)

### Import required packages

In [1]:
import pandas as pd #data manipulation and analysis
import numpy as np #linear algebra
import matplotlib.pyplot as plt #visualization

from sklearn.compose import ColumnTransformer #apply transformers to column 
from sklearn.preprocessing import OneHotEncoder #encode categorical features
from sklearn.model_selection import train_test_split #split the dataset in train and test
from sklearn.linear_model import LinearRegression #linear regression model
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error #evaluation metrics

### Import dataset

In [2]:
data = pd.read_csv("datasets/startups_data.csv")
data.head()  #head of the dataset

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
data.tail()  #tail of the dataset

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
45,1000.23,124153.04,1903.93,New York,64926.08
46,1315.46,115816.21,297114.46,Florida,49490.75
47,0.0,135426.92,0.0,California,42559.73
48,542.05,51743.15,0.0,New York,35673.41
49,0.0,116983.8,45173.06,California,14681.4


In [4]:
data.shape  #dimension of the dat: (rows,columns)

(50, 5)

In [5]:
# input feature
X = data.iloc[:, :-1].values

#target feature
y = data.iloc[:, -1].values

In [6]:
X[0]

array([165349.2, 136897.8, 471784.1, 'New York'], dtype=object)

In [7]:
y[0]

192261.83

### Encoding Categorical Values

In [8]:
column_transform = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(column_transform.fit_transform(X))
X[0]

array([0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1], dtype=object)

### Splitting the dataset to train and test

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print("X_train size:", len(X_train))
print("y_train size:", len(y_train))
print("X_test size:", len(X_test))
print("y_test size:", len(y_test))

X_train size: 35
y_train size: 35
X_test size: 15
y_test size: 15


### Training the Multiple Linear Regression model

In [10]:
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

### Prediction on test data - X_test

In [11]:
y_pred = model_lr.predict(X_test)

#Compare the predicted value with the actual target value
df_check = pd.DataFrame({'ActualValues': y_test, 'PredictedValues': y_pred})
df_check

Unnamed: 0,ActualValues,PredictedValues
0,103282.38,104282.764722
1,144259.4,132536.884992
2,146121.95,133910.850078
3,77798.83,72584.774894
4,191050.39,179920.927619
5,105008.31,114549.310792
6,81229.06,66444.432613
7,97483.56,98404.968401
8,110352.25,114499.828086
9,166187.94,169367.506399


### Evaluating the results

In [12]:
#Evaluation Metrices
accuracy_score = r2_score(y_test,y_pred)*100
print("Accuracy:", accuracy_score)

mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)

Accuracy: 93.5868097004653
MSE: 61903144.40236594
MAE: 6520.697183079641
