# Multiple Linear Regression

In most cases, there are multiple variables that predict the value of target variable. When multiple independent variables are present, the process is called multiple linear regression (MLR). MLR is extension of simple linear regression (SLR).

Generally, the model is of the form : $\hat{y} = \theta_0+\theta_1x_1+\theta_2x_2+...$ with fitting parameters $\theta_i$'s, feature variables $x_i$'s, and $\hat{y}$ as the target variable. Similar to SLR where our objective was to obtain the best fitting line, in MLR our objective is to obtain the best fitting hyperplane.

In [1]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

We use the same fuel-consumption and CO2 emissions dataset as in Simple Linear Regression notebook

In [2]:
df = pd.read_csv('fuel_consumption.csv')

In [3]:
df.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


Let's select some features that we want to use for regression

In [4]:
cdf = df[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB', 'CO2EMISSIONS']]

In [5]:
cdf.head()

Unnamed: 0,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,CO2EMISSIONS
0,2.0,4,9.9,6.7,8.5,196
1,2.4,4,11.2,7.7,9.6,221
2,1.5,4,6.0,5.8,5.9,136
3,3.5,6,12.7,9.1,11.1,255
4,3.5,6,12.1,8.7,10.6,244


In [6]:
msk = np.random.rand(len(df)) < 0.8                # splitting the data into train-test datasets
train = cdf[msk]
test = cdf[~msk]

Lets predict Co2 emissions using enginesize, cylinders, fuelconsumption_comb :

In [7]:
from sklearn import linear_model
lr = linear_model.LinearRegression()
x = np.array(train[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']])
y = np.array(train[['CO2EMISSIONS']])
lr.fit(x, y)
lr.coef_

array([[10.83408502,  7.67429606,  9.46521457]])

**Scikit-Learn uses plain Ordinary Least Squares (OLS) method to determine the fitting coefficients**

OLS is a method for estimating the best fitting coefficients in a linear regression model. OLS determines the coefficients by minimizing the mean squared error (MSE) between the target variable (y) and our predicted output ($\hat{y}$) over all samples in the dataset. OLS can find the best parameters using following two methods:
    <li>Solving the model parameters analytically using closed-form equations</li>
    <li>Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton’s Method, etc.)</li>
 Optimization algorithms are efficient for very large datasets, as the analytical case uses matrices, which are resource consuming when very large.

Now, we can use our fitted model to make prediction on the test set

In [8]:
x_test = np.array(test[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']])
y_test = np.array(test[['CO2EMISSIONS']])
yhat = lr.predict(x_test)
print('Residual sum of squares : %.2f' % np.mean((yhat-y_test)**2))
print('Variance score %.2f' % lr.score(x_test, y_test))

Residual sum of squares : 548.70
Variance score 0.86
