# **Multiple Linear Regression**

**Objectives**: After completing this lab you will be able to:
*   Use scikit-learn to implement Multiple Linear Regression
*   Create a model, train it, test it and use the model

**Problem:** Predict carbon dioxide emissions for light-duty vehicles for sale in Canada.

**Step 1: Importing Needed packages**

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

**Step 2: Data Collection**

To download the data, we will use !wget to download it from IBM Object Storage.

In [None]:
#!wget -O FuelConsumption.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv

!wget -O FuelConsumption.csv https://open.canada.ca/data/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64/resource/2309538b-53d1-4635-a88e-e237bfcef7a2/download/my2005-2014-fuel-consumption-ratings-5-cycle.csv

**Step 3: Load dataset**

In [None]:
df = pd.read_csv("FuelConsumption.csv")

# take a look at the dataset
df.head()

**Step 4: Understand data**

The dataset contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada.

MODEL YEAR e.g. 2014

MAKE e.g. Acura

MODEL e.g. ILX

VEHICLE CLASS e.g. SUV

ENGINE SIZE e.g. 4.7

CYLINDERS e.g 6

TRANSMISSION e.g. A6

FUELTYPE e.g. z

FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9

FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9

FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2

CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0

**Step 5: Data preprocessing**

Let's rename columns for an ease of use. Change only useful column for this problem and leave others.

In [None]:
df = df.rename(columns={'Engine size (L)': 'ENGINESIZE', 'City (L/100 km)': 'FUELCONSUMPTION_CITY', 'Highway (L/100 km)': 'FUELCONSUMPTION_HWY', 'Combined (L/100 km)': 'COMBINED','CO2 emissions (g/km)': 'CO2EMISSIONS'})
df.head()

In [None]:
# Let's select some features that we want to use for regression.

cdf = df[['ENGINESIZE','Cylinders','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','COMBINED','CO2EMISSIONS']]
cdf.head(9)

**Step 6: Exploratory data analysis (EDA)**

Let's plot Emission values with respect to Engine size.

In [None]:
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

**Step 7: Data split (Train/Test Split)**

Let's split data into training and testing sets. Around 80% of the entire dataset will be used for training and 20% for testing. We create a mask to select random rows using the **np.random.rand()** function:

In [None]:
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]

This is our training data

In [None]:
train.head()

In [None]:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

**Step 8.a: Model development (Training the model)**

Since, there are multiple variables that impact the CO2EMISSIONS, we would use multiple linear regression. We are predicting CO2EMISSIONS using the features COMBINED, ENGINESIZE and Cylinders of cars. Let's train our model on the training set.

In [None]:
from sklearn import linear_model
from sklearn.metrics import r2_score

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['ENGINESIZE','Cylinders','COMBINED']])
y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

# Predict the target values on the training set
y_train_pred = regr.predict(x)

# Calculate R-squared on the training set
r2_training = r2_score(y, y_train_pred)

# Print the R-squared value
print("Training R-squared:", r2_training)

**Step 8.b: Model development (Testing the model)**

Let's test our model on the testing set.
Here, the variance score best possible score is 1.0, the lower values are worse.

In [None]:
y_test_pred = regr.predict(test[['ENGINESIZE','Cylinders','COMBINED']])
x = np.asanyarray(test[['ENGINESIZE','Cylinders','COMBINED']])
y = np.asanyarray(test[['CO2EMISSIONS']])
print("Residual sum of squares: %.2f"
      % np.mean((y_test_pred - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Testing R-squared: %.2f' % regr.score(x, y))
# you can also use r2_testing = r2_score(y, y_test_pred)