# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline


In [2]:
data = pd.read_csv('petrol.csv')
print(data.head())
data.describe()

   tax   income   highway     dl   consumption
0  9.0     3571      1976  0.525           541
1  9.0     4092      1250  0.572           524
2  9.0     3865      1586  0.580           561
3  7.5     4870      2351  0.529           414
4  8.0     4399       431  0.544           410


Unnamed: 0,tax,income,highway,dl,consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [4]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

df = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]

print (df)


     tax   income   highway     dl   consumption
0   9.00     3571      1976  0.525           541
1   9.00     4092      1250  0.572           524
2   9.00     3865      1586  0.580           561
3   7.50     4870      2351  0.529           414
4   8.00     4399       431  0.544           410
6   8.00     5319     11868  0.451           344
7   8.00     5126      2138  0.553           467
8   8.00     4447      8577  0.529           464
9   7.00     4512      8507  0.552           498
10  8.00     4391      5939  0.530           580
12  7.00     4817      6930  0.574           525
13  7.00     4207      6580  0.545           508
14  7.00     4332      8159  0.608           566
15  7.00     4318     10340  0.586           635
16  7.00     4206      8508  0.572           603
17  7.00     3718      4725  0.540           714
19  8.50     4341      6010  0.677           640
20  7.00     4593      7834  0.663           649
21  8.00     4983       602  0.602           540
22  9.00     4897   

In [None]:
#Alternative method
def remove_outlier(df, col_name):
   q1 = df[col_name].quantile(0.25)
   q3 = df[col_name].quantile(0.75)
   IQR = q3-q1 #Interquartile range
   lowVal  = q1-1.5*IQR
   highVal = q3+1.5*IQR
   retain = df.loc[(df[col_name] > lowVal) & (df[col_name] < highVal)]
   return retain

for col in data.columns:
    data = remove_outlier(data, col)
    print (len(data))


# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [24]:
df.corr()

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.109537,-0.390602,-0.314702,-0.446116
income,-0.109537,1.0,0.051169,0.150689,-0.347326
highway,-0.390602,0.051169,1.0,-0.016193,0.034309
dl,-0.314702,0.150689,-0.016193,1.0,0.611788
consumption,-0.446116,-0.347326,0.034309,0.611788,1.0


### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [36]:
X = df.loc[:, ['tax',' dl']]
X.head()

Unnamed: 0,tax,dl
0,9.0,0.525
1,9.0,0.572
2,9.0,0.58
3,7.5,0.529
4,8.0,0.544


In [37]:
Y = df.iloc[:, -1]
Y.head()

0    541
1    524
2    561
3    414
4    410
Name:  consumption, dtype: int64

# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [38]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)  

print(X_train.shape)
print(X_test.shape)

(34, 2)
(9, 2)


In [39]:
X_train

Unnamed: 0,tax,dl
35,6.58,0.629
24,8.5,0.551
9,7.0,0.552
46,7.0,0.623
32,8.0,0.578
6,8.0,0.451
20,7.0,0.663
42,7.0,0.603
44,6.0,0.672
7,8.0,0.553


# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [40]:
from sklearn.linear_model import LinearRegression  
reg1 = LinearRegression()  
lm = reg1.fit(X_train, y_train)  

In [41]:
coeff_df = pd.DataFrame(reg1.coef_, X.columns, columns=['Coefficient'])  
coeff_df

Unnamed: 0,Coefficient
tax,-17.77311
dl,1050.687797


In [33]:
reg1.coef_

array([-32.60016349, 891.17172251])

# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [30]:
r2 = reg1.score(X_train,y_train)
print(r2)


0.43496069248561087


# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [31]:
# select feature variables
X = data.iloc[:,:-1]
#print(X.head())

# dependent variable
Y = data.iloc[:,-1]
# print(Y.head())

# split train and test set
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)  
#print(X_train.shape)
# print(X_test.shape)

# Model
from sklearn.linear_model import LinearRegression  
reg = LinearRegression()  
reg.fit(X_train, y_train)

# R-Square and Adjusted R-Square scores
r2 = reg.score(X_train,y_train)
print(r2)

0.6940210663453725


# Question 9: Print the coefficients of the multilinear regression model

In [32]:
coeff_df = pd.DataFrame(reg.coef_, X.columns,  columns = ["coefficient"])  
coeff_df

Unnamed: 0,coefficient
tax,-35.098742
income,-0.055894
highway,-0.00265
dl,1363.414145


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

Answer - ### In this case *R squared value increase if we increase the number of independent variables to our analysis