We will use the physical attributes of a car to predict its miles per gallon (mpg)

Linear regression produces a model in the form:

Y=β0+β1X1+β2X2…+βnXn

The way this is accomplished is by minimising the residual sum of squares, given by the equation below:

RSS=Σni=1(yi–y^i)2
RSS=Σni=1(yi–β0^–β1^x1–β2^x2–…–βp^xp)

In [26]:
import pandas as pd

df = pd.read_csv('auto-mpg.csv')

In [27]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


We don’t need the name column, so let’s remove this

In [28]:
df = df.drop('car name', axis=1)

In [29]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130,3504,12.0,70,1
1,15.0,8,350.0,165,3693,11.5,70,1
2,18.0,8,318.0,150,3436,11.0,70,1
3,16.0,8,304.0,150,3433,12.0,70,1
4,17.0,8,302.0,140,3449,10.5,70,1


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null int64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
dtypes: float64(3), int64(4), object(1)
memory usage: 25.0+ KB


Also note that the column "origin" is where the car came from and this is an ordinal categorical variable so we will need to create the dummy binary variables for this.

In [31]:
df['origin'] = df['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})

In [32]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130,3504,12.0,70,america
1,15.0,8,350.0,165,3693,11.5,70,america
2,18.0,8,318.0,150,3436,11.0,70,america
3,16.0,8,304.0,150,3433,12.0,70,america
4,17.0,8,302.0,140,3449,10.5,70,america


In [33]:
df = pd.get_dummies(df, columns=['origin'])

In [34]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin_america,origin_asia,origin_europe
0,18.0,8,307.0,130,3504,12.0,70,1,0,0
1,15.0,8,350.0,165,3693,11.5,70,1,0,0
2,18.0,8,318.0,150,3436,11.0,70,1,0,0
3,16.0,8,304.0,150,3433,12.0,70,1,0,0
4,17.0,8,302.0,140,3449,10.5,70,1,0,0


There are some missing values for horsepower, denoted by question marks so we’ll need to remove these

In [35]:
import numpy as np

In [36]:
df = df.replace('?', np.nan)   #replace ? from NaN

In [37]:
df.isnull().mean()

mpg               0.000000
cylinders         0.000000
displacement      0.000000
horsepower        0.015075
weight            0.000000
acceleration      0.000000
model year        0.000000
origin_america    0.000000
origin_asia       0.000000
origin_europe     0.000000
dtype: float64

In [38]:
df = df.dropna()

In [39]:
df.isnull().mean()

mpg               0.0
cylinders         0.0
displacement      0.0
horsepower        0.0
weight            0.0
acceleration      0.0
model year        0.0
origin_america    0.0
origin_asia       0.0
origin_europe     0.0
dtype: float64

In [40]:
df.shape

(392, 10)

In [41]:
df.origin_america.value_counts() #distinct value count

1    245
0    147
Name: origin_america, dtype: int64

In [42]:
'''import matplotlib.pyplot as plt
import seaborn as sns

x1=np.array(df.loc[:,'cylinders']).reshape(-1,1)
y1=np.array(df.loc[:,'mpg']).reshape(-1,1)

plt.figure(figsize=[10,10])
plt.scatter(x=x1,y=y1)
plt.xlabel("cylinders")
plt.ylabel("mpg")
plt.show()



SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-42-e65b0ea8420c>, line 12)

Now we can split our data into a training and test set

In [43]:
X = df.drop('mpg', axis=1)
y = df[['mpg']]



from sklearn.model_selection import train_test_split

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [44]:
X.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin_america,origin_asia,origin_europe
0,8,307.0,130,3504,12.0,70,1,0,0
1,8,350.0,165,3693,11.5,70,1,0,0
2,8,318.0,150,3436,11.0,70,1,0,0
3,8,304.0,150,3433,12.0,70,1,0,0
4,8,302.0,140,3449,10.5,70,1,0,0


We train our LinearRegression model using the training set of data

In [45]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [46]:
''''plt.plot(regression_model,color='black',linewidth=2)
plt.scatter(x=x1,y=y1)
plt.show()

SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-46-09e19abf2002>, line 3)

Now that our model is trained, we can view the coefficients of the model using regression_model.coef_, which is an array of tuples of coefficients.

In [47]:
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

The coefficient for cylinders is -0.24633755869962146
The coefficient for displacement is 0.02387033830714965
The coefficient for horsepower is -0.006017238617773323
The coefficient for weight is -0.00733643294389931
The coefficient for acceleration is 0.218977781041248
The coefficient for model year is 0.7851801072779493
The coefficient for origin_america is -1.7624934092199307
The coefficient for origin_asia is 0.8096269190858508
The coefficient for origin_europe is 0.9528664901340741


regression_model.intercept_ contains an array of intercepts (β0 values)

In [48]:
intercept = regression_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))

The intercept for our model is -19.809183848815874


So we can write our linear model as:

Y=−19.81–0.25×X1+0.02×X2–0.01×X3–0.01×X4+0.22×X5+0.78×X6–1.76×X7+0.81×X8+0.95×X9
Note that, because we’ve not done any feature scaling or dimensionality reduction, we can’t say anything about the relative importance of each of our features given these coefficients because the features are not of the same scale.

A common method of measuring the accuracy of regression models is to use the R2 statistic.

The R2 statistic is defined as follows:

R2=1–RSSTSS
The RSS (Residual sum of squares) measures the variability left unexplained after performing the regression
The TSS measues the total variance in Y
Therefore the R2 statistic measures proportion of variability in Y that is explained by X using our model
R2 can be determined using our test set and the model’s score method.

In [49]:
regression_model.score(X_test, y_test)

0.8285231316459772

So in our model, 82.85% of the variability in Y can be explained using X

We can also get the mean squared error using scikit-learn’s mean_squared_error method and comparing the prediction for the test data set (data not used for training) with the ground truth for the data test set:

In [50]:
from sklearn.metrics import mean_squared_error

y_predict = regression_model.predict(X_test)

regression_model_mse = mean_squared_error(y_predict, y_test)

regression_model_mse

12.230963834602674

In [51]:
import math

math.sqrt(regression_model_mse)

3.4972794904900972

So we are an average of 3.50 mpg away from the ground truth mpg when making predictions on our test set.

We can use our model to predict the miles per gallon for another, unseen car. Let’s give it a go on the following:

Cylinders – 4
Displacement – 121
Horsepower – 110
Weight – 2800
Acceleration – 15.4
Year – 81
Origin – Asia

In [None]:
regression_model.predict([[4, 121, 110, 2800, 15.4, 81, 0, 1, 0]])

The car above is the information for a Saab 900s and it turns out that this is quite close to the actual mpg of 26 mpg for this car.