<h1 style="color:red">Predicting Car Price Using Linear Regression</h1>
<p><b>By:</b> Mir Habeebullah Shah Quadri<br><b>Dept:</b> AI<br><b>Roll.No: </b> 18B81DA914</p>
<h4 style="color:red">Problem Statement:</h4>
<p>Apply predictive analysis on the 'auto-price' dataset using linear regression.</p>
<h4 style="color:blue">Solution</h4>

<p><b>Step 1:</b> Import the necessary libraries required.</p>

In [45]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import Slope
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score

<p><b>Step 2:</b> Read the csv file into a pandas dataframe.</p>

In [46]:
data = pd.read_csv('auto-price.csv')

<p><b>Step 3:</b> List out the features of the dataset.</p>

In [47]:
list(data.columns)

['Symboling',
 'normalized-losses',
 'make',
 'fuel-type',
 'aspiration',
 'num-of-doors',
 'body-style',
 'drive-wheels',
 'engine-location',
 'wheel-base',
 'length',
 'width',
 'height',
 'curb-weight',
 'engine-type',
 'num-of-cylinders',
 'engine-size',
 'fuel-system',
 'bore',
 'stroke',
 'compression-ratio',
 'horsepower',
 'peak-rpm',
 'city-mpg',
 'highway-mpg',
 'price']

<p><b>Step 4:</b> Find the total number of rows in the dataset.</p>

In [48]:
data.size

5330

<p><b>Step 5:</b> Find out the basic description (i.e, mean, count, standard deviation, etc.), for each feature in the dataset.</p>

In [49]:
data.describe()

Unnamed: 0,Symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


<p><b>Step 6:</b> Convert the required feature sets into numeric values. Remove any missing values from 'horsepower' and 'price' feature sets and create a subset for calculations.</p>

In [50]:
data['horsepower'] = pd.to_numeric(data['horsepower'], errors = 'coerce')
data['price'] = pd.to_numeric(data['price'], errors = 'coerce')
data.dropna(subset=['horsepower', 'price'], inplace = True)
data.size

5174

<p><b>Step 7:</b> Apply the pearsonr correlation function on 'horsepower' and 'price' to derive correlation between them.</p>

In [51]:
pearsonr(data['horsepower'], data['price'])

(0.8105330821322063, 1.1891278276946011e-47)

<p>The correlation coefficient is 0.81. Therefore there is a strong correlation between the two features.</p>

<p><b>Step 8:</b> Create a visual representation of the correlation between horsepower and price by graphing out all the points on a two dimensional graph.</p>

In [52]:
output_notebook()

In [53]:
source = ColumnDataSource(data = dict(
    x = data['horsepower'], 
    y = data['price'],
    make = data['make']
))

tooltips = [
    ('make', '@make'),
    ('horsepower', '$x'),
    ('price', '$y{0}')
]

p = figure(plot_width=600, plot_height=400, tooltips=tooltips)
p.xaxis.axis_label = 'Horsepower'
p.yaxis.axis_label = 'Price'

p.circle('x', 'y', source = source, size = 8, color = 'blue', alpha = 0.5)
show(p)

<p>From the graph above, we can notice a strong correlation between horsepower and price. The more the horsepower, the higher the price range of the automobile.</p>

<p><b>Step 9:</b> We will divide the dataset into a 70/30 ratio for training and testing purpose.</p>

In [54]:
train, test = train_test_split(data, test_size = 0.3)

<p><b>Step 10:</b> We create the Linear Regression model and fit the training data into the model. From this model, we then derive the slope and intercept values.</p>

In [55]:
model = linear_model.LinearRegression()

training_x = np.array(train['horsepower']).reshape(-1, 1)
training_y = np.array(train['price'])

model.fit(training_x, training_y)

slope = np.asscalar(np.squeeze(model.coef_))
intercept = model.intercept_
print('Slope:', round(slope, 2), 'Intercept:', round(intercept, 2))

Slope: 167.74 Intercept: -4395.1


<p><b>Step 11:</b> Using the slope and intercept values, we define the line that best fits the model and create a visual representation of it.</p>

In [56]:
best_fit = Slope(gradient=slope, y_intercept = intercept, line_color='red', line_width=3)
p.add_layout(best_fit)
show(p)

<p><b>Step 12:</b> We now derive metrics from the training data (i.e, mean squared error, mean absolute error and r^2 score.)<br>Then we measure the same metrics for the test data as well.</p>

In [58]:
def predict_metrics(lr, x, y):
    pred = lr.predict(x)
    mae = mean_absolute_error(y, pred)
    mse = mean_squared_error(y, pred)
    r2 = r2_score(y, pred)
    return mae, mse, r2

training_mae, training_mse, training_r2 = predict_metrics(model, training_x, training_y)

test_x = np.array(test['horsepower']).reshape(-1, 1)
test_y = np.array(test['price'])
test_mae, test_mse, test_r2 = predict_metrics(model, test_x, test_y)

print('training mean error: ', training_mae, 'training mse: ', training_mse, 'training r2: ', training_r2)
print('test mean error: ', test_mae, 'test mse: ', test_mse, 'test r2: ', test_r2)

training mean error:  3242.959508841529 training mse:  20930583.916503247 training r2:  0.620354614485515
test mean error:  3411.9625407557223 test mse:  23957037.22245476 test r2:  0.7090342742589079


<p><b>Step 13:</b> Now we add more features to the model to compare their correlations.</p>

In [59]:
cols = ['horsepower', 'engine-size', 'peak-rpm', 'length', 'width', 'height']

for col in cols:
    data[col] = pd.to_numeric(data[col], errors='coerce')

data.dropna(subset=['price', 'horsepower'], inplace=True)

for col in cols:
    print(col, pearsonr(data[col], data['price']))


horsepower (0.8105330821322063, 1.1891278276946011e-47)
engine-size (0.8738869517981516, 1.2650674479074428e-63)
peak-rpm (-0.10164886620219901, 0.15311824317199588)
length (0.6939647745646871, 6.39831060305001e-30)
width (0.7538710519013427, 8.679834788813268e-38)
height (0.13499022754460993, 0.05730390719825449)


<p><b>Step 14:</b> Since 'peak-rpm' and 'height' are weakly correlated, we drop them from the features. We split the data once again in a 70/30 ratio for training and testing and train the model with the rest of the features.</p>

In [60]:
model_cols = ['horsepower', 'engine-size', 'length', 'width']
multi_x = np.column_stack(tuple(data[col] for col in model_cols))
multi_train_x, multi_test_x, multi_train_y, multi_test_y = train_test_split(multi_x, data['price'], test_size=0.3)

<p><b>Step 15:</b> We derive the intercept and coefficients for each feature.</p>

In [61]:
multi_model = linear_model.LinearRegression()
multi_model.fit(multi_train_x, multi_train_y)
multi_intercept = multi_model.intercept_
multi_coeffs = dict(zip(model_cols, multi_model.coef_))
print('intercept: ', multi_intercept)
print('coefficients: ', multi_coeffs)

intercept:  -60579.124436381666
coefficients:  {'horsepower': 54.21620388122346, 'engine-size': 88.44509192715641, 'length': 32.43136308745014, 'width': 779.1586116809}


<p><b>Step 16:</b> We derive the MSE, MAE and r^2 score for the training and testing data.</p>

In [28]:
multi_train_mae, multi_train_mse, multi_train_r2 = predict_metrics(multi_model, multi_train_x, multi_train_y)
multi_test_mae, multi_test_mse, multi_test_r2 = predict_metrics(multi_model, multi_test_x, multi_test_y)

print('training mean error: ', multi_train_mae, 'training mse: ', multi_train_mse, 'training r2: ', multi_train_r2)
print('test mean error: ', multi_test_mae, 'test mse: ', multi_test_mse, 'test r2: ', multi_test_r2)

training mean error:  2441.0373472667497 training mse:  11550939.568584947 training r2:  0.8253298076168814
test mean error:  2502.4743131055566 test mse:  12406338.17545234 test r2:  0.7755701121414397
