#### Introduction

This is a simple excercise for practicing linear regression, a very basic ML algorithm, also will measure the accuracy of the model. KPIs for the model accuracy are below:

1. RMSE
2. MAE
3. MAPE
4. MSE
5. R squared

We will use a data set from kaggle (source: https://www.kaggle.com/aishwaryamuthukumar/cars-dataset-audi-bmw-ford-hyundai-skoda-vw).



#### Importing Libraries

As the first step, import the libraries required (this process is iterative, we might come back at the middle of the coding to add a new library).

In [63]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score

#### Importing data file

At this step, import the data set, we are importing it from github here.

In [64]:
df = pd.read_csv("https://raw.githubusercontent.com/SKawsar/Data_Visualization_with_Python/main/bmw.csv")

In [65]:
df.head() #to check if the data loaded correctly

Unnamed: 0,model,year,price,transmission,mileage,fuelType,mpg,engineSize
0,5 Series,2014,11200,Automatic,67068,Diesel,57.6,2.0
1,6 Series,2018,27000,Automatic,14827,Petrol,42.8,2.0
2,5 Series,2016,16000,Automatic,62794,Diesel,51.4,3.0
3,1 Series,2017,12750,Automatic,26676,Diesel,72.4,1.5
4,7 Series,2014,14500,Automatic,39554,Diesel,50.4,3.0


The first step of this task, loading the required data, is done. Lets describe the features for easy understanding:

Model: represent the model of the car.

Year: Year the model was launched.

Price: reselling price of the car.

Transmission: Transmission type of the car.

Mileage: total distance the car already travelled.

fuelType: fuel the car required.

mpg: Distance travelled per gallon of fuel.

engineSize: The engine size of the car.

Now, check the features more closely, which features are numeric and which are object.

In [66]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         10781 non-null  object 
 1   year          10781 non-null  int64  
 2   price         10781 non-null  int64  
 3   transmission  10781 non-null  object 
 4   mileage       10781 non-null  int64  
 5   fuelType      10781 non-null  object 
 6   mpg           10781 non-null  float64
 7   engineSize    10781 non-null  float64
dtypes: float64(2), int64(3), object(3)
memory usage: 673.9+ KB
None


Check for missing value

In [67]:
df.isnull().sum()

model           0
year            0
price           0
transmission    0
mileage         0
fuelType        0
mpg             0
engineSize      0
dtype: int64

No missing value, we have our data set description, now we will separate the feaures and target variable. We will take numeric columns as features here. And Price will be target variable.

In [68]:
X = df.drop(["model", "price", "transmission", "fuelType"], axis=1)
Y= df.price
X.head()

Unnamed: 0,year,mileage,mpg,engineSize
0,2014,67068,57.6,2.0
1,2018,14827,42.8,2.0
2,2016,62794,51.4,3.0
3,2017,26676,72.4,1.5
4,2014,39554,50.4,3.0


Now, before applying the Linear regression, we need to split the dataset, as a standard practice, we will split it to 80-20 (80% training data, 20% testing data). We need to import train_test_split from sklearn.model_selection.

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

Now, to apply linear regression, we need to import the model from scikit-learn. Then we can apply this directly to our dataset already splitted.

In [70]:
model = LinearRegression()

In [71]:
model.fit(X_train, y_train)
model.predict(X_test)

array([28070.5828543 , 24454.46184519, 10840.97137118, ...,
       28401.03849011, 30112.51356905, 16222.44919822])

Now, we have our predicted value, to check how good the model fitted, we will check the accuracy scores, as described, we will check 5 different score, a little description is given before code. As a first step, we will create a dataset for ease of calculation.

In [72]:
y_actual = y_test
y_predicted = model.predict(X_test)
df_accuracy = pd.DataFrame({"y_actual":y_actual,
                   "y_predicted": y_predicted})



df_accuracy["dif"] = df_accuracy["y_actual"] - df_accuracy["y_predicted"]
df_accuracy["abs_error"] = np.abs(df_accuracy["dif"])
df_accuracy["squared_error"] = df_accuracy["dif"]**2

df_accuracy["actual_subtract_mean"] = df_accuracy["y_actual"] - df_accuracy["y_actual"].mean()
df_accuracy["squared_actual_subtract_mean"] = df_accuracy["actual_subtract_mean"]**2


display(df_accuracy)

Unnamed: 0,y_actual,y_predicted,dif,abs_error,squared_error,actual_subtract_mean,squared_actual_subtract_mean
3840,35470,28070.582854,7399.417146,7399.417146,5.475137e+07,12640.937413,1.597933e+08
7757,15490,24454.461845,-8964.461845,8964.461845,8.036158e+07,-7339.062587,5.386184e+07
10325,17000,10840.971371,6159.028629,6159.028629,3.793363e+07,-5829.062587,3.397797e+07
685,10991,9436.633065,1554.366935,1554.366935,2.416057e+06,-11838.062587,1.401397e+08
1947,21050,21811.413561,-761.413561,761.413561,5.797506e+05,-1779.062587,3.165064e+06
...,...,...,...,...,...,...,...
9964,19980,16990.638004,2989.361996,2989.361996,8.936285e+06,-2849.062587,8.117158e+06
2039,33980,32127.152828,1852.847172,1852.847172,3.433043e+06,11150.937413,1.243434e+08
1608,19372,28401.038490,-9029.038490,9029.038490,8.152354e+07,-3457.062587,1.195128e+07
6951,35793,30112.513569,5680.486431,5680.486431,3.226793e+07,12963.937413,1.680637e+08


In [73]:
# mean absolute error (MAE): mean of absolute errors, where absolute errors are the difference between actual value and predicted value. lower is better
MAE = df_accuracy["abs_error"].mean()
print("MAE = ", MAE)

# Mean Absolute Percentage Error (MAPE): It measures this accuracy as a percentage, and can be calculated as the average absolute percent error, it is the difference between actual values and predicted values divided by actual values. lower is better
MAPE = np.round(np.mean(df_accuracy["abs_error"]/df_accuracy["y_actual"])*100, 2)
print("MAPE = ", MAPE)

# mean squared error (MSE): mean of square of the difference between the actual value and predicted value. lower is better
MSE = df_accuracy["squared_error"].mean()
print("MSE = ", MSE)

# root mean squared error: simply the root of MSE. lower is better
RMSE = np.round(np.sqrt(MSE), 2)
print("RMSE = ", RMSE)

# coefficient of determination == r_squared: The coefficient of determination is a statistical measurement that examines how differences in one variable can be explained by the difference in a second variable, when predicting the outcome of a given event. In other words, this coefficient, which is more commonly known as R-squared (or R2), assesses how strong the linear relationship is between two variables. greater is better. Max =1, min=-
r_squared = np.round(1- df_accuracy["squared_error"].sum()/df_accuracy["squared_actual_subtract_mean"].sum(), 2)
print("r_squared = ", r_squared)

MAE =  4616.644554773189
MAPE =  23.88
MSE =  47012798.583861955
RMSE =  6856.59
r_squared =  0.64


The accuracy results we found are not optimistic, either the linear regression model is not a good fit for this problem, or the features  we excluded are very important.