<h1>Machine Learning for skoda used car dataset using Python </h1>

<b>Prepared By:</b>
<br>Suman Biswas
<br>Scientific Officer (Statistics)
<br>Bangladesh Agricultural Research Institute
<br>Gazipur-1701, Bangladesh

<b>This topic covers how to</b>
- load the skoda used car dataset
- separate the numeric features and target variable
- split the original dataset into the train set (80%) and the test set (20%)
- perform Linear Regression and Predict the 'Price' from the test set 
- find the RMSE value from the actual test data and the predicted data

#### Import required Libraries

In [81]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import pandas as pd

#### (A) Load Data

In [82]:
# Load data
df=pd.read_csv('skoda.csv')

# Display dataset and it's shape
display((df.head(10)))
print(df.shape)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,Octavia,2017,10550,Manual,25250,Petrol,150,54.3,1.4
1,Citigo,2018,8200,Manual,1264,Petrol,145,67.3,1.0
2,Octavia,2019,15650,Automatic,6825,Diesel,145,67.3,2.0
3,Yeti Outdoor,2015,14000,Automatic,28431,Diesel,165,51.4,2.0
4,Superb,2019,18350,Manual,10912,Petrol,150,40.9,1.5
5,Yeti Outdoor,2017,13250,Automatic,47005,Diesel,145,51.4,2.0
6,Superb,2019,15250,Manual,14850,Petrol,145,40.9,1.5
7,Octavia,2019,18950,Automatic,5850,Diesel,150,50.4,2.0
8,Kodiaq,2019,29900,Automatic,2633,Petrol,150,31.4,2.0
9,Octavia,2017,18990,Manual,20000,Petrol,150,43.5,2.0


(6267, 9)


In [83]:
## Display variable names
print(df.columns)

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax',
       'mpg', 'engineSize'],
      dtype='object')


#### (B) Separating the numeric features and target variable

In [84]:
## Checking data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6267 entries, 0 to 6266
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         6267 non-null   object 
 1   year          6267 non-null   int64  
 2   price         6267 non-null   int64  
 3   transmission  6267 non-null   object 
 4   mileage       6267 non-null   int64  
 5   fuelType      6267 non-null   object 
 6   tax           6267 non-null   int64  
 7   mpg           6267 non-null   float64
 8   engineSize    6267 non-null   float64
dtypes: float64(2), int64(4), object(3)
memory usage: 440.8+ KB


<b>Observation:</b>
<br>year, mileage, tax, mpg and engineSize are the numerice variables which will be considered as features and the other numeric variable 'price' will be considered as target.

In [85]:
features=['year', 'mileage', 'tax', 'mpg', 'engineSize']
target=['price']

X=df[features]
y=df[target]

# Print the shape of the dataset features and target variable
print(X.shape, y.shape)

(6267, 5) (6267, 1)


#### (C) Split the original dataset into the train set (80%) and the test set (20%)

In [86]:
X_train, X_test, y_train, y_test=train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

print('X_train shape: ', X_train.shape, 'X_test shape: ', X_test.shape, 'y_train shape: ', y_train.shape, 'y_test shape: ', y_test.shape)

X_train shape:  (5013, 5) X_test shape:  (1254, 5) y_train shape:  (5013, 1) y_test shape:  (1254, 1)


#### (D) Perform Linear Regression and Predict the 'Price' from the test set

##### Linear Regression

In [87]:
model=LinearRegression()
model=model.fit(X_train, y_train)

##### Prediction the 'Price' from the test set

In [88]:
y_pred=model.predict(X_test)
print(y_pred)

[[13779.14498775]
 [15731.35912899]
 [17043.09723094]
 ...
 [14900.56791431]
 [17736.20873264]
 [12706.21305435]]


In [89]:
# Original y_test values
print(y_test)

      price
1261   9990
2182  15893
2094  16290
2552  24995
2812   9982
...     ...
5799  10790
106    9150
3931  15899
4540  16595
2815  13791

[1254 rows x 1 columns]


#### (E) Find the RMSE value from the actual test data and the predicted data

In [90]:
RMSE=mean_squared_error(y_test, y_pred, squared=False)
print('RMSE:','%.2f' % RMSE)

RMSE: 3410.42
