In [1]:
import pandas as pd

In [2]:
import numpy as np

These are the initial commands to import necessary packages. 
pandas is used to read the dataset in .csv format.

In [3]:
car=pd.read_csv('train.csv')

Dataset is read.

In [4]:
car.head()

Unnamed: 0,Weight_kg,Horsepower,Cylinders,Fuel_Type,Road_Type,Age_years,Mileage_km_per_l
0,1726,167,3,Diesel,Highway,23,14.3
1,2059,182,6,Petrol,City,12,12.0
2,1460,86,5,Diesel,Highway,12,18.16
3,1894,80,4,Petrol,City,7,12.0
4,1730,148,6,Petrol,City,11,12.0


In [5]:
car.shape

(20000, 7)

The dataset consists of 20000 instances/rows.

In [6]:
car.info

<bound method DataFrame.info of        Weight_kg  Horsepower  Cylinders Fuel_Type Road_Type  Age_years  \
0           1726         167          3    Diesel   Highway         23   
1           2059         182          6    Petrol      City         12   
2           1460          86          5    Diesel   Highway         12   
3           1894          80          4    Petrol      City          7   
4           1730         148          6    Petrol      City         11   
...          ...         ...        ...       ...       ...        ...   
19995       2265         177          3    Diesel      City         25   
19996       1170          98          4    Diesel      City          6   
19997       2359         220          5    Petrol      City          1   
19998       2384         151          6    Diesel      City         14   
19999        839         245          6    Petrol   Highway         20   

       Mileage_km_per_l  
0                 14.30  
1                 12.00  
2

In [7]:
x = car.drop(columns='Mileage_km_per_l')

This will take the data of all the columns into 'x' except Mileage_per_km_l which we need as the output.

In [8]:
y = car['Mileage_km_per_l']

This will take the output column .i.e, Mileage_per_km_l into 'y' .

We will now train the model using 'sckit-learn'.

In [9]:
from sklearn.model_selection import train_test_split

To understand:
- Training set : Used to train the model.
- Testing set : Used to access the model's performance on new, unseen data. 

The 'train_test_split' functon automates this process by randomly dividing the data into these subsets based on specified proportions.
Here the arguement 'test_size = 0.2' which means to say that, 20% of the total dataset(tran.csv) is randomly divided into testing set. 

In [10]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

'OneHotEncoder' -- Many Machine Learning algorithms cannot work with categorical data directly. 
                    This transforms categorical data into binary, making it easier for the algorithms to process and understand.

'r2_score' -- R^2 (Co-efficient of determination) is a statistical measure that represents the proportion of the variance.
            * R^2=1 : Regression predictons perfectly fit the data.
            * R^2=0 : Model does not explain any of the variability of the response data around its mean.
            * R^2<0 : The has the worst fit.

'make_pipeline' -- used to streamline the process of building a sequence of data | Final estimator.
It simplifies the workflow by chaining multiple steps together, including pre-processing, feature-scaling and model training into a single pipeline.


In [12]:
ohe = OneHotEncoder()
ohe.fit(x[['Cylinders','Fuel_Type','Road_Type']])

In our dataset, the categorical columns are:
- Cylinders (3, 4, 5, 6)
- Fuel_Type (Petrol, Diesel)
- Road_Type (City, Highway)

In [13]:
ohe.categories_

[array([3, 4, 5, 6]),
 array(['Diesel', 'Petrol'], dtype=object),
 array(['City', 'Highway'], dtype=object)]

In [14]:
column_trans = make_column_transformer((OneHotEncoder(categories = ohe.categories_),['Cylinders','Fuel_Type','Road_Type']),remainder = 'passthrough')

'make_column_transformer' -- is used to efficiently preprocess some heterogeneous data, such as categorical columns. The 'remainder' arguement specifes what to do with the remaining columns after transforming the data. You can either 'drop' those instances or simply 'passthrough' .

In our case, we are transforming our categorical values and 'passthrough' the remaining instances.

In [15]:
lr = LinearRegression()

Call LinearRegression() and create its object as 'lr' .

In [16]:
pipe = make_pipeline(column_trans,lr)

'pipe' in the above code is the object of pipeline can be directly dumped to use it in making a website.

In [17]:
pipe.fit(x_train,y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



This is the representation of our pipeline.

In [18]:
y_pred=pipe.predict(x_test)

In [19]:
y_pred

array([14.66349631, 16.13443465, 10.63904477, ..., 13.19537919,
       15.71128306, 13.56809745])

In [20]:
r2_score(y_test,y_pred)

0.8037077413091894

We need to maximize this r2_score so that the data will fit perfectly into the model

In [21]:
y_pred.shape

(4000,)

In [22]:
scores=[]
for i in range(1000):
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=i)
    lr = LinearRegression()
    pipe = make_pipeline(column_trans,lr)
    pipe.fit(x_train,y_train)
    y_pred = pipe.predict(x_test)
    scores.append(r2_score(y_test,y_pred))

In [23]:
np.argmax(scores)

np.int64(220)

In [24]:
scores[np.argmax(scores)]

0.81583210142894

This will be our maximum r2_score.

Using this in our model will result in fitting the data more accurately.

In [25]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=np.argmax(scores))
lr = LinearRegression()
pipe = make_pipeline(column_trans,lr)
pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)
r2_score(y_test,y_pred)

0.81583210142894

In [26]:
import pickle

'pickle' allows you to convert a Python object into a byte stream (serialization), and then reconstruct the original object from that byte stream (deserialization).

'pickle.dump(obj, file)' : Serialize the object obj and write it to the file.

In [27]:
pickle.dump(pipe,open('LinearRegression.pkl','wb'))

In the arguements we are passing 'pipe' object, creating a new file with .pkl extension and opening it in 'wb' (write binary) mode and finally saving the pickle file into our project file.

Below cell will show us the prediction.

In [28]:
pipe.predict(pd.DataFrame([[2460.00,148.00,4,'Petrol','Highway',4.0]],columns = ['Weight_kg','Horsepower','Cylinders','Fuel_Type','Road_Type','Age_years']))

array([13.60893539])