<h3>Luggage Bags Cost Prediction</h3>

The dataset attached contains the data of 160 different bags associated with ABC industries. 

The bags have certain attributes which are described below:

1. Height – The height of the bag
2. Width – The width of the bag
3. Length – The length of the bag
4. Weight – The weight the bag can carry
5. Weight1 – Weight the bag can carry after expansion

The company now wants to predict the cost they should set for a new variant of these kinds of bags
based on the attributes below. 

As a result, they want you to build a prediction model which can correctly
set the cost of the bag provided the attributes are given.

The equation for multiple linear regression would be like

Cost = B0 + B1(Height) + B2(Width) + B3(Length) + B4(Weight) + B5(Weight1)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
import math

#### Read the input into the data object

In [2]:
data = pd.read_csv(r'Data_miniproject.csv')

#### Display the head of the data

In [3]:
data.head()

Unnamed: 0,Cost,Weight,Weight1,Length,Height,Width
0,242.0,23.2,25.4,30.0,11.52,4.02
1,290.0,24.0,26.3,31.2,12.48,4.3056
2,340.0,23.9,26.5,31.1,12.3778,4.6961
3,363.0,26.3,29.0,33.5,12.73,4.4555
4,430.0,26.5,29.0,34.0,12.444,5.134


#### Display number of rows and columns in the data

In [4]:
data.shape

(159, 6)

#### Describe the data within each column in the data to know min,max median, count of rows etc.

In [5]:
data.describe()

Unnamed: 0,Cost,Weight,Weight1,Length,Height,Width
count,159.0,159.0,159.0,159.0,159.0,159.0
mean,398.326415,26.24717,28.415723,31.227044,8.970994,4.417486
std,357.978317,9.996441,10.716328,11.610246,4.286208,1.685804
min,0.0,7.5,8.4,8.8,1.7284,1.0476
25%,120.0,19.05,21.0,23.15,5.9448,3.38565
50%,273.0,25.2,27.3,29.4,7.786,4.2485
75%,650.0,32.7,35.5,39.65,12.3659,5.5845
max,1650.0,59.0,63.4,68.0,18.957,8.142


# Multi-Variate Linear Regression by solving Normal Equation or OLS

In [6]:
y = data.pop('Cost')          # Getting the 'Cost' into y variable
X = data.copy()               # Putting remaining columns that is Features into X variable
column_names = X.columns
scaler = StandardScaler()     # Using StandardScaler from sklearn.preprocessing to scale the Features 
X = pd.DataFrame(scaler.fit_transform(X),columns=column_names)

### Scaling is not required for Ordinary Least Squares method as each parameter is individually evaluated.

### When we use iterative method like Gradient Descent then Feature scaling can help to converge faster

In [7]:
reg = LinearRegression().fit(X,y)   # Using LinearRegression() and fitting the model

#### Obtaining the Coefficients 

In [8]:
print("Coefficient is: ",reg.coef_)

Coefficient is:  [ 621.36698553  -69.72252823 -335.9401089   120.90631145   37.76626234]


#### Obtaining the Intercept

In [9]:
print("Intercept is: ",reg.intercept_)

Intercept is:  398.32641509433967


In [15]:
print(X)

       Weight   Weight1    Length    Height     Width
0   -0.305789 -0.282303 -0.106020  0.596579 -0.236529
1   -0.225507 -0.198054 -0.002337  0.821261 -0.066579
2   -0.235542 -0.179332 -0.010977  0.797341  0.165793
3    0.005302  0.054694  0.196390  0.879771  0.022621
4    0.025372  0.054694  0.239592  0.812835  0.426371
..        ...       ...       ...       ...       ...
154 -1.479903 -1.517960 -1.540309 -1.610359 -1.799403
155 -1.459833 -1.499238 -1.531669 -1.530878 -1.873547
156 -1.419692 -1.443072 -1.505748 -1.566687 -1.881402
157 -1.309305 -1.321378 -1.384784 -1.427243 -1.398568
158 -1.249094 -1.255851 -1.298381 -1.413341 -1.510440

[159 rows x 5 columns]


In [27]:
print(X.iloc[0].values.reshape(1,-1))

[[-0.30578858 -0.28230301 -0.10602023  0.59657867 -0.23652895]]


In [20]:
X_pred = np.array([-0.30578858, -0.28230301, -0.10602023, 0.59657867, -0.23652895])

In [21]:
X_pred_r = X_pred.reshape(1, -1)

In [28]:
reg.predict(X.iloc[0].values.reshape(1,-1))

array([326.81612777])

In [29]:
reg.predict(X_pred_r)

array([326.81612622])

#### RMSE, RSE and R2 Score for OLS 

In [10]:
ne_rmse = math.sqrt(mean_squared_error(reg.predict(X),y))
ne_rse = (ne_rmse**2)*X.shape[0]
ne_rse /= X.shape[0]-2
print('mse: ',mean_squared_error(reg.predict(X),y))
print('rmse: ',ne_rmse)
print('rse: ',ne_rse)
print('r2 score: ',reg.score(X,y))

mse:  14607.878944541948
rmse:  120.86305864300286
rse:  14793.966574408725
r2 score:  0.8852867046546207


In [11]:
import pickle

In [12]:
pickle.dump(reg,open('regmodel.pkl','wb'))

In [14]:
pickled_model = pickle.load(open('regmodel.pkl','rb'))

In [23]:
pickled_model.predict(X_pred_r)

array([326.81612622])