# TIME SERIES ANALYSIS - STORE SALES ANALYSIS AND PREDICTION

## PROJECT DESCRIPTION
A time series refers to a sequence of data points collected and recorded chronologically over time. It involves the observation of a particular variable or set of variables at regular or irregular intervals. Time series data can be generated from various sources, such as stock prices, weather conditions, population statistics, economic indicators, or sensor readings. In this project, you'll predict store sales on data from Corporation Favorita, a large Ecuadorian-based grocery retailer.

To achieve our goal, we will employ a combination of statistical techniques and machine learning algorithms specifically designed for time series analysis.

By leveraging these analytical tools, we aim to make accurate predictions and provide valuable insights into the future behaviour of the time series. The outcomes of this project will not only enhance our understanding of the underlying dynamics of the dataset but also enable us to make informed decisions and formulate effective strategies based on the predicted values.

The project will follow a systematic approach, encompassing data preprocessing, exploratory data analysis, model selection, parameter tuning, and evaluation. Throughout the process, we will adhere to the industry-standard methodology, the CRISP-DM framework, to ensure a structured and reliable analysis.

By the end of this project, we anticipate obtaining robust and reliable forecasting models that can be applied to future time periods, enabling us to make data-driven decisions, optimize resource allocation, and achieve improved performance in the relevant domain.

## Install Important Packages

In [1]:
# Data Handling tools
import numpy as np
import pandas as pd
from scipy import stats
import pyodbc
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
import warnings

# Machine Learning tools
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from category_encoders.binary import BinaryEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor
from xgboost import XGBRegressor

# Visualization tools
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm 

# Feature Processing (Scikit-learn processing, etc. )
from sklearn.metrics import mean_squared_error, mean_squared_log_error

warnings.filterwarnings('ignore')

### Data Loading
Here is the section to load the datasets

In [2]:
df_train = pd.read_csv('C:/Users/user/P4_Analysis/Data/train.csv')
df_stores = pd.read_csv('C:/Users/user/P4_Analysis/Data/stores.csv')
df_trans = pd.read_csv('C:/Users/user/P4_Analysis/Data/transactions.csv')
df_oil = pd.read_csv('C:/Users/user/P4_Analysis/Data/oil.csv')
df_ss = pd.read_csv('C:/Users/user/P4_Analysis/Data/submission.csv')
df_test = pd.read_csv('C:/Users/user/P4_Analysis/Data/test.csv')
df_holi = pd.read_csv('C:/Users/user/P4_Analysis/Data/holidays_events.csv')

#### 7. Oil

In [3]:
df_oil.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


In [4]:
df_oil.isnull().sum()

date           0
dcoilwtico    43
dtype: int64

In [5]:
df_oil['dcoilwtico'].fillna(method='backfill',inplace=True)
df_oil.isnull().sum()

date          0
dcoilwtico    0
dtype: int64

#### Converting the 'date' column in the datasets to datetime format

In [6]:
# Train dataset
df_train['date'] = pd.to_datetime(df_train['date'])

# Test dataset
df_test['date'] = pd.to_datetime(df_test['date'])

# Holiday Events dataset
df_holi['date'] = pd.to_datetime(df_holi['date'])

# Oil dataset
df_oil['date'] = pd.to_datetime(df_oil['date'])

# Transactions dataset
df_trans['date'] = pd.to_datetime(df_trans['date'])

#### Checking completeness of train dataset

In [7]:
# Check the completeness of the train dataset
min_date = df_train['date'].min()
max_date = df_train['date'].max()
expected_dates = pd.date_range(start=min_date, end=max_date)

missing_dates = expected_dates[~expected_dates.isin(df_train['date'])]

if len(missing_dates) == 0:
    print("The train dataset is complete. It includes all the required dates.")
else:
    print("The train dataset is incomplete. The following dates are missing:")
    print(missing_dates)

The train dataset is incomplete. The following dates are missing:
DatetimeIndex(['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-25'], dtype='datetime64[ns]', freq=None)


In [8]:
# Complete the missing dates in the train dataset
# Create an index of the missing dates as a DatetimeIndex object
missing_dates = pd.Index(['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-25'], dtype='datetime64[ns]')

# Create a DataFrame with the missing dates, using the 'date' column
missing_data = pd.DataFrame({'date': missing_dates})

# Concatenate the original train dataset and the missing data DataFrame
# ignore_index=True ensures a new index is assigned to the resulting DataFrame
df_train = pd.concat([df_train, missing_data], ignore_index=True)

# Sort the DataFrame based on the 'date' column in ascending order
df_train.sort_values('date', inplace=True)

In [9]:
# Check the completeness of the train dataset
min_date = df_train['date'].min()
max_date = df_train['date'].max()
expected_dates = pd.date_range(start=min_date, end=max_date)

missing_dates = expected_dates[~expected_dates.isin(df_train['date'])]

if len(missing_dates) == 0:
    print("The train dataset is complete. It includes all the required dates.")
else:
    print("The train dataset is incomplete. The following dates are missing:")
    print(missing_dates)

The train dataset is complete. It includes all the required dates.


### Combine all the data to One data

In [10]:
# Merge train_df with stores_df on 'store_nbr' column
merged_df1 = df_train.merge(df_stores, on='store_nbr', how='inner')

# Merge merged_df1 with transactions_df on 'date' and 'store_nbr' columns
merged_df2 = merged_df1.merge(df_trans, on=['date', 'store_nbr'], how='inner')

# Merge merged_df2 with holidays_events_df on 'date' column
merged_df3 = merged_df2.merge(df_holi, on='date', how='inner')

# Merge merged_df3 with oil_df on 'date' column
finaldata = merged_df3.merge(df_oil, on='date', how='inner')

# View the first five rows of the merged dataset
finaldata.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type_x,cluster,transactions,type_y,locale,locale_name,description,transferred,dcoilwtico
0,73062.0,2013-02-11,1.0,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
1,73085.0,2013-02-11,1.0,MAGAZINES,0.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
2,73084.0,2013-02-11,1.0,"LIQUOR,WINE,BEER",21.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
3,73083.0,2013-02-11,1.0,LINGERIE,0.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
4,73082.0,2013-02-11,1.0,LAWN AND GARDEN,3.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01


#### Rechecking that the data is complete

In [11]:
# Check the completeness of the train dataset
min_date = finaldata['date'].min()
max_date = finaldata['date'].max()
expected_dates = pd.date_range(start=min_date, end=max_date)

missing_dates = expected_dates[~expected_dates.isin(df_train['date'])]

if len(missing_dates) == 0:
    print("The train dataset is complete. It includes all the required dates.")
else:
    print("The train dataset is incomplete. The following dates are missing:")
    print(missing_dates)

The train dataset is complete. It includes all the required dates.


In [12]:
finaldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322047 entries, 0 to 322046
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   id            322047 non-null  float64       
 1   date          322047 non-null  datetime64[ns]
 2   store_nbr     322047 non-null  float64       
 3   family        322047 non-null  object        
 4   sales         322047 non-null  float64       
 5   onpromotion   322047 non-null  float64       
 6   city          322047 non-null  object        
 7   state         322047 non-null  object        
 8   type_x        322047 non-null  object        
 9   cluster       322047 non-null  int64         
 10  transactions  322047 non-null  int64         
 11  type_y        322047 non-null  object        
 12  locale        322047 non-null  object        
 13  locale_name   322047 non-null  object        
 14  description   322047 non-null  object        
 15  transferred   322

In [13]:
finaldata.isnull().sum()

id              0
date            0
store_nbr       0
family          0
sales           0
onpromotion     0
city            0
state           0
type_x          0
cluster         0
transactions    0
type_y          0
locale          0
locale_name     0
description     0
transferred     0
dcoilwtico      0
dtype: int64

In [14]:
# Renaming the columns with the approapriate names
finaldata = finaldata.rename(columns={"type_x": "store_type", "type_y": "holiday_type","dcoilwtico":"oil_price" })
finaldata.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,store_type,cluster,transactions,holiday_type,locale,locale_name,description,transferred,oil_price
0,73062.0,2013-02-11,1.0,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
1,73085.0,2013-02-11,1.0,MAGAZINES,0.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
2,73084.0,2013-02-11,1.0,"LIQUOR,WINE,BEER",21.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
3,73083.0,2013-02-11,1.0,LINGERIE,0.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01
4,73082.0,2013-02-11,1.0,LAWN AND GARDEN,3.0,0.0,Quito,Pichincha,D,13,396,Holiday,National,Ecuador,Carnaval,False,97.01


## Feature engineering and feature scaling

In [15]:
#change date datatype as datetime to create new features

finaldata.date = pd.to_datetime(finaldata.date)

finaldata['year'] = finaldata.date.dt.year

finaldata['month'] = finaldata.date.dt.month

finaldata['dayofmonth'] = finaldata.date.dt.day

finaldata['dayofweek'] = finaldata.date.dt.dayofweek

finaldata.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,store_type,cluster,...,holiday_type,locale,locale_name,description,transferred,oil_price,year,month,dayofmonth,dayofweek
0,73062.0,2013-02-11,1.0,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,...,Holiday,National,Ecuador,Carnaval,False,97.01,2013,2,11,0
1,73085.0,2013-02-11,1.0,MAGAZINES,0.0,0.0,Quito,Pichincha,D,13,...,Holiday,National,Ecuador,Carnaval,False,97.01,2013,2,11,0
2,73084.0,2013-02-11,1.0,"LIQUOR,WINE,BEER",21.0,0.0,Quito,Pichincha,D,13,...,Holiday,National,Ecuador,Carnaval,False,97.01,2013,2,11,0
3,73083.0,2013-02-11,1.0,LINGERIE,0.0,0.0,Quito,Pichincha,D,13,...,Holiday,National,Ecuador,Carnaval,False,97.01,2013,2,11,0
4,73082.0,2013-02-11,1.0,LAWN AND GARDEN,3.0,0.0,Quito,Pichincha,D,13,...,Holiday,National,Ecuador,Carnaval,False,97.01,2013,2,11,0


In [16]:
#drop unnecessary columns
finaldata.drop(columns=['id','locale', 'locale_name', 'description', 'transferred', 'state',  'store_type'], inplace=True)

In [17]:
# set the date column as the index
finaldata.set_index('date', inplace=True)

In [18]:
finaldata.head()

Unnamed: 0_level_0,store_nbr,family,sales,onpromotion,city,cluster,transactions,holiday_type,oil_price,year,month,dayofmonth,dayofweek
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2013-02-11,1.0,AUTOMOTIVE,0.0,0.0,Quito,13,396,Holiday,97.01,2013,2,11,0
2013-02-11,1.0,MAGAZINES,0.0,0.0,Quito,13,396,Holiday,97.01,2013,2,11,0
2013-02-11,1.0,"LIQUOR,WINE,BEER",21.0,0.0,Quito,13,396,Holiday,97.01,2013,2,11,0
2013-02-11,1.0,LINGERIE,0.0,0.0,Quito,13,396,Holiday,97.01,2013,2,11,0
2013-02-11,1.0,LAWN AND GARDEN,3.0,0.0,Quito,13,396,Holiday,97.01,2013,2,11,0


In [19]:

final_data = finaldata.copy()

##### Features Encoding

In [20]:
# Select the categorical columns
categorical_columns = ["family", "city", "holiday_type"]
categorical_data = final_data[categorical_columns]
columns = list(final_data.columns)
print(columns)

['store_nbr', 'family', 'sales', 'onpromotion', 'city', 'cluster', 'transactions', 'holiday_type', 'oil_price', 'year', 'month', 'dayofmonth', 'dayofweek']


In [21]:
numerical_columns = [i for i in columns if i not in categorical_columns]

In [22]:
numerical_columns.remove('sales')
print(numerical_columns)

['store_nbr', 'onpromotion', 'cluster', 'transactions', 'oil_price', 'year', 'month', 'dayofmonth', 'dayofweek']


In [23]:
# Instantiate a BinaryEncoder transformer
encoder = BinaryEncoder(drop_invariant=False, return_df=True,)

# Apply the encoder on the categorical data
binary_encoded = encoder.fit(final_data[categorical_columns])

binary_encoded


Feature Scaling

In [24]:
final_data[numerical_columns].head()

Unnamed: 0_level_0,store_nbr,onpromotion,cluster,transactions,oil_price,year,month,dayofmonth,dayofweek
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-02-11,1.0,0.0,13,396,97.01,2013,2,11,0
2013-02-11,1.0,0.0,13,396,97.01,2013,2,11,0
2013-02-11,1.0,0.0,13,396,97.01,2013,2,11,0
2013-02-11,1.0,0.0,13,396,97.01,2013,2,11,0
2013-02-11,1.0,0.0,13,396,97.01,2013,2,11,0


In [25]:
# create an instance of StandardScaler
scaler = StandardScaler()

scaler.set_output(transform="pandas")

# fit and transform the numerical columns
scale_nums = scaler.fit(final_data[numerical_columns])
scale_nums

In [26]:
# transform the numerical and categorical columns
scale_nums = scaler.transform(final_data[numerical_columns])
binary_encoded = encoder.transform(final_data[categorical_columns])

In [27]:

final_data_encoded = pd.concat([scale_nums, binary_encoded, final_data.sales], axis=1)
final_data_encoded.head()

Unnamed: 0_level_0,store_nbr,onpromotion,cluster,transactions,oil_price,year,month,dayofmonth,dayofweek,family_0,...,family_5,city_0,city_1,city_2,city_3,city_4,holiday_type_0,holiday_type_1,holiday_type_2,sales
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.40364,-1.336086,0,...,1,0,0,0,0,1,0,0,1,0.0
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.40364,-1.336086,0,...,0,0,0,0,0,1,0,0,1,0.0
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.40364,-1.336086,0,...,1,0,0,0,0,1,0,0,1,21.0
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.40364,-1.336086,0,...,0,0,0,0,0,1,0,0,1,0.0
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.40364,-1.336086,0,...,1,0,0,0,0,1,0,0,1,3.0


Data Splitting

In [28]:
# Make a copy of the final_data_encoded as data
onedata = final_data_encoded.copy()
onedata

Unnamed: 0_level_0,store_nbr,onpromotion,cluster,transactions,oil_price,year,month,dayofmonth,dayofweek,family_0,...,family_5,city_0,city_1,city_2,city_3,city_4,holiday_type_0,holiday_type_1,holiday_type_2,sales
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.403640,-1.336086,0,...,1,0,0,0,0,1,0,0,1,0.00000
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.403640,-1.336086,0,...,0,0,0,0,0,1,0,0,1,0.00000
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.403640,-1.336086,0,...,1,0,0,0,0,1,0,0,1,21.00000
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.403640,-1.336086,0,...,0,0,0,0,0,1,0,0,1,0.00000
2013-02-11,-1.666843,-0.240273,0.948024,-1.273993,1.342694,-1.613649,-1.662104,-0.403640,-1.336086,0,...,1,0,0,0,0,1,0,0,1,3.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-01-01,-0.127903,-0.240273,-1.597692,0.445461,-0.439190,-0.018471,-1.970502,-1.491163,0.672159,0,...,0,1,0,0,1,1,0,0,1,0.00000
2015-01-01,-0.127903,-0.240273,-1.597692,0.445461,-0.439190,-0.018471,-1.970502,-1.491163,0.672159,0,...,1,1,0,0,1,1,0,0,1,105.00000
2015-01-01,-0.127903,-0.240273,-1.597692,0.445461,-0.439190,-0.018471,-1.970502,-1.491163,0.672159,1,...,0,1,0,0,1,1,0,0,1,121.94100
2015-01-01,-0.127903,-0.240273,-1.597692,0.445461,-0.439190,-0.018471,-1.970502,-1.491163,0.672159,1,...,1,1,0,0,1,1,0,0,1,279.16998


In [29]:
# Create the feature dataframe using the selected columns
X = onedata.drop(["sales"], axis=1)

# Get the target variable
y = onedata.sales

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:

#cross checking for null values in the test dataset 

X_test.isnull().sum()

store_nbr         0
onpromotion       0
cluster           0
transactions      0
oil_price         0
year              0
month             0
dayofmonth        0
dayofweek         0
family_0          0
family_1          0
family_2          0
family_3          0
family_4          0
family_5          0
city_0            0
city_1            0
city_2            0
city_3            0
city_4            0
holiday_type_0    0
holiday_type_1    0
holiday_type_2    0
dtype: int64

### Decision Tree Regression Model

Train the Model

In [31]:
# Decision Tree Regression Model
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

# Make prediction on X_test
dt_pred = dt.predict(X_test)

Decision Tree Regression evaluation metrics

In [32]:
# apply the absolute value function to y_test to remove negative signs
y_test_abs = abs(y_test)
dt_pred_abs = abs(dt_pred)

In [33]:
# Evaluate our models
mse = mean_squared_error(y_test, dt_pred)
msle = mean_squared_log_error(y_test_abs, dt_pred_abs)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt((msle)).round(2)

dt_results = pd.DataFrame([['Decision Tree', mse, msle, rmse, rmsle]], columns = ['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
dt_results

Unnamed: 0,Model,MSE,MSLE,RMSE,RMSLE
0,Decision Tree,389224.688096,0.345161,623.88,0.59


### XGBoost Model

In [34]:
# XGBoost Model
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)

Evaluate of XGBoast Model

In [35]:
# apply the absolute value function to y_test to remove negative signs
y_test_abs = abs(y_test)
xgb_pred_abs = abs(xgb_pred)

In [36]:
# Evaluate our models
mse = mean_squared_error(y_test, xgb_pred)
msle = mean_squared_log_error(y_test_abs, xgb_pred_abs)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(2)


# Create a DataFrame for the current model's results
model_results = pd.DataFrame([['XGBoost', mse, msle, rmse, rmsle]], columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])

# Display the results DataFrame
model_results

Unnamed: 0,Model,MSE,MSLE,RMSE,RMSLE
0,XGBoost,370252.616163,6.096088,608.48,2.47


### Linear Regression Model

In [37]:
# Linear Regression Model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make prediction on X_test
lr_pred = lr.predict(X_test)

Evaluation Metrics for Linear Regression

In [38]:
# apply the absolute value function to y_test to remove negative signs
y_test_abs = abs(y_test)
lr_pred_abs = abs(lr_pred)

In [39]:
# Evaluate our models
mse = mean_squared_error(y_test, lr_pred)
msle = mean_squared_log_error(y_test_abs, lr_pred_abs)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(2)

lr_results = pd.DataFrame([['Linear Regression', mse, msle, rmse, rmsle]], columns = ['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
lr_results

Unnamed: 0,Model,MSE,MSLE,RMSE,RMSLE
0,Linear Regression,1244209.0,11.881793,1115.44,3.45


### Random Forest Regression Model

In [40]:
# Random Forest Regression Model
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# Make prediction on X_test
rf_pred = rf.predict(X_test)

In [41]:
# apply the absolute value function to y_test to remove negative signs
y_test_abs = abs(y_test)
rf_pred_abs = abs(rf_pred)

In [42]:
# Evaluate our models
mse = mean_squared_error(y_test, rf_pred)
msle = mean_squared_log_error(y_test_abs, rf_pred_abs)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(2)

rf_results = pd.DataFrame([['Random Forest', mse, msle, rmse, rmsle]], columns = ['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
rf_results

Unnamed: 0,Model,MSE,MSLE,RMSE,RMSLE
0,Random Forest,297248.537792,0.276875,545.21,0.53


Data overview

In [43]:
print(dt_results)
print(model_results)
print(lr_results)
print(rf_results)

           Model            MSE      MSLE    RMSE  RMSLE
0  Decision Tree  389224.688096  0.345161  623.88   0.59
     Model            MSE      MSLE    RMSE  RMSLE
0  XGBoost  370252.616163  6.096088  608.48   2.47
               Model           MSE       MSLE     RMSE  RMSLE
0  Linear Regression  1.244209e+06  11.881793  1115.44   3.45
           Model            MSE      MSLE    RMSE  RMSLE
0  Random Forest  297248.537792  0.276875  545.21   0.53


In [44]:
# Creating a dictionary of objects to export
exports = {"encoder": encoder,
           "scaler":scaler,
           "model":rf}

In [45]:
import pickle

In [46]:
# Exporting the dictionary with Pickle
with open("streamlit_toolkit", "wb") as file:
    pickle.dump(exports, file)

In [47]:
# Save the trained Random Forest model to a file
with open('random_forest_model.pkl', 'wb') as model_file:
    pickle.dump(rf, model_file)

In [48]:
# Load the saved Random Forest model from a file
with open('random_forest_model.pkl', 'rb') as model_file:
    loaded_rf_model = pickle.load(model_file)

# Now 'loaded_rf_model' contains your trained Random Forest model

In [49]:
# Exporting the requirements
requirements = "\n".join(f"{m.__name__}=={m.__version__}" for m in globals().values() if getattr(m, "__version__", None))

with open("requirements.txt", "w") as f:
    f.write(requirements)