# <a id="1">Contents</a>

## [1. Introduction](#intro)
## [2. Data Analysis](#data-analysis)
## [3. Time Series](#time-series-analysis)
### [3.1 Hodrick-Prescott filter](#hodrick-prescott-filter)

# <a id="intro">1. Introduction</a>  

Due to competition, retailers aim to increase profits and reduce costs, increasing the profit margin for perishable food products. This means that avoiding costs due to lost sales, and because of the short-shelf life of their products, ensuring that there is no build up of inventory. Effecient forecasting system can result in reduced inventory, be flexible to changes and increase profits. 

Time series is a series of data points indexed by time typically in an ordered equally spaced manner. Time serves as the only feature in this format of data, and behavior of the data is analyzed through time. Time series forecasting uses past observations of the same variable to develop a model describing the underlying relationship. The model is then used to extrapolate time series into the future. This approach is useful when there are no other explanatory variables influencing the generation of the underlying data. 

## <a id="data-analysis">2. Data Analysis</a>

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.filters.hp_filter import hpfilter
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima_model import ARIMA

from pmdarima import auto_arima

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

import urllib
from sqlalchemy import create_engine

# Custom upload with connection string
from engine_info import server_info

#modules for deep learning with LSTM
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout

#additional modules
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

from matplotlib import rcParams

In [None]:
# Creating a connection to AWS RDS
params = urllib.parse.quote_plus(server_info)
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
connection = engine.connect()

In [None]:
# Upload Durban Fresh Produce Market sales data
sales = pd.read_sql_table(
    table_name='Durban_Fresh_produce_market',
    con=connection,
    parse_dates=['Date']
)

## <a id="data-overview">2.1 Data Overview</a>

In [None]:
# Check the first five rows
sales.head()

In [None]:
# View datatype of each column
sales.info()

Based on the data type for each column above, some of the columns are not on their appropriate data format. 

In [None]:
# Convert some of the columns to their appropriate data type
float_columns = ['Weight_Kg', 'Low_Price', 'High_Price', 'Average_Price', 
                 'Sales_Total', 'Total_Kg_Sold', 'Total_Qty_Sold', 'Stock_On_Hand']


# Convert the columns to numeric
for col in float_columns:
    # sales[col] = sales[col].astype(float)
    sales[col] = pd.to_numeric(sales[col])

In [None]:
sales.info()

In [None]:
sales.head()

In [None]:
# Check for the number of days in the database
print(f"{sales['Date'].nunique()} days recorded in the database")

In [None]:
# Remove days whereby total sales equal 0 because it registers average_price as zero.
filtered_sales = sales[sales['Sales_Total'] != 0]

In [None]:
print(f"{filtered_sales['Date'].nunique()} days recorded in the database after removing rows with zero sales")

Therefore no days were lost due after removing the rows which had items not sold on that particular day.

In [None]:
# EXAMPLE
# Check PINEAPPLE commodity to observe daily sales
sales[(sales['Commodities'] == 'PINEAPPLE QUEEN VICTORIA') & (sales['Container'] == 'LM080') & (sales['Province'] == 'NATAL')].sort_values('Date').head(20)

Based on the above table, it can be seen that multiple sales of the same product are taking place on the same day, which means that the data has to be consolidated to one day for some of these products.

In [None]:
# Consolidation of repeated sales in a single day of the same product to one day
df = filtered_sales.groupby(['Province', 'Container', 'Size_Grade', 'Weight_Kg', 'Commodities', 'Date']
    )[['Low_Price', 'High_Price', 'Sales_Total', 'Total_Qty_Sold', 'Total_Kg_Sold', 'Stock_On_Hand']].agg(
        {
            'Low_Price':min,
            'High_Price':max,
            'Sales_Total':sum,
            'Total_Qty_Sold':sum,
            'Total_Kg_Sold':sum,
            'Stock_On_Hand':sum
        }
)

In [None]:
df

In [None]:
# Reset index to ensure that every row in every column has data.
df.reset_index(inplace=True)

In [None]:
df.head()

In [None]:
# Calculate average price per kilogram of each item
df['avg_price_per_kg'] = round(df['Sales_Total'] / df['Total_Kg_Sold'], 2)

# ?????? Filter for one product NEED A HEADING

In [None]:
# Check which product is sold on a daily basis
day_count = sales['Date'].nunique()
df.groupby(['Province', 'Container', 'Size_Grade', 'Weight_Kg', 'Commodities'])['Commodities'].value_counts().apply(lambda x: x / day_count).sort_values(ascending=False)

Based on the results above, **APPLE GOLDEN DELICIOUS** has been sold for every day that is recorded in the database. For this notebook, this product will be used as a template for developing a forecasting model.

In [None]:
filtered_df = df[
    (df['Commodities'] == 'APPLE GOLDEN DELICIOUS') & 
    (df['Weight_Kg'] == 12.0) &
    (df['Size_Grade'] == '1S') &
    (df['Container'] == 'EC120') &
    (df['Province'] == 'CAPE')
]

In [None]:
filtered_df.head()

In [None]:
# For time series modelling, interest is only on the date and the feature of interest, in this case "avg_price_per_kg"
price = filtered_df[['Date', 'avg_price_per_kg']]

In [None]:
price.set_index('Date', inplace=True)

In [None]:
price.head()

In [None]:
ax = price.plot(figsize=(12,6), title="APPLE GOLDEN DELICIOUS")
ax.autoscale(axis='x', tight=True)
ax.set(ylabel='R/kg');

In [None]:
# Create a copy so as to add some columns to the copied dataframe and not the original
copy_price = price.copy()

In [None]:
# Simple Moving Average for different periods
copy_price['5-day-SMA'] = copy_price['avg_price_per_kg'].rolling(window=5).mean() # Idealy a week
copy_price['5-day-Std'] = copy_price['avg_price_per_kg'].rolling(window=5).std() # 

In [None]:
ax = copy_price[['avg_price_per_kg', '5-day-SMA', '5-day-Std']].plot(figsize=(14,6), title="APPLE GOLDEN DELICIOUS")
ax.autoscale(axis='x', tight=True)
ax.set(ylabel='R/kg')
ax.legend(bbox_to_anchor=(1,1));

## <a id="time-series-analysis">3. Time Series Analysis</a>

In [None]:
# View of the date index
price.index

The freq of the index is currently set to None, this will need to be changed to daily, since the frequency of the data is daily. Furthermore, since there is no data available for weekends, the freq has to be set to Business day (Mon-Fri), with a backfill method to account for those days when it is a holiday and no data updated. 

In [None]:
price = price.asfreq('B', method='backfill')
#price = price.asfreq('B')

In [None]:
price.head(10)

In [None]:
#price.interpolate(inplace=True)

### <a id="hodrick-prescott-filter">3.1 Hodrick-Prescott filter</a>

The Hodrick-Prescott filter is used to get the trend of the data. This approach separates the time-series into a trend component and a cyclical component.

In [None]:
price_cycle, price_trend = hpfilter(price)

In [None]:
price['trend'] = price_trend

In [None]:
ax = price[['trend','avg_price_per_kg']].plot(figsize=(12,6), title="APPLE GOLDEN DELICIOUS")
ax.autoscale(axis='x', tight=True)
ax.set(ylabel='R/kg');

In [None]:
del price['trend']

### <a id="seasonal-decomposition">3.2 Seasonal Decomposition</a>

Time series decomposition involves the deconstruction of the time series data into the level, trend, seasonal and noise component. The model is assumed to be additive, i.e. value of our variable is given by the summation of it's deconstructed components.
<p style="text-align: center; font-weight: bold;">
$y(t) = level + trend + seasonality + noise$
</p>

In [None]:
rcParams['figure.figsize'] = 12,8

In [None]:
result = seasonal_decompose(price['avg_price_per_kg'], model='additive')  
result.plot();

### <a id="forecasting">3.3 Forecasting</a>

**Holt - Winters method**

Holt - Winters method is a generalized exponential smooothing method that incorporates **trend** and **seasonal** variation in the model. The model makes use of exponential weighting of the coefficients of past observations in order to give more weight to the most recent observations. 

In [None]:
train_data = price.iloc[:-30]
test_data = price.iloc[-30:]

In [None]:
model = ExponentialSmoothing(train_data['avg_price_per_kg'], trend='add',seasonal='add',seasonal_periods=7) 
# seasonal_periods=7 for daily data
fitted_model = model.fit()

In [None]:
test_predictions = fitted_model.forecast(30).rename('Forecast')

In [None]:
train_data['avg_price_per_kg'].plot(legend=True, label='TRAIN', figsize=(16,5))
test_data['avg_price_per_kg'].plot(legend=True, label='TEST')
ax = test_predictions.plot(legend=True, label='PREDICTION', title="APPLE GOLDEN DELICIOUS")
ax.set(ylabel="R/kg");

In [None]:
hw_pred = np.sqrt(mean_squared_error(test_data, test_predictions))

In [None]:
hw_pred

**Autoregressive (AR) model**

The Holt-Winters method forecasts the variable of interest using a linear combination of predictors. These predictors are the set of level, trend and seasonal predictors. 

The autoregression model uses a linear combination of past values of the variable. This is a regression equation whereby the variable of interest is regressed against a set of it's lagged values of order $p$.

### $y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + \varepsilon_{t}$

where $c$ is a constant, $\phi_{1}$ and $\phi_{2}$ are lag coefficients up to order $p$, and $\varepsilon_{t}$ is white noise.

For example, an <strong>AR(1)</strong> model would follow the formula

&nbsp;&nbsp;&nbsp;&nbsp;$y_{t} = c + \phi_{1}y_{t-1} + \varepsilon_{t}$

whereas an <strong>AR(2)</strong> model would follow the formula

&nbsp;&nbsp;&nbsp;&nbsp;$y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \varepsilon_{t}$

and so on.

In [None]:
def ar_model(data, lags=1):
    """
    Returns an AutoRegressive model specified by the number of lags
    
    Parameters
    -----------
    data: pd.Series
        A pandas series with a datetime index, and has frequency of the data specied 
    lags: int
        The number of lags that the AutoRegressive model will use
        
    Returns
    -------
    An AR model specified by the number of lags
    """
    
    model = AR(data)
    ar = model.fit(maxlag=lags)
    
    return ar    

In [None]:
# AR(1) model
ar1 = ar_model(train_data['avg_price_per_kg'])

In [None]:
# This is the general format for obtaining predictions
start=len(train_data)
end=len(train_data)+len(test_data)-1
predictions1 = ar1.predict(start=start, end=end, dynamic=False).rename('AR(1) Predictions')

In [None]:
# Storage for scoring each of the models
scores = pd.DataFrame(columns=["RMSE"])
scores.index.name = "model"

In [None]:
scores.loc['AR(1)'] = np.sqrt(mean_squared_error(test_data, predictions1))

In [None]:
# AR(2) model
ar2 = ar_model(train_data['avg_price_per_kg'], lags=2)

In [None]:
predictions2 = ar2.predict(start=start, end=end, dynamic=False).rename('AR(2) Predictions')
scores.loc['AR(2)'] = np.sqrt(mean_squared_error(test_data, predictions2))

In [None]:
test_data['avg_price_per_kg'].plot(legend=True)
predictions1.plot(legend=True)
predictions2.plot(legend=True);

In [None]:
scores.sort_values(by="RMSE")

Based on the results, it can be seen that as the lags added increase, the RMSE is decreasing. Moreover, one needs to determine at what lag will the RMSE reach a minimum. 

In [None]:
ar_rmse = []
for i in range(1, 30): # 30 is an arbitrary number
    ar = ar_model(train_data['avg_price_per_kg'], lags=i)
    price_pred = ar.predict(start=start, end=end, dynamic=False)
    print(price_pred, test_data)
    ar_rmse.append(np.sqrt(mean_squared_error(test_data, price_pred)))

In [None]:
plt.plot(range(1, 30), ar_rmse);

In [None]:
# AR(5) model
ar5 = ar_model(train_data['avg_price_per_kg'], lags=5)
predictions5 = ar5.predict(start=start, end=end, dynamic=False).rename('AR(5) Predictions')

In [None]:
scores.loc['AR(5)'] = np.sqrt(mean_squared_error(test_data, predictions5))

In [None]:
test_data['avg_price_per_kg'].plot(legend=True)
predictions1.plot(legend=True)
predictions2.plot(legend=True)
predictions5.plot(legend=True);

In [None]:
# Identify the best AR() model to use for forecasting
model = AR(train_data['avg_price_per_kg'])
arfit = model.fit(maxiter=1000)

In [None]:
arfit.params

In [None]:
# AR(15) model
ar15 = ar_model(train_data['avg_price_per_kg'], lags=15)
predictions15 = ar15.predict(start=start, end=end, dynamic=False).rename('AR(15) Predictions')

In [None]:
scores.loc['AR(15)'] = np.sqrt(mean_squared_error(test_data, predictions15))

In [None]:
test_data['avg_price_per_kg'].plot(legend=True)
predictions5.plot(legend=True)
predictions15.plot(legend=True);

In [None]:
# AR(16) model
ar16 = ar_model(train_data['avg_price_per_kg'], lags=16)
predictions16 = ar16.predict(start=start, end=end, dynamic=False).rename('AR(16) Predictions')

In [None]:
scores.loc['AR(16)'] = np.sqrt(mean_squared_error(test_data, predictions16))

In [None]:
scores.sort_values(by="RMSE")

# MAYBE MAKE A PLOT OF HOW THE GRAPH LOOKS AS YOU CHANGE THE LAGS

**Autoregressive Integrated Moing Average (ARIMA) model**

ARIMA model is a combination of two models, the AR model utilizing past values of the time series data, and the Moving Average (MA) model, which uses past values of the forecast errors. 

### $$ y_{t} = c + \sum^p_{i=1} \phi_{i} y_{t-i} + \sum^q_{j=1} \theta_{j} \varepsilon_{t-j} + \varepsilon_{t} $$

As seen earlier, this models can be also be used separately, or in this section, combined. The fitting process returns estimated coefficients, $\phi_{i}$ and $\theta_{i}$, but prior to this process, the order ($p,q$) of the model needs to be determined.

In [None]:
model = auto_arima(train_data['avg_price_per_kg'],error_action='ignore', suppress_warnings=True, start_p=0, start_q=0,
                          max_p=6, max_q=3)

In [None]:
model.summary()

In [None]:
model = ARIMA(train_data['avg_price_per_kg'],order=(1,1,1))
results = model.fit()
results.summary()

In [None]:
start = len(train_data)
end = len(train_data) + len(test_data) - 1
predictions = results.predict(start=start, end=end, typ='levels').rename("ARIMA(1,1,1) predictions")

In [None]:
#predictions = pd.Series(predictions, index=test_data.index)

In [None]:
rcParams['figure.figsize'] = 12,8

In [None]:
test_data['avg_price_per_kg'].plot(legend=True)
predictions15.plot(legend=True)
predictions.plot(legend=True);

In [None]:
scores.loc['ARIMA(1,1,1)'] = np.sqrt(mean_squared_error(test_data, predictions))

In [None]:
scores.sort_values(by="RMSE")

### Still needs to be looked at
- Granger Causality Test
- Vector AutoRegression (VAR) methods

## <a id="regression">4. Regression</a>

### <a id='reg-data-analysis'>4.1 Data Analysis</a>

In [None]:
# Reminder of how the dataFrame looks
df.head()

In [None]:
df['low_price_per_kg'] = round(df['Low_Price'] / df['Weight_Kg'], 2)
df['high_price_per_kg'] = round(df['High_Price'] / df['Weight_Kg'], 2)

For time series analysis, the following format was followed when filtering the data:
```python
    filtered_df = df[
    (df['Commodities'] == 'APPLE GOLDEN DELICIOUS') & 
    (df['Weight_Kg'] == 12.0) &
    (df['Size_Grade'] == '1S') &
    (df['Container'] == 'EC120') &
    (df['Province'] == 'CAPE')
    ]
```
For regression, 'Province' will be excluded since where the product is from might affect the pricing.

In [None]:
filtered_df = df[
    (df['Commodities'] == 'APPLE GOLDEN DELICIOUS') & 
    (df['Weight_Kg'] == 12.0) &
    (df['Size_Grade'] == '1S') &
    (df['Container'] == 'EC120') 
]

In [None]:
apples = filtered_df[[
    'Province', 'Date', 'Low_Price', 'High_Price', 'Sales_Total', 'Total_Qty_Sold',
    'Total_Kg_Sold', 'Stock_On_Hand', 'avg_price_per_kg', 'low_price_per_kg', 'high_price_per_kg'
]]

To check for multicollinearity, only numerical columns can be used.

In [None]:
# Check for multicollinearity
sns.heatmap(apples.corr(), annot=True, cbar=False);

There is a high correlation amongst these three features **Sales_Total, Total_Qty_Sold** and **Total_Kg_Sold**. For the sake of determining inventory levels, only **Total_Qty_Sold** will remain. Furthermore there is also perfect correlation between **low_price_per_kg** and **Low_Price**, as well as between **high_price_per_kg** and **High_Price**. Since the target variable is in per kilogram terms, per kilograms values will remain. Although, there seems to be a high correlation between **avg_price_per_kg** and **low_price_per_kg**, for regression analysis the lagged values of the features are going to be used to predict the target variable, hence once the lag has been determined, correlations with the target variables will be assessed. 

In [None]:
apples.columns

In [None]:
# High correlation columns shall be removed 
rem_col = ['Sales_Total', 'Total_Kg_Sold', 'Low_Price', 'High_Price']
# The remaining columns after removing correlated columns
cols = [col for col in apples.columns if col not in rem_col]

In [None]:
apples_df = apples[cols]

In [None]:
apples_df.head()

### <a id='feature-engineering'>4.2 Feature Engineering and Visualizations</a>

In [None]:
apples_df['Province'].value_counts()

Based on the frequency of purchases from each province, apples from ORANGE FREE STATE are the least regularly bought, one option is to remove these rows, another option is to combine it with TRANSVAAL and have them renamed as "INLAND". The latter option is the prefered since it means no data is lost. 

In [None]:
apples_df['Province'] = apples_df['Province'].apply(lambda x: x if x not in ["TRANSVAAL", "ORANGE FREE STATE"] else "INLAND")

In [None]:
apples_df['Province'] .value_counts()

In [None]:
def plot_swarmplot(data_frame, x, y):
    """
    Returns swarmplot based on variables of interest
    
    Parameters
    -----------
    data_frame: DataFrame
        A DataFrame containing x and y variables
    x, y: str
        Features of interest in the DataFrame, x and
        y plotted on the x-axis and y-axis respectively
    
    Returns
    --------
        A seaborn graph object
    """

    plt.figure(figsize=(16,5))
    sns.swarmplot(x=x, y=y, data=data_frame)
    plt.title("Price variations of Apples Golden Delicious");

In [None]:
plot_swarmplot(apples_df, 'Province', 'avg_price_per_kg')

In [None]:
apples_df['weekday'] = apples_df['Date'].apply(lambda x: x.day_name())

In [None]:
apples_df['month'] = apples_df['Date'].apply(lambda x: x.month_name())

In [None]:
def season(month):
    """
    Returns the season of which the month falls in
    
    Parameters
    -----------
    month: str
        The month of the year as a full month name
    
    Returns
    --------
    str:
        The season of the year
        
    Examples
    ---------
    >>> season('October')
    'spring'
    """
    
    # Seasons
    summer = ['December', 'January', 'February']
    autumn = ['March', 'April', 'May']
    winter = ['June', 'July', 'August']
    spring = ['September', 'October', 'November']
    
    if month in summer:
        return 'summer'
    elif month in autumn:
        return 'autumn'
    elif month in winter:
        return 'winter'
    else:
        return 'spring'

In [None]:
apples_df['season'] = apples_df['month'].apply(season)

In [None]:
plot_swarmplot(apples_df, 'season', 'avg_price_per_kg')

Based on this plot, it can be concluded that since apples are not seasonal fruits, there appears to be no difference in average price per kilogram between the seasons. 

In [None]:
apples_df.sort_values('Date', inplace=True)

In [None]:
apples_df.head()

In [None]:
# Check if end of the month(25th - 31st) will influence the prices
apples_df['is_month_end'] = apples_df['Date'].apply(lambda x: 1 if x.day in range(25,32) else 0)

In [None]:
# Remove white spaces in the Province name
apples_df['Province'] = apples_df['Province'].apply(lambda x: x.replace(" ", "_").replace("-", "_").replace(".", "_"))

In [None]:
apples_df['Province'].unique()

In [None]:
X_features = list(filter(lambda x: x != 'avg_price_per_kg', apples_df.columns))

In [None]:
X = apples_df[X_features]
y = apples_df['avg_price_per_kg']

In [None]:
X.columns

For the base model, certain features have to be dropped because they are only recorded after the product has been sold for that day, whereas prediction are based on what the average price per kilogram is going to be before any transaction has taken place. These columns that are going to be dropped include **low_price_per_kg**, **high_price_per_kg**, **Total_Qty_Sold** and **Stock_On_Hand**. Although these columns are dropped, their lagged values might serve as an input, that can be looked at later on. Furthermore, the **Date** has been used to generate features, hence it will also be dropped.

In [None]:
# List of columns to serve as input for the regression model
lst = [col for col in X.columns if col not in [
    'low_price_per_kg', 'high_price_per_kg', 'Stock_On_Hand', 'Date', 'Total_Qty_Sold'
]]

In [None]:
X = X[lst]

In [None]:
X.head()

In [None]:
X = pd.get_dummies(X, drop_first=True)

In [None]:
X.head()

Since time has been removed as a sequential feature, splitting the data in accordance with time is not needed. Therefore for the train_test_split, shuffle can still remain at True.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Initially, comparison is going to be between a linear regression model, and a constant average price.

In [None]:
mean_predict = np.ones(shape=(len(y_test),)) * y_train.mean()

In [None]:
base_pred = np.sqrt(mean_squared_error(y_test, mean_predict))

In [None]:
base_pred

In [None]:
scores.loc['base_pred'] = base_pred

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train, y_train)

In [None]:
predict = lr.predict(X_test)

In [None]:
reg_rmse = np.sqrt(mean_squared_error(y_test, predict))

In [None]:
reg_rmse

In [None]:
plt.figure(figsize=(9,5))
plt.plot(y_test, predict, '.')
plt.plot(y_test, y_test, 'r')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title(f'Linear regression RMSE = {reg_rmse}');

In [None]:
scores.loc['linear_regression'] = reg_rmse

In [None]:
scores.sort_values(by='RMSE')

## <a id="deep-learning">5. Deep Learning</a>

For deep learning, a Long Short Term Memory (**LSTM**) was used to generate forecasts. LSTM is a special kind of recurrent neural network that is capable of learning long term dependencies in data. This is achieved because the recurring module of the model has a combination of four layers interacting with each other.
![image.png](attachment:image.png)

In [None]:
#Creating scaler to scale data between the range or (0,1)
scaler = MinMaxScaler()
scaler.fit(train_data)
train_data_scaled = scaler.transform(train_data)
test_data_scaled = scaler.transform(test_data)

In [None]:
test_data_scaled[:5]

**Creating a time series generator from keras for our scaled train and test data**

In [None]:
generator = TimeseriesGenerator(train_data_scaled, train_data_scaled, length=15, batch_size=1)

**LSTM model**

In [None]:
model = Sequential()
model.add(LSTM(150, activation='relu', input_shape=(15, 1)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.summary()

In [None]:
#fitting the model
model.fit_generator(generator, epochs=40, verbose=0)

In [None]:
model_loss = model.history.history['loss']
plt.figure(figsize=(14, 5))
plt.plot(range(1, len(model_loss)+1), model_loss) 
plt.ylabel("mean squared error")
plt.xlabel("epochs")
plt.title("LSTM model performance")
plt.autoscale(axis='x', tight=True);

After a certain number of epochs the loss starts to converge to a certain value.

Using the model to predict the average price per KG

In [None]:
output=[]
reshaped_data=np.reshape(train_data_scaled[-15:],(1, 15, 1))
for i in range(len(test_data_scaled)):
    prediction=model.predict(reshaped_data)[0]
    output.append(prediction)
    reshaped_data=np.append(reshaped_data[:,1:,:],[[prediction]],axis=1) 

In [None]:
output = scaler.inverse_transform(output)

In [None]:
lstm_rmse = np.sqrt(mean_squared_error(test_data, output))

In [None]:
lstm_rmse

In [None]:
test_data['avg_price_per_kg'].plot(figsize = (16,5), legend=True)
ax = output.plot(legend = True)
ax.set(ylabel="R/kg");

# COnclusion

The data input has to be in equal intervals.