In March 2012, I defended my Philosophy Thesis on the subject [The time series forecast model on the most similar pattern](https://www.mbureau.energy/blog/short-term-electricity-price-forecast-model-most-similar-pattern) at Bauman Moscow State Technical University. The thesis was published in Russian on the website of the [Mathematical Bureau](https://www.mbureau.energy) and, by the end of 2018, it had received over 65,000 views. The scientific quotation of the thesis has reached 64 according to the Russian Science Index.

The forecast model on the most similar pattern belongs to time series statistical along with ARIMAX, GARCH, ES and others. In the simplest version of the model to calculate forecast values of the target time series, only actuals of this target time series are required. I developed the model for electricity price and consumption forecast problem for the Russian Wholesale Electricity Market. Later the followers have found new applications in a variety of areas.

Below I introduce the implementation of the model in Python for the **electricity consumption forecast problem**. The analogous MATLAB 2015b version for electricity price forecasting could be found in my blog [The time series forecast model on the most similar pattern example in MATLAB](https://www.mbureau.energy/blog/time-series-forecast-model-most-similar-pattern-example-matlab).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

First of all, *forecast_moment* is the first time stamp for which we should calculate forecast value. The *forecast_horizon* in the example is equal to 24 as we’re dealing with the day ahead sector of the electricity market. The only model parameter is *M* which represents the pattern’s length. Estimation of M should be done with preliminary. I choose *M* value based on my experience.

In [None]:
forecast_moment = pd.datetime(2011, 9, 2, 0, 0, 0)
forecast_horizon = 24
column_name = 'consumption_eur'         # column to forecast
time_index_name = 'timestep'
M = 144                                 # model on the most similar pattern parameter

Upload values for *consumption_eur*. This is scheduled day ahead consumption, so called market demand, for European price zone of the Russian Wholesale Electricity Market.

In [None]:
data_source = pd.read_csv('/kaggle/input/russian-wholesale-electricity-market/RU_Electricity_Market_PZ_dayahead_price_volume.csv')
data_source.index = pd.to_datetime(data_source[time_index_name], format='%Y-%m-%d %H:%M')
data_source.index.name = time_index_name
data = data_source[column_name]
data.head()

**1. Define the latest available actual pattern**

This is the most recent actual values for our *forecast_moment*.

In [None]:
pattern_latest_available_range = pd.date_range(forecast_moment - pd.Timedelta(M, unit='H'),
                                               forecast_moment - pd.Timedelta(1, unit='H'), freq='H')
pattern_latest_available = data.loc[pattern_latest_available_range]

**2. Find the most similar pattern.**

To find the most similar pattern all the available patterns of the target time series should be checked using looping through k. As the looping result, the pattern which gives maximum similarity value is taken. Make sure the similarity value is close to one.

In [None]:
looping_dates_range = pd.date_range(data.index[0],
                                    forecast_moment - pd.Timedelta(M + forecast_horizon, unit='H'), freq='D')
similarity_measure = []
time_delay = []

# Looping through timeseries history

for d in looping_dates_range:

    pattern_temp_range = pd.date_range(d, d + pd.Timedelta(M-1, unit='H'), freq='H')
    ds = data.loc[pattern_temp_range].values
    time_delay.append(d)

    if np.sum(ds) == 0:
        similarity_measure.append(0)
    else:
        # Similarity measure = abs of linear correlation
        c = np.abs(np.corrcoef(ds, pattern_latest_available.values))
        similarity_measure.append(c[0, 1])

Find the pattern, which corresponds to maximum similarity value

In [None]:
similarity = pd.DataFrame(similarity_measure, index=time_delay, columns=['similarity'])
max_similarity = np.max(similarity.values)
max_time_delay = similarity[similarity.values == max_similarity].index
max_similarity_pattern_range = pd.date_range(max_time_delay[0],
                                             max_time_delay[0] + pd.Timedelta(M-1, unit='H'), freq='H')
max_similarity_pattern = data.loc[max_similarity_pattern_range]

**3. Calculate the linear coefficients**

When we talk about the patterns resemblance or similarity, we do not claim that patterns are the same. They are probably not, but they are linear-dependent. Why linear? Because we're checking the exact type of dependence using the Pearson correlation as the similarity measure.


In [None]:
regress = LinearRegression()
x = np.column_stack((max_similarity_pattern.values, np.ones(len(max_similarity_pattern))))
regress.fit(x, pattern_latest_available.values)
max_similarity_pattern_model = regress.predict(x)

**4. Define the base pattern.**

The base pattern is the time series piece which comes right after the most similar pattern on the time axis.


In [None]:
base_pattern_range = pd.date_range(max_similarity_pattern.index[-1] + pd.Timedelta(1, unit='H'),
                                   max_similarity_pattern.index[-1] + pd.Timedelta(forecast_horizon, unit='H'), freq='H')
base_pattern = data.loc[base_pattern_range]

**5. Calculate forecast values.**

Forecast pattern is calculated using the base pattern and obtained linear dependency.

In [None]:
forecast_range = pd.date_range(pattern_latest_available.index[-1] + pd.Timedelta(1, unit='H'),
                               pattern_latest_available.index[-1] + pd.Timedelta(forecast_horizon, unit='H'), freq='H')
x = np.column_stack((base_pattern.values, np.ones(len(base_pattern))))
y = regress.predict(x)
forecast = pd.DataFrame(y, index=forecast_range, columns=[column_name])

Well, that's it, the forecast is done.

Additionally I estimate forecast error using [MAE](https://en.wikipedia.org/wiki/Mean_absolute_error) and [MAPE](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) values.

In [None]:
actual = data.loc[forecast_range]
mae = np.mean(np.abs(actual.values.ravel() - forecast.values.ravel()))
mape = np.mean(np.abs((actual.values.ravel() - forecast.values.ravel()) / actual.values.ravel())) * 100
error_line = 'MAE = %2.2f MWh, MAPE = %2.2f %% ' % (mae, mape)

And plot the results. Note that on the bottom chart both *forecast* and *pattern_latest_available* are x-axis.

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(20, 20))
ax1.plot(actual.index, actual.values, label='Actuals')
ax1.plot(forecast.index, forecast.values, label='Forecast')
ax1.set_title('Forecast for ' + column_name + ': ' + error_line, fontsize=20)
ax1.legend()

ax2.plot(max_similarity_pattern.index, pattern_latest_available.values, label='Latest available pattern (x-shifted)')
ax2.plot(max_similarity_pattern.index, max_similarity_pattern.values, label='Max similarity pattern')
ax2.plot(max_similarity_pattern.index, max_similarity_pattern_model, label='Model of max similarity pattern')
ax2.plot(base_pattern.index, base_pattern.values, label='Base pattern values')
ax2.plot(base_pattern.index, forecast.values, label='Forecast (x-shifted)')
ax2.legend()

plt.subplots_adjust(hspace=0.3)
plt.show()