Skip to content

Time Series Analysis

Rafael Garcia Leiva edited this page Sep 21, 2021 · 6 revisions

Time Series Analysis

A time series is a sequence of data points X = {x1, x2, ..., xn} measured at fixed time intervals. From a machine learning point of view, forecasting a time series refer to the problem of estimating the unknown xt+1 value given the known previous k values xt, ..., xt-k.

The nescience library provides the class TimeSeries to analyze and find optimal models for time series.

Auto-miscoding

Auto-miscoding (see Miscoding) allow us to estimate how relevant are the previous values of the time series to forecast future values. The auto-miscoding method compares the time series with a lagged version of itself, for multiple lag values. In this sense, auto-miscoding address the same problem than auto-correlation.

import numpy as np
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt

Let's create a simple sinusoidal time series, constantly increasing in the mean and in the standard deviation:

data = np.array([x + np.sin(x) * 0.1 * x + np.random.randn() * 0.1 for x in range(1, 200)])
plt.plot(data)

Synthetic Time Series

Next code computes a classical auto-correlogram for this time series.

plot_acf(data)
plt.title("Autocorrelation")
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()

Auto-correlation

As we can observe, the auto-correlation tell us that beyond lag 15 the time series has no predictive power, which is not true, as we know from the original formula that generated the data. The problem is that auto-correlation is not defined for non-stationary time series (we need a constant mean and a constant standard deviation in order to compute auto-correlation).

Let see how auto-miscoding applies to the same time series:

from nescience.timeseries import TimeSeries
ts = TimeSeries(auto=False)
ts.fit(data)
mscd = ts.auto_miscoding(max_lag=25)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.xlabel("Lag")
plt.ylabel("Miscoding")
plt.title("Auto-miscoding")
plt.show()

Auto-miscoding

As we can see, auto-miscoding is able to recognize that the a lagged version of the time series is relevant to forecast future values, even for large values of the lag, even in the case of a non-stationary time series. Moreover, the auto-miscoding emphasizes the seasonal component of the series.

Cross-miscoding

Cross-miscoding (see Miscoding) allows us to estimate how relevant are the previous values of a multivariate time series to forecast future values. The cross-miscoding method compares the time series with a lagged version of its predictive features, for multiple lag values. In this sense, cross-miscoding address the same problem than cross-correlation.

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.api import datasets
from nescience.timeseries import TimeSeries

Let's start by loading the sample data: US Macroeconomic Data for 1959Q1 - 2009Q3.

data = datasets.macrodata.load_pandas().data
mdata.plot()
plt.show()

Auto-miscoding

Our goal is to identify which ones are the attributes that have a higher (temporal) predictive power over the unemployment rates. As an example, we will compare "realgdp" (real gross domestic product) and "realgovt" (real federal consumption expenditures & gross investment).

mdata = mdata.drop(["year", "quarter"], axis=1)
y = mdata["unemp"]
X = mdata.drop(["unemp"], axis=1)
ts = TimeSeries(multivariate=True, auto=False)
ts.fit(y, X)

And let's compute the cross-miscoding of "unemp" with given a lagged version of "realgdp".

mscd = ts.cross_miscoding(attribute=0, max_lag=50)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.show()

Auto-miscoding

Let's compute the cross-miscoding of "unemp" with given a lagged version of "realgdp".

mscd = ts.cross_miscoding(attribute=3, max_lag=50)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.show()

Auto-miscoding

In the short term, it seems that "realgdp" has a higher predictive power than "realgovt".

Forecasting

>>> import pandas as pd
>>> air = pd.read_csv("AirPassengers.csv")
>>> ts = air["#Passengers"].values

A canonical example of time series is the air passagers dataset, composed by monthly totals of a US airline passengers from 1949 to 1960:

Air Passengers

This dataset has to be imported in the following way to be used with the nescience library:

>>> import pandas as pd
>>> air = pd.read_csv("AirPassengers.csv")
>>> ts = air["#Passengers"].values

In order to evaluate the quality of the predictions made with the TimeSeries class, we will compare against a dummy model that as prediction for xt+1 it just return the value xt.

def dummy_score(ts):
    mean = np.mean(ts)
    u = np.sum([(ts[i] - ts[i-1])**2 for i in range(0, len(ts)-1)])
    v = np.sum([(ts[i] - mean)**2 for i in range(0, len(ts)-1)])
    score = 1 - u/v
    return score

If we apply our dummy model to the air passengers dataset we will get the following score:

>>> ground_truth(ts)
0.870694834815031

The same dataset modeled with the AutoTimeSeries class provide the following score:

>>> from nescience.timeseries import TimeSeries
>>> model = TimeSeries()
>>> model.fit(data)
TimeSeries()
>>> model.score(data)
0.9844207308319846

Supported Models

The following families of models are currently supported for the auto-time series part:

Clone this wiki locally