Skip to content

Time Series Analysis

Rafael Garcia Leiva edited this page Sep 22, 2021 · 6 revisions

Time Series Analysis

A time series is a sequence of data points X = {x1, x2, ..., xn} measured at fixed time intervals. From a machine learning point of view, forecasting a time series refer to the problem of estimating the unknown xt+1 value given the known previous k values xt, ..., xt-k.

The nescience library provides the class TimeSeries to analyze and find optimal models for time series.

Auto-miscoding

Auto-miscoding (see Miscoding) allow us to estimate how relevant are the previous values of the time series to forecast future values. The auto-miscoding method compares the time series with a lagged version of itself, for multiple lag values. In this sense, auto-miscoding address the same problem than auto-correlation.

import numpy as np
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt
from statsmodels.api import datasets

Let's load a sample dataset, the Mauna Loa Weekly Atmospheric CO2 Data.

co2  = datasets.co2.load_pandas()
data = co2.data.dropna(axis=0).values.flatten()
plt.plot(data)

Synthetic Time Series

Next code computes a classical auto-correlogram for this time series.

plot_acf(data)
plt.title("Autocorrelation")
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()

Auto-correlation

As we can observe, the auto-correlation tell us that all the past weeks have an equally high predictive power, which is not true. The problem is that auto-correlation is not defined for non-stationary time series (we need a constant mean and a constant standard deviation in order to compute auto-correlation).

Let see how auto-miscoding applies to the same time series:

from nescience.timeseries import TimeSeries
ts = TimeSeries(auto=False)
ts.fit(data)
mscd = ts.auto_miscoding(max_lag=100)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.xlabel("Lag")
plt.ylabel("Miscoding")
plt.title("Auto-miscoding")
plt.show()

Auto-miscoding

As we can see, auto-miscoding is able to recognize that the time series has a seasonal component, even in the case of a non-stationary time series.

Cross-miscoding

Cross-miscoding (see Miscoding) allows us to estimate how relevant are the previous values of a multivariate time series to forecast future values. The cross-miscoding method compares the time series with a lagged version of its predictive features, for multiple lag values. In this sense, cross-miscoding address the same problem than cross-correlation.

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.api import datasets
from nescience.timeseries import TimeSeries

Let's start by loading the sample data: US Macroeconomic Data for 1959Q1 - 2009Q3.

data = datasets.macrodata.load_pandas().data
mdata.plot()
plt.show()

Auto-miscoding

Our goal is to identify which ones are the attributes that have a higher (temporal) predictive power over the unemployment rates. As an example, we will compare "realgdp" (real gross domestic product) and "realgovt" (real federal consumption expenditures & gross investment).

mdata = mdata.drop(["year", "quarter"], axis=1)
y = mdata["unemp"]
X = mdata.drop(["unemp"], axis=1)
ts = TimeSeries(multivariate=True, auto=False)
ts.fit(y, X)

And let's compute the cross-miscoding of "unemp" with given a lagged version of "realgdp".

mscd = ts.cross_miscoding(attribute=0, max_lag=50)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.show()

Auto-miscoding

Let's compute the cross-miscoding of "unemp" with given a lagged version of "realgdp".

mscd = ts.cross_miscoding(attribute=3, max_lag=50)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.show()

Auto-miscoding

In the short term, it seems that "realgdp" has a higher predictive power than "realgovt".

Forecasting

>>> import pandas as pd
>>> air = pd.read_csv("AirPassengers.csv")
>>> ts = air["#Passengers"].values

A canonical example of time series is the air passagers dataset, composed by monthly totals of a US airline passengers from 1949 to 1960:

Air Passengers

This dataset has to be imported in the following way to be used with the nescience library:

>>> import pandas as pd
>>> air = pd.read_csv("AirPassengers.csv")
>>> ts = air["#Passengers"].values

In order to evaluate the quality of the predictions made with the TimeSeries class, we will compare against a dummy model that as prediction for xt+1 it just return the value xt.

def dummy_score(ts):
    mean = np.mean(ts)
    u = np.sum([(ts[i] - ts[i-1])**2 for i in range(0, len(ts)-1)])
    v = np.sum([(ts[i] - mean)**2 for i in range(0, len(ts)-1)])
    score = 1 - u/v
    return score

If we apply our dummy model to the air passengers dataset we will get the following score:

>>> ground_truth(ts)
0.870694834815031

The same dataset modeled with the AutoTimeSeries class provide the following score:

>>> from nescience.timeseries import TimeSeries
>>> model = TimeSeries()
>>> model.fit(data)
TimeSeries()
>>> model.score(data)
0.9844207308319846

Supported Models

The following families of models are currently supported for the auto-time series part:

Clone this wiki locally