-
Notifications
You must be signed in to change notification settings - Fork 6
Time Series Analysis
A time series is a sequence of data points X = {x1, x2, ..., xn} measured at fixed time intervals. From a machine learning point of view, forecasting a time series refer to the problem of estimating the unknown xt+1 value given the known previous k values xt, ..., xt-k.
The nescience library provides the class TimeSeries to analyze and find optimal models for time series.
Auto-miscoding (see Miscoding) allow us to estimate how relevant are the previous values of the time series to forecast future values. The auto-miscoding method compares the time series with a lagged version of itself, for multiple lag values. In this sense, auto-miscoding address the same problem than auto-correlation.
import numpy as np
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt
from statsmodels.api import datasetsLet's load a sample dataset, the Mauna Loa Weekly Atmospheric CO2 Data.
co2 = datasets.co2.load_pandas()
data = co2.data.dropna(axis=0).values.flatten()
plt.plot(data)
Next code computes a classical auto-correlogram for this time series.
plot_acf(data)
plt.title("Autocorrelation")
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()
As we can observe, the auto-correlation tell us that all the past weeks have an equally high predictive power, which is not true. The problem is that auto-correlation is not defined for non-stationary time series (we need a constant mean and a constant standard deviation in order to compute auto-correlation).
Let see how auto-miscoding applies to the same time series:
from nescience.timeseries import TimeSeries
ts = TimeSeries(auto=False)
ts.fit(data)
mscd = ts.auto_miscoding(max_lag=100)plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.xlabel("Lag")
plt.ylabel("Miscoding")
plt.title("Auto-miscoding")
plt.show()
As we can see, auto-miscoding is able to recognize that the time series has a seasonal component, even in the case of a non-stationary time series.
Cross-miscoding (see Miscoding) allows us to estimate how relevant are the previous values of a multivariate time series to forecast future values. The cross-miscoding method compares the time series with a lagged version of its predictive features, for multiple lag values. In this sense, cross-miscoding address the same problem than cross-correlation.
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.api import datasets
from nescience.timeseries import TimeSeriesLet's start by loading the sample data: US Macroeconomic Data for 1959Q1 - 2009Q3.
data = datasets.macrodata.load_pandas().data
mdata.plot()
plt.show()
Our goal is to identify which ones are the attributes that have a higher (temporal) predictive power over the unemployment rates. As an example, we will compare "realgdp" (real gross domestic product) and "realgovt" (real federal consumption expenditures & gross investment).
mdata = mdata.drop(["year", "quarter"], axis=1)
y = mdata["unemp"]
X = mdata.drop(["unemp"], axis=1)
ts = TimeSeries(multivariate=True, auto=False)
ts.fit(y, X)And let's compute the cross-miscoding of "unemp" with given a lagged version of "realgdp".
mscd = ts.cross_miscoding(attribute=0, max_lag=50)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.show()
Let's compute the cross-miscoding of "unemp" with given a lagged version of "realgdp".
mscd = ts.cross_miscoding(attribute=3, max_lag=50)
plt.bar(x=np.arange(len(mscd)), height=mscd)
plt.show()
In the short term, it seems that "realgdp" has a higher predictive power than "realgovt".
>>> import pandas as pd
>>> air = pd.read_csv("AirPassengers.csv")
>>> ts = air["#Passengers"].valuesA canonical example of time series is the air passagers dataset, composed by monthly totals of a US airline passengers from 1949 to 1960:

This dataset has to be imported in the following way to be used with the nescience library:
>>> import pandas as pd
>>> air = pd.read_csv("AirPassengers.csv")
>>> ts = air["#Passengers"].valuesIn order to evaluate the quality of the predictions made with the TimeSeries class, we will compare against a dummy model that as prediction for xt+1 it just return the value xt.
def dummy_score(ts):
mean = np.mean(ts)
u = np.sum([(ts[i] - ts[i-1])**2 for i in range(0, len(ts)-1)])
v = np.sum([(ts[i] - mean)**2 for i in range(0, len(ts)-1)])
score = 1 - u/v
return scoreIf we apply our dummy model to the air passengers dataset we will get the following score:
>>> ground_truth(ts)
0.870694834815031The same dataset modeled with the AutoTimeSeries class provide the following score:
>>> from nescience.timeseries import TimeSeries
>>> model = TimeSeries()
>>> model.fit(data)
TimeSeries()
>>> model.score(data)
0.9844207308319846The following families of models are currently supported for the auto-time series part: