Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SARIMAX: incremental Kalman filter (question) #5563

Open
vss888 opened this issue Apr 10, 2019 · 33 comments
Open

SARIMAX: incremental Kalman filter (question) #5563

vss888 opened this issue Apr 10, 2019 · 33 comments

Comments

@vss888
Copy link

vss888 commented Apr 10, 2019

Could you please let me know if it is possible to use the current SARIMAX code to apply Kalman filter incrementally (to make it faster)?

Here is what I mean. Imagine, one fits an SARIMAX model (#1) using data with indices in [iFitBegin, iFitEnd], then one wants to use such a model to make predictions for data with indices in [iFitEnd+1, iDataEnd]. To do that in the current code, it seems, one has to create a new SARIMAX model (#2) using data with indices in [iFitBegin, iDataEnd] and apply Kalman filter on the model #2 with parameters taken from model #1. After that one can request predictions from model #2. However, it is expensive computationally to do (for large models) if predictions are done on streaming data (where new data arrives one data point at a time) since every time one has to redo all Kalman filtering from iFitBegin to the index iPred (the one, at which a prediction is required). Mathematically, Kalman filter is applied incrementally, i.e. if Kalman filtering is done up to index iPred, then to do filtering up to index iPred+1 one just has to take the state at iPred and only apply one step of Kalman filter using a single new data point at iPred+1, and so it would be great if one could do it in the code incrementally as well.

Could you please let me know if there is a way to do such incremental Kalman filtering (i.e. one state update at a time) in the current SARIMAX implementation?

Thank you for your help!!!

@ChadFulton
Copy link
Member

We don't have this built in, but it is a feature that's been requested in the past and in principle wouldn't be too hard to add. I would be happy to look at a PR to add this feature.

@vss888
Copy link
Author

vss888 commented Apr 11, 2019

@ChadFulton Thank you very much for your reply, Chad! Such a feature would be incredibly valuable for practical applications of your code.

Our estimates indicate that the Kalman filtering part of out-of-sample predictions would become from 3,000 to 25,000 times faster (in our use cases) if we could do the filtering incrementally. If the prediction part is done independently from the filtering (for example, during the call of SARIMAXResults.get_prediction), then it would be only a small addition to the overall time required to make a prediction (about the same time scale as one step of Kalman filtering). Overall, the deployment of our models would become much easier and less computationally intensive, since we would be able to make multiple predictions sequentially instead of parallelizing them.

It would be really great if we could look into a PR to add this feature some time soon. If there is any part of the implementation, which I could help with, please, do let me know and I would be happy to contribute.

@vss888
Copy link
Author

vss888 commented Apr 16, 2019

@ChadFulton Please, let me know if there is any code I could help with to implement the incremental Kalman filtering.

To be able to handle state updates by the model at index iPred+1 (for example, if late data arrives into the system), it would be great to be able to save the state object at index iPred, such that the Kalman filtering to get the state at iPred+1 from the state at iPred could be run multiple times. After the state is updated, one should be able to make a prediction update as well.

@ChadFulton
Copy link
Member

Yes, this is all possible, it's mostly just a matter of figuring out the user interface. If you want to take a stab at it, I'd be happy to review a PR.

@vss888
Copy link
Author

vss888 commented May 17, 2019

@ChadFulton I have just spent a full day reading your code. While I have a better idea of what it does, the overall structure is still a mystery to me, and I still do not see where even to begin to make changes. To my taste, too much of different functionality is lumped together and the code would benefit from factoring it better into logically separated pieces. For example, I would not make KalmanFilter to inherit from Representation, and instead KalmanFilter would be something which can be applied to a Representation. And different Kalman filters would inherit from the base KalmanFilter, instead of having multiple options all in one file. KalmanSmoother would not inherit from KalmanFilter.

Getting back to the immediate needs, my understanding is that I need to:

  1. find where one step of Representation update is done
  2. separate it into a new method (e.g. kalman_filter.step), which would be called in a loop in the filter method approximately like that (possibily, implemented in Cython):
def filter(initial_representation, kalman_filter, observations, inplace=False):
    # 'initial_representation' : representation of the initial state
    # 'observations' : vector of observations which follow the initial state
    # 'inplace' : if the 'initial_representation' should be updated in place
    representation = initial_representation if inplace else initial_representation.copy()
    for obs in observations:
        kalman_filter.step(representation, obs) # update representation in place
    return representation

Could you, please, help me with step 1. above, i.e. to find where a single step of Representation update is done by a Kalman filter?

Thank you for your help!

@vss888
Copy link
Author

vss888 commented May 17, 2019

I might have found the relevant pieces of code: {{prefix}}KalmanFilter.__call__ seems to implement all of the Kalman filter steps (the filter function in my previous post), while {{prefix}}KalmanFilter.__next__ seems to implement one such step (the kalman_filter.step function in my previous post).

@ChadFulton
Copy link
Member

As a preliminary, it's probably useful to know that the background for this implementation comes from my use in time series macro econometrics, where the goal is usually estimation of parameters / running the smoother on a fixed dataset.

There is a performance cost to converting between numpy arrays and Cython memoryviews, and so things have been designed to minimize that conversion. That's why the design strategy is to convert everything once at the beginning to a memoryview, run the full filter and smoother in Cython, and then covert all results back to numpy arrays once at the end.

This is of course not the best approach for e.g. optimal control, where the goal is to apply the Kalman filter online and with fixed parameters. However, this was just not the use case this was built for.

To my taste, too much of different functionality is lumped together and the code would benefit from factoring it better into logically separated pieces. For example, I would not make KalmanFilter to inherit from Representation, and instead KalmanFilter would be something which can be applied to a Representation. And different Kalman filters would inherit from the base KalmanFilter, instead of having multiple options all in one file. KalmanSmoother would not inherit from KalmanFilter.

It is possible that it would have been better to use composition rather than inheritance, but this is likely not going to change now.

I might have found the relevant pieces of code: {{prefix}}KalmanFilter.__call__ seems to implement all of the Kalman filter steps (the filter function in my previous post), while {{prefix}}KalmanFilter.__next__ seems to implement one such step (the kalman_filter.step function in my previous post).

Yes, this is correct. However, I do not think you will want to go this route for various technical reasons related to what I described above. Because of the design principles behind our implementation, you are likely better off thinking about "batch" updates, where in practice the "batch" update could certainly be a single additional observation (but if you already have n additional observations, then it would be best to do one batch update with n new observations rather than n batch updates of 1 observation each).

Mathematically, the core idea is the same as the usual online updates. Once you have fit the model with data between [iFitBegin, iFitEnd] in model (1), you can easily apply the Kalman filter only to the data [iFitEnd+1, iDataEnd] in model (2) by initializing the model (2) Kalman filter with the last state and covariance from model (1).

Here is the (not very user-friendly) way to do such a batch update:

import numpy as np
import pandas as pd
import statsmodels.api as sm
endog = np.log(sm.datasets.macrodata.load_pandas().data['realgdp']).diff().iloc[1:]
endog.index = pd.date_range('1959Q2', '2009Q3', freq='QS')

iFitBegin = 0
iFitEnd = 100
iDataEnd = len(endog)

# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
mod_training = sm.tsa.SARIMAX(endog_training, order=(1, 0, 0))
res_training = mod_training.fit()
params_training = res_training.params

# Now apply the filter to the next dataset
endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
mod_test = sm.tsa.SARIMAX(endog_test, order=(1, 0, 0))
mod_test.ssm.initialize_known(res_training.predicted_state[..., -1],
                              res_training.predicted_state_cov[..., -1])
res_test = mod_test.filter(params_training)

# Finally, filter everything all at once so we can check results
mod_all = sm.tsa.SARIMAX(endog, order=(1, 0, 0))
res_all = mod_all.filter(params_training)

from numpy.testing import assert_allclose
assert_allclose(res_all.filtered_state[..., iFitEnd + 1:iDataEnd],
                res_test.filtered_state)

It would be nice to wrap this into a convenient function, like suggested in #4188, but it's not clear to me what the exact right interface is.

@ChadFulton
Copy link
Member

(Note that this exact example may not work with versions behind what's currently in Github master, although there is a way to make it in older versions too)

@bashtage
Copy link
Member

There is a performance cost to converting between numpy arrays and Cython memoryviews, and so things have been designed to minimize that conversion. That's why the design strategy is to convert everything once at the beginning to a memoryview, run the full filter and smoother in Cython, and then covert all results back to numpy arrays once at the end.

I don't think this is very much. In this cython file

#cython: language_level=3
from libc.stdint cimport uint32_t
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer

import numpy as np
cimport numpy as np
cimport cython

np.import_array()

def time_mem_view(double[::1] x, np.npy_intp n):
    x[n] = -1.0

def time_noop(object x, object n):
    pass

I get

In [4]: %timeit extending.time_mem_view(x,135)
837 ns ± 2.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit extending.time_noop(x,135)
108 ns ± 6.88 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

While it is much higher than a no-op, I wouldn't call 700ns a meaningful performance cost.

@ChadFulton
Copy link
Member

I guess the best I can do is to rephrase: "at one time there was a perceived performance cost to converting..."

@vss888
Copy link
Author

vss888 commented May 21, 2019

@ChadFulton Thank you very much, Chad, for such a detailed reply!
@bashtage Thank you, Kevin, for the benchmark! It opens the door for potentially re-implementing it in Cython. I believe the right thing would be to implement it in Cython similarly to what I described in my previous post. This way both optimal control can be done online and Chad will stay happy since if the code gets all the observations at once then conversion from numpy arrays to Cython memory views would only happen once, and if one gives the code one data row at a time then it would still be quite efficient (as Kevin showed above).

Before deciding on user-interface, I wanted to check how well the code works in practice. And I can see that in my use case, a simple test produces somewhat different results in both cases (prediction standard deviation gets larger by ~6% when I use initialize_known, so something seems to be missing).

I tried to reproduce it using your code snippet and I can see that if I replace ARIMA order from (1,0,0) to, for example, (3,0,1) then the assertion in the end of the snippet fails:

import numpy as np
import pandas as pd
import statsmodels.api as sm
endog = np.log(sm.datasets.macrodata.load_pandas().data['realgdp']).diff().iloc[1:]
endog.index = pd.date_range('1959Q2', '2009Q3', freq='QS')

iFitBegin = 0
iFitEnd = 100
iDataEnd = len(endog)

arima_order = (3, 0, 1)

# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
mod_training = sm.tsa.SARIMAX(endog_training, order=arima_order)
res_training = mod_training.fit()
params_training = res_training.params

# Now apply the filter to the next dataset
endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
mod_test = sm.tsa.SARIMAX(endog_test, order=arima_order)
mod_test.ssm.initialize_known(res_training.predicted_state    [..., -1],
                              res_training.predicted_state_cov[..., -1])
res_test = mod_test.filter(params_training)

# Finally, filter everything all at once so we can check results
mod_all = sm.tsa.SARIMAX(endog, order=arima_order)
res_all = mod_all.filter(params_training)

from numpy.testing import assert_allclose
assert_allclose(res_all.filtered_state[..., iFitEnd + 1:iDataEnd],
                res_test.filtered_state)

gives the following output:

>>> import numpy as np
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> endog = np.log(sm.datasets.macrodata.load_pandas().data['realgdp']).diff().iloc[1:]
>>> endog.index = pd.date_range('1959Q2', '2009Q3', freq='QS')
>>> 
>>> iFitBegin = 0
>>> iFitEnd = 100
>>> iDataEnd = len(endog)
>>> 
>>> arima_order = (3, 0, 1)
>>> 
>>> # Fit on the training sample
... endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
>>> mod_training = sm.tsa.SARIMAX(endog_training, order=arima_order)
>>> res_training = mod_training.fit()
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            5     M =           10
 This problem is unconstrained.

At X0         0 variables are exactly at the bounds

At iterate    0    f= -3.08211D+00    |proj g|=  3.78424D+00

 Bad direction in the line search;
   refresh the lbfgs memory and restart the iteration.

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    5      3     43      2     0     0   2.940D-02  -3.083D+00
  F =  -3.0825409348556598     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             

 Warning:  more than 10 function and gradient
   evaluations in the last line search.  Termination
   may possibly be caused by a bad search direction.

 Cauchy                time 0.000E+00 seconds.
 Subspace minimization time 0.000E+00 seconds.
 Line search           time 0.000E+00 seconds.

 Total User time 0.000E+00 seconds.

>>> params_training = res_training.params
>>> 
>>> # Now apply the filter to the next dataset
... endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
>>> mod_test = sm.tsa.SARIMAX(endog_test, order=arima_order)
>>> mod_test.ssm.initialize_known(res_training.predicted_state    [..., -1],
...                               res_training.predicted_state_cov[..., -1])
>>> res_test = mod_test.filter(params_training)
>>> 
>>> # Finally, filter everything all at once so we can check results
... mod_all = sm.tsa.SARIMAX(endog, order=arima_order)
>>> res_all = mod_all.filter(params_training)
>>> 
>>> from numpy.testing import assert_allclose
>>> assert_allclose(res_all.filtered_state[..., iFitEnd + 1:iDataEnd],
...                 res_test.filtered_state)
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File ".../Python-3.6.3/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 1493, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File ".../Python-3.6.3/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatch: 1.32%
Max absolute difference: 3.66600378e-09
Max relative difference: 4.52140969e-07
 x: array([[ 9.671517e-03,  8.108095e-03,  9.392402e-03,  8.431222e-03,
         1.549991e-02,  7.560947e-03,  9.563068e-03,  4.009178e-03,
         9.595188e-03,  4.821794e-03,  5.528936e-03,  1.057792e-02,...
 y: array([[ 9.671517e-03,  8.108099e-03,  9.392402e-03,  8.431222e-03,
         1.549991e-02,  7.560947e-03,  9.563068e-03,  4.009178e-03,
         9.595188e-03,  4.821794e-03,  5.528936e-03,  1.057792e-02,...

I am looking into it further, but if you have some idea of what might be causing it, please, let us know.

@vss888
Copy link
Author

vss888 commented May 21, 2019

Picking higher MA (i.e. q) numbers in ARIMA order causes the assertion in the snippet to fail more dramatically. Try, for example, arima_order = (1, 0, 5): it gives "Mismatch: 20.3%". Other examples of arima_order values, which cause the assertion to fail:

arima_order = (1, 0, 9)
arima_order = (1, 0, 8)
arima_order = (1, 0, 7)
arima_order = (1, 0, 6)
arima_order = (1, 0, 5) 
arima_order = (1, 0, 4)
arima_order = (1, 0, 3)
arima_order = (1, 0, 2)
arima_order = (1, 0, 1)
arima_order = (2, 0, 1)
arima_order = (3, 0, 1)
arima_order = (4, 0, 1)
arima_order = (6, 0, 1)
arima_order = (8, 0, 1)

@vss888
Copy link
Author

vss888 commented May 21, 2019

I think, the difficulty with figuring out the user interface comes from the fact that KalmanFilter inherits from Representation, and so the state and the state update operation end up being the same thing. This leads to a situation that to copy a state, one has to copy a model. Moreover, a model also represents (contains) the data, which will be used to fit it, and so a model and data end up being the same thing. The end result is that model, state, and data are the same object. In addition, states are contained in the fit results but one does not get the whole state from a fit result but an array. Face palm...

I would define a class State (Representation) containing a set of numbers representing it and the operations to check the numbers for self consistency. A class Model (KalmanFilter) represents operations of the state evolution, prediction, log-likelihood calculation, and fitting. The fit function of Model (not the constructor!) should take data as an input. class FitResult should contain effectively (not necessarily, literally) a sequence of states plus fit result metrics (like log-likelihood, model parameters, co-variance matrix of model parameters, and functions to get various derived values). A model object then could be used like that: model.filter(initial_state, new_observations, fit_result) to advance a state forward and to make predictions.

The code does not need rewriting, just refactoring it into more independent concepts. Without such refactoring, it is hard to come up with a pretty user interface. I am willing to do the refactoring, but I am unlikely to succeed without some help from somebody who understands the code in detail.

@ChadFulton
Copy link
Member

It opens the door for potentially re-implementing it in Cython. I believe the right thing would be to implement it in Cython similarly to what I described in my previous post. This way both optimal control can be done online and Chad will stay happy since if the code gets all the observations at once then conversion from numpy arrays to Cython memory views would only happen once, and if one gives the code one data row at a time then it would still be quite efficient (as Kevin showed above).

I have to say that it seems unlikely to me that we would want to fundamentally restructure the state space models at this point, particularly since the solution I gave already works for your use case. I will expand on this more in a second comment replying to your next message.

Before deciding on user-interface, I wanted to check how well the code works in practice. And I can see that in my use case, a simple test produces somewhat different results in both cases (prediction standard deviation gets larger by ~6% when I use initialize_known, so something seems to be missing).
I tried to reproduce it using your code snippet and I can see that if I replace ARIMA order from (1,0,0) to, for example, (3,0,1) then the assertion in the end of the snippet fails:

This is because Kalman filtering converges after some number of iterations to a steady-state, after which point we can improve performance by no longer updating the covariance matrix. To determine convergence to the steady-state, we have a tolerance that is very close to zero. You can get some (very small) differences if you run the Kalman filter with different convergence tolerances (which is implicitly happening here). If you replace sm.tsa.SARIMAX(..., order=(1, 0, 9)) with sm.tsa.SARIMAX(..., order=(1, 0, 9), tolerance=0) in all cases, then you will match exactly in all cases.

@ChadFulton
Copy link
Member

ChadFulton commented May 22, 2019

A couple of notes on what you've written. First, I'm going to jump to the most important comment, which you may find disappointing:

The fit function of Model (not the constructor!) should take data as an input.

This is not the design pattern followed by Statsmodels. In the model classes in this library (including OLS, GLM, etc., and also state space models), the data and fundamental model specification are always passed as part of the constructor. I realize that this is different from e.g. scikit-learn, but we are not planning to change this. As a result, the suggestion of removing data out of the constructor is unfortunately a non-starter.

The code does not need rewriting, just refactoring it into more independent concepts. Without such refactoring, it is hard to come up with a pretty user interface. I am willing to do the refactoring, but I am unlikely to succeed without some help from somebody who understands the code in detail.

Thanks very much for offering to help write code, that is always very much appreciated! However, you seem to be suggesting quite a lot of work that would fundamentally change the way the state space models run. Before considering doing this, I think it would be important to show why our current solution does not work. In particular, why not just wrap the code I gave above into a helper method?

If you really want to proceed despite this warning, I would be still be happy to review your code, but unfortunately I have to say that I don't have huge amounts of time to discuss rewriting code that seems to work well to me.

I think, the difficulty with figuring out the user interface comes from the fact that KalmanFilter inherits from Representation, and so the state and the state update operation end up being the same thing. This leads to a situation that to copy a state, one has to copy a model.

It's not clear to me what you are referring to as "a state". Do you mean the conditional mean and covariance at a particular time period?

In addition, states are contained in the fit results but one does not get the whole state from a fit result but an array.

Similar to my above comment, I don't know what you mean by "the whole state" here. However, the complete output of the Kalman filter (and perhaps smoother) are available in the fit results.

class FitResult should contain effectively (not necessarily, literally) a sequence of states plus fit result metrics (like log-likelihood, model parameters, co-variance matrix of model parameters, and functions to get various derived values).

As far as I can tell, this is what our results objects do contain. Is there something that is missing there?

@vss888
Copy link
Author

vss888 commented May 23, 2019

@ChadFulton In no regard, my goal was to criticize the existing code, but only to offer help. My apologies if it was the impression I created. Since you worried the most about how to wrap the code snippet with a nice user interface, this was exactly what I tried to help with. However, I quickly realized that it was not easy, and thinking about the reasons led me to the conclusion that it was due how the code is currently organized. For example, one has to create in the function (implementing an incremental Kalman filter) an exact copy of the original model, that was used to fit the training sample, which can be done by:

  • passing a function which created the original model,
  • creating a copy of the original model and then resetting it and attaching new data to it (this entails adding 2 new methods: model.reset and model.set_data (which could be merged into a single method).

In addition, I believe that restructuring the code would enhance its performance greatly. At this movement, I can see a speed up by a factor <15 in a simple test case, while one should be able to get a speed up factor of ~3000 in the same case.

If you replace sm.tsa.SARIMAX(..., order=(1, 0, 9)) with sm.tsa.SARIMAX(..., order=(1, 0, 9), tolerance=0) in all cases, then you will match exactly in all cases.

Indeed, once I did that, the assertion was passed successfully, but once I set enforce_stationarity=False (and added exogenous data), the assertion failed once again. I have created a test data set with one endogenous variable and a few exogenous variables since this more closely resembles one of our uses cases. The corresponding code follows. Is there anything else one has to do, to make the code snippet to work correctly when enforce_stationarity=False? Thank you very much for your help!

import numpy as np
import pandas as pd
import statsmodels.api as sm
data = pd.read_csv('https://github.com/statsmodels/statsmodels/files/3208318/data.zip')
data.drop(columns=['dummy'], inplace=True)
endog = data['x']
# exog = None # no exogeneous variables case
exog = data[ [x for x in data.columns if x.startswith('e')] ]

arima_order = (3, 0, 1)
tolerance = 0
#enforce_stationarity=True # XXX: works
enforce_stationarity=False # XXX: does not work
enforce_invertibility=False

iFitBegin = 0
iFitEnd = len(endog)-41-1
iDataEnd = len(endog)

# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
exog_training  = exog .iloc[iFitBegin:iFitEnd + 1] if exog is not None else None
mod_training = sm.tsa.SARIMAX(endog_training, exog=exog_training, order=arima_order,
    enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
    tolerance=tolerance)
res_training = mod_training.fit()
params_training = res_training.params

# Now apply the filter to the next dataset
endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
exog_test  = exog .iloc[iFitEnd + 1:iDataEnd] if exog is not None else None
mod_test = sm.tsa.SARIMAX(endog_test, exog=exog_test, order=arima_order,
    enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
    tolerance=tolerance)
mod_test.ssm.initialize_known(res_training.predicted_state    [..., -1],
                              res_training.predicted_state_cov[..., -1])
res_test = mod_test.filter(params_training)

# Finally, filter everything all at once so we can check results
mod_all = sm.tsa.SARIMAX(endog, exog=exog, order=arima_order,
    enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
    tolerance=tolerance)
res_all = mod_all.filter(params_training)

from numpy.testing import assert_allclose
assert_allclose(res_all .filtered_state[..., iFitEnd + 1:iDataEnd],
                res_test.filtered_state)
assert_allclose(res_all .get_prediction(start=iFitEnd+1, end=iDataEnd-1).predicted_mean,
                res_test.get_prediction(start=iFitEnd +1-len(endog_training),
                                          end=iDataEnd-1-len(endog_training)).predicted_mean)

gives

>>> import numpy as np
>>> import pandas as pd
>>> import statsmodels.api as sm
>>> data = pd.read_csv('https://github.com/statsmodels/statsmodels/files/3208318/data.zip')
>>> data.drop(columns=['dummy'], inplace=True)
>>> endog = data['x']
>>> # exog = None # no exogeneous variables case
... exog = data[ [x for x in data.columns if x.startswith('e')] ]
>>> 
>>> arima_order = (3, 0, 1)
>>> tolerance = 0
>>> #enforce_stationarity=True
... enforce_stationarity=False
>>> enforce_invertibility=False
>>> 
>>> iFitBegin = 0
>>> iFitEnd = len(endog)-41-1
>>> iDataEnd = len(endog)
>>> 
>>> # Fit on the training sample
... endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
>>> exog_training  = exog .iloc[iFitBegin:iFitEnd + 1] if exog is not None else None
>>> mod_training = sm.tsa.SARIMAX(endog_training, exog=exog_training, order=arima_order,
...     enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
...     tolerance=tolerance)
>>> res_training = mod_training.fit()
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           46     M =           10
 This problem is unconstrained.

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.25198D-01    |proj g|=  4.98917D-02

At iterate    5    f=  2.23076D-01    |proj g|=  2.70743D-02

At iterate   10    f=  2.21376D-01    |proj g|=  5.78439D-02

At iterate   15    f=  2.09607D-01    |proj g|=  2.00068D-01

At iterate   20    f=  1.94864D-01    |proj g|=  2.18334D-01

At iterate   25    f=  1.94039D-01    |proj g|=  1.26863D-02

At iterate   30    f=  1.93994D-01    |proj g|=  1.40626D-02

At iterate   35    f=  1.93986D-01    |proj g|=  2.43847D-03

At iterate   40    f=  1.93985D-01    |proj g|=  5.24934D-04

At iterate   45    f=  1.93985D-01    |proj g|=  1.14569D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
   46     45     58      1     0     0   1.146D-03   1.940D-01
  F =  0.19398530156927296     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             

 Cauchy                time 0.000E+00 seconds.
 Subspace minimization time 0.000E+00 seconds.
 Line search           time 0.000E+00 seconds.

 Total User time 0.000E+00 seconds.

>>> params_training = res_training.params
>>> 
>>> # Now apply the filter to the next dataset
... endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
>>> exog_test  = exog .iloc[iFitEnd + 1:iDataEnd] if exog is not None else None
>>> mod_test = sm.tsa.SARIMAX(endog_test, exog=exog_test, order=arima_order,
...     enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
...     tolerance=tolerance)
>>> mod_test.ssm.initialize_known(res_training.predicted_state    [..., -1],
...                               res_training.predicted_state_cov[..., -1])
>>> res_test = mod_test.filter(params_training)
>>> 
>>> # Finally, filter everything all at once so we can check results
... mod_all = sm.tsa.SARIMAX(endog, exog=exog, order=arima_order,
...     enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
...     tolerance=tolerance)
>>> res_all = mod_all.filter(params_training)
>>> 
>>> from numpy.testing import assert_allclose
>>> assert_allclose(res_all .filtered_state[..., iFitEnd + 1:iDataEnd],
...                 res_test.filtered_state)
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File ".../Python-3.6.3/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 1493, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File ".../Python-3.6.3/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatch: 34.1%
Max absolute difference: 0.16789461
Max relative difference: inf
 x: array([[ 1.700444e-01,  5.913961e-02, -1.705444e-01, -1.296107e-01,
         2.856858e-01, -1.929758e-01,  6.747454e-02, -2.871661e-01,
        -1.312029e-02,  2.934357e-01, -4.076935e-01,  4.119701e-01,...
 y: array([[ 1.700444e-01,  5.913961e-02, -1.705444e-01, -1.296107e-01,
         2.856858e-01, -1.929758e-01,  6.747454e-02, -2.871661e-01,
        -1.312029e-02,  2.934357e-01, -4.076935e-01,  4.119701e-01,...
>>> assert_allclose(res_all .get_prediction(start=iFitEnd+1, end=iDataEnd-1).predicted_mean,
...                 res_test.get_prediction(start=iFitEnd +1-len(endog_training),
...                                           end=iDataEnd-1-len(endog_training)).predicted_mean)
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File ".../Python-3.6.3/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 1493, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File ".../Python-3.6.3/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatch: 100%
Max absolute difference: 0.16789461
Max relative difference: 6.13242913
 x: array([-1.167966,  1.463285, -0.236501, -0.028315,  0.087713, -0.197135,
        0.026948,  0.036051,  0.115366,  0.072279, -0.008309,  0.192155,
       -0.091452,  0.080386, -0.396223,  0.072319, -0.061391,  0.012717,...
 y: array([-1.308032,  1.505413, -0.186883, -0.171807, -0.080181, -0.273773,
       -0.057278, -0.033971,  0.024664, -0.021499, -0.081685,  0.108427,
       -0.158343,  0.015663, -0.437569,  0.027645, -0.09929 , -0.024326,...

data.zip

@ChadFulton
Copy link
Member

Indeed, once I did that, the assertion was passed successfully, but once I set enforce_stationarity=False (and added exogenous data), the assertion failed once again. I have created a test data set with one endogenous variable and a few exogenous variables since this more closely resembles one of our uses cases. The corresponding code follows. Is there anything else one has to do, to make the code snippet to work correctly when enforce_stationarity=False? Thank you very much for your help!

This is kind of strange - I copied and pasted your code into a notebook, and it runs through for me with no errors (the assertions pass). I may be using a more recent version of Statsmodels, though, and maybe we fixed a bug since the last release (which was quite a while ago).

@ChadFulton
Copy link
Member

@ChadFulton In no regard, my goal was to criticize the existing code, but only to offer help. My apologies if it was the impression I created.

No problem at all - I was not in any way offended, and I'm sorry if it came off that way! We very much appreciate your offer to help.

Since you worried the most about how to wrap the code snippet with a nice user interface, this was exactly what I tried to help with. However, I quickly realized that it was not easy, and thinking about the reasons led me to the conclusion that it was due how the code is currently organized. For example, one has to create in the function (implementing an incremental Kalman filter) an exact copy of the original model, that was used to fit the training sample

Yes, that's right - in fact we already do this to support forecasting. See e.g. the FilterResults.predict method. I was thinking this same approach could be used here.

In addition, I believe that restructuring the code would enhance its performance greatly. At this movement, I can see a speed up by a factor <15 in a simple test case, while one should be able to get a speed up factor of ~3000 in the same case.

If you have examples of performance improvements that you can share, that would be interesting to see.

@vss888
Copy link
Author

vss888 commented May 24, 2019

@ChadFulton "This is kind of strange - I copied and pasted your code into a notebook, and it runs through for me with no errors (the assertions pass). I may be using a more recent version of Statsmodels, though, and maybe we fixed a bug since the last release (which was quite a while ago)."

You are right, once I installed the latest master of statsmodels, the assertions in the code snippet passed.

Here is code, which allows to benchmark how much faster predictions can be done in an online setting (for some particular test data/model case):

import numpy as np
import pandas as pd
import statsmodels.api as sm
data = pd.read_csv('https://github.com/statsmodels/statsmodels/files/3208318/data.zip')
data.drop(columns=['dummy'], inplace=True)
endog = data['x']
# exog = None # no exogeneous variables case
exog = data[ [x for x in data.columns if x.startswith('e')] ]

arima_order = (3, 0, 1)
tolerance = 0
enforce_stationarity=False
enforce_invertibility=False

iFitBegin = 0
iFitEnd = len(endog)-41-1
iDataEnd = len(endog)

def CreateModel(endog_data, exog_data):
    return sm.tsa.SARIMAX(endog_data, exog=exog_data, order=arima_order,
        enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
        tolerance=tolerance)

# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
exog_training  = exog .iloc[iFitBegin:iFitEnd + 1] if exog is not None else None
mod_training = CreateModel(endog_training, exog_training)
res_training = mod_training.fit()
params_training = res_training.params

from time import process_time

prediction_steps = list(range(iFitEnd+1,iDataEnd))

def GetLastState(fit_result):
    return (fit_result.predicted_state    [..., -1],
            fit_result.predicted_state_cov[..., -1])

last_state = GetLastState(res_training)
results1 = list()
t1 = -process_time()
for i in prediction_steps:
    # Now apply the filter to the next dataset
    endog_test = endog.iloc[i:i+1]
    exog_test  = exog .iloc[i:i+1] if exog is not None else None
    mod_test = CreateModel(endog_test, exog_test)
    mod_test.ssm.initialize_known(*last_state)
    res_test = mod_test.filter(params_training)
    last_state = GetLastState(res_test)
    results1.append( res_test.get_prediction(start=0, end=0).predicted_mean.values[0] )
t1 += process_time()

results2 = list()
t2 = -process_time()
for i in prediction_steps:
    # Finally, filter everything all at once so we can check results
    endog_all = endog.iloc[iFitBegin:i+1]
    exog_all  = exog .iloc[iFitBegin:i+1] if exog is not None else None
    mod_all = CreateModel(endog_all, exog_all)
    res_all = mod_all.filter(params_training)
    results2.append( res_all.get_prediction(start=i-iFitBegin, end=i-iFitBegin).predicted_mean.values[0] )
t2 += process_time()

print('time incremental:',t1,'sec')
print('time all        :',t2,'sec')
print('speed up        :',t2/t1)
print('achivable speed up estimate:',iFitEnd-iFitBegin+0.5*(iDataEnd-(iFitEnd+1)))
# above is an estimate of how many more Kalman filter steps are done in the 2nd case vs 1st one

from numpy.testing import assert_allclose
assert_allclose(np.array(results1), np.array(results2))

The relevant output is the following:

time incremental: 0.5111017090000018 sec
time all : 13.707173090999998 sec
speed up : 26.818875479439942
achivable speed up estimate: 2725.5

It means that the current incremental version is about 27 times faster than the original one (which reruns Kalman filter for the whole data set up the last available data point). However, this version still is about 100 times slower (the potentially achievable speed up factor is ~2700) than it could have been if all the unnecessary boiler plate code was removed (i.e. if the code was better organized to avoid calling the boiler plate code for every prediction). I estimated the achievable speed up factor by simply counting how many more Kalman filter steps the 2nd version does in comparison to the 1st one (i.e. incremental).

To achieve better performance, one should be able to:

  • add data points to an existing model (instead of recreating one from scratch), optionally skipping checks of the input for correctness,
  • run Kalman filter from the last data point for which state is already known up to the last added data point.

These two steps could actually be merged into a single new function, e.g. add_and_filter_new(data, validate_data=True). It would be great if we could add such functionality, and I am ready to help with it.

@ChadFulton
Copy link
Member

Thanks for posting that. I agree that it can definitely be faster to not re-filter all of the original observations if you don't need to.

These two steps could actually be merged into a single new function, e.g. add_and_filter_new(data, validate_data=True). It would be great if we could add such functionality, and I am ready to help with it.

I agree that this would be great! I'd be happy to look over a PR.

I just want to add the caution that there are a lot of things going on in the state space models that need to be maintained with an PR (a few issues are mentioned in #4188). Hopefully this is relatively straightforward, but I would hazard a guess that trying to get absolute maximal performance may be hard to integrated with the current code. My suggestion is to start with a new method that just wraps the approach I gave (I think this alone will require plenty of work).

@vss888
Copy link
Author

vss888 commented May 30, 2019

@ChadFulton There are some surprising discoveries in terms of where the time is actually spent. In the test model benchmark code, here is the approximate breakdown for the incremental filtering code:

  • data selection: 0.05 sec
  • model creation: 0.07 sec
  • setting last state: 0.00 sec
  • filtering: 0.84 sec
  • getting last state: 0.03 sec
  • compute prediction: 0.01 sec

So, most of the time is actually spent doing one filtering step (not getting everything else prepared for it). Moreover, if I request raw result (return_ssm=True) from the filter function, the filtering time drops to 0.09 sec. That means that most of time (~90%) in the filter function is spent constructing the result object (i.e. in the _wrap_results call). If you have any insight on how realistic it is to make the call faster, please, let me know. For example, maybe there is an easy way to create predictions from the raw results of a filter call? If yes, then one can skip construction of SARIMAXResults altogether and go straight to the point.

Also, I have discovered that the SARIMAXResults.get_prediction function creates back the original SARIMAX model. That means that a SARIMAX creates SARIMAXResults which creates back SARIMAX to make prediction. The impression is that creation of SARIMAXResults should not be done in the first place, and the model object itself should be used to make predictions. Otherwise, we just copy back and forth the same information, and create and recreate the same objects (first, we create an SARIMAX to do a fit, then we create another SARIMAX to do filtering, and then we create one more SARIMAX to make a prediction). This may be how it is done currently in statsmodels in general, but this observation reinforces my skepticism that it is a good idea to do it this way.

People do use sklearn like interface for doing time series analysis (for example, in Keras) and it works great. I do not see why it can not be done in statsmodels. It was sad for me to discover that in some polls on data analytics tools statismodels is not even an option one can vote for. Also, I remember the hard time I had figuring out how to use statsmodels when I just started with it. It all points in the same direction: statsmodels code could be organized better.

@ChadFulton
Copy link
Member

So, most of the time is actually spent doing one filtering step (not getting everything else prepared for it). Moreover, if I request raw result (return_ssm=True) from the filter function, the filtering time drops to 0.09 sec. That means that most of time (~90%) in the filter function is spent constructing the result object (i.e. in the _wrap_results call). If you have any insight on how realistic it is to make the call faster, please, let me know.

There are a couple of things slowing things down here:

First, the full results object computes standard errors for the parameters. The default for this is the outer product of gradients method. We compute the gradients numerically, and this requires additional iterations of the Kalman filter. If you don't need these, you can pass cov_type='none' to the filter call.

Aside from this, most of the remaining time is spent in KalmanFilter.update_filter. This is the step where we convert the Cython memoryviews into Numpy array objects. In additional to just converting, we have to make copies (so that your results arrays from one application of the Kalman filter don't change when you run the filter again with different parameters). There is a lot of output from the Kalman filter, and so this can be expensive with long input arrays.

For example, maybe there is an easy way to create predictions from the raw results of a filter call? If yes, then one can skip construction of SARIMAXResults altogether and go straight to the point.

It is possible to skip both the construction of SARIMAXResults and even the KalmanFilter.update_filter call. For example, the loglike and loglikeobs methods skip these since they don't need the output.

How exactly this would work depends on exactly what you want to do. In general you would want to call the _filter method of self.ssm rather than the filter method, and then manually retrieve the results you want directly from the Cython instance.

Also, I have discovered that the SARIMAXResults.get_prediction function creates back the original SARIMAX model. That means that a SARIMAX creates SARIMAXResults which creates back SARIMAX to make prediction.

This is simply a convenient way to generate updated state space system matrices in the time-varying case (which occurs for example if trend='t'). It takes a trivial amount of time and prevents a lot of code duplication.

The impression is that creation of SARIMAXResults should not be done in the first place, and the model object itself should be used to make predictions. Otherwise, we just copy back and forth the same information, and create and recreate the same objects (first, we create an SARIMAX to do a fit, then we create another SARIMAX to do filtering, and then we create one more SARIMAX to make a prediction). This may be how it is done currently in statsmodels in general, but this observation reinforces my skepticism that it is a good idea to do it this way.

We have been careful to avoid extraneous information copying, while also balancing user-friendliness for the most typical use cases.

People do use sklearn like interface for doing time series analysis (for example, in Keras) and it works great. I do not see why it can not be done in statsmodels. It was sad for me to discover that in some polls on data analytics tools statismodels is not even an option one can vote for. Also, I remember the hard time I had figuring out how to use statsmodels when I just started with it. It all points in the same direction: statsmodels code could be organized better.

If you have strong feelings about this, please feel free to bring it up on the mailing list. Of course, there is a lot that can be done to make it more user friendly (especially better documentation and more examples). Unfortunately, everyone is a volunteer with limited time, so we do the best we can :)

However, if a change to e.g. the scikit interface were to occur (which I have to say is very unlikely) it would have to be done across Statsmodels, and not just in the state space component.

@vss888
Copy link
Author

vss888 commented Jul 22, 2019

@ChadFulton I have some more time to contribute to further improving the Kalman filter implementation efficiency. Thank you very much for your reply on StackOverflow! I have installed the latest development version of statsmodels and tested both 'extend' and 'append' methods on the same test model, which we discussed above. Here are a few useful observations for the new statsmodels:

  1. Giving cov_type='none' to filter reduces CPU time by ~40% (0.83 sec vs 0.51 sec).
  2. If I use extend (vs explicitly creating a new model, giving it the state and the state covariance matrix, and then running filter on new data), then the CPU time decreases to 0.47 sec, which is only 8% faster than the number above. Giving cov_type='none' as a keyword parameter to extend does not help, which might indicate that currently this parameter is ignored in the function. However, in the new statsmodels the filtering is about 2 times faster than in version 0.9.0.
  3. append is much slower (13.6 sec) than extend: a factor of ~29 in the test, and so it is much slower even than the "old" way (i.e. redoing all the filtering steps from the beginning of the training data) of doing filtering (0.57 sec).
  4. It also seems that giving return_ssm=True to filter does not help with speed anymore.

It is possible to skip both the construction of SARIMAXResults and even the KalmanFilter.update_filter call.
...
How exactly this would work depends on exactly what you want to do.

I want the simplest thing possible: to get the mean prediction one step into the future (after extending the data with a single data point) as quickly as possible. Could you, please, let me know if you think the approach (i.e. working out the prediction from the raw Cython output is a promising way to achieve it (especially, in light of observation 4. above)? This seems to be the only unexplored yet idea I could fish out of your last post above.

Thank you very much for your help!

@ChadFulton
Copy link
Member

Thanks for this comment, there are two things you pointed out that need to be added to the current version of extend: (1) kwargs to pass to the underlying method (like cov_type), and (2) it should transfer the covariance parameters to the new results object.

We'll probably have to revisit your timings after I implement this.

One other note: append is not intended to improve performance, it is just a convenience method for create a new model with additional observations. This is described in its docstring, but maybe I should be more explicit. As mentioned in #5287, we would ideally rewrite append to make it as fast as extend, but doing so would be relatively tough.

Could you, please, let me know if you think the approach (i.e. working out the prediction from the raw Cython output is a promising way to achieve it (especially, in light of observation 4. above)?

Possibly, although I still think the extend method will work for you once we've sorted out some things.

@vss888
Copy link
Author

vss888 commented Aug 14, 2019

@ChadFulton Chad, looking at the documentation/source of SARIMAXResults, I can see that one has two options to get results from function get_prediction:

full_results : boolean, optional
If True, returns a FilterResults instance; if False returns a
tuple with forecasts, the forecast errors, and the forecast error
covariance matrices. Default is False.

I was wondering if there is any way to avoid computing forecast errors and the forecast error covariance matrices (i.e. only get forecasts) to save time.

Thank you very much for your help!

@vss888
Copy link
Author

vss888 commented Aug 20, 2019

While working on speeding up the SARIMAXResults.get_prediction function, I found that the issue is due to many layers of result "wrapping":

  • result is ---- statsmodels.tsa.statespace.mlemodel.PredictionResultsWrapper
  • which wraps statsmodels.tsa.statespace.mlemodel.PredictionResults
  • which wraps statsmodels.tsa.statespace.kalman_filter.PredictionResults
  • which wraps statsmodels.tsa.statespace.kalman_smoother.SmootherResults
  • which wraps statsmodels.tsa.statespace.kalman_filter.FilterResults
  • which wraps statsmodels.tsa.statespace.representation.FrozenRepresentation

If one only needs the average prediction, it could be made very fast by skipping the Russian doll style layers of result wrapping and doing the following:

sarimaxResults.filter_results.predict(start, start+1, dynamic=False).forecasts[0,0]

In my use case, it is about 120 times faster than calling SARIMAXResults.get_prediction. I am not sure yet if it could be simplified even further. Maybe, we could add a new method implementing the fast prediction?

Could you, please, let me know if there was any progress in making the SARIMAXResults.extend more performant?

@ChadFulton
Copy link
Member

In my use case, it is about 120 times faster than calling SARIMAXResults.get_prediction. I am not sure yet if it could be simplified even further. Maybe, we could add a new method implementing the fast prediction?

You are free to call the method directly, as you just did. What more would you want?

Could you, please, let me know if there was any progress in making the SARIMAXResults.extend more performant?

Yes, those fixes have been merged, see #6074.

@vss888
Copy link
Author

vss888 commented Aug 20, 2019

What more would you want?

I was thinking about other people, not about myself. In addition, more functionally could potentially be added: like giving start time as non-integers or making dynamic predictions. I am sorry, if it came out as asking something unreasonable.

@vss888
Copy link
Author

vss888 commented Aug 20, 2019

@ChadFulton Thank you very much, Chad, for the new version of extend. I have just installed master version of statsmodels and benchmarked it. In my test, the new extend is about as fast (only ~8% slower) as the lower level way of extending forecasts, but of course much more convenient.

Could you, please, let us know if there is any potential to speed it up even further, or it is already as fast as it could be (assuming that one only cares about average predictions, and does not need errors or covariance matrix of model parameters)?

@ChadFulton
Copy link
Member

I have just installed master version of statsmodels and benchmarked it. In my test, the new extend is about as fast (only ~8% slower) as the lower level way of extending forecasts, but of course much more convenient.

Great news!

Could you, please, let us know if there is any potential to speed it up even further, or it is already as fast as it could be (assuming that one only cares about average predictions, and does not need errors or covariance matrix of model parameters)?

I think that most of the obvious performance improvements have been implemented now, although it's possible there could be more to do. If you have a particular example where it seems slow, that would help in determining if we could improve it further.

More generally, we want to balance ease-of-use with performance, so for the main method get_prediction we will be most interested in implementing (a) a level of performance that meets the needs of most use cases, and (b) performance improvements that don't compromise the user experience.

Which is just to say that my goal is not going to be to maximize performance at all costs for the primary interface methods. But I hope that we can provide alternative options for people with that goal - they just might not be as convenient as the primary interface.

@giuliobeseghi
Copy link

What more would you want?

I was thinking about other people, not about myself. In addition, more functionally could potentially be added: like giving start time as non-integers or making dynamic predictions. I am sorry, if it came out as asking something unreasonable.

Totally agree with @vss888. In several cases (such as mine), people want a fast convergence without worrying too much about getting a "detailed" result. It's OK, as long as it's safe enough.

It's true that most of the times people could call methods like sarimaxResults.filter_results.predict directly, but it's unlikely that everyone will know which method to call and how safe it is without having to dig in the docs (and possibly getting lost). I'll make you an example:

How exactly this would work depends on exactly what you want to do. In general you would want to call the _filter method of self.ssm rather than the filter method, and then manually retrieve the results you want directly from the Cython instance.

I know what filter does and I know it's safe because it's designed to be called directly. I guess I could figure out what _filter does, but I will never feel "safe" by calling it directly, because it's a private method and therefore not designed for it. I have no idea about how to retrieve the results from the Cython instance. Maybe it's something really trivial, maybe it's not. Does it make sense for me to spend time on it, without knowing how much time I actually have to spend on it? I guess not, in many cases.

@ChadFulton, I think balancing (or even giving priority to) ease-of-use with performance is the right approach, and the code shouldn't be designed to be "fast" as much as it is designed to be simple, logical and versatile. However, I don't think that cases where we want to achieve a fast optimisation are uncommon. Maybe a lighter but faster version could help in many cases? It could possibly be initialized with a keyword like fast=True, while keeping the default as False to perform the convergence as it is currently done unless otherwise specified.

Mine was just a thought, hopefully it will be helpful, thank you guys for maintaining and contributing to such a useful library :)

@bashtage
Copy link
Member

I know what filter does and I know it's safe because it's designed to be called directly. I guess I could figure out what _filter does, but I will never feel "safe" by calling it directly, because it's a private method and therefore not designed for it. I have no idea about how to retrieve the results from the Cython instance. Maybe it's something really trivial, maybe it's not. Does it make sense for me to spend time on it, without knowing how much time I actually have to spend on it? I guess not, in many cases.

More documentation is always welcome. Docs and examples, while not glorious to write, are much simpler and safer to write than a large refactoring.

For example, it could be very useful to have a notebook focused on pure-prediction applications of SARIMAX which demonstrated all of the available shortcuts.

I think balancing (or even giving priority to) ease-of-use with performance is the right approach, and the code shouldn't be designed to be "fast" as much as it is designed to be simple, logical and versatile. However, I don't think that cases where we want to achieve a fast optimisation are uncommon. Maybe a lighter but faster version could help in many cases? It could possibly be initialized with a keyword like fast=True, while keeping the default as False to perform the convergence as it is currently done unless otherwise specified.

What is needed then is a careful line_profiler run of both the relevant Python code and the Cython code. This is a straightforward but time-consuming exercise. This is the only way to precisely determine where the hot spots are that might be worth investing the time to refactor out of a path.

In short, @ChadFulton does a ton for this module that appears to have a pretty wide user base including in machine learning applications. It would be great to see some community support materializing at any level.

@ChadFulton
Copy link
Member

@bashtage is right, it is important going forwards that we get more community support for this project, and there are lots of ways to do that which don't require getting too much in the weeds, like documentation or examples. (And if you do want to get in the weeds, line profiling would be awesome!).

The point that I think I am not getting across clearly is that we are not purposely letting slow code exist, but I have my own use cases and I don't always know what others may want out of this package. If there is an important use case that we have not covered, it would be great to improve that situation, but at the least we need the users who are running into those problems to provide runnable examples that we can examine for bottlenecks. @giuliobeseghi, if you have such an example, please do post it, even if I can't guarantee that we can make it substantially faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants