-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SARIMAX: incremental Kalman filter (question) #5563
Comments
We don't have this built in, but it is a feature that's been requested in the past and in principle wouldn't be too hard to add. I would be happy to look at a PR to add this feature. |
@ChadFulton Thank you very much for your reply, Chad! Such a feature would be incredibly valuable for practical applications of your code. Our estimates indicate that the Kalman filtering part of out-of-sample predictions would become from 3,000 to 25,000 times faster (in our use cases) if we could do the filtering incrementally. If the prediction part is done independently from the filtering (for example, during the call of It would be really great if we could look into a PR to add this feature some time soon. If there is any part of the implementation, which I could help with, please, do let me know and I would be happy to contribute. |
@ChadFulton Please, let me know if there is any code I could help with to implement the incremental Kalman filtering. To be able to handle state updates by the model at index |
Yes, this is all possible, it's mostly just a matter of figuring out the user interface. If you want to take a stab at it, I'd be happy to review a PR. |
@ChadFulton I have just spent a full day reading your code. While I have a better idea of what it does, the overall structure is still a mystery to me, and I still do not see where even to begin to make changes. To my taste, too much of different functionality is lumped together and the code would benefit from factoring it better into logically separated pieces. For example, I would not make Getting back to the immediate needs, my understanding is that I need to:
def filter(initial_representation, kalman_filter, observations, inplace=False):
# 'initial_representation' : representation of the initial state
# 'observations' : vector of observations which follow the initial state
# 'inplace' : if the 'initial_representation' should be updated in place
representation = initial_representation if inplace else initial_representation.copy()
for obs in observations:
kalman_filter.step(representation, obs) # update representation in place
return representation Could you, please, help me with step 1. above, i.e. to find where a single step of Thank you for your help! |
I might have found the relevant pieces of code: {{prefix}}KalmanFilter.__call__ seems to implement all of the Kalman filter steps (the |
As a preliminary, it's probably useful to know that the background for this implementation comes from my use in time series macro econometrics, where the goal is usually estimation of parameters / running the smoother on a fixed dataset. There is a performance cost to converting between numpy arrays and Cython memoryviews, and so things have been designed to minimize that conversion. That's why the design strategy is to convert everything once at the beginning to a memoryview, run the full filter and smoother in Cython, and then covert all results back to numpy arrays once at the end. This is of course not the best approach for e.g. optimal control, where the goal is to apply the Kalman filter online and with fixed parameters. However, this was just not the use case this was built for.
It is possible that it would have been better to use composition rather than inheritance, but this is likely not going to change now.
Yes, this is correct. However, I do not think you will want to go this route for various technical reasons related to what I described above. Because of the design principles behind our implementation, you are likely better off thinking about "batch" updates, where in practice the "batch" update could certainly be a single additional observation (but if you already have Mathematically, the core idea is the same as the usual online updates. Once you have fit the model with data between Here is the (not very user-friendly) way to do such a batch update: import numpy as np
import pandas as pd
import statsmodels.api as sm
endog = np.log(sm.datasets.macrodata.load_pandas().data['realgdp']).diff().iloc[1:]
endog.index = pd.date_range('1959Q2', '2009Q3', freq='QS')
iFitBegin = 0
iFitEnd = 100
iDataEnd = len(endog)
# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
mod_training = sm.tsa.SARIMAX(endog_training, order=(1, 0, 0))
res_training = mod_training.fit()
params_training = res_training.params
# Now apply the filter to the next dataset
endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
mod_test = sm.tsa.SARIMAX(endog_test, order=(1, 0, 0))
mod_test.ssm.initialize_known(res_training.predicted_state[..., -1],
res_training.predicted_state_cov[..., -1])
res_test = mod_test.filter(params_training)
# Finally, filter everything all at once so we can check results
mod_all = sm.tsa.SARIMAX(endog, order=(1, 0, 0))
res_all = mod_all.filter(params_training)
from numpy.testing import assert_allclose
assert_allclose(res_all.filtered_state[..., iFitEnd + 1:iDataEnd],
res_test.filtered_state) It would be nice to wrap this into a convenient function, like suggested in #4188, but it's not clear to me what the exact right interface is. |
(Note that this exact example may not work with versions behind what's currently in Github master, although there is a way to make it in older versions too) |
I don't think this is very much. In this cython file
I get
While it is much higher than a no-op, I wouldn't call 700ns a meaningful performance cost. |
I guess the best I can do is to rephrase: "at one time there was a perceived performance cost to converting..." |
@ChadFulton Thank you very much, Chad, for such a detailed reply! Before deciding on user-interface, I wanted to check how well the code works in practice. And I can see that in my use case, a simple test produces somewhat different results in both cases (prediction standard deviation gets larger by ~6% when I use I tried to reproduce it using your code snippet and I can see that if I replace ARIMA order from import numpy as np
import pandas as pd
import statsmodels.api as sm
endog = np.log(sm.datasets.macrodata.load_pandas().data['realgdp']).diff().iloc[1:]
endog.index = pd.date_range('1959Q2', '2009Q3', freq='QS')
iFitBegin = 0
iFitEnd = 100
iDataEnd = len(endog)
arima_order = (3, 0, 1)
# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
mod_training = sm.tsa.SARIMAX(endog_training, order=arima_order)
res_training = mod_training.fit()
params_training = res_training.params
# Now apply the filter to the next dataset
endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
mod_test = sm.tsa.SARIMAX(endog_test, order=arima_order)
mod_test.ssm.initialize_known(res_training.predicted_state [..., -1],
res_training.predicted_state_cov[..., -1])
res_test = mod_test.filter(params_training)
# Finally, filter everything all at once so we can check results
mod_all = sm.tsa.SARIMAX(endog, order=arima_order)
res_all = mod_all.filter(params_training)
from numpy.testing import assert_allclose
assert_allclose(res_all.filtered_state[..., iFitEnd + 1:iDataEnd],
res_test.filtered_state) gives the following output:
I am looking into it further, but if you have some idea of what might be causing it, please, let us know. |
Picking higher MA (i.e. arima_order = (1, 0, 9)
arima_order = (1, 0, 8)
arima_order = (1, 0, 7)
arima_order = (1, 0, 6)
arima_order = (1, 0, 5)
arima_order = (1, 0, 4)
arima_order = (1, 0, 3)
arima_order = (1, 0, 2)
arima_order = (1, 0, 1)
arima_order = (2, 0, 1)
arima_order = (3, 0, 1)
arima_order = (4, 0, 1)
arima_order = (6, 0, 1)
arima_order = (8, 0, 1) |
I think, the difficulty with figuring out the user interface comes from the fact that KalmanFilter inherits from Representation, and so the state and the state update operation end up being the same thing. This leads to a situation that to copy a state, one has to copy a model. Moreover, a model also represents (contains) the data, which will be used to fit it, and so a model and data end up being the same thing. The end result is that model, state, and data are the same object. In addition, states are contained in the fit results but one does not get the whole state from a fit result but an array. Face palm... I would define a The code does not need rewriting, just refactoring it into more independent concepts. Without such refactoring, it is hard to come up with a pretty user interface. I am willing to do the refactoring, but I am unlikely to succeed without some help from somebody who understands the code in detail. |
I have to say that it seems unlikely to me that we would want to fundamentally restructure the state space models at this point, particularly since the solution I gave already works for your use case. I will expand on this more in a second comment replying to your next message.
This is because Kalman filtering converges after some number of iterations to a steady-state, after which point we can improve performance by no longer updating the covariance matrix. To determine convergence to the steady-state, we have a tolerance that is very close to zero. You can get some (very small) differences if you run the Kalman filter with different convergence tolerances (which is implicitly happening here). If you replace |
A couple of notes on what you've written. First, I'm going to jump to the most important comment, which you may find disappointing:
This is not the design pattern followed by Statsmodels. In the model classes in this library (including OLS, GLM, etc., and also state space models), the data and fundamental model specification are always passed as part of the constructor. I realize that this is different from e.g. scikit-learn, but we are not planning to change this. As a result, the suggestion of removing data out of the constructor is unfortunately a non-starter.
Thanks very much for offering to help write code, that is always very much appreciated! However, you seem to be suggesting quite a lot of work that would fundamentally change the way the state space models run. Before considering doing this, I think it would be important to show why our current solution does not work. In particular, why not just wrap the code I gave above into a helper method? If you really want to proceed despite this warning, I would be still be happy to review your code, but unfortunately I have to say that I don't have huge amounts of time to discuss rewriting code that seems to work well to me.
It's not clear to me what you are referring to as "a state". Do you mean the conditional mean and covariance at a particular time period?
Similar to my above comment, I don't know what you mean by "the whole state" here. However, the complete output of the Kalman filter (and perhaps smoother) are available in the fit results.
As far as I can tell, this is what our results objects do contain. Is there something that is missing there? |
@ChadFulton In no regard, my goal was to criticize the existing code, but only to offer help. My apologies if it was the impression I created. Since you worried the most about how to wrap the code snippet with a nice user interface, this was exactly what I tried to help with. However, I quickly realized that it was not easy, and thinking about the reasons led me to the conclusion that it was due how the code is currently organized. For example, one has to create in the function (implementing an incremental Kalman filter) an exact copy of the original model, that was used to fit the training sample, which can be done by:
In addition, I believe that restructuring the code would enhance its performance greatly. At this movement, I can see a speed up by a factor <15 in a simple test case, while one should be able to get a speed up factor of ~3000 in the same case.
Indeed, once I did that, the assertion was passed successfully, but once I set import numpy as np
import pandas as pd
import statsmodels.api as sm
data = pd.read_csv('https://github.com/statsmodels/statsmodels/files/3208318/data.zip')
data.drop(columns=['dummy'], inplace=True)
endog = data['x']
# exog = None # no exogeneous variables case
exog = data[ [x for x in data.columns if x.startswith('e')] ]
arima_order = (3, 0, 1)
tolerance = 0
#enforce_stationarity=True # XXX: works
enforce_stationarity=False # XXX: does not work
enforce_invertibility=False
iFitBegin = 0
iFitEnd = len(endog)-41-1
iDataEnd = len(endog)
# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
exog_training = exog .iloc[iFitBegin:iFitEnd + 1] if exog is not None else None
mod_training = sm.tsa.SARIMAX(endog_training, exog=exog_training, order=arima_order,
enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
tolerance=tolerance)
res_training = mod_training.fit()
params_training = res_training.params
# Now apply the filter to the next dataset
endog_test = endog.iloc[iFitEnd + 1:iDataEnd]
exog_test = exog .iloc[iFitEnd + 1:iDataEnd] if exog is not None else None
mod_test = sm.tsa.SARIMAX(endog_test, exog=exog_test, order=arima_order,
enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
tolerance=tolerance)
mod_test.ssm.initialize_known(res_training.predicted_state [..., -1],
res_training.predicted_state_cov[..., -1])
res_test = mod_test.filter(params_training)
# Finally, filter everything all at once so we can check results
mod_all = sm.tsa.SARIMAX(endog, exog=exog, order=arima_order,
enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
tolerance=tolerance)
res_all = mod_all.filter(params_training)
from numpy.testing import assert_allclose
assert_allclose(res_all .filtered_state[..., iFitEnd + 1:iDataEnd],
res_test.filtered_state)
assert_allclose(res_all .get_prediction(start=iFitEnd+1, end=iDataEnd-1).predicted_mean,
res_test.get_prediction(start=iFitEnd +1-len(endog_training),
end=iDataEnd-1-len(endog_training)).predicted_mean) gives
|
This is kind of strange - I copied and pasted your code into a notebook, and it runs through for me with no errors (the assertions pass). I may be using a more recent version of Statsmodels, though, and maybe we fixed a bug since the last release (which was quite a while ago). |
No problem at all - I was not in any way offended, and I'm sorry if it came off that way! We very much appreciate your offer to help.
Yes, that's right - in fact we already do this to support forecasting. See e.g. the
If you have examples of performance improvements that you can share, that would be interesting to see. |
You are right, once I installed the latest master of statsmodels, the assertions in the code snippet passed. Here is code, which allows to benchmark how much faster predictions can be done in an online setting (for some particular test data/model case): import numpy as np
import pandas as pd
import statsmodels.api as sm
data = pd.read_csv('https://github.com/statsmodels/statsmodels/files/3208318/data.zip')
data.drop(columns=['dummy'], inplace=True)
endog = data['x']
# exog = None # no exogeneous variables case
exog = data[ [x for x in data.columns if x.startswith('e')] ]
arima_order = (3, 0, 1)
tolerance = 0
enforce_stationarity=False
enforce_invertibility=False
iFitBegin = 0
iFitEnd = len(endog)-41-1
iDataEnd = len(endog)
def CreateModel(endog_data, exog_data):
return sm.tsa.SARIMAX(endog_data, exog=exog_data, order=arima_order,
enforce_stationarity=enforce_stationarity, enforce_invertibility=enforce_invertibility,
tolerance=tolerance)
# Fit on the training sample
endog_training = endog.iloc[iFitBegin:iFitEnd + 1]
exog_training = exog .iloc[iFitBegin:iFitEnd + 1] if exog is not None else None
mod_training = CreateModel(endog_training, exog_training)
res_training = mod_training.fit()
params_training = res_training.params
from time import process_time
prediction_steps = list(range(iFitEnd+1,iDataEnd))
def GetLastState(fit_result):
return (fit_result.predicted_state [..., -1],
fit_result.predicted_state_cov[..., -1])
last_state = GetLastState(res_training)
results1 = list()
t1 = -process_time()
for i in prediction_steps:
# Now apply the filter to the next dataset
endog_test = endog.iloc[i:i+1]
exog_test = exog .iloc[i:i+1] if exog is not None else None
mod_test = CreateModel(endog_test, exog_test)
mod_test.ssm.initialize_known(*last_state)
res_test = mod_test.filter(params_training)
last_state = GetLastState(res_test)
results1.append( res_test.get_prediction(start=0, end=0).predicted_mean.values[0] )
t1 += process_time()
results2 = list()
t2 = -process_time()
for i in prediction_steps:
# Finally, filter everything all at once so we can check results
endog_all = endog.iloc[iFitBegin:i+1]
exog_all = exog .iloc[iFitBegin:i+1] if exog is not None else None
mod_all = CreateModel(endog_all, exog_all)
res_all = mod_all.filter(params_training)
results2.append( res_all.get_prediction(start=i-iFitBegin, end=i-iFitBegin).predicted_mean.values[0] )
t2 += process_time()
print('time incremental:',t1,'sec')
print('time all :',t2,'sec')
print('speed up :',t2/t1)
print('achivable speed up estimate:',iFitEnd-iFitBegin+0.5*(iDataEnd-(iFitEnd+1)))
# above is an estimate of how many more Kalman filter steps are done in the 2nd case vs 1st one
from numpy.testing import assert_allclose
assert_allclose(np.array(results1), np.array(results2)) The relevant output is the following:
It means that the current incremental version is about 27 times faster than the original one (which reruns Kalman filter for the whole data set up the last available data point). However, this version still is about 100 times slower (the potentially achievable speed up factor is ~2700) than it could have been if all the unnecessary boiler plate code was removed (i.e. if the code was better organized to avoid calling the boiler plate code for every prediction). I estimated the achievable speed up factor by simply counting how many more Kalman filter steps the 2nd version does in comparison to the 1st one (i.e. incremental). To achieve better performance, one should be able to:
These two steps could actually be merged into a single new function, e.g. |
Thanks for posting that. I agree that it can definitely be faster to not re-filter all of the original observations if you don't need to.
I agree that this would be great! I'd be happy to look over a PR. I just want to add the caution that there are a lot of things going on in the state space models that need to be maintained with an PR (a few issues are mentioned in #4188). Hopefully this is relatively straightforward, but I would hazard a guess that trying to get absolute maximal performance may be hard to integrated with the current code. My suggestion is to start with a new method that just wraps the approach I gave (I think this alone will require plenty of work). |
@ChadFulton There are some surprising discoveries in terms of where the time is actually spent. In the test model benchmark code, here is the approximate breakdown for the incremental filtering code:
So, most of the time is actually spent doing one filtering step (not getting everything else prepared for it). Moreover, if I request raw result ( Also, I have discovered that the People do use sklearn like interface for doing time series analysis (for example, in Keras) and it works great. I do not see why it can not be done in statsmodels. It was sad for me to discover that in some polls on data analytics tools statismodels is not even an option one can vote for. Also, I remember the hard time I had figuring out how to use statsmodels when I just started with it. It all points in the same direction: statsmodels code could be organized better. |
There are a couple of things slowing things down here: First, the full results object computes standard errors for the parameters. The default for this is the outer product of gradients method. We compute the gradients numerically, and this requires additional iterations of the Kalman filter. If you don't need these, you can pass Aside from this, most of the remaining time is spent in
It is possible to skip both the construction of How exactly this would work depends on exactly what you want to do. In general you would want to call the
This is simply a convenient way to generate updated state space system matrices in the time-varying case (which occurs for example if
We have been careful to avoid extraneous information copying, while also balancing user-friendliness for the most typical use cases.
If you have strong feelings about this, please feel free to bring it up on the mailing list. Of course, there is a lot that can be done to make it more user friendly (especially better documentation and more examples). Unfortunately, everyone is a volunteer with limited time, so we do the best we can :) However, if a change to e.g. the scikit interface were to occur (which I have to say is very unlikely) it would have to be done across Statsmodels, and not just in the state space component. |
@ChadFulton I have some more time to contribute to further improving the Kalman filter implementation efficiency. Thank you very much for your reply on StackOverflow! I have installed the latest development version of statsmodels and tested both 'extend' and 'append' methods on the same test model, which we discussed above. Here are a few useful observations for the new statsmodels:
I want the simplest thing possible: to get the mean prediction one step into the future (after extending the data with a single data point) as quickly as possible. Could you, please, let me know if you think the approach (i.e. working out the prediction from the raw Cython output is a promising way to achieve it (especially, in light of observation 4. above)? This seems to be the only unexplored yet idea I could fish out of your last post above. Thank you very much for your help! |
Thanks for this comment, there are two things you pointed out that need to be added to the current version of We'll probably have to revisit your timings after I implement this. One other note:
Possibly, although I still think the |
@ChadFulton Chad, looking at the documentation/source of SARIMAXResults, I can see that one has two options to get results from function
I was wondering if there is any way to avoid computing forecast errors and the forecast error covariance matrices (i.e. only get forecasts) to save time. Thank you very much for your help! |
While working on speeding up the
If one only needs the average prediction, it could be made very fast by skipping the Russian doll style layers of result wrapping and doing the following: sarimaxResults.filter_results.predict(start, start+1, dynamic=False).forecasts[0,0] In my use case, it is about 120 times faster than calling Could you, please, let me know if there was any progress in making the |
You are free to call the method directly, as you just did. What more would you want?
Yes, those fixes have been merged, see #6074. |
I was thinking about other people, not about myself. In addition, more functionally could potentially be added: like giving start time as non-integers or making dynamic predictions. I am sorry, if it came out as asking something unreasonable. |
@ChadFulton Thank you very much, Chad, for the new version of Could you, please, let us know if there is any potential to speed it up even further, or it is already as fast as it could be (assuming that one only cares about average predictions, and does not need errors or covariance matrix of model parameters)? |
Great news!
I think that most of the obvious performance improvements have been implemented now, although it's possible there could be more to do. If you have a particular example where it seems slow, that would help in determining if we could improve it further. More generally, we want to balance ease-of-use with performance, so for the main method Which is just to say that my goal is not going to be to maximize performance at all costs for the primary interface methods. But I hope that we can provide alternative options for people with that goal - they just might not be as convenient as the primary interface. |
Totally agree with @vss888. In several cases (such as mine), people want a fast convergence without worrying too much about getting a "detailed" result. It's OK, as long as it's safe enough. It's true that most of the times people could call methods like
I know what @ChadFulton, I think balancing (or even giving priority to) ease-of-use with performance is the right approach, and the code shouldn't be designed to be "fast" as much as it is designed to be simple, logical and versatile. However, I don't think that cases where we want to achieve a fast optimisation are uncommon. Maybe a lighter but faster version could help in many cases? It could possibly be initialized with a keyword like Mine was just a thought, hopefully it will be helpful, thank you guys for maintaining and contributing to such a useful library :) |
More documentation is always welcome. Docs and examples, while not glorious to write, are much simpler and safer to write than a large refactoring. For example, it could be very useful to have a notebook focused on pure-prediction applications of SARIMAX which demonstrated all of the available shortcuts.
What is needed then is a careful line_profiler run of both the relevant Python code and the Cython code. This is a straightforward but time-consuming exercise. This is the only way to precisely determine where the hot spots are that might be worth investing the time to refactor out of a path. In short, @ChadFulton does a ton for this module that appears to have a pretty wide user base including in machine learning applications. It would be great to see some community support materializing at any level. |
@bashtage is right, it is important going forwards that we get more community support for this project, and there are lots of ways to do that which don't require getting too much in the weeds, like documentation or examples. (And if you do want to get in the weeds, line profiling would be awesome!). The point that I think I am not getting across clearly is that we are not purposely letting slow code exist, but I have my own use cases and I don't always know what others may want out of this package. If there is an important use case that we have not covered, it would be great to improve that situation, but at the least we need the users who are running into those problems to provide runnable examples that we can examine for bottlenecks. @giuliobeseghi, if you have such an example, please do post it, even if I can't guarantee that we can make it substantially faster. |
Could you please let me know if it is possible to use the current SARIMAX code to apply Kalman filter incrementally (to make it faster)?
Here is what I mean. Imagine, one fits an SARIMAX model (#1) using data with indices in
[iFitBegin, iFitEnd]
, then one wants to use such a model to make predictions for data with indices in[iFitEnd+1, iDataEnd]
. To do that in the current code, it seems, one has to create a new SARIMAX model (#2) using data with indices in[iFitBegin, iDataEnd]
and apply Kalman filter on the model #2 with parameters taken from model #1. After that one can request predictions from model #2. However, it is expensive computationally to do (for large models) if predictions are done on streaming data (where new data arrives one data point at a time) since every time one has to redo all Kalman filtering fromiFitBegin
to the indexiPred
(the one, at which a prediction is required). Mathematically, Kalman filter is applied incrementally, i.e. if Kalman filtering is done up to indexiPred
, then to do filtering up to indexiPred+1
one just has to take the state atiPred
and only apply one step of Kalman filter using a single new data point atiPred+1
, and so it would be great if one could do it in the code incrementally as well.Could you please let me know if there is a way to do such incremental Kalman filtering (i.e. one state update at a time) in the current SARIMAX implementation?
Thank you for your help!!!
The text was updated successfully, but these errors were encountered: