New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Automatic forecasting GSoC-2018-Project #4621

Open
wants to merge 49 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@abhijeetpanda12
Copy link

abhijeetpanda12 commented May 12, 2018

@ChadFulton I have only pushed the function definition. We can discuss the improvements/ additions over here.

@coveralls

This comment has been minimized.

Copy link

coveralls commented May 12, 2018

Coverage Status

Coverage increased (+0.03%) to 84.267% when pulling a007dd7 on abhijeetpanda12:auto-forecast-1 into 428b49c on statsmodels:master.

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 12, 2018

Thanks for setting up this PR!

I don't see a function definition yet, though, so maybe you didn't commit it before pushing?

Below is a copy of what I e-mailed to you yesterday, for reference:

I agree that getting started with a function to select the ARIMA orders (p, q) is a good idea. I think you should have the following in mind:

  • Let's focus on the SARIMAX class for now (I know it better than the ARIMA classes, and in some ways it is more consistent)
  • For now, why don't you create a function that does (p, q) selection for SARIMAX models (in the future we can try to generalize if that seems useful, but for now let's keep it simple).

Why don't you start with the following:

  • Check out a new feature branch in Git (named whatever you like)
  • Create a new package (i.e. directory) under statsmodels/tsa called "automatic" (and put in at least init.py)
  • Create a new module (i.e. file) in statsmodels/tsa/automatic called "sarimax.py"
  • Create a function in that module called "auto_order" (or if you want to suggest another name, that's good too).
  • As we go through the summer we will improve the features of this function, but for now, maybe start with the following signature:
def auto_order(endog, criteria='aic', d=0, max_order=(3, 3), spec=None):
  • criteria will eventually be expanded, but for now you can check for anything other than aic and raise a NotImplementedError
  • For now, spec should be a dictionary of kwargs that you will pass to each of the SARIMA models (e.g. this could include exog, or seasonal_order, etc.).

I think you should start with non-stepwise model selection (i.e. just the exhaustive search) which should be reasonably straightforward and then we can get started with the stepwise approach later this week.

Two notes:

  • Please do follow up with any questions even though there may be a slight delay from me in responding.
  • Please push the PR to Github and create a Pull Request as soon as you have the basic structure setup and just a stub of the function (i.e. you should only have
def auto_order(endog, criteria='aic', d=0, max_order=(3, 3), spec=None):
    pass

in your sarimax.py file). Then as you add new features, you can push them up too, and we can have better discussions on the Github pull request page.

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 12, 2018

When you get a chance, I think it would be great if you could send an update to the mailing list giving with the link to this PR and a status update of what you're planning for this week (i.e. the first week of your proposal).

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 12, 2018

Hopefully the basic non-stepwise version of this function is straightforward and can be pushed pretty today or tomorrow, since we should get ARMA p and q selection done this week. But if you're having trouble figuring out how to get started, that's no problem at all. A couple of good ways to get help if you're stuck:

  • You can always add a comment here with your question, and maybe some short example code.
  • If the example code is longer, you can create a gist (gist.github.com) and share the link to it.
  • Especially if the question is design or big-picture oriented, you can also feel free to send a question to the mailing list (even though it's probably still me and Josef commenting :)
@abhijeetpanda12

This comment has been minimized.

Copy link
Author

abhijeetpanda12 commented May 13, 2018

I have added the function definition now.
In case I've stuck anywhere, I'll comment here about my issues and will post my progress on the mailing list.

@josef-pkt

This comment has been minimized.

Copy link
Member

josef-pkt commented May 13, 2018

This pull request introduces 1 alert when merging adfce60 into f790b88 - view on lgtm.com

new alerts:

  • 1 for Unnecessary pass

Comment posted by lgtm.com

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented May 13, 2018

Codecov Report

Merging #4621 into master will decrease coverage by 0.65%.
The diff coverage is 76.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4621      +/-   ##
==========================================
- Coverage   81.75%   81.09%   -0.66%     
==========================================
  Files         569      592      +23     
  Lines       89851    90470     +619     
  Branches    10022    10120      +98     
==========================================
- Hits        73455    73367      -88     
- Misses      14126    14799     +673     
- Partials     2270     2304      +34
Impacted Files Coverage Δ
statsmodels/tsa/automatic/tests/test_forecast.py 100% <100%> (ø)
statsmodels/tsa/automatic/__init__.py 100% <100%> (ø)
...tatsmodels/tsa/automatic/tests/test_exponential.py 100% <100%> (ø)
statsmodels/tsa/automatic/tests/test_sarimax.py 100% <100%> (ø)
statsmodels/tsa/automatic/tests/test_tscv.py 100% <100%> (ø)
statsmodels/tsa/automatic/tests/test_transform.py 100% <100%> (ø)
statsmodels/tsa/automatic/sarimax.py 42.3% <42.3%> (ø)
statsmodels/tsa/automatic/exponentialsmoothing.py 54.54% <54.54%> (ø)
statsmodels/tsa/automatic/forecast.py 65.55% <65.55%> (ø)
statsmodels/tsa/automatic/tscv.py 66.66% <66.66%> (ø)
... and 339 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 428b49c...a007dd7. Read the comment docs.

@abhijeetpanda12

This comment has been minimized.

Copy link
Author

abhijeetpanda12 commented May 14, 2018

@ChadFulton I have made a GitHub gist for the non-stepwise model selection for SARIMAX at
https://gist.github.com/abhijeetpanda12/1dad1bdab503edf2bec98934591d8ced
Can you give it a look and suggest me the changes/ improvements that I should make.

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 14, 2018

Great! As I mentioned on the gist, please go ahead and add that code as commit on this PR. In the meantime, here are some comments:

First, rather than putting any datasets in the sarimax.py file, create a new folder in the "automatic" folder called tests, and create a new file called test_sarimax.py. In that file, create a function (e.g. test_non_stepwise). That will be the first unit test, and you should test that your auto_order function produces the result you expect. For the moment you can just run the function once, and then test that future test runs return that same result (we call this a "smoke test" because all it tests is that nothing has changed in the function).

Then, with the test separated from the code, you should modify the function to use whatever endog is passed, and not a hardcoded dataset.

Finally, you haven't made use of the spec argument:

For now, spec should be a dictionary of kwargs that you will pass to each of the SARIMA models (e.g. this could include exog, or seasonal_order, etc.).

So you would want to do e.g.:

mod = sm.tsa.statespace.SARIMAX(endog, order=(p, d, q), **spec)

(we may expand spec later, but for now this is good enough).

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 14, 2018

Then, with the test separated from the code, you should modify the function to use whatever endog is passed, and not a hardcoded dataset.

For examples of these tests, see e.g. the tests in statespace/tests (or anywhere else in Statsmodels).

To run your test, from the command line:

cd /path/to/statsmodels/tsa/automatic
# To run all tests:
pytest tests
# To run only a certain file:
pytest tests/test_sarimax.py
# To run only a certain test:
pytest tests/test_sarimax.py::test_non_stepwise
@abhijeetpanda12

This comment has been minimized.

Copy link
Author

abhijeetpanda12 commented May 14, 2018

@ChadFulton thanks for your comments and reviews. I have made a few necessary changes like separating the test file from the original sarimax.py file but I haven't done the unit test for it, which I'll be completing today and committing it.
Do we have any standard dataset in which we can test the functions or we can use any publicly available dataset?
Now that the non-stepwise function has been created, I am thinking of working on the stepwise algorithm.
Do you have anything else for me to keep in my mind before moving to the next step?

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 14, 2018

Do we have any standard dataset in which we can test the functions or we can use any publicly available dataset?

Yes, you can use any of the variables in the macrodata dataset. For example, I used the following in one of my test files:

current_path = os.path.dirname(os.path.abspath(__file__))
macrodata = datasets.macrodata.load_pandas().data
macrodata.index = pd.PeriodIndex(start='1959Q1', end='2009Q3', freq='Q')

For tests in which d=0, you can use macrodata.infl, and for tests with d=1 (or even d=2) you can use macrodata.cpi.

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 14, 2018

Now that the non-stepwise function has been created, I am thinking of working on the stepwise algorithm.
Do you have anything else for me to keep in my mind before moving to the next step?

Please carry on with the stepwise version. We can always come back and clean things up here later, but it's more important at this stage to keep things moving.

By the way, you should in general keep moving on the project as you see fit / as the timeline suggests - you don't need to check in with me first. Since we are on essentially opposite time zones, you'll need to be able to keep working on things even if you don't hear back from me for several hours.

data = pd.read_stata(BytesIO(wpi1))
data.index = data.t

def auto_order(endog, criteria='aic', d=0, max_order=(3, 3), spec=None):

This comment has been minimized.

@ChadFulton

ChadFulton May 14, 2018

Member

You don't need to (and shouldn't) replicate the function to be tested in the unit test. Instead, you should import the function you want to test, e.g.:

from statsmodels.tsa.automatic import sarimax

and then in your test_non_stepwise function you should do something like:

p, q = sarimax.auto_order(...)  # where you fill in the appropriate parameters for your test
desired_p = ...  # where you fill in whatever you think the returned `p` should be equal to
desired_q = ...  # similarly for what the returned `q` should be

assert_equal(p, desired_p)
assert_equal(q, desired_q)

In general, you'll want to test out various different types of parameterizations (e.g. different max_order, different spec, etc).

@ChadFulton ChadFulton changed the title automatic package created WIP: Automatic forecasting May 15, 2018

@ChadFulton ChadFulton self-assigned this May 15, 2018

@ChadFulton ChadFulton modified the milestones: 0.10, Someday May 15, 2018

aic_matrix = np.zeros(p, q)
for p in range(max_order[0]):
for q in range(max_order[1]):
if p == 0 and q == 0:

This comment has been minimized.

@ChadFulton

ChadFulton May 16, 2018

Member

You don't need to skip this case, since if exog is set in spec, then linear regression is the special case p=0, q=0.

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 17, 2018

Author

Yaa, I agree but if we skip this case, then the SARIMAX model fails as one of the states should be positive.
I think we should put a check for this

This comment has been minimized.

@ChadFulton

ChadFulton May 17, 2018

Member

If this is true, can you file a bug report with code to replicate? When I try, it seems to work:

import pandas as pd
import statsmodels.api as sm

dta = sm.datasets.macrodata.load_pandas().data
dta.index = pd.PeriodIndex(start='1959Q1', end='2009Q3', freq='Q')

mod = sm.tsa.SARIMAX(dta['infl'], order=(0, 0, 0))
res = mod.fit()
print(res.summary())
                           Statespace Model Results                           
==============================================================================
Dep. Variable:                   infl   No. Observations:                  203
Model:                        SARIMAX   Log Likelihood                -619.610
Date:                Thu, 17 May 2018   AIC                           1241.220
Time:                        07:51:27   BIC                           1244.533
Sample:                    03-31-1959   HQIC                          1242.560
                         - 09-30-2009                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
sigma2        26.2234      2.466     10.632      0.000      21.389      31.058
===================================================================================
Ljung-Box (Q):                      712.11   Jarque-Bera (JB):                60.66
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.45   Skew:                             0.74
Prob(H) (two-sided):                  0.00   Kurtosis:                         5.23
===================================================================================

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 17, 2018

Author

In that case, I'll go check this code again. I'll take this into account and edit the code for necessary changes.

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 20, 2018

Author

It throws a ValueError for the number of states to be positive which I have pasted in
https://gist.github.com/abhijeetpanda12/e2d2c14744c84b368910842a4ecfe0f7

This comment has been minimized.

@ChadFulton

ChadFulton May 20, 2018

Member

Can you do:

import statsmodels.api as sm.
print(sm.version.version)

You seem to be using an old version, whereas you should be developing against master.

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 21, 2018

Author

I was using version 0.8. I have now rebased my local repository to the current master running version 0.10
Thanks for helping me out.

@abhijeetpanda12 abhijeetpanda12 force-pushed the abhijeetpanda12:auto-forecast-1 branch from b4a20f7 to 201f14c May 22, 2018

@abhijeetpanda12

This comment has been minimized.

Copy link
Author

abhijeetpanda12 commented May 22, 2018

@ChadFulton I have written the stepwise algorithm now without taking the seasonal components into account. Please review the code and provide me some feedback.
Meanwhile, I am working on my targets of the second week as to how the results of the forecast function should be designed. Your help in this area is highly valuable

@josef-pkt

This comment has been minimized.

Copy link
Member

josef-pkt commented May 22, 2018

This pull request introduces 1 alert when merging 201f14c into 6611e87 - view on lgtm.com

new alerts:

  • 1 for Redundant comparison

Comment posted by lgtm.com

model = int(np.where(aic_vals == min_aic)[0])
p, q = order_init[model][0], order_init[model][1]
# only p varies by +1
if(p+1 <= max_order[0]):

This comment has been minimized.

@ChadFulton

ChadFulton May 23, 2018

Member

In Statsmodels, we put spaced between mathematical operators other than exponentiation, so this should be p + 1.

Also, no need for the parentheses, so maybe:

if p + 1 <= max_order[0]:

(and similar in the rest of the function).

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 23, 2018

Author

I have the made the changes.


def auto_order(
endog, criteria='aic', d=0, max_order=(3, 3),
stepwise=False, **spec):

This comment has been minimized.

@ChadFulton

ChadFulton May 23, 2018

Member

At some point we need to add a docstring here, giving at least the Parameters and the Return sections. It would also be good to have a Notes section describing the basics of the stepwise approach and giving the citation.

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 23, 2018

Author

I have added a simple docstring with the citation of Hyndman's paper in the Notes section. I'll add more details to it after completing the function.

def auto_order(endog, criteria='aic', d=0, max_order=(3, 3), **spec=None):

def auto_order(
endog, criteria='aic', d=0, max_order=(3, 3),

This comment has been minimized.

@ChadFulton

ChadFulton May 23, 2018

Member

The first line arguments should be on the same line as the def part.

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 23, 2018

Author

I followed the E127 convention along with the pycodestyle suggestions, so had to take it to the new line.
I'll make the desired changes.

p += 1
min_aic = res.aic
# only p varies by -1
if(p-1 >= 0):

This comment has been minimized.

@ChadFulton

ChadFulton May 23, 2018

Member

If the previous case (p+1) improved the aic, then this will test the original p value again.

E.g. if the order_init was minimized with p=1, q=0 and then p+1=2 was better than p = 1, this section will test p = 1 again.

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 23, 2018

Author

Yes, I understand that. It should be checking for p=0. I'll add a new variable that'll keep track of the changes in p and finally assign it in the end.

q += 1
min_aic = res.aic
# only q varies by -1
if(q-1 >= 0):

This comment has been minimized.

@ChadFulton

ChadFulton May 23, 2018

Member

Similarly to the p-1 case, if q+1 improved the aic, this will test the original q again.

q += 1
min_aic = res.aic
# both p and q vary by -1
if(q-1 >= 0) and (q-1 >= 0):

This comment has been minimized.

@ChadFulton

ChadFulton May 23, 2018

Member

See my notes on the p-1 and q-1 cases above, similar issue here.

desired_p = 2
desired_q = 2
assert_equal(p, desired_p)
assert_equal(q, desired_q)
assert p == desired_p

This comment has been minimized.

@ChadFulton

ChadFulton May 23, 2018

Member

We use the numpy testing functions (regular assert calls can disappear at compile time), so something like:

from numpy.testing import assert_equal

and then later:

assert_equal(p, desired_p)

This comment has been minimized.

@abhijeetpanda12

abhijeetpanda12 May 23, 2018

Author

Sure, I'll use the numpy testing functions.

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 23, 2018

I have written the stepwise algorithm now without taking the seasonal components into account. Please review the code and provide me some feedback.

Looks like a great start, I've added a few comments.

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented May 23, 2018

Meanwhile, I am working on my targets of the second week as to how the results of the forecast function should be designed.

I'll write some more thoughts on this in response to your question on the mailing list, and maybe we can get others' input as well.

@abhijeetpanda12 abhijeetpanda12 force-pushed the abhijeetpanda12:auto-forecast-1 branch from 201f14c to bca1cec May 23, 2018

@abhijeetpanda12

This comment has been minimized.

Copy link
Author

abhijeetpanda12 commented May 23, 2018

@ChadFulton I have made all the changes that you have reviewed. Please let me know if some other changes are also required.

@abhijeetpanda12 abhijeetpanda12 force-pushed the abhijeetpanda12:auto-forecast-1 branch from a852ff8 to a007dd7 Dec 5, 2018

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented Dec 30, 2018

@abhijeetpanda12 - it would be great to get some of your hard work merged in to Statsmodels. Do you think you'll have some time to work on this in the coming month?

@abhijeetpanda12

This comment has been minimized.

Copy link
Author

abhijeetpanda12 commented Dec 30, 2018

@ChadFulton I would definitely like to see this PR merged and I have recently started working on it. I will be able to take out some time and complete it by the end next month.

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented Jan 19, 2019

@abhijeetpanda12 how are things? Is there anything you're stuck on, or that we can help with?

@abhijeetpanda12

This comment has been minimized.

Copy link
Author

abhijeetpanda12 commented Jan 24, 2019

I am having an issue with the Result classes. I want to know which all values to post in the result summary. I am taking reference from your PR #5151 . Can you brief me about what things I should consider for it?

@ChadFulton

This comment has been minimized.

Copy link
Member

ChadFulton commented Jan 27, 2019

There is no specific list of attributes that Results classes must hold - in general, they should hold the "important results", and those differ depending on the situation. One thing that almost all Results classes include is a summary method, that can be used to print a text-based table showing whatever is important about the results.

In this case, I think that Results should hold the different models that were considered, as well as the model that was ultimately selected. In addition, it should show the criteria that was considered in selecting the best model (and maybe other criteria, if available, even if it wasn't specifically used to select the best model this time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment