ENH: Wrap X12/X13-ARIMA AUTOMDL. Closes #442. #1587

Merged
merged 22 commits into from Sep 24, 2014

Projects

None yet

6 participants

@jseabold
Member

This works but is still WIP for some cleanup.

  • add to release notes
  • add tests (will have to run locally only unless we build x12 on travis/throw up a binary somewhere)
  • make sure API is sensible
  • deal with x12path systematically, allow setting at build time, allow setting via a function
  • finish docstrings
  • download links for the binaries
  • add to optional dependencies

The spec code in there should allow wrapping more than AUTOMDL, but I don't know that we really need that. We could, for example, allow use of x12arima as a method in ARIMA and use it's seasonal ARIMA until we have a native solution.

@coveralls

Coverage Status

Coverage remained the same when pulling 3aae32f on jseabold:add-x12 into 5171e28 on statsmodels:master.

@bashtage
Contributor

Nice addition.

@coveralls

Coverage Status

Coverage remained the same when pulling fcfeebb on jseabold:add-x12 into d7ff182 on statsmodels:master.

@josef-pkt josef-pkt commented on an outdated diff Apr 15, 2014
statsmodels/tsa/api.py
@@ -13,3 +13,4 @@
from .base import datetools
from .seasonal import seasonal_decompose
from ..graphics import tsaplots as graphics
+import x12
@josef-pkt
josef-pkt Apr 15, 2014 Member

from . import x12

@josef-pkt
Member

some docs in how to get X12/X13 and what needs to be on the path is also needed.
download link ?
https://www.census.gov/srd/www/x13as/

I think once it's working, additional usecases will show up where this could be expanded.

@jseabold
Member

I'm not sure on windows. I might need some help here. I've always built it from source, so I'm not sure all the different binary names you might get. We just need x12a or x13a (x13as?) on the path for linux.

@josef-pkt
Member

x13as.exe
most likely it's easy, unzip and add to system path (or specify path variable)

E:\Josef\_progs\x13as>x13as Testairline -s -o testout

X-13ARIMA-SEATS Seasonal Adjustment Program
Version Number 1.1 Build 9
Execution began  Apr 14, 2014  22.05.58
@jseabold
Member

Apparently the specs for x12 and x13 are different. x13 provides a converter utility. I don't know if it makes sense to rely on this utility. Probably best to have native support for both specs. Or just target x13?

@josef-pkt
Member

I don't see a reason to support x12 if the latest version is x13.
We don't have legacy code to support.

How easy is it to install x13 parallel to x12 on Linux? For users who already have x12.
I don't see a problem on Windows because it's just an unzip as far as I saw.

@jseabold
Member

Well the reason to support it is that I wrote the specs for x12 first.

It's trivial to install the new one. I just built it, and they also provide binaries for linux.

I'll see if there are any changes to the specs we use. Hopefully, they both work out of the box.

@jseabold
Member

Looks like these specifications were unaffected. Seems to run fine in x12 or x13. I'm thinking about changing the namespace to x13 though. I added a note that we use x12 and x13 interchangeably but both should work.

@jseabold
Member

Re: namespace. What's the proper way to only export these two functions in __all__ to the api? I thought if all was defined then it would only tab-complete those, but that's not the case. Should both functions be preceded by x13 and we just import this in the tsa namespace? I.e.,

sm.tsa.x13_select_arima_order
sm.tsa.x13_arima_analysis
@jseabold
Member

I slightly prefer sm.tsa.x13.select_arima_order and sm.tsa.x13.arima_analysis but not enough to bend over backwards to make it happen.

@josef-pkt
Member

If it's just adding two functions, then I would prefer the underline names, not really worth starting a new namespace.
If the plan is to extend this, and we get a collection of functions, then a new namespace would be fine.

y X ?

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 15, 2014
statsmodels/tsa/x12.py
+ year, stperiod = start_date.year, start_date.quarter
+ else: # pragma: no cover
+ raise ValueError("Only monthly and quarterly periods are supported."
+ " Please report or send a pull request if you want "
+ "this extended.")
+
+ if hasattr(x, 'name'):
+ name = x.name or "Unnamed Series"
+ else:
+ name = 'Unnamed Series'
+ series_spec = SeriesSpec(data=data, name=name, period=period,
+ title=name, start="{}.{}".format(year, stperiod))
+ return series_spec
+
+
+def x13arima_analysis(y, x12path=None, X=None, log=None, outlier=True,
@josef-pkt
josef-pkt Apr 15, 2014 Member

I think x12path should go later, assuming that in general the path will be picked up automatically and doesn't have to be given on each function call.
maxorder and maxdiff look like higher priority arguments to me.

(spot check: I haven't looked at more than the signature, and whether it's really a function)

@jseabold
jseabold Apr 15, 2014 Member

These are the kind of checks that are helpful. Now that it's general enough to be smart, I think this makes sense. I'm using keyword args in all my user code for this so far.

@josef-pkt
josef-pkt Apr 15, 2014 Member

question: Does this only allow for annual seasonality? Is seasonality inferred from freq, or is there a way to set it explicitly?

@jseabold
jseabold Apr 15, 2014 Member

No.

Seasonality is inferred from the index, but if you pass an array-like with a start date and a freq, then it's settable. That said, it only allows quarterly and monthly with a note to expand this if it's desired for some reason.

@josef-pkt josef-pkt commented on an outdated diff Apr 15, 2014
statsmodels/tsa/x12.py
+ warn("Failed to delete resource {}".format(ftempout.name),
+ IOWarning)
+
+ seasadj = _convert_out_to_series(seasadj, y.index, 'seasadj')
+ trend = _convert_out_to_series(trend, y.index, 'trend')
+ irregular = _convert_out_to_series(irregular, y.index, 'irregular')
+
+ #NOTE: there isn't likely anything in stdout that's not in results
+ # so may be safe to just suppress and remove it
+ if not retspec:
+ return results, seasadj, trend, irregular, stdout
+ else:
+ return results, seasadj, trend, irregular, stdout, spec
+
+
+def select_arima_order(y, x12path=None, X=None, log=None, outlier=True,
@josef-pkt
josef-pkt Apr 15, 2014 Member

same as above, I would think that maxorder will be used most of the time by users

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 15, 2014
statsmodels/tsa/x12.py
+ x12path : str or None
+ The path to x12 or x13 binary. If None, the program will attempt
+ to find x13as or x12a on the PATH or by looking at X13PATH or X12PATH
+ depending on the value of prefer_x13.
+ X : array-like
+ Exogenous variables.
+ log : bool or None
+ If None, it is automatically determined whether to log the series or not.
+ If False, logs are not taken. If True, logs are taken.
+ outlier : bool
+ Whether or not outliers are tested for and corrected, if detected.
+ maxorder : tuple
+ The maximum order of the regular and seasonal ARMA polynomials to
+ examine during the model identification. The order for the regular
+ polynomial must be greater than zero and no larger than 4. The
+ order for the seaonal polynomial may be 1 or 2.
@josef-pkt
josef-pkt Apr 15, 2014 Member

typo: seasonal

is order <= 4 an x13 limitation? If we have monthly or higher frequency data, then we might want to try more lags.

@jseabold
jseabold Apr 15, 2014 Member

I really can't type that word. Yes 4 is an x13 limitation. There are also series length, etc. limitations from x13. It may be possible to change it in the source and compile your own (there are some limitations you can change at your own peril), but we don't support this.

@josef-pkt josef-pkt commented on an outdated diff Apr 15, 2014
statsmodels/tsa/x12.py
+ args = [x12path, "-m " + specpath]
+ elif datameta:
+ args = [x12path, "-d " + specpath]
+ else:
+ args = [x12path, specpath]
+
+ if outname:
+ args += [outname]
+
+ return subprocess.Popen(args, stdout=subprocess.PIPE,
+ stderr=subprocess.STDOUT)
+
+
+def _make_automdl_options(maxorder, maxdiff, diff):
+ options = "\n"
+ options += "maxorder = ({} {})\n".format(maxorder[0], maxorder[1])
@josef-pkt
josef-pkt Apr 15, 2014 Member

IIRC, this is not valid python 2.6 syntax {0}, {1}

Changed in version 2.7: The positional argument specifiers can be omitted, so '{} {}' is equivalent to '{0} {1}'.

@jseabold
Member

I think I've ticked off all the check boxes. The path isn't handled systematically in that it's a settable property at build-time or at run-time that will always be respected. We'll have to wait for that once we have a proper options system.

@bashtage bashtage and 2 others commented on an outdated diff Apr 16, 2014
statsmodels/tsa/x13.py
+import tempfile
+import re
+from warnings import warn
+
+import pandas as pd
+
+from statsmodels.compat.python import iteritems
+from statsmodels.tools.tools import Bunch
+from statsmodels.tools.sm_exceptions import (X12NotFoundError, X12Error,
+ IOWarning)
+
+__all__ = ["x13_arima_select_order", "x13_arima_analysis"]
+
+_binary_names = ('x13as.exe', 'x13as', 'x12a.exe', 'x12a')
+
+class _freq_to_period:
@bashtage
bashtage Apr 16, 2014 Contributor

Isn't this an old style class that doesn't work under Python 3?

@jseabold
jseabold Apr 16, 2014 Member

I think all classes in Python 3 are new-style. We inherit from object in < 2.7 to ensure this. Could be wrong though.

@josef-pkt
josef-pkt Apr 16, 2014 Member

We need (object) for python 2 to get new-style classes (which are not new at all anymore).
python 3 only has new-style classes, whether it does or doesn't inherit from object.

I think it would be better to stick to pep-8 capitalized class names

@josef-pkt
josef-pkt Apr 16, 2014 Member

maybe this should inherit from dict ?

@jseabold
jseabold Apr 16, 2014 Member

It's a private, internal only class for this module, so I'm not really that worried about it. Really, it should be replaced by "the way" to do this. I don't know exactly what that is since there's still a few loose ends floating around re: period-handling. I'll likely just file a ticket to look at this once I can come back and merge the exponential smoothing and I have a better idea.

@jseabold
Member

One minor issue is where to put these in the docs. I stuck them with the tsa.stattools stuff in source/tsa.rst, but they might not "belong" there.

@coveralls

Coverage Status

Coverage remained the same when pulling 18e2359 on jseabold:add-x12 into d7ff182 on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 7d895ed on jseabold:add-x12 into d7ff182 on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 7d895ed on jseabold:add-x12 into d7ff182 on statsmodels:master.

@josef-pkt josef-pkt and 2 others commented on an outdated diff Apr 16, 2014
statsmodels/tsa/x13.py
+ year, stperiod = start_date.year, start_date.quarter
+ else: # pragma: no cover
+ raise ValueError("Only monthly and quarterly periods are supported."
+ " Please report or send a pull request if you want "
+ "this extended.")
+
+ if hasattr(x, 'name'):
+ name = x.name or "Unnamed Series"
+ else:
+ name = 'Unnamed Series'
+ series_spec = SeriesSpec(data=data, name=name, period=period,
+ title=name, start="{0}.{1}".format(year, stperiod))
+ return series_spec
+
+
+def x13_arima_analysis(y, maxorder=(2, 1), maxdiff=(2, 1), diff=None, X=None,
@josef-pkt
josef-pkt Apr 16, 2014 Member

I would still prefer the more standard

def x13_arima_analysis(endog, exog=None, maxorder=(2, 1), maxdiff=(2, 1), diff=None, ...

@jseabold
jseabold Apr 16, 2014 Member

Well, I'm still trying to move away from this, and I haven't had time to finish the great rename yet. I had to go through again last week people getting lost when I bring this up. Really don't want to get into this debate again right now though.

@argriffing
argriffing Apr 18, 2014

If statsmodels were only an econometrics package then maybe people would immediately understand endog vs. exog, but coming from outside of econometrics this terminology is strange to me. When I see endog I just think of the shibe doge surrounded with econometrics jargon thought bubbles 'such pricing', 'much elasticity', and I forget what I was doing.

@josef-pkt
josef-pkt Apr 18, 2014 Member

see #395 for the looooong discussion
The main point: None of the code that I'm reviewing has one letter names in public functions, and internally only if it's a temp variable within a few lines of code. (As a maintainer I try to "memorize" as few things as possible per line of code. That's the only way to keep track of 100,000 lines of code.)

@argriffing
argriffing Apr 18, 2014

I see. It looks like an important part of your system!

@jseabold
jseabold Apr 18, 2014 Member

That's a bit of a straw man. No one is suggesting that we all of a sudden start writing unclear and undocumented code, only that we move away from using a technical term that less than 3% of our target audience (based on sampling each time I give a talk. Last data point 100-150 people for a talk.) for a very, very prominent user-facing variable name.

@josef-pkt
josef-pkt Apr 18, 2014 Member

maybe a bit, but x and y are neither descriptive nor specific names, and I didn't object too strongly to finding other "names" that are not letters.
Besides, I think most of those 97% almost never even need to see "endog" and "exog".

@argriffing
argriffing Apr 18, 2014

Besides, I think most of those 97% almost never even need to see "endog" and "exog".

Aren't they literally the first thing you see in statsmodels docs after the function name? For example linear regression or multinomial logistic regression, unless most of the 97% are not going to look at docs.

@jseabold
jseabold Apr 18, 2014 Member

Do you find their use to be confusing here?

@argriffing
argriffing Apr 18, 2014

Personally I don't find it too confusing anymore, I just need a moment to get re-oriented to the statsmodels notation. But the first time I used statsmodels endog and exog sent me on a half-hour expedition to learn what they were all about. Most people probably would not have spent that long -- they would have either understood more quickly or they would have just moved on to something else.

After reading the discussion in #395 I can see how using endog and exog instead of y and x could help maintainability and development by putting more semantics into the variable names, especially if you are used to that system. I think it's OK that statsmodels has a handful of developers actively contributing and testing research code using eccentric but internally consistent notation, and I assume that other projects will start repackaging the code as it matures and give it a less eccentric interface.

@jseabold
jseabold Apr 18, 2014 Member

Could you add this to #395? It would be good to keep all viewpoints in the same place.

@coveralls

Coverage Status

Coverage remained the same when pulling b5b74a9 on jseabold:add-x12 into d7ff182 on statsmodels:master.

@josef-pkt josef-pkt commented on an outdated diff Apr 18, 2014
statsmodels/tsa/x13.py
+ # call x12 arima
+ p = run_spec(x12path, ftempin.name[:-4], ftempout.name)
+ p.wait()
+ stdout = p.stdout.read()
+ if print_stdout:
+ print(p.stdout.read())
+ # check for errors
+ errors = _open_and_read(ftempout.name + '.err')
+ _check_errors(errors)
+
+ # read in results
+ results = _open_and_read(ftempout.name + '.out')
+ seasadj = _open_and_read(ftempout.name + '.d11')
+ trend = _open_and_read(ftempout.name + '.d12')
+ irregular = _open_and_read(ftempout.name + '.d13')
+ except X13Error, err:
@josef-pkt
josef-pkt Apr 18, 2014 Member

not python 3 syntax use as err we don't need py 2.5 anymore.

BTW: What's the point in catching the exception and then raising it right away again?
try ... finally ... ?

@coveralls

Coverage Status

Coverage remained the same when pulling 8f6b50b on jseabold:add-x12 into d7ff182 on statsmodels:master.

@bashtage
Contributor

endog and exog are especially problematic in the instrumental variable code, when variables in the exog set can be endogenous.

I don't think this is likely to change, by y or Y and X are so standard and broad that pretty much anyone who has been around statistical models will recognize these.

@argriffing argriffing referenced this pull request Apr 19, 2014
Open

endog, exog -> y, x #395

@lminer
lminer commented Sep 10, 2014

This is a fantastic addition. Why is frequency limited to months and quarters? Is it due to a limitation in x13?

@josef-pkt
Member

@lminer I don't find the reference right now at to Census website, but AFAIR the frequency limitation comes from X13 which is used for monthly or quarterly data (unemployment rate, GDP, ....).

@lminer
lminer commented Sep 11, 2014

Is it not possible to just pretend that weekly data is in fact monthly data? Can I do a pull request on a pull request?

@josef-pkt
Member

Can I do a pull request on a pull request?
Yes, PR for the branch that is the base of the PR, i.e. Skipper's

For AR(I)MA you can pretend whatever frequency you want. I don't know the details of the date handling, but I guess you can fake it for X13.
However, I guess there is a limitation on the seasonality, i.e. either 4 periods or 12 periods.
For example, I think daily data with weekly pattern wouldn't be possible to fake.
There should be more information about what's possible in the X12/X13 documentation.

@lminer
lminer commented Sep 11, 2014

I'll take a look at it when I get a chance and see what can be done

@jseabold
Member

I know there are data limitations put in place at compile time on the
number of observations, etc. I also suspect that periodicity is limited to
monthly and quarterly data as Josef mentioned. It could probably be hacked
to work differently but YMMV.

@jseabold
Member

I think this is a candidate for 0.6 too. Probably mark as experimental.

@jseabold jseabold added this to the 0.6 milestone Sep 24, 2014
@josef-pkt
Member

no x and y, it's still endog and exog.
If we ever do a rename, it won't be x and y, and we do it all at once and not start to have inconsistent naming in statsmodels.

I, and Kerby and Frank, spent time trying to get a consistent "generic" structure and names so we can write "Meta" or "Prefix" functions and classes on top of the existing models.

otherwise, IIRC, the PR looked fine and finding x13 wasn't a problem

@jseabold
Member

Well, we can do whatever we want, but I'm not having this conversation again.

@bashtage
Contributor

Is this a GitHub bug? Isn't the x/y conversation half a year old?

@argriffing

Is this a GitHub bug? Isn't the x/y conversation half a year old?

Nah, some of this PR had X which @jseabold is changing to exog, not half a year ago but like half an hour ago in jseabold@e1fff26

I'm not sure of the status of the x/y conversation, I assume stats in python will change to x/y one way or another, possibly involving statsmodels becoming de-facto replaced by some project where the developers are paid to be less eccentric.

@coveralls

Coverage Status

Coverage decreased (-0.37%) when pulling b354f24 on jseabold:add-x12 into e27e5dc on statsmodels:master.

@jseabold
Member

On Wed, Sep 24, 2014 at 3:26 PM, argriffing notifications@github.com
wrote:

Is this a GitHub bug? Isn't the x/y conversation half a year old?

Nah, some of this PR had X which @jseabold https://github.com/jseabold
is changing to exog, not half a year ago but like half an hour ago in
jseabold@e1fff26
jseabold@e1fff26

I'm not sure of the status of the x/y conversation, I assume stats in
python will change to x/y one way or another, possibly involving
statsmodels becoming de-facto replaced by some project where the developers
are paid to be less eccentric.

:/

@bashtage
Contributor

I see. I don't mind either way, but that find/replace seems to have gone awry

"""
If x12path is not given, then either x13as[.exe] or x12a[.exe] must
be found on the PATH. Otherwise, the environmental variable exog12PATH or
X13PATH must be defined. If prefer_x13 is True, only X13PATH is searched
for. If it is false, only exog12PATH is searched for.
"""

exog12PATH

jseabold added some commits Sep 24, 2014
@jseabold jseabold STY: Add extra letters to variable names 3bb7ec6
@jseabold jseabold STY: Pep-8
aa16c75
@jseabold
Member

Should be fixed

@josef-pkt
Member

find/replace seems to have gone awry

which I think proves my point

@argriffing If you want to fork statsmodels over x and y, then you are welcome to it. I won't have to maintain that fork. (You are then also free to use any other letter from the ASCII or greek alphabet.)

@jseabold
Member

On Wed, Sep 24, 2014 at 3:51 PM, Josef Perktold notifications@github.com
wrote:

find/replace seems to have gone awry

which I think proves my point

That I was careless with my regex and only accounted for 13 and not 12?
That's a reason to alienate 98% of users?

@josef-pkt
Member

That I was careless with my regex and only accounted for 13 and not 12?
That's a reason to alienate 98% of users?

I don't think 98% of users are alienated.
However, yes, you are the expert on regular expressions and are careless.
The last time I used a regular expression is several years ago, and for me it wouldn't be carelessness, I just wouldn't know, at least not without spending several hours trying out regexes.

Plus, we need a context sensitive regex in our head whenever we read a piece of code with x in it.

@coveralls

Coverage Status

Coverage decreased (-0.37%) when pulling aa16c75 on jseabold:add-x12 into 57a876d on statsmodels:master.

@jseabold
Member

On Wed, Sep 24, 2014 at 4:14 PM, Josef Perktold notifications@github.com
wrote:

That I was careless with my regex and only accounted for 13 and not 12?
That's a reason to alienate 98% of users?

I don't think 98% of users are alienated.

Again, this is based on the several straw polls I conduct w/1-200 people
each time I give a talk on statsmodels plus every discussion we've had
about this.

However, yes, you are the expert on regular expressions and are careless.

Do you want to get ad hominem now? That'll be even more fun for everyone.

The last time I used a regular expression is several years ago, and for me
it wouldn't be carelessness, I just wouldn't know, at least not without
spending several hours trying out regexes.

Plus, we need a context sensitive regex in our head whenever we read a
piece of code with x in it.

Or style guidelines, code comments, or documentation, but who needs those
right? Let's continue to imagine the horrors that could happen if people
take things to the sloppy extreme.

@jseabold jseabold merged commit 61d3654 into statsmodels:master Sep 24, 2014

2 checks passed

continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci The Travis CI build passed
Details
@jseabold jseabold deleted the jseabold:add-x12 branch Sep 24, 2014
@argriffing

If you want to fork statsmodels over x and y, then you are welcome to it. I won't have to maintain that fork. (You are then also free to use any other letter from the ASCII or greek alphabet.)

Honestly I have only appreciation for the work that you guys put into making this free software, and this notation does not matter to me much at all. If I ever fork statsmodels over notation it will be because I want the variables named likelihood to actually be a likelihood, not a log likelihood or a negative log likelihood or penalized negative log likelihood or a loss function unrelated to likelihoods in any way :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment