CLN: Refactored so that there is no longer a need for 2to3 #1520

Merged
merged 31 commits into from Apr 10, 2014

Projects

None yet

5 participants

@bashtage
Contributor

The codebase has been refactored so that it can exist in a single, unified base.
The strategy closely followed six and pandas.compat, but does not involve any further
dependencies.

The compatability location has been renamed form compatnp, which housed compatnp.py3k, to
just compat, and the main py3k compatability files are in init.py so that they can
be directly accessed

@josef-pkt
Member

oh no, why did you merge all commits.

It's difficult to see which are innocent changes, and which need closer review

https://github.com/scipy/scipy/pull/397/commits

@jseabold jseabold commented on an outdated diff Mar 26, 2014
statsmodels/api.py
+from statsmodels.genmod.generalized_estimating_equations import GEE
+from statsmodels.genmod import families
+import statsmodels.robust as robust
+from statsmodels.robust.robust_linear_model import RLM
+from statsmodels.discrete.discrete_model import (Poisson, Logit, Probit,
+ MNLogit, NegativeBinomial)
+from statsmodels.tsa import api as tsa
+from statsmodels.nonparametric import api as nonparametric
+import statsmodels.distributions as distributions
+from statsmodels.__init__ import test
+from statsmodels import version
+from statsmodels.info import __doc__
+from statsmodels.graphics.gofplots import qqplot, qqplot_2samples, qqline, ProbPlot
+from statsmodels.graphics import api as graphics
+from statsmodels.stats import api as stats
+from statsmodels.emplike import api as emplike
@jseabold
jseabold Mar 26, 2014 Member

Are these changes strictly necessary? Explicit relative imports are a little easier to read IMO.

@bashtage
Contributor

They are all innocent :)

On Mar 26, 2014 7:33 PM, Josef Perktold notifications@github.com wrote:

oh no, why did you merge all commits.

It's difficult to see which are innocent changes, and which need closer review

https://github.com/scipy/scipy/pull/397/commits


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/pull/1520#issuecomment-38729131.

@jseabold
Member

FWIW, pandas is the only package I contribute to with the squash everything and let god sort 'em out philosophy. I find it to be pretty much awful, but I (try to) pick my battles over there. I know other non-pandas devs are also surprised when they're asked to squash everything in a PR over there. "But version control and logs are...useful...?"

We tend to prefer more commits even if some are meandering to squashing all of them. That said, I don't know which case this falls into. I don't mind so much here just browsing so far, but Josef ends up doing the (vast) majority of the outside code review, so I usually defer to him.

@jseabold
Member

If it's dire, I think you can recover the unsquashed commits with git reflog, but I'd do it on a temporary local branch and be very sure first.

@jseabold
Member

This is a huge effort, though. Thanks for taking it on.

Right now I'm thinking about things like lrange, lmap, and cPickle. I'm almost never going to remember to use these, though I know travis wouldn't let me forget.

Would it be more desirable to just add a list call whenever we really need a list back vs. when xrange e.g. would do the job? I didn't look to see if this is already the case or if you replaced all map/range/etc.

For cPickle, should we alias pickle to cPickle in 2.7, so that it's the odd person out, or should we stick with the slower pickle in 2.x and not worry about special casing it. I think I vote for the latter.

@jseabold
Member

Keep in mind that I'm pretty much exclusively using 2.7 for at least the near future.

@jseabold
Member

I also wonder if we just want to use six or if there's a reason we haven't been?

@josef-pkt
Member

There are two problems with squashing too much:

  • for immediate code review: I cannot review every line in detail, so I need to screen what will be most likely innocent, cosmetic, style changes, and what are the critical parts. I got quite good in spotting this, but it is impossible to look at some details if its hidden in a few hundred or thousand lines of changes.
  • history: If we have a problem that shows up later, we often need to go back to understand why or where it was introduced. If we don't have nice commits and commit messages, then this is a needle in a haystack. blame often helps but is difficult across code moves.

commits should be logical units, but if selective squashing is too difficult (I often don't try), then I rather have too many than too few commits. Since commits are bunched together in merges, many commits still don't pollute the main history line of master.

@josef-pkt
Member

I also wonder if we just want to use six or if there's a reason we haven't been?

I don't think we gain anything using six. Most of what I've seen in this PR looks good, close to python 3. scipy has deleted most of the six module.

@jseabold
Member

And for the above, I'm suggesting to just use range everywhere we're using xrange and just accept that it's more memory efficient in Python 3. Stick/carrot and all that.

@bashtage
Contributor

There is surprisingly little surprising about porting to a single code base. The biggest su[prise was that it has to be done all at once - or has to involve an external dependence like six. This happens since running 2to3 on compat mungs (http://en.wikipedia.org/wiki/Mung) it.

I think the only challenges I came across were (a) circular import issues, which took a while but were ultimately easy to avoid and (b) print >> buf, something which I had never seen and for which google was no help. Thinking about it and seeing some other code examples (but no explantion) indicated that this was just an old shorthand for buf.write(str(something) + '\n') which is what redirecting print to buf would do.

FWIW, pandas is the only package I contribute to with the squash everything and let god sort 'em out philosophy. I find it to be pretty much awful, but I (try to) pick my battles over there. I know other non-pandas devs are also surprised when they're asked to squash everything in a PR over there. "But version control and logs are...useful...?"

We tend to prefer more commits even if some are meandering to squashing all of them. That said, I don't know which case this falls into. I don't mind so much here just browsing so far, but Josef ends up doing the (vast) majority of the outside code review, so I usually defer to him.

@bashtage
Contributor

Code that uses list(zip()) and list(range()) will obviously work correctly without issue, although range should be explicitly imported from compat to ensure that xrange is used on 2.x and range is used on 3k.

In most cases with numerical code, the difference between range and xrange is small.

This is a huge effort, though. Thanks for taking it on.

Right now I'm thinking about things like lrange, lmap, and cPickle. I'm almost never going to
remember to use these, though I know travis wouldn't let me forget.

Would it be more desirable to just add a list call whenever we really need a list back vs. when xrange e.g. would do the job? I didn't look to see if this is already the case or if you replaced all map/range/etc.

For cPickle, should we alias pickle to cPickle in 2.7, so that it's the odd person out, or should we stick with the slower pickle in 2.x and not worry about special casing it. I think I vote for the latter.

@jseabold
Member

print >> buf is a Wes McKinney-ism AFAIK. I've never seen anyone else use it.

@bashtage
Contributor

I seriousness the commits were not done in any particular order, mostly working through the issues I saw in the 2to3 report until there were no more actual issues.

oh no, why did you merge all commits.

It's difficult to see which are innocent changes, and which need closer review

@bashtage
Contributor

Given the rarity of that construct, it seems like it should be avoided.

print >> buf is a Wes McKinney-ism AFAIK. I've never seen anyone else use it.

@bashtage
Contributor

You can blame @jreback for my commit style - I do sort of like it. Certainly more than my natural semi-random or too-frequent commit style.

FWIW, pandas is the only package I contribute to with the squash everything and let god sort 'em out philosophy. I find it to be pretty much awful, but I (try to) pick my battles over there. I know other non-pandas devs are also surprised when they're asked to squash everything in a PR over there. "But version control and logs are...useful...?"

@bashtage
Contributor

Eay to roll back some, but then you have a mix of absolute import statsmodels.blah and from .blah import blerg. I find the all absolute version to be slightly easier to read.

Are these changes strictly necessary? Explicit relative imports are a little easier to read IMO.

@bashtage
Contributor

The biggest difficulty with this PR is that it will require rebasing anything else in the queue since all non-trivial files were touched. This said, while going through the code, I notes many examples of

(a) non-imported code being called (although some of this was determined by PyCharm, which isn't 100% reliable)
(b) modules called which are not in requirements
(c) print functions in actual code serving as warnings

These should all be eliminated in the near future.

@jseabold
Member

Thanks. We had an unwritten, though I believe spoken, rule of explicit
relative imports in api.py files and full paths in test suites and in the
code.

@bashtage
Contributor

The problem of unwritten rules is that they are ...

@jseabold
Member

Sure, though it was 'spoken' about on the mailing list many moons ago. It's been mainly two of us for a while, so not a large need to write everything down. Maybe it should go in the dev docs or we can revisit these decisions.

@jseabold
Member

I know many things that I knew long ago to be completely and utterly wrong these days.

@jreback
Contributor
jreback commented Mar 26, 2014

squash bashing should be directed to. @wesm :)

@josef-pkt
Member

about imports
using relative imports withing a directory and absolute imports across directories is by far the easiest way to stay out of circular import problems IMO

@josef-pkt
Member

about lzip, lrange and similar:
I like them for now: They show that they were changed for py 2 py 3 compatibility, and can be changed again on an individual basis.

for example: dict(lzip(...)) doesn't need list, but I haven't checked and don't remember for python 2.6. In the long term I find np.array(list(zip(...)) easier to understand than using lzip (explicit is better than ...)

@bashtage
Contributor

Right now I'm thinking about things like lrange, lmap, and cPickle. I'm almost never going to remember to use these, though I know travis wouldn't let me forget.

I think using pickle everywhere is reasonable - in fact, using things that look more like Python3 probably makes sense. As for the l functions - lzip, lmap and lrange - the advantage of using these is that they avoid extra calls to list on 2.x - although in most application this is not a big deal.

@coveralls

Coverage Status

Coverage remained the same when pulling 97351b3 on bashtage:remove-2to3 into e397e42 on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 5113bed on bashtage:remove-2to3 into e397e42 on statsmodels:master.

@coveralls

Coverage Status

Changes Unknown when pulling b887e2f on bashtage:remove-2to3 into * on statsmodels:master*.

@coveralls

Coverage Status

Coverage remained the same when pulling 185b5db on bashtage:remove-2to3 into b0e5b41 on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 219264c on bashtage:remove-2to3 into 2762065 on statsmodels:master.

@jseabold
Member
jseabold commented Apr 3, 2014

Should we go ahead and bite the bullet and merge this? I don't mind doing some of the rebasing work for existing PRs. We might lose green button ability on a lot of them.

@bashtage
Contributor
bashtage commented Apr 3, 2014

Rebased

@jseabold
Member
jseabold commented Apr 3, 2014

Thanks. I might ask you to do that again after a few more spring cleaning merges, if you don't mind... Otherwise, we should merge this and I should cleanup the other PRs.

@bashtage
Contributor
bashtage commented Apr 3, 2014

One way or the other. I think there are about 20 files this PR doesn't touch.

@jseabold
Member
jseabold commented Apr 3, 2014

Ok, let me do a bit more small cleanup merges, then we'll see about having you do one more rebase, if necessary, and we'll merge this. For the rest of the outstanding PRs, I'll do the rebasing as we merge. Have you observed (m)any merge conflicts or is git handling them ok? Sometimes I'm amazed what it can suss out automatically.

@bashtage
Contributor
bashtage commented Apr 3, 2014

The typical rebased has 2 or 3 conflicts. Most are obvious and have to do with the compat or __future__ imports.

@coveralls

Coverage Status

Coverage remained the same when pulling 07430d2 on bashtage:remove-2to3 into ec765bb on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 4e1de69 on bashtage:remove-2to3 into 067b41f on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 1cae1f7 on bashtage:remove-2to3 into bed3499 on statsmodels:master.

@jseabold
Member
jseabold commented Apr 5, 2014

Ok what do you think? Rebase and merge now? Likely give up green button for a bit or will need to ask authors to rebase all PRs / do it ourselves and open new ones when merging.

@bashtage
Contributor
bashtage commented Apr 5, 2014

Depends on whether you have any that are close - very likely to require rebase, and of course Python 3 direct compatibility, which most PRs don't currently have.

I'm happy for this to go ahead once I get Travis to pass - a little non Python 3 code in the last couple of patches.

@jseabold
Member
jseabold commented Apr 5, 2014

I'm mostly worried about third-party PRs. I tried to get in the ones that were close Thursday and pinged the other ones. @josef-pkt what do you think?

I'm also fine to let this sit until we hear back from a few PRs, but we're going to have bit rot one way or the other.

@josef-pkt
Member

I'd rather not rush this in. But I also don't want to wait for long, now that it is available.

We should check at least some of the larger pull requests to see how difficult the rebasing will be.
One candidate is facet plot, which I think is the most difficult one in terms of python 2,3 compatibility.

For most of the other PRs I expect only small adjustments, but I don't like merge conflicts early in a rebase that cause follow-up conflicts over many commits.

@coveralls

Coverage Status

Coverage remained the same when pulling f09df51 on bashtage:remove-2to3 into bed3499 on statsmodels:master.

@jseabold
Member
jseabold commented Apr 5, 2014

Hmm, facetplot is a tough one, regardless. We're going to have to just make some decisions there about unicode if we want to merge it. Given the comments about rebasing this after merging several PRs. I don't think it'll be too difficult. The questions is just whether this sits while we merge or we merge this and deal with the merge conflicts / small compat issues.

@bashtage
Contributor
bashtage commented Apr 5, 2014

It can sit for a while - I've been rebasing it pretty much any time I see a couple of commits to master so that it doesn't become un-rebasable. I would say ti automatically rebases 60% of the time - but there are almost always some 2-vs-3 compat issues.

@josef-pkt
Member

One difference, based on my experience:
If you rebase this on top of master, then there are only very few commits that touch the same file and cause merge conflicts.
If you rebase a branch that has many commits for the same parts, then the same merge conflict can show up many times.

Some files didn't get changed much in this PR. Rebasing will not be much of a problem with PRs that change those files. (I haven't checked yet.)

@bashtage
Contributor
bashtage commented Apr 5, 2014

I have seen that a couple of times - I suppose I modified a file in a couple of commits.

I agree that it is easier to rebase the on top of most PRs - in fact, it would be easiest to rebase if this PR was squashed to a single commit - then there would be at most 1 merge conflict per file per rebase.

@josef-pkt
Member

Don't squash before I haven't looked at it some more :)

examples
#1433 should be easy Paul also squashed into two commits, and only two lines are changed here.
I guess the GEE changes should go in before this.
PR's that add new code like Panel and Mixed will most likely be easy to rebase.

@josef-pkt
Member

facetplot actually shouldn't be affected much by what goes in first, since it's almost all new code.
in robust there are also only very few changes, so it shouldn't create many merge conflicts for my PR

@coveralls

Coverage Status

Coverage remained the same when pulling df57681 on bashtage:remove-2to3 into 8285516 on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 957b6ee on bashtage:remove-2to3 into a738b4f on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 957b6ee on bashtage:remove-2to3 into a738b4f on statsmodels:master.

@josef-pkt
Member

Why did you move a lot of code into the compat.__init__.py?
I like to keep the __init__.py essentially empty so we can have specific imports, and avoid most possibilities for circular imports.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 9, 2014
statsmodels/api.py
@@ -1,22 +1,24 @@
-import iolib, datasets, tools
-from tools.tools import add_constant, categorical
-import regression
+import statsmodels.iolib as iolib
+import statsmodels.datasets as datasets
+import statsmodels.tools as tools
+from .tools.tools import add_constant, categorical
+import statsmodels.regression as regression
@josef-pkt
josef-pkt Apr 9, 2014 Member

why not from . import regression ?
that's how 2to3 translated it.

@bashtage
bashtage Apr 10, 2014 Contributor

Switched all these - this was the first file changed for obvious reasons, and wasn't fully up to sped on relative/absolute imports.

@josef-pkt josef-pkt commented on the diff Apr 9, 2014
statsmodels/compat/ordereddict.py
@@ -165,12 +165,12 @@ def update(*args, **kwds):
for key in other:
self[key] = other[key]
elif hasattr(other, 'keys'):
- for key in other.keys():
+ for key in iterkeys(other):
@josef-pkt
josef-pkt Apr 9, 2014 Member

Since the old version didn't use iterkeys(), I would have left it unchanged which would still be correct py2 and py3 with more efficient py3.

We are using ordereddict only for small dictionaries.

@bashtage
bashtage Apr 10, 2014 Contributor

This doesn't really matter since this is only used by Python 2.6, however prevents noise when looking through 2to3 for changes, and so simplifies things.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 9, 2014
statsmodels/base/tests/test_shrink_pickle.py
fh = BytesIO() # use cPickle with binary content
# test unwrapped results load save pickle
self.results._results.save(fh)
fh.seek(0, 0)
res_unpickled = self.results._results.__class__.load(fh)
- assert_(type(res_unpickled) is type(self.results._results))
+ assert_(isinstance(res_unpickled, type(self.results._results)))
@josef-pkt
josef-pkt Apr 9, 2014 Member

I think the previous version type is type is stronger, it rules out subclasses

@bashtage
bashtage Apr 10, 2014 Contributor

This was me being aggressive in clean up - perhaps too aggressive.

@josef-pkt josef-pkt commented on the diff Apr 9, 2014
statsmodels/base/tests/test_shrink_pickle.py
@@ -209,7 +206,7 @@ def setup(self):
TestRemoveDataPickleNegativeBinomial,
TestRemoveDataPickleLogit, TestRemoveDataPickleRLM,
TestRemoveDataPickleGLM]:
- print cls
+ print(cls)
@josef-pkt
josef-pkt Apr 9, 2014 Member

stray print ?

@bashtage
bashtage Apr 10, 2014 Contributor

In __main__

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 10, 2014
statsmodels/compat/__init__.py
@@ -0,0 +1,234 @@
+"""
+Python 3 compatibility tools.
+
+"""
@josef-pkt
josef-pkt Apr 10, 2014 Member

functions and function names look good overall

I would make urlxxx lazy, there is no need to import it almost all the time

for usage: I wouldn't bother most of the time about iterkeys() In most of our cases dictionary only have a few keys, and py3 keys() doesn't require a helper function and only has a small cost in py2.

@bashtage
bashtage Apr 10, 2014 Contributor

I'm not sure how to do this using this file structure since init is processed once and for all.

The consistent use of iterkeys, iteritems and itervalues is to be explicit -- using one or the other makes for generally ugly code since sometimes an iterator is fine but in others a list is needed. So that list(iterkeys(som_dict)) is clear what is requires, while some_dict.keys() is not

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/emplike/tests/test_regression.py
@@ -146,10 +144,12 @@ def test_ci_beta1(self):
beta1ci = self.res1.conf_int_el(1, method='nm')
assert_almost_equal(beta1ci, self.res2.test_ci_beta1, 6)
+ @slow
@josef-pkt
josef-pkt Apr 10, 2014 Member

why did you change the slow here?
nm should be better than powell, IIRC

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 10, 2014
statsmodels/discrete/tests/test_discrete.py
@@ -8,7 +8,8 @@
tests.
"""
# pylint: disable-msg=E1101
-
+from statsmodels.compat import range
+from .results.results_discrete import RandHIE, Anes
@josef-pkt
josef-pkt Apr 10, 2014 Member

general python convention, import of local or within package imports are last.
import python, numpy, scipy, .... statsmodels
statsmodels.compat is an exception, I think, should count as python

@bashtage
bashtage Apr 10, 2014 Contributor

That one was redundant.

I have been trying to leave imports alone, with the exception of
from __future__``always being first, since it has to be,compatalways being second followed bycompat.submodule`

@bashtage
bashtage Apr 10, 2014 Contributor

I agree that it is helpful for compat to be special since it should be removed someday.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 10, 2014
statsmodels/distributions/empirical_distribution.py
@@ -1,6 +1,7 @@
"""
Empirical CDF Functions
"""
+from statsmodels.compat import urlopen
@josef-pkt
josef-pkt Apr 10, 2014 Member

import urlopen in if __name__ == "__main__"
it's not used by the functions in the module

@bashtage
bashtage Apr 10, 2014 Contributor

Rolled back

@josef-pkt josef-pkt commented on an outdated diff Apr 10, 2014
statsmodels/examples/run_all.py
@@ -24,7 +26,7 @@
import glob
filelist = glob.glob('*.py')
-print zip(range(len(filelist)), filelist)
+print(lzip(lrange(len(filelist)), filelist))
@josef-pkt
josef-pkt Apr 10, 2014 Member

lrange has redundant l
not that it matters

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/examples/run_all.py
This is done mainly to check that they are up to date.
(y/n) >>> """)
+has_errors = []
@josef-pkt
josef-pkt Apr 10, 2014 Member

we only need the has_errors inside the if

@bashtage
bashtage Apr 10, 2014 Contributor

has_errors appears below in

print('\nModules that raised exception:')
print(has_errors)

and so must be defined so that if there are no errors, it still works.

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/genmod/dependence_structures/covstruct.py
ix = cpp1[ky]
tables[ky][1, 1] += emat_11[ix[:, 0], ix[:, 1]].sum()
tables[ky][1, 0] += emat_10[ix[:, 0], ix[:, 1]].sum()
tables[ky][0, 1] += emat_01[ix[:, 0], ix[:, 1]].sum()
tables[ky][0, 0] += emat_00[ix[:, 0], ix[:, 1]].sum()
- cor_expval = self.pooled_odds_ratio(tables.values())
+
+ cor_expval = self.pooled_odds_ratio(list(itervalues(tables)))
@josef-pkt
josef-pkt Apr 10, 2014 Member

just list(tables.values()) should do it, is plain py3 (checked on py 3.3) and list is noop on py2

@bashtage
bashtage Apr 10, 2014 Contributor

Another measure to reduce the noise produced by 2to3 checks.

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/genmod/dependence_structures/covstruct.py
@@ -647,7 +648,7 @@ def observed_crude_oddsratio(self, parent):
# Storage for the contingency tables for each (c,c')
tables = {}
- for ii in cpp[0].keys():
+ for ii in iterkeys(cpp[0]):
@josef-pkt
josef-pkt Apr 10, 2014 Member

I guess ok, I have no idea how large cpp is, even if iterkeys() wasn't used before

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/genmod/generalized_estimating_equations.py
@@ -22,7 +22,7 @@
LA Mancl LA, TA DeRouen (2001). A covariance estimator for GEE with
improved small-sample properties. Biometrics. 2001 Mar;57(1):126-34.
"""
-
@josef-pkt
josef-pkt Apr 10, 2014 Member

I guess there will be merge conflicts with this module, still ongoing refactoring, new PR is in master.

We should have this PR in before merging the next round of GEE changes.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 10, 2014
statsmodels/graphics/tests/test_mosaicplot.py
@@ -79,7 +81,7 @@ def test_mosaic_simple():
# the cartesian product of all the categories is
# the complete set of categories
keys = list(product(*key_set))
- data = OrderedDict(zip(keys, range(1, 1 + len(keys))))
+ data = OrderedDict(zip(keys, lrange(1, 1 + len(keys))))
@josef-pkt
josef-pkt Apr 10, 2014 Member

lrange not needed, zip works with iterator, I think

@bashtage
bashtage Apr 10, 2014 Contributor

Seems to work - was worried that it wouldn't work mixing list and iterator.

@josef-pkt josef-pkt commented on an outdated diff Apr 10, 2014
statsmodels/graphics/tests/test_mosaicplot.py
@@ -152,7 +154,7 @@ def test_mosaic_very_complex():
fig, axes = pylab.subplots(L, L)
for i in range(L):
for j in range(L):
- m = set(range(L)).difference(set((i, j)))
+ m = set(lrange(L)).difference(set((i, j)))
@josef-pkt
josef-pkt Apr 10, 2014 Member

set also works from iterator, lrange not needed

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/graphics/tests/test_mosaicplot.py
eq(res[('a',)], (0, 0, 0.5, 1))
eq(res[('b',)], (0.5, 0, 0.5, 1))
# subdivide a in two sublevel
res_bis = _key_splitting(res, ['c', 'd'], [1, 1], ('a',), False, 0)
- assert_(res_bis.keys() == [('a', 'c'), ('a', 'd'), ('b',)])
+ assert_(list(iterkeys(res_bis)) == [('a', 'c'), ('a', 'd'), ('b',)])
@josef-pkt
josef-pkt Apr 10, 2014 Member

I think all these should be just list(res_bis.keys())

@bashtage
bashtage Apr 10, 2014 Contributor

Done for 2to3 noise purposes.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Apr 10, 2014
statsmodels/iolib/foreign.py
@@ -357,7 +357,7 @@ def variables(self):
"""
Returns a list of the dataset's StataVariables objects.
"""
- return map(_StataVariable, zip(range(self._header['nvar']),
+ return lmap(_StataVariable, lzip(lrange(self._header['nvar']),
@josef-pkt
josef-pkt Apr 10, 2014 Member

I guess lzip not needed iterator inside (l)map

@bashtage
bashtage Apr 10, 2014 Contributor

Removed

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/iolib/foreign.py
@@ -534,7 +534,7 @@ def _next(self):
if self._has_string_data:
data = [None]*self._header['nvar']
for i in range(len(data)):
- if type(typlist[i]) is int:
+ if isinstance(typlist[i], int):
@josef-pkt
josef-pkt Apr 10, 2014 Member

no idea if it matters, isinstance allows subclasses

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/iolib/summary.py
@@ -780,11 +778,11 @@ def summary_return(tables, return_fmt='text'):
if return_fmt == 'text':
strdrop = lambda x: str(x).rsplit('\n',1)[0]
#convert to string drop last line
- return '\n'.join(map(strdrop, tables[:-1]) + [str(tables[-1])])
+ return '\n'.join(lmap(strdrop, tables[:-1]) + [str(tables[-1])])
@josef-pkt
josef-pkt Apr 10, 2014 Member

join takes an iterable, lmap redundant, map is fine

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/iolib/summary2.py
@@ -78,9 +78,11 @@ def add_dict(self, d, ncols=2, align='l', float_format="%.4f"):
Data alignment (l/c/r)
'''
- keys = [_formatter(x, float_format) for x in d.keys()]
- vals = [_formatter(x, float_format) for x in d.values()]
- data = np.array(zip(keys, vals))
+ keys = [_formatter(x, float_format) for x in iterkeys(d)]
+
+
+ vals = [_formatter(x, float_format) for x in itervalues(d)]
@josef-pkt
josef-pkt Apr 10, 2014 Member

iterkeys and itervalues are noise, not worth the bother for py2 and changing it again sometime in future

@bashtage
bashtage Apr 10, 2014 Contributor

Done to reduce noise from 2to3. Even with all of the changes, and disabling future and price, it still produces around 1000 lines of output (down from 30,000 originally).

@josef-pkt josef-pkt commented on the diff Apr 10, 2014
statsmodels/iolib/summary2.py
@@ -78,9 +78,11 @@ def add_dict(self, d, ncols=2, align='l', float_format="%.4f"):
Data alignment (l/c/r)
'''
- keys = [_formatter(x, float_format) for x in d.keys()]
- vals = [_formatter(x, float_format) for x in d.values()]
- data = np.array(zip(keys, vals))
+ keys = [_formatter(x, float_format) for x in iterkeys(d)]
+
+
+ vals = [_formatter(x, float_format) for x in itervalues(d)]
+ data = np.array(lzip(keys, vals))
@josef-pkt
josef-pkt Apr 10, 2014 Member

could just transpose instead of zip

@josef-pkt
Member

(out of time for now)

summary so far

I don't like the file structure in compat much, but haven't thought yet about alternative similar to before

I think iterkeys and iteritems are largely redundant, but I don't think we need to change them back from the way it is (ignoring my in-code comments). They will be easy to grep later.
But I don't think we need to follow a strict pattern like in this PR for future changes and code.

Some extra usage of lzip, lrange, ...., in this PR that could be removed.

Overall a lot of work and looks good. Thanks Kevin.

I need to go through the rest, but I think we can merge next week (after pycon),
and then start to sort out and rebase pull requests, and old branches.

@bashtage
Contributor

I will defend the structure of compat. The old structure was problematic for two reasons

  • It was named compatnp which is probably legacy from early days when it was just for numpy
  • The py3k module is not correctly named when using an integrated code base since it provides compatibility across all covered versions of Python

The main purpose of the module is to produce Python compatibility and so this is done in the main module and not submodules. Of course, this could be done using a different file structure, but this seems unnecessary.

I think the nested structure with submodules numpy and scipy is especially clear, and later pandas, patsy or other dependencies can be added to provide backward/forward compatibility.

I don't think circular imports are an issue here since this is more-or-less a once-and-for-all change. There may be Python 3.x features that need to be backported, but this is about the only scenario that I could see being an issue.

For iterkeys and its two cousins, this was mostly done to reduce 2to3 unnecessary output, which I have been regularly checking as I rebase. It still produces a non-trivial output, even with future and print disabled (with these it is hopeless).

@josef-pkt
Member

What's the noise that you are referring to? I didn't run this branch yet.

About module structure:

scipy tradition, what we are trying to do is to build a hierarchy of files, essentially a tree of modules where the import structure is easy to understand.

some basic tools and helper functions like compat3k should be at the bottom, which avoids importing the __init__.py like here:
https://github.com/statsmodels/statsmodels/pull/1520/files#diff-488ac74512611828cae8ff295e9e37fcR10
(I didn't go through all the imports yet.)

statsmodels.compat is for all version compatibility problems not just python 2 versus python 3. That's why the original name was comptanp because I only expected handling different numpy version.

compat.py3k was intended for specific transition 2 to 3 compatibility, not for general python compatibilities. Having a separate module makes it easier to see what's specific to the 2 to 3 transition. The other python compatibility problems go away as the version within py 2 or py3 increases. for example the backports for python 2.5 (obsolete) and 2.6. I haven't looked closely yet at python 3.2, 3.3, 3.4, because we don't use any specific features yet.

compat.numpy.__init__.py and the same for scipy have redundant directory layers since they can be put into modules with the same name. Because the code is currently in the __init__.py, it wouldn't be possible to build a hierarchy of modules inside the compat.numpy directory. I don't think that this part will ever get large enough to "deserve" their own directories instead of one to three modules.

@josef-pkt
Member

note:
we could keep compat py3k in the compat namespace directly (in contrast to what we had before), Hovever we can import them in compat.__init__.py from a py3k module, and then all other modules in statsmodels.compat could import the local module from .py3k import filter, ..... without having to import the __init__.py

@bashtage
Contributor

What's the noise that you are referring to? I didn't run this branch yet.

The noise in the output of 2to3.py which needs to be run to detect changes in un-run code. While it is more correct than not, it make a non-trivial number of mistakes, like replacing dict(zip with dict(list(zip.

But there is another reason to import things like iteritems - it provides more uniform execution across both python 2 and 3.

@bashtage
Contributor

we could keep compat py3k in the compat namespace directly (in contrast to what we had before), Hovever we can import them in compat.init.py from a py3k module, and then all other modules in statsmodels.compat could import the local module from .py3k import filter, ..... without having to import the init.py

I don't think py3k makes sense in a package that does not use 2to3, since it isn't just shimming problems in python 3.x that 2to3 gets wrong. The compatibility code is attempting to provide uniformity across all supported versions so that things like range, zip and map always behave like an iterator, and to provide a common location for other functions that have different locations.

If it were moved to a separate file the only reasonable names I can see are either python, so that they would be compat.python or possibly common, although using common is not very precise.

@josef-pkt
Member

The compatibility code is attempting to provide uniformity across all supported versions so that things like range, zip and map always behave like an iterator, and to provide a common location for other functions that have different locations.

but almost all of these are only because of the py2 - py3 transition, and we drop the entire module when we drop py 2.x support. However, we add and drop compatibility functions as any of the python versions within the 2.x and within the 3.x line changes.
I don't care much about the name, the original name came from the numpy module that this was taken/copied from, numpy.compat.py3k.py, scipy uses now a modified six.py.

@josef-pkt
Member

aside: pycon got cancelled for me today because of a sick boy at home, so I try to look more at this.

@bashtage
Contributor

but almost all of these are only because of the py2 - py3 transition, and we drop the entire module when we drop py 2.x support. However, we add and drop compatibility functions as any of the python versions within the 2.x and within the 3.x line changes.

While I agree that most is due to 2.x and 3.x incompatibilities, I would imagine that this structure is unlikely to go away even once 3.x is standardized upon. For example, suppose you wanted to use the new Enum in Python 3.4. I would think the natural place to find a backport would be in compat.python or just compat.

The structure I followed was mostly inspired by pandas.

@josef-pkt
Member

Enum would be in collections I guess, like OrderedDict and counter, to match the python import

from (statsmodels.compat.)collections import Enum
or
from (statsmodels.compat.)itertools import new_fancy_iterator

@bashtage
Contributor

Or statsmodels.compat.python.collections?

Not obvious which is better, since something like numpy is an external package while collections is an internal module.

@josef-pkt
Member

The structure I followed was mostly inspired by pandas.

I don't know why they did this. In the other directories they follow also the rule to keep __init__.py empty. They even followed the api.py pattern which I didn't know about, except they import everything at the top level.

pandas is or was compared to statsmodels much more a single purpose library, to provide a few data structures with associated tools, while we are a library that collects different models and statistical tools, and users might not use many of them.

(There are things where I will, most likely, never follow the "pandas pattern", being stingy on letters beyond 3 for method names, confounding __repr__ and __str__, and overloading methods with "magic", for example.)

@josef-pkt
Member

Or statsmodels.compat.python.collections?

I think that path gets too long. I'm all in favor of deepening the structure when we want to grow there (*). But in this case it's easy to keep track that compat.collections or compat.itertools are python. I don't expect that we get many of this kind, so far we only have collections. (plus we had iter_compat.py which was misnamed and a bit of a mix of version and py2-3 problems.)
And it's unlikely that we get numpy_collections, python_collections, scipy_collections and patsy_collections.

(*) A long time ago we decided to add subdirectories for models even if we had only one module in it, like regression.linear_model, in the hope that eventually we will need it and to be "future-proof", making it less likely that we have to restructure several times.

@bashtage
Contributor

I agree that this is on the long side.

In fact, one of the reasons for using __init__ for the core Python compatibility code was to make the import as short as reasonable since 95% of the imports from compat are due to Python language difference, and this had the shortest path.

@josef-pkt
Member

You can still do in compat.__init__py:

from .py3k_or_whatever import *

and define __all__ in py3k_or_whatever.py (*)
I think that's easier to do now, given that the from statsmodels.compat.py3k import xxx have changed.

rename the current __init__.py to py3k_or_whatever.py and create a new __init__.py with just the above import.
local files can import from py3k_or_whatever, code outside of compat doesn't need to be changed.

>>> import statsmodels.compat.__init__ as smc
>>> dir(smc)
['BytesIO', 'HTTPError', 'PY3', 'PY3_2', 'StringIO', '__builtins__', '__cached__', '__doc__', '__file__', '__initializing__', '__loader__', '__name__', '__package__', 'advance_iterator', 'asbytes', 'asbytes_nested', 'asstr', 'asstr2', 'asunicode', 'asunicode_nested', 'builtins', 'bytes', 'cPickle', 'cStringIO', 'callable', 'combinations', 'filter', 'functools', 'get_class', 'get_function_name', 'getexception', 'input', 'io', 'isfileobj', 'iteritems', 'iterkeys', 'itertools', 'itervalues', 'lfilter', 'lmap', 'long', 'lrange', 'lzip', 'map', 'next', 'open_latin1', 'pickle', 'range', 'reduce', 'str', 'strchar', 'string_types', 'sys', 'unichr', 'urljoin',
'urllib', 'urlopen', 'urlretrieve', 'zip', 'zip_longest']
@josef-pkt
Member

Looks good to me.

why does it say We can’t automatically merge this pull request. from the network graph it looks like it's based on current master.
Also, TravisCI didn't run after the last commits and I don't see a job in the queue.

I ran the not-slow test again on py 3.3 after inplace develop install, and I get only one unrelated failure in test_numdiff.TestGradLogit.test_hess (pandas 0.10 was to old) I didn't test with any other python version.

I think we can just merge this, after we clear up why it cannot be merged automatically and another TraviCI run.
We should watch out what pythonxy Ubuntu and nipy Debian testing are saying in the next days.

I read (mostly skimmed) through the rest that github shows and checked that the type check in tools.data didn't change.
I didn't see any problems. I think the compat structure is fine now.
If there are minor problems, then we can fix them if or when we run into them.

@josef-pkt
Member

minor, style question:

In the last changes we import compat.python into compat.__init__.py, but the other modules are changed to importing from compat.python.
What import do we recommend for general use?:

from statsmodels.compat import range, ....
or
from statsmodels.compat.python import iterkeys

If the second, then importing compat.python into compat.__init__.py is redundant.

Kevin Sheppard added some commits Mar 26, 2014
Kevin Sheppard CLN: Refactored so that there is no longer a need for 2to3
The codebase has been refactored so that it can exist in a single, unified base.
The strategy closely followed six and pandas.compat, but does not involve any further
dependencies.

The compatability location has been renamed form compatnp, which housed compatnp.py3k, to
just compat, and the main py3k compatability files are in __init__.py so that they can
be directly accessed

Conflicts:
	statsmodels/genmod/tests/test_gee.py
	statsmodels/nonparametric/tests/test_kernels.py
	statsmodels/sandbox/regression/tests/test_gmm.py
	statsmodels/tools/parallel.py
	statsmodels/tsa/vector_ar/tests/test_var.py

Conflicts:
	statsmodels/base/data.py
	statsmodels/tools/decorators.py

Conflicts:
	statsmodels/base/model.py
	statsmodels/compatnp/py3k.py

Conflicts:
	statsmodels/base/model.py
	statsmodels/genmod/generalized_linear_model.py

Conflicts:
	statsmodels/tools/catadd.py

Conflicts:
	statsmodels/base/model.py
	statsmodels/sandbox/regression/tests/test_gmm_poisson.py
	statsmodels/sandbox/stats/multicomp.py

Conflicts:
	statsmodels/genmod/generalized_estimating_equations.py
	statsmodels/genmod/tests/test_gee.py
4f85a22
Kevin Sheppard Rolled back absolute imports in api.py 9faa25d
Kevin Sheppard Refactored compatnp to compat and moved all py3k to __init__. compatnp
is not the correct name, and it makes more sense for compat to be the
location of any backward/forward compatability code using compat for
the majority of the py3k code and submodules np (or numpy) for numpy and scipy.

Conflicts:
	statsmodels/base/data.py
	statsmodels/tools/decorators.py

Conflicts:
	statsmodels/base/model.py
	statsmodels/tools/tools.py

Conflicts:
	statsmodels/genmod/tests/test_gee.py
3463733
Kevin Sheppard Final removal of 2.x print ecdec83
Kevin Sheppard Fixes reduce-related issues
Conflicts:
	statsmodels/base/data.py

Conflicts:
	statsmodels/base/model.py
e1edee0
Kevin Sheppard Removes extra .items() and replaced with iteritems.
Corrected typo in import BytesIO
07dc268
Kevin Sheppard Many small fixes related to things missed that are not covered by tes…
…ts, including

raw_input
f64a25c
Kevin Sheppard Removed redundant parentheses that produced tuples in print statements 48bf12f
Kevin Sheppard Rolled back a small error in the previous commit 8198357
Kevin Sheppard Fix broken execfile fix c9643c5
Kevin Sheppard Removed unnecessary use of lzip and lmap
Imported zip and map from compat where used

Conflicts:
	statsmodels/tools/catadd.py
6cffe14
Kevin Sheppard Rebase error fixes 4f8e80f
Kevin Sheppard Cleaned up old, out dated import code for packages that are now requi…
…red.

Removed a small amount of unnecessary compatability code for slogdet

Conflicts:
	statsmodels/iolib/tests/test_table_econpy.py
f1be5c4
Kevin Sheppard Further refinements to the compat module 9713866
Kevin Sheppard Remove unnecessary compat file a28efb7
Kevin Sheppard Re-factored compat to provide clear locations for dependency compatib…
…ility code
da88bef
Kevin Sheppard Added protocol to cPickle to prevent error on 2.6 0e62c88
Kevin Sheppard Fix missing parentheses in tuple comprehension and missing lmap
Conflicts:
	statsmodels/graphics/tests/test_tsaplots.py
982954d
Kevin Sheppard Improved idiomatic python use by removing type comparisons and using …
…isinstance

Removed unneeded .sort() and replaced with sorted()
Replaced range with lrange where needed

Conflicts:
	statsmodels/genmod/generalized_estimating_equations.py
84cf1f7
Kevin Sheppard Removed future with_statement f724ffc
Kevin Sheppard Removed rebase issue 9796eea
Kevin Sheppard Refactored from compatnp to compat using the structure
compat for core Python compatibility
compat.numpy for numpy compatibility
compat.scipy for scipy compatibility

Other locations can be used for compatibility across version of other dependencies, e.g.
compat.pandas

Conflicts:
	statsmodels/base/model.py
1a323c0
Kevin Sheppard Moved NumpyVersion to compat.scipy since it comes from scipy
Fixed import path for NumpyVersion from scipy to lib._version

Conflicts:
	statsmodels/compat/scipy/__init__.py
4d9b3c7
Kevin Sheppard Fix newly merged print statements
Fix rebase issues
18577f1
Kevin Sheppard Fixed missed range uses dfde60f
Kevin Sheppard Fixed some Python 3 compat issues 00a2d03
Kevin Sheppard Fixed examples and notebooks to be Python 2/3 compatible e14825e
Kevin Sheppard Rolled back some aggressive changes. 8ccbc00
Kevin Sheppard Moved compat.__init__ to compat.python
Imported from .python in __init__
f96dc4a
Kevin Sheppard Moved compat.__init__ to compat.python
Conflicts:
	statsmodels/compat/__init__.py
	statsmodels/compat/scipy/__init__.py
	statsmodels/compat/tests/test_collections.py
	statsmodels/compat/tests/test_itercompat.py

Conflicts:
	statsmodels/genmod/tests/test_gee.py
b235b7b
Kevin Sheppard Simplified compat numpy and scipy so that they are just files.
Fixed minor rebase issue
8f1ed7a
@josef-pkt
Member

@bashtage @jseabold any objection to merging?

TravisCI should come back green soon. My latest 3.3 run also came back green (except unrelated failure)

@jseabold
Member

Merge is fine with me. Will update other PRs as needed.

@bashtage
Contributor

Seems find now. Had missed an absolute_import

@bashtage
Contributor

from statsmodels.compat import range, ....
or
from statsmodels.compat.python import iterkeys

If the second, then importing compat.python into compat.init.py is redundant.

My preference is for the former since it is very common, and it occasionally saves splitting imports across two lines.

@josef-pkt josef-pkt merged commit 5931ec4 into statsmodels:master Apr 10, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@bashtage bashtage deleted the bashtage:remove-2to3 branch Apr 10, 2014
@josef-pkt josef-pkt added the PR label Apr 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment