Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that tuples are not considered as data, not as data containers #820

Merged
merged 3 commits into from May 11, 2013

Conversation

toobaz
Copy link
Contributor

@toobaz toobaz commented May 8, 2013

It can happen (not necessarily by craziness, but because of default names given by pandas.core.groupby.DataFrameGroupBy.agg() ) that names of columns are tuples. If one tries to print a regression summary() in such cases,

  1. if the column is used as a dependent variable, there will be an error,
  2. if the column is used as an independent variable, "%s" will be printed instead than its name.

My change ensures that tuples are considered as objects for interpolation, not as containers of objects. It should have no effect at all on any data type which is not a tuple.

BTW: tested only on d5ed746 , unfortunately later commits give me an error ("ImportError: cannot import name kalman_loglike", at line 32 of statsmodels/statsmodels/tsa/kalmanf/kalmanfilter.py ). But I did check that the file was not changed by any subsequent commit.

@@ -679,7 +679,7 @@ def format(self, width, output_format='txt', **fmt_dict):
data_aligns = fmt.get('data_aligns','c')
if isinstance(datatype, int):
datatype = datatype % len(data_fmts) #constrain to indexes
content = data_fmts[datatype] % data
content = data_fmts[datatype] % (data,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not work if data is already an iterable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean there could be formats strings with more than one "%s"? There aren't, and I don't see why there should be, but maybe I'm missing something. What I say is precisely that only this will work if data is already an iterable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this code, and never looked at the details, I sent an email to Alan, the author of SimpleTable.

It's likely that you are right, since this is in a single cell (after looking a bit closer). Then, the change would really be without effect in other cases than tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's precisely my point.

I'll try to produce a test, but 1) I never produced a test in my life, so I'll do that when I have some time, 2) anyway, the test would certainly have formats with one "%s" only, since I can't think of anything different (and anyway, one must change the code to do something different).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to test or come up with something where we have more than a single "%s".

What would be good to have is your original case, a simplified version, where names of columns are tuples.

(writing unit tests is a good skill to practise :)

@josef-pkt
Copy link
Member

@toobaz Thank you, can you provide a simple test case, that is made to work with this change.

A case with a pandas dataframe in a model.summary would be best.
It could be that we should "sanitize" the names already in the model data handling, where we know more about the meaning of the names/tuples.

About the import error, the current master requires compiling. Building the extensions is not optional anymore.

@josef-pkt
Copy link
Member

One problem in this is that our "real" test coverage for the summaries and SimpleTable is not very high. So we might be introducing a mistake without getting a test failure.

@jseabold
Copy link
Member

jseabold commented May 8, 2013

If you could file an issue about the build error that would be helpful, so we can fail more gracefully.

@josef-pkt
Copy link
Member

Alan made the same change

https://code.google.com/p/econpy/source/diff?spec=svn197&r=197&format=side&path=/trunk/utilities/table.py

So this should be good to merge.
(our version differs a bit)

@toobaz
Copy link
Contributor Author

toobaz commented May 11, 2013

I added a test ( ae1a38b ). It's the first time I write one, so feel free to point out at enhancements.

Notice I test only the minimum related to my changes in the code. I couldn't test anyway the whole regression summary, since it contains the date.

@@ -3,6 +3,8 @@
from statsmodels.iolib.table import SimpleTable, default_txt_fmt
from statsmodels.iolib.table import default_latex_fmt
from statsmodels.iolib.table import default_html_fmt
import pandas
from statsmodels.api import OLS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from statsmodels.regression.linear_model import OLS

It doesn't make much difference in this case, but it's better to import from the actual module (inside statsmodels)

@josef-pkt
Copy link
Member

Thank you Pitro
The test looks good, matches the style of the SimpleTable tests, which is different from our usual nose based tests.

Would you please change the import, then we can just merge through github. This branch is only one commit behind master.

@toobaz
Copy link
Contributor Author

toobaz commented May 11, 2013

Done. Notice that in that file there are several tests which are disabled (indented).

@josef-pkt
Copy link
Member

I didn't know about the nested functions or never paid enough attention.)

It looks like they are indented intentionally, they use objects created in the outer functions.

I never checked whether nose runs nested functions. They might be candidates for some test refactoring, but that's not relevant for this pull request.

@josef-pkt
Copy link
Member

I restarted the Travis CI build, because it had an unrelated build/network error

josef-pkt added a commit that referenced this pull request May 11, 2013
SimpleTable: Ensure that tuples are considered as data, not as data containers
@josef-pkt josef-pkt merged commit 9bd0cbd into statsmodels:master May 11, 2013
@josef-pkt
Copy link
Member

merging
Thanks Pietro

@toobaz toobaz deleted the fix_tuples_printout branch February 28, 2014 12:53
PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014
SimpleTable: Ensure that tuples are considered as data, not as data containers
@krrishsgk
Copy link

I had the same error. "ImportError: cannot import name kalman_loglike". I've installed Cython and installed statsmodels with 'build-ext'. I still get the error. How do I get past it? This is the beginning of the code I'm running where it throws the error:

import numpy as np
import pandas as pd

import statsmodels.api as sm

df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/longley.csv', index_col=0)
df.head()

@jseabold
Copy link
Member

If you do python setup.py build_ext --inplace you have to import from the source tree unless you next install it with something like python setup.py install --user. Can you ask on the mailing list instead of on this issue for help?

@krrishsgk
Copy link

Yeah ok. Didn't know that. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants