Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: store.select should warn about unimplemented operators #2973

Closed
michaelaye opened this issue Mar 6, 2013 · 22 comments
Closed

Feature: store.select should warn about unimplemented operators #2973

michaelaye opened this issue Mar 6, 2013 · 22 comments

Comments

@michaelaye
Copy link
Contributor

Currently, when doing this (with object dtypes)

store.select('df', 'x != none') #this is not NaN, but the string 'none'

one receives exactly the same object as when doing

store.select('df', 'x = none')

The suggestion is that the query parser warns about the unknown or unimplemented operator.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

can u post a couple of sample rows from df
as well as df.info()? (I want to see dtypes u r passing)

@michaelaye
Copy link
Contributor Author

Here's the store:

<class 'pandas.io.pytables.HDFStore'>
File path: /Users/maye/data/marszoo/planet_four_classifications_2013-02-23.h5
/df            frame_table  (typ->appendable,nrows->9275468,ncols->17,indexers->[index],dc->[image_id,image_name,user_name,marking])

Here's df.info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9275468 entries, 0 to 9275467
Data columns:
classification_id    9275468  non-null values
created_at           9275468  non-null values
image_id             9275468  non-null values
image_name           9275468  non-null values
user_name            9275468  non-null values
marking              9275468  non-null values
x_tile               9275468  non-null values
y_tile               9275468  non-null values
x                    9275468  non-null values
y                    9275468  non-null values
image_x              9275468  non-null values
image_y              9275468  non-null values
radius_1             9275468  non-null values
radius_2             9275468  non-null values
distance             9275468  non-null values
angle                9275468  non-null values
spread               9275468  non-null values
dtypes: object(17)

and here are a couple of lines with that 'none' entry (note that everything from column 'x' is empty for those lines):

classification_id   created_at  image_id    image_name  user_name   marking x_tile  y_tile  x   y   image_x image_y radius_1    radius_2    distance    angle   spread
df_index                                                                    
23   50eace09e39956220600081f    2013-01-07 13:30:49     APF00008jy  ESP_012265_0950     lukesmith   none    5   2                                  
41   50eaf01ae3995621fc00093e    2013-01-07 15:56:10     APF00007su  ESP_012604_0965     not-logged-in   none    1   2                                  
42   50eaf01ee3995621d300010a    2013-01-07 15:56:14     APF00003bv  ESP_011460_0980     not-logged-in   none    2   1                                  
44   50eaf03a45d7e142f50000cb    2013-01-07 15:56:42     APF000030j  ESP_011900_0985     not-logged-in   none    5   1                                  
45   50eaf03ce3995621d3000117    2013-01-07 15:56:44     APF0000p3r  ESP_020150_0950     not-logged-in   none    7   1                                                                  

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

sorry I think df.info() doesn't show me what I need

an u show

df.get_dtype_counts()

x is an object column (string)?
when u say empty, you mean == '' would be true ?
a len 0 string?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

fixed! ty

had an incorrect implementation...thanks for the debugging!

fyi...you can put np.nan in your string columns and they will come back that way (rather than '')

only issue is I can't think of a nice way to select on them

e.g.

store.select('df',['x!=none]') will work (and get '' and np.nan columns if you have them)

store.select('df',['x','!=',np.nan]) is a little awkard (and doesn't work!).....any thoughts?
I think in SQL land they would do somethign like: x=NULL ....?

jreback added a commit that referenced this issue Mar 6, 2013
BUG: HDFStore didn't implement != correctly for string columns query, GH #2973
@michaelaye
Copy link
Contributor Author

Thanks, amazing fast response! ;)

About np.nan, I don't understand where I could 'put np.nan in my string'? Do you mean, that the store.select will treat empty strings as np.nans and return np.nan if I tried to filter for it? But that might not always be what the user wants? Maybe I get you wrong.

I can see 4 scenarios:

  1. 'none' is a string that the user needs to be able to search for. Then store.select('df',['x!=none']) should only get those and should work, I guess?
  2. none is just another way of saying np.nan, but then it should not be in the HDFStore in the first place, and the user possibly wants to clean it up, but first needs to select for it, so number 1. applies
  3. The user wants to select for empty strings. Here I don't know how to select on them?
  4. The user wants to select on nan: The implementation for that should be guided by the way the HDFStore stores actually nan values (which I don't know).

Does that help?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

you can also specify terms like:

Term('x','=',value)

value could be a string/list/object (useful when say u are comparing against a date)

On Mar 5, 2013, at 10:10 PM, "K.-Michael Aye" notifications@github.com wrote:

Thanks, amazing fast response! ;)

About np.nan, I don't understand where I could 'put np.nan in my string'? Do you mean, that the store.select will treat empty strings as np.nans and return np.nan if I tried to filter for it? But that might not always be what the user wants? Maybe I get you wrong.

I can see 3 scenarios:

  1. 'none' is a string that the user needs to be able to search for. Then store.select('df',['x!=none']) should only get those and should work, I guess?
  2. none is just another way of saying np.nan, but then it should not be in the HDFStore in the first place, and the user possibly wants to clean it up, but first needs to select for it, so number 1. applies
  3. The user wants to select for empty strings. Here I don't know how to select on them?
  4. The user wants to select on nan: The implementation for that should be guided by the way the HDFStore stores actually nan values (which I don't know).

Does that help?


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

  1. should work if u update to master
  2. You should store strings with a None value as np.nan - they will come back that way (you can specify how they r actually stored as well - see the docs, this is call the nan_rep)
  3. and 4) that way I was asking
    say u actually want to select a string that is a np.nan - actually I think u can just select the nan_rep string (but may have to set it to something other than '') as I am not sure that works - I will think about this

in any event let me know if u have any more issues

Jeff

On Mar 5, 2013, at 10:10 PM, "K.-Michael Aye" notifications@github.com wrote:

Thanks, amazing fast response! ;)

About np.nan, I don't understand where I could 'put np.nan in my string'? Do you mean, that the store.select will treat empty strings as np.nans and return np.nan if I tried to filter for it? But that might not always be what the user wants? Maybe I get you wrong.

I can see 3 scenarios:

  1. 'none' is a string that the user needs to be able to search for. Then store.select('df',['x!=none']) should only get those and should work, I guess?
  2. none is just another way of saying np.nan, but then it should not be in the HDFStore in the first place, and the user possibly wants to clean it up, but first needs to select for it, so number 1. applies
  3. The user wants to select for empty strings. Here I don't know how to select on them?
  4. The user wants to select on nan: The implementation for that should be guided by the way the HDFStore stores actually nan values (which I don't know).

Does that help?


Reply to this email directly or view it on GitHub.

@michaelaye
Copy link
Contributor Author

it does not seem to be working yet:

http://nbviewer.ipython.org/5097158

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

afba0 is the current commit, your version references the prior one

@michaelaye
Copy link
Contributor Author

Hm, am I doing something wrong?

(master)[maye@lunatic ~/Dropbox/src/pandas]$ git log -1
commit afba0d0e401a294f83cde3df4609179f64f165b2
Merge: 3790f16 197b3c7
Author: jreback <jeff@reback.net>
Date:   Tue Mar 5 19:04:09 2013 -0800

    Merge pull request #2976 from jreback/pytables_select

    BUG: HDFStore didn't implement != correctly for string columns query, GH #2973
(master)[maye@lunatic ~/Dropbox/src/pandas]$ git pull upstream master
From https://github.com/pydata/pandas
 * branch            master     -> FETCH_HEAD
Already up-to-date.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

that looks right
did u restart the notebook kernel?
or start a new one?
I usually have to force a path in notebook
(eg sys.path.insert(0,'path to pandas git')

try in your git dir with ipython at command line

I can be reached on my cell 917-971-6387

On Mar 6, 2013, at 3:31 AM, "K.-Michael Aye" notifications@github.com wrote:

Hm, am I doing something wrong?

(master)[maye@lunatic ~/Dropbox/src/pandas]$ git log -1
commit afba0d0
Merge: 3790f16 197b3c7
Author: jreback jeff@reback.net
Date: Tue Mar 5 19:04:09 2013 -0800

Merge pull request #2976 from jreback/pytables_select

BUG: HDFStore didn't implement != correctly for string columns query, GH #2973

(master)[maye@lunatic ~/Dropbox/src/pandas]$ git pull upstream master
From https://github.com/pydata/pandas

  • branch master -> FETCH_HEAD
    Already up-to-date.

    Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

did you get this to work?

@michaelaye
Copy link
Contributor Author

yes, finally. I think it was not found because I tried to install via egginstall, while I had installed the last dev version via setup.py install. I guess the egg is only being used when there's no pandas folder? I really don't understand all these dist-tools...

@jreback
Copy link
Contributor

jreback commented Mar 6, 2013

great

never used egg install

@michaelaye
Copy link
Contributor Author

but: what about the initial idea of this feature request? ;)

@jreback
Copy link
Contributor

jreback commented Mar 7, 2013

do you have a case where it fails? (aside from the != and bool cases we fixed)
I test with Terms that have: datetimes,float,int,string,bool....is there something else missing?

the issue is that I think we always do a valid conversion, so how do I know when its invalid?

@michaelaye
Copy link
Contributor Author

Doh, sorry, I didn't click that all is covered now. (BTW, the website need to include the != operator now as supported for queries.
Only thing I can think about now, is to warn the user when an OR is being tried, but maybe you have done that already?

@jreback
Copy link
Contributor

jreback commented Mar 7, 2013

you can't specify an OR explicity (yet!), are you doing this somehow?

you basically just concat 2 results of 2 queries together (in docs under advanced)

@michaelaye
Copy link
Contributor Author

No, I'm not hacking it somehow together. ;)
yes, saw the OR replacement, thx.
Do you want me to submit a nano-sized-diff-PR for the improved docs saying that != is now supported?

@jreback
Copy link
Contributor

jreback commented Mar 7, 2013

thanks. but..added in another PR

but feel free to review the docs for anything missing/confusing!

@jreback
Copy link
Contributor

jreback commented Mar 14, 2013

@michaelaye shall we close this ? do you have a case where the selection should warn? (I am guessing if you did then we would have fixed it!)

@michaelaye
Copy link
Contributor Author

yep. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants