Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Can not put a dataframe into hdfstore *completely* #3012

Closed
simomo opened this Issue · 3 comments

2 participants

simomo jreback
simomo
  • I load a dataframe from mysql:
df_bugs_activity_4w = psql.read_frame('select * from bugs_activity limit 0, 40000', conn)
  • and the structure of df_bugs_activity_4w:
In[19]: df_bugs_activity_4w
Out[19]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns:
bug_id       40000  non-null values
attach_id    13  non-null values
who          40000  non-null values
bug_when     40000  non-null values
fieldid      40000  non-null values
added        40000  non-null values
removed      40000  non-null values
id           40000  non-null values
dtypes: float64(1), int64(4), object(3)
  • then, convert the object columns
In [60]: df_bugs_activity_4w = df_bugs_activity_4w.convert_objects()
In [61]: df_bugs_activity_4w
Out[61]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns:
bug_id       40000  non-null values
attach_id    13  non-null values
who          40000  non-null values
bug_when     40000  non-null values
fieldid      40000  non-null values
added        40000  non-null values
removed      40000  non-null values
id           40000  non-null values
dtypes: datetime64[ns](1), float64(1), int64(4), object(2)
  • I put it into a hdfstore, and then get it out, found the number of dataframe entries changed from 40,000 to 13 ! That's weird. It seems that the number of 'attach_id' columns limits the total number of dataframe when putting it into a hdfstore.
In [63]: %prun store.put('df_bugs_activity_4w1', df_bugs_activity_4w, table=True)

In [64]: %time store.get('df_bugs_activity_4w1')
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.01 s
Out[64]:
bug_id  attach_id   who bug_when    fieldid added   removed id
2012     301879  0   35 1999-04-06 11:22:11  16  dev     bug     2013
2014     301879  0   35 1999-04-06 11:22:12  5   para    us 2015
2835     301879  0   56 1999-05-14 15:56:12  10  clo     op  2836
31244    301879  0   207    2001-07-18 14:11:38  10  op  clo     31245
31252    301879  0   207    2001-07-18 15:40:52  10  ana     op  31253
31283    301879  0   35 2001-07-18 21:21:33  16  lui     dev     31284
31285    301879  0   35 2001-07-18 21:21:34  15  296     10  31286
31287    301879  0   35 2001-07-18 21:21:35  5   unk     para    31288
31393    301879  0   159    2001-07-19 12:41:07  16  prat    lui     31394
31472    301879  0   207    2001-07-19 17:27:31  10  ope     ana     31473
32675    301879  0   207    2001-08-02 10:09:08  10  clos    op   32676
38609    235837  0   201    2001-09-26 20:28:11  15  310-    300-3   38610
38610    235838  0   201    2001-09-26 20:28:11  15  310-    300     38611

2013-03-11 21:43:21

In [66]: store
Out[66]:
<class 'pandas.io.pytables.HDFStore'>
File path: sample_no_fill.h5
/df_bugs_4w                      frame_table  (typ->appendable,nrows->40000,ncols->52,indexers->[index])
/df_bugs_4w1                     frame_table  (typ->legacy,nrows->None,ncols->0,indexers->[])           
/df_bugs_activity_4w             frame_table  (typ->appendable,nrows->13,ncols->8,indexers->[index])    
/df_bugs_activity_4w1            frame_table  (typ->appendable,nrows->13,ncols->8,indexers->[index]) 
jreback
Owner

pls post a sample of the data as it comes out of SQL, before any other operations. post as a text string, EXACTLY as you have it (e.g. do a df.to_csv()) with a subset of the rows,

jreback
Owner

@simomo PR #3013 should fix your 2nd issue (there was a bug), pls give a try and let me know

simomo

This issue has been solved. thanks~

simomo simomo closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.