Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HDF5 Engine #79

Closed
ethanwhite opened this issue Mar 3, 2013 · 13 comments
Closed

Add HDF5 Engine #79

ethanwhite opened this issue Mar 3, 2013 · 13 comments

Comments

@ethanwhite
Copy link
Member

ethanwhite commented Mar 3, 2013

Seems like a really powerful emerging approach to storing data that lots of more cutting edge folks are using. PyTables should help.

@henrykironde
Copy link
Contributor

Since we have

  • Fetch function.

  • And we efficiently install data and keep a record of the installed data table names (an engine instance will have those table names)

  • Both CSV and SQlite and Pandas/Numpy make installing of data into HDF5 easy
    Additional, reference is PyTables.

@harshitbansal05
Copy link
Contributor

@henrykironde can I work on this one?

@henrykironde
Copy link
Contributor

Yes, go for it

@harshitbansal05
Copy link
Contributor

@henrykironde I read about pytables and as far as I understand, I only need to use the dictionary provided by the fetch function(consisting of table name with pandas df) and save them into an HDF5 file. Meaning first, a sqlite db would be created followed by the hdf5 file. Am I right?

@henrykironde
Copy link
Contributor

That could be one way. If you install data into sqlite or csv, Pandas can create a data frame from that. (We may not actually need to use fetch)
And I think we are doing the same in fetch.

If we have a data frame, we can check out how Pytables creates hdf5 files or even the python module hdf5. If we can create the hdf5 files with out introducing any of these package dependencies, that would be great. That means we don't need to ask users to install more dependencies. However we can use these packages when testing how good we are creating the hdf5.

Let me know the design you come out with before you move on to coding it out.

@harshitbansal05
Copy link
Contributor

@henrykironde @ethanwhite, currently pandas converts its dataframes to hdf5 files using a class named HDFStore. For this, it uses pytables under the hood. We can use this class to convert dataframes to hdf5 files(after forming pandas dataframes from sqlite). But for this, we would have to add a new dependency of pytables. I researched but I could not find any other suitable method to converting to hdf5 files.

@henrykironde
Copy link
Contributor

Okey, lets go with that .Use pandas and we can import the Pytables if it is not shipped with panda

@ethanwhite
Copy link
Member Author

Does using pandas cause us any issues with out of memory scale data? In other words, do we need to load all of the data into an in-RAM data frame this way?

@henrykironde
Copy link
Contributor

Just throwing this here, pandas may have some thing like an external data frame. That reads from from disc. I am not sure

@DumbMachine
Copy link
Contributor

Hey, I was wondering if this issue is solved or being worked upon currently. I had a few suggestions for it.

Does using pandas cause us any issues with out of memory scale data? In other words, do we need to load all of the data into an in-RAM data frame this way?

Pandas has good support for out of memory scale data. We read data from sources like csv in chunk sizes. So adding the data from a large csv to hdf5 file will become as simple as:

store = pd.HDFStore('{}.h5'.format(DATASET_NAME))
for chunk in pd.read_csv(LARGE_CSV_FILE,chunksize=CHUNKSIZE):
    store.append(TABLE_NAME, chunk, data_columns=True, index=False)
store.close()

For using sqlite to transform data to h5 we can do the same as above only difference being the API used

import sqlite3
cnx = sqlite3.connect('sqlite.db')
store = pd.HDFStore('{}.h5'.format(DATASET_NAME))
for chunk in pd.read_sql_query("SELECT * FROM {db}_{table}", cnx,chunksize=CHUNKSIZE):
    store.append(TABLE_NAME, chunk, data_columns=True, index=False)
store.close()

If this issue is not currently being worked upon by anyone, I would love to take up this issue. Otherwise, I'd still be willing to help if required.

@henrykironde
Copy link
Contributor

@DumbMachine, this looks like what we need, I have not had time to get my hands on this however, since @harshitbansal05 was working on this, you could work together and Collaborate on this with a goal to have it working in the best way possible.

@DumbMachine
Copy link
Contributor

@harshitbansal05 , Let me know if I can help you in any way. In the mean time I made s rough implementation of the HDF5engine by reusing CSVengine and converting the csv to hdf5 file. The results can be seen here. If these are the results you desire then I can make a pull request to fully implement the same.

@henrykironde
Copy link
Contributor

Added in 05351dc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants