Add HDF5 Engine #79

ethanwhite · 2013-03-03T20:22:07Z

Seems like a really powerful emerging approach to storing data that lots of more cutting edge folks are using. PyTables should help.

henrykironde · 2019-03-02T23:52:16Z

Since we have

Fetch function.
And we efficiently install data and keep a record of the installed data table names (an engine instance will have those table names)
Both CSV and SQlite and Pandas/Numpy make installing of data into HDF5 easy
Additional, reference is PyTables.

harshitbansal05 · 2019-03-03T05:20:53Z

@henrykironde can I work on this one?

henrykironde · 2019-03-03T05:51:01Z

Yes, go for it

harshitbansal05 · 2019-03-03T07:03:26Z

@henrykironde I read about pytables and as far as I understand, I only need to use the dictionary provided by the fetch function(consisting of table name with pandas df) and save them into an HDF5 file. Meaning first, a sqlite db would be created followed by the hdf5 file. Am I right?

henrykironde · 2019-03-03T08:50:42Z

That could be one way. If you install data into sqlite or csv, Pandas can create a data frame from that. (We may not actually need to use fetch)
And I think we are doing the same in fetch.

If we have a data frame, we can check out how Pytables creates hdf5 files or even the python module hdf5. If we can create the hdf5 files with out introducing any of these package dependencies, that would be great. That means we don't need to ask users to install more dependencies. However we can use these packages when testing how good we are creating the hdf5.

Let me know the design you come out with before you move on to coding it out.

harshitbansal05 · 2019-03-05T19:16:06Z

@henrykironde @ethanwhite, currently pandas converts its dataframes to hdf5 files using a class named HDFStore. For this, it uses pytables under the hood. We can use this class to convert dataframes to hdf5 files(after forming pandas dataframes from sqlite). But for this, we would have to add a new dependency of pytables. I researched but I could not find any other suitable method to converting to hdf5 files.

henrykironde · 2019-03-05T20:56:24Z

Okey, lets go with that .Use pandas and we can import the Pytables if it is not shipped with panda

ethanwhite · 2019-03-05T20:58:18Z

Does using pandas cause us any issues with out of memory scale data? In other words, do we need to load all of the data into an in-RAM data frame this way?

henrykironde · 2019-03-05T22:36:32Z

Just throwing this here, pandas may have some thing like an external data frame. That reads from from disc. I am not sure

DumbMachine · 2019-03-20T19:22:15Z

Hey, I was wondering if this issue is solved or being worked upon currently. I had a few suggestions for it.

Does using pandas cause us any issues with out of memory scale data? In other words, do we need to load all of the data into an in-RAM data frame this way?

Pandas has good support for out of memory scale data. We read data from sources like csv in chunk sizes. So adding the data from a large csv to hdf5 file will become as simple as:

store = pd.HDFStore('{}.h5'.format(DATASET_NAME))
for chunk in pd.read_csv(LARGE_CSV_FILE,chunksize=CHUNKSIZE):
    store.append(TABLE_NAME, chunk, data_columns=True, index=False)
store.close()

For using sqlite to transform data to h5 we can do the same as above only difference being the API used

import sqlite3
cnx = sqlite3.connect('sqlite.db')
store = pd.HDFStore('{}.h5'.format(DATASET_NAME))
for chunk in pd.read_sql_query("SELECT * FROM {db}_{table}", cnx,chunksize=CHUNKSIZE):
    store.append(TABLE_NAME, chunk, data_columns=True, index=False)
store.close()

If this issue is not currently being worked upon by anyone, I would love to take up this issue. Otherwise, I'd still be willing to help if required.

henrykironde · 2019-03-20T23:27:39Z

@DumbMachine, this looks like what we need, I have not had time to get my hands on this however, since @harshitbansal05 was working on this, you could work together and Collaborate on this with a goal to have it working in the best way possible.

DumbMachine · 2019-03-21T11:56:28Z

@harshitbansal05 , Let me know if I can help you in any way. In the mean time I made s rough implementation of the HDF5engine by reusing CSVengine and converting the csv to hdf5 file. The results can be seen here. If these are the results you desire then I can make a pull request to fully implement the same.

Fixes: weecology#79

henrykironde · 2021-04-14T01:27:17Z

Added in 05351dc

ethanwhite mentioned this issue Jul 21, 2013

Add indexes for large datasets #95

Open

This was referenced Mar 8, 2015

Add taxonomic name resolution to the EcoData Retriever to facilitate data science approaches to ecology numfocus/gsoc#1

Closed

Improving reproducibility in science by adding provenance tracking to the EcoData Retriever numfocus/gsoc#2

Closed

ethanwhite added this to the 2.1 milestone Nov 30, 2015

henrykironde added the getting-started label Mar 2, 2019

harshitbansal05 added a commit to harshitbansal05/retriever that referenced this issue Mar 28, 2019

Add HDF5 engine using Pandas and SQLite

7bcd347

Fixes: weecology#79

harshitbansal05 mentioned this issue Mar 28, 2019

Add HDF5 engine using Pandas and SQLite #1296

Closed

coolalexzb mentioned this issue Mar 21, 2020

Allow consuming JSON data #1334

Closed

henrykironde closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HDF5 Engine #79

Add HDF5 Engine #79

ethanwhite commented Mar 3, 2013 •

edited by henrykironde

Loading

henrykironde commented Mar 2, 2019

harshitbansal05 commented Mar 3, 2019

henrykironde commented Mar 3, 2019

harshitbansal05 commented Mar 3, 2019

henrykironde commented Mar 3, 2019

harshitbansal05 commented Mar 5, 2019

henrykironde commented Mar 5, 2019

ethanwhite commented Mar 5, 2019

henrykironde commented Mar 5, 2019

DumbMachine commented Mar 20, 2019

henrykironde commented Mar 20, 2019

DumbMachine commented Mar 21, 2019

henrykironde commented Apr 14, 2021

Add HDF5 Engine #79

Add HDF5 Engine #79

Comments

ethanwhite commented Mar 3, 2013 • edited by henrykironde Loading

henrykironde commented Mar 2, 2019

harshitbansal05 commented Mar 3, 2019

henrykironde commented Mar 3, 2019

harshitbansal05 commented Mar 3, 2019

henrykironde commented Mar 3, 2019

harshitbansal05 commented Mar 5, 2019

henrykironde commented Mar 5, 2019

ethanwhite commented Mar 5, 2019

henrykironde commented Mar 5, 2019

DumbMachine commented Mar 20, 2019

henrykironde commented Mar 20, 2019

DumbMachine commented Mar 21, 2019

henrykironde commented Apr 14, 2021

ethanwhite commented Mar 3, 2013 •

edited by henrykironde

Loading