-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HDF5 Engine #79
Comments
Since we have
|
@henrykironde can I work on this one? |
Yes, go for it |
@henrykironde I read about pytables and as far as I understand, I only need to use the dictionary provided by the |
That could be one way. If you install data into If we have a data frame, we can check out how Let me know the design you come out with before you move on to coding it out. |
@henrykironde @ethanwhite, currently pandas converts its dataframes to hdf5 files using a class named |
Okey, lets go with that .Use pandas and we can import the |
Does using pandas cause us any issues with out of memory scale data? In other words, do we need to load all of the data into an in-RAM data frame this way? |
Just throwing this here, pandas may have some thing like an external data frame. That reads from from disc. I am not sure |
Hey, I was wondering if this issue is solved or being worked upon currently. I had a few suggestions for it.
Pandas has good support for out of memory scale data. We read data from sources like csv in chunk sizes. So adding the data from a large csv to hdf5 file will become as simple as: store = pd.HDFStore('{}.h5'.format(DATASET_NAME))
for chunk in pd.read_csv(LARGE_CSV_FILE,chunksize=CHUNKSIZE):
store.append(TABLE_NAME, chunk, data_columns=True, index=False)
store.close() For using sqlite to transform data to h5 we can do the same as above only difference being the API used import sqlite3
cnx = sqlite3.connect('sqlite.db')
store = pd.HDFStore('{}.h5'.format(DATASET_NAME))
for chunk in pd.read_sql_query("SELECT * FROM {db}_{table}", cnx,chunksize=CHUNKSIZE):
store.append(TABLE_NAME, chunk, data_columns=True, index=False)
store.close() If this issue is not currently being worked upon by anyone, I would love to take up this issue. Otherwise, I'd still be willing to help if required. |
@DumbMachine, this looks like what we need, I have not had time to get my hands on this however, since @harshitbansal05 was working on this, you could work together and Collaborate on this with a goal to have it working in the best way possible. |
@harshitbansal05 , Let me know if I can help you in any way. In the mean time I made s rough implementation of the HDF5engine by reusing CSVengine and converting the csv to hdf5 file. The results can be seen here. If these are the results you desire then I can make a pull request to fully implement the same. |
Added in 05351dc |
Seems like a really powerful emerging approach to storing data that lots of more cutting edge folks are using. PyTables should help.
The text was updated successfully, but these errors were encountered: