New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with vaex to support Python3 #369
Comments
What I suspect happening is:
Now when you execute |
Hi again, here more details, We have installed last anaconda version with python3 as root. When we follow your instructions: conda install -c maartenbreddels vaex Vaex is installed sucesfully but it downgrades our conda python version to 2.7.6 Then I can invoke python from conda that is 2.7.6 version, and I can import vaex with “import vaex”, but everything works in python 2.7.6 Can we use vaex in python3? Regards. P.S. Here is the attached message when we try to install vaex with conda: [root@skun6 ~]# conda install -c maartenbreddels vaex Package Planenvironment location: /usr/local/anaconda3 added / updated specs: The following packages will be downloaded:
The following NEW packages will be INSTALLED: _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main The following packages will be REMOVED: anaconda-2019.03-py37_0 The following packages will be UPDATED: anaconda-project pkgs/main/linux-64::anaconda-project-~ --> pkgs/main/noarch::anaconda-project-0.8.3-py_0 The following packages will be DOWNGRADED: ipyw_jlab_nb_ext~ 0.1.0-py37_0 --> 0.1.0-py27_0 Proceed ([y]/n)? |
Hi, Can you please try installing from conda-forge ? conda install -c conda-forge vaex Best from a clean env. Cheers, |
Thanks! Now it seems to work following your suggestions. BUT, I get this error when reading my hdf5 table, In [15]: ds = vaex.open('/home/users/dae/ishiyama/Uchuu/Rockstar/007/out_7.rockstar.0.hdf5')
|
Hi, How did you create the |
Hi, the hdf5 files have been created by our group with our code based on c starting from a huge ASCII file that it is splited and then converted to several hdf5 files. I guess we are not following a standard format? You can take a look at the code here https://bitbucket.org/cnvega/rockstar_outputs/src/default/ Your help is very much welcome! Thanks. |
Easiest / fastest way would probably be to use vaex to read the ascii file and output a single (or multiple) hdf5 files. Then you are guaranteed compatibility. Maybe I could help with this, if you send me a couple of lines from that ascii file? In general, you can use I hope this helps. |
Thank you! Indeed starting from the ascii certainly would be the best. It'd be great if you can help with this. Please find below the first lines of the ascii (halo catalog) which includes the header and data for 4 halos: #ID DescID Mvir Vmax Vrms Rvir Rs Np X Y Z VX VY VZ JX JY JZ Spin rs_klypin Mvir_all M200b M200c M500c M2500c Xoff Voff spin_bullock b_to_a c_to_a A[x] A[y] A[z] b_to_a(500c) c_to_a(500c) Ax Ay Az T/|U| M_pe_Behro |
Ah, this is from ROCKSTAR the clustering algorithm right? It should be straightforward to read in the data than. All you need to do is this
where Alternatively, you can set the header to be inferred. That requires the top non-comment line of the file to contain all column names. You can either edit the file to achieve this, or perhaps adjust the output of ROCKSTAR such that the header is a bit different. Hope this helps. Please let me know if this works. |
That's right! We are using ROCKSTAR to create the halo catalogs. In this case it is for a new two-trillion N-body simulation! So, ROCKSTAR provides an ASCII file for each time epoch. The ASCII file is huge, we have more than 4 billion halos! This is why we converted the ASCII file to hdf5, and also splitted to help with the file transfer. OK. Good. Let me then follow your advise and use vaex.read_csv ... Thank you! |
Hi Jovan, I forgot to ask. Once we read the ASCII file in vaex how can we convert it into several hdf5 files? Thanks! |
Once you read everything in:
You may want to read through Cheers |
Got it, thanks! Let me work on it. Keep in touch. |
Hi, vaex read the ASCII file well and it worked fine, great! When I want to create a hdf5 version, following df.export_hdf5('/somewhere/on/disk/file.hdf5', progress=True) then I get this error OSError Traceback (most recent call last) ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/dataframe.py in export_hdf5(self, path, column_names, byteorder, shuffle, selection, progress, virtual, sort, ascending) ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/export.py in export_hdf5(dataset, path, column_names, byteorder, shuffle, selection, progress, virtual, sort, ascending) ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/hdf5/export.py in export_hdf5(dataset, path, column_names, byteorder, shuffle, selection, progress, virtual, sort, ascending) ~/.conda/envs/vaexenv/lib/python3.7/site-packages/h5py/_hl/files.py in init(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, **kwds) ~/.conda/envs/vaexenv/lib/python3.7/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5f.pyx in h5py.h5f.create() OSError: Unable to create file (unable to truncate a file which is already open) |
Odd, are you maybe writing to a file you already opened? Can you change the filename? |
Yeah, I change the filename and the error persist. Yet, I've noticed that the hdf5 file is created in the directory ... |
Hi @fprada Can you tell me which version of h5py you have installed in the same env as vaex? Can you try writing to a different directory altogether? Also if that does not work, can you give us the output of Cheers |
(Ups sorry closed it by mistake). On the positive side, I think i figured it out. I think some of the column names are too exotic for h5py, in particular things like 'T/|U|' and potentially 'A[x]' and 'c_to_a(500c)'. I suggest to rename the column names to contain only letters (lower or upper case) and numbers, and underscores. Other characters such as [, (, \ / ? etc.. may raise issues. I am not sure if this is due to vaex, or h5py at this point. Please try using more simple column names, and exporting than. Cheers. |
@fprada The first time you run it, you see a different stacktrace than the second time (that confused me!). The first time it got confused by The second time it complains that the file is already open (which is the stacktrace you gave), I think we can improve that as well. The workaround, for now, is what Jovan suggested:
|
Excellent, it works! Thanks very much Maarten and Jovan for your help. Now it creates the hdf5 file, and when I read it with vaex everything looks fine. Great. Now, if I read a much bigger ascii file (230 GB) with vaex takes really long (still reading after 1.5 hrs, it's taking all 128GB RAM running on 1 CPU). Is there a way to speed up the reading? Why does it take all that RAM? There are 661592956 rows in the original ascii file. Note that this is a file with only 1/8 of the entire Rockstar data, which contains about 5 billion rows for one redshift snapshot of the simulation :-) Let me also mentioned that when I exported the previous ascii file to hdf5, I noticed that its size is about the same. Our hdf5 file created with c has about half-size. Is there a way in vaex to reduce the size (some compression?) when exporting hdf5? This is our main interest of having the data in hdf5 instead of ascii. I should mentioned that our interest on vaex is to provide efficient manipulation and analysis of our data for the entire astronomical community. We do plan to have a first data release soon. Thanks again for all your support! |
Hi @fprada I am happy to hear that it works. Perhaps it is best to open another issue regarding any follow up questions, as to not divert this threat too much, but I will offer some advice here. To your 1st point.. well you are trying to read a 230 GB file, but you only have 128 GB of ram, so that sets a limit on how much you can effectively read in memory at one time. Your computer is probably using the swap disk as an additional ram, but this is much slower, and is best avoided if possible. How to deal with this: we will eventually provide support for converting larger-than-memory text (csv, ascii) files to hdf5 out-of-the box, but we are busy working on other stuff right now, so this will perhaps happen in a month or two. In the meantime you can do the following: familiarize yourself with Once you have the data in hdf5, regardless of the size, you can work with the entire data, as About the size of the hdf5: when the ascii data is read by python, it is stored as We would be very grateful if you cite/mention the use of this project if it helps you out :) |
Hi, FYI. after more than 2.5hr the reading hasn't finished ... Still going. Thanks. |
Thanks Jovan, that's why we splitted the orignal ascii big file into several smaller hdf5 files. We have done that with our own c code. But unfortunately vaex cannot recognise our format. Likely because that issue you pointed out with the names of the columns. If we can solve this, then the best would be to use vaex.open_many to read the many hdf5 files. I will take a look at the pandas.read_csv ... It'd be a please to acknowledge vaex. Hopefully we will make use of it once we are able to make it work for our application ;-) It is an amazing tool! Congratulations. Best. |
It might be possible to get it compatible, but I'm not sure what is more work now. Thanks for your positive words, glad you find it helpful. I'll close this issue, feel free to open new ones for new issues. cheers, Maarten |
Thank you! Thanks Maarten and Jovan. |
Once we installed vaex using conda the original Python3 was replaced by Python2 when call ipython. In principle vaex supports Python3, how to avoid this?
The text was updated successfully, but these errors were encountered: