New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG-REPORT] problem using data from parquet/pyarrow in correlation function #1336
Comments
Hi Laurie, I imagine that the correlation function is not built to correlate non-numeric datatypes. I'd be curious if you label encoded your strings prior to correlating, would that work? Might be good to try it on a smaller dataset to see if the result is what you're after. |
Hi @kmcentush, Thank you for your quick reply. Pandas corr supports strings I wondered if Vaex intended to do the same. If not, it would be nice to have a more informative error message like "not supported" or mention that the data must be numeric in the docs. Yes I will try label encoding-- thank you for the suggestion. Please feel free to close if there is nothing actionable here. Thanks again, |
I'd be curious to see if label encoding in Vaex then correlating returns the same results as doing it in Pandas. Please let me know what you find! |
Hi, I don't understand, what is the mathematical background on calculating a correlation function between a numeric and a non-numeric data type? Also, i think the error message is quite clear, vaex assumes that one is using numerical columns for correlation, and if not, will try to convert it to floats. In this case it fails, and the message reflects that. I also had a (very) brief look at the pdf.corr() on your example in the OP, i see that column I am happy to get corrected if anything I wrote here is wrong in any way! |
Hi @JovanVeljanoski, Thanks for your reply. I should apologize-- I was mistaken about what I said earlier. The data I used with
I should have gone back and looked more carefully at what I had done before opening an issue with you-- I am sorry about this. I would still ultimately be interested in doing this in Vaex (if this is possible?) but I realize this is now a different question. I appreciate your and @kmcentush's fast reponses. My apologies providing the wrong information earlier. Thank you for Vaex-- so far I am finding it very powerful. Best, |
If the values are unique, you can try vaex's If you have a sample of the input data (prior to encoding as you do w/ pandas pivot), I can take a look. I have some various pivot snippets that might work. |
Thank you. Yes, tl;dr we are tracking Python libraries used in jobs at NERSC. Each dataframe row represents one job and each column represents one library. As you might expect one job can have many libraries. Here is a snippet:
We are using these data to figure out which libraries are correlated. Thank you very much, |
Cool! I did some undergrad work at LBNL in the electrochemical space. My background is in chemical engineering. 😄 Do you have a snippet of some example rows prior to the pivoted result you're showing there? That looks like a good goal but I'm happy to help you get there from a pre-pivoted sample. |
Hey @lastephey No worries! :). I know it is a challenge to do correlation between numerical and categorical (or string) data types. Depending on the problem there are various approaches, some are as @kmcentush describes. Btw - you are doing interesting work! If you publish a paper or something like that, I'd be interested in the results :). |
Hi @kmcentush and @JovanVeljanoski, Thanks-- here are the relevant pre-pivoted columns:
This is only a small fraction but we have many rows for the same job, each corresponding to a single library. We pivot to get everything grouped together according to job. If this isn't possible in Vaex, we are also using dask-cudf so we can explore options there, too. Yes we are working on a talk/paper for SciPy 2021:
As you can see the analysis is still in progress. :) Thank you very much, |
Okay, you managed to get me to dig up my vaex utility code from last year! I believe this should do the trick. It works on the latest version of vaex and was originally written for v4.0 preleases - so there's definitely an opportunity to make things more efficient! I wasn't sure if you wanted to count ex:
|
Wow, thank you so much!! I was hoping there would be something quick that wouldn't take you much effort 😱 -- but thank you, I really appreciate it. I'll test this out and report back. Indeed, only counting |
I've tested this. It seems to work well to produce a dataframe in the form I needed (first column is I do have a few questions. I'd like to use This works for a single pair:
If I try this to use all columns I get:
My real data is shape I'm sorry if I'm just misunderstanding the API. Thank you very much, |
Can you try running that on just a subset of the data? I.E. try running the correlation of all columns but not all all of your rows. If that still happens even with just a few rows, it's most likely some (unintentional) evaluation limit that you're butting up against. Does your stacktrace show what call is actually recursive? The correlation function utilizes a lot of other functions in it that may actually be the offenders. Regardless, any sort of test case you can share will help. :) |
Hi @kmcentush, I tried with 10 rows and still hit the limit:
The stacktrace is quite long. Here is just a subset of what I hope is important:
Are you able to reproduce on your end? Would you like me to send you a file/dataframe with 10 rows to test? I am happy to close this issue as we're now very far off topic and open a new one to capture the recursion problem if it's helpful. Please let me know. Thank you again for your help, |
Hi Laurie, That subset looks like it shows the issue! In the I find it interesting that this would be caused by the |
Hi Regarding the usage of correlation (same goes for the mutual_information method): it is not a bug, it is a feature :) This is part of an.. older part of vaex that has not been updated in a long time, and the API is a bit different. You can use it in two ways: one as @lastephey found out, by passing a single If you want to calculate more correlations at once, you need to pass a list of tuples to the # assume the dataframe has x, y, z columns:
df.correlation(x=[('x', 'y'), ('x', 'z'), ('y', 'z')]) @kmcentush don't bother with this, it touches a very very old part of vaex that might be tricky to understand. I think @maartenbreddels has basically updated this to a more modern API, but it lives in a branch somewhere and needs a final bit of ironing out before it is merged. I expect it to happen soon-ish. |
Hi, I just wanted to let you all know that our SciPy paper is published here. We acknowledged you (the Vaex developers)-- thank you for your help in this effort. You are free to close this issue if it's not useful for you. We were able to perform the calculation we needed in Dask-cuDF. Thank you again, |
Thanks for all your help. Closing in the spirit of Closember. |
Dear Vaex developers,
Description
I would like to use the Vaex correlation function to calculate the correlation coefficient between two columns of a dataframe (one type
int64
, the other typestr
). I am loading the dataframe from a parquet file generated using the pyarrow engine. It seems thatcorrelation
may not work correctly with data in this format.Here is a reproducer:
This fails with a traceback that ends in
ValueError: could not convert string to float: 'hello'
.I thought I might need to explictly request
.values
:This fails with a traceback that ends in
Software information
Apologies if I am using Vaex or this function incorrectly.
Thank you very much,
Laurie
The text was updated successfully, but these errors were encountered: