-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call community detector lib with parquet data #90
Conversation
Needed by the community detector lib Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGFM!
Overall looks good, I butt think it's ok to have another file for now with appropriate data structure. But let me try it locally first and get back to you here. |
@carlos every time after running it locally, I got
Could you look into that and let me know what I'm doing wrong to trigger this error?
from PR description mean? Does it refer to writing a single parquet file?
On fresh virtualenv, installing dependencies fails for me
Am I doing something wrong, or does Cython need to be added to requirements.txt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
please, feel free to merge as soon as 3 minor issues from the comment above are addressed here.
@bzz:
I cannot reproduce this message. What is your scala version? Could it be related to the size of the data? What DB are you using to test it?
This refers to saving the data produced by the python script
It works for me in linux, python 3.6.5. Is this the case? If that's the problem, since the stable version is 3.6 I think it's reasonable to ignore this for now. |
It should be fixed only in hadoop-common 2.9 and that may not be something easy to upgrade - should be aligned with Spark and Cassandra connector versions. Anyway, should be handled outside of this PR.
2.12.2
the same dump, that was used for demo
|
Description updated, I'll look into the other issues later. |
but I think we might want to explore using native python parquet implementation instead of pyArrow. 👍 let's update deps so |
That sounds good to me! Feel free to push a commit with the new requirements.txt or ping me with the contents that work in macOS |
could you please update the |
It's not clear from PR rescription, but at current state results are just printed to STDOUT
|
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
New requirements.txt pushed. I actually follow that guide you posted, but for me |
LGTM, let's merge it! |
Partially implements #60.
This PR writes extra data in
report
, and adds a newreport.py
command that reads parquet files and calls the community detector python lib.Next steps, to be done in future PRs:
report.py
output data as a parquet file.report.py
command, wait for it to finish, and read the new parquet data.Notes:
If you look at
report.py
it would make sense to write a single parquet file with the columnselement_id, cc, buckets
. But as I put in a TODO, the community detector lib actually uses internallycc->element-ids
. So the currentcc.parquet
could be read and used skippingbuild_id_to_cc
. Left as a TODO to avoid changing too much until we have all the parts working together.