Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call community detector lib with parquet data #90

Merged
merged 3 commits into from
Apr 19, 2018

Conversation

carlosms
Copy link
Contributor

@carlosms carlosms commented Apr 12, 2018

Partially implements #60.

This PR writes extra data in report, and adds a new report.py command that reads parquet files and calls the community detector python lib.

$ ./report --keyspace apollo -o src/main/python/community-detector/parquets/
$ python src/main/python/community-detector/report.py src/main/python/community-detector/parquets

Next steps, to be done in future PRs:

  • save the report.py output data as a parquet file.
  • make scala report app call the report.py command, wait for it to finish, and read the new parquet data.
  • use this data (possibly calling the DB to get extra info) in the scala report output.

Notes:
If you look at report.py it would make sense to write a single parquet file with the columns element_id, cc, buckets. But as I put in a TODO, the community detector lib actually uses internally cc->element-ids. So the current cc.parquet could be read and used skipping build_id_to_cc. Left as a TODO to avoid changing too much until we have all the parts working together.

Needed by the community detector lib

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
@carlosms carlosms requested review from smacker and bzz April 12, 2018 11:00
Copy link
Contributor

@smacker smacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM!

@bzz
Copy link
Contributor

bzz commented Apr 13, 2018

Overall looks good, I butt think it's ok to have another file for now with appropriate data structure.

But let me try it locally first and get back to you here.

@bzz bzz mentioned this pull request Apr 17, 2018
@bzz
Copy link
Contributor

bzz commented Apr 17, 2018

@carlos every time after running it locally, I got

  1. strange error
$ ./report --keyspace apollo -o apollo_dump_17.04/demo/cc/

[info] Running tech.sourced.gemini.ReportApp --keyspace apollo -o apollo_dump_17.04/demo/cc/
No duplicates found.
 WARN 00:09:24 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner (FileSystem.java:2995) - exception in the cleaner thread but it will continue to run
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
	at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
	at java.lang.Thread.run(Thread.java:748)
[success] Total time: 13 s, completed Apr 18, 2018 12:09:24 AM

buckets.parquet and cc.parquet are cerated though.

Could you look into that and let me know what I'm doing wrong to trigger this error?

  1. A quick question
    Also, as 2 parquet files were created. could you please help me understand what does

save the output data as a parquet file.

from PR description mean? Does it refer to writing a single parquet file?

  1. On using community detector lib

On fresh virtualenv, installing dependencies fails for me

$ pip3 install -r src/main/python/community-detector/requirements.txt

Collecting pyarrow==0.9.0 (from -r src/main/python/community-detector/requirements.txt (line 2))
  Downloading https://files.pythonhosted.org/packages/be/2d/11751c477e4e7f4bb07ac7584aafabe0d0608c170e4bff67246d695ebdbe/pyarrow-0.9.0.tar.gz (8.5MB)
    100% |████████████████████████████████| 8.5MB 887kB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/pip-install-1zra07c5/pyarrow/setup.py", line 29, in <module>
        from Cython.Distutils import build_ext as _build_ext
    ModuleNotFoundError: No module named 'Cython'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/pip-install-1zra07c5/pyarrow/

Am I doing something wrong, or does Cython need to be added to requirements.txt?

Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

please, feel free to merge as soon as 3 minor issues from the comment above are addressed here.

@carlosms
Copy link
Contributor Author

@bzz:

  1. strange error

I cannot reproduce this message. What is your scala version? Could it be related to the size of the data? What DB are you using to test it?

  1. A quick question

save the output data as a parquet file

This refers to saving the data produced by the python script report.py to a new (third) parquet file. This file will be read from ReportApp.scala.

  1. On using community detector lib

It works for me in linux, python 3.6.5.
I think you may be using python 3.7, and pip install does not work as it should. See:
https://issues.apache.org/jira/browse/ARROW-1661
apache/arrow#1125

Is this the case? If that's the problem, since the stable version is 3.6 I think it's reasonable to ignore this for now.

@bzz
Copy link
Contributor

bzz commented Apr 19, 2018

  1. Could that be somehow related to HADOOP-12829?

It should be fixed only in hadoop-common 2.9 and that may not be something easy to upgrade - should be aligned with Spark and Cassandra connector versions. Anyway, should be handled outside of this PR.

What is your scala version

2.12.2

What DB are you using to test it?

the same dump, that was used for demo

  1. Sounds good! Shall we update PR description with this information, to clarify the confusion?
  2. Using Python 3.6.3 & virtualenv

@carlosms
Copy link
Contributor Author

Description updated, I'll look into the other issues later.

@bzz
Copy link
Contributor

bzz commented Apr 19, 2018

  1. can be worked around with pip3 install Cython but then building native dependencies for arrow fails on macOS
CMake Error at cmake_modules/FindArrow.cmake:130 (message):
    Could not find the Arrow library.  Looked for headers in , and for libs in
  Call Stack (most recent call first):
    CMakeLists.txt:197 (find_package)

  error: command 'cmake' failed with exit status 1

but pip3 install pyarrow works and brings 0.9.0.post1. Can we update it, so 0.9.0.post1 would satisfy?

I think we might want to explore using native python parquet implementation instead of pyArrow.

👍 let's update deps so 0.9.0.post1 works and merge this for now, and move 1 feedback in separate issue.

@carlosms
Copy link
Contributor Author

That sounds good to me! Feel free to push a commit with the new requirements.txt or ping me with the contents that work in macOS

@bzz
Copy link
Contributor

bzz commented Apr 19, 2018

Feel free to push a commit with the new requirements.txt

could you please update the requirements.txt following https://pip.pypa.io/en/stable/user_guide/ so the version posted above satisfy? I guesspip install 'pyarrow>=0.9.0' should generate appropriate one, like pyarrow>=0.9.0

@bzz bzz mentioned this pull request Apr 19, 2018
@bzz
Copy link
Contributor

bzz commented Apr 19, 2018

It's not clear from PR rescription, but at current state results are just printed to STDOUT

$ python3 src/main/python/community-detector/report.py apollo_dump_17.04/demo/cc

{'data': array([   9,   10,   31,   32,   59,   60,   64,   65,   70,   71,  193,
        194,  275,  276,  350,  351,  371,  372,   67, 1431,  919, 2968,
       2597, 3374,  711, 3783, 1632, 2041,  291, 3082, 1435, 2605, 1600,
       2114, 3779, 3410,  726,  860,  292, 2466,  547, 1318,  837, 1742,
       3685, 2284, 2797, 3439,  293, 3728,  406, 3095, 2297, 2490, 1093,
       3164, 1387, 1646,   68, 3729, 2600, 2231, 3020, 1870, 3286, 1373,
        997,  765,  295, 2690, 3858, 1303,  568, 2107, 3136,  841, 2770,
       1778,  296,  775, 1677, 2835, 3352, 2206, 1097, 3669, 1501, 2525],
      dtype=uint32), 'indptr': array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 28, 38, 48, 58, 68, 78, 88])}

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
@carlosms
Copy link
Contributor Author

New requirements.txt pushed. I actually follow that guide you posted, but for me pip freeze always outputs exact versions and not >=. Is there a flag that I might be missing?

@bzz
Copy link
Contributor

bzz commented Apr 19, 2018

LGTM, let's merge it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants