Call community detector lib with parquet data #90

carlosms · 2018-04-12T11:00:34Z

Partially implements #60.

This PR writes extra data in report, and adds a new report.py command that reads parquet files and calls the community detector python lib.

$ ./report --keyspace apollo -o src/main/python/community-detector/parquets/
$ python src/main/python/community-detector/report.py src/main/python/community-detector/parquets

Next steps, to be done in future PRs:

save the report.py output data as a parquet file.
make scala report app call the report.py command, wait for it to finish, and read the new parquet data.
use this data (possibly calling the DB to get extra info) in the scala report output.

Notes:
If you look at report.py it would make sense to write a single parquet file with the columns element_id, cc, buckets. But as I put in a TODO, the community detector lib actually uses internally cc->element-ids. So the current cc.parquet could be read and used skipping build_id_to_cc. Left as a TODO to avoid changing too much until we have all the parts working together.

Needed by the community detector lib Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

smacker

LGFM!

bzz · 2018-04-13T17:06:19Z

Overall looks good, I butt think it's ok to have another file for now with appropriate data structure.

But let me try it locally first and get back to you here.

bzz · 2018-04-17T22:19:36Z

@carlos every time after running it locally, I got

strange error

$ ./report --keyspace apollo -o apollo_dump_17.04/demo/cc/

[info] Running tech.sourced.gemini.ReportApp --keyspace apollo -o apollo_dump_17.04/demo/cc/
No duplicates found.
 WARN 00:09:24 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner (FileSystem.java:2995) - exception in the cleaner thread but it will continue to run
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
	at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
	at java.lang.Thread.run(Thread.java:748)
[success] Total time: 13 s, completed Apr 18, 2018 12:09:24 AM

buckets.parquet and cc.parquet are cerated though.

Could you look into that and let me know what I'm doing wrong to trigger this error?

A quick question
Also, as 2 parquet files were created. could you please help me understand what does

save the output data as a parquet file.

from PR description mean? Does it refer to writing a single parquet file?

On using community detector lib

On fresh virtualenv, installing dependencies fails for me

$ pip3 install -r src/main/python/community-detector/requirements.txt

Collecting pyarrow==0.9.0 (from -r src/main/python/community-detector/requirements.txt (line 2))
  Downloading https://files.pythonhosted.org/packages/be/2d/11751c477e4e7f4bb07ac7584aafabe0d0608c170e4bff67246d695ebdbe/pyarrow-0.9.0.tar.gz (8.5MB)
    100% |████████████████████████████████| 8.5MB 887kB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/pip-install-1zra07c5/pyarrow/setup.py", line 29, in <module>
        from Cython.Distutils import build_ext as _build_ext
    ModuleNotFoundError: No module named 'Cython'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/pip-install-1zra07c5/pyarrow/

Am I doing something wrong, or does Cython need to be added to requirements.txt?

bzz

LGTM,

please, feel free to merge as soon as 3 minor issues from the comment above are addressed here.

carlosms · 2018-04-18T15:03:22Z

@bzz:

strange error

I cannot reproduce this message. What is your scala version? Could it be related to the size of the data? What DB are you using to test it?

A quick question

save the output data as a parquet file

This refers to saving the data produced by the python script report.py to a new (third) parquet file. This file will be read from ReportApp.scala.

On using community detector lib

It works for me in linux, python 3.6.5.
I think you may be using python 3.7, and pip install does not work as it should. See:
https://issues.apache.org/jira/browse/ARROW-1661
apache/arrow#1125

Is this the case? If that's the problem, since the stable version is 3.6 I think it's reasonable to ignore this for now.

bzz · 2018-04-19T08:53:44Z

Could that be somehow related to HADOOP-12829?

It should be fixed only in hadoop-common 2.9 and that may not be something easy to upgrade - should be aligned with Spark and Cassandra connector versions. Anyway, should be handled outside of this PR.

What is your scala version

2.12.2

What DB are you using to test it?

the same dump, that was used for demo

Sounds good! Shall we update PR description with this information, to clarify the confusion?
Using Python 3.6.3 & virtualenv

carlosms · 2018-04-19T09:27:42Z

Description updated, I'll look into the other issues later.

bzz · 2018-04-19T09:32:25Z

can be worked around with pip3 install Cython but then building native dependencies for arrow fails on macOS

CMake Error at cmake_modules/FindArrow.cmake:130 (message):
    Could not find the Arrow library.  Looked for headers in , and for libs in
  Call Stack (most recent call first):
    CMakeLists.txt:197 (find_package)

  error: command 'cmake' failed with exit status 1

but pip3 install pyarrow works and brings 0.9.0.post1. Can we update it, so 0.9.0.post1 would satisfy?

I think we might want to explore using native python parquet implementation instead of pyArrow.

👍 let's update deps so 0.9.0.post1 works and merge this for now, and move 1 feedback in separate issue.

carlosms · 2018-04-19T09:47:52Z

That sounds good to me! Feel free to push a commit with the new requirements.txt or ping me with the contents that work in macOS

bzz · 2018-04-19T09:52:11Z

Feel free to push a commit with the new requirements.txt

could you please update the requirements.txt following https://pip.pypa.io/en/stable/user_guide/ so the version posted above satisfy? I guesspip install 'pyarrow>=0.9.0' should generate appropriate one, like pyarrow>=0.9.0

bzz · 2018-04-19T10:01:13Z

It's not clear from PR rescription, but at current state results are just printed to STDOUT

$ python3 src/main/python/community-detector/report.py apollo_dump_17.04/demo/cc

{'data': array([   9,   10,   31,   32,   59,   60,   64,   65,   70,   71,  193,
        194,  275,  276,  350,  351,  371,  372,   67, 1431,  919, 2968,
       2597, 3374,  711, 3783, 1632, 2041,  291, 3082, 1435, 2605, 1600,
       2114, 3779, 3410,  726,  860,  292, 2466,  547, 1318,  837, 1742,
       3685, 2284, 2797, 3439,  293, 3728,  406, 3095, 2297, 2490, 1093,
       3164, 1387, 1646,   68, 3729, 2600, 2231, 3020, 1870, 3286, 1373,
        997,  765,  295, 2690, 3858, 1303,  568, 2107, 3136,  841, 2770,
       1778,  296,  775, 1677, 2835, 3352, 2206, 1097, 3669, 1501, 2525],
      dtype=uint32), 'indptr': array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 28, 38, 48, 58, 68, 78, 88])}

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

carlosms · 2018-04-19T10:14:06Z

New requirements.txt pushed. I actually follow that guide you posted, but for me pip freeze always outputs exact versions and not >=. Is there a flag that I might be missing?

bzz · 2018-04-19T13:05:15Z

LGTM, let's merge it!

carlosms added 2 commits April 12, 2018 12:44

Report exports more data as parquet files

93e6d81

Needed by the community detector lib Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

New python cmd to call community detector with parquet data

a6eaf19

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

carlosms requested review from smacker and bzz April 12, 2018 11:00

smacker approved these changes Apr 12, 2018

View reviewed changes

bzz mentioned this pull request Apr 17, 2018

update release artifacts #98

Merged

bzz approved these changes Apr 17, 2018

View reviewed changes

bzz assigned carlosms Apr 17, 2018

bzz mentioned this pull request Apr 19, 2018

Fix WARN on ./report #102

Closed

Update requirements.txt pyarrow version

17457ad

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>

carlosms merged commit 4c84f22 into src-d:master Apr 19, 2018

carlosms mentioned this pull request Apr 19, 2018

Report: use community detector lib, print similar files info #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call community detector lib with parquet data #90

Call community detector lib with parquet data #90

carlosms commented Apr 12, 2018 •

edited

Loading

smacker left a comment

bzz commented Apr 13, 2018

bzz commented Apr 17, 2018 •

edited

Loading

bzz left a comment •

edited

Loading

carlosms commented Apr 18, 2018

bzz commented Apr 19, 2018 •

edited

Loading

carlosms commented Apr 19, 2018

bzz commented Apr 19, 2018 •

edited

Loading

carlosms commented Apr 19, 2018

bzz commented Apr 19, 2018 •

edited

Loading

bzz commented Apr 19, 2018

carlosms commented Apr 19, 2018

bzz commented Apr 19, 2018

Call community detector lib with parquet data #90

Call community detector lib with parquet data #90

Conversation

carlosms commented Apr 12, 2018 • edited Loading

smacker left a comment

Choose a reason for hiding this comment

bzz commented Apr 13, 2018

bzz commented Apr 17, 2018 • edited Loading

bzz left a comment • edited Loading

Choose a reason for hiding this comment

carlosms commented Apr 18, 2018

bzz commented Apr 19, 2018 • edited Loading

carlosms commented Apr 19, 2018

bzz commented Apr 19, 2018 • edited Loading

carlosms commented Apr 19, 2018

bzz commented Apr 19, 2018 • edited Loading

bzz commented Apr 19, 2018

carlosms commented Apr 19, 2018

bzz commented Apr 19, 2018

carlosms commented Apr 12, 2018 •

edited

Loading

bzz commented Apr 17, 2018 •

edited

Loading

bzz left a comment •

edited

Loading

bzz commented Apr 19, 2018 •

edited

Loading

bzz commented Apr 19, 2018 •

edited

Loading

bzz commented Apr 19, 2018 •

edited

Loading