Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] make tmsc work with git-based modelforge and sourced.ml #12

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bzz
Copy link

@bzz bzz commented Aug 23, 2018

This is one Sunday afternoon attempt to make tmsc great again.

It's WIP as usage of BOW model from modelforge should be removed as per discussion in src-d/models#11

Early feedback is warmly appreciated though, helping to make it ready to merge at some point.

Current version is able to run and produce results:

$python3 -m tmsc https://github.com/apache/spark

                Parallel and distributed processing - General IT	4.49
                Machine Learning, sklearn-like APIs - General IT	3.88
               Java/JS + async + JSON serialization - General IT	3.77
                            Cryptography: libraries - General IT	3.23
                        SQL, working with databases - General IT	3.18
                Java string input/output - Programming languages	3.16
                          Java: Spring, Hibernate - Technologies	3.11
                              Operations on numbers - General IT	3.02
                               Distributed clusters - General IT	2.69
           Functional programming, Scala - Programming languages	2.64

Full log

$ python3 -m tmsc https://github.com/apache/spark

/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: RuntimeWarning: 'tmsc.__main__' found in sys.modules after import of package 'tmsc', but prior to execution of 'tmsc.__main__'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
INFO:GitIndex:Index is cached
INFO:topics:Reading /Users/alex/.source{d}/topics/default.asdf...
INFO:docfreq:Reading /Users/alex/.source{d}/docfreq/default.asdf...
INFO:docfreq:Building the docfreq dictionary...
INFO:topic_detector:Loaded topics model: {'created_at': datetime.datetime(2017, 9, 18, 12, 27, 56, 74233),
 'dependencies': [{'created_at': datetime.datetime(2017, 6, 19, 9, 59, 14, 766638),
                   'dependencies': [],
                   'model': 'docfreq',
                   'uuid': 'f64bacd4-67fb-4c64-8382-399a8e7db52a',
                   'version': [1, 0, 0]}],
 'model': 'topics',
 'uuid': 'c70a7514-9257-4b33-b468-27a8588d4dfa',
 'version': [0, 3, 0]}
320 topics, 2015336 tokens
First 10 tokens: ['ulcancel', 'domainlin', 'trudi', 'fncreateinstancedbaselin', 'wbnz', 'lmultiplicand', 'otronumero', 'qxln', 'gvgq', 'polaroidish']
Topics: labeled, first 10: ['Zend framework, Magento - Technologies', 'AngularJS, promises - Technologies', 'Drupal - Technologies', 'HTML DOM - General IT', 'Cryptography: ciphers and certificates - General IT', 'HTML tags - General IT', 'Countries, Moodle - Technologies', '3D modelling and rendering, WebGL - Technologies', 'POSIX terminal interface, serial interface, image capture - General IT', 'Popular Wordpress plugins - Technologies']
non-zero elements: 16892389  (0.026194)
INFO:docfreq:Pruning to min 20 occurrences
INFO:docfreq:Size: 5720096 -> 416370
INFO:topic_detector:Loaded docfreq model: {'created_at': datetime.datetime(2017, 6, 19, 9, 59, 14, 766638),
 'dependencies': [],
 'model': 'docfreq',
 'uuid': 'f64bacd4-67fb-4c64-8382-399a8e7db52a',
 'version': [1, 0, 0]}
Number of words: 416370
Random 10 words: {'aaa': 6322, 'aaaa': 2676, 'aaaaa': 861, 'aaaaaa': 1163, 'aaaaaaa': 341, 'aaaaaaaa': 156, 'aaaaaaaaa': 119, 'aaaaaaaaaa': 189, 'aaaaaaaaaaa': 30, 'aaaaaaaaaaaa': 90}
Number of documents: 112273
INFO:bblfsh:Detected bblfsh server: 0.0.0.0:9432
WARNING:topic_detector:No BOW cache was loaded.
INFO:repo_cloner:Cloning from https://github.com/apache/spark...
INFO:repo_cloner:Finished cloning https://github.com/apache/spark
INFO:repo_cloner:Classifying the files...
INFO:repo_cloner:Result: {'ANTLR': 1, 'Batchfile': 19, 'C': 1, 'CSS': 6, 'CSV': 20, 'Csound': 3, 'Dockerfile': 3, 'HTML': 4, 'Java': 760, 'JavaScript': 16, 'Makefile': 2, 'Markdown': 4, 'PLSQL': 23, 'PLpgSQL': 62, 'PowerShell': 1, 'Python': 126, 'R': 68, 'RMarkdown': 1, 'Roff': 4, 'SQL': 158, 'SQLPL': 52, 'Scala': 2803, 'Shell': 77, 'Text': 260, 'Thrift': 2, 'reStructuredText': 6}
INFO:repo2bow:Fetching and processing UASTs...
INFO:repo2bow:https://github.com/apache/spark pending tasks: 880
INFO:repo2bow:https://github.com/apache/spark pending tasks: 872
INFO:repo2bow:https://github.com/apache/spark pending tasks: 864
INFO:repo2bow:https://github.com/apache/spark pending tasks: 856
INFO:repo2bow:https://github.com/apache/spark pending tasks: 848
INFO:repo2bow:https://github.com/apache/spark pending tasks: 840
INFO:repo2bow:https://github.com/apache/spark pending tasks: 832
INFO:repo2bow:https://github.com/apache/spark pending tasks: 824
INFO:repo2bow:https://github.com/apache/spark pending tasks: 816
INFO:repo2bow:https://github.com/apache/spark pending tasks: 808
INFO:repo2bow:https://github.com/apache/spark pending tasks: 800
INFO:repo2bow:https://github.com/apache/spark pending tasks: 792
INFO:repo2bow:https://github.com/apache/spark pending tasks: 784
INFO:repo2bow:https://github.com/apache/spark pending tasks: 776
INFO:repo2bow:https://github.com/apache/spark pending tasks: 768
INFO:repo2bow:https://github.com/apache/spark pending tasks: 760
INFO:repo2bow:https://github.com/apache/spark pending tasks: 752
INFO:repo2bow:https://github.com/apache/spark pending tasks: 744
INFO:repo2bow:https://github.com/apache/spark pending tasks: 736
INFO:repo2bow:https://github.com/apache/spark pending tasks: 728
INFO:repo2bow:https://github.com/apache/spark pending tasks: 720
INFO:repo2bow:https://github.com/apache/spark pending tasks: 712
INFO:repo2bow:https://github.com/apache/spark pending tasks: 704
INFO:repo2bow:https://github.com/apache/spark pending tasks: 696
INFO:repo2bow:https://github.com/apache/spark pending tasks: 688
INFO:repo2bow:https://github.com/apache/spark pending tasks: 680
INFO:repo2bow:https://github.com/apache/spark pending tasks: 672
INFO:repo2bow:https://github.com/apache/spark pending tasks: 664
INFO:repo2bow:https://github.com/apache/spark pending tasks: 656
INFO:repo2bow:https://github.com/apache/spark pending tasks: 648
INFO:repo2bow:https://github.com/apache/spark pending tasks: 640
INFO:repo2bow:https://github.com/apache/spark pending tasks: 632
INFO:repo2bow:https://github.com/apache/spark pending tasks: 624
INFO:repo2bow:https://github.com/apache/spark pending tasks: 616
INFO:repo2bow:https://github.com/apache/spark pending tasks: 608
INFO:repo2bow:https://github.com/apache/spark pending tasks: 600
INFO:repo2bow:https://github.com/apache/spark pending tasks: 592
INFO:repo2bow:https://github.com/apache/spark pending tasks: 584
INFO:repo2bow:https://github.com/apache/spark pending tasks: 576
INFO:repo2bow:https://github.com/apache/spark pending tasks: 568
INFO:repo2bow:https://github.com/apache/spark pending tasks: 560
INFO:repo2bow:https://github.com/apache/spark pending tasks: 552
INFO:repo2bow:https://github.com/apache/spark pending tasks: 544
INFO:repo2bow:https://github.com/apache/spark pending tasks: 536
INFO:repo2bow:https://github.com/apache/spark pending tasks: 528
INFO:repo2bow:https://github.com/apache/spark pending tasks: 520
INFO:repo2bow:https://github.com/apache/spark pending tasks: 512
INFO:repo2bow:https://github.com/apache/spark pending tasks: 504
INFO:repo2bow:https://github.com/apache/spark pending tasks: 496
INFO:repo2bow:https://github.com/apache/spark pending tasks: 488
INFO:repo2bow:https://github.com/apache/spark pending tasks: 480
INFO:repo2bow:https://github.com/apache/spark pending tasks: 472
INFO:repo2bow:https://github.com/apache/spark pending tasks: 464
INFO:repo2bow:https://github.com/apache/spark pending tasks: 456
INFO:repo2bow:https://github.com/apache/spark pending tasks: 448
INFO:repo2bow:https://github.com/apache/spark pending tasks: 440
INFO:repo2bow:https://github.com/apache/spark pending tasks: 432
INFO:repo2bow:https://github.com/apache/spark pending tasks: 424
INFO:repo2bow:https://github.com/apache/spark pending tasks: 416
INFO:repo2bow:https://github.com/apache/spark pending tasks: 408
INFO:repo2bow:https://github.com/apache/spark pending tasks: 400
INFO:repo2bow:https://github.com/apache/spark pending tasks: 392
INFO:repo2bow:https://github.com/apache/spark pending tasks: 384
INFO:repo2bow:https://github.com/apache/spark pending tasks: 376
INFO:repo2bow:https://github.com/apache/spark pending tasks: 368
INFO:repo2bow:https://github.com/apache/spark pending tasks: 360
INFO:repo2bow:https://github.com/apache/spark pending tasks: 352
INFO:repo2bow:https://github.com/apache/spark pending tasks: 344
INFO:repo2bow:https://github.com/apache/spark pending tasks: 336
INFO:repo2bow:https://github.com/apache/spark pending tasks: 328
INFO:repo2bow:https://github.com/apache/spark pending tasks: 320
INFO:repo2bow:https://github.com/apache/spark pending tasks: 312
WARNING:repo2bow:/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/repo2-0hkgj3su/apache&spark_github.com/sql/hive-thriftserver/src/gen/java/org/apache/hive/service/cli/thrift/TCLIService.java was skipped: it is too big - 516093 bytes
INFO:repo2bow:https://github.com/apache/spark pending tasks: 304
INFO:repo2bow:https://github.com/apache/spark pending tasks: 296
INFO:repo2bow:https://github.com/apache/spark pending tasks: 288
INFO:repo2bow:https://github.com/apache/spark pending tasks: 280
INFO:repo2bow:https://github.com/apache/spark pending tasks: 272
INFO:repo2bow:https://github.com/apache/spark pending tasks: 264
INFO:repo2bow:https://github.com/apache/spark pending tasks: 256
INFO:repo2bow:https://github.com/apache/spark pending tasks: 248
INFO:repo2bow:https://github.com/apache/spark pending tasks: 240
INFO:repo2bow:https://github.com/apache/spark pending tasks: 232
INFO:repo2bow:https://github.com/apache/spark pending tasks: 224
INFO:repo2bow:https://github.com/apache/spark pending tasks: 216
INFO:repo2bow:https://github.com/apache/spark pending tasks: 208
INFO:repo2bow:https://github.com/apache/spark pending tasks: 200
INFO:repo2bow:https://github.com/apache/spark pending tasks: 192
INFO:repo2bow:https://github.com/apache/spark pending tasks: 184
INFO:repo2bow:https://github.com/apache/spark pending tasks: 176
INFO:repo2bow:https://github.com/apache/spark pending tasks: 168
INFO:repo2bow:https://github.com/apache/spark pending tasks: 160
INFO:repo2bow:https://github.com/apache/spark pending tasks: 152
INFO:repo2bow:https://github.com/apache/spark pending tasks: 144
INFO:repo2bow:https://github.com/apache/spark pending tasks: 136
INFO:repo2bow:https://github.com/apache/spark pending tasks: 128
INFO:repo2bow:https://github.com/apache/spark pending tasks: 120
INFO:repo2bow:https://github.com/apache/spark pending tasks: 112
INFO:repo2bow:https://github.com/apache/spark pending tasks: 104
INFO:repo2bow:https://github.com/apache/spark pending tasks: 96
INFO:repo2bow:https://github.com/apache/spark pending tasks: 88
INFO:repo2bow:https://github.com/apache/spark pending tasks: 80
INFO:repo2bow:https://github.com/apache/spark pending tasks: 72
INFO:repo2bow:https://github.com/apache/spark pending tasks: 64
INFO:repo2bow:https://github.com/apache/spark pending tasks: 56
INFO:repo2bow:https://github.com/apache/spark pending tasks: 48
WARNING:repo2bow:/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/repo2-0hkgj3su/apache&spark_github.com/python/pyspark/sql/tests.py was skipped: it is too big - 270463 bytes
INFO:repo2bow:https://github.com/apache/spark pending tasks: 40
INFO:repo2bow:https://github.com/apache/spark pending tasks: 32
INFO:repo2bow:https://github.com/apache/spark pending tasks: 24
INFO:repo2bow:https://github.com/apache/spark pending tasks: 16
INFO:repo2bow:https://github.com/apache/spark pending tasks: 8
INFO:repo2bow:https://github.com/apache/spark pending tasks: 0
                Parallel and distributed processing - General IT	4.49
                Machine Learning, sklearn-like APIs - General IT	3.88
               Java/JS + async + JSON serialization - General IT	3.77
                            Cryptography: libraries - General IT	3.23
                        SQL, working with databases - General IT	3.18
                Java string input/output - Programming languages	3.16
                          Java: Spring, Hibernate - Technologies	3.11
                              Operations on numbers - General IT	3.02
                               Distributed clusters - General IT	2.69
           Functional programming, Scala - Programming languages	2.64

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
@bzz bzz requested a review from vmarkovtsev August 23, 2018 11:26
@vmarkovtsev
Copy link
Collaborator

Gigantic effort @bzz!

Copy link
Collaborator

@vmarkovtsev vmarkovtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, the old code is ancient, and we've replaced any single bit of the API. @zurk can you please guide Alex how to use the new API? This topic modeling thing extracts BOWs from repositories so we need the right Transformer chain, Ignition, etc.

"github", "bblfsh", "babelfish", "ast2vec"],
install_requires=["ast2vec>=0.3.8-alpha"],
"github", "bblfsh", "babelfish"],
install_requires=["sourced-ml>=0.5.1", "ast2vec>=0.3.8-alpha"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ast2vec is dead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, was kept just as a hack to satisfy internal imports 😕

@@ -33,6 +35,10 @@ def main():
parser.add_argument("--prune-df", default=20, type=int,
help="Minimum number of times an identifer must occur in different "
"documents to be taken into account.")
parser.add_argument("--index_repo", default="https://github.com/src-d/models",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are normally hardcoded in modelforgecfg.py which exists in sourced-ml. Users should be abstracted from these details.

I think that

args.topics = Topics(log_level=args.log_level).load(source=args.topics)

will work - the backend will be created automatically. If it doesn't then there is a bug somewhere.

Copy link
Author

@bzz bzz Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a 🐛 confirmed

import logging
from sourced.ml.models import Topics
topics = Topics(log_level=logging.INFO).load(source=None)

works as expected only if there is warm cache already in ~/.source{d}/topics

In case if the cache is empty, the very same code results in

Traceback (most recent call last):
  File "./test_tm.py", line 6, in <module>
    topics = Topics(log_level=logging.INFO).load(source=None)
  File "~/src-d/tmsc/.venv3/lib/python3.6/site-packages/modelforge/model.py", line 82, in load
    raise ValueError("The backend must be set to load a UUID or the default "
ValueError: The backend must be set to load a UUID or the default model.

@zurk
Copy link
Contributor

zurk commented Aug 23, 2018

All right, I think this( https://github.com/src-d/ml/blob/master/sourced/ml/cmd/repos2bow.py ) can be helpful. We use it to convert repositories to BOW models. It is so complex because we also use it in the Apollo project. But the common pipeline idea to create BOW model can be found in an initial code of repo2bow: https://github.com/zurk/ml/blob/d7a093de39e90db9a9c74515d6b2029240de7b96/sourced/ml/cmd_entries/repos2bow.py

I am not sure how deep your knowledge in new sourced-ml, @bzz, If you want we can have a call and I explain to you main aspects.

@vmarkovtsev
Copy link
Collaborator

This is an excellent chance to improve our documentation btw.

@zurk
Copy link
Contributor

zurk commented Aug 23, 2018

yeah, good idea.
we have something here: https://docs.sourced.tech/sourced-ml but it tells you how to use it and nothing about developing.

I think I can add more docstrings to our codebase. @bzz if you can, please let me know about everything that is confusing or hard to get in sourced-ml, I will add docstrings there firstly.
I am asking, because It is hard to know most problematic places from inside :)

@vmarkovtsev
Copy link
Collaborator

@bzz The core part here is extracting the BOW. You can use the revamped function from Vecino now: https://github.com/src-d/vecino/blob/master/vecino/repo2bow.py

@bzz
Copy link
Author

bzz commented Sep 5, 2018

Yes, that is exactly missing component that I had to resurrect from git history 🚀

Is that ok to use vecino as dependency here?

@vmarkovtsev
Copy link
Collaborator

@bzz It is completely fine to copy-paste for now - we will add this to sourced-ml once we have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants