-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] make tmsc work with git-based modelforge and sourced.ml #12
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
Gigantic effort @bzz! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, the old code is ancient, and we've replaced any single bit of the API. @zurk can you please guide Alex how to use the new API? This topic modeling thing extracts BOWs from repositories so we need the right Transformer chain, Ignition, etc.
"github", "bblfsh", "babelfish", "ast2vec"], | ||
install_requires=["ast2vec>=0.3.8-alpha"], | ||
"github", "bblfsh", "babelfish"], | ||
install_requires=["sourced-ml>=0.5.1", "ast2vec>=0.3.8-alpha"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ast2vec is dead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, was kept just as a hack to satisfy internal imports 😕
@@ -33,6 +35,10 @@ def main(): | |||
parser.add_argument("--prune-df", default=20, type=int, | |||
help="Minimum number of times an identifer must occur in different " | |||
"documents to be taken into account.") | |||
parser.add_argument("--index_repo", default="https://github.com/src-d/models", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two are normally hardcoded in modelforgecfg.py
which exists in sourced-ml. Users should be abstracted from these details.
I think that
args.topics = Topics(log_level=args.log_level).load(source=args.topics)
will work - the backend will be created automatically. If it doesn't then there is a bug somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a 🐛 confirmed
import logging
from sourced.ml.models import Topics
topics = Topics(log_level=logging.INFO).load(source=None)
works as expected only if there is warm cache already in ~/.source{d}/topics
In case if the cache is empty, the very same code results in
Traceback (most recent call last):
File "./test_tm.py", line 6, in <module>
topics = Topics(log_level=logging.INFO).load(source=None)
File "~/src-d/tmsc/.venv3/lib/python3.6/site-packages/modelforge/model.py", line 82, in load
raise ValueError("The backend must be set to load a UUID or the default "
ValueError: The backend must be set to load a UUID or the default model.
All right, I think this( https://github.com/src-d/ml/blob/master/sourced/ml/cmd/repos2bow.py ) can be helpful. We use it to convert repositories to BOW models. It is so complex because we also use it in the Apollo project. But the common pipeline idea to create BOW model can be found in an initial code of repo2bow: https://github.com/zurk/ml/blob/d7a093de39e90db9a9c74515d6b2029240de7b96/sourced/ml/cmd_entries/repos2bow.py I am not sure how deep your knowledge in new sourced-ml, @bzz, If you want we can have a call and I explain to you main aspects. |
This is an excellent chance to improve our documentation btw. |
yeah, good idea. I think I can add more docstrings to our codebase. @bzz if you can, please let me know about everything that is confusing or hard to get in sourced-ml, I will add docstrings there firstly. |
@bzz The core part here is extracting the BOW. You can use the revamped function from Vecino now: https://github.com/src-d/vecino/blob/master/vecino/repo2bow.py |
Yes, that is exactly missing component that I had to resurrect from git history 🚀 Is that ok to use vecino as dependency here? |
@bzz It is completely fine to copy-paste for now - we will add this to sourced-ml once we have time. |
This is one Sunday afternoon attempt to make tmsc great again.
It's WIP as usage of BOW model from modelforge should be removed as per discussion in src-d/models#11
Early feedback is warmly appreciated though, helping to make it ready to merge at some point.
Current version is able to run and produce results:
Full log