-
Notifications
You must be signed in to change notification settings - Fork 11
Readable json + updated models examples #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
irinakhismatullina
suggested changes
May 2, 2019
Contributor
Author
vmarkovtsev
requested changes
May 7, 2019
README.md
Outdated
| ```python | ||
| from sourced.ml.models import BOW | ||
| bow = BOW().load(bow) | ||
| import modelforge.backends |
Collaborator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed after src-d/modelforge#92
| @@ -1 +1,410 @@ | |||
| {"meta": {"id2vec": {"default": "92609e70-f79c-46b5-8419-55726e873cfc", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "description": "Source code identifier embeddings, that is, every identifier is represented by a dense vector."}, "docfreq": {"default": "f64bacd4-67fb-4c64-8382-399a8e7db52a", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "description": "Document frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature."}, "typos_correction": {"default": "245fae3a-2f87-4990-ab9a-c463393cfe51", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "description": "Model that suggests fixes to correct typos."}, "topics": {"default": "c70a7514-9257-4b33-b468-27a8588d4dfa", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of tokens:\", len(topics.tokens))", "description": "Topic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories."}, "bow": {"default": "1e3da42a-28b6-4b33-94a2-a5671f4102f4", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "description": "Weighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF."}}, "models": {"id2vec": {"3467e9ca-ec11-444a-ba27-9fa55f5ee6c1": {"version": [1, 0, 0], "created_at": "2018-07-19 13:14:53.000621", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Size of each embedding": "300", "Data collection date": "June 2018", "Number of tokens": "9 99,424"}, "description": "A little under 1M identifier embeddings, generated for identifiers extracted from half of PGA in June 2018. New pipeline was used, with splitting and stemming of identifiers, the full descriptio n can be found in the \"Algorithms\" section of the [sourced.ml](https://github.com/src-d/ml) repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F3467e9ca-ec11-444a-ba27-9fa55f5ee6c1.asdf", "size": "1.2 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "license": ["", "none"]}, "92609e70-f79c-46b5-8419-55726e873cfc": {"created_at": "2017-06-18 17:37:06.255615", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Generated from 140,000 most starred projects on GitHub in October 2016. Legacy pipeline, no splitting and stemming, later converted with quality loss.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F92609e70-f79c-46b5-8419-55726e873cfc.asdf", "license": ["", "undecided"], "size": "1.1 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "version": [1, 0, 0]}}, "docfreq": {"55215392-36fc-43e5-b277-500f5b68d0c6": {"created_at": "2018-06-20 14:51:45.469503", "parent": "", "dependencies": [], "references": [], "extra": {"Number of distinct documents (files)": "7,873,334", "Data collection date": "July 2018", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourc ed.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf", "license": ["", "none"], "size": "69.9 MB", "code": "from sourced.ml.models import OrderedDocumentFrequencies\ndf = OrderedDocumentFrequencies().load(%s)\nprint(\"Number of documents:\", len(df))", "version": [1, 0, 0]}, "f64bacd4-67fb-4c64-8382-399a8e7db52a": {"created_at": "2017-06-19 09:59:14.766638", "parent": "", "dependencies": [], "references": [], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "5.7 million source code identifiers, extracted in october 2016 from all repositories we cloned - 10 million after de-duplication. Standard processing: splitting, stemming - as given in the paper. The document frequency here refers to the frequency of identifiers per repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2Ff64bacd4-67fb-4c64-8382-399a8e7db52a.asdf", "license": ["", "undecided"], "size": "24.3 MB", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "version": [1, 0, 0]}}, "typos_correction": {"245fae3a-2f87-4990-ab9a-c463393cfe51": {"datasets": [[]], "references": [[]], "extra": {"Proba of >1 typo in a typo-ed word": 0, "Frequencies size": 50000, "Vocabulary size": 5000, "Train size": 50000, "Fasttext train size": 10000000}, "description": "Model that suggests fixes to correct typos.", "metrics": {"f1": 0.8862168782008462, "recall": 0.796, "accuracy": 0.8978, "precision": 0.9994977398292315}, "size": "66.7 MB", "license": "ODbL-1.0", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "vendor": "source{d}", "created_at": "2019-03-26 14:13:56", "dependencies": [], "series": 0.2, "tags": ["typos_correction"], "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftypos_correction%2F245fae3a-2f87-4990-ab9a-c463393cfe51.asdf", "environment": {"python": "3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609]", "packages": [["ConfigArgParse", "0.14.0"], ["Jinja2", "2.10"], ["MarkupSafe", "1.0"], ["Pillow", "5.4.1"], ["Pillow-SIMD", "5.1.1.post0"], ["PyStemmer", "1.3.0"], ["PyYAML", "3.12"], ["Pygments", "2.3.1"], ["Pympler", "0.6"], ["SQLAlchemy", "1.2.18"], ["SQLAlchemy-Utils", "0.33.11"], ["asdf", "2.3.0"], ["backcall", "0.1.0"], ["bblfsh", "2.12.7"], ["boto", "2.49.0"], ["boto3", "1.9.98"], ["botocore", "1.12.98"], ["cachetools", "2.1.0"], ["cairocffi", "0.8.0"], ["certifi", "2018.4.16"], ["cffi", "1.5.2"], ["chardet", "3.0.4"], ["clint", "0.5.1"], ["cycler", "0.10.0"], ["decorator", "4.3.0"], ["dulwich", "0.19.11"], ["gensim", "3.7.1"], ["google-auth", "1.6.3"], ["google-auth-httplib2", "0.0.3"], ["google-cloud-core", "0.25.0"], ["grpcio", "1.18.0"], ["grpcio-tools", "1.18.0"], ["httplib2", "0.12.1"], ["humanfriendly", "4.17"], ["humanize", "0.5.1"], ["idna", "2.5"], ["ipykernel", "4.8.2"], ["ipython", "6.3.1"], ["ipython-genutils", "0.2.0"], ["ipywidgets", "7.2.1"], ["jedi", "0.12.0"], ["jmespath", "0.9.3"], ["jsonschema", "2.6.0"], ["jupyter-client", "5.2.3"], ["jupyter-core", "4.4.0"], ["lookout-sdk", "0.4.1"], ["lookout-sdk-ml", "0.17.0"], ["lookout-style", "0.1.1"], ["lz4", "2.0.2"], ["matplotlib", "2.2.2"], ["modelforge", "0.12.0"], ["numpy", "1.14.0"], ["packaging", "19.0"], ["pandas", "0.22.0"], ["parso", "0.2.0"], ["pexpect", "4.5.0"], ["pickleshare", "0.7.4"], ["pip", "19.0.3"], ["prompt-toolkit", "1.0.15"], ["protobuf", "3.7.0"], ["psycopg2-binary", "2.7.7"], ["ptyprocess", "0.5.2"], ["pygtrie", "2.3"], ["pyparsing", "2.2.0"], ["python-dateutil", "2.7.3"], ["python-igraph", "0.7.1.post6"], ["pytz", "2018.4"], ["pyzmq", "17.0.0"], ["requests", "2.21.0"], ["requirements-parser", "0.2.0"], ["scikit-learn", "0.20.1"], ["scikit-optimize", "0.5.2"], ["scipy", "1.0.0"], ["semantic-version", "2.6.0"], ["setuptools", "40.8.0"], ["simplegeneric", "0.8.1"], ["six", "1.11.0"], ["smart-open", "1.8.0"], ["sourced-ml", "0.8.2"], ["spdx", "2.5.0"], ["stringcase", "1.2.0"], ["tornado", "5.0.2"], ["tqdm", "4.31.1"], ["traitlets", "4.3.2"], ["urllib3", "1.24.1"], ["wcwidth", "0.1.7"], ["xgboost", "0.72.1"], ["xxhash", "1.2.0"]], "platform": "Linux-4.15.15-coreos-x86_64-with-Ubuntu-16.04-xenial"}, "version": [1, 0, 0]}}, "topics": {"c70a7514-9257-4b33-b468-27a8588d4dfa": {"created_at": "2017-09-18 12:27:56.074233", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Topic modeling of public repositories at scale using names in source code", "https://arxiv.org/abs/1704.00135"]], "extra": {"Number of topics": "320", "Data collection date": "October 2016", "Number of tokens": "2,015,336"}, "description": "Generated from 2 million GitHub repositories in October 2016.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftopics%2Fc70a7514-9257-4b33-b468-27a8588d4dfa.asdf", "license": ["", "undecided"], "size": "95.1 MB", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of repositories:\", len(topics.tokens))", "version": [0, 3, 0]}}, "bow": {"694c20a0-9b96-4444-80ae-f2fa5bd1395b": {"version": [1, 0, 0], "created_at": "2018-07-17 10:28:56.243131", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,512,171", "Data collection date": "July 2018", "Other parts": "[da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F694c20a0-9b96-4444-80ae-f2fa5bd1395b.asdf", "size": "26.0 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "da8c5dee-b285-4d55-8913-a5209f716564": {"created_at": "2018-07-17 09:43:05.498579", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,493,288", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2Fda8c5dee-b285-4d55-8913-a5209f716564.asdf", "license": ["", "none"], "size": "25.8 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}, "1e0deee4-7dc1-400f-acb6-74c0f4aec471": {"version": [1, 0, 0], "created_at": "2018-07-17 10:16:51.105969", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "864,458", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e0deee4-7dc1-400f-acb6-74c0f4aec471.asdf", "size": "5.9 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "1e3da42a-28b6-4b33-94a2-a5671f4102f4": {"created_at": "2017-06-19 09:16:08.942880", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Similarity of GitHub Repositories by Source Code Identifiers", "http://vmarkovtsev.github.io/techtalks-2017-moscow/#"]], "extra": {"Number of (sub)tokens": "999,424", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Bags of identifiers generated from 140,000 most starred projects on GitHub in October 2016 - ~112k after deduplication.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e3da42a-28b6-4b33-94a2-a5671f4102f4.asdf", "license": ["", "undecided"], "size": "380.8 MB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}}}} | |||
| { | |||
Collaborator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please split this PR into two:
- Readable JSON, no other changes.
- Extra swagger you add to JSON and generate to Markdown
Closed
Signed-off-by: tristan kalos <tristan.kalos@live.fr>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
must be merged after #18