Skip to content

Conversation

@glimow
Copy link
Contributor

@glimow glimow commented May 2, 2019

must be merged after #18

@glimow
Copy link
Contributor Author

glimow commented May 6, 2019

@irinakhismatullina @vmarkovtsev PTAL

README.md Outdated
```python
from sourced.ml.models import BOW
bow = BOW().load(bow)
import modelforge.backends
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed after src-d/modelforge#92

@@ -1 +1,410 @@
{"meta": {"id2vec": {"default": "92609e70-f79c-46b5-8419-55726e873cfc", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "description": "Source code identifier embeddings, that is, every identifier is represented by a dense vector."}, "docfreq": {"default": "f64bacd4-67fb-4c64-8382-399a8e7db52a", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "description": "Document frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature."}, "typos_correction": {"default": "245fae3a-2f87-4990-ab9a-c463393cfe51", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "description": "Model that suggests fixes to correct typos."}, "topics": {"default": "c70a7514-9257-4b33-b468-27a8588d4dfa", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of tokens:\", len(topics.tokens))", "description": "Topic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories."}, "bow": {"default": "1e3da42a-28b6-4b33-94a2-a5671f4102f4", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "description": "Weighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF."}}, "models": {"id2vec": {"3467e9ca-ec11-444a-ba27-9fa55f5ee6c1": {"version": [1, 0, 0], "created_at": "2018-07-19 13:14:53.000621", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Size of each embedding": "300", "Data collection date": "June 2018", "Number of tokens": "9 99,424"}, "description": "A little under 1M identifier embeddings, generated for identifiers extracted from half of PGA in June 2018. New pipeline was used, with splitting and stemming of identifiers, the full descriptio n can be found in the \"Algorithms\" section of the [sourced.ml](https://github.com/src-d/ml) repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F3467e9ca-ec11-444a-ba27-9fa55f5ee6c1.asdf", "size": "1.2 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "license": ["", "none"]}, "92609e70-f79c-46b5-8419-55726e873cfc": {"created_at": "2017-06-18 17:37:06.255615", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Generated from 140,000 most starred projects on GitHub in October 2016. Legacy pipeline, no splitting and stemming, later converted with quality loss.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F92609e70-f79c-46b5-8419-55726e873cfc.asdf", "license": ["", "undecided"], "size": "1.1 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "version": [1, 0, 0]}}, "docfreq": {"55215392-36fc-43e5-b277-500f5b68d0c6": {"created_at": "2018-06-20 14:51:45.469503", "parent": "", "dependencies": [], "references": [], "extra": {"Number of distinct documents (files)": "7,873,334", "Data collection date": "July 2018", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourc ed.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf", "license": ["", "none"], "size": "69.9 MB", "code": "from sourced.ml.models import OrderedDocumentFrequencies\ndf = OrderedDocumentFrequencies().load(%s)\nprint(\"Number of documents:\", len(df))", "version": [1, 0, 0]}, "f64bacd4-67fb-4c64-8382-399a8e7db52a": {"created_at": "2017-06-19 09:59:14.766638", "parent": "", "dependencies": [], "references": [], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "5.7 million source code identifiers, extracted in october 2016 from all repositories we cloned - 10 million after de-duplication. Standard processing: splitting, stemming - as given in the paper. The document frequency here refers to the frequency of identifiers per repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2Ff64bacd4-67fb-4c64-8382-399a8e7db52a.asdf", "license": ["", "undecided"], "size": "24.3 MB", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "version": [1, 0, 0]}}, "typos_correction": {"245fae3a-2f87-4990-ab9a-c463393cfe51": {"datasets": [[]], "references": [[]], "extra": {"Proba of >1 typo in a typo-ed word": 0, "Frequencies size": 50000, "Vocabulary size": 5000, "Train size": 50000, "Fasttext train size": 10000000}, "description": "Model that suggests fixes to correct typos.", "metrics": {"f1": 0.8862168782008462, "recall": 0.796, "accuracy": 0.8978, "precision": 0.9994977398292315}, "size": "66.7 MB", "license": "ODbL-1.0", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "vendor": "source{d}", "created_at": "2019-03-26 14:13:56", "dependencies": [], "series": 0.2, "tags": ["typos_correction"], "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftypos_correction%2F245fae3a-2f87-4990-ab9a-c463393cfe51.asdf", "environment": {"python": "3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609]", "packages": [["ConfigArgParse", "0.14.0"], ["Jinja2", "2.10"], ["MarkupSafe", "1.0"], ["Pillow", "5.4.1"], ["Pillow-SIMD", "5.1.1.post0"], ["PyStemmer", "1.3.0"], ["PyYAML", "3.12"], ["Pygments", "2.3.1"], ["Pympler", "0.6"], ["SQLAlchemy", "1.2.18"], ["SQLAlchemy-Utils", "0.33.11"], ["asdf", "2.3.0"], ["backcall", "0.1.0"], ["bblfsh", "2.12.7"], ["boto", "2.49.0"], ["boto3", "1.9.98"], ["botocore", "1.12.98"], ["cachetools", "2.1.0"], ["cairocffi", "0.8.0"], ["certifi", "2018.4.16"], ["cffi", "1.5.2"], ["chardet", "3.0.4"], ["clint", "0.5.1"], ["cycler", "0.10.0"], ["decorator", "4.3.0"], ["dulwich", "0.19.11"], ["gensim", "3.7.1"], ["google-auth", "1.6.3"], ["google-auth-httplib2", "0.0.3"], ["google-cloud-core", "0.25.0"], ["grpcio", "1.18.0"], ["grpcio-tools", "1.18.0"], ["httplib2", "0.12.1"], ["humanfriendly", "4.17"], ["humanize", "0.5.1"], ["idna", "2.5"], ["ipykernel", "4.8.2"], ["ipython", "6.3.1"], ["ipython-genutils", "0.2.0"], ["ipywidgets", "7.2.1"], ["jedi", "0.12.0"], ["jmespath", "0.9.3"], ["jsonschema", "2.6.0"], ["jupyter-client", "5.2.3"], ["jupyter-core", "4.4.0"], ["lookout-sdk", "0.4.1"], ["lookout-sdk-ml", "0.17.0"], ["lookout-style", "0.1.1"], ["lz4", "2.0.2"], ["matplotlib", "2.2.2"], ["modelforge", "0.12.0"], ["numpy", "1.14.0"], ["packaging", "19.0"], ["pandas", "0.22.0"], ["parso", "0.2.0"], ["pexpect", "4.5.0"], ["pickleshare", "0.7.4"], ["pip", "19.0.3"], ["prompt-toolkit", "1.0.15"], ["protobuf", "3.7.0"], ["psycopg2-binary", "2.7.7"], ["ptyprocess", "0.5.2"], ["pygtrie", "2.3"], ["pyparsing", "2.2.0"], ["python-dateutil", "2.7.3"], ["python-igraph", "0.7.1.post6"], ["pytz", "2018.4"], ["pyzmq", "17.0.0"], ["requests", "2.21.0"], ["requirements-parser", "0.2.0"], ["scikit-learn", "0.20.1"], ["scikit-optimize", "0.5.2"], ["scipy", "1.0.0"], ["semantic-version", "2.6.0"], ["setuptools", "40.8.0"], ["simplegeneric", "0.8.1"], ["six", "1.11.0"], ["smart-open", "1.8.0"], ["sourced-ml", "0.8.2"], ["spdx", "2.5.0"], ["stringcase", "1.2.0"], ["tornado", "5.0.2"], ["tqdm", "4.31.1"], ["traitlets", "4.3.2"], ["urllib3", "1.24.1"], ["wcwidth", "0.1.7"], ["xgboost", "0.72.1"], ["xxhash", "1.2.0"]], "platform": "Linux-4.15.15-coreos-x86_64-with-Ubuntu-16.04-xenial"}, "version": [1, 0, 0]}}, "topics": {"c70a7514-9257-4b33-b468-27a8588d4dfa": {"created_at": "2017-09-18 12:27:56.074233", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Topic modeling of public repositories at scale using names in source code", "https://arxiv.org/abs/1704.00135"]], "extra": {"Number of topics": "320", "Data collection date": "October 2016", "Number of tokens": "2,015,336"}, "description": "Generated from 2 million GitHub repositories in October 2016.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftopics%2Fc70a7514-9257-4b33-b468-27a8588d4dfa.asdf", "license": ["", "undecided"], "size": "95.1 MB", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of repositories:\", len(topics.tokens))", "version": [0, 3, 0]}}, "bow": {"694c20a0-9b96-4444-80ae-f2fa5bd1395b": {"version": [1, 0, 0], "created_at": "2018-07-17 10:28:56.243131", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,512,171", "Data collection date": "July 2018", "Other parts": "[da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F694c20a0-9b96-4444-80ae-f2fa5bd1395b.asdf", "size": "26.0 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "da8c5dee-b285-4d55-8913-a5209f716564": {"created_at": "2018-07-17 09:43:05.498579", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,493,288", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2Fda8c5dee-b285-4d55-8913-a5209f716564.asdf", "license": ["", "none"], "size": "25.8 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}, "1e0deee4-7dc1-400f-acb6-74c0f4aec471": {"version": [1, 0, 0], "created_at": "2018-07-17 10:16:51.105969", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "864,458", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e0deee4-7dc1-400f-acb6-74c0f4aec471.asdf", "size": "5.9 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "1e3da42a-28b6-4b33-94a2-a5671f4102f4": {"created_at": "2017-06-19 09:16:08.942880", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Similarity of GitHub Repositories by Source Code Identifiers", "http://vmarkovtsev.github.io/techtalks-2017-moscow/#"]], "extra": {"Number of (sub)tokens": "999,424", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Bags of identifiers generated from 140,000 most starred projects on GitHub in October 2016 - ~112k after deduplication.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e3da42a-28b6-4b33-94a2-a5671f4102f4.asdf", "license": ["", "undecided"], "size": "380.8 MB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}}}}
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please split this PR into two:

  1. Readable JSON, no other changes.
  2. Extra swagger you add to JSON and generate to Markdown

@glimow glimow mentioned this pull request May 8, 2019
Signed-off-by: tristan kalos <tristan.kalos@live.fr>
@glimow glimow closed this May 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants