Readable json + updated models examples #20

glimow · 2019-05-02T10:05:03Z

must be merged after #18

bow/1e0deee4-7dc1-400f-acb6-74c0f4aec471.md

glimow · 2019-05-06T16:10:30Z

vmarkovtsev · 2019-05-07T15:14:16Z

README.md

 ```python
 from sourced.ml.models import BOW
-bow = BOW().load(bow)
+import modelforge.backends


No longer needed after src-d/modelforge#92

vmarkovtsev · 2019-05-07T15:15:14Z

index.json

@@ -1 +1,410 @@
-{"meta": {"id2vec": {"default": "92609e70-f79c-46b5-8419-55726e873cfc", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "description": "Source code identifier embeddings, that is, every identifier is represented by a dense vector."}, "docfreq": {"default": "f64bacd4-67fb-4c64-8382-399a8e7db52a", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "description": "Document frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature."}, "typos_correction": {"default": "245fae3a-2f87-4990-ab9a-c463393cfe51", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "description": "Model that suggests fixes to correct typos."}, "topics": {"default": "c70a7514-9257-4b33-b468-27a8588d4dfa", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of tokens:\", len(topics.tokens))", "description": "Topic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories."}, "bow": {"default": "1e3da42a-28b6-4b33-94a2-a5671f4102f4", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "description": "Weighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF."}}, "models": {"id2vec": {"3467e9ca-ec11-444a-ba27-9fa55f5ee6c1": {"version": [1, 0, 0], "created_at": "2018-07-19 13:14:53.000621", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Size of each embedding": "300", "Data collection date": "June 2018", "Number of tokens": "9    99,424"}, "description": "A little under 1M     identifier embeddings, generated for identifiers extracted from half of PGA in June 2018. New pipeline was used, with splitting and stemming of identifiers, the full descriptio    n can be found in the \"Algorithms\" section of the [sourced.ml](https://github.com/src-d/ml) repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F3467e9ca-ec11-444a-ba27-9fa55f5ee6c1.asdf", "size": "1.2 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "license": ["", "none"]}, "92609e70-f79c-46b5-8419-55726e873cfc": {"created_at": "2017-06-18 17:37:06.255615", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Generated from 140,000 most starred projects on GitHub in October 2016. Legacy pipeline, no splitting and stemming, later converted with quality loss.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F92609e70-f79c-46b5-8419-55726e873cfc.asdf", "license": ["", "undecided"], "size": "1.1 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "version": [1, 0, 0]}}, "docfreq": {"55215392-36fc-43e5-b277-500f5b68d0c6": {"created_at": "2018-06-20 14:51:45.469503", "parent": "", "dependencies": [], "references": [], "extra": {"Number of distinct documents (files)": "7,873,334", "Data collection date": "July 2018", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourc    ed.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf", "license": ["", "none"], "size": "69.9 MB", "code": "from sourced.ml.models import OrderedDocumentFrequencies\ndf = OrderedDocumentFrequencies().load(%s)\nprint(\"Number of documents:\", len(df))", "version": [1, 0, 0]}, "f64bacd4-67fb-4c64-8382-399a8e7db52a": {"created_at": "2017-06-19 09:59:14.766638", "parent": "", "dependencies": [], "references": [], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "5.7 million source code identifiers, extracted in october 2016 from all repositories we cloned - 10 million after de-duplication. Standard processing: splitting, stemming - as given in the paper. The document frequency here refers to the frequency of identifiers per repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2Ff64bacd4-67fb-4c64-8382-399a8e7db52a.asdf", "license": ["", "undecided"], "size": "24.3 MB", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "version": [1, 0, 0]}}, "typos_correction": {"245fae3a-2f87-4990-ab9a-c463393cfe51": {"datasets": [[]], "references": [[]], "extra": {"Proba of >1 typo in a typo-ed word": 0, "Frequencies size": 50000, "Vocabulary size": 5000, "Train size": 50000, "Fasttext train size": 10000000}, "description": "Model that suggests fixes to correct typos.", "metrics": {"f1": 0.8862168782008462, "recall": 0.796, "accuracy": 0.8978, "precision": 0.9994977398292315}, "size": "66.7 MB", "license": "ODbL-1.0", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "vendor": "source{d}", "created_at": "2019-03-26 14:13:56", "dependencies": [], "series": 0.2, "tags": ["typos_correction"], "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftypos_correction%2F245fae3a-2f87-4990-ab9a-c463393cfe51.asdf", "environment": {"python": "3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609]", "packages": [["ConfigArgParse", "0.14.0"], ["Jinja2", "2.10"], ["MarkupSafe", "1.0"], ["Pillow", "5.4.1"], ["Pillow-SIMD", "5.1.1.post0"], ["PyStemmer", "1.3.0"], ["PyYAML", "3.12"], ["Pygments", "2.3.1"], ["Pympler", "0.6"], ["SQLAlchemy", "1.2.18"], ["SQLAlchemy-Utils", "0.33.11"], ["asdf", "2.3.0"], ["backcall", "0.1.0"], ["bblfsh", "2.12.7"], ["boto", "2.49.0"], ["boto3", "1.9.98"], ["botocore", "1.12.98"], ["cachetools", "2.1.0"], ["cairocffi", "0.8.0"], ["certifi", "2018.4.16"], ["cffi", "1.5.2"], ["chardet", "3.0.4"], ["clint", "0.5.1"], ["cycler", "0.10.0"], ["decorator", "4.3.0"], ["dulwich", "0.19.11"], ["gensim", "3.7.1"], ["google-auth", "1.6.3"], ["google-auth-httplib2", "0.0.3"], ["google-cloud-core", "0.25.0"], ["grpcio", "1.18.0"], ["grpcio-tools", "1.18.0"], ["httplib2", "0.12.1"], ["humanfriendly", "4.17"], ["humanize", "0.5.1"], ["idna", "2.5"], ["ipykernel", "4.8.2"], ["ipython", "6.3.1"], ["ipython-genutils", "0.2.0"], ["ipywidgets", "7.2.1"], ["jedi", "0.12.0"], ["jmespath", "0.9.3"], ["jsonschema", "2.6.0"], ["jupyter-client", "5.2.3"], ["jupyter-core", "4.4.0"], ["lookout-sdk", "0.4.1"], ["lookout-sdk-ml", "0.17.0"], ["lookout-style", "0.1.1"], ["lz4", "2.0.2"], ["matplotlib", "2.2.2"], ["modelforge", "0.12.0"], ["numpy", "1.14.0"], ["packaging", "19.0"], ["pandas", "0.22.0"], ["parso", "0.2.0"], ["pexpect", "4.5.0"], ["pickleshare", "0.7.4"], ["pip", "19.0.3"], ["prompt-toolkit", "1.0.15"], ["protobuf", "3.7.0"], ["psycopg2-binary", "2.7.7"], ["ptyprocess", "0.5.2"], ["pygtrie", "2.3"], ["pyparsing", "2.2.0"], ["python-dateutil", "2.7.3"], ["python-igraph", "0.7.1.post6"], ["pytz", "2018.4"], ["pyzmq", "17.0.0"], ["requests", "2.21.0"], ["requirements-parser", "0.2.0"], ["scikit-learn", "0.20.1"], ["scikit-optimize", "0.5.2"], ["scipy", "1.0.0"], ["semantic-version", "2.6.0"], ["setuptools", "40.8.0"], ["simplegeneric", "0.8.1"], ["six", "1.11.0"], ["smart-open", "1.8.0"], ["sourced-ml", "0.8.2"], ["spdx", "2.5.0"], ["stringcase", "1.2.0"], ["tornado", "5.0.2"], ["tqdm", "4.31.1"], ["traitlets", "4.3.2"], ["urllib3", "1.24.1"], ["wcwidth", "0.1.7"], ["xgboost", "0.72.1"], ["xxhash", "1.2.0"]], "platform": "Linux-4.15.15-coreos-x86_64-with-Ubuntu-16.04-xenial"}, "version": [1, 0, 0]}}, "topics": {"c70a7514-9257-4b33-b468-27a8588d4dfa": {"created_at": "2017-09-18 12:27:56.074233", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Topic modeling of public repositories at scale using names in source code", "https://arxiv.org/abs/1704.00135"]], "extra": {"Number of topics": "320", "Data collection date": "October 2016", "Number of tokens": "2,015,336"}, "description": "Generated from 2 million GitHub repositories in October 2016.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftopics%2Fc70a7514-9257-4b33-b468-27a8588d4dfa.asdf", "license": ["", "undecided"], "size": "95.1 MB", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of repositories:\", len(topics.tokens))", "version": [0, 3, 0]}}, "bow": {"694c20a0-9b96-4444-80ae-f2fa5bd1395b": {"version": [1, 0, 0], "created_at": "2018-07-17 10:28:56.243131", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,512,171", "Data collection date": "July 2018", "Other parts": "[da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F694c20a0-9b96-4444-80ae-f2fa5bd1395b.asdf", "size": "26.0 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "da8c5dee-b285-4d55-8913-a5209f716564": {"created_at": "2018-07-17 09:43:05.498579", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,493,288", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2Fda8c5dee-b285-4d55-8913-a5209f716564.asdf", "license": ["", "none"], "size": "25.8 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}, "1e0deee4-7dc1-400f-acb6-74c0f4aec471": {"version": [1, 0, 0], "created_at": "2018-07-17 10:16:51.105969", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "864,458", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e0deee4-7dc1-400f-acb6-74c0f4aec471.asdf", "size": "5.9 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "1e3da42a-28b6-4b33-94a2-a5671f4102f4": {"created_at": "2017-06-19 09:16:08.942880", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Similarity of GitHub Repositories by Source Code Identifiers", "http://vmarkovtsev.github.io/techtalks-2017-moscow/#"]], "extra": {"Number of (sub)tokens": "999,424", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Bags of identifiers generated from 140,000 most starred projects on GitHub in October 2016 - ~112k after deduplication.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e3da42a-28b6-4b33-94a2-a5671f4102f4.asdf", "license": ["", "undecided"], "size": "380.8 MB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}}}}
+{


Can you please split this PR into two:

Readable JSON, no other changes.

Extra swagger you add to JSON and generate to Markdown

Signed-off-by: tristan kalos <tristan.kalos@live.fr>

glimow mentioned this pull request May 2, 2019

Add backend argument to load() in code examples #19

Closed

irinakhismatullina suggested changes May 2, 2019

View reviewed changes

bow/1e0deee4-7dc1-400f-acb6-74c0f4aec471.md Outdated Show resolved Hide resolved

vmarkovtsev requested changes May 7, 2019

View reviewed changes

glimow mentioned this pull request May 8, 2019

Beautyfied json #21

Closed

Sort and indent index.json

f639427

Signed-off-by: tristan kalos <tristan.kalos@live.fr>

glimow closed this May 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Readable json + updated models examples #20

Readable json + updated models examples #20

Uh oh!

glimow commented May 2, 2019

Uh oh!

Uh oh!

glimow commented May 6, 2019

Uh oh!

vmarkovtsev May 7, 2019

Uh oh!

vmarkovtsev May 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1 +1,410 @@
		{"meta": {"id2vec": {"default": "92609e70-f79c-46b5-8419-55726e873cfc", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "description": "Source code identifier embeddings, that is, every identifier is represented by a dense vector."}, "docfreq": {"default": "f64bacd4-67fb-4c64-8382-399a8e7db52a", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "description": "Document frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature."}, "typos_correction": {"default": "245fae3a-2f87-4990-ab9a-c463393cfe51", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "description": "Model that suggests fixes to correct typos."}, "topics": {"default": "c70a7514-9257-4b33-b468-27a8588d4dfa", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of tokens:\", len(topics.tokens))", "description": "Topic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories."}, "bow": {"default": "1e3da42a-28b6-4b33-94a2-a5671f4102f4", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "description": "Weighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF."}}, "models": {"id2vec": {"3467e9ca-ec11-444a-ba27-9fa55f5ee6c1": {"version": [1, 0, 0], "created_at": "2018-07-19 13:14:53.000621", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Size of each embedding": "300", "Data collection date": "June 2018", "Number of tokens": "9 99,424"}, "description": "A little under 1M identifier embeddings, generated for identifiers extracted from half of PGA in June 2018. New pipeline was used, with splitting and stemming of identifiers, the full descriptio n can be found in the \"Algorithms\" section of the [sourced.ml](https://github.com/src-d/ml) repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F3467e9ca-ec11-444a-ba27-9fa55f5ee6c1.asdf", "size": "1.2 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "license": ["", "none"]}, "92609e70-f79c-46b5-8419-55726e873cfc": {"created_at": "2017-06-18 17:37:06.255615", "parent": "", "dependencies": [], "references": [["Source code identifier embeddings", "https://blog.sourced.tech/post/id2vec/"]], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Generated from 140,000 most starred projects on GitHub in October 2016. Legacy pipeline, no splitting and stemming, later converted with quality loss.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fid2vec%2F92609e70-f79c-46b5-8419-55726e873cfc.asdf", "license": ["", "undecided"], "size": "1.1 GB", "code": "from sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(%s)\nprint(\"Number of tokens:\", len(id2vec))", "version": [1, 0, 0]}}, "docfreq": {"55215392-36fc-43e5-b277-500f5b68d0c6": {"created_at": "2018-06-20 14:51:45.469503", "parent": "", "dependencies": [], "references": [], "extra": {"Number of distinct documents (files)": "7,873,334", "Data collection date": "July 2018", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourc ed.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf", "license": ["", "none"], "size": "69.9 MB", "code": "from sourced.ml.models import OrderedDocumentFrequencies\ndf = OrderedDocumentFrequencies().load(%s)\nprint(\"Number of documents:\", len(df))", "version": [1, 0, 0]}, "f64bacd4-67fb-4c64-8382-399a8e7db52a": {"created_at": "2017-06-19 09:59:14.766638", "parent": "", "dependencies": [], "references": [], "extra": {"Number of (sub)tokens": "5,720,096", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "5.7 million source code identifiers, extracted in october 2016 from all repositories we cloned - 10 million after de-duplication. Standard processing: splitting, stemming - as given in the paper. The document frequency here refers to the frequency of identifiers per repository.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2Ff64bacd4-67fb-4c64-8382-399a8e7db52a.asdf", "license": ["", "undecided"], "size": "24.3 MB", "code": "from sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(%s)\nprint(\"Number of tokens:\", len(df))", "version": [1, 0, 0]}}, "typos_correction": {"245fae3a-2f87-4990-ab9a-c463393cfe51": {"datasets": [[]], "references": [[]], "extra": {"Proba of >1 typo in a typo-ed word": 0, "Frequencies size": 50000, "Vocabulary size": 5000, "Train size": 50000, "Fasttext train size": 10000000}, "description": "Model that suggests fixes to correct typos.", "metrics": {"f1": 0.8862168782008462, "recall": 0.796, "accuracy": 0.8978, "precision": 0.9994977398292315}, "size": "66.7 MB", "license": "ODbL-1.0", "code": "from lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(%s)\nprint(\"Corrector configuration:\\n\", corrector.dump())", "vendor": "source{d}", "created_at": "2019-03-26 14:13:56", "dependencies": [], "series": 0.2, "tags": ["typos_correction"], "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftypos_correction%2F245fae3a-2f87-4990-ab9a-c463393cfe51.asdf", "environment": {"python": "3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609]", "packages": [["ConfigArgParse", "0.14.0"], ["Jinja2", "2.10"], ["MarkupSafe", "1.0"], ["Pillow", "5.4.1"], ["Pillow-SIMD", "5.1.1.post0"], ["PyStemmer", "1.3.0"], ["PyYAML", "3.12"], ["Pygments", "2.3.1"], ["Pympler", "0.6"], ["SQLAlchemy", "1.2.18"], ["SQLAlchemy-Utils", "0.33.11"], ["asdf", "2.3.0"], ["backcall", "0.1.0"], ["bblfsh", "2.12.7"], ["boto", "2.49.0"], ["boto3", "1.9.98"], ["botocore", "1.12.98"], ["cachetools", "2.1.0"], ["cairocffi", "0.8.0"], ["certifi", "2018.4.16"], ["cffi", "1.5.2"], ["chardet", "3.0.4"], ["clint", "0.5.1"], ["cycler", "0.10.0"], ["decorator", "4.3.0"], ["dulwich", "0.19.11"], ["gensim", "3.7.1"], ["google-auth", "1.6.3"], ["google-auth-httplib2", "0.0.3"], ["google-cloud-core", "0.25.0"], ["grpcio", "1.18.0"], ["grpcio-tools", "1.18.0"], ["httplib2", "0.12.1"], ["humanfriendly", "4.17"], ["humanize", "0.5.1"], ["idna", "2.5"], ["ipykernel", "4.8.2"], ["ipython", "6.3.1"], ["ipython-genutils", "0.2.0"], ["ipywidgets", "7.2.1"], ["jedi", "0.12.0"], ["jmespath", "0.9.3"], ["jsonschema", "2.6.0"], ["jupyter-client", "5.2.3"], ["jupyter-core", "4.4.0"], ["lookout-sdk", "0.4.1"], ["lookout-sdk-ml", "0.17.0"], ["lookout-style", "0.1.1"], ["lz4", "2.0.2"], ["matplotlib", "2.2.2"], ["modelforge", "0.12.0"], ["numpy", "1.14.0"], ["packaging", "19.0"], ["pandas", "0.22.0"], ["parso", "0.2.0"], ["pexpect", "4.5.0"], ["pickleshare", "0.7.4"], ["pip", "19.0.3"], ["prompt-toolkit", "1.0.15"], ["protobuf", "3.7.0"], ["psycopg2-binary", "2.7.7"], ["ptyprocess", "0.5.2"], ["pygtrie", "2.3"], ["pyparsing", "2.2.0"], ["python-dateutil", "2.7.3"], ["python-igraph", "0.7.1.post6"], ["pytz", "2018.4"], ["pyzmq", "17.0.0"], ["requests", "2.21.0"], ["requirements-parser", "0.2.0"], ["scikit-learn", "0.20.1"], ["scikit-optimize", "0.5.2"], ["scipy", "1.0.0"], ["semantic-version", "2.6.0"], ["setuptools", "40.8.0"], ["simplegeneric", "0.8.1"], ["six", "1.11.0"], ["smart-open", "1.8.0"], ["sourced-ml", "0.8.2"], ["spdx", "2.5.0"], ["stringcase", "1.2.0"], ["tornado", "5.0.2"], ["tqdm", "4.31.1"], ["traitlets", "4.3.2"], ["urllib3", "1.24.1"], ["wcwidth", "0.1.7"], ["xgboost", "0.72.1"], ["xxhash", "1.2.0"]], "platform": "Linux-4.15.15-coreos-x86_64-with-Ubuntu-16.04-xenial"}, "version": [1, 0, 0]}}, "topics": {"c70a7514-9257-4b33-b468-27a8588d4dfa": {"created_at": "2017-09-18 12:27:56.074233", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Topic modeling of public repositories at scale using names in source code", "https://arxiv.org/abs/1704.00135"]], "extra": {"Number of topics": "320", "Data collection date": "October 2016", "Number of tokens": "2,015,336"}, "description": "Generated from 2 million GitHub repositories in October 2016.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Ftopics%2Fc70a7514-9257-4b33-b468-27a8588d4dfa.asdf", "license": ["", "undecided"], "size": "95.1 MB", "code": "from sourced.ml.models import Topics\ntopics = Topics().load(%s)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of repositories:\", len(topics.tokens))", "version": [0, 3, 0]}}, "bow": {"694c20a0-9b96-4444-80ae-f2fa5bd1395b": {"version": [1, 0, 0], "created_at": "2018-07-17 10:28:56.243131", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,512,171", "Data collection date": "July 2018", "Other parts": "[da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F694c20a0-9b96-4444-80ae-f2fa5bd1395b.asdf", "size": "26.0 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "da8c5dee-b285-4d55-8913-a5209f716564": {"created_at": "2018-07-17 09:43:05.498579", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "3,493,288", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [1e0deee4-7dc1-400f-acb6-74c0f4aec471](1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2Fda8c5dee-b285-4d55-8913-a5209f716564.asdf", "license": ["", "none"], "size": "25.8 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}, "1e0deee4-7dc1-400f-acb6-74c0f4aec471": {"version": [1, 0, 0], "created_at": "2018-07-17 10:16:51.105969", "parent": "", "dependencies": ["55215392-36fc-43e5-b277-500f5b68d0c6"], "references": [], "extra": {"Number of distinct documents (files)": "864,458", "Data collection date": "July 2018", "Other parts": "[694c20a0-9b96-4444-80ae-f2fa5bd1395b](694c20a0-9b96-4444-80ae-f2fa5bd1395b.md) and [da8c5dee-b285-4d55-8913-a5209f716564](da8c5dee-b285-4d55-8913-a5209f716564.md)", "Number of distinct features": "6,194,874"}, "description": "Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the `HEAD` commit), using all implemented extractors in `sourced.ml` at the time (`identifiers`, `literals`, `graphlets`, `children`, `node2vec` and `uast2seq`) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use [apollo](https://github.com/src-d/apollo) at scale. We hit `scipy.sparse` limits while trying to merge sparse matrices for all bags, so this is only one of three `BOW` model holding bags.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e0deee4-7dc1-400f-acb6-74c0f4aec471.asdf", "size": "5.9 GB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "license": ["", "none"]}, "1e3da42a-28b6-4b33-94a2-a5671f4102f4": {"created_at": "2017-06-19 09:16:08.942880", "parent": "", "dependencies": ["f64bacd4-67fb-4c64-8382-399a8e7db52a"], "references": [["Similarity of GitHub Repositories by Source Code Identifiers", "http://vmarkovtsev.github.io/techtalks-2017-moscow/#"]], "extra": {"Number of (sub)tokens": "999,424", "Number of repositories": "112,273", "Data collection date": "October 2016"}, "description": "Bags of identifiers generated from 140,000 most starred projects on GitHub in October 2016 - ~112k after deduplication.", "url": "https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e3da42a-28b6-4b33-94a2-a5671f4102f4.asdf", "license": ["", "undecided"], "size": "380.8 MB", "code": "from sourced.ml.models import BOW\nbow = BOW().load(%s)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))", "version": [1, 0, 0]}}}}
		{

Readable json + updated models examples #20

Readable json + updated models examples #20

Uh oh!

Conversation

glimow commented May 2, 2019

Uh oh!

Uh oh!

glimow commented May 6, 2019

Uh oh!

vmarkovtsev May 7, 2019

Choose a reason for hiding this comment

Uh oh!

vmarkovtsev May 7, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants