This tool is built on top of ast2vec Machine Learning models.
Provides API and tools to train and use models for ecosystem exploratory snippet mining. It can help you to learn new libraries faster and speed up coding speed. The module allows you to train and use hierarchical topic model on top of babelfish UAST for any library you want.
Now Snippet ranger is under active development.
pip3 install git+https://github.com/src-d/snippet-ranger
The project exposes two interfaces: API and command line. The command line is
snippet_ranger --help
1. Get list of dependent repositories.
You should have libraries.io (v1.0.0) dataset on your disk. You can download it here: https://libraries.io/data
Example for numpy
library:
snippet_ranger dependent_reps --librariesio_data ../libio/ -o . --libraries numpy:https://github.com/numpy/numpy
There are examples of output files in data folder. You can use it to try snippet_ranger without a need to download libraries.io dataset.
2. Clone repositories
Use ast2vec clone
for it. It requires enry. Install it via ast2vec enry
if you do not have.
Example:
ast2vec clone --ignore -o data/repos/numpy -t 16 --languages Python --linguist ./enry numpy.txt
You can skip the second step if you do not want to store repositories. But enry installation is necessary.
3. Convert to Source modelforge models
Use ast2vec repos2source
for it.
You should have bblfsh server running.
Please use v0.7.0 and v0.8.2. of python driver:
BBLFSH_DRIVER_IMAGES="python=docker://bblfsh/python-driver:v0.8.2" docker run -e BBLFSH_DRIVER_IMAGES --rm --privileged -d -p 9432:9432 --name bblfsh bblfsh/server:v0.7.0 --log-level DEBUG
Example:
ast2vec repos2source -p 2 -t 8 --organize-files 2 -o data/sources $( find data/repos/numpy -maxdepth 1 -mindepth 1 -type d | xargs)
If you skip second step replace data/repos/numpy
with data/numpy_dependent_reps.txt
:
ast2vec repos2source -p 2 -t 8 --organize-files 2 -o data/sources data/numpy_dependent_reps.txt
Check ast2vec topic modeling instructions to learn more about parameters.
4. Get UAST for the library
If you use the library for Python, first you should install it to avoid autogenerated files losing. UAST is builded from installation directory:
snippet_ranger pylib2uast -p 1 -o ./data/libraries_uasts numpy
You can use other languages which are supported by bblfsh.
Just download the library sources and run ast2vec repo2uast
for it.
5. Extract snippets from Source model
Use snippet_ranger source2func
for it.
This command does the following:
- Filter files without library usage.
- Split files to functions or take full file if there are no functions (just script).
- Filter split result without library function calls.
More ways of snippet extraction can be added later.
Example:
snippet_ranger source2func -p 8 --library_name numpy --library_uast ./data/libraries_uasts/numpy.asdf -o ./data/funcs/numpy/ ./data/sources/numpy
If you have several All functions are filtered and you get empty model.
errors it is ok.
6. Create vowpal wabbit dataset
Here you have two way. Default one is use all simple identifiers as tokens for document modeling, as described in 3-4 points in ast2vec topic modeling instructions.
Another one, use only specific identifiers, which can be found in the library UAST.
For now, it is only about function calls (fc).
Use snippet2fc_df
and snippet2fc_bow
for the second approach.
Example:
mkdir ./data/dfs_fc
snippet_ranger snippet2fc_df -p 8 --library_name numpy --library_uast ./data/libraries_uasts/numpy.asdf ./data/funcs/numpy/ ./data/dfs_fc/numpy.asdf
snippet_ranger snippet2fc_bow -p 8 --df ./data/dfs_fc/numpy.asdf -v 1000000 ./data/funcs/numpy/ ./data/bows_fc/numpy
Then you need to do the same as in 5-7 points in ast2vec topic modeling:
python3 -m ast2vec join-bow -p 16 --bow ./data/bows_fc/numpy ./data/bows_fc/numpy.asdf
python3 -m ast2vec bow2vw --bow ./data/bows_fc/numpy.asdf -o ./data/vowpal_wabbit/numpy_fc.txt
On going
You should install BigARTM library.
Easy way is to use ast2vec bigartm
command (not implemented yet).
You can checkout simple draft experiment using BigARTM Python API notebook.
We use PEP8 with line length 99 and ". All the tests must pass:
unittest discover /path/to/ast2vec
Apache 2.0.