Skip to content
NLP in Rust with Python bindings
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci Add Jaro similarity (#39) Apr 29, 2019
benchmarks Parallel HashingVectorizer (#20) May 9, 2019
ci/azure Add Sørensen-Dice string similarity (#38) Apr 29, 2019
doc Parallel HashingVectorizer (#20) May 9, 2019
evaluation
python Parallel HashingVectorizer (#20) May 9, 2019
src Parallel HashingVectorizer (#20) May 9, 2019
.gitignore DOC improve documentation Apr 28, 2019
Cargo.toml Parallel HashingVectorizer (#20) May 9, 2019
LICENSE Relicense under Apache license 2.0 (#44) Apr 29, 2019
README.md Parallel HashingVectorizer (#20) May 9, 2019
azure-pipelines.yml Add Azure Pipelines CI (#26) Apr 8, 2019

README.md

vtext

Crates.io PyPI CircleCI Build Status

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

The API is currently unstable.

Features

  • Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
  • Stemming: Snowball (in Python 15-20x faster than NLTK)
  • Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
  • Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Usage

Usage in Python

vtext requires Python 3.5+ and can be installed with,

pip install --pre vtext

Below is a simple tokenization example,

>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]

For more details see the project documentation: vtext.io/doc/latest/index.html

Usage in Rust

Add the following to Cargo.toml,

[dependencies]
vtext = "0.1.0-alpha.2"

For more details see rust documentation: docs.rs/vtext

Benchmarks

Tokenization

Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,

lang dataset regexp spacy 2.1 vtext
en EWT 0.812 0.972 0.966
en GUM 0.881 0.989 0.996
de GSD 0.896 0.944 0.964
fr Sequoia 0.844 0.968 0.971

and the English tokenization speed,

regexp spacy 2.1 vtext
Speed (10⁶ tokens/s) 3.1 0.14 2.1

Text vectorization

Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset,

Speed (MB/s) scikit-learn 0.20.1 vtext (n_jobs=1) vtext (n_jobs=4)
CountVectorizer 14 49 NA
HashingVectorizer 19 78 227

see benchmarks/README.md for more details.

License

vtext is released under the Apache License, Version 2.0.

You can’t perform that action at this time.