unitoken

Working with large volumes of webtext can be a bit arduous, as the web contains many different languages. This package provides an automated way of tokenizing webtext in different languages by first detecting the language of a string, and then using an appropriate language specific tokenizer to tokenize it.

It exposes two functions:

tokenize(str): tokenizes your text into words.
sent_tokenize(str): tokenizes your text into sentences, which are again split into words.

The functions also take a dictionary mapping from languages to spacy models. This is handy if you'd like to use models you defined yourself.

The return of the models is a tuple: the first item is the tokens in the text, while the second item is a LanguageResult, indicating the language, the confidence, and whether the language was valid in the current context. If the language is invalid, i.e., valid == False, then the return list of tokens will be empty.

Example

from unitoken import tokenize

my_sentences = ["日本、朝鮮半島、中国東部まで広く分布し、その姿や鳴き声はよく知られている。",
"ประเทศโครเอเชียปรับมาใช้สกุลเงินยูโรและเข้าร่วมเขตเชงเกน", "The dog walked home and had a nice cookie."]

tokenized = []
for sentence in my_sentences:
    tokenized.append(tokenize(sentence))

Using a pre-defined model:

import spacy
from unitoken import tokenize

# Assumes en_core_web_sm is installed.
predefined_models = {"en": spacy.load("en_core_web_sm")}

my_sentences = ["日本、朝鮮半島、中国東部まで広く分布し、その姿や鳴き声はよく知られている。",
"ประเทศโครเอเชียปรับมาใช้สกุลเงินยูโรและเข้าร่วมเขตเชงเกน", "The dog walked home and had a nice cookie."]

tokenized = []
for sentence in my_sentences:
    tokenized.append(tokenize(sentence, models=predefined_models))

Installation

It can be installed by cloning through setup.py:

python3 setup.py install

or by using git:

pip install git+https://github.com/stephantul/unitoken.git

Credits

This package builds on top of spacy and fasttext. I didn't really do a lot work.

Improvements

The current implementation assumes that a text consists of a single language. This is obviously false. We could try a window approach to find windows in which languages differ, but this requires some experimentation.

Author

Stéphan Tulkens

License

MIT

Version

0.2.0

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
tests		tests
unitoken		unitoken
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests

tests

unitoken

unitoken

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

LICENSE

LICENSE

README.md

README.md

mypy.ini

mypy.ini

requirements.txt

requirements.txt

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

unitoken

Example

Installation

Credits

Improvements

Author

License

Version

About

Releases

Packages

Languages

License

stephantul/unitoken

Folders and files

Latest commit

History

Repository files navigation

unitoken

Example

Installation

Credits

Improvements

Author

License

Version

About

Resources

License

Stars

Watchers

Forks

Languages