Skip to content

ruanchaves/hashformers

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

✂️ hashformers

Open In Colab PyPi license stars tweet

Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag.

Hashformers is the current state-of-the-art for hashtag segmentation. On average, hashformers is 10% more accurate than the second best hashtag segmentation library ( more details on the docs ).

Hashformers is also language-agnostic: you can use it to segment hashtags not just in English, but also in any language with a GPT-2 model on the Hugging Face Model Hub.

✂️ Get started - Google Colab tutorial

✂️ Read the documentation

Basic usage

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    reranker_model_name_or_path="bert-base-uncased"
)

segmentations = ws.segment([
    "#weneedanationalpark",
    "#icecold"
])

print(segmentations)

# [ 'we need a national park',
# 'ice cold' ]

Installation

Hashformers is compatible with Python 3.7.

pip install hashformers

It is possible to use hashformers without a reranker:

from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    reranker_model_name_or_path=None
)

If you want to use a BERT model as a reranker, you must install mxnet. Here we install hashformers with mxnet-cu110, which is compatible with Google Colab. If installing in another environment, replace it by the mxnet package compatible with your CUDA version.

pip install mxnet-cu110 
pip install hashformers

Contributing

Pull requests are welcome! Read our paper for more details on the inner workings of our framework.

If you want to develop the library, you can install hashformers directly from this repository ( or your fork ):

git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .

Relevant Papers

Blog Posts

Citation

@misc{rodrigues2021zeroshot,
      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 
      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
      year={2021},
      eprint={2112.03213},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}