Skip to content
Ruan Chaves edited this page Jun 5, 2023 · 11 revisions

✂️ hashformers

  1. What Is Hashformers?
  2. Quick Start
  3. Learn More

What Is Hashformers?

Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag. Hashformers is the current state-of-the-art for hashtag segmentation. Hashformers is also language-agnostic: you can use it to segment hashtags not just in English, but also in any language on the Hugging Face Model Hub.

Quick Start

This quick start guide includes what most users will ever have to know about the library.

Installation

To begin, install hashformers using pip:

pip install hashformers

Loading the Hashtag Segmenter

Once you have hashformers installed, you can load a hashtag segmenter using your preferred model. Here's an example using the GPT-2 model:

from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    segmenter_model_type="gpt2"
)

segmentations = ws.segment([
    "#weneedanationalpark",
    "#icecold"
])

print(segmentations)

# [ 'we need a national park',
# 'ice cold' ]

Available Model Types

Hashformers utilizes the minicons library to load models and calculate the log-likelihood of hashtag segmentations. You can use any model that can be loaded through the minicons library.

The following model types are available:

  • gpt2: This is an alias kept for legacy purposes. It loads the IncrementalLMScorer in minicons.scorer.
  • bert: This is an alias kept for legacy purposes. It loads the MaskedLMScorer in minicons.scorer.
  • incremental: This loads a IncrementalLMScorer.
  • masked: This loads a MaskedLMScorer.
  • seq2seq: This loads a Seq2SeqScorer.

For more information on the scorers, check the minicons repository.

While it is possible to use any model as the segmenter, based on our research, incremental models have demonstrated good results when used as the segmenter.

Example: German Hashtag Segmentation

Suppose you want to segment hashtags in German. You can use a German GPT-2 model from the Hugging Face Model Hub:

ws = WordSegmenter(
    segmenter_model_name_or_path="dbmz/german-gpt2",
    segmenter_model_type="incremental"
)

Example: Large Language Models

You can also use large language models (LLMs) for hashtag segmentation. Here we combine GPT-J and Dolly:

ws = WordSegmenter(
    segmenter_model_name_or_path="EleutherAI/gpt-j-6b",
    segmenter_model_type="incremental",
    reranker_model_name_or_path="databricks/dolly-v2-3b",
    reranker_model_type="incremental"
)

Using Class Names for Model Types

You can also call scorers by their class name directly. This can be useful if minicons implements a scorer that is not listed above. Here's an example:

# Achieving the same result using class name
ws = WordSegmenter(
    segmenter_model_name_or_path="dbmz/german-gpt2",
    segmenter_model_type="IncrementalLMScorer"
)

Learn More

What we have covered so far will be sufficient for most projects using the hashformers library: simply load a model as the segmenter (we recommend using GPT-2 or some other incremental model) and start segmenting hashtags in your projects.

The remainder of this documentation covers use cases that may be useful for advanced users or developers who are seeking ways to contribute to the library.

  • Frequently-Asked-Questions-(FAQ) : Frequently asked questions about the library.

  • Segmenters : In-depth documentation of the TransformerWordSegmenter class and other segmenter classes available in the library for hashtag segmentation, such as TweetSegmenter and RegexWordSegmenter.

  • Benchmarking : How to benchmark your segmenters, and benchmarks for some common datasets.