# [Hashformers](https://github.com/ruanchaves/hashformers)

Hashformers is a framework for hashtag segmentation with transformers. For more information, please check the [GitHub repository](https://github.com/ruanchaves/hashformers). 

# Installation

Here we install `mxnet-cu110` and `hashformers`.

`mxnet-cu110` is compatible with Google Colab. If installing in another environment, replace it by the mxnet package compatible with your CUDA version.

In [2]:
%%capture

!pip install mxnet-cu110 
!pip install hashformers

# Loading the models

Here we initialize a simple word segmenter by selecting `distilgpt2` as the segmenter model.

In [3]:
%%capture

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="distilgpt2",
    reranker_model_name_or_path=None
)

# Hashtag Segmentation with hashformers

After installing the library and loading the models, we can segment hashtags and look at the segmentations. 

Write hashtags below, one per line.

In [4]:
hashtags = """
#myoldphonesucks
#latinosinthedeepsouth
"""

In [None]:
hashtag_list = [ x.strip() for x in hashtags.split("\n") if x.strip()]
segmentation = ws.segment(hashtag_list)
print(*segmentation, sep='\n')

You can use **hashformers** to segment hashtags in any language, not just English. Visit the [HuggingFace Model Hub](https://huggingface.co/models) and choose any GPT-2 and a BERT models for the WordSegmenter class.

The GPT-2 model should be informed as `segmenter_model_name_or_path` and the BERT model as `reranker_model_name_or_path`. A segmenter is required, however a reranker is optional. 

In [6]:
%%capture

from hashformers import TransformerWordSegmenter as WordSegmenter

portuguese_ws = WordSegmenter(
    segmenter_model_name_or_path="pierreguillou/gpt2-small-portuguese",
    reranker_model_name_or_path="neuralmind/bert-base-portuguese-cased"
)

In [7]:
hashtag_list = [
    "#benficamemes",
    "#mouraria",
    "#CristianoRonaldo"
]

segmentations = portuguese_ws.segment(hashtag_list)

print(*segmentations, sep='\n')

ben ficam em es
m ouraria
Cristiano Ronaldo


# Advanced usage

## Speeding up

If you want to investigate the speed-accuracy trade-off, here are a few things that can be done to improve the speed of the segmentations:


* Turn off the reranker model by passing `use_reranker = False` to the `ws.segment` method.

* Adjust the `segmenter_gpu_batch_size` (default: `1` ) and the `reranker_gpu_batch_size` (default: `2000`) parameters in the `WordSegmenter` initialization.


* Decrease the beamsearch parameters `topk` (default: `20`) and `steps` (default: `13`) when calling the `ws.segment` method.

In [8]:
%%capture

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="distilgpt2",
    reranker_model_name_or_path="distilbert-base-uncased",
    segmenter_gpu_batch_size=1,
    reranker_gpu_batch_size=2000
)

In [9]:
%%timeit

hashtag_list = [
    "#myoldphonesucks",
    "#latinosinthedeepsouth",
    "#weneedanationalpark"
]

segmentations = ws.segment(hashtag_list)

1 loop, best of 5: 8.08 s per loop


In [10]:
%%timeit

hashtag_list = [
    "#myoldphonesucks",
    "#latinosinthedeepsouth",
    "#weneedanationalpark"
]

segmentations = ws.segment(
    hashtag_list,
    topk=5,
    steps=5,
    use_reranker=False
)

1 loop, best of 5: 3.03 s per loop


## Getting the ranks

If you pass `return_ranks == True` to the `ws.segment` method, you will receive a dictionary with the ranks generated by the segmenter and the reranker, the dataframe utilized by the ensemble and the final segmentations. A segmentation will rank higher if its score value is **lower** than the other segmentation scores.

Rank outputs are useful if you want to combine the segmenter rank and the reranker rank in ways which are more sophisticated than what is done by the basic ensembler that comes by default with **hashformers**.   

For instance, you may want to take two or more ranks ( also called "runs" ), convert them to the trec format and combine them through a rank fusion technique on the [trectools library](https://github.com/joaopalotti/trectools).    

In [11]:
hashtag_list = [
    "#myoldphonesucks",
    "#latinosinthedeepsouth",
    "#weneedanationalpark"
]

ranks = ws.segment(
    hashtag_list,
    use_reranker=True,
    return_ranks=True
)

In [12]:
# Segmenter rank
ranks.segmenter_rank

Unnamed: 0,characters,segmentation,score
0,latinosinthedeepsouth,latinos in the deep south,50.041458
1,latinosinthedeepsouth,latino s in the deep south,53.423897
2,latinosinthedeepsouth,latinosin the deep south,53.662689
3,latinosinthedeepsouth,la tinos in the deep south,54.122768
4,latinosinthedeepsouth,latinos in the deepsouth,54.437469
...,...,...,...
905,weneedanationalpark,weneed anatio nalpark,80.100243
906,weneedanationalpark,weneedanati onalpa rk,80.674561
907,weneedanationalpark,weneedanat ionalpa rk,81.096085
908,weneedanationalpark,weneedanat ionalpar k,82.248749


In [13]:
# Reranker rank
ranks.reranker_rank

Unnamed: 0,characters,segmentation,score
0,latinosinthedeepsouth,latinos in the deep south,18.863357
1,latinosinthedeepsouth,latino s in the deep south,36.419517
2,latinosinthedeepsouth,latinos in the deepsouth,37.305017
3,latinosinthedeepsouth,latin os in the deep south,38.368534
4,latinosinthedeepsouth,la tinos in the deep south,38.611647
...,...,...,...
905,weneedanationalpark,weneed a nati onalpark,84.555845
906,weneedanationalpark,w eneedanationalpar k,85.361568
907,weneedanationalpark,w eneedanationalp ark,86.047094
908,weneedanationalpark,w eneedanationa lpark,86.134639


## Evaluation 

The `evaluate_df` function can evaluate the accuracy, precision and recall of our segmentations. It uses exactly the same evaluation method as previous authors in the field of hashtag segmentation ( Çelebi et al., [BOUN Hashtag Segmentor](https://tabilab.cmpe.boun.edu.tr/projects/hashtag_segmentation/) ).

We have to pass a dataframe with fields for the gold segmentations ( a `gold_field` ) and your candidate segmentations ( a `segmentation_field` ).

The relationship between gold and candidate segmentations does not have to be one-to-one. If we pass more than one candidate segmentation for a single hashtag, `evaluate_df` will measure what is the upper boundary that can be achieved on our ranks ( e.g. Acc@10, Recall@10 ).   

### Minimal example

In [14]:
# Let's measure the actual performance of the segmenter: 
# we will evaluate only the top-1.
import pandas as pd
from hashformers.experiments.evaluation import evaluate_df

gold_segmentations = {
    "myoldphonesucks" : "my old phone sucks",
    "latinosinthedeepsouth": "latinos in the deep south",
    "weneedanationalpark": "we need a national park"
}

gold_df = pd.DataFrame(gold_segmentations.items(),
    columns=["characters", "gold"])

segmenter_top_1 = ranks.segmenter_rank.groupby('characters').head(1)

eval_df = pd.merge(gold_df, segmenter_top_1, on="characters")

eval_df

Unnamed: 0,characters,gold,segmentation,score
0,myoldphonesucks,my old phone sucks,my old phone sucks,34.331543
1,latinosinthedeepsouth,latinos in the deep south,latinos in the deep south,50.041458
2,weneedanationalpark,we need a national park,we need a national park,35.088081


In [15]:
evaluate_df(
    eval_df,
    gold_field="gold",
    segmentation_field="segmentation"
)

{'acc': 100.0, 'f1': 100.0, 'precision': 100.0, 'recall': 100.0}

### Benchmarking

Here we evaluate a `distilgpt2` model on 1000 hashtags.

We collect our hashtags from 10 word segmentation datasets by taking the first 100 hashtags from each dataset. 

In [16]:
%%capture
!pip install datasets

In [17]:
%%capture
from hashformers.experiments.evaluation import evaluate_df
import pandas as pd
from hashformers import TransformerWordSegmenter
from datasets import load_dataset

user = "ruanchaves"

dataset_names = [
    "boun",
    "stan_small",
    "stan_large",
    "dev_stanford",
    "test_stanford",
    "snap",
    "hashset_distant",
    "hashset_manual",
    "hashset_distant_sampled",
    "nru_hse"
]

dataset_names = [ f"{user}/{dataset}" for dataset in dataset_names ]

ws = TransformerWordSegmenter(
    segmenter_model_name_or_path="distilgpt2",
    reranker_model_name_or_path=None
)

def generate_experiments(datasets, splits, samples=100):
    for dataset_name in datasets:
        for split in splits:
            try:
                dataset = load_dataset(dataset_name, split=f"{split}[0:{samples}]")
                yield {
                    "dataset": dataset,
                    "split": split,
                    "name": dataset_name
                }
            except:
                continue

benchmark = []
for experiment in generate_experiments(dataset_names, ["train", "validation", "test"], samples=100):
    hashtags = experiment['dataset']['hashtag']
    annotations = experiment['dataset']['segmentation']
    segmentations = ws.segment(hashtags, use_reranker=False, return_ranks=False)

    eval_df = [{
      "gold": gold,
      "hashtags": hashtag,
      "segmentation": segmentation   
  } for gold, hashtag, segmentation in zip(annotations, hashtags, segmentations)]
    eval_df = pd.DataFrame(eval_df)
  
    eval_results = evaluate_df(
        eval_df,
        gold_field="gold",
        segmentation_field="segmentation"
    )

    eval_results.update({
      "name": experiment["name"],
      "split": experiment["split"]
      })
    benchmark.append(eval_results)

In [18]:
benchmark_df = pd.DataFrame(benchmark)
benchmark_df["name"] = benchmark_df["name"].apply(lambda x: x[(len(user) + 1):])
benchmark_df = benchmark_df.set_index(["name", "split"])
benchmark_df = benchmark_df.round(3)
benchmark_df

Unnamed: 0_level_0,Unnamed: 1_level_0,f1,acc,recall,precision
name,split,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
boun,validation,94.577,93.0,92.766,96.46
boun,test,77.679,69.0,73.109,82.857
stan_small,test,73.926,71.0,70.492,77.711
stan_large,train,81.143,80.0,78.022,84.524
stan_large,validation,81.143,80.0,78.022,84.524
stan_large,test,79.599,80.0,75.796,83.803
dev_stanford,validation,78.75,78.0,77.301,80.255
test_stanford,test,68.896,69.474,62.424,76.866
snap,train,84.296,76.0,81.557,87.225
hashset_distant,test,86.331,78.0,86.022,86.643


In [19]:
benchmark_df.agg(['mean', 'std']).round(3)

Unnamed: 0,f1,acc,recall,precision
mean,79.817,76.575,77.072,82.927
std,9.134,10.94,10.607,7.647
