# TweetNLP Introduction
This colab notebook brings a short introduction of [`tweetnlp`](https://github.com/cardiffnlp/tweetnlp), a python library of NLP models for tweets. In this tutorial, we explain following applications on tweets:
- [Text Classification](#scrollTo=KAZYjeskBqL4): Sentiment/Hate/Irony/Emoji/Emotion, etc
- [NER](#scrollTo=WeREiLEjBlrj): Named Entity Recognition (NER)
- [Question Answering](#scrollTo=reZDePaBmYhA&line=4&uniqifier=1): Answer prediction given a question with a context (SQuAD style)
- [Question Answer Generation](#scrollTo=uqd7sBHhnwym&line=6&uniqifier=1): Question and answer pairs generation on a context
- [Language Modeling](#scrollTo=COOoZHVAFCIG): Masked token prediction
- [Fine-tuning](#scrollTo=2plrPTqk7OHp): Model fine-tuning.


## Installation
TweetNLP is available on pip or can be installed from source.


In [1]:
# Fix Colab Error
!pip install --upgrade google-cloud-storage



In [2]:
# via pip
!pip install tweetnlp



In [3]:
# # via source
# !git clone https://github.com/cardiffnlp/tweetnlp
# %cd tweetnlp
# !pip install . -U

In [4]:
! pip list | grep tweetnlp

tweetnlp                         0.2.2


In [5]:
!pip uninstall -y transformers huggingface_hub


Found existing installation: transformers 4.21.2
Uninstalling transformers-4.21.2:
  Successfully uninstalled transformers-4.21.2
Found existing installation: huggingface-hub 0.24.5
Uninstalling huggingface-hub-0.24.5:
  Successfully uninstalled huggingface-hub-0.24.5


In [6]:
!pip install huggingface_hub==0.23.0 transformers==4.31.0


Collecting huggingface_hub==0.23.0
  Using cached huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
Collecting transformers==4.31.0
  Using cached transformers-4.31.0-py3-none-any.whl.metadata (116 kB)
Using cached huggingface_hub-0.23.0-py3-none-any.whl (401 kB)
Using cached transformers-4.31.0-py3-none-any.whl (7.4 MB)
Installing collected packages: huggingface_hub, transformers
Successfully installed huggingface_hub-0.23.0 transformers-4.31.0


In [7]:
# Clone the repository
!git clone https://github.com/cardiffnlp/tweetnlp.git
%cd tweetnlp

# Install other dependencies
!pip install -r requirements.txt

# Install tweetnlp if not already installed
!pip install tweetnlp

import tweetnlp

fatal: destination path 'tweetnlp' already exists and is not an empty directory.
/content/tweetnlp
[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m


In [8]:
import transformers
import huggingface_hub

print("Transformers version:", transformers.__version__)
print("Huggingface Hub version:", huggingface_hub.__version__)


Transformers version: 4.31.0
Huggingface Hub version: 0.23.0


In [9]:
pip install -U accelerate



In [10]:
import tweetnlp as tweetnlp

All you need is to import `tweetnlp` !

## Tweet Classification
The classification module consists of six different tasks (Topic Classification, Sentiment Analysis, Irony Detection, Hate Speech Detection, Offensive Language Detection, Emoji Prediction, and Emotion Analysis).
In each example, the model is instantiated by `tweetnlp.load("task-name")`, and run the prediction by passing a text or a list of texts as argument to the corresponding function.

### Topic Classification
The aim of this task is, given a tweet to assign topics related to its content. The task is formed as a supervised multi-label classification problem where each tweet is assigned one or more topics from a total of 19 available topics. The topics were carefully curated based on Twitter trends with the aim to be broad and general and consist of classes such as: arts and culture, music, or sports. Our internally-annotated dataset contains over 10K manually-labeled tweets (check the paper [here](https://arxiv.org/abs/2209.09824), or the [huggingface dataset page](https://huggingface.co/datasets/cardiffnlp/tweet_topic_single)).

***Multi-label Model***

In [11]:
model = tweetnlp.load_model('topic_classification')  # Or `model = tweetnlp.TopicClassification()`
model.topic("Jacob Collier is a Grammy-awarded English artist from London.")  # Or `model.predict`

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.01k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/354 [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")


{'label': ['celebrity_&_pop_culture', 'music']}

In [12]:
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model.topic("Jacob Collier is a Grammy-awarded English artist from London.", return_probability=True)

{'label': ['celebrity_&_pop_culture', 'music'],
 'probability': {'arts_&_culture': 0.037371691316366196,
  'business_&_entrepreneurs': 0.010188562795519829,
  'celebrity_&_pop_culture': 0.92448890209198,
  'diaries_&_daily_life': 0.03425709903240204,
  'family': 0.007961373776197433,
  'fashion_&_style': 0.020642103627324104,
  'film_tv_&_video': 0.0806259736418724,
  'fitness_&_health': 0.006343095097690821,
  'food_&_dining': 0.004288368858397007,
  'gaming': 0.004327300935983658,
  'learning_&_educational': 0.010652054101228714,
  'music': 0.8291938304901123,
  'news_&_social_concern': 0.2468821108341217,
  'other_hobbies': 0.020671192556619644,
  'relationships': 0.020371057093143463,
  'science_&_technology': 0.0170074962079525,
  'sports': 0.014291051775217056,
  'travel_&_adventure': 0.010423894971609116,
  'youth_&_student_life': 0.008605164475739002}}

***Singlelabel Model***

In [13]:
model = tweetnlp.load_model('topic_classification', multi_label=False)  # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model.topic("Jacob Collier is a Grammy-awarded English artist from London.")

config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/407 [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'label': 'pop_culture'}

In [14]:
# NOTE: the probability of the sinlge-label model the softmax over the label.
model.topic("Jacob Collier is a Grammy-awarded English artist from London.", return_probability=True)

{'label': 'pop_culture',
 'probability': {'arts_&_culture': 9.206245886161923e-05,
  'business_&_entrepreneurs': 6.916998972883448e-05,
  'pop_culture': 0.9995898604393005,
  'daily_life': 0.00011083026038249955,
  'sports_&_gaming': 8.668459486216307e-05,
  'science_&_technology': 5.152115045348182e-05}}

***Dataset***

In [15]:
# Install or update necessary libraries
!pip install --upgrade datasets tweetnlp

from datasets import load_dataset

# Load dataset using available split names
dataset_multi_label = load_dataset('cardiffnlp/tweet_topic_single', split='train_2021')
dataset_single_label = load_dataset('cardiffnlp/tweet_topic_single', split='test_2021')

# Display a sample from each dataset to verify
print("Multi-label dataset sample:", dataset_multi_label[0])
print("Single-label dataset sample:", dataset_single_label[0])

# Function to get labels from the dataset
def get_labels(dataset):
    unique_labels = set()
    for example in dataset:
        if 'labels' in example:
            unique_labels.update(example['labels'])
        elif 'label' in example:
            unique_labels.add(example['label'])
    return sorted(list(unique_labels))

# Prepare labels and label-to-id mappings
labels_multi_label = get_labels(dataset_multi_label)
labels_single_label = get_labels(dataset_single_label)

label2id_multi_label = {label: idx for idx, label in enumerate(labels_multi_label)}
label2id_single_label = {label: idx for idx, label in enumerate(labels_single_label)}

print("Multi-label label to id mapping:", label2id_multi_label)
print("Single-label label to id mapping:", label2id_single_label)


Collecting tweetnlp
  Using cached tweetnlp-0.4.4-py3-none-any.whl
Collecting transformers<=4.21.2 (from tweetnlp)
  Using cached transformers-4.21.2-py3-none-any.whl.metadata (81 kB)
INFO: pip is looking at multiple versions of tweetnlp to determine which version is compatible with other requirements. This could take a while.
Collecting tweetnlp
  Using cached tweetnlp-0.4.3.tar.gz (54 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Using cached tweetnlp-0.4.2.tar.gz (53 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Using cached tweetnlp-0.4.1.tar.gz (54 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Using cached tweetnlp-0.4.0.tar.gz (50 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Using cached tweetnlp-0.3.4.tar.gz (49 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Using cached tweetnlp-0.3.3.tar.gz (49 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Using cached tweetnlp-0.3.0.tar.gz (38 kB)
  Preparing metada

In [16]:
dataset_multi_label

Dataset({
    features: ['text', 'date', 'label', 'label_name', 'id'],
    num_rows: 1516
})

In [17]:
label2id_multi_label

{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5}

In [18]:
dataset_single_label

Dataset({
    features: ['text', 'date', 'label', 'label_name', 'id'],
    num_rows: 1693
})

In [19]:
label2id_single_label

{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5}

### Sentiment Analysis
The sentiment analysis task integrated in TweetNLP is a simplified version where the goal is to predict the sentiment of a tweet with one of the three following labels: positive, neutral or negative. The base dataset for English is the unified TweetEval version of the Semeval-2017 dataset from the task on Sentiment Analysis in Twitter (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***English Model***

In [20]:
import tweetnlp

model = tweetnlp.load_model('sentiment')  # Or `model = tweetnlp.Sentiment()`
model.sentiment("Yes, including Medicare and social security saving👍")  # Or `model.predict`

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'label': 'positive'}

In [21]:
model.sentiment("Yes, including Medicare and social security saving👍", return_probability=True)

{'label': 'positive',
 'probability': {'negative': 0.004584966693073511,
  'neutral': 0.19360849261283875,
  'positive': 0.8018065094947815}}

***Multilingual Model***

In [22]:
model = tweetnlp.load_model('sentiment', multilingual=True)  # Or `model = tweetnlp.Sentiment(multilingual=True)`
model.sentiment("天気が良いとやっぱり気持ち良いなあ✨")

config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

{'label': 'positive'}

In [23]:
model.sentiment("天気が良いとやっぱり気持ち良いなあ✨", return_probability=True)

{'label': 'positive',
 'probability': {'negative': 0.028369639068841934,
  'neutral': 0.08128832280635834,
  'positive': 0.8903420567512512}}

***Dataset***

In [24]:
dataset, label2id = tweetnlp.load_dataset('sentiment')



In [25]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45615
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [26]:
label2id

{'negative': 0, 'neutral': 1, 'positive': 2}

In [27]:
for l in ['arabic', 'english', 'french', 'german', 'hindi', 'italian', 'portuguese', 'spanish']:
    dataset_multilingual, label2id_multilingual = tweetnlp.load_dataset('sentiment', multilingual=True, task_language=l)
    print(dataset_multilingual)
    print(label2id_multilingual)
    print()

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1839
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 324
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 870
    })
})
{'negative': 0, 'neutral': 1, 'positive': 2}

DatasetDict({

### Irony Detection
This is a binary classification task where given a tweet, the goal is to detect whether it is ironic or not. It is based on the Irony Detection dataset from the SemEval 2018 task (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [28]:
model = tweetnlp.load_model('irony')  # Or `model = tweetnlp.Irony()`
model.irony('If you wanna look like a badass, have drama on social media')  # Or `model.predict`

config.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'label': 'irony'}

In [29]:
model.irony('If you wanna look like a badass, have drama on social media', return_probability=True)

{'label': 'irony',
 'probability': {'non_irony': 0.083908811211586, 'irony': 0.9160912036895752}}

***Dataset***

In [30]:
dataset, label2id = tweetnlp.load_dataset('irony')

In [31]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2862
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 784
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 955
    })
})

In [32]:
label2id

{'non_irony': 0, 'irony': 1}

### Hate Speech Detection
The hate speech dataset consists of detecting whether a tweet is hateful towards women or immigrants. It is based on the Detection of Hate Speech task at SemEval 2019 (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [33]:
model = tweetnlp.load_model('hate')  # Or `model = tweetnlp.Hate()`
model.hate('Whoever just unfollowed me you a bitch')  # Or `model.predict`

config.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'label': 'NOT-HATE'}

In [34]:
model.hate('Whoever just unfollowed me you a bitch', return_probability=True)

{'label': 'NOT-HATE',
 'probability': {'NOT-HATE': 0.94898921251297, 'HATE': 0.05101083219051361}}

***Dataset***

In [35]:
dataset, label2id = tweetnlp.load_dataset('hate')

In [36]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2970
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})

In [37]:
label2id

{'non-hate': 0, 'hate': 1}

### Offensive Language Identification
This task consists in identifying whether some form of offensive language is present in a tweet. For our benchmark we rely on the SemEval2019 OffensEval dataset (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [38]:
model = tweetnlp.load_model('offensive')  # Or `model = tweetnlp.Offensive()`
model.offensive("All two of them taste like ass.")  # Or `model.predict`

config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'label': 'offensive'}

In [39]:
model.offensive("All two of them taste like ass.", return_probability=True)

{'label': 'offensive',
 'probability': {'non-offensive': 0.16420334577560425,
  'offensive': 0.8357967138290405}}

***Dataset***

In [40]:
dataset, label2id = tweetnlp.load_dataset('offensive')

In [41]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11916
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 860
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1324
    })
})

In [42]:
label2id

{'non-offensive': 0, 'offensive': 1}

### Emoji Prediction
The goal of emoji prediction is to predict the final emoji on a given tweet. The dataset used to fine-tune our models is the TweetEval adaptation from the SemEval 2018 task on Emoji Prediction (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)), including 20 emoji as labels (❤, 😍, 😂, 💕, 🔥, 😊, 😎, ✨, 💙, 😘, 📷, 🇺🇸, ☀, 💜, 😉, 💯, 😁, 🎄, 📸, 😜).

***Model***

In [43]:
model = tweetnlp.load_model('emoji')  # Or `model = tweetnlp.Emoji()`
model.emoji('Beautiful sunset last night from the pontoon @TupperLakeNY')  # Or `model.predict`

config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'label': '📷'}

In [44]:
model.emoji('Beautiful sunset last night from the pontoon @TupperLakeNY', return_probability=True)

{'label': '📷',
 'probability': {'❤': 0.13197310268878937,
  '😍': 0.11246417462825775,
  '😂': 0.008415071293711662,
  '💕': 0.04842923581600189,
  '🔥': 0.014528140425682068,
  '😊': 0.1509673148393631,
  '😎': 0.08625394850969315,
  '✨': 0.016166353598237038,
  '💙': 0.07396606355905533,
  '😘': 0.03033280372619629,
  '📷': 0.16525329649448395,
  '🇺🇸': 0.020336609333753586,
  '☀': 0.007999823428690434,
  '💜': 0.01611141487956047,
  '😉': 0.012984540313482285,
  '💯': 0.012557176873087883,
  '😁': 0.03138682246208191,
  '🎄': 0.006829543504863977,
  '📸': 0.04188753664493561,
  '😜': 0.01115693524479866}}

***Dataset***

In [45]:
dataset, label2id = tweetnlp.load_dataset('emoji')

In [46]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [47]:
label2id

{'❤': 0,
 '😍': 1,
 '😂': 2,
 '💕': 3,
 '🔥': 4,
 '😊': 5,
 '😎': 6,
 '✨': 7,
 '💙': 8,
 '😘': 9,
 '📷': 10,
 '🇺🇸': 11,
 '☀': 12,
 '💜': 13,
 '😉': 14,
 '💯': 15,
 '😁': 16,
 '🎄': 17,
 '📸': 18,
 '😜': 19}

### Emotion Recognition
Given a tweet, this task consists of associating it with its most appropriate emotion. As a reference dataset we use the SemEval 2018 task on Affect in Tweets, simplified to only four emotions used in TweetEval: anger, joy, sadness and optimism (check the paper [here](https://arxiv.org/pdf/2010.12421.pdf)).

***Model***

In [48]:
model = tweetnlp.load_model('emotion')  # Or `model = tweetnlp.Emotion()`
model.emotion('I love swimming for the same reason I love meditating...the feeling of weightlessness.')  # Or `model.predict`

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/409 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'label': 'joy'}

In [49]:
model.emotion('I love swimming for the same reason I love meditating...the feeling of weightlessness.', return_probability=True)

{'label': 'joy',
 'probability': {'anger': 0.0002580075233709067,
  'anticipation': 0.0005329722189344466,
  'disgust': 0.0002611202944535762,
  'fear': 0.00027552220853976905,
  'joy': 0.7721396684646606,
  'love': 0.18062669038772583,
  'optimism': 0.04208087548613548,
  'pessimism': 0.0002532521029934287,
  'sadness': 0.0006160670309327543,
  'surprise': 0.0005619610310532153,
  'trust': 0.0023938403464853764}}

***Dataset***

In [50]:
dataset, label2id = tweetnlp.load_dataset('emotion')

Downloading data:   0%|          | 0.00/233k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/105k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/28.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3257 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1421 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/374 [00:00<?, ? examples/s]

In [51]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})

In [52]:
label2id

{'anger': 0, 'joy': 1, 'optimism': 2, 'sadness': 3}

## Named Entity Recognition
This module consists of a named-entity recognition (NER) model specifically trained for tweets. The model is instantiated by `tweetnlp.load("ner")`, and runs the prediction by giving a text or a list of texts as argument to the `ner` function (check the paper [here](https://arxiv.org/abs/2210.03797), or the [huggingface dataset page](https://huggingface.co/datasets/tner/tweetner7)).

***Model***

In [53]:
model = tweetnlp.load_model('ner')  # Or `model = tweetnlp.NER()`
model.ner('Jacob Collier is a Grammy-awarded English artist from London.')  # Or `model.predict`

config.json:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

[{'type': 'person', 'entity': 'Jacob Collier'},
 {'type': 'event', 'entity': ' Grammy'},
 {'type': 'location', 'entity': ' London'}]

In [54]:
# Note: the probability for the predicted entity is the mean of the probabilities over the sub-tokens representing the entity.
model.ner('Jacob Collier is a Grammy-awarded English artist from London.', return_probability=True)  # Or `model.predict`

[{'type': 'person',
  'entity': 'Jacob Collier',
  'probability': 0.9905317823092142},
 {'type': 'event', 'entity': ' Grammy', 'probability': 0.19164393842220306},
 {'type': 'location', 'entity': ' London', 'probability': 0.9607000350952148}]

***Dataset***

In [55]:
dataset, label2id = tweetnlp.load_dataset('ner')

Downloading data:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/447k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/96.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/56.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/723k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/400k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/99.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/753k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.7M [00:00<?, ?B/s]

Generating test_2020 split:   0%|          | 0/576 [00:00<?, ? examples/s]

Generating test_2021 split:   0%|          | 0/2807 [00:00<?, ? examples/s]

Generating validation_2020 split:   0%|          | 0/576 [00:00<?, ? examples/s]

Generating validation_2021 split:   0%|          | 0/310 [00:00<?, ? examples/s]

Generating train_2020 split:   0%|          | 0/4616 [00:00<?, ? examples/s]

Generating train_2021 split:   0%|          | 0/2495 [00:00<?, ? examples/s]

Generating train_all split:   0%|          | 0/7111 [00:00<?, ? examples/s]

Generating validation_random split:   0%|          | 0/576 [00:00<?, ? examples/s]

Generating train_random split:   0%|          | 0/4616 [00:00<?, ? examples/s]

Generating extra_2020 split:   0%|          | 0/87880 [00:00<?, ? examples/s]

Generating extra_2021 split:   0%|          | 0/93594 [00:00<?, ? examples/s]

In [56]:
dataset

DatasetDict({
    test_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    test_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 2807
    })
    validation_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    validation_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 310
    })
    train_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 4616
    })
    train_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 2495
    })
    train_all: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 7111
    })
    validation_random: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    train_random: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 4616
    })
  

In [57]:
label2id

{'B-corporation': 0,
 'B-creative_work': 1,
 'B-event': 2,
 'B-group': 3,
 'B-location': 4,
 'B-person': 5,
 'B-product': 6,
 'I-corporation': 7,
 'I-creative_work': 8,
 'I-event': 9,
 'I-group': 10,
 'I-location': 11,
 'I-person': 12,
 'I-product': 13,
 'O': 14}

## Question Answering
This module consists of a question answering model specifically trained for tweets.
The model is instantiated by `tweetnlp.load("question_answering")`,
and runs the prediction by giving a question or a list of questions along with a context or a list of contexts
as argument to the `question_answering` function (check the paper [here](https://arxiv.org/abs/2210.03992), or the [huggingface dataset page](https://huggingface.co/datasets/lmqg/qg_tweetqa)).

***Model***

In [58]:
model = tweetnlp.load_model('question_answering')  # Or `model = tweetnlp.QuestionAnswering()`
model.question_answering(
  question='who created the post as we know it today?',
  context="'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`

config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/20.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

{'generated_text': 'nebraska'}

***Dataset***

In [59]:
dataset = tweetnlp.load_dataset('question_answering')

Downloading data:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/218k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/236k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9489 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1086 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1203 [00:00<?, ? examples/s]

In [60]:
dataset

DatasetDict({
    train: Dataset({
        features: ['answer', 'paragraph_question', 'question', 'paragraph'],
        num_rows: 9489
    })
    validation: Dataset({
        features: ['answer', 'paragraph_question', 'question', 'paragraph'],
        num_rows: 1086
    })
    test: Dataset({
        features: ['answer', 'paragraph_question', 'question', 'paragraph'],
        num_rows: 1203
    })
})

## Question Answer Generation
This module consists of a question & answer pair generation specifically trained for tweets.
The model is instantiated by `tweetnlp.load("question_answer_generation")`,
and runs the prediction by giving a context or a list of contexts
as argument to the `question_answer_generation` function (check the paper [here](https://arxiv.org/abs/2210.03992), or the [huggingface dataset page](https://huggingface.co/datasets/lmqg/qag_tweetqa)).


***Model***

In [61]:
model = tweetnlp.load_model('question_answer_generation')  # Or `model = tweetnlp.QuestionAnswerGeneration()`
model.question_answer_generation(
  text="'So much of The Post is Ben,' Mrs. Graham said in 1994, three years after Bradlee retired as editor. 'He created it as we know it today.'— Ed O'Keefe (@edatpost) October 21, 2014"
)  # Or `model.predict`

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/20.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

[{'question': 'who created the post?', 'answer': 'ben'},
 {'question': 'what did ben do in 1994?', 'answer': 'he retired as editor'}]

***Dataset***

In [62]:
dataset = tweetnlp.load_dataset("question_answer_generation")

Downloading data:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/191k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/211k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4536 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/583 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/583 [00:00<?, ? examples/s]

In [63]:
dataset

DatasetDict({
    train: Dataset({
        features: ['answers', 'questions', 'paragraph', 'paragraph_id', 'questions_answers'],
        num_rows: 4536
    })
    validation: Dataset({
        features: ['answers', 'questions', 'paragraph', 'paragraph_id', 'questions_answers'],
        num_rows: 583
    })
    test: Dataset({
        features: ['answers', 'questions', 'paragraph', 'paragraph_id', 'questions_answers'],
        num_rows: 583
    })
})

## Language Modeling
The masked language model predicts the masked token in the given sentence. This is instantiated by `tweetnlp.load('language_model')`, and runs the prediction by giving a text or a list of texts as argument to the `mask_prediction` function. Please make sure that each text has a `<mask>` token, since that is eventually the following by the objective of the model to predict.

In [64]:
model = tweetnlp.load_model('language_model')  # Or `model = tweetnlp.LanguageModel()`
model.mask_prediction("So glad I'm <mask> vaccinated.")  # Or `model.predict`

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

{'best_tokens': ['fully',
  'getting',
  'not',
  'still',
  'already',
  'all',
  'being',
  'completely',
  'now',
  'finally'],
 'best_scores': [1.0366179269138964e-11,
  1.6352424622723127e-11,
  1.9819900928808032e-11,
  4.5200146403523433e-10,
  1.0578219189483207e-05,
  0.0002495313819963485,
  2.3952554329298437e-05,
  1.8536427887738682e-05,
  2.8879319870611653e-05,
  5.781220806966303e-06],
 'best_sentences': ["So glad I'm fully vaccinated.",
  "So glad I'm getting vaccinated.",
  "So glad I'm not vaccinated.",
  "So glad I'm still vaccinated.",
  "So glad I'm already vaccinated.",
  "So glad I'm all vaccinated.",
  "So glad I'm being vaccinated.",
  "So glad I'm completely vaccinated.",
  "So glad I'm now vaccinated.",
  "So glad I'm finally vaccinated."]}

## Tweet Embedding
The tweet embedding model produces a fixed length embedding for a tweet. The embedding represents the semantics by meaning of the tweet, and this can be used for semantic search of tweets by using the similarity between the embeddings. Model is instantiated by `tweet_nlp.load('sentence_embedding')`, and run the prediction by passing a text or a list of texts as argument to the `embedding` function.

In [65]:
model = tweetnlp.load_model('sentence_embedding')

.gitattributes:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]



In [66]:
# Get sentence embedding
tweet = "I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done."
vectors = model.embedding(tweet)
vectors.shape


(768,)

In [67]:
# Get sentence embedding (multiple inputs)
tweet_corpus = [
    "Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair does not make it so. Charges require specific allegations and then proof. We have neither here.",
    "Trump appointed judge Stephanos Bibas ",
    "If your members can go to Puerto Rico they can get their asses back in the classroom. @CTULocal1",
    "@PolitiBunny @CTULocal1 Political leverage, science said schools could reopen, teachers and unions protested to keep'em closed and made demands for higher wages and benefits, they're usin Covid as a crutch at the expense of life and education.",
    "Congratulations to all the exporters on achieving record exports in Dec 2020 with a growth of 18 % over the previous year. Well done &amp; keep up this trend. A major pillar of our govt's economic policy is export enhancement &amp; we will provide full support to promote export culture.",
    "@ImranKhanPTI Pakistan seems a worst country in term of exporting facilities. I am a small business man and if I have to export a t-shirt having worth of $5 to USA or Europe. Postal cost will be around $30. How can we grow as an exporting country if this situation prevails. Think about it. #PM",
    "The thing that doesn’t sit right with me about “nothing good happened in 2020” is that it ignores the largest protest movement in our history. The beautiful, powerful Black Lives Matter uprising reached every corner of the country and should be central to our look back at 2020.",
    "@JoshuaPotash I kinda said that in the 2020 look back for @washingtonpost",
    "Is this a confirmation from Q that Lin is leaking declassified intelligence to the public? I believe so. If @realDonaldTrump didn’t approve of what @LLinWood is doing he would have let us know a lonnnnnng time ago. I’ve always wondered why Lin’s Twitter handle started with “LLin” https://t.co/0G7zClOmi2",
    "@ice_qued @realDonaldTrump @LLinWood Yeah 100%",
    "Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you."
    "The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned",
    "@ianmSC Beyond that, they put such huge confidence in masks so early with no strong evidence that they have any meaningful benefit, they don’t want to backtrack or admit they were wrong. They put the cart before the horse, now desperate to find any results that match their hypothesis.",
]
vectors = model.embedding(tweet_corpus, batch_size=3)
vectors.shape

(12, 768)

In [68]:
# Similarity search
sims = []
for n, i in enumerate(tweet_corpus):
  _sim = model.similarity(tweet, i)
  sims.append([n, _sim])
print(f'anchor tweet: {tweet}\n')
for m, (n, s) in enumerate(sorted(sims, key=lambda x: x[1], reverse=True)):
  print(f' - top {m}: {tweet_corpus[n]}\n - similaty: {s}\n')

anchor tweet: I will never understand the decision making of the people of Alabama. Their new Senator is a definite downgrade. You have served with honor.  Well done.

 - top 0: Tomorrow is my last day as Senator from Alabama.  I believe our opportunities are boundless when we find common ground. As we swear in a new Congress &amp; a new President, demand from them that they do just that &amp; build a stronger, more just society.  It’s been an honor to serve you.The mask cult can’t ever admit masks don’t work because their ideology is based on feeling like a “good person”  Wearing a mask makes them a “good person” &amp; anyone who disagrees w/them isn’t  They can’t tolerate any idea that makes them feel like their self-importance is unearned
 - similaty: 0.7480925982953287

 - top 1: Trump appointed judge Stephanos Bibas 
 - similaty: 0.6289173552400941

 - top 2: Free, fair elections are the lifeblood of our democracy. Charges of unfairness are serious. But calling an election unfair 

## Use Custom Model
To use an other model from local/huggingface modelhub, one can simply provide the model path/alias to the `load` function.

`tweetnlp.load('task', model='model-path/alias')`

Or any classification model can be used without specifying the task.


In [69]:
# other task eg) NER
model = tweetnlp.load_model('ner', model_name='tner/twitter-roberta-base-2019-90m-tweetner7-continuous')
model.ner("Jacob Collier is a Grammy-awarded English artist from London.")

config.json:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

[{'type': 'person', 'entity': 'Jacob Collier'},
 {'type': 'location', 'entity': ' London'}]

## Fine-tuning Language Model with TweetNLP
TweetNLP provides an easy interface to fine-tune language models on the datasets supported by HuggingFace for model hosting/fine-tuning with [RAY TUNE](https://docs.ray.io/en/latest/tune/index.html) for parameter search.
- Supported Tasks: `sentiment`, `offensive`, `irony`, `hate`, `emotion`, `topic_classification`



In [70]:
import logging
import tweetnlp
from pprint import pprint

logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s', level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S')

# an examples for model prediction
sample = [
    "How many more days until opening day? 😩"
    "All two of them taste like ass.",
    "If you wanna look like a badass, have drama on social media",
    "Whoever just unfollowed me you a bitch",
    "I love swimming for the same reason I love meditating...the feeling of weightlessness.",
    "Beautiful sunset last night from the pontoon @ Tupper Lake, New York",
    'Jacob Collier is a Grammy-awarded English artist from London.'
]

# set language model and task
language_model = 'cardiffnlp/twitter-roberta-base-2021-124m'
task = "irony"

# load dataset
dataset, label_to_id = tweetnlp.load_dataset(task)

# load trainer
trainer_class = tweetnlp.load_trainer(task)

# define trainer
trainer = trainer_class(
    language_model=language_model,
    dataset=dataset,
    label_to_id=label_to_id,
    max_length=128,
    split_train='train',
    split_test='test',
    output_dir=f'model_ckpt/test'
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-2021-124m and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/2862 [00:00<?, ? examples/s]

Map:   0%|          | 0/784 [00:00<?, ? examples/s]

Map:   0%|          | 0/955 [00:00<?, ? examples/s]

  metric_accuracy = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The repository for accuracy contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/accuracy.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

The repository for f1 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/f1.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


In [None]:
# train
trainer.train(down_sample_size_train=1000, ray_result_dir="ray_results/test")

# save model checkpoint
trainer.save_model()



Step,Training Loss


In [None]:
# model evaluation
metrics = trainer.evaluate()
pprint(metrics)

In [None]:
# sample prediction
output = trainer.predict(sample)
pprint(f"Sample Prediction: {language_model} ({task})")
for s, p in zip(sample, output):
    pprint(s)
    pprint(p)