# Lab 04 - NLP 2

# Introduction (1 point)

## (1 point) Pick one of the datasets between hate and offensive, and justify your choice. Remember that it is for a commercial application (there is a good and a bad answer).

We choose the offensive dataset because the hate dataset is under the CC BY-NC 4.0 license, which means that we cannot use it for commercial purposes.

In [24]:
import numpy as np
from bertopic import BERTopic
from umap import UMAP

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [2]:
from datasets import load_dataset

dataset = load_dataset("tweet_eval", "offensive")

Found cached dataset tweet_eval (/home/maxenceoden/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


  0%|          | 0/3 [00:00<?, ?it/s]

In [22]:
print("Split label balence:")
print(f"  Train: {np.array(dataset['train']['label']).sum()/len(dataset['train']['label'])*100:.2f}%")
print(f"  Validation: {np.array(dataset['validation']['label']).sum()/len(dataset['validation']['label'])*100:.2f}%")
print(f"  Test: {np.array(dataset['test']['label']).sum()/len(dataset['test']['label'])*100:.2f}%")

print("\nSplit size:")
print(f"  Train: {len(dataset['train']['label'])} tweets")
print(f"  Validation: {len(dataset['validation']['label'])} tweets")
print(f"  Test: {len(dataset['test']['label'])} tweets")

print("\nExample:")
print(f"  Tweet: '{dataset['train']['text'][0]}'")
print(f"  Label: {dataset['train']['label'][0]}")


Split label balence:
  Train: 33.07%
  Validation: 34.67%
  Test: 27.91%

Split size:
  Train: 11916 tweets
  Validation: 1324 tweets
  Test: 860 tweets

Example:
  Tweet: '@user Bono... who cares. Soon people will understand that they gain nothing from following a phony celebrity. Become a Leader of your people instead or help and support your fellow countrymen.'
  Label: 0


## (1 point) Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

text: a string feature containing the tweet.
label: an int classification label with the following mapping:
- 0: non-offensive
- 1: offensive

They are 3 splits: train, validation and test.
 - train: 11.9k tweets (33% offensive)
 - validation: 1.32k tweets (34% offensive)
 - test: 860 tweets (27% offensive)

After looking at the text, we can see that there are a lot tweets talking about politics, loves and the NFL. We can also see that there are a lot of tweets (almost every tweet) with the '@user' mentionning someone. There are also tweets with emojis and hashtags.

## (3 points) Use BERTopic to extract the topics within the data, and the main topics within each class.

In [47]:
umap_model = UMAP(random_state=42)

topic_model = BERTopic(umap_model=umap_model, embedding_model="all-MiniLM-L6-v2")

In [48]:
topics, probs = topic_model.fit_transform(dataset["train"]["text"] + dataset["validation"]["text"])

In [49]:
topics_per_class = topic_model.topics_per_class(dataset["train"]["text"] + dataset["validation"]["text"], dataset["train"]["label"] + dataset["validation"]["label"])
topic_model.visualize_topics_per_class(topics_per_class)

## (1 point) What do you think about the results? How do you think it could impact a model trained on these data?



In [51]:
# Bigram model
topic_model = BERTopic(umap_model=umap_model, embedding_model="all-MiniLM-L6-v2", n_gram_range=(1, 2))
topics, probs = topic_model.fit_transform(dataset["train"]["text"] + dataset["validation"]["text"])
topics_per_class = topic_model.topics_per_class(dataset["train"]["text"] + dataset["validation"]["text"], dataset["train"]["label"] + dataset["validation"]["label"])
topic_model.visualize_topics_per_class(topics_per_class, title="<b>Topics per Class Bigram</b>")

# Evaluate a model (6 points)

## (2 points) Evaluate their model on the test split of the dataset you picked, using precision, recall, and F1-score.

In [1]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# Tasks:
# emoji, emotion, hate, irony, offensive, sentiment
# stance/abortion, stance/atheism, stance/climate, stance/feminist, stance/hillary

task='offensive'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night ðŸ˜Š"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)


Downloading (â€¦)lve/main/config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

Downloading (â€¦)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (â€¦)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (â€¦)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

1) not-offensive 0.9073
2) offensive 0.0927


In [4]:
# predict test split
test_text = dataset['test']['text']
test_text = [preprocess(t) for t in test_text]
encoded_input = tokenizer(test_text, return_tensors='pt', padding=True, truncation=True, max_length=256)
output = model(**encoded_input)
scores = output[0]
scores = softmax(scores, axis=1)

# get the predicted labels
preds = scores.argmax(axis=1)

: 

: 

In [None]:
from datasets import load_metric

# Compute metrics
metric = load_metric("accuracy", "glue", "mrpc")
metric.compute(predictions=preds, references=dataset['test']['label'])