# Sentiment Analysis Example

This notebook contains an applied example of using Roberta for sentiment analysis.

For this example, we'll use the **cardiffnlp/twitter-roberta-base-sentiment** model found [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment).

---------

### Initial model example

Import the libraries needed:

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import os

Set up the model by specifying the model and the tokenizer.

**What is a model?**

A model in Hugging Face refers to a machine learning model that has been trained and stored on their platform ([more info here](https://huggingface.co/docs/hub/models)).

**What is a tokenizer?**

A tokenizer in Hugging Face is a tool that processes textual data into a format that can be understood by a machine learning model. It is an essential step in the Natural Language Processing (NLP) pipeline, responsible for translating text into numerical data that can be processed by the model ([more info here](https://medium.com/@awaldeep/hugging-face-understanding-tokenizers-1b7e4afdb154)).

In [None]:
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Read in [labels for the outcomes](https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/sentiment/mapping.txt). These translate the model output into words (e.g. 0 is negative, 1 is neutral, 2 is positive).

In [None]:
# Download label mapping
labels=[]
mapping_link = "https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/sentiment/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

Pass a string to pass the model to get sentiment back:

In [None]:
text = "I like you. I love you!"

Encode the text so that it can be understood by the model, and pass it to the model

In [None]:
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

View the output:

In [None]:
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

---------

### Model applied to dataset

Load the data:

In [None]:
from sklearn.datasets import fetch_20newsgroups

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date. More information can be found here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

**Fetch data:**

In [None]:
fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes')).keys()

In [None]:
fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['target_names']

In [None]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

Create function to clean text:

In [None]:
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

In [None]:
text = docs[11]
text

In [None]:
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

In [None]:
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")