<a href="https://colab.research.google.com/github/twinarta/sentiment-analysis/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using Python
8 January 2023, run using Google Colab (CPU, free)

## Install dependency

In this case, the dependency is Huggingface's Transformers library that is used to load sentiment analysis model and make a prediction. The torch variant is used to save time and disk space (instead of both the PyTorch and TensorFlow version).

In [1]:
!pip install 'transformers[torch]'



## Imports

Importing libraries, in this case Huggingface's Transformers and regex for text processing purposes

In [2]:
from transformers import pipeline
import re

## Loading the sentiment analysis pretrained model from Huggingface

Using Roberta as the architecture of the model, specifically used for English language. The model will return 3 classes (positive, negative, neutral). Roberta is a fairly large model that runs quite fast even using CPU (less than 0.5 seconds for around 10 words on a Intel i3 machine).

In [3]:
sentiment_pipeline = pipeline("text-classification", model="j-hartmann/sentiment-roberta-large-english-3-classes")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at j-hartmann/sentiment-roberta-large-english-3-classes were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that y

The `sentiment_pipeline` variable will act as the global variable that will be accessed to make an inference, instead of instantiating a new variable that holds the model over and over again as the `analyze_sentiment` function is called.

The pipeline will also automatically preprocess the input text to be processed by the model (tokenization, embedding generation, reshape to match input size).

## Functions to preprocess raw text

1. Remove all characters except alphabets (uppercase and lowercase), comma, dot, apostrophe, and single whitespace.
2. Remove tags (XML or HTML tags)
3. Remove multiple whitespaces (if any)
4. Remove all non-ascii characters
5. Remove emojis

In this case, these functions are prepared to determine the sentiment of English texts.

In [35]:
def remove_misc_characters(text):
  return re.sub("[^A-Za-z,.' ]+", '', text)

def remove_tags(text):
  return re.sub('<[^<]+?>', '', text)

def remove_multiple_whitespaces(text):
  return " ".join(text.split())

def clean_unicode(text):
  return (text.encode('ascii', 'ignore')).decode("utf-8")

# From: https://stackoverflow.com/a/58356570
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

## Combine the functions

A preprocess function combines the functions above with the following steps:
1. Remove all non-ascii characters
2. Remove all emojis
3. Remove multiple whitespaces (ensuring there are only single spaces from now own)
4. Remove HTML/XML tags (for instance, `<div></div>` or `<span></span>`
5. Remove all characters except alphabets (uppercase and lowercase), comma, dot, apostrophe, and single whitespace
6. Convert the preprocessed text into lowercase text
7. Remove all trailing whitespace found in the text

If the resulted text is empty, return null as the result

In [36]:
def preprocess(text):
  if text is None:
    return None

  preprocessed = clean_unicode(text)
  preprocessed = remove_emojis(text)
  preprocessed = remove_multiple_whitespaces(preprocessed)
  preprocessed = remove_tags(preprocessed)
  preprocessed = remove_misc_characters(preprocessed)
  preprocessed = preprocessed.lower()
  preprocessed = preprocessed.strip()

  if len(preprocessed) > 0:
    return preprocessed

  return None

## Function to predict the sentiment of text

The function accesses the sentiment_pipeline variable as a global variable.

After the text is subjected to the preprocess() function, there are several condititions used to validate the text:
1. The text can't be null
2. The number of words needs to be larger or equal to 3, to ensure that the sentence is not too short
3. The number of words needs to be less than 20, since long sentences may produce inaccurate result because the context might be too large (the average number of word found in a sentence is 15-20 words).
4. Ensuring that list that is returned as the prediction result is valid in terms of the size of the list and the index of the dictionary.
5. A threshold is set to ensure that the inference result's score is higher than a certain number (in this case, 0.6). This is done to only trust prediction result with high confidence score.

In [37]:
def analyze_sentiment(text):
  global sentiment_pipeline

  confidence_threshold = 0.6

  text = preprocess(text)

  if text is None:
    return None

  words = text.split(" ")

  if len(words) < 3:
    return None

  if len(words) > 20:
    return None

  result = sentiment_pipeline(text)

  sentiment_result = None

  if len(result) > 0:
    if 'score' not in result[0].keys() or 'label' not in result[0].keys():
      return None

    if result[0]['score'] >= confidence_threshold:
      sentiment_result = result[0]['label'].lower()

  return sentiment_result

## Testing the function

Here are some regular sentence that produces the 3 classes (positive, negative, and neutral), representing the positive test cases.

## Positive test cases

### Neutral

In [48]:
result = analyze_sentiment("I ate a sandwhich")

print(result)

neutral


### Positive

In [49]:
result = analyze_sentiment("I really like this food!")

print(result)

positive


### Negative

In [50]:
result = analyze_sentiment("This is not what I ordered")

print(result)

negative


## Cases in which no explicit positive/negative word is mentioned

Here are several test cases in which no positive/negative word is mentioned, where the word "like" is used to convey the positive/negative nuance.

In [51]:
result = analyze_sentiment("You smell like a flower")

print(result)

positive


In [52]:
result = analyze_sentiment("You smell like a giraffe")

print(result)

negative


## Negative test cases

### 1. Empty string (no alphabet)
It will return a null result. The inference process won't take place if the input doesn't include any alphabet.

In [53]:
result = analyze_sentiment("    -     ")

print(result)

None


In [54]:
result = analyze_sentiment("🤣👍👍")

print(result)

None


### 2. Texts containing emojis
It will remove the emojis prior to the inference process to prevent

In [55]:
result = analyze_sentiment("This is really good 🤣👍👍")

print(result)

positive


### 3. Sentences that are too long
It will avoid giving ambiguous prediction result or having the model processing a large text, returning null value

In [56]:
result = analyze_sentiment('"The quick brown fox jumps over the lazy dog" is an English-language pangram – a sentence that contains all the letters of the alphabet. The phrase is commonly used for touch-typing practice, testing typewriters and computer keyboards, displaying examples of fonts, and other applications involving text where the use of all letters in the alphabet is desired.')

print(result)

None
