<a href="https://colab.research.google.com/github/umbertoselva/NER-based-Sentiment-Analysis/blob/main/05_Sentiment_Analysis_distilBERT_with_Flair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 05 SENTIMENT ANALYSIS WITH FLAIR AND distillBERT

This is Part 05 of my NER-based Sentiment Analysis Project: https://github.com/umbertoselva/NER-based-Sentiment-Analysis

Our goal in this notebook will be to try a different method to carry out Sentiment Analysis on our movie review dataset extracted from the "I Just Watched" subreddit. 

In Part 04 we trained our own model by fine-tuning a BERT model. Here we shall try out a pre-trained distillBERT model with the Flair library (https://github.com/flairNLP/flair)

In [1]:
pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[K     |████████████████████████████████| 401 kB 6.6 MB/s 
[?25hCollecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 60.7 MB/s 
Collecting mpld3==0.3
  Downloading mpld3-0.3.tar.gz (788 kB)
[K     |████████████████████████████████| 788 kB 48.1 MB/s 
[?25hCollecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.gz (18 kB)
Collecting segtok>=1.5.7
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting bpemb>=0.3.2
  Downloading bpemb-0.3.3-py3-none-any.whl (19 kB)
Collecting conllu>=4.0
  Downloading conllu-4.5.1-py2.py3-none-any.whl (16 kB)
Collecting deprecated>=1.2.4
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting pptree
  Downloading pptree-3.1.tar.gz (3.0 kB)
Collecting sqlitedict>=1.6.0
  Downloading sqlited

In [2]:
import flair

Leveraging a distilBERT model for the specific task of Sentiment Analysis is as easy as typing the following line:

In [3]:
model = flair.models.TextClassifier.load('en-sentiment')

2022-07-18 09:46:38,500 https://nlp.informatik.hu-berlin.de/resources/models/sentiment-curated-distilbert/sentiment-en-mix-distillbert_4.pt not found in cache, downloading to /tmp/tmp1xd_67bw


100%|██████████| 265512723/265512723 [00:17<00:00, 15163929.31B/s]

2022-07-18 09:46:56,509 copying /tmp/tmp1xd_67bw to cache at /root/.flair/models/sentiment-en-mix-distillbert_4.pt





2022-07-18 09:46:56,966 removing temp file /tmp/tmp1xd_67bw
2022-07-18 09:46:57,002 loading file /root/.flair/models/sentiment-en-mix-distillbert_4.pt


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

We can now define a `get_sentiment()` function, similar to what we did in Part 04.

In this case we don't need a custom `prep_data()` function, because the Flair library provides us with a function for preprocessing our raw text into a Sentence object which we will then feed to our model.

In [4]:
text = "this movie is awesome"

# tokenization
sentence = flair.data.Sentence(text) 

type(sentence)

flair.data.Sentence

In [5]:
sentence

Sentence: "this movie is awesome"

In [6]:
sentence[0] # indexes return single tokens

Token[0]: "this"

In [7]:
sentence.to_tokenized_string() # extracts the string

'this movie is awesome'

In [8]:
sentence.labels

[]

In [9]:
sentence.get_labels()

[]

Now let's try to predict the sentiment

In [10]:
pred = model.predict(sentence)
pred

In [11]:
type(pred)

NoneType

The output of the `.predict()` method is not returned as such, so it can't be stored in a variable. Rather, it is stored as `labels` in the same Sentence object that was passed to the model.

In [12]:
sentence

Sentence: "this movie is awesome" → POSITIVE (0.9913)

In [13]:
sentence.labels

['Sentence: "this movie is awesome"'/'POSITIVE' (0.9913)]

In [14]:
type(sentence.labels)

list

In [15]:
# the labels are now a list with a single Label item
sentence.labels[0]

'Sentence: "this movie is awesome"'/'POSITIVE' (0.9913)

In [16]:
type(sentence.labels[0])

flair.data.Label

In [17]:
sentence.get_labels()

['Sentence: "this movie is awesome"'/'POSITIVE' (0.9913)]

In [18]:
sentence.get_labels()[0]

'Sentence: "this movie is awesome"'/'POSITIVE' (0.9913)

In [19]:
type(sentence.get_labels()[0])

flair.data.Label

In [20]:
sentence.get_labels()[0].value

'POSITIVE'

In [21]:
sentence.get_labels()[0].score

0.9912939667701721

In [22]:
sentence.labels[0].value

'POSITIVE'

In [23]:
sentence.labels[0].score

0.9912939667701721

In [24]:
type(sentence.labels[0].value)

str

In [25]:
type(sentence.labels[0].score)

float

So that is how to extract the sentiment value and the probability score.

Now let us define our `get_sentiment()` function which we will then apply to each single movie review in our dataset.

In [26]:
def get_sentiment(text):

  # preprocess/tokenize input string into Sentence object
  sentence = flair.data.Sentence(text)

  # predict setniment (and saves it in the Sentence object)
  model.predict(sentence)

  # extract sentiment value and probability
  sent_value = sentence.labels[0].value
  sent_score = sentence.labels[0].score

  return (sent_value, sent_score)

Now let us load our "I Just Watched" subreddit movie review dataset as we created it in Part 01-02 (including the NER 'people' column).

In [27]:
import pandas as pd

In [28]:
url = "https://drive.google.com/file/d/1rGO4DABtChIogEC8mn7EHpQiZotbapM1/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
df = pd.read_csv(dwn_url, sep='|', encoding='utf-8')
df

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,people
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0,"['Albina', 'Ayanna Misola', 'Adrian Alandy']"
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0,"['Marx', 'Mel Brooks', 'Mel']"
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0,"['Hana', 'Rose van Ginkel', 'Kitty K7', 'Joy A..."
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0,"[""Kevin Hart's""]"
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0,"['Korg', 'Thor', 'Thors', 'Chris Hemsworth', '..."
...,...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0,[]
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0,"['Buddha', 'Kim Yoo Jung']"
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0,[]
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0,[]


Now let us create a dedicated column and extract the sentiment from each review and store it in that column 

In [29]:
df['sentiment'] = df['selftext'].apply(get_sentiment)

In [30]:
df

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,people,sentiment
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0,"['Albina', 'Ayanna Misola', 'Adrian Alandy']","(NEGATIVE, 0.9999946355819702)"
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0,"['Marx', 'Mel Brooks', 'Mel']","(NEGATIVE, 0.9990817308425903)"
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0,"['Hana', 'Rose van Ginkel', 'Kitty K7', 'Joy A...","(NEGATIVE, 0.9888206124305725)"
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0,"[""Kevin Hart's""]","(POSITIVE, 0.9992497563362122)"
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0,"['Korg', 'Thor', 'Thors', 'Chris Hemsworth', '...","(NEGATIVE, 0.9997126460075378)"
...,...,...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0,[],"(POSITIVE, 0.9996770620346069)"
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0,"['Buddha', 'Kim Yoo Jung']","(NEGATIVE, 0.9999048709869385)"
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0,[],"(POSITIVE, 0.9999856948852539)"
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0,[],"(POSITIVE, 0.9999256134033203)"


Finally, let us save the dataframe into a CSV file for later use

In [31]:
df.to_csv('ijw_subreddit_ner_sent_flair.csv', sep='|', encoding='utf-8', index=False)


In [32]:
!ls

ijw_subreddit_ner_sent_flair.csv  sample_data
