<a href="https://colab.research.google.com/github/umbertoselva/NER-based-Sentiment-Analysis/blob/main/06_NER_based_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 06 NER-BASED SENTIMENT ANALYSIS

This is Part 06 of my NER-based Sentiment Analysis Project: https://github.com/umbertoselva/NER-based-Sentiment-Analysis

Let us summarize what we have done so far:

- We scraped the "I Just Watched" (IJW) subreddit and created a movie review dataset (Part 01)
- We applyed Named Entity Recognition to such dataset with spaCy and extracted the 'PERSON' entities and filtered for full names of actors and film directors to find out who is more frequently mentioned (Part 02)
- We fine-tuned a BERT model with TensorFlow on a movie review dataset from Kaggle and then used the model to predict the sentiment of the IJW reviews (Part 03-04)
- We explored an alternative method to achieve the same, by using a pre-built distillBERT model with Flair (Part 05)

We can now carry out a NER-based sentiment analysis to find out with what sentiment do the IWJ subreddit user speak of film industry people and who is more popular.

Let us use the dataset with the sentiment labels extracted with our fine-tuned BERT model.



In [1]:
import pandas as pd

In [2]:
# the dataset with the sentiment column created with our fine-tuned BERT

url = "https://drive.google.com/file/d/1sNAoVuUdjeb-1l58a85Sm0aSD5TaXEFK/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
df_bert = pd.read_csv(dwn_url, sep='|', encoding='utf-8')

In [3]:
df_bert

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,people,sentiment
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0,"['Albina', 'Ayanna Misola', 'Adrian Alandy']","('POSITIVE', 0.5522495)"
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0,"['Marx', 'Mel Brooks', 'Mel']","('POSITIVE', 0.5305168)"
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0,"['Hana', 'Rose van Ginkel', 'Kitty K7', 'Joy A...","('POSITIVE', 0.84092736)"
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0,"[""Kevin Hart's""]","('NEGATIVE', 0.5498567)"
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0,"['Korg', 'Thor', 'Thors', 'Chris Hemsworth', '...","('NEGATIVE', 0.5038758)"
...,...,...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0,[],"('NEGATIVE', 0.5597052)"
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0,"['Buddha', 'Kim Yoo Jung']","('POSITIVE', 0.76196575)"
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0,[],"('POSITIVE', 0.5740766)"
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0,[],"('NEGATIVE', 0.5029748)"


Let us recall that when we first created the dataframe with the column "people", we populated this column with lists (a list in each cell).

Howerver, when we saved the dataframe as a CSV file and the loaded it, these lists were not loaded as Python list objects, but as strings.

So now we first have to re-convert them into Python lists.

Same thing for the tuples in the sentiment column.

In [4]:
import ast

In [5]:
df_bert['people'] = df_bert['people'].apply(lambda x: ast.literal_eval(x))
df_bert['sentiment'] = df_bert['sentiment'].apply(lambda x: ast.literal_eval(x))

Now we want to match each relevant name ('PERSON', from the "people" column) to the sentiment of the reviews in which they appear, and the probability score of such sentiment.

Our final goal will be to rank them by average sentiment score.

Let us create a dictionary for this purpose.

In [6]:
sentiment_bert = {}

Let's loop through our dataset, extract the names, sentiment and scores and save them in the dictionary.

We want each name to be a key, whose value will be an embedded dictionary featuring two keys, 'POSITIVE' and 'NEGATIVE', corresponding to lists of all the scores (one per review in which the name appears)

```
{
  'person': {
    'POSITIVE': [0.99, 0,88, 0.95],
    'NEGATIVE': [0.84, 0.92, 0.65]
  }
  'person2': {
    'POSITIVE': [...],
    'NEGATIVE': [...]
  }
  ...
}
```

In [7]:
# each sentiment column contains a tuple
# so, in order to extract the sentiment lable we take the item at index [0]
df_bert['sentiment'].iloc[0][0]

'POSITIVE'

In [8]:
# and index [1] for the probablity score
df_bert['sentiment'].iloc[0][1]

0.5522495

Let us also filter the "people" the same way we did in Part 02, i.e. by keeping only those strings that contain a whitespace (separating first and last name). We only want "Tom Hanks", not just "Tom". Some actors might be mentioned by their first name (e.g. "Tom" for "Tom Hanks") but it is reasonable to expect this to happen in the text of a review only after they have already been introduced with their full name first. As for the rest, we will be mostly filtering out names that just refer to movie characters. A few might remain (e.g. "Freddy Krueger"), but that's OK. 

In [9]:
for i, row in df_bert.iterrows():
  
  # extract the label 'POSITIVE' or 'NEGATIVE'
  sentiment_label = row['sentiment'][0]
  # extract the probability score
  sent_prob_score = row['sentiment'][1]

  # loop through the people in the people column (each cell is a list)
  for person in row['people']:
    # record only for full names
    if ' ' in person:
      # add key if not present in dict
      if person not in sentiment_bert.keys():
        # add key corresponding to a dict with two POS/NEG keys
        # each corresponding to an empty list of scores
        sentiment_bert[person] = {'POSITIVE': [], 'NEGATIVE': []}
      # else if the key exists, then record score in the appropriate POS or NEG list
      sentiment_bert[person][sentiment_label].append(sent_prob_score)

In [10]:
sentiment_bert

{'Ayanna Misola': {'NEGATIVE': [0.5841552],
  'POSITIVE': [0.5522495, 0.82105774]},
 'Adrian Alandy': {'NEGATIVE': [], 'POSITIVE': [0.5522495]},
 'Mel Brooks': {'NEGATIVE': [], 'POSITIVE': [0.5305168]},
 'Rose van Ginkel': {'NEGATIVE': [], 'POSITIVE': [0.84092736, 0.5278149]},
 'Kitty K7': {'NEGATIVE': [], 'POSITIVE': [0.84092736]},
 'Joy Aquino': {'NEGATIVE': [], 'POSITIVE': [0.84092736]},
 'Antoinette Jadaone': {'NEGATIVE': [], 'POSITIVE': [0.84092736]},
 "Rose van Ginkel's": {'NEGATIVE': [], 'POSITIVE': [0.84092736]},
 'Marco Gallo': {'NEGATIVE': [], 'POSITIVE': [0.84092736]},
 "Kevin Hart's": {'NEGATIVE': [0.5498567], 'POSITIVE': []},
 'Chris Hemsworth': {'NEGATIVE': [0.5038758],
  'POSITIVE': [0.72985405, 0.6967107]},
 'Natalie Portman': {'NEGATIVE': [0.5038758, 0.6152438],
  'POSITIVE': [0.72985405]},
 'Taika Waititi': {'NEGATIVE': [0.5038758],
  'POSITIVE': [0.72985405, 0.72160435, 0.87222373, 0.55458665]},
 'Jane Foster': {'NEGATIVE': [0.5038758], 'POSITIVE': [0.72985405]},
 'G

Some things can be improved with some cleaning, e.g. some names appear twice ('Nicole Kidman' and 'Nicole Kidman\'s'), but let's leave that for a second stage, and let's make do for now.

So now we can look at the sentiment for any name:

In [11]:
sentiment_bert['Idris Elba']

{'NEGATIVE': [0.5215943, 0.7074299, 0.58697766, 0.7194377],
 'POSITIVE': [0.6008813,
  0.60693496,
  0.7901574,
  0.8163174,
  0.8312122,
  0.59141517]}

How can we extract useful information from this?

We can calculate the total average sentiment by summing the positive scores and substracting the sum of the negative scores

e.g. for "Idris Elba" (approximately):

```
pos_freq = 6
sum_pos = 0.6 + 0.6 + 0.7 + 0.8 + 0.8 + 0.5 = 4
neg_freq = 4
sum_neg = 0.5 + 0.7 + 0.5 + 0.7 = 2.4
total_freq = 10

total_sentiment = sum_pos - sum_neg = 4 - 2.4 = 1.6
```

Then we can calculate the average positive and negative scores

```
avg_pos = sum_pos / pos_freq = 4 / 6 = 0.6
avg_neg = sum_neg / neg_freq = 2.4 / 4 = 0.6
```

Then we can also calculate the overall average sentiment as follows:

```
avg_sentiment = total / (pos_freq + neg_freq) = 1.6 / (6 + 4) = 1.6 / 10 = 0.16
```

And we can use these info to rank the names.

Let us create a new list in which each item will be a dict (one for each name) where to save all this info.

We want it to have the following structure:
```
[
  {
    'person': 'Idris Elba',
    'avg_pos': 0.6,
    'avg_neg': 0.6,
    'total_freq': 10,
    'avg_sentiment': 0.16
  },
  {
    'person': ...,
    ...
  }
  ...
]

```


In [12]:
bert_avg_sentiment = []

In [13]:
# loop through each person key in the BERT sentiment dict
for person in sentiment_bert.keys():

  # FREQ
  # count the pos and neg ratings (i.e. len of the lists)
  pos_freq = len(sentiment_bert[person]['POSITIVE'])
  neg_freq = len(sentiment_bert[person]['NEGATIVE'])
  total_freq = pos_freq + neg_freq

  # SUM
  # loop through each sentiment list (so first loop through POS then NEG)
  for sent_label in ['POSITIVE', 'NEGATIVE']:
    # save the list in a variable
    score_list = sentiment_bert[person][sent_label]
    # if there are no entries, set
    if len(score_list) == 0:
      # replace the empty list with a 0
      sentiment_bert[person][sent_label] = 0.0
    # else sum up the entries and replace the scores with just the sum 
    else:
      sentiment_bert[person][sent_label] = sum(score_list)
  
  # now all the POS and NEG lists have been replaced by sums

  # we can calculate the avg pos and neg (sum / freq)
  # NB do this only if the freq is not zero, otherwise assign the avg to 0
  # (This is to avoid division by 0)
  avg_pos = sentiment_bert[person]['POSITIVE'] / pos_freq if pos_freq != 0 else 0
  avg_neg = sentiment_bert[person]['NEGATIVE'] / neg_freq if neg_freq != 0 else 0

  # we can calculate the total sentiment (POS sum minus NEG sum)
  total = sentiment_bert[person]['POSITIVE'] - sentiment_bert[person]['NEGATIVE']

  # calculate the overall average score (total / posfreq+negfreq)
  overall_avg = total / total_freq

  # append the avg to our avg_sentiment list
  bert_avg_sentiment.append({
      'person': person,
      'avg_pos': avg_pos,
      'avg_neg': avg_neg,
      'total_freq': total_freq,
      'avg_sentiment': overall_avg
  })

In [14]:
bert_avg_sentiment[:3]

[{'avg_neg': 0.5841552,
  'avg_pos': 0.68665362,
  'avg_sentiment': 0.26305068,
  'person': 'Ayanna Misola',
  'total_freq': 3},
 {'avg_neg': 0,
  'avg_pos': 0.5522495,
  'avg_sentiment': 0.5522495,
  'person': 'Adrian Alandy',
  'total_freq': 1},
 {'avg_neg': 0,
  'avg_pos': 0.5305168,
  'avg_sentiment': 0.5305168,
  'person': 'Mel Brooks',
  'total_freq': 1}]

We can easily convert this list of dicts to a dataframe

In [15]:
ijw_sentiment_bert = pd.DataFrame(bert_avg_sentiment)

In [16]:
ijw_sentiment_bert

Unnamed: 0,person,avg_pos,avg_neg,total_freq,avg_sentiment
0,Ayanna Misola,0.686654,0.584155,3,0.263051
1,Adrian Alandy,0.552249,0.000000,1,0.552249
2,Mel Brooks,0.530517,0.000000,1,0.530517
3,Rose van Ginkel,0.684371,0.000000,2,0.684371
4,Kitty K7,0.840927,0.000000,1,0.840927
...,...,...,...,...,...
2864,Milana Vayntraub,0.960454,0.000000,1,0.960454
2865,Red Guardian,0.528371,0.000000,1,0.528371
2866,Apollo Creed,0.829540,0.000000,1,0.829540
2867,Kim Yoo Jung,0.761966,0.000000,1,0.761966


Let's check whether the data for "Idris Elba" matches our very rough calculations:

In [17]:
ijw_sentiment_bert.loc[ijw_sentiment_bert['person'] == 'Idris Elba']

Unnamed: 0,person,avg_pos,avg_neg,total_freq,avg_sentiment
185,Idris Elba,0.706153,0.63386,10,0.170148


Looks good!

Our dataframe has 2869 rows. However, most people are mentioned only once or twice. 

We may want to keep only those people who are mentioned more than a certain number of times, say 3 times.

Let us filter our dataframe accordingly.

In [18]:
ijw_sentiment_bert = ijw_sentiment_bert[ijw_sentiment_bert['total_freq'] > 3]

In [19]:
ijw_sentiment_bert

Unnamed: 0,person,avg_pos,avg_neg,total_freq,avg_sentiment
12,Taika Waititi,0.719567,0.503876,5,0.474879
38,Andrew Garfield,0.650010,0.633689,8,0.008160
49,Ethan Hawke,0.740359,0.000000,4,0.740359
55,Kane Hodder,0.000000,0.769500,6,-0.769500
56,Freddy Krueger,0.000000,0.676549,8,-0.676549
...,...,...,...,...,...
2227,Jeymes Samuel,0.770228,0.000000,4,0.770228
2272,Johnny Depp,0.883742,0.850527,4,-0.416960
2316,Edgar Wright,0.783773,0.000000,4,0.783773
2707,Sylvester Stallone,0.591415,0.671282,4,-0.355608


Now we can finally sort by average sentiment to see the most popular people!

In [20]:
ijw_sentiment_bert.sort_values('avg_sentiment', ascending=False).head(20)

Unnamed: 0,person,avg_pos,avg_neg,total_freq,avg_sentiment
166,Tom Hanks,0.844061,0.0,4,0.844061
2316,Edgar Wright,0.783773,0.0,4,0.783773
2227,Jeymes Samuel,0.770228,0.0,4,0.770228
2226,Rufus Buck,0.770228,0.0,4,0.770228
49,Ethan Hawke,0.740359,0.0,4,0.740359
1909,Thomasin McKenzie,0.714305,0.0,4,0.714305
156,Nicolas Cage,0.903847,0.668224,6,0.641836
1012,Emily Blunt,0.856362,0.582145,5,0.568661
411,Benedict Cumberbatch,0.671306,0.583549,7,0.492041
368,Jared Leto,0.780222,0.726666,5,0.478844


And the least popular ones!

In [21]:
ijw_sentiment_bert.sort_values('avg_sentiment', ascending=True).head(20)

Unnamed: 0,person,avg_pos,avg_neg,total_freq,avg_sentiment
902,Bill Moseley,0.0,0.870926,4,-0.870926
573,Linnea Quigley,0.0,0.869971,5,-0.869971
1791,James Karen,0.0,0.861162,4,-0.861162
390,Tommy Jarvis,0.0,0.830847,4,-0.830847
334,Brian Yuzna,0.0,0.81797,4,-0.81797
1411,Neal Marshall Stevens,0.0,0.817064,4,-0.817064
427,Tom Atkins,0.0,0.805105,4,-0.805105
634,Kevin Bacon,0.0,0.801383,4,-0.801383
1703,Ted Raimi,0.0,0.79812,4,-0.79812
231,Steve Miner,0.0,0.788963,4,-0.788963
