<a href="https://colab.research.google.com/github/umbertoselva/NER-based-Sentiment-Analysis/blob/main/02_NER_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER WITH SPACY

This is part 02 of my NER-based Sentiment Analysis Project: 
https://github.com/umbertoselva/NER-based-Sentiment-Analysis

Our goals in this notebook will be 

- A) to use the spaCy library to carry out Named Entity Recognition on the "I Just Watched" movie review dataset that we created in Part 01 and extract the 'PERSON' entities, i.e. actor or movie director names;

- B) we will then calculate the frequency of each name to find out who are the most frequently metioned people.

## A) EXTRACTING THE ENTITIES

Let us load the "I Just Watched" subreddit dataset that we had saved as a CSV file on Google Drive.

In [1]:
import pandas as pd

In [2]:
url = "https://drive.google.com/file/d/1fpUHi7suKqMybzYjSrGXzlfmPWdAPqoy/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
df = pd.read_csv(dwn_url, sep='|', encoding='utf-8')
df

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0
...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0


Let's install spaCy

As we will use a Transformer model to extract the entities, let us install `spacy[transformers]`

In [3]:
!pip install spacy[transformers]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-transformers<1.2.0,>=1.1.2
  Downloading spacy_transformers-1.1.7-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.5 MB/s 
Collecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 16.4 MB/s 
[?25hCollecting transformers<4.21.0,>=3.4.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 36.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████

Let us download spaCy's core Transformer model for the English language: `en_core_web_trf`, where "trf" stands for "transformer".

In [4]:
!python -m spacy download en_core_web_trf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-trf==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.3.0/en_core_web_trf-3.3.0-py3-none-any.whl (460.3 MB)
[K     |████████████████████████████████| 460.3 MB 24 kB/s 
Installing collected packages: en-core-web-trf
Successfully installed en-core-web-trf-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


Next, let us load the model pipeline

In [5]:
import spacy

In [6]:
nlp = spacy.load('en_core_web_trf')

As we feed any text to the pipeline, this will automatically extract the entities, and store them in the returned Doc object.

Let us use the `displacy` function to visualize them.

In [7]:
from spacy import displacy

In [8]:
# let's select a review
df['selftext'].iloc[6]

'[https://jwwreviews.blogspot.com/2022/07/thor-love-and-thunder.html](https://jwwreviews.blogspot.com/2022/07/thor-love-and-thunder.html)\n\n8.5/10\n\nIn the fourth Thor movie, thunder god Thor (played by Chris Hemsworth) ends up pursuing Gorr the God Butcher (Christian Bale) and runs into his ex Jane Foster (Natalie Portman), who now to his surprise has both his powers and hammer.\n\nLike the last film, this is directed by Taika Waititi, and again he makes this one of the funniest Marvel movies. Waititi and all the actors involved definitely seem like they\'re having goofy fun. However, the movie does a serious backbone to several characters\' motivations (especially Jane and Gorr, who has a pretty strong "how-I-became-a-villain" story.) The ending is particularly good and may surprise you.\n\nLove and Thunder does repeat the sin of the last one Ragnarok, but doubles down on it: there is a little too much reliance on jokes. Waititi seems unable to take many reprieves from the humor, a

In [9]:
# let's pass that review to the pipeline and save the Doc object
doc = nlp(df['selftext'].iloc[6])

In [10]:
# let's visualize the entities
displacy.render(doc, jupyter=True, style='ent')

# N.B. jupyter=True is necessary to render the graphics in a notebook

We want to extract only the `'PERSON'` entities.

The most frequent ones will be actors or directors mentioned in multiple reviews.

Let us create a custom function to achieve that.

We will then apply this function to each review in our dataframe.

In [11]:
def get_people(text):

  # initiate a list where to store the found person entities
  person_list = []

  # create doc with the spacy pipeline
  # this doc will already contain the recognized entities
  doc = nlp(text)

  # loop through the ents (stored in the doc.ents) to find PERSON
  for ent in doc.ents:
    if ent.label_ == 'PERSON':
      person_list.append(ent.text)
  
  # remove duplicates from the list
  person_list = list(set(person_list))

  # return the list of unique PERSON entities
  return person_list

Now we shall apply this function to each cell of the `selftext` column, and save the result in a corresponding cell of a new column that we shall call `people`.

N.B. The following cell took 22min to run on a GPU with High RAM on Google Colab Pro.

In [12]:
df['people'] = df['selftext'].apply(get_people)

In [13]:
df.head(10)

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,people
0,t3_vzu4cb,1657906000.0,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0,"[Ayanna Misola, Albina, Adrian Alandy]"
1,t3_vz90er,1657840000.0,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0,"[Marx, Mel, Mel Brooks]"
2,t3_vyxfuj,1657810000.0,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.0,1.0,0.0,1.0,"[Antoinette Jadaone, Kitty K7, Joy Aquino, Ros..."
3,t3_vx6v7n,1657617000.0,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0,[Kevin Hart's]
4,t3_vwmwkm,1657558000.0,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0,"[Christian Bale, Natalie Portman, Chris Hemswo..."
5,t3_vw5wb1,1657501000.0,Ijustwatched,IJW: The Wonderful Summer of Mickey Mouse (2022),https://jwwreviews.blogspot.com/2022/07/the-wo...,0.8,3.0,0.0,3.0,"[Mickey, Chris Diamantoupolous', Mickey Mouse]"
6,t3_vw24r4,1657490000.0,Ijustwatched,IJW: Thor: Love and Thunder (2022),[https://jwwreviews.blogspot.com/2022/07/thor-...,0.87,6.0,0.0,6.0,"[Christian Bale, Natalie Portman, Chris Hemswo..."
7,t3_vuu2ph,1657343000.0,Ijustwatched,IJW: Highlander (1986),"First off, I am glad I watched this film. Afte...",1.0,19.0,0.0,19.0,[Henry Cavill]
8,t3_vuskii,1657338000.0,Ijustwatched,IJW: House Of Gucci (2022),The trailer looked amazing and all this market...,0.75,2.0,0.0,2.0,[mich]
9,t3_vuk84b,1657313000.0,Ijustwatched,IJW: Mega Shark vs Crocosaurus (2012),[https://foreverfinalgirl.com/mega-shark-vs-cr...,0.81,3.0,0.0,3.0,"[Michael Gaglio, Nigel Putnum, Nigel, Terry Mc..."


Let us save this dataframe into a CSV file for later use

In [14]:
df.to_csv('ijw_subreddit_ner.csv', sep='|', encoding='utf-8', index=False)

In [15]:
!ls

ijw_subreddit_ner.csv  sample_data


## B) ENTITY FRQUENCY

Let's load our CSV file from Google Drive

In [16]:
url = "https://drive.google.com/file/d/1rGO4DABtChIogEC8mn7EHpQiZotbapM1/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
df = pd.read_csv(dwn_url, sep='|', encoding='utf-8')
df

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,people
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0,"['Albina', 'Ayanna Misola', 'Adrian Alandy']"
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0,"['Marx', 'Mel Brooks', 'Mel']"
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0,"['Hana', 'Rose van Ginkel', 'Kitty K7', 'Joy A..."
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0,"[""Kevin Hart's""]"
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0,"['Korg', 'Thor', 'Thors', 'Chris Hemsworth', '..."
...,...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0,[]
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0,"['Buddha', 'Kim Yoo Jung']"
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0,[]
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0,[]


Now it should be noted that, when we first created the dataframe with the column "people", we populated this column with lists (a list in each cell).

Howerver, when we saved the dataframe as a CSV file and then loaded it back, these lists were not loaded as Python list objects, but as strings.

So now we first have to re-convert them into Python lists.

In [17]:
import ast

In [18]:
df['people'] = df['people'].apply(lambda x: ast.literal_eval(x))

Now let's find the most frequently cited people.

First we shall turn the people column into a list.

This will give us a list of lists (each cell list)

In [19]:
people_lists = df['people'].to_list()

In [20]:
people_lists[:50]

[['Albina', 'Ayanna Misola', 'Adrian Alandy'],
 ['Marx', 'Mel Brooks', 'Mel'],
 ['Hana',
  'Rose van Ginkel',
  'Kitty K7',
  'Joy Aquino',
  'Antoinette Jadaone',
  "Rose van Ginkel's",
  'Marco Gallo'],
 ["Kevin Hart's"],
 ['Korg',
  'Thor',
  'Thors',
  'Chris Hemsworth',
  'Gorr',
  'Natalie Portman',
  'Valkyrie',
  'Taika Waititi',
  'Jane Foster',
  'Gorr the God Butcher',
  'Thor Odinson',
  'Tessa Thompson',
  'Christian Bale'],
 ['Mickey Mouse', 'Mickey', "Chris Diamantoupolous'"],
 ['Thor',
  'Chris Hemsworth',
  'Gorr',
  'Natalie Portman',
  'Taika Waititi',
  'Bale',
  'Jane Foster',
  'Waititi',
  'Gorr the God Butcher',
  'Gamemaster',
  'Jane',
  'Christian Bale'],
 ['Henry Cavill'],
 ['mich'],
 ['Jaleel White',
  'Terry',
  'Calvin',
  'Michael Gaglio',
  'Nigel',
  'Terry McCormick',
  'Urkel',
  'Smalls',
  'Robert Picardo',
  'Nigel Putnum'],
 ['Sharon',
  'Franklin Delano Floyd',
  'Forrest Gump',
  'Tonya Dawn Hughes',
  'Sharon Marshall',
  'Skye Borgman',
  'To

Let's turn everything into a single list.

This list will contain duplicates. We will then count the duplicates to measure each name's frequency.

In [21]:
people = [single_item for name_list in people_lists for single_item in name_list]

# # which is equivalent to
# people = []
# # loop through people
# for name_list in people_lists:
#   # loop through each single sublist
#   for single_name in name_list:
#     people2.append(single_name)

In [22]:
people[:50]

['Albina',
 'Ayanna Misola',
 'Adrian Alandy',
 'Marx',
 'Mel Brooks',
 'Mel',
 'Hana',
 'Rose van Ginkel',
 'Kitty K7',
 'Joy Aquino',
 'Antoinette Jadaone',
 "Rose van Ginkel's",
 'Marco Gallo',
 "Kevin Hart's",
 'Korg',
 'Thor',
 'Thors',
 'Chris Hemsworth',
 'Gorr',
 'Natalie Portman',
 'Valkyrie',
 'Taika Waititi',
 'Jane Foster',
 'Gorr the God Butcher',
 'Thor Odinson',
 'Tessa Thompson',
 'Christian Bale',
 'Mickey Mouse',
 'Mickey',
 "Chris Diamantoupolous'",
 'Thor',
 'Chris Hemsworth',
 'Gorr',
 'Natalie Portman',
 'Taika Waititi',
 'Bale',
 'Jane Foster',
 'Waititi',
 'Gorr the God Butcher',
 'Gamemaster',
 'Jane',
 'Christian Bale',
 'Henry Cavill',
 'mich',
 'Jaleel White',
 'Terry',
 'Calvin',
 'Michael Gaglio',
 'Nigel',
 'Terry McCormick']

Obviously this list will contain both actor and director names (which is what we are looking for), but also film character names (but these will most likely appear only in single reviews, so their number should't matter for the frequency count). 

In [23]:
from collections import Counter

In [24]:
people_freq = Counter(people)

In [25]:
people_freq.most_common(100)

[('Charles Band', 22),
 ('Jason', 19),
 ('Freddy', 18),
 ('Wes Craven', 15),
 ('Andy', 13),
 ('Michael', 13),
 ('Robert England', 12),
 ('Leatherface', 12),
 ('Billy', 11),
 ('Sam', 10),
 ('Dewey', 10),
 ('Jennifer', 10),
 ('Idris Elba', 10),
 ('Roger Ebert', 10),
 ('Peter', 10),
 ('Michael Myers', 10),
 ('Arthur', 10),
 ('Sam Raimi', 9),
 ('Frank', 9),
 ('Tommy', 9),
 ('Paul', 9),
 ('Nancy', 9),
 ('John Cena', 9),
 ('Chucky', 9),
 ('Victor Miller', 9),
 ('Andrew Garfield', 8),
 ('Sarah', 8),
 ('Freddy Krueger', 8),
 ('Batman', 8),
 ('Angela', 8),
 ('Jack', 8),
 ('Scott', 8),
 ('Tom Holland', 8),
 ('Sally', 8),
 ('David DeCoteau', 8),
 ('Bruce Campbell', 8),
 ('Toulon', 8),
 ('David Schmoeller', 8),
 ('Mike', 7),
 ('Tony', 7),
 ('Johnny', 7),
 ('Eddie', 7),
 ('Karen', 7),
 ('Ben', 7),
 ('Kevin Williamson', 7),
 ('Ghostface', 7),
 ('Alice', 7),
 ('Tina', 7),
 ('Michael Meyers', 7),
 ('Benedict Cumberbatch', 7),
 ('Marilyn Burns', 7),
 ('Richard', 7),
 ('Brad Dourif', 7),
 ('Jason Voorhe

Ok, so indeed there are lots of first names, which could be both film character names or real people's names. Let us just remove them (e.g. "Jason") and keep only first name + family name combinations (e.g. "Idris Elba").

In [26]:
people = [name for name in people if ' ' in name]

Let us count again

In [27]:
people_freq = Counter(people)

In [28]:
people_freq.most_common(50)

[('Charles Band', 22),
 ('Wes Craven', 15),
 ('Robert England', 12),
 ('Idris Elba', 10),
 ('Roger Ebert', 10),
 ('Michael Myers', 10),
 ('Sam Raimi', 9),
 ('John Cena', 9),
 ('Victor Miller', 9),
 ('Andrew Garfield', 8),
 ('Freddy Krueger', 8),
 ('Tom Holland', 8),
 ('David DeCoteau', 8),
 ('Bruce Campbell', 8),
 ('David Schmoeller', 8),
 ('Kevin Williamson', 7),
 ('Michael Meyers', 7),
 ('Benedict Cumberbatch', 7),
 ('Marilyn Burns', 7),
 ('Brad Dourif', 7),
 ('Jason Voorhees', 7),
 ('James Gunn', 7),
 ('Puppet Master', 7),
 ('Kane Hodder', 6),
 ('Tom Savini', 6),
 ('Nicolas Cage', 6),
 ('Rob Zombie’s', 6),
 ('Tom Cruise', 6),
 ('John Carpenter', 6),
 ('Jamie Lee Curtis', 6),
 ('Andy Serkis', 6),
 ('Alan Ritchson', 6),
 ('David Arquette', 6),
 ('Woody Harrelson', 6),
 ('Taika Waititi', 5),
 ('Quentin Tarantino', 5),
 ('Bruce Wayne', 5),
 ('Robert Pattinson', 5),
 ('Ben Affleck', 5),
 ('Miles Teller', 5),
 ('Jared Leto', 5),
 ('Laurie Strode', 5),
 ('Ryan Reynolds', 5),
 ('Linnea Quig

This way we can have an idea about who are the most "talked about" people in the subreddit.

But with what sentiment do the subreddit users talk about these people? We shall find out at the end of our project.