### Transfomers with Hugging Face

In this chapter, we will be discussing the Hugging Face Transformers library in Python and go through examples to demonstrate how to use pretrained models to perform NLP tasks.

Topics:

- Hugging Face
- NLP applications:
  - Sentiment analysis
  - Named Entity Recognition - Automatically pull key terms from text
  - Zero-Shot Classifications -Classify text without labels
  - Text Summarization
  - Text Generation
  - Document Similarity

#### Hugging Face

Hugging Face is the company that created the Transformers Python Library. It is popular because it makes it easy for data professionals to access and utilize pretrained LLMs.

They also host the Model Hub, which contains over 1 million pretrained, open-source models (in addition to base models, there are variants, fine-tuned models, experimental models, etc.).

Note: GitHub is the community for uploading and sharing code, and Model Hub is the community for uploading and sharing pretrained models.

Hugging Face workflow:

- Determine the goal:  Different types of tasks use different types of LLMs.
  - Encoder-Only:  Sentiment Analysis, Named Entity Recognition
  - Decoder-Only: Text Generalization
  - Encoder-Decoder: Zero-Shot Classification, Text Summarization
  - Embedding: Document Similarity

- Identify a pretrained model (from Hugging Face's Model Hub): In Model Hub, we can sort models based on popularity, downloads, and so on.


- Specify input data (a single string, a series/column of text,...)

- Apply the pretrained model (input data and view the output(s))

Note: There are additional steps in case we want to optimize the model for our data.

For the rest of the work, we are creating a new environment called "nlp_transformers" and install following packages:

- python
- Jupyter Notebook
- pandas
- numpy
- scikit-learn
- openpyxl (to read Excel files)
- transformers
- PyTorch (to run transformers. Alternative: Tensorflow)

Note: If you are running into any issues when installing packages, we can try the 'pip' command to install Jupyter Notebook, transformers, and pytorch.

#### Sentiment Analysis with LLMs

Remember that with sentiment analysis, we are determining the positivity/negativity of text.

The default LLM for sentiment analysis is `DistilBERT` (encoder-only). This is a variant of `BERT,` and as a new version, it has fewer parameters and run fast.

Syntax:

`from transformers import pipeline`

`sentiment_analyzer = pipeline('sentiment-analysis',`
                                `model = "distilBERT/distilbert-...",`
                                `device=-1)`

The pipeline module allows us to specify the task we are planning to perform.

`'sentiment-analysis'`: task

`model = "distilBERT/distilbert-..."`: The long code is coming from Hugging Face Model Hub (We are choosing a particular pretrained model)

`device=-1`: This means we are only using the CPU in the computer. We can switch this to use the GPU in the computer.

In [1]:
import pandas as pd

In [2]:
### Read Data

df = pd.read_excel('Chapter6_Popchip_Reviews_Sentiment.xlsx')
df.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan ga...,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,I like the puffed nature of this chip that mak...,0.7269
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,I just love these chips! I was always a big f...,0.979


In [3]:
### Note: Only part of the "Text" is visible. We can change the column width to show complete text.

pd.set_option('display.max_colwidth', None)

In [4]:
df.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979


In [5]:
df.shape

(564, 7)

In [6]:
### TO save the run time, we will be considering the first 30 rows only.

df =  df.head(30)

In [7]:
df.shape

(30, 7)

In [8]:
### Sentiment Analysis

In [9]:
from transformers import pipeline

In [10]:
sentiment_analysis = pipeline('sentiment-analysis',
                             model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                             device = -1)

### We can set "device = 'mps' " to use the GPU in the device.

Device set to use cpu


In [11]:
### Data

text1 = 'When life gives you lemons, make lemonade! ðŸ™‚'
text2 = 'A dozen lemons will make a gallon of lemonade.'
text3 = 'I didn\'t like the taste of that lemonade at all.'

In [12]:
sentiment_analysis(text1)

### It says text is positive and it is very confident about that.

[{'label': 'POSITIVE', 'score': 0.996239423751831}]

In [13]:
sentiment_analysis(text2)

[{'label': 'POSITIVE', 'score': 0.7781563401222229}]

In [14]:
sentiment_analysis(text3)

[{'label': 'NEGATIVE', 'score': 0.9955588579177856}]

Now we like to apply this to an entire column of text.

We can simply use the apply function and apply `sentiment_analysis` to each row:

In [15]:
# df['Text'].apply(sentiment_analysis)

The above code will give an error. For one of the reviews, the token indices sequence length is longer than the specified maximum sequence length of the model (668 > 512 (the maximum model can handle)). Thus, we have to make the review text shorter.

We can achieve this by adding an argument inthe  pipeline: "truncation". When it is True, the pipeline will cut off (truncate) your input to fit within the model's maximum sequence length.

In [16]:
sentiment_analysis = pipeline('sentiment-analysis',
                             model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                             device = -1,
                             truncation = True)

Device set to use cpu


In [17]:
df['Text'].apply(sentiment_analysis)

0     [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1      [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.6984887719154358}]
3     [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4     [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
5     [{'label': 'POSITIVE', 'score': 0.9994196891784668}]
6     [{'label': 'POSITIVE', 'score': 0.9992188215255737}]
7     [{'label': 'POSITIVE', 'score': 0.9969040751457214}]
8     [{'label': 'POSITIVE', 'score': 0.9894027709960938}]
9     [{'label': 'POSITIVE', 'score': 0.9991832375526428}]
10    [{'label': 'POSITIVE', 'score': 0.9994851350784302}]
11    [{'label': 'NEGATIVE', 'score': 0.7255957126617432}]
12    [{'label': 'POSITIVE', 'score': 0.9966173768043518}]
13    [{'label': 'POSITIVE', 'score': 0.9997195601463318}]
14    [{'label': 'POSITIVE', 'score': 0.8944368958473206}]
15    [{'label': 'POSITIVE', 'score': 0.9989368319511414}]
16    [{'label': 'POSITIVE', 'score': 0.9998534917831421

**Timing, Logging and Device Setup**

With trasnformer, there are many warning messages. Now we are trying to get rid of the warning message "Device set to use cpu".

In [18]:
from transformers import logging

In [19]:
logging.set_verbosity_error()

sentiment_analyzer = pipeline('sentiment-analysis',
                             model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                             device = -1,
                             truncation = True)

Timing

%%time - This will tell us the run time of the cell

In [20]:
%%time

sentiment_analyzer = pipeline('sentiment-analysis',
                             model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                             device = -1,
                             truncation = True)

CPU times: total: 375 ms
Wall time: 646 ms


CPU times: How long does it take the processor to process it?

Wall time: How long did we have to wait in real-time to see the final result?

In [21]:
%%time

sentiment_analyzer = pipeline('sentiment-analysis',
                             model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                             device = -1,
                             truncation = True)

df['Text'].apply(sentiment_analyzer)

CPU times: total: 14.8 s
Wall time: 4.28 s


0     [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1      [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.6984887719154358}]
3     [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4     [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
5     [{'label': 'POSITIVE', 'score': 0.9994196891784668}]
6     [{'label': 'POSITIVE', 'score': 0.9992188215255737}]
7     [{'label': 'POSITIVE', 'score': 0.9969040751457214}]
8     [{'label': 'POSITIVE', 'score': 0.9894027709960938}]
9     [{'label': 'POSITIVE', 'score': 0.9991832375526428}]
10    [{'label': 'POSITIVE', 'score': 0.9994851350784302}]
11    [{'label': 'NEGATIVE', 'score': 0.7255957126617432}]
12    [{'label': 'POSITIVE', 'score': 0.9966173768043518}]
13    [{'label': 'POSITIVE', 'score': 0.9997195601463318}]
14    [{'label': 'POSITIVE', 'score': 0.8944368958473206}]
15    [{'label': 'POSITIVE', 'score': 0.9989368319511414}]
16    [{'label': 'POSITIVE', 'score': 0.9998534917831421

This means 14 seconds worth of calculations spread across multiple threads, and thus, we were able to see the output after 4 seconds.

Note: CPU can perform multiple calculations in parallel.

We can further speed up the calculation using the GPU.

In [22]:
# %%time

# sentiment_analyzer = pipeline('sentiment-analysis',
#                            model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
#                             device = 'mps',
#                             truncation = True)

# df['Text'].apply(sentiment_analyzer)

**Note**: In order to run the above code, we need to make sure Pytorch is setup to run with GPU.

Now we are trying to clean the output sothat we can get sentiment score and compare it againts sentimane score from VADER.

In [23]:
sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984887719154358}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

If it is "positive", I want to see the score as it is and if it is "negative", I want to see the core with a negative sign.

In [24]:
sentiment_scores[0]

[{'label': 'POSITIVE', 'score': 0.9935212731361389}]

In [25]:
sentiment_scores[0][0]['label']

'POSITIVE'

In [26]:
sentiment_scores[0][0]['score']

0.9935212731361389

I wan to add this to my data frame by creating two new columns.

In [27]:
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269


In [28]:
### In order to apply this to each row, we are using the "lambda function" available in Python.

df['Label_HF'] = sentiment_scores.apply(lambda x: x[0]['label'])
df.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE


In [29]:
df['Score_HF'] = sentiment_scores.apply(lambda x: x[0]['score'])
df.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698489


It is almost done. But, we want the score to be a negative value if the label is negative.

We can also get this using the lambda function.

In [30]:
df.apply(lambda row: row['Score_HF'] if row['Label_HF'] == 'POSITIVE' else -row['Score_HF'], axis = 1)

0     0.993521
1     0.999605
2    -0.698489
3    -0.999631
4     0.999181
5     0.999420
6     0.999219
7     0.996904
8     0.989403
9     0.999183
10    0.999485
11   -0.725596
12    0.996617
13    0.999720
14    0.894437
15    0.998937
16    0.999853
17    0.966338
18   -0.942053
19    0.999761
20   -0.965379
21    0.945946
22    0.998185
23    0.999004
24   -0.752335
25    0.999222
26   -0.990390
27    0.999484
28    0.999874
29   -0.930707
dtype: float64

In [31]:
df['Sentiment_HF'] = df.apply(lambda row: row['Score_HF'] if row['Label_HF'] == 'POSITIVE' else -row['Score_HF'], axis = 1)
df.head()

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698489,-0.698489
3,23692,A2NU55U9LKTB5J,3,Low,Not somthing I would crave,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,NEGATIVE,0.999631,-0.999631
4,23693,A225F7QFP5LIW2,5,High,healthy and delicious,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,POSITIVE,0.999181,0.999181


Note that VADER did a good job in some cases (example: case 2) and Hugging Face did a good job in some cases (example: case 3).

Thus, it depends on the data which model/approach works well.

#### Speeding up Transformers Code

- Use GPU instead of CPU (`device = 'mps'`, `device = 'cuda'`)

- Even with the CPU, there are ways to improve the speed.
    - Use a lighter-weight model (a model with fewer parameters)
    - Try fast tokenization (`use_fast = True`)
    - Specify the number of threads we will be using (`torch.set_num_threads(1)`)
    - Disable gradients. Just for predictions, we do not need to remember gradients, as we do not plan to update parameters.

No optimization:

In [32]:
%%time

sentiment_analyzer = pipeline('sentiment-analysis',
                             model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                             device = -1,
                             truncation = True)

df['Text'].apply(sentiment_analyzer)

CPU times: total: 14.8 s
Wall time: 4.06 s


0     [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1      [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.6984887719154358}]
3     [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4     [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
5     [{'label': 'POSITIVE', 'score': 0.9994196891784668}]
6     [{'label': 'POSITIVE', 'score': 0.9992188215255737}]
7     [{'label': 'POSITIVE', 'score': 0.9969040751457214}]
8     [{'label': 'POSITIVE', 'score': 0.9894027709960938}]
9     [{'label': 'POSITIVE', 'score': 0.9991832375526428}]
10    [{'label': 'POSITIVE', 'score': 0.9994851350784302}]
11    [{'label': 'NEGATIVE', 'score': 0.7255957126617432}]
12    [{'label': 'POSITIVE', 'score': 0.9966173768043518}]
13    [{'label': 'POSITIVE', 'score': 0.9997195601463318}]
14    [{'label': 'POSITIVE', 'score': 0.8944368958473206}]
15    [{'label': 'POSITIVE', 'score': 0.9989368319511414}]
16    [{'label': 'POSITIVE', 'score': 0.9998534917831421

With optimization:

In [33]:
import torch

In [34]:
%%time

sentiment_analyzer = pipeline('sentiment-analysis',
                             model = 'distilbert-base-uncased-finetuned-sst-2-english',
                            use_fast = True,
                             device = -1,
                             truncation = True)

torch.set_num_threads(1)

with torch.no_grad():
    df['Text'].apply(sentiment_analyzer)

CPU times: total: 8 s
Wall time: 8.68 s


Task: Do sentiment analysis using Hugging Face and LLMs instead of VADER and rules. Then compare the results.

Also, what re the top 10 most feel-good movies and top 10 darkest movies according to data.

In [35]:
### Import Data

movies = pd.read_csv('Chapter6_movie_reviews_sentiment.csv')
movies.head(3)

### Note: The last column contains the sentiment score from VADER

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus,sentiment_vader
0,A Dog's Journey,PG,"Drama, Kids & Family",5/17/19,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",Gail Mancuso,female,50,92,"A Dog's Journey is as sentimental as one might expect, but even cynical viewers may find their ability to resist shedding a tear stretched to the puppermost limit.",0.9837
1,A Dog's Way Home,PG,Drama,1/11/19,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",Charles Martin Smith,male,60,71,"A Dog's Way Home may not quite be a family-friendly animal drama fan's best friend, but this canine adventure is no less heartwarming for its familiarity.",0.9237
2,A Tuba to Cuba,NR,"Documentary, Musical & Performing Arts",2/15/19,"The leader of New Orleans' famed Preservation Hall Jazz Band seeks to fulfill his late father's dream of retracing their musical roots to the shores of Cuba in search of the indigenous music that gave birth to New Orleans jazz. A TUBA TO CUBA celebrates the triumph of the human spirit expressed through the universal language of music and challenges us to resolve to build bridges, not walls.","Danny Clinch, T.G. Herrington",male,100,82,,0.936


In [36]:
### Create the sentiment analyzer:

sentiment_analyzer1 = pipeline('sentiment-analysis',
                             model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                             device = -1)

In [37]:
movies['movie_info'].apply(sentiment_analyzer1)

0      [{'label': 'POSITIVE', 'score': 0.9982469081878662}]
1      [{'label': 'POSITIVE', 'score': 0.9995336532592773}]
2      [{'label': 'POSITIVE', 'score': 0.9994434714317322}]
3      [{'label': 'POSITIVE', 'score': 0.9994601607322693}]
4      [{'label': 'POSITIVE', 'score': 0.9972022771835327}]
                               ...                         
161    [{'label': 'POSITIVE', 'score': 0.9987725615501404}]
162    [{'label': 'POSITIVE', 'score': 0.9984967708587646}]
163    [{'label': 'POSITIVE', 'score': 0.9989098310470581}]
164    [{'label': 'POSITIVE', 'score': 0.9913573265075684}]
165    [{'label': 'NEGATIVE', 'score': 0.9984468817710876}]
Name: movie_info, Length: 166, dtype: object

In [38]:
sentiment_scores1 = movies['movie_info'].apply(sentiment_analyzer1)

movies['label_hf'] = sentiment_scores1.apply(lambda x: x[0]['label'])
movies['score_hf'] = sentiment_scores1.apply(lambda x: x[0]['score'])
movies['sentiment_hf'] = movies.apply(lambda row: row['score_hf'] if row['label_hf'] == 'POSITIVE' else -row['score_hf'], axis = 1)

In [39]:
movies.head(3)

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus,sentiment_vader,label_hf,score_hf,sentiment_hf
0,A Dog's Journey,PG,"Drama, Kids & Family",5/17/19,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",Gail Mancuso,female,50,92,"A Dog's Journey is as sentimental as one might expect, but even cynical viewers may find their ability to resist shedding a tear stretched to the puppermost limit.",0.9837,POSITIVE,0.998247,0.998247
1,A Dog's Way Home,PG,Drama,1/11/19,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",Charles Martin Smith,male,60,71,"A Dog's Way Home may not quite be a family-friendly animal drama fan's best friend, but this canine adventure is no less heartwarming for its familiarity.",0.9237,POSITIVE,0.999534,0.999534
2,A Tuba to Cuba,NR,"Documentary, Musical & Performing Arts",2/15/19,"The leader of New Orleans' famed Preservation Hall Jazz Band seeks to fulfill his late father's dream of retracing their musical roots to the shores of Cuba in search of the indigenous music that gave birth to New Orleans jazz. A TUBA TO CUBA celebrates the triumph of the human spirit expressed through the universal language of music and challenges us to resolve to build bridges, not walls.","Danny Clinch, T.G. Herrington",male,100,82,,0.936,POSITIVE,0.999443,0.999443


Now let's compare the results:

In [40]:
movies[['movie_title','movie_info','sentiment_vader','sentiment_hf']].head()

Unnamed: 0,movie_title,movie_info,sentiment_vader,sentiment_hf
0,A Dog's Journey,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",0.9837,0.998247
1,A Dog's Way Home,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",0.9237,0.999534
2,A Tuba to Cuba,"The leader of New Orleans' famed Preservation Hall Jazz Band seeks to fulfill his late father's dream of retracing their musical roots to the shores of Cuba in search of the indigenous music that gave birth to New Orleans jazz. A TUBA TO CUBA celebrates the triumph of the human spirit expressed through the universal language of music and challenges us to resolve to build bridges, not walls.",0.936,0.999443
3,A Vigilante,"A once abused woman, Sadie (Olivia Wilde), devotes herself to ridding victims of their domestic abusers while hunting down the husband she must kill to truly be free. A Vigilante is a thriller inspired by the strength and bravery of real domestic abuse survivors and the incredible obstacles to safety they face.",-0.0334,0.99946
4,After,"Based on Anna Todd's best-selling novel which became a publishing sensation on social storytelling platform Wattpad, AFTER follows Tessa (Langford), a dedicated student, dutiful daughter and loyal girlfriend to her high school sweetheart, as she enters her first semester in college. Armed with grand ambitions for her future, her guarded world opens up when she meets the dark and mysterious Hardin Scott (Tiffin), a magnetic, brooding rebel who makes her question all she thought she knew about herself and what she wants out of life.",0.9349,0.997202


Let's check a few movies with a negative vibe.

In [41]:
movies[['movie_title','movie_info','sentiment_vader','sentiment_hf']].sort_values('sentiment_hf').head()

Unnamed: 0,movie_title,movie_info,sentiment_vader,sentiment_hf
22,Braid,"Two wanted women decide to rob their wealthy yet mentally unstable friend who lives in a fantasy world they all created as children. To take her money, the girls must take part in a deadly and perverse game of make believe throughout a sprawling yet decaying estate. As things become increasingly violent and hallucinatory, they realize that obtaining the money may be the least of their concerns.",-0.8316,-0.999203
103,Spider-Man: Far From Home,"Peter Parker returns in Spider-Man: Far From Home, the next chapter of the Spider-Man: Homecoming series! Our friendly neighborhood Super Hero decides to join his best friends Ned, MJ, and the rest of the gang on a European vacation. However, Peter's plan to leave super heroics behind for a few weeks are quickly scrapped when he begrudgingly agrees to help Nick Fury uncover the mystery of several elemental creature attacks, creating havoc across the continent!",0.9722,-0.998805
34,Dragged Across Concrete,"DRAGGED ACROSS CONCRETE follows two police detectives who find themselves suspended when a video of their strong-arm tactics is leaked to the media. With little money and no options, the embittered policemen descend into the criminal underworld and find more than they wanted waiting in the shadows.",-0.9015,-0.998734
165,Yesterday,"Jack Malik (Himesh Patel, BBC's Eastenders) is a struggling singer-songwriter in a tiny English seaside town whose dreams of fame are rapidly fading, despite the fierce devotion and support of his childhood best friend, Ellie (Lily James, Mamma Mia! Here We Go Again). Then, after a freak bus accident during a mysterious global blackout, Jack wakes up to discover that The Beatles have never existed... and he finds himself with a very complicated problem, indeed.",0.1365,-0.998447
102,Skin,"A white supremacist reforms his life after falling in love but saying goodbye to his skinhead life isn't a clean process. He must betray his former gang and work alongside the FBI in order to remove the body ink that has represented his identity for so long, as well as the burden of the gang's crimes he has carried.",-0.8377,-0.996846


Note: VADER identify Spider-Man movies as positive (due to "!" marks) and Hugging Face identify is as negative (due to negative type words).

#### Named Entity Recognition (NER)

NER is used to find and label important information (people, places, organizations, dates,...) in text.

We can perform this using spaCy. However, it works very well with Transformers.

Here, we will be using the default LLM model for NER: BERT (encoder-only).

The implementation is very similar to sentiment analysis (`model = ner`).

Also, we add `aggregation_strategy = 'SIMPLE'`. This tells the model that we want to look at words, not subwords.

In the output, we can also see the likelihood for the entity group (whether it is an organization, location, or so on).

In [42]:
from transformers import pipeline

Since we are working on the new analyzer, it is best to enable warnings.

In [43]:
logging.set_verbosity_warning()

In [44]:
ner_analyzer = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [45]:
### Let's hide a warnings again.

logging.set_verbosity_error()

In [46]:
ner_analyzer = pipeline('ner',
                       model = 'dbmdz/bert-large-cased-finetuned-conll03-english',
                       device = -1)

In [47]:
text4 = "I ordered an Arnold Palmer at Applebee's in Springfield."

In [48]:
ner_analyzer(text4)

[{'entity': 'I-MISC',
  'score': np.float32(0.9920421),
  'index': 4,
  'word': 'Arnold',
  'start': 13,
  'end': 19},
 {'entity': 'I-MISC',
  'score': np.float32(0.9907755),
  'index': 5,
  'word': 'Palmer',
  'start': 20,
  'end': 26},
 {'entity': 'I-ORG',
  'score': np.float32(0.9482757),
  'index': 7,
  'word': 'Apple',
  'start': 30,
  'end': 35},
 {'entity': 'I-ORG',
  'score': np.float32(0.9633053),
  'index': 8,
  'word': '##bee',
  'start': 35,
  'end': 38},
 {'entity': 'I-ORG',
  'score': np.float32(0.96600634),
  'index': 9,
  'word': "'",
  'start': 38,
  'end': 39},
 {'entity': 'I-ORG',
  'score': np.float32(0.8968691),
  'index': 10,
  'word': 's',
  'start': 39,
  'end': 40},
 {'entity': 'I-LOC',
  'score': np.float32(0.9780036),
  'index': 12,
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

When you see "#" infront of a word, it is a sub word. We can hide them by adding a argument in the pipeline.

In [49]:
ner_analyzer = pipeline('ner',
                       model = 'dbmdz/bert-large-cased-finetuned-conll03-english',
                       device = -1,
                       aggregation_strategy = 'SIMPLE')

In [50]:
ner_analyzer(text4)

[{'entity_group': 'MISC',
  'score': np.float32(0.9914088),
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': np.float32(0.9436141),
  'word': "Applebee ' s",
  'start': 30,
  'end': 40},
 {'entity_group': 'LOC',
  'score': np.float32(0.9780036),
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

Here, 'Arnold Palmer' is a miselanious (MISC), 'Applebee' is a organization (ORG), and 'Springfield' is alocation (LOC).

Let's assume we are not happy with the outcome and we want to use a different model. For that, we can go to the Model Hub.

Link: https://huggingface.co/models

Then we can either look for a new model under tasks (on the left side) or search with the keyword "ner".
Once choose a model, copy the model name (it should be next to the name at the top): 
dslim/bert-base-NER

Note: When you specify a new model, it may take some time to download the model.

In [51]:
ner_analyzer1 = pipeline('ner',
                       model = 'dslim/bert-base-NER',
                       device = -1,
                       aggregation_strategy = 'SIMPLE')

In [52]:
ner_analyzer1(text4)

[{'entity_group': 'PER',
  'score': np.float32(0.876623),
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': np.float32(0.70051384),
  'word': 'Applebee',
  'start': 30,
  'end': 38},
 {'entity_group': 'LOC',
  'score': np.float32(0.6289258),
  'word': "' s",
  'start': 38,
  'end': 40},
 {'entity_group': 'LOC',
  'score': np.float32(0.99173564),
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

Note that here it identifies 'Arnold Palmer' as a person (PER).

**Clean NER Output:**

In [53]:
df.head(2)

### We want to extract entities from "Text". Then we want to create a column containing entities for each "Text".

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605


In [54]:
ner_analyzer3 = pipeline('ner',
                       model = 'dbmdz/bert-large-cased-finetuned-conll03-english',
                       device = -1,
                       aggregation_strategy = 'SIMPLE')

In [55]:
ner_analyzer3(df['Text'][0])

[]

This means, for the first entry, there are not entities.

In [56]:
ner_analyzer3(df['Text'][1])

[{'entity_group': 'MISC',
  'score': np.float32(0.91492635),
  'word': 'Salt and Vinegar',
  'start': 99,
  'end': 115},
 {'entity_group': 'MISC',
  'score': np.float32(0.77423894),
  'word': 'Salt and Vinegar',
  'start': 392,
  'end': 408},
 {'entity_group': 'ORG',
  'score': np.float32(0.9589694),
  'word': 'S & V',
  'start': 450,
  'end': 453}]

This is nice. But we want a unique list of entities. For that, we can use list comprehensions.

In [57]:
[entity['word'] for entity in ner_analyzer(df['Text'][1])]

['Salt and Vinegar', 'Salt and Vinegar', 'S & V']

Now we can do this to the entire column.

In [58]:
df['Text'].apply(lambda x: [entity['word'] for entity in ner_analyzer(x)])

0                                                                                 []
1                                        [Salt and Vinegar, Salt and Vinegar, S & V]
2                                                                           [Amazon]
3                                                                                 []
4                                                                                 []
5                                                                                 []
6                                                                 [Ch, ##ar, Amazon]
7                                                                                 []
8                          [Popchips, Miami, Amazon, General Mills, B, ##le, Amazon]
9                                                                           [Costco]
10                                                                                []
11                                             [PopChips, Amazon,

We are going to add this as a new column.

In [59]:
df['Named_Entities'] = df['Text'].apply(lambda x: [entity['word'] for entity in ner_analyzer(x)])
df.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF,Named_Entities
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521,[]
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605,"[Salt and Vinegar, Salt and Vinegar, S & V]"
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698489,-0.698489,[Amazon]


We are not done yet. We need a single list with all the unique entities.

In [60]:
df['Named_Entities'].explode()

0                  NaN
1     Salt and Vinegar
1     Salt and Vinegar
1                S & V
2               Amazon
            ...       
25                 ##D
26                 NaN
27                 NaN
28                Chip
29              Popchi
Name: Named_Entities, Length: 64, dtype: object

In [61]:
df['Named_Entities'].explode().dropna()

1          Salt and Vinegar
1          Salt and Vinegar
1                     S & V
2                    Amazon
6                        Ch
6                      ##ar
6                    Amazon
8                  Popchips
8                     Miami
8                    Amazon
8             General Mills
8                         B
8                      ##le
8                    Amazon
9                    Costco
11                 PopChips
11                   Amazon
11                 PopChips
11                  Cheetos
13            Stop and Shop
13                   Amazon
14                   Popchi
15                    Watch
15                 PopChips
16                        B
16                     Chip
20                   Amazon
21               Popchips B
21                   Popchi
21                   Amazon
21                   Amazon
23                   COSTCO
23                   Amazon
23                      com
24                 Pringles
24                  

I would like to see this as a list.

In [62]:
df['Named_Entities'].explode().dropna().tolist()

['Salt and Vinegar',
 'Salt and Vinegar',
 'S & V',
 'Amazon',
 'Ch',
 '##ar',
 'Amazon',
 'Popchips',
 'Miami',
 'Amazon',
 'General Mills',
 'B',
 '##le',
 'Amazon',
 'Costco',
 'PopChips',
 'Amazon',
 'PopChips',
 'Cheetos',
 'Stop and Shop',
 'Amazon',
 'Popchi',
 'Watch',
 'PopChips',
 'B',
 'Chip',
 'Amazon',
 'Popchips B',
 'Popchi',
 'Amazon',
 'Amazon',
 'COSTCO',
 'Amazon',
 'com',
 'Pringles',
 'Lays',
 'Pringles',
 'Pringles',
 'Salt',
 'and',
 "Vinegar Pirate ' s Bo",
 'S',
 '& V',
 'VA',
 '##RI',
 '##ING',
 '##FFIC',
 '##IANA',
 '##D',
 'Chip',
 'Popchi']

We still have duplicates. We can get rid of them using `set`

In [63]:
list(set(df['Named_Entities'].explode().dropna().tolist()))

['com',
 'S & V',
 '##le',
 'Costco',
 '##RI',
 'Amazon',
 'Ch',
 "Vinegar Pirate ' s Bo",
 '##IANA',
 'PopChips',
 'B',
 '##FFIC',
 'Lays',
 '##D',
 'Salt',
 'S',
 'Watch',
 'Stop and Shop',
 '& V',
 'Pringles',
 '##ar',
 'Popchi',
 'General Mills',
 'Salt and Vinegar',
 'Miami',
 'Chip',
 'Cheetos',
 'Popchips',
 'VA',
 'and',
 '##ING',
 'COSTCO',
 'Popchips B']

There are still some subwords and now we can get rid of them.

In [64]:
named_entities = list(set(df['Named_Entities'].explode().dropna().tolist()))
named_entities[:5]

['com', 'S & V', '##le', 'Costco', '##RI']

In [65]:
[entity for entity in named_entities if '#' not in entity]

['com',
 'S & V',
 'Costco',
 'Amazon',
 'Ch',
 "Vinegar Pirate ' s Bo",
 'PopChips',
 'B',
 'Lays',
 'Salt',
 'S',
 'Watch',
 'Stop and Shop',
 '& V',
 'Pringles',
 'Popchi',
 'General Mills',
 'Salt and Vinegar',
 'Miami',
 'Chip',
 'Cheetos',
 'Popchips',
 'VA',
 'and',
 'COSTCO',
 'Popchips B']

Task: We want to geta rough list of characters from the book collection.
Use NER  to extract the named entities from the book descriptions, and then filter on only people.

In [66]:
### Read Data

books = pd.read_csv('Chapter2_childrens_books.csv')
books.head(3)

Unnamed: 0,Ranking,Title,Author,Year,Rating,Description
0,1,Where the Wild Things Are,Maurice Sendak,1963,4.25,"Where the Wild Things AreÂ follows Max, a young boy who, after being sent to his room for misbehaving, imagines sailing to an island filled with wild creatures. As their king, Max tames the beasts and eventually returns home to find his supper waiting for him. This iconic book explores themes of imagination, adventure, and the complex emotions of childhood, all captured through Sendak's whimsical illustrations and story."
1,2,The Very Hungry Caterpillar,Eric Carle,1969,4.34,"The Very Hungry CaterpillarÂ tells the story of a caterpillar who eats through a variety of foods before eventually becoming a butterfly. Eric Carleâ€™s use of colorful collage illustrations and rhythmic text has made this book a beloved classic for young readers. The simple, engaging story introduces children to days of the week, counting, and the concept of metamorphosis. Itâ€™s a staple in early childhood education."
2,3,The Giving Tree,Shel Silverstein,1964,4.38,"The Giving TreeÂ is a touching and bittersweet story about a tree that gives everything it has to a boy over the course of his life. As the boy grows up, he takes more from the tree, and the tree continues to give, even when it has little left. Silversteinâ€™s minimalist text and illustrations convey deep themes of unconditional love, selflessness, and the passage of time. It has sparked much discussion about relationships and sacrifice."


In [67]:
### Apply NER to the Description column

ner_analyzer_books = pipeline('ner',
                       model = 'dbmdz/bert-large-cased-finetuned-conll03-english',
                       device = -1,
                       aggregation_strategy = 'SIMPLE')

In [68]:
### Let's obtain named entities for the first entry first.

ner_analyzer_books(books['Description'][0])

[{'entity_group': 'MISC',
  'score': np.float32(0.9462517),
  'word': 'Where the Wild Things Are',
  'start': 0,
  'end': 25},
 {'entity_group': 'PER',
  'score': np.float32(0.9990614),
  'word': 'Max',
  'start': 34,
  'end': 37},
 {'entity_group': 'PER',
  'score': np.float32(0.9984414),
  'word': 'Max',
  'start': 175,
  'end': 178},
 {'entity_group': 'PER',
  'score': np.float32(0.97894603),
  'word': 'Sendak',
  'start': 380,
  'end': 386}]

In [69]:
### From the output, let's choose "word".

[entity['word'] for entity in ner_analyzer_books(books['Description'][0]) if entity['entity_group'] == 'PER']

['Max', 'Max', 'Sendak']

In [70]:
### Now we are choosing "word" for all the entries.

books['Description'].apply(lambda row: [entity['word'] for entity in ner_analyzer_books(row) if entity['entity_group'] == 'PER'])

0                  [Max, Max, Sendak]
1                  [##pi, Eric Carle]
2                       [Silverstein]
3           [Sam - I - Am, Dr. Seuss]
4                      [Clement Hurd]
                   ...               
95                [Jon J. Muth, Muth]
96    [Shel Silverstein, Silverstein]
97       [Harry, Sirius Black, Harry]
98      [Harry, Harry, Ron, Hermione]
99                          [Galdone]
Name: Description, Length: 100, dtype: object

In [76]:
### Create a list with named entities

named_entities = books['Description'].apply(lambda row: [entity['word'] for entity in ner_analyzer_books(row) if entity['entity_group'] == 'PER'])

named_entities.explode()

0            Max
0            Max
0         Sendak
1           ##pi
1     Eric Carle
         ...    
98         Harry
98         Harry
98           Ron
98      Hermione
99       Galdone
Name: Description, Length: 254, dtype: object

In [77]:
### At the moment, this is a series. However, we like a list. Also, there are duplicates.

updated_named_entities = list(set(named_entities.explode().tolist()))
updated_named_entities[:10]

['Eeyore',
 '##ki T',
 'Tigger',
 'Bilbo Baggins',
 'Harold',
 'Little Bear',
 '##y',
 'Burton',
 'Jess',
 'Bemelmans']

In [83]:
### If you still see some subwords, we can get rid of them using the following code:
### Note: Convert to string first

updated_named_entities = [entity for entity in updated_named_entities if '#' not in str(entity)]
updated_named_entities[:10]

['Eeyore',
 'Tigger',
 'Bilbo Baggins',
 'Harold',
 'Little Bear',
 'Burton',
 'Jess',
 'Bemelmans',
 'Silverstein',
 'Matthew Cuthbert']

In [84]:
len(updated_named_entities)

171