# Hugging Face Transformers

## 0. Read in Data

In [1]:
import pandas as pd

# modify the column width
pd.set_option('display.max_colwidth', None)

# look at a subset of the reviews
df = pd.read_excel('../Data/Popchip_Reviews_Sentiment.xlsx').head(30)
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269


In [2]:
# confirm the number of reviews
df.shape

(30, 7)

## 1. Sentiment Analysis

### a. Simple Example

In [3]:
# sentiment analysis with hugging face
from transformers import pipeline

sentiment_analyzer = pipeline("sentiment-analysis", # set the task to sentiment analysis
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", # specify the default distilbert model
                              device=-1) # use the computer's cpu

text1 = 'When life gives you lemons, make lemonade! 🙂'
text2 = 'A dozen lemons will make a gallon of lemonade.'
text3 = 'I didn\'t like the taste of that lemonade at all.'

Device set to use cpu


In [4]:
sentiment_analyzer(text1)

[{'label': 'POSITIVE', 'score': 0.996239423751831}]

In [5]:
sentiment_analyzer(text2)

[{'label': 'POSITIVE', 'score': 0.7781561613082886}]

In [6]:
sentiment_analyzer(text3)

[{'label': 'NEGATIVE', 'score': 0.9955588579177856}]

### b. Practical Example

In [7]:
# calculate the sentiment scores
sentiment_analyzer = pipeline("sentiment-analysis",
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
                              device=-1,
                              truncation=True) # adding truncation here to truncate text before analyzing sentiment

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

Device set to use cpu


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984862685203552}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [8]:
%%time

# add a timer and hide all non-critical warnings
from transformers import pipeline, logging

logging.set_verbosity_error()

sentiment_analyzer = pipeline("sentiment-analysis",
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
                              device=-1,
                              truncation=True)

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 1min 59s, sys: 8.91 s, total: 2min 8s
Wall time: 18 s


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984862685203552}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [9]:
%%time

# utilize mac's silicon chip (gpu)
sentiment_analyzer = pipeline("sentiment-analysis",
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
                              device='mps', # update from -1 to mps
                              truncation=True)

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 926 ms, sys: 231 ms, total: 1.16 s
Wall time: 1.44 s


0    [{'label': 'POSITIVE', 'score': 0.9935213923454285}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984829306602478}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [10]:
# extract the label for a single review
sentiment_scores[0][0]['label']

'POSITIVE'

In [11]:
# extract the score for a single review
sentiment_scores[0][0]['score']

0.9935213923454285

In [12]:
# extract the label and score and create a sentiment score for all reviews
df['Label_HF'] = sentiment_scores.apply(lambda x: x[0]['label'])
df['Score_HF'] = sentiment_scores.apply(lambda x: x[0]['score'])
df['Sentiment_HF'] = df.apply(lambda row: row['Score_HF'] if row['Label_HF'] == 'POSITIVE' else -row['Score_HF'], axis=1)

In [13]:
# view the calculations
df[['Rating', 'Text', 'Sentiment_VADER', 'Label_HF', 'Score_HF', 'Sentiment_HF']].head()

Unnamed: 0,Rating,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,5,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,5,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605
2,5,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698483,-0.698483
3,3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,NEGATIVE,0.999631,-0.999631
4,5,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,POSITIVE,0.999181,0.999181


In [14]:
# view the most positive review
df.sort_values('Sentiment_HF', ascending=False).head(1).Text

28    These Pop Chips are incredible. They taste so much better than baked chips and the quantity you get for 2 points is so much more. I buy the variety case and love them all!
Name: Text, dtype: object

In [15]:
# view the most negative review
df.sort_values('Sentiment_HF', ascending=True).head(1).Text

3    These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day.  They were the bomb then, not so much now.  Won't buy again unless I get them for cheap or free.
Name: Text, dtype: object

### c. Speed Up Code

In [16]:
%%time

# no optimizations
from transformers import pipeline

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=-1, # running on CPU
    truncation=True
)

sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 1min 34s, sys: 11.2 s, total: 1min 45s
Wall time: 14.9 s


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984862685203552}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [17]:
%%time

# four things to try if you can't use GPU
from transformers import pipeline

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english", # 1. smaller model
    device=-1, # running on CPU
    truncation=True,
    use_fast=True # 2. faster tokenization
)

import torch
torch.set_num_threads(1)  # 3. specify multi-threading

with torch.no_grad(): # 4. disable gradients
    sentiment_scores = df['Text'].apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: user 18.7 s, sys: 1.73 s, total: 20.4 s
Wall time: 2.89 s


0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984862685203552}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

## 2. Named Entity Recognition

### a. Simple Example

In [18]:
# view warning options
logging.set_verbosity_warning() # view more warnings
logging.set_verbosity_error() # view fewer warnings

In [19]:
# ner with hugging face
ner_analyzer = pipeline("ner",
                        model="dbmdz/bert-large-cased-finetuned-conll03-english",
                        device=-1,
                        aggregation_strategy='SIMPLE')

text4 = "I ordered an Arnold Palmer at Applebee's in Springfield."

In [20]:
ner_analyzer(text4)

[{'entity_group': 'MISC',
  'score': np.float32(0.9914088),
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': np.float32(0.9436141),
  'word': "Applebee ' s",
  'start': 30,
  'end': 40},
 {'entity_group': 'LOC',
  'score': np.float32(0.9780036),
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

In [21]:
# try a different model
ner_analyzer2 = pipeline("ner",
                        model="dslim/bert-base-NER",
                        device=-1,
                        aggregation_strategy='SIMPLE')

In [22]:
ner_analyzer2(text4)

[{'entity_group': 'PER',
  'score': np.float32(0.87662256),
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': np.float32(0.70051384),
  'word': 'Applebee',
  'start': 30,
  'end': 38},
 {'entity_group': 'LOC',
  'score': np.float32(0.628926),
  'word': "' s",
  'start': 38,
  'end': 40},
 {'entity_group': 'LOC',
  'score': np.float32(0.99173564),
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

### b. Practical Example

In [23]:
# find the named entities in each review
ner_analyzer = pipeline("ner",
                        model="dbmdz/bert-large-cased-finetuned-conll03-english",
                        device='mps',
                        aggregation_strategy='SIMPLE')

In [24]:
# apply to one review
ner_analyzer(df.Text[1])

[{'entity_group': 'MISC',
  'score': np.float32(0.9149264),
  'word': 'Salt and Vinegar',
  'start': 99,
  'end': 115},
 {'entity_group': 'MISC',
  'score': np.float32(0.7742393),
  'word': 'Salt and Vinegar',
  'start': 392,
  'end': 408},
 {'entity_group': 'ORG',
  'score': np.float32(0.9589692),
  'word': 'S & V',
  'start': 450,
  'end': 453}]

In [25]:
# extract the words
[entity['word'] for entity in ner_analyzer(df.Text[1])]

['Salt and Vinegar', 'Salt and Vinegar', 'S & V']

In [26]:
# apply to all reviews
df['Named_Entities'] = df['Text'].apply(lambda x: [entity['word'] for entity in ner_analyzer(x)])

In [27]:
# view the named entities
df[['Text', 'Named_Entities']].head()

Unnamed: 0,Text,Named_Entities
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,[]
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.","[Salt and Vinegar, Salt and Vinegar, S & V]"
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",[Amazon]
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",[]
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",[]


In [28]:
# create a unique list of named entities
named_entities = list(set(df.Named_Entities.explode().dropna().tolist()))
named_entities[:10]

['Popchips',
 'Ch',
 '##ar',
 'General Mills',
 'Chip',
 '& V',
 'S & V',
 'Watch',
 'Costco',
 "Vinegar Pirate ' s Bo"]

In [29]:
# view the number of named entities found
len(named_entities)

33

In [30]:
# exclude subwords from the list
named_entities_clean = [entity for entity in named_entities if '#' not in entity]
sorted(named_entities_clean)

['& V',
 'Amazon',
 'B',
 'COSTCO',
 'Ch',
 'Cheetos',
 'Chip',
 'Costco',
 'General Mills',
 'Lays',
 'Miami',
 'PopChips',
 'Popchi',
 'Popchips',
 'Popchips B',
 'Pringles',
 'S',
 'S & V',
 'Salt',
 'Salt and Vinegar',
 'Stop and Shop',
 'VA',
 "Vinegar Pirate ' s Bo",
 'Watch',
 'and',
 'com']

In [31]:
# view the number of named entities found
len(named_entities_clean)

26

## 3. Zero-Shot Classification

### a. Simple Example

In [32]:
# zero-shot classification with hugging face
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device=-1)

In [33]:
text1, text4

('When life gives you lemons, make lemonade! 🙂',
 "I ordered an Arnold Palmer at Applebee's in Springfield.")

In [34]:
classifier(text1, candidate_labels = ['quote', 'food & drinks', 'technology'])

{'sequence': 'When life gives you lemons, make lemonade! 🙂',
 'labels': ['quote', 'food & drinks', 'technology'],
 'scores': [0.9833195209503174, 0.011176466010510921, 0.005504013504832983]}

In [35]:
classifier(text4, candidate_labels = ['quote', 'food & drinks', 'technology'])

{'sequence': "I ordered an Arnold Palmer at Applebee's in Springfield.",
 'labels': ['food & drinks', 'quote', 'technology'],
 'scores': [0.5157099962234497, 0.44382426142692566, 0.040465742349624634]}

### b. Practical Example

In [36]:
# remember our topics from the machine learning section: 'order', 'taste & texture', 'good', 'flavor', 'health'
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli",
                      device='mps')

In [37]:
# try on one review
classifier(df.Text[0], ['order', 'taste & texture', 'good', 'flavor', 'health'])

{'sequence': 'Popchips are the bomb!!  I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip.  My healthy eating program is saved.',
 'labels': ['health', 'good', 'flavor', 'taste & texture', 'order'],
 'scores': [0.2693711817264557,
  0.2510983645915985,
  0.24033670127391815,
  0.2068103849887848,
  0.03238330036401749]}

In [38]:
# try on another review
classifier(df.Text[1], ['order', 'taste & texture', 'good', 'flavor', 'health'])

{'sequence': 'I like the puffed nature of this chip that makes it more unique in the chip market.  I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever.  I have tried the cheddar and regular flavors as well.  The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular.  The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.',
 'labels': ['flavor', 'good', 'taste & texture', 'order', 'health'],
 'scores': [0.34008774161338806,
  0.27486875653266907,
  0.25607454776763916,
  0.11652418226003647,
  0.012444781139492989]}

In [39]:
# extract just the top label
classifier(df.Text[1], ['order', 'taste & texture', 'good', 'flavor', 'health'])['labels'][0]

'flavor'

In [40]:
# apply to all reviews
df['Category'] = df.Text.apply(lambda x: classifier(x, ['order', 'taste & texture', 'good', 'flavor', 'health'])['labels'][0])

In [41]:
# view the category labels
df[['Text', 'Category']].head()

Unnamed: 0,Text,Category
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,health
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",flavor
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",good
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",taste & texture
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",taste & texture


## 4. Text Summarization

### a. Simple Example

In [42]:
# text summarization with hugging face
summarizer = pipeline("summarization",
                      model="facebook/bart-large-cnn",
                      device=-1)

text5 = """
            The lemon tree produces a pointed oval yellow fruit. Botanically this is a hesperidium, 
            a modified berry with a tough, leathery rind. The rind is divided into an outer colored layer or zest, 
            which is aromatic with essential oils, and an inner layer of white spongy pith. 
            Inside are multiple carpels arranged as radial segments. The seeds develop inside the carpels. 
            The space inside each segment is a locule filled with juice vesicles. 
            Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins.[3] 
            Their juice contains slightly more citric acid than lime juice (about 47 g/L), 
            nearly twice as much as grapefruit juice, and about five times as much as orange juice.[4]
        """

In [43]:
# try it with the default parameters
summarizer(text5)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins.'}]

In [44]:
# specify the parameters
summarizer(text5, min_length=20, max_length=50)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including poly'}]

In [45]:
# will get the same results
summarizer(text5, min_length=20, max_length=50)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including poly'}]

In [46]:
# will get more random results
summarizer(text5, min_length=20, max_length=50, do_sample=True)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins. Their juice contains slightly more citric acid than lime juice.'}]

In [47]:
# extract just the text portion
summarizer(text5, min_length=20, max_length=50)[0]['summary_text']

'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including poly'

### b. Practical Example

In [48]:
# load pipelines
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device='mps')
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device='mps', truncation=True)

In [49]:
# step 1: summarize reviews
df['Summary'] = df['Text'].apply(lambda x: summarizer(x, min_length=20, max_length=50)[0]['summary_text'])
df[['Text', 'Summary']].head(2)

Unnamed: 0,Text,Summary
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.","I like the puffed nature of this chip that makes it more unique in the chip market. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come"


In [50]:
# step 2: find sentiment scores
sentiment_scores2 = df.Summary.apply(sentiment_analyzer)
sentiment_scores2[:5]

0    [{'label': 'POSITIVE', 'score': 0.9976533055305481}]
1    [{'label': 'POSITIVE', 'score': 0.9991886019706726}]
2    [{'label': 'NEGATIVE', 'score': 0.9929706454277039}]
3    [{'label': 'NEGATIVE', 'score': 0.9985463619232178}]
4    [{'label': 'POSITIVE', 'score': 0.9993218183517456}]
Name: Summary, dtype: object

In [51]:
# extract label and score and create a sentiment score
df['Label_HF2'] = sentiment_scores2.apply(lambda x: x[0]['label'])
df['Score_HF2'] = sentiment_scores2.apply(lambda x: x[0]['score'])
df['Sentiment_HF2'] = df.apply(lambda row: row['Score_HF2'] if row['Label_HF2'] == 'POSITIVE' else -row['Score_HF2'], axis=1)

In [52]:
# view the calculations
df[['Text', 'Label_HF2', 'Score_HF2', 'Sentiment_HF2']].head()

Unnamed: 0,Text,Label_HF2,Score_HF2,Sentiment_HF2
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,POSITIVE,0.997653,0.997653
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",POSITIVE,0.999189,0.999189
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",NEGATIVE,0.992971,-0.992971
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",NEGATIVE,0.998546,-0.998546
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",POSITIVE,0.999322,0.999322


In [53]:
# compare the sentiment scores
df[['Text', 'Sentiment_VADER', 'Sentiment_HF', 'Sentiment_HF2']].head()

Unnamed: 0,Text,Sentiment_VADER,Sentiment_HF,Sentiment_HF2
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,0.993521,0.997653
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,0.999605,0.999189
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,-0.698483,-0.992971
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,-0.999631,-0.998546
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,0.999181,0.999322


## 5. Text Generation

In [54]:
# text generation with hugging face
generator = pipeline("text-generation", model="gpt2", max_length=20, device=-1)

prompt = "On a hot summer day, I love to drink cold lemonade because"

In [55]:
# set general parameters
generator(prompt, max_length=50, num_return_sequences=1, do_sample=False)

[{'generated_text': "On a hot summer day, I love to drink cold lemonade because it's so refreshing. I also love to drink cold lemonade because it's so refreshing.\n\nI love to drink cold lemonade because it's so refreshing. I also"}]

In [56]:
# get a more random output
generator(prompt, max_length=50, num_return_sequences=1, do_sample=True)

[{'generated_text': 'On a hot summer day, I love to drink cold lemonade because it makes my mind more flexible. It makes my brain feel better, that I think better of myself, and keeps my eye on my job and my goal of getting an OK.'}]

In [57]:
# get a more random output
generator(prompt, max_length=50, num_return_sequences=1, do_sample=True)

[{'generated_text': 'On a hot summer day, I love to drink cold lemonade because it gives me a better flow but only as warm as you want, just like I thought I would. I just had to mix my milk and butter and pour in it with a'}]

## 6. Document Similarity with Embeddings

### a. Simple Example

In [58]:
# feature extraction with hugging face
feature_extractor = pipeline("feature-extraction",
                             model="sentence-transformers/all-MiniLM-L6-v2",
                             device=-1)

In [59]:
# view the text
text1

'When life gives you lemons, make lemonade! 🙂'

In [60]:
# view the embedding
feature_extractor(text1)[0][0][:10]

[-0.2936323285102844,
 0.20775198936462402,
 0.11103478074073792,
 0.14668866991996765,
 0.39885425567626953,
 0.31434932351112366,
 0.4152563214302063,
 -0.19369427859783173,
 0.11604061722755432,
 -0.885179340839386]

In [61]:
# view the shape
len(feature_extractor(text1)[0][0])

384

### b. Practical Example

#### Step 1: Extract Embeddings

In [62]:
# modify the column width
pd.set_option('display.max_colwidth', 50)

# read in the movies data
movies = pd.read_csv('../Data/movie_reviews.csv')
movies.head(2)

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus
0,A Dog's Journey,PG,"Drama, Kids & Family",5/17/19,Bailey (voiced again by Josh Gad) is living th...,Gail Mancuso,female,50,92,A Dog's Journey is as sentimental as one might...
1,A Dog's Way Home,PG,Drama,1/11/19,"Separated from her owner, a dog sets off on an...",Charles Martin Smith,male,60,71,A Dog's Way Home may not quite be a family-fri...


In [63]:
# extract the embedding representation for each review
feature_extractor = pipeline("feature-extraction",
                             model="sentence-transformers/all-MiniLM-L6-v2",
                             device='mps')

embeddings = movies['movie_info'].apply(lambda x: feature_extractor(x)[0][0])
embeddings.head(2)

0    [-0.3252973258495331, -0.07725150883197784, 0....
1    [0.20780158042907715, -0.06860747933387756, 0....
Name: movie_info, dtype: object

#### Step 2: Specify the Captain Marvel Embedding

In [64]:
# view one movie - Captain Marvel
movies[movies.movie_title == 'Captain Marvel']

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus
25,Captain Marvel,PG-13,"Action & Adventure, Science Fiction & Fantasy",3/8/19,The story follows Carol Danvers as she becomes...,"Anna Boden, Ryan Fleck",female,78,53,"Packed with action, humor, and visual thrills,..."


In [65]:
# save the embedding for that movie
import numpy as np

embedding_cm = np.array(embeddings[25]).reshape(1, -1)
embedding_cm.shape

(1, 384)

#### Step 3: Specify the Embeddings for All Movies

In [66]:
# save the embeddings for all movies
embeddings_movies = np.vstack(embeddings)
embeddings_movies.shape

(166, 384)

#### Step 4: Calculate Cosine Similarity Scores

In [67]:
# calculate the cosine similarity scores
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores_cm = cosine_similarity(embedding_cm, embeddings_movies)
similarity_scores_cm_series = pd.Series(similarity_scores_cm.flatten(), name='similarity_score')
similarity_scores_cm_series.head()

0    0.749577
1    0.684320
2    0.599276
3    0.673823
4    0.724890
Name: similarity_score, dtype: float64

In [68]:
# combine movie titles, descriptions and scores
similarity_scores_cm_df = pd.concat([movies[['movie_title', 'movie_info']], similarity_scores_cm_series], axis=1)
similarity_scores_cm_df.head()

Unnamed: 0,movie_title,movie_info,similarity_score
0,A Dog's Journey,Bailey (voiced again by Josh Gad) is living th...,0.749577
1,A Dog's Way Home,"Separated from her owner, a dog sets off on an...",0.68432
2,A Tuba to Cuba,The leader of New Orleans' famed Preservation ...,0.599276
3,A Vigilante,"A once abused woman, Sadie (Olivia Wilde), dev...",0.673823
4,After,Based on Anna Todd's best-selling novel which ...,0.72489


In [69]:
# view the top 5 most similar movies
similarity_scores_cm_df.sort_values('similarity_score', ascending=False).head()

Unnamed: 0,movie_title,movie_info,similarity_score
25,Captain Marvel,The story follows Carol Danvers as she becomes...,1.0
45,Fast & Furious Presents: Hobbs & Shaw,Ever since hulking lawman Hobbs (Dwayne Johnso...,0.819661
18,Avengers: Endgame,The grave course of events set in motion by Th...,0.794008
131,The LEGO Movie 2: The Second Part,The much-anticipated sequel to the critically ...,0.792253
6,Alita: Battle Angel,From visionary filmmakers James Cameron (AVATA...,0.789453


#### DEMO: Create a function to find the most similar movie

In [70]:
# step 1: specify our feature extraction model
feature_extractor = pipeline('feature-extraction',
                     model='sentence-transformers/all-MiniLM-L6-v2',
                     device='mps')

In [71]:
# step 2: create a movies x embeddings array (166 x 384)
embeddings = movies.movie_info.apply(lambda row: feature_extractor(row)[0][0])
embeddings_movies = np.vstack(embeddings)
embeddings_movies.shape

(166, 384)

In [72]:
# step 3: create a get_similar_movies function with the inputs: embeddings, movie_index, movie_details, top_n
def get_similar_movies(embeddings, movie_index, movie_details, top_n=3):

    # create movie embedding for movie_index
    m_embedding = np.array(embeddings[movie_index]).reshape(1, -1)
    
    # calculate similarity scores
    similarity_scores = cosine_similarity(m_embedding, embeddings)
    similarity_scores_series = pd.Series(similarity_scores.flatten(), name='similarity_score')
    
    # bring in movie info
    movies_similarity_scores_df = pd.concat([movie_details, similarity_scores_series], axis=1)

    # display movie recs
    return movies_similarity_scores_df.sort_values('similarity_score', ascending=False).iloc[0:top_n+1]

In [73]:
# modify the column width
pd.set_option('display.max_colwidth', None)

In [74]:
# find movies similar to Captain Marvel
get_similar_movies(embeddings_movies, 25, movies[['movie_title', 'movie_info']])

Unnamed: 0,movie_title,movie_info,similarity_score
25,Captain Marvel,"The story follows Carol Danvers as she becomes one of the universe's most powerful heroes when Earth is caught in the middle of a galactic war between two alien races. Set in the 1990s, Captain Marvel is an all-new adventure from a previously unseen period in the history of the Marvel Cinematic Universe.",1.0
45,Fast & Furious Presents: Hobbs & Shaw,"Ever since hulking lawman Hobbs (Dwayne Johnson), a loyal agent of America's Diplomatic Security Service, and lawless outcast Shaw (Jason Statham), a former British military elite operative, first faced off in 2015's Furious 7, the duo have swapped smack talk and body blows as they've tried to take each other down. But when cyber-genetically enhanced anarchist Brixton (Idris Elba) gains control of an insidious bio-threat that could alter humanity forever--and bests a brilliant and fearless rogue MI6 agent (The Crown's Vanessa Kirby), who just happens to be Shaw's sister--these two sworn enemies will have to partner up to bring down the only guy who might be badder than themselves.",0.819661
18,Avengers: Endgame,"The grave course of events set in motion by Thanos that wiped out half the universe and fractured the Avengers ranks compels the remaining Avengers to take one final stand in Marvel Studios' grand conclusion to twenty-two films, ""Avengers: Endgame.""",0.794008
131,The LEGO Movie 2: The Second Part,"The much-anticipated sequel to the critically acclaimed, global box office phenomenon that started it all, ""The LEGO (R) Movie 2: The Second Part,"" reunites the heroes of Bricksburg in an all new action-packed adventure to save their beloved city. It's been five years since everything was awesome and the citizens are now facing a huge new threat: LEGO DUPLO (R) invaders from outer space, wrecking everything faster than it can be rebuilt. The battle to defeat the invaders and restore harmony to the LEGO universe will take Emmet (Chris Pratt), Lucy (Elizabeth Banks), Batman (Will Arnett) and their friends to faraway, unexplored worlds, including a strange galaxy where everything is a musical. It will test their courage, creativity and Master Building skills, and reveal just how special they really are.",0.792253


In [75]:
# find movies similar to The LEGO Movie 2
get_similar_movies(embeddings_movies, 131, movies[['movie_title', 'movie_info', 'rating']], top_n=5)

Unnamed: 0,movie_title,movie_info,rating,similarity_score
131,The LEGO Movie 2: The Second Part,"The much-anticipated sequel to the critically acclaimed, global box office phenomenon that started it all, ""The LEGO (R) Movie 2: The Second Part,"" reunites the heroes of Bricksburg in an all new action-packed adventure to save their beloved city. It's been five years since everything was awesome and the citizens are now facing a huge new threat: LEGO DUPLO (R) invaders from outer space, wrecking everything faster than it can be rebuilt. The battle to defeat the invaders and restore harmony to the LEGO universe will take Emmet (Chris Pratt), Lucy (Elizabeth Banks), Batman (Will Arnett) and their friends to faraway, unexplored worlds, including a strange galaxy where everything is a musical. It will test their courage, creativity and Master Building skills, and reveal just how special they really are.",PG,1.0
151,Toy Story 4,"Woody (voice of Tom Hanks) has always been confident about his place in the world, and that his priority is taking care of his kid, whether that's Andy or Bonnie. So when Bonnie's beloved new craft-project-turned-toy, Forky (voice of Tony Hale), declares himself as ""trash"" and not a toy, Woody takes it upon himself to show Forky why he should embrace being a toy. But when Bonnie takes the whole gang on her family's road trip excursion, Woody ends up on an unexpected detour that includes a reunion with his long-lost friend Bo Peep (voice of Annie Potts). After years of being on her own, Bo's adventurous spirit and life on the road belie her delicate porcelain exterior. As Woody and Bo realize they're worlds apart when it comes to life as a toy, they soon come to find that's the least of their worries.",G,0.83106
32,Dolemite Is My Name,"Stung by a string of showbiz failures, floundering comedian Rudy Ray Moore (Academy Award nominee Eddie Murphy) has an epiphany that turns him into a word-of-mouth sensation: step onstage as someone else. Borrowing from the street mythology of 1970s Los Angeles, Moore assumes the persona of Dolemite, a pimp with a cane and an arsenal of obscene fables. However, his ambitions exceed selling bootleg records deemed too racy for mainstream radio stations to play. Moore convinces a social justice-minded dramatist (Keegan-Michael Key) to write his alter ego a film, incorporating kung fu, car chases, and Lady Reed (Da'Vine Joy Randolph), an ex-backup singer who becomes his unexpected comedic foil. Despite clashing with his pretentious director, D'Urville Martin (Wesley Snipes), and countless production hurdles at their studio in the dilapidated Dunbar Hotel, Moore's Dolemite becomes a runaway box office smash and a defining movie of the Blaxploitation era.",R,0.826733
154,Triple Threat,"TRIPLE THREAT, the newest feature from Johnson, is an adrenaline fueled and gritty action thriller starring some of the biggest names in action today. Michael Jai White (BLACK DYNAMITE; UNDISPUTED 2: LAST MAN STANDING), Scott Adkins (Marvel's DOCTOR STRANGE; THE EXPENDABLES 2), Michael Bisping (xXx: RETURN OF XANDER CAGE) star as a group of professional assassins hired to take out a billionaire's daughter who is intent on bringing down a major crime syndicate. A down and out team of mercenaries, played by Tony Jaa (ONG BAK TRILOGY; xXx: RETURN OF XANDER CAGE), Iko Uwais (THE RAID 1 & 2; STAR WARS: THE FORCE AWAKENS) and Tiger Chen (MAN OF TAI CHI), must take on the assassins and stop them before they kill their target. The film co-stars JeeJa Yanin (CHOCOLATE) Michael Wong (Cold War) and Celina Jade (WOLF WARRIOR 2).",R,0.818383
36,Dumbo,"From Disney and visionary director Tim Burton, the all-new grand live-action adventure ""Dumbo"" expands on the beloved classic story where differences are celebrated, family is cherished and dreams take flight. Circus owner Max Medici (Danny DeVito) enlists former star Holt Farrier (Colin Farrell) and his children Milly (Nico Parker) and Joe (Finley Hobbins) to care for a newborn elephant whose oversized ears make him a laughingstock in an already struggling circus. But when they discover that Dumbo can fly, the circus makes an incredible comeback, attracting persuasive entrepreneur V.A. Vandevere (Michael Keaton), who recruits the peculiar pachyderm for his newest, larger-than-life entertainment venture, Dreamland. Dumbo soars to new heights alongside a charming and spectacular aerial artist, Colette Marchant (Eva Green), until Holt learns that beneath its shiny veneer, Dreamland is full of dark secrets.",PG,0.818338
45,Fast & Furious Presents: Hobbs & Shaw,"Ever since hulking lawman Hobbs (Dwayne Johnson), a loyal agent of America's Diplomatic Security Service, and lawless outcast Shaw (Jason Statham), a former British military elite operative, first faced off in 2015's Furious 7, the duo have swapped smack talk and body blows as they've tried to take each other down. But when cyber-genetically enhanced anarchist Brixton (Idris Elba) gains control of an insidious bio-threat that could alter humanity forever--and bests a brilliant and fearless rogue MI6 agent (The Crown's Vanessa Kirby), who just happens to be Shaw's sister--these two sworn enemies will have to partner up to bring down the only guy who might be badder than themselves.",PG-13,0.816315
