In [1]:
from pathlib import Path
import numpy as np

## Sentiment analysis

Sentiment analysis is a natural language processing task that involves determining the sentiment or emotional tone expressed in a piece of text. The goal is to classify the text as positive, negative, or neutral, providing insights into the overall sentiment conveyed by the content. This task is widely used in various applications, such as social media monitoring, customer feedback analysis, and opinion mining.

In [2]:
! ls

1_image_classification.ipynb  3_finetune_llms.ipynb	  aclImdb
2_into_to_huggingface.ipynb   4_image_segmentation.ipynb  aclImdb_v1.tar.gz


To get the sentiment data: <br>
`wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

To unzip run: <br>
`! tar -xvf aclImdb_v1.tar.gz`

This is a dataset for binary sentiment classification (positive or negative). The training data consists of 25,000 movie reviews for training, and 25,000 for testing.

In [3]:
! ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [4]:
def read_file(path):
    f = open(path, "r")
    return f.read()

In [5]:
test_path = Path("aclImdb/test")
test_path = test_path.expanduser()
pos_paths = [f for f in (test_path/"pos").iterdir()]
neg_paths = [f for f in (test_path/"neg").iterdir()]

In [6]:
read_file(pos_paths[0])

"Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is that it was just so beautiful, in every sense - emotionally, visually, editorially - just gorgeous.<br /><br />If you like movies that are wonderful to look at, and also have emotional content to which that beauty is relevant, I think you will be glad to have seen this extraordinary and unusual work of art.<br /><br />On a scale of 1 to 10, I'd give it about an 8.75. The only reason I shy away from 9 is that it is a mood piece. If you are in the mood for a really artistic, very romantic film, then it's a 10. I definitely think it's a must-see, but none of us can be in that mood all the time, so, overall, 8.75."

In [7]:
data1 = [read_file(pos_paths[i]) for i in range(10)]

In [8]:
data2 = [read_file(neg_paths[i]) for i in range(10)]

Here is a [link](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment) to models that are useful for sentiment analysis.

In [11]:
# ! pip install transformers
from transformers import pipeline

In [12]:
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [14]:
sentiment_pipeline(data1, max_length=512, truncation=True)

[{'label': 'POSITIVE', 'score': 0.9997010827064514},
 {'label': 'POSITIVE', 'score': 0.9957257509231567},
 {'label': 'POSITIVE', 'score': 0.9472860097885132},
 {'label': 'POSITIVE', 'score': 0.8518150448799133},
 {'label': 'POSITIVE', 'score': 0.9996123909950256},
 {'label': 'POSITIVE', 'score': 0.9388156533241272},
 {'label': 'POSITIVE', 'score': 0.9882730841636658},
 {'label': 'POSITIVE', 'score': 0.9941344857215881},
 {'label': 'POSITIVE', 'score': 0.6735564470291138},
 {'label': 'POSITIVE', 'score': 0.952999472618103}]

In [15]:
sentiment_pipeline(data2, max_length=512, truncation=True)

[{'label': 'NEGATIVE', 'score': 0.999616265296936},
 {'label': 'NEGATIVE', 'score': 0.6170626282691956},
 {'label': 'NEGATIVE', 'score': 0.9997100234031677},
 {'label': 'NEGATIVE', 'score': 0.995756208896637},
 {'label': 'POSITIVE', 'score': 0.996307373046875},
 {'label': 'NEGATIVE', 'score': 0.9966711401939392},
 {'label': 'NEGATIVE', 'score': 0.958416759967804},
 {'label': 'NEGATIVE', 'score': 0.9994093179702759},
 {'label': 'NEGATIVE', 'score': 0.9996923208236694},
 {'label': 'NEGATIVE', 'score': 0.99977046251297}]

Here we specify the model.

In [28]:
model_path = "siebert/sentiment-roberta-large-english"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0


In [31]:
sentiment_task(data1, max_length=512, truncation=True)

[{'label': 'POSITIVE', 'score': 0.998914361000061},
 {'label': 'POSITIVE', 'score': 0.9988763928413391},
 {'label': 'POSITIVE', 'score': 0.9987799525260925},
 {'label': 'POSITIVE', 'score': 0.9989105463027954},
 {'label': 'POSITIVE', 'score': 0.9988929629325867},
 {'label': 'POSITIVE', 'score': 0.9843221306800842},
 {'label': 'POSITIVE', 'score': 0.9988069534301758},
 {'label': 'POSITIVE', 'score': 0.9989105463027954},
 {'label': 'POSITIVE', 'score': 0.9988324046134949},
 {'label': 'POSITIVE', 'score': 0.9989055395126343}]

In [32]:
sentiment_task(data2, max_length=512, truncation=True)

[{'label': 'NEGATIVE', 'score': 0.9995086193084717},
 {'label': 'POSITIVE', 'score': 0.9988320469856262},
 {'label': 'NEGATIVE', 'score': 0.9995156526565552},
 {'label': 'NEGATIVE', 'score': 0.9995115995407104},
 {'label': 'POSITIVE', 'score': 0.998916506767273},
 {'label': 'NEGATIVE', 'score': 0.9995063543319702},
 {'label': 'NEGATIVE', 'score': 0.9991637468338013},
 {'label': 'NEGATIVE', 'score': 0.9995145797729492},
 {'label': 'NEGATIVE', 'score': 0.999512791633606},
 {'label': 'NEGATIVE', 'score': 0.9995163679122925}]

## Text summarization
Text summarization is a natural language processing task aimed at condensing the content of a given text while retaining its key information and meaning. This task is crucial for information retrieval, document organization, and content consumption in various applications.

You can download a summarization dataset from [here](
https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail)

In [38]:
import pandas as pd
df = pd.read_csv("cnn_dailymail/train.csv") 

In [39]:
df.head()

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [40]:
df.iloc[0].article

"By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for newly ordained 

In [41]:
df.iloc[0].highlights

'Bishop John Folda, of North Dakota, is taking time off after being diagnosed .\nHe contracted the infection through contaminated food in Italy .\nChurch members in Fargo, Grand Forks and Jamestown could have been exposed .'

In [42]:
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")

input_text = df.iloc[0].article

# Generate summary
summary = summarizer(input_text, max_length=150, min_length=50, length_penalty=2.0,
                     num_beams=4, early_stopping=True, truncation=True)

# Print the generated summary
print(summary[0]['summary_text'])

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cuda:0
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


the bishop of the fargo Catholic diocese in north . Dakota has exposed potentially hundreds of church members to the hepatitis . A virus in late September and early . October . the risk is low, but officials feel it's important to alert people to the possible exposure .


### a different model
Let's try `google/flan-t5-base` model. Looks much better.

In [43]:
summarizer = pipeline("summarization", model="google/flan-t5-base",
                      tokenizer="google/flan-t5-base")

input_text = df.iloc[0].article

# Generate summary
summary = summarizer(input_text, max_length=150, min_length=50, length_penalty=2.0,
                     num_beams=4, early_stopping=True, truncation=True)

# Print the generated summary
print(summary[0]['summary_text'])

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0


Bishop John Folda of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure.


## Zero shot classification
Zero-Shot Classification is a task in natural language processing where a model is trained to predict a class that it did not encounter during its training phase. This method uses  a pre-trained language model and can be considered an instance of transfer learning, which refers to using a model trained for one task in a different application than its original purpose. This approach is particularly useful when the amount of labeled data is limited. <br>
In zero-shot classification, the model is provided with a prompt and a sequence of text that describes the desired action in natural language. This differs from single or few-shot classification, which includes one or a few examples of the selected task.

Let's consider this news category dataset https://www.kaggle.com/datasets/rmisra/news-category-dataset

In [47]:
df = pd.read_json('News_Category_Dataset_v3.json', lines=True)

In [48]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [49]:
df.category.unique()

array(['U.S. NEWS', 'COMEDY', 'PARENTING', 'WORLD NEWS', 'CULTURE & ARTS',
       'TECH', 'SPORTS', 'ENTERTAINMENT', 'POLITICS', 'WEIRD NEWS',
       'ENVIRONMENT', 'EDUCATION', 'CRIME', 'SCIENCE', 'WELLNESS',
       'BUSINESS', 'STYLE & BEAUTY', 'FOOD & DRINK', 'MEDIA',
       'QUEER VOICES', 'HOME & LIVING', 'WOMEN', 'BLACK VOICES', 'TRAVEL',
       'MONEY', 'RELIGION', 'LATINO VOICES', 'IMPACT', 'WEDDINGS',
       'COLLEGE', 'PARENTS', 'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE',
       'HEALTHY LIVING', 'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST',
       'FIFTY', 'ARTS', 'DIVORCE'], dtype=object)

In [50]:
pipe = pipeline(model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [52]:
cats = ['U.S. NEWS', 'COMEDY', 'PARENTING', 'WORLD NEWS']
cats = [x.lower() for x in cats]
cats

['u.s. news', 'comedy', 'parenting', 'world news']

In [53]:
i = 0
text = df.iloc[i].short_description
text

'Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.'

In [54]:
results = pipe(text, candidate_labels=cats)
results

{'sequence': 'Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.',
 'labels': ['u.s. news', 'world news', 'parenting', 'comedy'],
 'scores': [0.5229672789573669,
  0.39395207166671753,
  0.05158837139606476,
  0.03149222582578659]}

In [55]:
for i in range(10):
    text = df.iloc[i].short_description
    results = pipe(text, candidate_labels=cats)
    print("actual:", df.iloc[i].category, "predicted:", results['labels'][0])
    print(results)

actual: U.S. NEWS predicted: u.s. news
{'sequence': 'Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall.', 'labels': ['u.s. news', 'world news', 'parenting', 'comedy'], 'scores': [0.5229672789573669, 0.39395207166671753, 0.05158837139606476, 0.03149222582578659]}
actual: U.S. NEWS predicted: u.s. news
{'sequence': "He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles.", 'labels': ['u.s. news', 'world news', 'parenting', 'comedy'], 'scores': [0.5980039834976196, 0.26056456565856934, 0.10371331870555878, 0.03771822899580002]}
actual: COMEDY predicted: parenting
{'sequence': '"Until you have a dog you don\'t understand what could be eaten."', 'labels': ['parenting', 'comedy', 'u.s. news', 'world news'], 'scores': [0.5189522504806519, 0.28340235352516174, 0.1624489426612854, 0.0351964

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


actual: CULTURE & ARTS predicted: u.s. news
{'sequence': 'In "Mija," director Isabel Castro combined music documentaries with the style of "Euphoria" and "Clueless" to tell a more nuanced immigration story.', 'labels': ['u.s. news', 'comedy', 'world news', 'parenting'], 'scores': [0.5727845430374146, 0.17682603001594543, 0.12701761722564697, 0.12337180972099304]}
actual: WORLD NEWS predicted: world news
{'sequence': "White House officials say the crux of the president's visit to the U.N. this year will be a full-throated condemnation of Russia and its brutal war.", 'labels': ['world news', 'u.s. news', 'parenting', 'comedy'], 'scores': [0.5133635997772217, 0.4231444001197815, 0.04503404721617699, 0.018457990139722824]}
