In [2]:
from pathlib import Path
import numpy as np

## Sentiment analysis

Sentiment analysis is a natural language processing task that involves determining the sentiment or emotional tone expressed in a piece of text. The goal is to classify the text as positive, negative, or neutral, providing insights into the overall sentiment conveyed by the content. This task is widely used in various applications, such as social media monitoring, customer feedback analysis, and opinion mining.

To get the sentiment data: <br>
`wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

To unzip run: <br>
`! tar -xvf aclImdb_v1.tar.gz`

This is a dataset for binary sentiment classification (positive or negative). The training data consists of 25,000 movie reviews for training, and 25,000 for testing.

In [3]:
! ls ~/data/aclImdb

README     imdb.vocab imdbEr.txt [34mtest[m[m       [34mtrain[m[m


In [4]:
def read_file(path):
    f = open(path, "r")
    return f.read()

In [5]:
test_path = Path("~") / "data/aclImdb/test"
test_path = test_path.expanduser()
pos_paths = [f for f in (test_path/"pos").iterdir()]
neg_paths = [f for f in (test_path/"neg").iterdir()]

In [6]:
read_file(pos_paths[0])

'Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with her sister, and was forced to stay back until she could get I.D. papers from the American embassy. To fill in a day before she could fly out, she took a trip into the countryside with a tour guide. "I tried finding something in those stone statues, but nothing stirred in me. I was stone myself." <br /><br />Suddenly all hell broke loose and she was caught in a political revolt. Just when it looked like she had escaped and safely boarded a train, she saw her tour guide get beaten and shot. In a split second she decided to jump from the moving train and try to rescue him, with no thought of herself. Continually her life was in danger. <br /><br />Here is a woman 

In [7]:
data1 = [read_file(pos_paths[i]) for i in range(10)]

In [8]:
data2 = [read_file(neg_paths[i]) for i in range(10)]

Here is a [link](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment) to models that are useful for sentiment analysis.

In [9]:
from transformers import pipeline

In [10]:
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

This default model takes a max of 512 tokens without the max_length you will get an error.

In [45]:
sentiment_pipeline(data1, max_length=512)

[{'label': 'POSITIVE', 'score': 0.9971122741699219},
 {'label': 'POSITIVE', 'score': 0.9975005984306335},
 {'label': 'POSITIVE', 'score': 0.9931706190109253},
 {'label': 'POSITIVE', 'score': 0.9990983009338379},
 {'label': 'POSITIVE', 'score': 0.9972705245018005},
 {'label': 'POSITIVE', 'score': 0.996355414390564},
 {'label': 'POSITIVE', 'score': 0.9971224665641785},
 {'label': 'POSITIVE', 'score': 0.9996683597564697},
 {'label': 'POSITIVE', 'score': 0.9987655878067017},
 {'label': 'POSITIVE', 'score': 0.9910690784454346}]

In [46]:
sentiment_pipeline(data2, max_length=512)

[{'label': 'POSITIVE', 'score': 0.993617057800293},
 {'label': 'NEGATIVE', 'score': 0.994662880897522},
 {'label': 'NEGATIVE', 'score': 0.9976724982261658},
 {'label': 'NEGATIVE', 'score': 0.9993439316749573},
 {'label': 'NEGATIVE', 'score': 0.9994088411331177},
 {'label': 'NEGATIVE', 'score': 0.9961174726486206},
 {'label': 'NEGATIVE', 'score': 0.9991970658302307},
 {'label': 'POSITIVE', 'score': 0.9972756505012512},
 {'label': 'NEGATIVE', 'score': 0.9799123406410217},
 {'label': 'NEGATIVE', 'score': 0.9983288645744324}]

Here we specify the model.

In [47]:
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"

sentiment_task = pipeline("sentiment-analysis", model=model_name)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [50]:
sentiment_task(data1, max_length=512)

[{'label': 'negative', 'score': 0.546417772769928},
 {'label': 'positive', 'score': 0.779367208480835},
 {'label': 'positive', 'score': 0.9537304639816284},
 {'label': 'positive', 'score': 0.9705942273139954},
 {'label': 'positive', 'score': 0.8485249876976013},
 {'label': 'positive', 'score': 0.7092108726501465},
 {'label': 'positive', 'score': 0.6795801520347595},
 {'label': 'positive', 'score': 0.682655394077301},
 {'label': 'negative', 'score': 0.49398666620254517},
 {'label': 'positive', 'score': 0.8577772974967957}]

In [51]:
sentiment_task(data2, max_length=512)

[{'label': 'positive', 'score': 0.9468842148780823},
 {'label': 'negative', 'score': 0.8466248512268066},
 {'label': 'negative', 'score': 0.7134288549423218},
 {'label': 'negative', 'score': 0.6122548580169678},
 {'label': 'negative', 'score': 0.40970394015312195},
 {'label': 'neutral', 'score': 0.7669347524642944},
 {'label': 'negative', 'score': 0.7475892901420593},
 {'label': 'negative', 'score': 0.3856699764728546},
 {'label': 'negative', 'score': 0.6014850735664368},
 {'label': 'negative', 'score': 0.8043235540390015}]

## Text summarization
Text summarization is a natural language processing task aimed at condensing the content of a given text while retaining its key information and meaning. This task is crucial for information retrieval, document organization, and content consumption in various applications.

You can download a summarization dataset from [here](
https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail)

In [11]:
import pandas as pd
df = pd.read_csv("~/data/cnn_dailymail/train.csv") 

In [12]:
df.head()

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [13]:
df.iloc[0].article

"By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for newly ordained 

In [14]:
df.iloc[0].highlights

'Bishop John Folda, of North Dakota, is taking time off after being diagnosed .\nHe contracted the infection through contaminated food in Italy .\nChurch members in Fargo, Grand Forks and Jamestown could have been exposed .'

In [16]:
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")

input_text = df.iloc[0].article

# Generate summary
summary = summarizer(input_text, max_length=150, min_length=50, length_penalty=2.0,
                     num_beams=4, early_stopping=True, truncation=True)

# Print the generated summary
print(summary[0]['summary_text'])

the bishop of the fargo Catholic diocese in north . Dakota has exposed potentially hundreds of church members to the hepatitis . A virus in late September and early . October . the risk is low, but officials feel it's important to alert people to the possible exposure .


### a different model
Let's try `google/flan-t5-base` model. Looks much better.

In [18]:
summarizer = pipeline("summarization", model="google/flan-t5-base",
                      tokenizer="google/flan-t5-base")

input_text = df.iloc[0].article

# Generate summary
summary = summarizer(input_text, max_length=150, min_length=50, length_penalty=2.0,
                     num_beams=4, early_stopping=True, truncation=True)

# Print the generated summary
print(summary[0]['summary_text'])

Bishop John Folda of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure.


## Topics