<a href="https://colab.research.google.com/github/rsadaphule/jhu-aaml/blob/main/Module_11_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 11 - Transformers
# Student - Ravindra Sadaphule

In [1]:
from google.colab import drive; drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


1. [5 pts] What is the most probable noun after the word united?
2. [5 pts] What is the most probable preposition before the word sea?
3. [5 pts] What is the most probable preposition after the verb studying?
4. [5 pts] What is the most probable verb before the word network?
5. [20 pts] Predict the sentiments in the IMDB movie reviews and compare to the original
labels. List the number of disagreements in each group.
6. [20 pts] List the movie names mentioned in the IMDB movie reviews and rank them in each
sentiment group.
7. [20 pts] Refer to the carroll-alice.txt in Gutenberg (nltk corpus) and answer the following
questions:
(a.) Who is Alice?
(b.) Where is Alice?
(c.) What is Alice?
(d.) How is Alice?
8. [20 pts] Summarize the story Alice in Wonderland. Comment about the results.


In [2]:
!pip install transformers



In [3]:
PATH_DATA = '/content/drive/My Drive/JHU/AAML/Assignments/data/imdb/'
FILE_NAME = "movie_data.csv"

In [4]:
from transformers import pipeline

# Initialize a pipeline for 'fill-mask' task using a pre-trained model
fill_mask = pipeline("fill-mask", model="bert-base-uncased")

# Define the queries with the correct mask token
queries = [
    "united [MASK]",         # to find a noun after 'united'
    "[MASK] sea",            # to find a preposition before 'sea'
    "studying [MASK]",       # to find a preposition after 'studying'
    "[MASK] network"         # to find a verb before 'network'
]

# Find and print the most probable tokens for each query
for query in queries:
    result = fill_mask(query)[0]  # Get the most probable prediction
    print(f"Query: '{query}' -> Prediction: '{result['token_str']}' with score {result['score']}")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Query: 'united [MASK]' -> Prediction: '.' with score 0.6787757873535156
Query: '[MASK] sea' -> Prediction: 'mediterranean' with score 0.22027242183685303
Query: 'studying [MASK]' -> Prediction: '.' with score 0.9596723318099976
Query: '[MASK] network' -> Prediction: 'cartoon' with score 0.07166624069213867


In [5]:
import pandas as pd

5. [20 pts] Predict the sentiments in the IMDB movie reviews and compare to the original

In [6]:
# Load and preprocess the dataset
# Read the CSV file
df = pd.read_csv(PATH_DATA + FILE_NAME)
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [7]:
import torch
from transformers import pipeline, DistilBertTokenizer


# Check if GPU is available and use it if possible
device = 0 if torch.cuda.is_available() else -1

# Load the sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=device)

# Function to predict sentiment with truncation
def predict_sentiment(text):
    # Truncate the text to a maximum of 512 tokens
    result = sentiment_pipeline(text[:512])[0]
    return 1 if result['label'] == 'POSITIVE' else 0

# Apply the prediction function to the review column
df['predicted_sentiment'] = df['review'].apply(predict_sentiment)

# Compare predictions with actual labels
df['disagreement'] = df['sentiment'] != df['predicted_sentiment']

# Count the number of disagreements
disagreements = df['disagreement'].sum()
print(f"Number of disagreements: {disagreements}")

# Optional: Save the results to a new CSV file
df.to_csv('sentiment_analysis_results.csv', index=False)




Number of disagreements: 8623


In [8]:
# Count the number of disagreements
disagreements = df['disagreement'].sum()
total_reviews = len(df)

# Calculate the percentage of disagreements
disagreement_percentage = (disagreements / total_reviews) * 100

print(f"Percentage of disagreements: {disagreement_percentage:.2f}%")

Percentage of disagreements: 17.25%


6. [20 pts] List the movie names mentioned in the IMDB movie reviews and rank them in each
sentiment group.

In [9]:
import pandas as pd
import spacy
from collections import Counter


# Load a Spacy model for Named Entity Recognition
nlp = spacy.load('en_core_web_sm')

# Function to extract movie names using NER
def extract_movie_names(text):
    doc = nlp(text)
    # Extract entities that might be movie titles
    return [ent.text for ent in doc.ents if ent.label_ in ['WORK_OF_ART', 'ORG']]

# Extract movie names from reviews
df['movies'] = df['review'].apply(extract_movie_names)

# Flatten the list of movies and count occurrences
movie_counts = Counter([movie for sublist in df['movies'] for movie in sublist])

# Assuming 'sentiment' column exists with binary values (1 for positive, 0 for negative)
positive_reviews = df[df['sentiment'] == 1]
negative_reviews = df[df['sentiment'] == 0]

# Count positive and negative reviews for each movie
positive_movie_counts = Counter([movie for sublist in positive_reviews['movies'] for movie in sublist])
negative_movie_counts = Counter([movie for sublist in negative_reviews['movies'] for movie in sublist])

# Rank movies in each sentiment group
ranked_positive_movies = positive_movie_counts.most_common()
ranked_negative_movies = negative_movie_counts.most_common()

print("Top Positive Movies:")
for movie, count in ranked_positive_movies[:20]:  # Top 20 for example
    print(f"{movie}: {count} positive reviews")

print("\nTop Negative Movies:")
for movie, count in ranked_negative_movies[:20]:  # Top 20 for example
    print(f"{movie}: {count} negative reviews")


Top Positive Movies:
Disney: 561 positive reviews
VHS: 298 positive reviews
BBC: 227 positive reviews
MGM: 220 positive reviews
Hitchcock: 201 positive reviews
Batman: 201 positive reviews
Ford: 187 positive reviews
Bond: 179 positive reviews
HBO: 171 positive reviews
Love: 171 positive reviews
House: 164 positive reviews
Oscar: 163 positive reviews
FBI: 143 positive reviews
: 142 positive reviews
Stanwyck: 137 positive reviews
Fulci: 134 positive reviews
CIA: 125 positive reviews
Matthau: 123 positive reviews
Hamlet: 121 positive reviews
ABC: 117 positive reviews

Top Negative Movies:
Disney: 384 negative reviews
CGI: 243 negative reviews
FBI: 180 negative reviews
BBC: 160 negative reviews
: 139 negative reviews
VHS: 124 negative reviews
FX: 123 negative reviews
Bible: 118 negative reviews
House: 116 negative reviews
Batman: 112 negative reviews
REALLY: 111 negative reviews
un: 110 negative reviews
Seagal: 110 negative reviews
CIA: 107 negative reviews
MTV: 105 negative reviews
/>We

Analysis:
Extracting movie names from reviews is hard. We could extract few movie names like Batman but most of the other movie names are franchieses rather than movie names.

7. [20 pts] Refer to the carroll-alice.txt in Gutenberg (nltk corpus) and answer the following questions: (a.) Who is Alice? (b.) Where is Alice? (c.) What is Alice? (d.) How is Alice?


In [10]:
import nltk
nltk.download('gutenberg')


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [11]:
from nltk.corpus import gutenberg
alice_text = gutenberg.raw('carroll-alice.txt')


In [31]:
#print(alice_text)

We'll use bert-large-uncased-whole-word-masking-finetuned-squad model, which is fine-tuned on the SQuAD dataset for question answering tasks:

In [22]:
#from transformers import pipeline


# Initialize question answering pipeline with a different model
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

# The context (passage of text) where we search for answers
context = """Alice's Adventures in Wonderland (commonly shortened to Alice in Wonderland) is an 1865 novel by Lewis Carroll. It tells of a young girl named Alice, who falls through a rabbit hole into a subterranean fantasy world populated by peculiar, anthropomorphic creatures."""
#context = alice_text

# Questions about Alice

questions = [
    "Who is Alice?",
    "Where is Alice?",
    "What is Alice?",
    "How is Alice?"
]

# Questions about Alice
'''
questions = [
    "What Character does Alice play in the story?"
]
'''

# Find answers for each question from the context
for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"Question: {question}\nAnswer: {result['answer']}\n")


Question: Who is Alice?
Answer: a young girl

Question: Where is Alice?
Answer: Alice's Adventures in Wonderland

Question: What is Alice?
Answer: Alice's Adventures in Wonderland

Question: How is Alice?
Answer: fantasy world



Anaylsis:
The answer to first and 2nd questions are ok but 3rd and 4th are wrong. We may need to expand the context to get better answer. I tried bert-uncased and distillbert Distillbert seems to perform realtively better.

[20 pts] Summarize the story Alice in Wonderland. Comment about the results.

To summarize the story "Alice in Wonderland" using the Hugging Face Transformers library, we can use a pre-trained model that is suitable for summarization tasks. One of the popular choices for this purpose is the BART or T5 models, which are trained on a variety of NLP tasks, including summarization.

In [30]:
#from transformers import pipeline

# Initialize the summarization pipeline
summarizer = pipeline("summarization", model="t5-large")

# Example context: a snippet from "Alice in Wonderland"
#context = """Alice's Adventures in Wonderland (commonly shortened to Alice in Wonderland) is an 1865 novel by Lewis Carroll. It tells of a young girl named Alice, who falls through a rabbit hole into a subterranean fantasy world populated by peculiar, anthropomorphic creatures. The tale plays with logic, giving the story lasting popularity with adults as well as with children. It is considered to be one of the best examples of the literary nonsense genre."""
context = alice_text

# Summarize the context
summary = summarizer(context[:1024], max_length=130, min_length=30, do_sample=False)

# Print the summary
print("Summary:", summary[0]['summary_text'])


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Summary: Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do . suddenly a White Rabbit with pink eyes ran close by her . Alice thought it so VERY out of the way to hear the rabbit say to itself, 'Oh dear! Oh dear! I shall be late!'


Analysis:
I tried different model like BART, T5-small and T5-large. T5-large performs best but is comutatonally expensive.