<a href="https://colab.research.google.com/github/vilash57/GEN-AI-Engg/blob/main/1_NLP_AND_LANG_MODELS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers sentence-transformers faiss-cpu PyPDF2

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-tran

# Hugging Face Transformers for Zero-shot Classification and Sentiment Analysis

In this notebook, we will:
1. Perform zero-shot classification using the `facebook/bart-large-mnli` model.
2. Explore sentiment analysis using pre-trained models.
3. Learn about tokenization and model inputs.
4. Use multiple models to compare sentiment analysis results.

---

## What is Sentiment Analysis?
Sentiment analysis is a Natural Language Processing (NLP) task where we determine the sentiment (e.g., positive, negative, or neutral) of a given text. This is widely used in areas like:
- Analyzing customer reviews.
- Monitoring social media sentiment.
- Understanding feedback in education or healthcare contexts.

### Sentiment Analysis with Pre-trained Models
We will use `nlptown/bert-base-multilingual-uncased-sentiment` for sentiment analysis on different texts.

In [5]:
from transformers import pipeline

# Perform sentiment analysis using a pre-trained model
get_sentiment = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

text = "The Medical reports reveal Blood sugar and Blood pressure are high"
print(text)
print(get_sentiment(text))

print("**********")
text = "The students from this school score very high marks"
print(text)
print(get_sentiment(text))

Device set to use cpu


The Medical reports reveal Blood sugar and Blood pressure are high
[{'label': '2 stars', 'score': 0.32587388157844543}]
**********
The students from this school score very high marks
[{'label': '1 star', 'score': 0.23965802788734436}]


### Custom Tokenizer and Model for Sentiment Analysis
We load a tokenizer and model for a detailed sentiment analysis process.

In [6]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model for sentiment analysis
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
sentiment_analyzer = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Analyze sentiment for a given text
text = "Most of my students have scored high marks in JEE"
print(sentiment_analyzer(text))

# Additional examples
text = "The Medical reports reveal Blood sugar and Blood pressure are high"
print(sentiment_analyzer(text))

text = "High Blood Pressure Levels"
print(sentiment_analyzer(text))

Device set to use cpu


[{'label': '5 stars', 'score': 0.36143529415130615}]
[{'label': '2 stars', 'score': 0.32587388157844543}]
[{'label': '4 stars', 'score': 0.25538644194602966}]


In [7]:
# Tokenize and explore input details
# tokens=>words
text = "The Medical reports reveal Blood sugar and Blood pressure are high"
tokenized_output = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
tokens = tokenizer.tokenize(text)  # Tokenized words
input_ids = tokenized_output["input_ids"]  # Token IDs
attention_mask = tokenized_output["attention_mask"]  # Attention mask

print("Tokens:", tokens)
print("Input IDs:", input_ids)
print("Attention Mask:", attention_mask)

Tokens: ['the', 'medical', 'reports', 'reveal', 'blood', 'sugar', 'and', 'blood', 'pressure', 'are', 'high']
Input IDs: tensor([[  101, 10103, 14336, 20336, 53468, 15465, 25238, 10110, 15465, 21686,
         10320, 11053,   102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


### Sentiment Analysis with Multiple Models
Compare sentiment analysis results using different pre-trained models.

In [9]:
# Sentiment analysis with different models
get_sentiment_model1 = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")
get_sentiment_model2 = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

text = "The Medical reports reveal Blood sugar and Blood pressure are high"
print("************")
print(text)
print("bert-base-multilingual-uncased-sentiment", get_sentiment_model1(text))
print("distilbert-base-uncased-finetuned-sst-2-english",get_sentiment_model2(text))
print("************")



text = "The students from this school score very high marks"
print("************")
print(text)
print("bert-base-multilingual-uncased-sentiment", get_sentiment_model1(text))
print("distilbert-base-uncased-finetuned-sst-2-english",get_sentiment_model2(text))
print("************")


Device set to use cpu
Device set to use cpu


************
The Medical reports reveal Blood sugar and Blood pressure are high
bert-base-multilingual-uncased-sentiment [{'label': '2 stars', 'score': 0.32587388157844543}]
distilbert-base-uncased-finetuned-sst-2-english [{'label': 'NEGATIVE', 'score': 0.9767434597015381}]
************
************
The students from this school score very high marks
bert-base-multilingual-uncased-sentiment [{'label': '1 star', 'score': 0.23965802788734436}]
distilbert-base-uncased-finetuned-sst-2-english [{'label': 'POSITIVE', 'score': 0.9994359612464905}]
************


# Sentiment Analysis and Zero-shot Classification

In this notebook, we explore the following tasks:
- Sentiment analysis using pre-trained Hugging Face models.
- Tokenization and model exploration.
- Zero-shot classification with the `facebook/bart-large-mnli` model.

### Zero-shot Classification
We will use the `facebook/bart-large-mnli` model to perform zero-shot classification. This involves classifying text into predefined categories without training for those specific categories.

In [14]:
# Zero-shot classification using BART model
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

text = "organizations continue to choose our ubiquitous computing fabric—from cloud to edge—to run their missioncritical applications"
text = "Manchester United is the biggest brand in the world"
candidate_labels = ["football","sport","cricket","education", "politics", "technology", "science", "cosmology"]
result = classifier(text, candidate_labels=candidate_labels)
print(result)

Device set to use cpu


{'sequence': 'Manchester United is the biggest brand in the world', 'labels': ['football', 'sport', 'technology', 'cosmology', 'cricket', 'education', 'science', 'politics'], 'scores': [0.5724889039993286, 0.32411059737205505, 0.024618325755000114, 0.022414034232497215, 0.015259054489433765, 0.014339434914290905, 0.013671555556356907, 0.013098129071295261]}


# LANGUAGE MODELS

# Exploring Text Generation, Summarization, and Question Answering with Hugging Face Transformers

This notebook demonstrates the use of Hugging Face Transformers for:
1. Text generation using a GPT-2 model.
2. Text summarization with a BART model.
3. Question answering with different models and contexts.
4. Reading and processing PDF files for NLP tasks.

## 1. Text Generation
We use the `gpt2-large` model for generating text based on a given input prompt. The pipeline is initialized with the task `text-generation`.

In [16]:
from transformers import pipeline, set_seed

# Initialize the text generation pipeline
generator = pipeline("text-generation", model="openai-community/gpt2-large")

# Set a random seed for reproducibility
set_seed(4)

# Generate text based on a prompt
print(generator("The man worked as a", max_length=50, num_return_sequences=2, truncation=False))

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The man worked as a "receiver/transition" who was known to help others set up and maintain online relationships.\n\nDetectives suspect the man had some kind of role in the crime but, until last night, had not been named'}, {'generated_text': 'The man worked as a waiter at the restaurant and has been charged with sexual assault. He and a female customer are the only two reported victims and the restaurant has not released any information as to what happened or when or where the incident happened.\n\n'}]


config.json:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

The repository for deepseek-ai/DeepSeek-R1 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/deepseek-ai/DeepSeek-R1.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


configuration_deepseek.py:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/deepseek-ai/DeepSeek-R1:
- configuration_deepseek.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


KeyboardInterrupt: Interrupted by user

## 2. Text Summarization
We use the `bart-large-cnn` model for summarizing large pieces of text into concise summaries.

In [17]:
# Initialize the summarization pipeline
summarize_model = pipeline("summarization", model="facebook/bart-large-cnn")

# Example text for summarization
txt = '''
Team India's below-par performance in the Border-Gavaskar Trophy could see big changes in the team and the leadership group. Rohit Sharma's captaincy is under the scanner and the selectors could take a call on him if India fail to reach the World Test Championship final. He has also struggled with the bat and only managed 31 runs in the ongoing series.
Amid India's poor performance in Australia, the Indian Express has reported that a senior player is portraying to be 'Mr Fix-it." The report states that the senior player is ready to project himself as an interim option for captaincy as he isn't convinced about the young players. The report doesn't mention the name of the senior player.
The report adds that Rohit may take a call about his career after the Border-Gavaskar Trophy. He made his ODI and T20I captaincy debut in 2007. Rohit made his Test debut in 2013.
'''

# Summarize the text
# print(summarize_model(txt, max_length=int(len(txt.split(" ")) / 4), do_sample=False))
print(summarize_model(txt,  do_sample=False))

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


[{'summary_text': "Rohit Sharma's captaincy is under the scanner and the selectors could take a call on him if India fail to reach the World Test Championship final. He has also struggled with the bat and only managed 31 runs in the ongoing series. The report doesn't mention the name of the senior player."}]


In [19]:
# Initialize the summarization pipeline
summarize_model = pipeline("summarization", model="nsi319/legal-pegasus")

# Another example for summarization
txt = '''This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA
Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and
assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or
functionality.
'''

# Summarize the text
# print(summarize_model(txt, max_length=int(len(txt.split(" ")) / 4), do_sample=False))
print(summarize_model(txt, max_length=100, do_sample=False))

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Device set to use cpu


[{'summary_text': 'This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein.'}]


## 3. Question Answering
We demonstrate question answering using three different models:
- A general-purpose model (`roberta-base-squad2`).
- A model fine-tuned for legal documents.
- A distilled version of BERT fine-tuned for QA tasks.

In [20]:
# Initialize question answering pipelines
question_model = pipeline("question-answering", model="deepset/roberta-base-squad2")
question_model_legal = pipeline("question-answering", model="atharvamundada99/bert-large-question-answering-finetuned-legal")
question_model_bert = pipeline("question-answering", model="distilbert/distilbert-base-cased-distilled-squad")

# Define a query and context
query = "what are customer's responsibilities"
txt = '''This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA
Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and
assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or
functionality.'''

# Get answers from different models
print(question_model(question=query, context=txt, top_k=3))
print(question_model_legal(question=query, context=txt, top_k=3))
print(question_model_bert(question=query, context=txt, top_k=3))


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


[{'score': 0.0008512305212207139, 'start': 371, 'end': 394, 'answer': 'errors contained herein'}, {'score': 0.0005820614751428366, 'start': 367, 'end': 394, 'answer': 'any errors contained herein'}, {'score': 0.0005411873571574688, 'start': 345, 'end': 394, 'answer': 'no responsibility for any errors contained herein'}]
[{'score': 0.23245160281658173, 'start': 337, 'end': 394, 'answer': 'assumes no responsibility for any errors contained herein'}, {'score': 0.1143946498632431, 'start': 345, 'end': 394, 'answer': 'no responsibility for any errors contained herein'}, {'score': 0.03920697793364525, 'start': 337, 'end': 395, 'answer': 'assumes no responsibility for any errors contained herein.'}]
[{'score': 0.03025309555232525, 'start': 345, 'end': 394, 'answer': 'no responsibility for any errors contained herein'}, {'score': 0.02294539101421833, 'start': 337, 'end': 394, 'answer': 'assumes no responsibility for any errors contained herein'}, {'score': 0.013406002894043922, 'start': 367, '

## 4. Reading PDFs for Question Answering
We use the PyPDF2 library to read a PDF file and extract its content for further processing.

In [22]:
from PyPDF2 import PdfReader

# Function to read a PDF file
def read_pdf(file_path):
    reader = PdfReader(file_path)
    content = ""
    for page in reader.pages:
        content += page.extract_text() + "\n"
    return content

# Path to the PDF file
file_path = "/content/sample_data/LLM.pdf"
pdf_content = read_pdf(file_path)

# Define a query and get answers from the PDF content
query = "explain LLM?"
print(question_model(question=query, context=pdf_content, top_k=3))
print(question_model_legal(question=query, context=pdf_content, top_k=3))
print(question_model_bert(question=query, context=pdf_content, top_k=3))

[{'score': 0.16235895454883575, 'start': 97, 'end': 118, 'answer': 'Emergent Capabilities'}, {'score': 0.02643592096865177, 'start': 1253, 'end': 1272, 'answer': '●Emergent Abilities'}, {'score': 0.023881059139966965, 'start': 1254, 'end': 1272, 'answer': 'Emergent Abilities'}]
[{'score': 0.03023272193968296, 'start': 4540, 'end': 4572, 'answer': 'content moderation, explanations'}, {'score': 0.02450510486960411, 'start': 1149, 'end': 1201, 'answer': 'can now use one single model to solve many NLP tasks'}, {'score': 0.005973827559500933, 'start': 904, 'end': 954, 'answer': 'Why LLMs? \n●Scaling Law for Neural Language Models'}]
[{'score': 0.03337723761796951, 'start': 4338, 'end': 4345, 'answer': 'misused'}, {'score': 0.007326561491936445, 'start': 4338, 'end': 4369, 'answer': 'misused \n(misinformation, spam)'}, {'score': 0.005947143770754337, 'start': 4331, 'end': 4345, 'answer': 'can be misused'}]


In [23]:
pdf_content

'Large Language Models \nCSC413 Tutorial 9 \nYongchao Zhou \nOverview \n●What are LLMs? \n●Why LLMs? \n●Emergent Capabilities \n○Few-shot In-context Learning \n○Advanced Prompt Techniques \n●LLM Training \n○Architectures \n○Objectives \n●LLM Finetuning \n○Instruction ﬁnetuning \n○RLHF \n○Bootstrapping \n●LLM Risks \nWhat are Language Models? \n●Narrow Sense \n○A probabilistic model that assigns a probability to every ﬁnite sequence (grammatical or not) \n●Broad Sense \n○Decoder-only models (GPT-X, OPT, LLaMA, PaLM) \n○Encoder-only models (BERT, RoBERTa, ELECTRA) \n○Encoder-decoder models (T5, BART) \n\nLarge Language Models - Billions of Parameters  \nhttps://huggingface.co/blog/large-language-models  \nLarge Language Models - Hundreds of Billions of Tokens \nhttps://babylm.github.io/  \nLarge Language Models - yottaFlops of Compute \nhttps://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf  \nWhy LLMs? \n●Scaling Law for Neural Language Models \n○Performan