# Transformers and Pretrained Language Models

## **Outline**


- NLP and text generation before the introduction of transformers.
- Overview of pre-trained models:
  - BERT
  - GPT
- Other notable transformers and their applications.
- **Hands-on Lab:** Introduction to BERT and GPT


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Transformers: Attention Is All You Need (Google) — 2017**

- Deep learning lacked in natural language processing (NLP)
- NLP not just about translation or classification
- The challenge was coherent conversations with humans

- **RNN (Recurrent Neural Network)**
  - A type of neural network designed for sequential data.
  - Processes data with loops, allowing information persistence.
    - **How it works:**
        - Takes input at each time step.
        - Maintains a hidden state that captures previous information.

- **LSTM (Long Short-Term Memory)**
  - A type of RNN designed to address the vanishing gradient problem.
  - Keeps long-term dependencies in sequential data.
    - **How it works:**
        - Similar to RNN but with specialized memory cells.
        - Has gates (input, forget, output) to control information flow.
        - Can store, read, and write information selectively.
    - **Advantages:**
        - Handles long-term dependencies effectively.
        - Better at capturing and retaining sequential patterns.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/RNNLSTM.png?raw=1"  width="400" align="center"/>


- Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks were early staples in natural language processing and time series analysis.

**Limitations**
- Proficient at short sequences but struggled with **longer text**.
- Couldn't capture complex ideas in extended text.

## **Transformers**

- Google introduced the "Transformer" model in 2017.
- Presented in the groundbreaking paper "Attention Is All You Need."
- A milestone that revolutionized translation problems.


#### The Power of Attention

- "Attention" mechanism - a neural network game-changer.
- Allows analyzing the entire input sequence.
- Determines relevance to each component of the output.
- Transforms NLP and many other AI domains.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## Before Transformers:

In [1]:
#%pip install transformers

In [2]:
import torch
import torch.nn as nn
import numpy as np

# Sample text data
#text = ""

import requests

# Specify the URL of your text file
file_url = 'https://github.com/wsko/Statistics/raw/main/hawking.txt'

# Send a GET request to the URL
response = requests.get(file_url)

# Check if the request was successful
if response.status_code == 200:
    # Get the content of the response as a string
    text = response.text
else:
    print(f"Failed to retrieve file. Status code: {response.status_code}")

import string
import re

def process_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation using regular expressions
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)

    return text

text = process_text(text)
text[:200]


'stephen hawkings a brief history of time explores the profound questions about the universe beginning with the history of cosmological thought from aristotle and ptolemy’s geocentric models through co'

In [3]:
# Preprocess the text and create sequences
tokens = text.split()
word_to_idx = {word: idx for idx, word in enumerate(tokens)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
seq_length = 5

data = []
for i in range(len(tokens) - seq_length):
    seq_in = tokens[i:i + seq_length]
    seq_out = tokens[i + seq_length]
    data.append((seq_in, seq_out))

# Define an RNN-based text generation model
class RNNTextGenerator(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(RNNTextGenerator, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden):
        x = self.embeddings(x)
        x, hidden = self.rnn(x, hidden)
        x = self.fc(x)
        return x, hidden

# Hyperparameters
vocab_size = len(tokens)
embedding_dim = 10
hidden_dim = 50
learning_rate = 0.01
num_epochs = 25

# Create and train the RNN model
model_rnn = RNNTextGenerator(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_rnn.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    for seq_in, seq_out in data:
        seq_in_idx = torch.tensor([word_to_idx[word] for word in seq_in], dtype=torch.long)
        seq_out_idx = torch.tensor([word_to_idx[seq_out]], dtype=torch.long)

        optimizer.zero_grad()
        hidden = None
        for i in range(seq_length):
            output, hidden = model_rnn(seq_in_idx[i].view(1, -1), hidden)

        loss = criterion(output.view(1, -1), seq_out_idx)
        loss.backward()
        optimizer.step()
    print('epoch:', epoch, ', loss:', loss)



epoch: 0 , loss: tensor(10.8144, grad_fn=<NllLossBackward0>)
epoch: 1 , loss: tensor(6.8264, grad_fn=<NllLossBackward0>)
epoch: 2 , loss: tensor(1.6240, grad_fn=<NllLossBackward0>)
epoch: 3 , loss: tensor(1.9326, grad_fn=<NllLossBackward0>)
epoch: 4 , loss: tensor(0.8661, grad_fn=<NllLossBackward0>)
epoch: 5 , loss: tensor(0.8254, grad_fn=<NllLossBackward0>)
epoch: 6 , loss: tensor(0.7872, grad_fn=<NllLossBackward0>)
epoch: 7 , loss: tensor(0.2200, grad_fn=<NllLossBackward0>)
epoch: 8 , loss: tensor(0.0111, grad_fn=<NllLossBackward0>)
epoch: 9 , loss: tensor(2.1394, grad_fn=<NllLossBackward0>)
epoch: 10 , loss: tensor(0.0048, grad_fn=<NllLossBackward0>)
epoch: 11 , loss: tensor(0.0068, grad_fn=<NllLossBackward0>)
epoch: 12 , loss: tensor(0.0254, grad_fn=<NllLossBackward0>)
epoch: 13 , loss: tensor(0.0631, grad_fn=<NllLossBackward0>)
epoch: 14 , loss: tensor(0.0011, grad_fn=<NllLossBackward0>)
epoch: 15 , loss: tensor(0.0441, grad_fn=<NllLossBackward0>)
epoch: 16 , loss: tensor(0.0407, 

In [4]:
# Generate text using the RNN model
seed_text = "the universe"
predicted_text = seed_text
hidden = None
for _ in range(5):
    seq_in_idx = torch.tensor([word_to_idx[word] for word in seed_text.split()], dtype=torch.long)
    output, hidden = model_rnn(seq_in_idx[-1].view(1, -1), hidden)
    predicted_word_idx = torch.argmax(output).item()
    predicted_word = idx_to_word[predicted_word_idx]
    predicted_text += " " + predicted_word
    seed_text += " " + predicted_word
print("********")
print(predicted_text)


********
the universe the journey he uses of


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## With Transformers:

In [5]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the device to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [6]:
# Generate text using the Transformer-based model
input_text = "the universe"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
output = model.generate(input_ids, max_length=10, num_return_sequences=1)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("********")
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


********
the universe is a very small place, and it


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Transformers Beyond Translation**

- State-of-the-art models for numerous NLP tasks.
- Recently, Transformers made waves in computer vision.

- Impacts on NLP
  - Fostered advancements in conversational AI.
  - Enabled applications in chatbots, virtual assistants, and more.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/NLPEvolution.png?raw=1"  align="center"/>

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **BERT (Google) and GPT (OpenAI) — 2018**

- AI needed to understand **language beyond translation**.
- BERT and GPT addressed this crucial gap.

#### Introducing BERT

- BERT (Bidirectional Encoder Representations from Transformers).
- Google's approach to contextual language understanding.
- Trained on vast amounts of text to predict missing words.
- BERT's Impact
  - Achieved remarkable results in sentiment analysis, question answering, and more.
  - Contextual embeddings revolutionized language understanding.

#### GPT - A Different Approach

- GPT (Generative Pre-trained Transformer) by OpenAI.
- Focus on autoregressive language modeling.
  - Learning to generate text one word at a time.
- GPT's Language Generation
  - GPT-2's surprising ability to generate coherent text.
  - Human-like responses in chatbots and text generation.
  - Demonstrated the power of pre-trained models.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/GPT.jpeg?raw=1" width="500" align="center"/>
---

#### Scaling Challenges

- Collecting quality training data remained a challenge.
- ImageNet required meticulous labeling of thousands of images.
- Text datasets for language tasks were equally demanding.

#### GPT-4: Scaling New Heights

- OpenAI introduced GPT-4 was trained with 100s billion parameters.
- The largest and most powerful language model to date.
- GPT-3 has "only" 175 billion parameters.

#### Fine Tuning - Customizing Models

- Fine tuning adapts large models to specific tasks.
- Cost-effective compared to training from scratch.
- Application in fields like healthcare, finance, and more.
- Examples
  - Fine-tuned models for medical document processing.
  - Improved accuracy in identifying medical conditions.
  - OpenAI's partnership with Microsoft for domain-specific AI.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## Sentiment Analysis with and without BERT

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from transformers import BertTokenizer, BertForSequenceClassification, pipeline
import pandas as pd
from sklearn.model_selection import train_test_split


# Load the dataset from the CSV file
df = pd.read_csv('https://raw.githubusercontent.com/wsko/Statistics/main/movie_reviews.csv')

# Split the dataset into training and testing sets
X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Traditional Approach: TF-IDF + Logistic Regression
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

lr_classifier = LogisticRegression()
lr_classifier.fit(X_train_tfidf, y_train)

sample_review = ["The movie was disappointing. The acting was mediocre, and the plot lacked depth. I would not recommend it."]

# Transform the sample review using TF-IDF
sample_review_tfidf = tfidf_vectorizer.transform(sample_review)

# Predict the sentiment for the sample review
sample_predicted_sentiment_lr = lr_classifier.predict(sample_review_tfidf)
sample_sentiment = "Positive" if sample_predicted_sentiment_lr[0] == 'positive' else "Negative"
print(f"Sentiment Prediction (TF-IDF + Logistic Regression): {sample_sentiment}")



Sentiment Prediction (TF-IDF + Logistic Regression): Positive


In [8]:
# BERT-based Approach
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

results = nlp(sample_review)
predicted_sentiment_bert = results[0]['label']
print(f"Sentiment (BERT): {predicted_sentiment_bert}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment (BERT): LABEL_0


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **The Challenge of Interaction**

- They focused on predicting the next word.
- Interacting with Language Models (LLMs) was challenging.
- Difficulties in following human instructions.

- **Instruction Tuning Unveiled**
  - Fine-tuning LLMs to follow human instructions
  - Enhanced interaction and task performance

** Benefits of Instruction Tuning**

- Increased accuracy and capabilities of LLMs.
- Alignment with human values.
- Prevention of undesired or dangerous content.


## **The Arrival of ChatGPT**

- ChatGPT: A milestone in Generative AI.
- Reorganized instruction tuning into a dialogue format.
- User-friendly interface for AI interaction.


## **Popular LLMs**
<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/GAI2.webp?raw=1" width="1000" align="center"/>


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **OpenAI’s GPT Models**

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/GAI3.webp?raw=1" width="1000" align="center"/>

## **Task specific**

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/GAI4.webp?raw=1" width="1000" align="center"/>


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/GAI5.webp?raw=1" width="1000" align="center"/>

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Google AI and the Pathways Language Model (PaLM)**

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/PALM.jpeg?raw=1" width="500" align="center"/>

- Google's largest publicly disclosed model.
  - PaLM serves as a foundation model.
- Used in various Google projects.
  - Sscale up to 540 billion parameters.
- Trained on 780 billion tokens.
- A substantial leap beyond GPT-3.

#### Training Data

- Self-supervised learning with a diverse text corpus.
- Multilingual web pages, books, code repositories, and more.

#### PaLM's Performance

- PaLM's exceptional few-shot performance.
- Outperforming prior larger models like GPT-3.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **DeepMind’s Chinchilla Model**

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/Deepmind.webp?raw=1" width="500" align="center"/>

- DeepMind Founded in 2010.
- Acquired by Google in 2014, now a subsidiary of Alphabet Inc.
  - DeepMind's pursuit of replicating human short-term memory.
  - Creation of a Neural Turing Machine.
  - A step towards understanding memory in AI.

- AlphaZaro
  - Competence achieved through reinforcement learning.

- AlphaFold's advances in protein folding.
  - Predicting over 200 million protein structures.
  - Revolutionizing the field of biology.

#### Flamingo - Describing Images


- In April 2022, DeepMind launched Flamingo.
  - A single visual language model capable of describing any picture.
  - Advancing AI's understanding of visual content.
  - https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model


#### Chinchilla AI - Outperforming GPT-3

- DeepMind's Chinchilla AI introduced in March 2022.
- Outperforming GPT-3.
- How
  - Chinchilla boasts 70B parameters.
  - Trained on 1,400 tokens, 4.7x more than GPT-3.
- Significant benefits for inference costs.
  - Outperforming other large language model platforms.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Meta AI (formerly FAIR)**

- FAIR, or Facebook Artificial Intelligence Research.
- A laboratory focused on open-source AI frameworks.


#### PyText - Advancing NLP

- In 2018, FAIR released PyText.
- A modeling framework for NLP systems.


#### Galactica - Assisting Scientists

- November 2022: Meta's Galactica.
- Assists scientists with tasks like summarizing papers and annotating molecules.
- Bridging the gap between AI and scientific research.


## **LLaMA - Large Language Model Meta AI**

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/llama.jpeg?raw=1" width="500" align="center"/>

- Released in February 2023.
- A foundational transformer-based language model.
- Aimed at advancing AI research and academic exploration.
- Responsible AI
  - LLaMA models released under non-commercial licenses.
  - Preventing misuse while promoting responsible AI.
  - Access granted to select researchers and organizations.
- Parameters
  - from 7 billion to 65 billion parameters.
  - Comparing LLaMA-65B to Chinchilla and PaLM.
- Training Data
  - LLaMA models trained on 1.4 trillion tokens in 20 languages.
  - Leveraging publicly available unlabeled data.
  - Data sources include CCNet, GitHub, Wikipedia, ArXiv, Stack Exchange, and books.

- Challenges
  - LLaMA's performance varies across languages.
  - Challenges related to bias, toxicity, and hallucination.


## **Alpaca**

- **Model Origin**: Fine-tuned from LLaMA 7B
- **Training Data**: 52K instruction-following demonstrations
- **Comparison**: Similar behavior to OpenAI’s text-davinci-003
- **Cost**: <$600 for reproduction
- **Code**: [GitHub.com/Stanford-Alpaca/Alpaca7B](#)

- Powerful instruction-following models:
    - GPT-3.5 (text-davinci-003)
    - ChatGPT
    - Claude
    - Bing Chat
- **Challenges**:
    - Generation of false information
    - Propagation of stereotypes
    - Toxic language generation

#### Alpaca Model Details
- **Purpose**: Addressing deficiencies in instruction-following models
- **Base**: Meta’s LLaMA 7B model
- **Training Data**: 52K instructions generated using text-davinci-003
- **Behavior**: Similar to text-davinci-003
- **Cost**: Surprisingly low


#### Training Recipe
- **Challenges**:
    1. Pretrained language model quality
    2. High-quality instruction data
- **Solution**: Meta’s new LLaMA models & self-instruct method
- **Training Details**: Fine-tuned LLaMA 7B on 52K demonstrations from text-davinci-003
- **Data Cost**: <$500 using OpenAI API

#### Preliminary Evaluation
- **Method**: Human evaluation on self-instruct evaluation set
- **Comparison**: Blind pairwise comparison between text-davinci-003 & Alpaca 7B
- **Results**: Alpaca and text-davinci-003 had very similar performance
- **Demo**: Interactive testing of Alpaca model



<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/alpaca.jpeg?raw=1" width="800" align="center"/>

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Anthropic and the Claude Chatbot**

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/claude.png?raw=1" width="500" align="center"/>

- Anthropic: An AI startup and public benefit corporation.
- Founded in 2021 by Daniela Amodei and Dario Amodei, former OpenAI members.
- A focus on responsible AI and interpretability.


#### Claude Chatbot

- Introducing Claude, Anthropic's conversational large language model.
- Using **constitutional AI** for better alignment with human intentions.
- Claude Models
  - Claude comes in two versions: Claude-v1 and Claude Instant.
  - Claude-v1 for complex dialogues and creative content.
  - Claude Instant for casual conversations and summarization.

#### Limitations and Concerns

- Claude's limitations in math and programming.
- Occasional hallucinations and dubious instructions.
- Concerns about clever prompting bypassing safety features.

**Availability and Integration**

- Claude's media embargo lifted in January 2023.
- Integration with Discord Juni Tutor Bot and various platforms.

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Open Source Efforts in AI and Machine Learning**



| Model Family Name | Created By | Sizes | Focus | Foundation or Fine-Tuned | License | What’s Interesting | Architectural Notes |
|-------------------|------------|-------|-------|--------------------------|---------|-------------------|--------------------|
| LLaMA | Meta | 7B, 13B, 32B, 65.2B | Varied | Foundation | Non-commercial | Basis for numerous fine-tuned variants | SwiGLU activation instead of ReLU |
| LLaMA 2 | Meta with Microsoft | 7B, 13B, 70B | Chat | Foundation | Commercial | Balances safety and helpfulness better than OpenAI's models | SwiGLU activation, RoPE over traditional embeddings |
| Alpaca | Stanford’s CRFM | 7B | Instruction following | Fine-tuned LLaMA 7B | Non-commercial | Trained on text-davinci-003 examples | - |
| Vicuna | LMSYS | 7B, 13B | Chat | Fine-tuned LLaMA 13B | Non-commercial | Utilizes conversations from ShareGPT.com for training | - |
| Guanaco | KBlueLeaf | 7B | Instruction following | Fine-tuned LLaMA 7B (parameter efficient) | Non-commercial | Fine-tuned using QLoRA | - |
| RedPajama | Multiple collaborators | 3B, 7B | Chat, Instruction following | Foundation | Commercial | Uses the fully open RedPajama dataset following the LLaMA training recipe | Modifications on the Pythia architecture |
| Falcon | Technology Innovation Institute of UAE | 7B, 40B | Varied | Foundation | Commercial | Features a 2D parallelism strategy and ZeRo optimization for efficient training | FlashAttention and Multi-query Attention techniques |
| Flan-T5 | Google | Various, up to 11B | Varied | Foundation | Commercial | Trained on a massive collection of datasets, tasks, and task categories | Based on the T5 encoder-decoder structure |
| Stable Beluga 2 (Freewilly) | Stability AI | 70B | Varied | Fine-tuned LLaMA 2 70B | Non-commercial | Uses a modified Orca approach for high-quality example generation | - |
| MPT | MosaicML | Up to 30B | Varied including story writing | Foundation | Commercial | Capable of generating extremely long texts (up to 84k tokens) with specific configurations | Features FlashAttention |


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-1/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

# Transformer Timeline


![Embedding Classifier Example](https://github.com/wsko/Generative_AI/blob/main/Day-3/images/NLP2.svg?raw=1)

- June 2017
  - Introduction of the Transformer architecture

- June 2018
  - **GPT** (Generative Pretrained Transformer)
  - First pretrained Transformer model
  - Used for fine-tuning on NLP tasks
  - Achieved state-of-the-art results

- October 2018
  - **BERT** (Bidirectional Encoder Representations from Transformers)
  - Large pretrained model
  - Designed for better sentence summarization
  - More on this in the next chapter!

- February 2019
  - **GPT-2**
  - Improved and larger version of GPT
  - Delayed public release due to ethical concerns

- October 2019
  - **DistilBERT**
  - Distilled version of BERT
  - 60% faster, 40% lighter in memory
  - Still retains 97% of BERT's performance

  - **BART and T5**
    - Large pretrained models
    - Same architecture as the original Transformer

- May 2020
  - **GPT-3**
  - Even bigger than GPT-2
  - Performs well on various tasks without fine-tuning
  - Known for zero-shot learning

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Transformer Model Types**

- **GPT-like**
  - Also known as auto-regressive Transformer models

- **BERT-like**
  - Also known as auto-encoding Transformer models

- **BART/T5-like**
  - Also known as sequence-to-sequence Transformer models

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Pretrained Transformer Models**

- All mentioned models (GPT, BERT, BART, T5, etc.) are pretrained language models.
- Trained on large amounts of raw text data.
- Self-supervised learning: Objective computed automatically from inputs, no human labeling required.

- Limitations of Pretrained Models

    - Pretrained models have statistical language understanding.
    - Not directly useful for specific tasks.
    - Require transfer learning for practical applications.

- Transfer Learning

  - Transfer learning fine-tunes pretrained models for specific tasks.
  - Supervised learning with human-annotated labels.
  - Improves model performance and adaptability.

- Example Task:
  -  Causal Language Modeling

     - Task: Predict the next word in a sentence given n previous words.
     - Output depends on past and present inputs, not future ones.

![Embedding Classifier Example](https://github.com/wsko/Generative_AI/blob/main/Day-3/images/NLP3.svg?raw=1)


  - Another example is masked language modeling, in which the model predicts a masked word in the sentence.



![Embedding Classifier Example](https://github.com/wsko/Generative_AI/blob/main/Day-3/images/NLP4.svg?raw=1)


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Transformers are big models**

- Improving performance often involves:
  - Increasing model sizes
  - Expanding pretrained data

![Embedding Classifier Example](https://github.com/wsko/Generative_AI/blob/main/Day-3/images/NLP5.png?raw=1)

- Size vs. Performance
  - Larger models tend to perform better.
  - But training large models is resource-intensive.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>


## **General architecture**

- Two main components:
  - Autoencoders
  - Attention layers

### **Autoencoders**

![Transformer Architecture](https://github.com/wsko/Generative_AI/blob/main/Day-3/images/NLP10.svg?raw=1)


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

### **Attention Layers**

- Attention layers instruct the model to focus on specific words in a sentence while processing the representation of each word.
- They help the model pay attention to certain words while ignoring others.
- In the context of translation
  - Attention layers are crucial because they allow the model to consider adjacent words for proper translation.
  - For example
    - When translating from English to French, attention is needed for subjects and gender agreement.
- Attention layers ensure that words' meanings are deeply influenced by their surrounding context.

- They are essential for handling complex sentences and grammar rules in natural language processing tasks.

- Understanding attention layers is a foundation for comprehending the Transformer architecture.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **The Transformer**

![Transformer Architecture](https://github.com/wsko/Generative_AI/blob/main/Day-3/images/NLP9.svg?raw=1)

- Transformer architecture designed for translation.
- **Encoder** processes inputs (sentences) in one language.
- **Decoder** generates translations in the target language.

-  **Attention Mechanism in Encoder**
   - Encoder uses attention layers.
   - Can attend to all words in a sentence.
   - Considers both preceding and following words.
  
- **Attention Mechanism in Decoder**

  - Decoder works sequentially.
  - Processes words one by one.
  - Limited to using words before the current word.

- Training Speed-up

  - During training, the decoder sees the entire target sentence.
  - It can't use future words for prediction.
  - For example, when predicting the fourth word, it only has access to words 1 to 3.


<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Text Generation: OpenAI GPT Model**

In [9]:
from transformers import pipeline

model_name = 'openai-gpt'
#model_name = 'gpt2-medium'
#model_name = 'distilgpt2'

generator = pipeline('text-generation', model=model_name)

generator("Hello! I am a neural network, and I want to say that", max_length=100, num_return_sequences=5)

# arguments for generator():
#temperature=?
#max_length=50,
#min_length=10,
#do_sample=True, False = greedy sampling
#early_stopping=True,
#num_beams=5,
#temperature=0.7,
#top_k=50,
#top_p=0.95,
#repetition_penalty=1.2,
#length_penalty=1.0,
#num_return_sequences=3

config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/479M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/816k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/458k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': 'Hello! I am a neural network, and I want to say that these are an important and important people who have helped me in a great many parts of the world. many places that i have seen have a great deal of information of value. i believe that i owe quite a lot to you and to our allies. i believe that you have helped us in many areas of history. i look forward to seeing you again very soon. please write a letter to the public in your own words. this'},
 {'generated_text': 'Hello! I am a neural network, and I want to say that i have found a source of energy, and that i will not lose this war. at your peril i will make you the hero you are destined to be. " \n as he spoke, the door opened and two men appeared from behind it. \n " well, well, miss lu, what do you make of it? is it a success? " asked the taller, more attractive of the two. both had big, but'},
 {'generated_text': 'Hello! I am a neural network, and I want to say that you must be the greatest hacker on earth.

In [10]:
generator("Synonyms of a word cat:", max_length=20, num_return_sequences=5)

[{'generated_text': "Synonyms of a word cat:'i feel very uncomfortable ','i feel very stupid"},
 {'generated_text': 'Synonyms of a word cat: "\'kitten \'. it\'s a cat name ; it'},
 {'generated_text': 'Synonyms of a word cat: a " great, magical, " magical, magical, enchanted'},
 {'generated_text': 'Synonyms of a word cat: the cat you got when you were a kid, with a'},
 {'generated_text': "Synonyms of a word cat: that's cat. you go with the cat, and that"}]

In [11]:
generator("People who liked the movie The Matrix also liked ", max_length=40, num_return_sequences=5)

[{'generated_text': 'People who liked the movie The Matrix also liked  it like some old - time tv shows or whatever it was in his family. when it was the movie he had loved most, before he was forced to'},
 {'generated_text': 'People who liked the movie The Matrix also liked  it good. \n i was only seventeen so it was a bit too sudden and so i just thought it would be a fun, hot, cheesy movie.'},
 {'generated_text': "People who liked the movie The Matrix also liked , so they have a whole lot of credits as well. what a pity. we didn't know them when they were teenagers. they're just like us"},
 {'generated_text': 'People who liked the movie The Matrix also liked  movies the matrix. \n " what\'s this? " i asked, grabbing the remote. \n " the game, " said jericha. \n " no'},
 {'generated_text': 'People who liked the movie The Matrix also liked  it, especially the time the girl wore a pair of sunglasses. \n when the movie was over, abby sat on the couch to watch tv and eat some'}]

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

## **Text Sampling Strategies**

So far we have been using simple **greedy** sampling strategy, when we selected next word based on the highest probability. Here is how it works:

In [12]:
prompt = "It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw"
generator(prompt,max_length=100,num_return_sequences=5)

[{'generated_text': "It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw my father talking to someone. i didn't want to stop and say hello. i quickly turned around and approached him again. i was surprised to find that he was still talking with the man who took me to the hospital. a few moments later, a doctor and another doctor entered the room. then i heard mr. hunter's voice telling the"},
 {'generated_text': 'It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw a young man sitting on a metal exam table examining one of the various experiments that he had in mind. he had blond hair, and he was a little taller than me, though much broader - chested. \n " i want to know if any of this is necessary, " i announced. " i\'ve been hired to work from home.'},
 {'generated_text': "It was early evening when I can back from work. I usu

**Beam Search** allows the generator to explore several directions (*beams*) of text generation, and select the ones with highers overall score. You can do beam search by providing `num_beams` parameter. You can also specify `no_repeat_ngram_size` to penalize the model for repeating n-grams of a given size:

In [13]:
prompt = "It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw"
generator(prompt,max_length=100,num_return_sequences=5,num_beams=10,no_repeat_ngram_size=2)

[{'generated_text': 'It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw a woman sitting on the edge of her bed, reading a book. \n " hello, " i said. " can i help you? " \n she looked up from the book and smiled at me. she was in her late twenties, maybe early thirties. her hair was pulled back in a ponytail, and she wore a pair of jeans'},
 {'generated_text': 'It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw a woman sitting on the edge of her bed, reading a book. \n " hello, " i said. " can i help you? " \n she looked up from the book and smiled at me. she was in her late twenties, maybe early thirties. her hair was pulled back into a ponytail, and she wore a pair of jeans'},
 {'generated_text': 'It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, 

**Sampling** selects the next word non-deterministically, using the probability distribution returned by the model. You turn on sampling using `do_sample=True` parameter. You can also specify `temperature`, to make the model more or less deterministic.

In [14]:
prompt = "It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw"
generator(prompt,max_length=100,do_sample=True,temperature=0.8)

[{'generated_text': 'It was early evening when I can back from work. I usually work late, but this time it was an exception. When I entered a room, I saw my mother sitting in one of the chairs, reading a magazine. \n " hey, " i said, sitting on the chair next to her. she looked up at me and smiled. \n " hey, " she answered. \n " so, you have any plans today? " i asked. \n " we\'re all heading to the beach'}]

We can also provide to additional parameters to sampling:
* `top_k` specifies the number of word options to consider when using sampling. This minimizes the chance of getting weird (low-probability) words in our text.
* `top_p` is similar, but we chose the smallest subset of most probable words, whose total probability is larger than p.

Feel free to experiment with adding those parameters in.

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

# Code Generation

In [15]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model and tokenizer
model_name = 'gpt2-medium'
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model.eval()

if torch.cuda.is_available():
    model.to('cuda')


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [16]:
def generate_code(prompt, max_length=150, temperature=0.7, top_k=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    if torch.cuda.is_available():
        input_ids = input_ids.to('cuda')

    with torch.no_grad():
        generated_ids = model.generate(input_ids, max_length=max_length, temperature=temperature, top_k=top_k)

    generated_code = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    return generated_code


In [17]:
prompt = "Write a Python function that takes a list of numbers and returns their average:"
generated_function = generate_code(prompt)
print(generated_function)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Write a Python function that takes a list of numbers and returns their average:

>>> from math import average >>> average = average(1, 2, 3) >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average >>> print average


## Few Shot Learning

Few-shot learning aims to make accurate predictions in tasks with very limited labeled training data.


**How it Works:**

1. **Training Phase:** Train a model on a large and diverse dataset.
2. **Adaptation Phase:** Fine-tune the model on a small dataset related to the specific task.

**Benefits:**

- Overcomes the challenge of data scarcity.
- Adapts to new tasks without extensive retraining.



In [18]:
def generate_code_with_examples(prompt, max_length=300, temperature=0.6, top_k=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    if torch.cuda.is_available():
        input_ids = input_ids.to('cuda')

    with torch.no_grad():
        generated_ids = model.generate(input_ids, max_length=max_length, temperature=temperature, top_k=top_k)

    generated_code = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    return generated_code


In [19]:
prompt_with_examples = """
# Example 1:
# Function to add two numbers
def add(a, b):
    return a + b

# Example 2:
# Function to check if a number is even
def is_even(num):
    return num % 2 == 0

# Task: Write a Python function that takes a list of numbers and returns their sum:
"""

generated_function_with_examples = generate_code_with_examples(prompt_with_examples)
print(generated_function_with_examples)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



# Example 1:
# Function to add two numbers
def add(a, b):
    return a + b

# Example 2:
# Function to check if a number is even
def is_even(num):
    return num % 2 == 0

# Task: Write a Python function that takes a list of numbers and returns their sum:

#

# def sum(list):

#  return sum(list)

#

# Example 1:

# Function to add two numbers

def add(a, b):

   return a + b

# Example 2:

# Function to check if a number is even

def is_even(num):

   return num % 2 == 0

# Task: Write a Python function that takes a list of numbers and returns their sum:

#

# def sum(list):

#  return sum(list)

#

# Example 1:

# Function to add two numbers

def add(a, b):

   return a + b

# Example 2:

# Function to check if a number is even

def is_even(num):

   return num % 2 == 0

# Task: Write a Python function that takes a list of numbers and


In [20]:
prompt = "Write a Python function that takes a list of numbers and returns their sum:"
generated_function = generate_code(prompt)
print("Without Examples:\n", generated_function)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Without Examples:
 Write a Python function that takes a list of numbers and returns their sum:

def sum ( n ): return n * n

This function takes a list of numbers and returns the sum of them.

The sum function is a very simple function. It takes a list of numbers and returns the sum of them.

The sum function is a very simple function. It takes a list of numbers and returns the sum of them.

The sum function is a very simple function. It takes a list of numbers and returns the sum of them.

The sum function is a very simple function. It takes a list of numbers and returns the sum of them.

The sum function is a very simple function. It


In [21]:
print("With Examples:\n", generated_function_with_examples)


With Examples:
 
# Example 1:
# Function to add two numbers
def add(a, b):
    return a + b

# Example 2:
# Function to check if a number is even
def is_even(num):
    return num % 2 == 0

# Task: Write a Python function that takes a list of numbers and returns their sum:

#

# def sum(list):

#  return sum(list)

#

# Example 1:

# Function to add two numbers

def add(a, b):

   return a + b

# Example 2:

# Function to check if a number is even

def is_even(num):

   return num % 2 == 0

# Task: Write a Python function that takes a list of numbers and returns their sum:

#

# def sum(list):

#  return sum(list)

#

# Example 1:

# Function to add two numbers

def add(a, b):

   return a + b

# Example 2:

# Function to check if a number is even

def is_even(num):

   return num % 2 == 0

# Task: Write a Python function that takes a list of numbers and


# **Lab:** Introduction to BERT and GPT

## 1. GPT
- Use a pre-trained GPT-2 model to generate text based on a given prompt.
- Become familiar with the key parameters for text generation

https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2LMHeadModel

## 2. BERT

- Use BERT to compute the similarity between sentences
- Suggestion: use the "Consumer Complaints" dataset for text examples


## Sample Solutions

In [22]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, GPT2Tokenizer, GPT2LMHeadModel

In [23]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')


# Sample prompt
prompt = "Magnetic Resonance Imaging was invented"

# Tokenize the input
inputs = tokenizer(prompt, return_tensors='pt')


In [24]:
# Generate text
output = model.generate(**inputs, max_length=100, num_return_sequences=1, )

# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Magnetic Resonance Imaging was invented by Dr. John D. D. Dolan, a professor of physics at the University of California, Berkeley. The study was funded by the National Science Foundation.

The study was published in the journal Nature Communications.

"This is a very exciting discovery," said Dr. Dolan, who is also a professor of physics at the University of California, Berkeley. "It's the first time that we've seen a magnetic resonance imaging study of a


In [25]:
output = model.generate(
    **inputs,
    max_length=100,
    num_return_sequences=3,
    temperature=0.7,
    top_k=50,
    top_p=0.92,
    do_sample=True,
    repetition_penalty=2.0,
    num_beams=5
)

# Decode and print the generated texts
for i, generated_seq in enumerate(output):
    print(f"Generated Sequence {i + 1}:")
    print(tokenizer.decode(generated_seq, skip_special_tokens=True))
    print()


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Sequence 1:
Magnetic Resonance Imaging was invented in the early 1960s. It is a type of magnetic resonance imaging (MRI) that can be used to detect changes in brain activity, such as changes in heart rate and blood pressure.

In this study, researchers from the University of California, San Diego, and the National Institute of Neurological Disorders and Stroke (NINDS) at the University of California, San Diego, performed magnetic resonance imaging (MRI) on subjects who had been diagnosed with

Generated Sequence 2:
Magnetic Resonance Imaging was invented in the early 1960s. It is a type of magnetic resonance imaging (MRI) that can be used to detect changes in brain activity, such as changes in heart rate and blood pressure.

In this study, researchers from the University of California, San Diego, and the National Institute of Neurological Disorders and Stroke (NINDS) at the University of California, San Diego, performed magnetic resonance imaging (MRI) on subjects who had bee

### Parameters for `model.generate`

1. **`max_length`**: The maximum length of the sequence to be generated.
2. **`num_return_sequences`**: The number of sequences to generate.
3. **`temperature`**: The value used to control the randomness of predictions by scaling the logits before applying softmax. Lower values make the model more confident.
4. **`top_k`**: The number of highest probability vocabulary tokens to keep for top-k-filtering. Helps in reducing the randomness of the generated text.
5. **`top_p` (nucleus sampling)**: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling.
6. **`do_sample`**: Whether or not to use sampling; use greedy decoding otherwise.
7. **`repetition_penalty`**: The parameter for repetition penalty. 1.0 means no penalty.
8. **`num_beams`**: The number of beams for beam search. 1 means no beam search.


In [26]:
import pandas as pd
complaint = list(pd.read_csv("https://github.com/wsko/Statistics/raw/main/complaints02.csv")['Consumer_complaint'].values)
complaint[:2]

['I have sent plenty of letters asking this debt collection agency to verify my debt they will not respond to me.',
 "Hi Cfpb, I noticed inquiries on my credit report that I am not familiar with. I was a victim of the XXXX inquiry issue. I spoke with XXXX and they are not helping at all. I know for a fact that no one has my SSN or personal information. I had over an XXXX credit score all my life and would like these inquiries removed since I do n't recognize them. \n\nBest Regards, XXXX XXXX XXXX XXXX, XXXX XXXX XXXX, XXXX Bank XXXX XXXX XXXX XXXX XXXX, XXXX XXXX XXXX, XXXX XXXX  XXXX XXXX XXXX XXXX, XXXX XXXX XXXX, XXXX XXXX XXXX XXXX, XXXX XXXX XXXX, XXXX XXXX  XXXX XXXX XXXX, XXXX XXXX XXXX, XXXX XXXX XXXX XXXX XXXX XXXX XXXX, XXXX XXXX XXXX, XXXX XXXX XXXX XXXX, XXXX XXXX XXXX, XXXX"]

In [27]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample sentences
text_1 = complaint[0]
text_2 = complaint[10]

print("Text 1", '\n', text_1, '\n', '\n', '\n', "Text 2", '\n', text_2)


Text 1 
 I have sent plenty of letters asking this debt collection agency to verify my debt they will not respond to me. 
 
 
 Text 2 
 I've been having issues with Navient for years. They are very rude and condescending when they're talking to me. I had an issue once with it stated on my account that I had a past due amount but had been making auto-payments- they had the incorrect date on their end of the due date and when they were taking the payment, so they were calling them late- not my fault. 

Six months ago I called with issues with making payments. I had been unemployed and had to move across country for a job opportunity. They said there was nothing they could do- I also had complained that my payment was never the same every month and they lied and said it was. I had my account open in front of me showing different amounts taken from my account every month- again the agent lied. I told them to un-enroll me that I would make payments on my own. He assured me that it had occur

In [28]:
# Tokenize and encode the sentences
inputs_1 = tokenizer(text_1, return_tensors='pt')
inputs_2 = tokenizer(text_2, return_tensors='pt')

# Get the embeddings
with torch.no_grad():
    outputs_1 = model(**inputs_1)
    outputs_2 = model(**inputs_2)

# Compute cosine similarity
embedding_1 = outputs_1.last_hidden_state.mean(dim=1)
embedding_2 = outputs_2.last_hidden_state.mean(dim=1)
cosine_similarity = torch.nn.functional.cosine_similarity(embedding_1, embedding_2)
print(cosine_similarity.item())

0.8109659552574158
