# Attention based Models and Transfer Learning

`1. What is BERT and how does it work?`

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained NLP model developed by Google. It is designed to understand the context of words in a sentence by considering both the words before and after a given word (bidirectional context).

How it works:

- Pre-training: BERT is trained on a large corpus using two tasks:
- Masked Language Model (MLM): Randomly masks some words in the input and trains the model to predict them.
- Next Sentence Prediction (NSP): Determines if a sentence follows another logically.
- Fine-tuning: The pre-trained model is fine-tuned for specific downstream tasks (e.g., sentiment analysis, question answering) with task-specific labeled data.

`2. What are the main advantages of using the attention mechanism in neural networks?`

- Focus on Relevant Information: The attention mechanism allows the model to focus on the most relevant parts of the input while processing each output step.
- Handles Long-Range Dependencies: It effectively captures relationships between distant words in a sequence, which RNNs struggle with.
- Parallel Processing: Attention-based architectures, such as Transformers, enable efficient parallel computation compared to sequential models like RNNs.
- Improved Performance: Leads to state-of-the-art results in tasks such as translation, summarization, and question answering.

`4. What is the role of the decoder in a Seq2Seq model?`

The decoder in a Seq2Seq model:

- Receives Context: Takes the context vector from the encoder as input.
- Generates Output: Produces the output sequence one element at a time.
- Attention Mechanism (Optional): Aligns the decoder’s output with specific parts of the input sequence for improved accuracy.

`6. Why is the Transformer model considered more efficient than RNNs and LSTMs?`

- Parallelization: Processes the entire sequence simultaneously, unlike RNNs that process one token at a time.
- Scalability: Handles long sequences more effectively due to the self-attention mechanism.
- Long-Range Dependencies: Captures relationships between distant elements better than RNNs/LSTMs.
- Reduced Vanishing Gradient Problem: Does not rely on sequential data processing, avoiding gradient issues.

`7. Explain how the attention mechanism works in a Transformer model.`

Inputs: Each word in the sequence is represented as an embedding.

Query, Key, Value: The input is transformed into three vectors for each word:
 - Query (Q): Represents the current word.
 - Key (K): Represents all other words in the sequence.
 - Value (V): Contains the actual information of the word.

Weighted Sum: Outputs a weighted sum of values based on relevance scores.

Multi-Head Attention: Uses multiple attention heads to capture different relationships in parallel.

`9. What is the primary purpose of using the self-attention mechanism in Transformers?`

The self-attention mechanism in Transformers is designed to:

- Capture Relationships Between All Tokens: It enables the model to learn dependencies between all words in a sequence, regardless of their distance.
- Understand Context Better: By assigning attention scores to each word in relation to others, the model can understand the relevance of words in context.
- Replace Recurrence with Parallelization: Eliminates the need for sequential processing (as in RNNs), making computation faster and more efficient.

`10. How does the GPT-2 model generate text?`

The GPT-2 model generates text using a unidirectional Transformer decoder:

- Input Encoding: The input text is tokenized and transformed into embeddings.
- Causal Masking: Ensures that predictions for a word depend only on previous words (left-to-right context).
- Sequential Generation: Predicts the next word token by token using probabilities from its trained distribution.
- Greedy or Sampling Decoding: Uses strategies like greedy search, beam search, or nucleus sampling to decide the next token.
- Iterative Process: Continues generating tokens until an end-of-sequence token or a predefined length is reached.

`12. Explain the concept of “fine-tuning” in BERT.`

Fine-tuning in BERT refers to adapting the pre-trained BERT model to a specific downstream task.

Steps:

- Add Task-Specific Layers: Attach layers like classifiers or regressors on top of the BERT model.
- Task-Specific Training: Use labeled data from the target task to train the entire model or just the added layers.
- Leverage Pre-Trained Knowledge: Retains general language understanding from pre-training while specializing in the new task.
- Fine-tuning makes BERT highly effective for tasks like sentiment analysis, question answering, and text classification.

`13. How does the attention mechanism handle long-range dependencies in sequences?`

The attention mechanism captures long-range dependencies by:

- Global Context Understanding: Every word/token attends to all other tokens in the sequence, regardless of distance.
- Weighted Relevance: Assigns importance scores (attention weights) to tokens based on their relevance to the current word.
- Parallel Computation: Processes all tokens simultaneously, avoiding sequential dependency issues found in RNNs.

This allows the model to retain information about distant relationships effectively.

`14. What is the core principle behind the Transformer architecture?`

The core principle of the Transformer architecture is:

- Self-Attention Mechanism: Enables efficient context learning across all tokens in a sequence.
- Parallel Processing: Eliminates sequential dependencies, allowing faster training and inference.
- Positional Encoding: Adds positional information to embeddings to preserve the order of words.
- Multi-Head Attention: Uses multiple attention heads to learn diverse aspects of relationships in parallel.
- Feedforward Layers: Applies dense layers to transform representations after attention.

This design makes Transformers highly effective for tasks involving sequential data, such as language modeling and translation.

`17. What does it mean when a model is described as “autoregressive” like GPT-2?`

An autoregressive model generates output sequentially, where each token is predicted based on previous tokens. In GPT-2:

- Unidirectional Context: The model processes input left-to-right, using past tokens to predict the next one.
- Causal Masking: Ensures the model cannot "see" future tokens during training or generation.
- Sequential Generation: Words are generated one at a time, and the next word depends on the sequence generated so far.

This approach makes GPT-2 effective for text generation tasks, as it learns to predict coherent and contextually relevant sequences.

`18. How does BERT's bidirectional training improve its performance?`

BERT's bidirectional training allows it to:

- Contextual Understanding: Processes both left and right context simultaneously, enabling deeper semantic understanding of words in relation to their surroundings.
- Masked Language Model (MLM): Trains by masking some tokens and predicting them using both preceding and succeeding words.
- Improved Accuracy: Captures richer representations of text, making it better suited for tasks like question answering, named entity recognition, and text classification.

This contrasts with unidirectional models (e.g., GPT-2), which only consider one direction.


`20. What is the attention mechanism’s impact on the performance of models like BERT and GPT-2?`

The attention mechanism significantly enhances the performance of models like BERT and GPT-2 by:

- Capturing Relationships Across Tokens: Models relationships between all words, regardless of their position in the sequence.
- Improving Context Understanding: Assigns relevance scores to words, helping the model focus on the most important parts of the input.
- Handling Long Sequences Effectively: Mitigates the vanishing gradient problem common in RNNs.
- Enabling Scalability: Facilitates parallel processing, allowing efficient training on large datasets.

This mechanism is foundational to the success of Transformer-based models in various NLP tasks, from language modeling to machine translation.









# Practicals


`1. How to implement a simple text classification model using LSTM in Keras?`



In [2]:
!pip install tensorflow
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer # Changed import to tensorflow.keras
from tensorflow.keras.preprocessing.sequence import pad_sequences # Changed import to tensorflow.keras

# Sample Data
texts = ["I love this movie", "I hate this movie", "This was fantastic", "This was terrible"]
labels = [1, 0, 1, 0]

# Preprocess Data
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=10)

# Model Definition
model = Sequential([
    Embedding(input_dim=5000, output_dim=64, input_length=10),
    LSTM(64, return_sequences=False),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

# Train Model
model.fit(padded_sequences, np.array(labels), epochs=5, batch_size=2)





Epoch 1/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 29ms/step - accuracy: 0.8333 - loss: 0.6839
Epoch 2/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.5000 - loss: 0.6890 
Epoch 3/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.5000 - loss: 0.6923
Epoch 4/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.3333 - loss: 0.6930    
Epoch 5/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 1.0000 - loss: 0.6777


<keras.src.callbacks.history.History at 0x78993e1ebc10>

`2. How to generate sequences of text using a Recurrent Neural Network (RNN)?`



In [3]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# Sample Text
text = "hello world"
chars = sorted(list(set(text)))
char_to_int = {c: i for i, c in enumerate(chars)}

# Prepare Data
seq_length = 3
dataX, dataY = [], []
for i in range(len(text) - seq_length):
    seq_in = text[i:i + seq_length]
    seq_out = text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])

X = np.array(dataX)
y = to_categorical(dataY, num_classes=len(chars))

# Build Model
model = Sequential([
    Embedding(input_dim=len(chars), output_dim=10, input_length=seq_length),
    SimpleRNN(50, return_sequences=False),
    Dense(len(chars), activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X, y, epochs=100, batch_size=1)

# Generate Text
seed = "hel"
seed_seq = [char_to_int[char] for char in seed]
for _ in range(10):
    seed_seq_padded = np.reshape(seed_seq, (1, seq_length))
    prediction = np.argmax(model.predict(seed_seq_padded), axis=-1)
    seed += chars[prediction[0]]
    seed_seq.append(prediction[0])
    seed_seq = seed_seq[1:]
print("Generated text:", seed)


Epoch 1/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.0622 - loss: 2.0881    
Epoch 2/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5180 - loss: 2.0219 
Epoch 3/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7450 - loss: 1.9714 
Epoch 4/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6394 - loss: 1.9461 
Epoch 5/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6209 - loss: 1.8799 
Epoch 6/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5839 - loss: 1.8185 
Epoch 7/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8601 - loss: 1.6639 
Epoch 8/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7386 - loss: 1.5756 
Epoch 9/100
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m

`3. How to perform sentiment analysis using a simple CNN model?`



In [4]:
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

# Sample Data
texts = ["I love this movie", "I hate this movie", "This was fantastic", "This was terrible"]
labels = [1, 0, 1, 0]

# Preprocessing (same as LSTM example)
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, maxlen=10)

# CNN Model
model = Sequential([
    Embedding(input_dim=5000, output_dim=64, input_length=10),
    Conv1D(filters=128, kernel_size=3, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(padded_sequences, np.array(labels), epochs=5, batch_size=2)


Epoch 1/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.3333 - loss: 0.6955
Epoch 2/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 0.3333 - loss: 0.6925     
Epoch 3/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - accuracy: 0.8333 - loss: 0.6751
Epoch 4/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.6667 - loss: 0.6747
Epoch 5/5
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 1.0000 - loss: 0.6662 


<keras.src.callbacks.history.History at 0x78993c4d4250>

`4. How to perform Named Entity Recognition (NER) using spaCy?`



In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."
doc = nlp(text)

# Extract Entities
for ent in doc.ents:
    print(ent.text, ent.label_)


Barack Obama PERSON
Hawaii GPE
44th ORDINAL
the United States GPE


`6. How to generate text using a pre-trained transformer model (GPT-2)?`

In [6]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
prompt = "Once upon a time in a distant galaxy,"
output = generator(prompt, max_length=50, num_return_sequences=1)

print(output[0]['generated_text'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time in a distant galaxy, she and her companions discovered one place they could gather for a final search of the universe, in their last days. But then a race of superweapons, led by Queen Cate, were unleashed and killed
