In [None]:
1. What is BERT and how does it work?

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained Transformer-based model.
It reads text bidirectionally (left + right context together), making it great for understanding meaning. 
It is trained with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

In [None]:
2. What are the main advantages of using the attention mechanism in neural networks?


Captures long-range dependencies better than RNNs

Focuses on important words/tokens

Works in parallel, faster training

In [None]:
3. How does the self-attention mechanism differ from traditional attention mechanisms?

Traditional: attends to encoder hidden states (Seq2Seq).

Self-attention: each token attends to all tokens in the same sequence (within encoder or decoder).

In [None]:
4. What is the role of the decoder in a Seq2Seq model?

The decoder generates the output sequence step by step. It takes the encoder’s context
(or attention over encoder outputs) plus its previously generated tokens to predict the next token.
It works autoregressively until the sequence ends.

In [None]:
5. What is the difference between GPT-2 and BERT models?

BERT: Bidirectional, trained with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP),
designed for understanding tasks (classification, Q&A).

GPT-2: Autoregressive, unidirectional (left-to-right), trained for generation tasks
(next-word prediction, text completion).

In [None]:
6.Why is the Transformer model considered more efficient than RNNs and LSTMs?

Fully parallelizable (RNNs process sequentially).

Captures long-range dependencies directly with attention.

Faster training using GPUs/TPUs.

In [None]:
7.Explain how the attention mechanism works in a Transformer model.



Each token forms Query (Q), Key (K), and Value (V) vectors.

Compute similarity between Query and Keys -> gives attention scores.

Apply softmax -> weights.

Weighted sum of Values -> updated representation of the token.

In [None]:
8.What is the difference between an encoder and a decoder in a Seq2Seq model?

Encoder: Reads input sequence -> produces a context representation.

Decoder: Uses that context (and attention) to generate the output sequence step by step.

In [None]:
9.What is the primary purpose of using the self-attention mechanism in transformers?

To let each token attend to all other tokens in the same sequence,
capturing contextual relationships efficiently.

In [None]:
10.How does the GPT-2 model generate text?

It generates tokens autoregressively:

Predicts the next token from the context.

Appends it to the input.

Repeats until reaching end of sequence or stop condition.

In [None]:
11.What is the main difference between the encoder-decoder architecture and a simple neural network?

Encoder-decoder: Handles variable-length input and output sequences (e.g., translation).

Simple NN: Works on fixed-size inputs and outputs, no sequence handling.

In [None]:
12.Explain the concept of “fine-tuning” in BERT.

Fine-tuning = taking pre-trained BERT weights and adapting them to a specific task by adding a small output
layer (e.g., classifier) and training on a labeled dataset.

In [None]:
13.How does the attention mechanism handle long-range dependencies in sequences?

Every token directly attends to all others with weighted scores. Unlike RNNs, 
it doesn’t rely on passing information step by step,so dependencies between far-apart tokens are captured 
easily.

In [None]:
14.What is the core principle behind the Transformer architecture?

Replace recurrence with self-attention and parallelization for efficiency and scalability.

In [None]:
15.What is the role of the position encoding in a Transformer model?

Since Transformers don’t have recurrence, positional encoding adds order information
(which token comes first, second, etc.).

In [None]:
16.How do Transformers use multiple layers of attention?

They stack multiple self-attention layers, where each layer learns different aspects of relationships 
(syntax in lower layers, semantics in higher layers).

In [None]:
17.What does it mean when a model is described as “autoregressive” like GPT-2?

It generates text token by token, predicting the next token using only past context (not future words).

In [None]:
18.How does BERT's bidirectional training improve its performance?

BERT sees both left and right context when predicting words, leading to deeper understanding and
better performance on comprehension tasks (Q&A, sentiment, classification).

In [None]:
19.What are the advantages of using the Transformer over RNN-based models in NLP?

Faster training (parallel processing).

Handles long-range context better.

Achieves state-of-the-art results on many NLP tasks.

In [None]:
20.What is the attention mechanism’s impact on the performance of models like BERT and GPT-2?

It enables deep contextual understanding and scalable learning, making these models powerful
    for both understanding and generation tasks.

In [2]:
## Practical

In [None]:
1.How to implement a simple text classification model using LSTM in Keras?
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

model = Sequential([
    Embedding(input_dim=10000, output_dim=128, input_length=100),
    LSTM(128),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
2.How to generate sequences of text using a Recurrent Neural Network (RNN)?

Train an RNN (LSTM/GRU) on a text corpus for next-character or next-word prediction.
At inference:

Give a seed text.

Predict next token.

Append prediction to seed.

Repeat until desired length.


seed = "Once upon"
for i in range(50):
    pred = model.predict(seed)   # predict next token
    seed += pred                 # append prediction

    

In [None]:
3.How to perform sentiment analysis using a simple CNN model?



from tensorflow import keras
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

model = Sequential([
    Embedding(10000, 128, input_length=100),
    Conv1D(128, 5, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(1, activation='sigmoid')
])


In [None]:
4.How to perform Named Entity Recognition (NER) using spacy?

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)


In [None]:
5.How to implement a simple Seq2Seq model for machine translation using LSTM in Keras?
from tensorflow import keras
from keras.models import Model
from keras.layers import Input, LSTM, Dense

# Encoder
enc_inputs = Input(shape=(None,))
enc_lstm = LSTM(256, return_state=True)
enc_outputs, state_h, state_c = enc_lstm(enc_inputs)
enc_states = [state_h, state_c]

# Decoder
dec_inputs = Input(shape=(None,))
dec_lstm = LSTM(256, return_sequences=True, return_state=True)
dec_outputs, _, _ = dec_lstm(dec_inputs, initial_state=enc_states)
dec_dense = Dense(vocab_size, activation='softmax')
dec_outputs = dec_dense(dec_outputs)

model = Model([enc_inputs, dec_inputs], dec_outputs)


In [None]:
6.How to generate text using a pre-trained transformer model (GPT-2)?


from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_ids = tokenizer.encode("Once upon a time", return_tensors="pt")
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

print(tokenizer.decode(output[0]))


In [None]:
7.How to apply data augmentation for text in NLP?
import nlpaug.augmenter.word as naw

aug = naw.SynonymAug(aug_p=0.3)
augmented_text = aug.augment("I love natural language processing")
print(augmented_text)


In [None]:
8.How can you add an Attention Mechanism to a Seq2Seq model?


At each decoder step, compute attention weights over encoder states.

Use weighted sum (context vector) along with decoder state.

Improves alignment in translation tasks.
# _________________________________________________________________________________________________________#
score = dot(DecoderHidden, EncoderOutputs)   # alignment scores
weights = softmax(score)                      # attention weights
context = sum(weights * EncoderOutputs)       # context vector
DecoderInput = concat(context, prev_output)
