# **BOOTCAMP @ GIKI (Content designed by Usama Arshad) WEEK 5**

---



# Text Classification (Sentiment Analysis) using BERT

## Theory:

### Transformers:
Transformers are a type of deep learning model introduced by Vaswani et al. in their 2017 paper "Attention is All You Need." Transformers have revolutionized natural language processing (NLP) and many other fields. They are the foundation of many state-of-the-art models, including BERT, GPT, and T5.

#### Key Concepts of Transformers:

**Attention Mechanism:**
- The core idea of transformers is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence.
- Self-attention helps the model focus on relevant parts of the input sequence, improving the ability to understand context and relationships between words.

**Encoder-Decoder Architecture:**
- Transformers are typically composed of an encoder and a decoder.
- The encoder processes the input sequence and generates a context-aware representation.
- The decoder uses this representation to generate the output sequence (e.g., translation, summary).

**Parallel Processing:**
- Unlike RNNs, which process sequences sequentially, transformers process entire sequences in parallel.
- This makes transformers much more efficient and faster to train.

### BERT (Bidirectional Encoder Representations from Transformers):
BERT is a transformer-based model designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. This allows BERT to understand the context of a word from both directions, which improves performance on a wide range of NLP tasks.

#### Key Features of BERT:

**Bidirectional:**
- BERT reads text in both directions (left-to-right and right-to-left) to understand the context of words better.

**Pre-training and Fine-tuning:**
- BERT is pre-trained on a large corpus of text using two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
- After pre-training, BERT can be fine-tuned on specific tasks (e.g., sentiment analysis, question answering) with relatively few additional parameters.

**Transformers Encoder:**
- BERT uses only the transformer encoder architecture, discarding the decoder since it's not needed for many NLP tasks.

### BERT Tokenizer:
The BERT tokenizer is responsible for converting raw text into the input format required by BERT. It includes several steps:

**Tokenization:**
- The text is split into individual tokens (words or subwords).

**Adding Special Tokens:**
- Special tokens such as [CLS] (start of sequence) and [SEP] (separator) are added to the tokenized text.

**Padding and Truncation:**
- Sequences are padded to the same length or truncated if they are too long.

**Conversion to Input IDs and Attention Masks:**
- Tokens are converted into numerical IDs that BERT can process.
- Attention masks are created to indicate which tokens should be attended to (1) and which should be ignored (0).


# Text Classification (Sentiment Analysis) using BERT

## Theory:
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that has achieved state-of-the-art performance on many NLP tasks, including text classification. BERT is pre-trained on a large corpus of text and can be fine-tuned for specific tasks such as sentiment analysis. Sentiment analysis is the task of classifying the sentiment of a given text, such as determining whether a movie review is positive or negative.

## Steps:

### 1. Load and Preprocess the Dataset
We'll use the IMDB dataset, which contains 50,000 movie reviews labeled as positive or negative. We'll split the data into training and test sets, and then tokenize and encode the text data.

### 2. Fine-tune the BERT Model
We'll use the pre-trained BERT model from Hugging Face's `transformers` library and fine-tune it on the IMDB dataset. Fine-tuning involves training the model on the specific task of sentiment analysis for a few epochs.

### 3. Make Predictions and Evaluate the Model
We'll use the fine-tuned model to make predictions on the test data and evaluate its performance using classification metrics such as accuracy and the classification report.

### 4. Save and Load the Model
We'll save the fine-tuned model and demonstrate how to load it for inference. We'll also provide a function to predict the sentiment of new text inputs.



In [None]:
# Import necessary libraries
import numpy as np  # For numerical computations
import tensorflow as tf  # For building and training the BERT model
from sklearn.metrics import classification_report  # For evaluating the model's performance
from transformers import BertTokenizer, TFBertForSequenceClassification  # For BERT tokenizer and model
import tensorflow_datasets as tfds  # For loading the IMDB dataset
import ipywidgets as widgets  # For creating UI widgets
from IPython.display import display  # For displaying UI elements

# Load the IMDB dataset
data, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)  # Load IMDB dataset with info

# Split the data into training and test sets
train_data, test_data = data['train'], data['test']  # Split data into training and test sets

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # Use the pre-trained BERT tokenizer

# Function to tokenize and encode the data
def encode_data(dataset, max_length=128):
    input_ids, attention_masks, labels = [], [], []  # Initialize lists for input IDs, attention masks, and labels

    for text, label in tfds.as_numpy(dataset):  # Iterate through the dataset
        encoding = tokenizer.encode_plus(
            text.decode('utf-8'),  # Decode text from bytes to string
            max_length=max_length,  # Maximum length of the tokenized text
            truncation=True,  # Truncate texts longer than max length
            padding='max_length',  # Pad texts shorter than max length
            add_special_tokens=True,  # Add special tokens (e.g., [CLS], [SEP])
            return_attention_mask=True,  # Return attention masks
            return_tensors='tf'  # Return TensorFlow tensors
        )
        input_ids.append(encoding['input_ids'])  # Append input IDs
        attention_masks.append(encoding['attention_mask'])  # Append attention masks
        labels.append(label)  # Append labels

    return tf.concat(input_ids, axis=0), tf.concat(attention_masks, axis=0), tf.convert_to_tensor(labels)  # Concatenate and return tensors

# Encode the training and test data
X_train, X_train_masks, y_train = encode_data(train_data)  # Encode the training data
X_test, X_test_masks, y_test = encode_data(test_data)  # Encode the test data

# Initialize the BERT model for sequence classification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # Load pre-trained BERT model for classification

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),  # Use Adam optimizer with a learning rate of 2e-5
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  # Use sparse categorical cross-entropy loss
              metrics=['accuracy'])  # Track accuracy metric

# Train the model
history = model.fit([X_train, X_train_masks], y_train, epochs=3, batch_size=16, validation_split=0.1)  # Train the model for 3 epochs

# Make predictions on the test data
y_pred = model.predict([X_test, X_test_masks])  # Predict labels for the test data
y_pred_labels = np.argmax(y_pred.logits, axis=1)  # Get predicted labels

# Evaluate the model
print(classification_report(y_test, y_pred_labels))  # Print classification report

# Save the model
model.save_pretrained('./sentiment-analysis-bert')  # Save the fine-tuned BERT model

# Load the model for inference
loaded_model = TFBertForSequenceClassification.from_pretrained('./sentiment-analysis-bert')  # Load the saved model

# Function to predict sentiment
def predict_sentiment(text):
    encoding = tokenizer.encode_plus(
        text,  # Text to be tokenized
        max_length=128,  # Maximum length of the tokenized text
        truncation=True,  # Truncate texts longer than max length
        padding='max_length',  # Pad texts shorter than max length
        add_special_tokens=True,  # Add special tokens (e.g., [CLS], [SEP])
        return_attention_mask=True,  # Return attention masks
        return_tensors='tf'  # Return TensorFlow tensors
    )
    input_ids = encoding['input_ids']  # Get input IDs
    attention_mask = encoding['attention_mask']  # Get attention masks
    logits = loaded_model(input_ids, attention_mask=attention_mask).logits  # Get logits from the model
    return np.argmax(logits)  # Return predicted label

# Test the prediction function
print(predict_sentiment("I loved this movie! It was amazing."))  # Predict sentiment for a positive review
print(predict_sentiment("I hated this movie. It was terrible."))  # Predict sentiment for a negative review

# UI Components

# Define widgets for user input and output
input_text = widgets.Textarea(
    value='I loved this movie! It was amazing.',
    placeholder='Enter text to analyze',
    description='Input:',
    layout=widgets.Layout(width='100%', height='100px')
)

output_text = widgets.Textarea(
    value='',
    placeholder='Predicted sentiment will appear here',
    description='Output:',
    layout=widgets.Layout(width='100%', height='100px'),
    disabled=True
)

analyze_button = widgets.Button(description='Analyze Sentiment')

# Define the button click event handler
def on_analyze_button_clicked(b):
    sentiment = predict_sentiment(input_text.value)
    sentiment_text = "Positive" if sentiment == 1 else "Negative"
    output_text.value = sentiment_text

# Attach the event handler to the button
analyze_button.on_click(on_analyze_button_clicked)

# Display the widgets
display(input_text, analyze_button, output_text)


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
 259/1407 [====>.........................] - ETA: 7:58:40 - loss: 0.4477 - accuracy: 0.7806

# Text Generation using GPT-2

## Theory:

### GPT (Generative Pre-trained Transformer):
GPT is a transformer-based model developed by OpenAI that is designed for generating text. Unlike BERT, which is designed for understanding text, GPT is designed for generating coherent and contextually relevant text.

#### Key Concepts of GPT:

**Autoregressive Model:**
- GPT is an autoregressive model, which means it generates text one token at a time and each token is conditioned on the previously generated tokens.

**Transformer Decoder:**
- GPT uses only the transformer decoder architecture, unlike BERT, which uses the encoder. The decoder is designed to generate sequences.

**Pre-training and Fine-tuning:**
- GPT is pre-trained on a large corpus of text in an unsupervised manner. It is then fine-tuned on specific tasks such as text generation, machine translation, or question answering.

### GPT-2:
GPT-2 is a variant of GPT that has been trained on a large dataset to generate human-like text. It can generate coherent and contextually relevant text when provided with a prompt.

## Implementation:

### 1. Load and Prepare the GPT-2 Model
We'll use the pre-trained GPT-2 model from Hugging Face's `transformers` library.

### 2. Generate Text
We'll use the model to generate text based on a given prompt.


In [8]:
# Import necessary libraries
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import ipywidgets as widgets
from IPython.display import display

# Initialize the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Function to generate text
def generate_text(prompt, max_length=100):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')  # Encode the prompt text
    output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2, top_k=50, top_p=0.95, temperature=1.9)  # Generate text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)  # Decode the generated text
    return generated_text

# Define widgets for user input and output
input_prompt = widgets.Textarea(
    value='Once upon a time',
    placeholder='Enter text prompt',
    description='Prompt:',
    layout=widgets.Layout(width='100%', height='100px')
)

output_text = widgets.Textarea(
    value='',
    placeholder='Generated text will appear here',
    description='Output:',
    layout=widgets.Layout(width='100%', height='200px'),
    disabled=True
)

generate_button = widgets.Button(description='Generate Text')

# Define the button click event handler
def on_generate_button_clicked(b):
    generated = generate_text(input_prompt.value)
    output_text.value = generated

# Attach the event handler to the button
generate_button.on_click(on_generate_button_clicked)

# Display the widgets
display(input_prompt, generate_button, output_text)


Textarea(value='Once upon a time', description='Prompt:', layout=Layout(height='100px', width='100%'), placeho…

Button(description='Generate Text', style=ButtonStyle())

Textarea(value='', description='Output:', disabled=True, layout=Layout(height='200px', width='100%'), placehol…

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# Machine Translation using MarianMT

## Theory:

### MarianMT:
MarianMT is a sequence-to-sequence model trained specifically for machine translation. It uses the transformer architecture to translate text from one language to another.

## Implementation:

### 1. Load and Prepare the MarianMT Model
We'll use the pre-trained MarianMT model from Hugging Face's `transformers` library.

### 2. Translate Text
We'll use the model to translate text from English to French.


In [9]:
# Import necessary libraries
from transformers import MarianMTModel, MarianTokenizer
import ipywidgets as widgets
from IPython.display import display

# Define the model name for English to French translation
model_name = 'Helsinki-NLP/opus-mt-en-fr'

# Initialize the MarianMT tokenizer
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Load the pre-trained MarianMT model
model = MarianMTModel.from_pretrained(model_name)

# Function to translate text
def translate_text(text, max_length=100):
    input_ids = tokenizer.encode(text, return_tensors='pt', max_length=max_length, truncation=True)  # Encode the input text
    output = model.generate(input_ids, max_length=max_length, num_beams=4, early_stopping=True)  # Generate translation
    translated_text = tokenizer.decode(output[0], skip_special_tokens=True)  # Decode the translated text
    return translated_text

# Define widgets for user input and output
input_text = widgets.Textarea(
    value='Hello, how are you?',
    placeholder='Enter text to translate',
    description='Input:',
    layout=widgets.Layout(width='100%', height='100px')
)

output_text = widgets.Textarea(
    value='',
    placeholder='Translated text will appear here',
    description='Output:',
    layout=widgets.Layout(width='100%', height='100px'),
    disabled=True
)

translate_button = widgets.Button(description='Translate')

# Define the button click event handler
def on_translate_button_clicked(b):
    translated = translate_text(input_text.value)
    output_text.value = translated

# Attach the event handler to the button
translate_button.on_click(on_translate_button_clicked)

# Display the widgets
display(input_text, translate_button, output_text)


Textarea(value='Hello, how are you?', description='Input:', layout=Layout(height='100px', width='100%'), place…

Button(description='Translate', style=ButtonStyle())

Textarea(value='', description='Output:', disabled=True, layout=Layout(height='100px', width='100%'), placehol…