# Generative AI and Prompt Engineering
## A program by IISc and TalentSprint
### Mini-Project: Text Classification

## Problem Statement

Intent identification using BERT model

## Learning Objectives

At the end of the mini-project, you will be able to :

* Read the intent, questions and responses data
* Load a pre trained BERT model
* Fine-tune the BERT model
* Get the predictions for each question

## Overview

The intent identification problem is framed as a text classification task, where a BERT model is trained to classify intent. Once the model is fine-tuned, a conversation tool is set up. For each user question, the model first predicts the intent, and a response is selected from a predefined set of responses corresponding to the predicted intent as the answer to the input question.

## Dataset

Different classes of intent with a set of questions that fall into each intent and a pool of suitable responses for each intent.

## Grading = 10 Points

In [1]:
# prompt: Create a hidden code cell with @#title Download the Dataset. Data should be downloaded from the following link: https://cdn.exec.talentsprint.com/static/aimlops/c3/spam.csv

#@title Download the Dataset
!wget https://cdn.exec.talentsprint.com/static/aimlops/c3/Intent.json

--2024-09-22 04:23:47--  https://cdn.exec.talentsprint.com/static/aimlops/c3/Intent.json
Resolving cdn.exec.talentsprint.com (cdn.exec.talentsprint.com)... 172.105.52.210
Connecting to cdn.exec.talentsprint.com (cdn.exec.talentsprint.com)|172.105.52.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69866 (68K) [application/json]
Saving to: ‘Intent.json’


2024-09-22 04:23:48 (301 KB/s) - ‘Intent.json’ saved [69866/69866]



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import Neccesary Packages

In [3]:
# Please feel free to add/remove installations here

# Initial Packages
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import json
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Read the Intent, Questions, and Response Data (1 point)

In [4]:
## Add your code here
def read_data(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)

    intent_data = []
    responses_dict = {}
    for intent in data['intents']:
        for text in intent['text']:
            intent_data.append({
                'intent': intent['intent'],
                'question': text
            })
        responses_dict[intent['intent']] = intent['responses']

    return pd.DataFrame(intent_data), responses_dict

intent_dataframe, responses_dict = read_data('/content/Intent.json')
intent_dataframe.head()

Unnamed: 0,intent,question
0,Greeting,Hi
1,Greeting,Hi there
2,Greeting,Hola
3,Greeting,Hello
4,Greeting,Hello there


In [5]:
responses_dict

{'Greeting': ['Hi human, please tell me your GeniSys user',
  'Hello human, please tell me your GeniSys user',
  'Hola human, please tell me your GeniSys user'],
 'GreetingResponse': ['Great! Hi <HUMAN>! How can I help?',
  'Good! Hi <HUMAN>, how can I help you?',
  'Cool! Hello <HUMAN>, what can I do for you?',
  'OK! Hola <HUMAN>, how can I help you?',
  'OK! hi <HUMAN>, what can I do for you?'],
 'CourtesyGreeting': ['Hello, I am great, how are you? Please tell me your GeniSys user',
  'Hello, how are you? I am great thanks! Please tell me your GeniSys user',
  'Hello, I am good thank you, how are you? Please tell me your GeniSys user',
  'Hi, I am great, how are you? Please tell me your GeniSys user',
  'Hi, how are you? I am great thanks! Please tell me your GeniSys user',
  'Hi, I am good thank you, how are you? Please tell me your GeniSys user',
  'Hi, good thank you, how are you? Please tell me your GeniSys user'],
 'CourtesyGreetingResponse': ['Great! Hi <HUMAN>! How can I hel

In [6]:
intent_dataframe.intent.value_counts()


Unnamed: 0_level_0,count
intent,Unnamed: 1_level_1
GreetingResponse,8
CourtesyGreetingResponse,8
Greeting,7
CourtesyGreeting,7
CurrentHumanQuery,7
RealNameQuery,7
PodBayDoor,7
TimeQuery,7
NotTalking2U,7
Shutup,7


### Tokenize the Questions (1 point)

In [7]:
## Add your code here
def tokenize_questions(texts, tokenizer, max_len):
    return tokenizer(
        texts,
        add_special_tokens=True,
        max_length=max_len,
        truncation=True,
        padding='max_length',
        return_attention_mask=True,
        return_tensors='pt'
    )

In [8]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encodings = tokenize_questions(intent_dataframe['question'].tolist(), tokenizer, max_len=128)

print(encodings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

{'input_ids': tensor([[ 101, 7632,  102,  ...,    0,    0,    0],
        [ 101, 7632, 2045,  ...,    0,    0,    0],
        [ 101, 7570, 2721,  ...,    0,    0,    0],
        ...,
        [ 101, 2064, 2017,  ...,    0,    0,    0],
        [ 101, 2064, 2017,  ...,    0,    0,    0],
        [ 101, 6011, 2017,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}




### Create the Train Data with Tokenized Questions and Intent Labels (1 point)

In [10]:
## Add your code here
class IntentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

le = LabelEncoder()
intent_dataframe['intent_encoded'] = le.fit_transform(intent_dataframe['intent'])

dataset = IntentDataset(encodings, intent_dataframe['intent_encoded'].tolist())

<__main__.IntentDataset at 0x7dc914ec50c0>

## Preparing the dataset

In [12]:
def prepare_data(data, tokenizer, max_len=128):
    print(f"Shape of input data: {data.shape}")
    print(f"Columns in data: {data.columns}")
    print(f"Sample of data:\n{data.head()}")

    le = LabelEncoder()
    data['intent_encoded'] = le.fit_transform(data['intent'])

    texts = data['question'].tolist()
    labels = data['intent_encoded'].tolist()

    print(f"Number of texts: {len(texts)}")
    print(f"Number of labels: {len(labels)}")

    # Tokenize all texts at once using BERT Tokenizer
    encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_len, return_tensors='pt')

    print(f"Shape of input_ids: {encodings['input_ids'].shape}")
    print(f"Shape of attention_mask: {encodings['attention_mask'].shape}")

    # Convert encodings to lists for easier splitting
    input_ids = encodings['input_ids'].tolist()
    attention_mask = encodings['attention_mask'].tolist()

    # Split the data
    train_texts, val_texts, train_labels, val_labels = train_test_split(
        texts,
        labels,
        test_size=0.2,
        random_state=42
    )

    print(f"Number of training samples: {len(train_texts)}")
    print(f"Number of validation samples: {len(val_texts)}")

    # Tokenize split data
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_len, return_tensors='pt')
    val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=max_len, return_tensors='pt')

    # Create datasets
    train_dataset = IntentDataset(train_encodings, train_labels)
    val_dataset = IntentDataset(val_encodings, val_labels)

    return train_dataset, val_dataset, le

In [13]:
train_dataset, val_dataset, le = prepare_data(intent_dataframe, tokenizer)

Shape of input data: (143, 3)
Columns in data: Index(['intent', 'question', 'intent_encoded'], dtype='object')
Sample of data:
     intent     question  intent_encoded
0  Greeting           Hi               7
1  Greeting     Hi there               7
2  Greeting         Hola               7
3  Greeting        Hello               7
4  Greeting  Hello there               7
Number of texts: 143
Number of labels: 143
Shape of input_ids: torch.Size([143, 11])
Shape of attention_mask: torch.Size([143, 11])
Number of training samples: 114
Number of validation samples: 29


### Load a Pre-Trained BERT Model (1 point)

In [14]:
## Add your code here
def load_bert_model(num_labels):
    return BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

In [15]:
model = load_bert_model(num_labels=len(le.classes_))

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Prepare the Model to Fine-Tune (1 point)

In [17]:
## Add your code here
def prepare_model_for_training(model, train_dataset, val_dataset):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16)
    optimizer = AdamW(model.parameters(), lr=2e-5)
    return model, train_loader, val_loader, optimizer, device

In [18]:
model, train_loader, val_loader, optimizer, device = prepare_model_for_training(model, train_dataset, val_dataset)



### Train the Model using the Tokenized Questions and Intent Labels (1 point)

In [19]:
## Add your code here
def train_model(model, train_loader, val_loader, optimizer, device, epochs=30):
    best_accuracy = 0
    patience = 3
    no_improve = 0

    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for batch in train_loader:
            optimizer.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            train_loss += loss.item()
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
                val_loss += outputs.loss.item()
                _, predicted = torch.max(outputs.logits, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = correct / total
        print(f'Epoch {epoch + 1}/{epochs}')
        print(f'Train Loss: {train_loss/len(train_loader):.4f}')
        print(f'Validation Loss: {val_loss/len(val_loader):.4f}')
        print(f'Validation Accuracy: {accuracy:.4f}')

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            no_improve = 0
            torch.save(model.state_dict(), 'best_model.pth')
        else:
            no_improve += 1

        if no_improve == patience:
            print("Early stopping")
            break

    # Load the best model
    model.load_state_dict(torch.load('best_model.pth'))
    return model

In [20]:
model = train_model(model, train_loader, val_loader, optimizer, device)

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch 1/30
Train Loss: 3.1004
Validation Loss: 3.0784
Validation Accuracy: 0.0000
Epoch 2/30
Train Loss: 2.9177
Validation Loss: 3.0249
Validation Accuracy: 0.1034
Epoch 3/30
Train Loss: 2.8794
Validation Loss: 2.8704
Validation Accuracy: 0.2414
Epoch 4/30
Train Loss: 2.7225
Validation Loss: 2.7569
Validation Accuracy: 0.3448
Epoch 5/30
Train Loss: 2.6358
Validation Loss: 2.6993
Validation Accuracy: 0.3103
Epoch 6/30
Train Loss: 2.5239
Validation Loss: 2.5727
Validation Accuracy: 0.3793
Epoch 7/30
Train Loss: 2.3671
Validation Loss: 2.4402
Validation Accuracy: 0.5172
Epoch 8/30
Train Loss: 2.2086
Validation Loss: 2.3574
Validation Accuracy: 0.4483
Epoch 9/30
Train Loss: 2.1788
Validation Loss: 2.2525
Validation Accuracy: 0.5172
Epoch 10/30
Train Loss: 2.0404
Validation Loss: 2.1652
Validation Accuracy: 0.5517
Epoch 11/30
Train Loss: 1.9111
Validation Loss: 2.0399
Validation Accuracy: 0.5862
Epoch 12/30
Train Loss: 1.7749
Validation Loss: 2.0204
Validation Accuracy: 0.5517
Epoch 13/30
T

  model.load_state_dict(torch.load('best_model.pth'))


## Evaluation of the model

In [21]:
def evaluate_model(model, val_loader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            _, predicted = torch.max(outputs.logits, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = correct / total
    print(f'Model Accuracy: {accuracy:.4f}')
    return accuracy

In [22]:
accuracy = evaluate_model(model, val_loader, device)

Model Accuracy: 0.8621


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


### Create a Function to get the Predictions for each Question (1 point)

In [23]:
## Add your code here
def get_prediction(model, tokenizer, text, device):
    model.eval()
    encoding = tokenizer(text, return_tensors='pt', max_length=128, padding='max_length', truncation=True)
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        _, prediction = torch.max(outputs.logits, dim=1)

    return prediction.item()

### Create a Function to Choose the Response based on the Intent Prediction (1 point)

In [24]:
## Add your code here
def choose_response(intent, responses_dict):
    if intent in responses_dict:
        return random.choice(responses_dict[intent])
    else:
        return "Kindly reframe your question; I'm not sure how to respond to that"

### Connect the above 2 Functions to take a Question from the User and Respond with Intent and the Answer (2 point)

In [25]:
## Add your code here
def chat_with_bot(model, tokenizer, label_encoder, responses_dict, device):
    while True:
        user_input = input("You: ")
        if user_input.lower() in ['quit', 'exit', 'bye']:
            print("Bot: Goodbye!")
            break

        intent_id = get_prediction(model, tokenizer, user_input, device)
        intent = label_encoder.inverse_transform([intent_id])[0]
        response = choose_response(intent, responses_dict)

        print(f"Bot: (Intent: {intent}) {response}")

## Chat with the bot

In [26]:
print("Chat with the bot (type 'quit', 'exit', or 'bye' to end the conversation):")
chat_with_bot(model, tokenizer, le, responses_dict, device)

Chat with the bot (type 'quit', 'exit', or 'bye' to end the conversation):
You: Hello
Bot: (Intent: Greeting) Hola human, please tell me your GeniSys user
You: I am Adam. 
Bot: (Intent: GreetingResponse) Cool! Hello <HUMAN>, what can I do for you?
You: What do you call name? 
Bot: (Intent: CurrentHumanQuery) Your name is <HUMAN>, how can I help you?
You: What do you call me? 
Bot: (Intent: CurrentHumanQuery) You are <HUMAN>! How can I help?
You: My name is Praveen
Bot: (Intent: GreetingResponse) Cool! Hello <HUMAN>, what can I do for you?
You: I dont want to talk to you
Bot: (Intent: NameQuery) You can call me Geni
You: Hi Geni
Bot: (Intent: Greeting) Hello human, please tell me your GeniSys user
You: Tell me the cricket score
Bot: (Intent: SelfAware) That is an difficult question, can you prove that you are?
You: Tell a funny joke
Bot: (Intent: Jokes) A famous blues musician died. His tombstone bore the inscription, 'Didn't wake up this morning...'
You: quit 
Bot: (Intent: PodBayDoor)