<a href="https://colab.research.google.com/github/ravisankarg/notebooks/blob/main/ask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import random
import csv

# Placeholder data for different contexts
placeholder_data = {
    "events": ["Music Concert", "Art Exhibition", "Tech Webinar", "Sports Match", "Food Festival"],
    "flights": ["AI-301", "SG-452", "IN-786"],
    "airlines": ["Air India", "IndiGo", "SpiceJet"],
    "stadiums": ["Wankhede Stadium", "Eden Gardens", "Narendra Modi Stadium"],
    "restaurants": ["The Spice House", "Ocean View Diner", "Royal Garden"],
    "hotels": ["Sunrise Hotel", "Lakeview Resort", "Hilltop Inn"],
    "locations": ["Bangalore", "Mumbai", "Delhi", "Chennai", "Pune"],
    "dates": ["2025-09-20", "2025-10-05", "2025-11-15", "2025-12-01"],
    "times": ["6 PM", "10 AM", "3 PM", "7:30 PM"],
    "coupon_codes": ["SAVE20", "FESTIVE50", "WELCOME10"],
    "users": ["John", "Alice", "Ravi", "Meera"],
    "transport_modes": ["bus", "train", "cab", "metro"]
}

# Templates for positive intent (questions)
question_templates = {
    "travel": {
        "what": [
            "What is the baggage allowance for flight {}?",
            "What is the refund policy for flight {}?",
            "What documents do I need for train travel?"
        ],
        "where": [
            "Where can I collect my boarding pass for flight {}?",
            "Where is platform 5 located at {} station?"
        ],
        "when": [
            "When should I arrive at the airport for flight {}?",
            "When does the train depart from {}?"
        ],
        "who": [
            "Who can help me with flight changes?",
            "Who is in charge of platform announcements?"
        ]
    },
    "sports_entertainment": {
        "what": [
            "What are the seating arrangements at stadium {}?",
            "What should I bring to the match?"
        ],
        "where": [
            "Where is the parking near stadium {}?",
            "Where can I collect tickets for {}?"
        ],
        "when": [
            "When will match tickets be available?",
            "When does the event {} start?"
        ],
        "who": [
            "Who is playing in the match?",
            "Who is the chief guest at the event?"
        ]
    },
    "hotel_restaurant": {
        "what": [
            "What are the check-in rules at hotel {}?",
            "What meal options are included in my package?"
        ],
        "where": [
            "Where is hotel {} located?",
            "Where should I park my vehicle near the restaurant {}?"
        ],
        "when": [
            "When should I arrive at hotel {}?",
            "When does the restaurant open for dinner?"
        ],
        "who": [
            "Who do I contact for room service?",
            "Who manages bookings at hotel {}?"
        ]
    },
    "coupon_discount": {
        "what": [
            "What discount does coupon {} offer?",
            "What are the terms and conditions for coupon {}?"
        ],
        "where": [
            "Where can I redeem coupon {}?",
            "Where do I enter the promo code when booking?"
        ],
        "when": [
            "When is the last date to redeem coupon {}?",
            "When will the offer expire?"
        ],
        "who": [
            "Who is eligible for the discount coupon {}?",
            "Who can help me apply the coupon?"
        ]
    },
    "event_webinar": {
        "what": [
            "What is the topic of the webinar {}?",
            "What platform will the session be streamed on?"
        ],
        "where": [
            "Where can I access the online event?",
            "Where is the venue for the seminar?"
        ],
        "when": [
            "When does the webinar start?",
            "When can I download the event materials?"
        ],
        "who": [
            "Who is the keynote speaker at the webinar?",
            "Who will moderate the panel discussion?"
        ]
    }
}

# Templates for negative intent (statements)
negative_templates = {
    "travel": [
        "Your flight {} is confirmed for {}.",
        "Check-in starts 3 hours before departure."
    ],
    "sports_entertainment": [
        "Your concert tickets for {} are booked.",
        "The event will be live-streamed online."
    ],
    "hotel_restaurant": [
        "Your hotel reservation at {} is confirmed.",
        "Dinner is included in your package."
    ],
    "coupon_discount": [
        "The coupon {} offers a 20% discount on selected items.",
        "Coupons cannot be combined with other offers."
    ],
    "event_webinar": [
        "A confirmation email has been sent to you.",
        "A reminder will be sent before the event starts."
    ]
}

def random_choice(key):
    return random.choice(placeholder_data[key])

def fill_template(template, placeholders):
    return template.format(*placeholders)

def generate_question():
    context = random.choice(list(question_templates.keys()))
    qtype = random.choice(list(question_templates[context].keys()))
    template = random.choice(question_templates[context][qtype])

    # Pick placeholders depending on template structure
    placeholders = []
    count_placeholders = template.count("{}")
    for _ in range(count_placeholders):
        # Choose placeholder source intelligently
        source = random.choice(["events", "flights", "stadiums", "restaurants", "hotels", "locations", "dates", "times", "coupon_codes", "users"])
        placeholders.append(random_choice(source))
    return fill_template(template, placeholders), "ask_info"

def generate_non_question():
    context = random.choice(list(negative_templates.keys()))
    template = random.choice(negative_templates[context])
    placeholders = []
    count_placeholders = template.count("{}")
    for _ in range(count_placeholders):
        source = random.choice(["events", "flights", "stadiums", "restaurants", "hotels", "locations", "dates", "times", "coupon_codes", "users"])
        placeholders.append(random_choice(source))
    return fill_template(template, placeholders), "not_ask_info"

def generate_dataset(num_samples=2000):
    data = []
    for _ in range(num_samples):
        if random.random() < 0.6:  # 60% questions
            text, intent = generate_question()
        else:  # 40% statements
            text, intent = generate_non_question()
        data.append({"text": text, "intent": intent})
    return data

def save_to_csv(data, filename="intent_dataset.csv"):
    with open(filename, mode="w", newline='', encoding="utf-8") as file:
        writer = csv.DictWriter(file, fieldnames=["text", "intent"])
        writer.writeheader()
        writer.writerows(data)
    print(f"Dataset saved to {filename}")

if __name__ == "__main__":
    dataset = generate_dataset(num_samples=3000)
    save_to_csv(dataset)


Dataset saved to intent_dataset.csv


# Task
Train a text classifier using the "intent_dataset.csv" file and the "all-MiniLM-L12-v2" model.

## Install required libraries

### Subtask:
Install the `transformers` and `scikit-learn` libraries.


**Reasoning**:
The subtask is to install the required libraries. I will use pip to install `transformers` and `scikit-learn`.



In [2]:
%pip install transformers scikit-learn



## Load the dataset

### Subtask:
Load the `intent_dataset.csv` file into a pandas DataFrame.


**Reasoning**:
Load the data from the CSV file into a pandas DataFrame and display its head and info.



In [3]:
import pandas as pd

df = pd.read_csv('intent_dataset.csv')
display(df.head())
display(df.info())

Unnamed: 0,text,intent
0,What should I bring to the match?,ask_info
1,Your flight Delhi is confirmed for Wankhede St...,not_ask_info
2,Who manages bookings at hotel 2025-12-01?,ask_info
3,Check-in starts 3 hours before departure.,not_ask_info
4,What are the check-in rules at hotel Narendra ...,ask_info


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    4000 non-null   object
 1   intent  4000 non-null   object
dtypes: object(2)
memory usage: 62.6+ KB


None

## Data preprocessing

### Subtask:
Split the data into training and testing sets, tokenize the text data using the All-MiniLM-L12-v2 tokenizer, and encode the labels.


**Reasoning**:
Split the data, tokenize the text, and encode the labels as per the instructions.



**Reasoning**:
The previous command failed because the model name "all-MiniLM-L12-v2" was not found. This could be a typo or the model might have a different name on Hugging Face. I will try the correct model name "sentence-transformers/all-MiniLM-L12-v2".



In [5]:
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from sklearn.preprocessing import LabelEncoder

# 1. Split the data into training and testing sets
X = df['text']
y = df['intent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Instantiate a tokenizer with the correct model name
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L12-v2")

# 3. Tokenize the text data
max_len = 128 # Define a max length, adjust as needed
X_train_tokenized = tokenizer(list(X_train), padding=True, truncation=True, max_length=max_len, return_tensors="pt")
X_test_tokenized = tokenizer(list(X_test), padding=True, truncation=True, max_length=max_len, return_tensors="pt")

# 4. Encode the target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print("Data splitting, tokenization, and label encoding complete.")

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Data splitting, tokenization, and label encoding complete.


## Model loading

### Subtask:
Load the pre-trained All-MiniLM-L12-v2 model.


**Reasoning**:
Load the pre-trained All-MiniLM-L12-v2 model for sequence classification using the number of unique labels.



In [6]:
from transformers import AutoModelForSequenceClassification

# Get the number of unique labels from the label_encoder
num_labels = len(label_encoder.classes_)

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/all-MiniLM-L12-v2", num_labels=num_labels)

print("Pre-trained model loaded successfully.")

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L12-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pre-trained model loaded successfully.


## Model training

### Subtask:
Fine-tune the pre-trained model on your dataset.


**Reasoning**:
Fine-tune the pre-trained model by defining training arguments, creating a custom dataset, instantiating a Trainer, and starting the training process as per the instructions.



**Reasoning**:
The traceback indicates that the `TrainingArguments` constructor received an unexpected keyword argument 'evaluation_strategy'. This suggests a potential version incompatibility or a change in argument names. I will remove `evaluation_strategy` and try again, as evaluation is not strictly required for the training to proceed.



In [8]:
import torch
from torch.utils.data import Dataset
from transformers import Trainer, TrainingArguments

# Create a custom PyTorch dataset
class IntentDataset(Dataset):
    def __init__(self, tokenized_texts, labels):
        self.tokenized_texts = tokenized_texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.tokenized_texts.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

# Instantiate datasets
train_dataset = IntentDataset(X_train_tokenized, y_train_encoded)
test_dataset = IntentDataset(X_test_tokenized, y_test_encoded)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# Instantiate Trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

# Start training
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mravisankar-g[0m ([33mravisankar-g-samsung-electronics[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,0.6977
20,0.6966
30,0.6924
40,0.6883
50,0.684
60,0.6793
70,0.665
80,0.6339
90,0.5653
100,0.4578


TrainOutput(global_step=600, training_loss=0.13360678139453133, metrics={'train_runtime': 245.7028, 'train_samples_per_second': 39.072, 'train_steps_per_second': 2.442, 'total_flos': 24702273792000.0, 'train_loss': 0.13360678139453133, 'epoch': 3.0})

## Model evaluation

### Subtask:
Evaluate the trained model's performance on the test set.


**Reasoning**:
Evaluate the trained model on the test dataset using the trainer object.



In [9]:
# Evaluate the model
eval_results = trainer.evaluate()

# Print the evaluation results
print(eval_results)

{'eval_loss': 0.0011684903874993324, 'eval_runtime': 0.5422, 'eval_samples_per_second': 1475.414, 'eval_steps_per_second': 23.975, 'epoch': 3.0}


## Summary:

### Data Analysis Key Findings

*   The dataset was successfully loaded and contains 4000 entries with 'text' and 'intent' columns. Both columns have no missing values and are of object data type.
*   The data was split into training and testing sets, with 80% for training and 20% for testing.
*   The text data was successfully tokenized using the "sentence-transformers/all-MiniLM-L12-v2" tokenizer with a maximum length of 128.
*   The intent labels were successfully encoded into numerical format.
*   The pre-trained "sentence-transformers/all-MiniLM-L12-v2" model was loaded for sequence classification, with the classifier layer weights newly initialized.
*   The model was successfully fine-tuned on the dataset for 3 epochs.
*   The evaluation on the test set resulted in a very low evaluation loss of approximately 0.00117.

### Insights or Next Steps

*   While the evaluation loss is very low, it is recommended to calculate and report additional classification metrics such as precision, recall, and F1-score to get a more comprehensive understanding of the model's performance across different classes.
*   The model can be saved for future use and deployment to classify new text data based on the trained intents.


## Display test set examples

### Subtask:
Show a few examples from the test split.

**Reasoning**:
Display the head of the `X_test` and `y_test` Series to show examples from the test split.

In [10]:
display(X_test.head())
display(y_test.head())

Unnamed: 0,text
555,Who can help me apply the coupon?
3491,How can I update my payment method?
527,When does the webinar start?
3925,Where is my package?
2989,The event will be live-streamed online.


Unnamed: 0,intent
555,ask_info
3491,not_ask_info
527,ask_info
3925,not_ask_info
2989,not_ask_info


# Task
Train a classifier using the "intent_dataset.csv" file, show examples from the test split, and create a Gradio interface to test the model with custom input.

## Install gradio

### Subtask:
Install the Gradio library.


**Reasoning**:
Install the gradio library using pip.



In [11]:
%pip install gradio



## Define prediction function

### Subtask:
Create a Python function that takes text as input, tokenizes it, makes a prediction using the trained model, and returns the predicted intent.


**Reasoning**:
Define a function to predict the intent of a given text using the trained model and tokenizer.



In [12]:
import torch
import torch.nn.functional as F

def predict_intent(text):
    """
    Predicts the intent of a given text using the trained model.

    Args:
        text (str): The input text to classify.

    Returns:
        str: The predicted intent label.
    """
    # Tokenize the input text
    tokenized_text = tokenizer(text, padding=True, truncation=True, max_length=max_len, return_tensors="pt")

    # Make a prediction using the trained model
    with torch.no_grad():
        outputs = model(**tokenized_text)

    # Apply softmax to get probabilities
    logits = outputs.logits
    probabilities = F.softmax(logits, dim=1)

    # Get the predicted class index
    predicted_class_index = torch.argmax(probabilities, dim=1).item()

    # Map the index back to the original label
    predicted_label = label_encoder.classes_[predicted_class_index]

    return predicted_label

print("Predict function defined.")

Predict function defined.


## Create gradio interface

### Subtask:
Build a Gradio interface with a text input field and a text output field, linking it to the prediction function.


**Reasoning**:
Build a Gradio interface with a text input field and a text output field, linking it to the prediction function.



In [13]:
import gradio as gr

# Build the Gradio interface
interface = gr.Interface(
    fn=predict_intent,
    inputs=gr.Textbox(label="Enter text here"),
    outputs=gr.Textbox(label="Predicted Intent"),
    title="Intent Classifier",
    description="Enter a sentence to predict its intent (ask_info or not_ask_info)."
)

# Launch the interface (this will run until manually stopped)
# interface.launch() # Uncomment this line to launch the interface
print("Gradio interface created.")

Gradio interface created.


## Launch gradio app

### Subtask:
Launch the Gradio application to interact with the model.


**Reasoning**:
Launch the Gradio interface using the .launch() method of the interface object.



In [None]:
# Launch the interface (this will run until manually stopped)
interface.launch()

Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://e91375ccdc89b76ecb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
