In [None]:
from google.colab import files
uploaded = files.upload()

Saving faq_dataset.csv to faq_dataset.csv


Once the file is uploaded, you can load it into a pandas DataFrame like this:

In [None]:
import pandas as pd
file_name = list(uploaded.keys())[0]
df = pd.read_csv(file_name)
display(df.head())

Unnamed: 0,Question,Answer
0,What are the admission requirements?,Admission requirements vary by program. Please...
1,How can I apply for admission?,You can apply online through our admissions po...
2,What is the application deadline?,The deadline for all undergraduate application...
3,How much is the admission fee?,The one-time admission processing fee is ₹5000...
4,Where can I check my application status?,You can track your application status by loggi...


# Task
Create a chatbot that answers student queries based on a provided FAQ dataset. The chatbot should preprocess the dataset and user queries, use TF-IDF for vectorization, and cosine similarity for matching to return the best answer. The solution should include the code for preprocessing, vectorization, similarity calculation, and a simple interactive chatbot interface.

## Preprocess the faq dataset

### Subtask:
Clean and prepare the 'Question' column of the dataset for analysis by performing tokenization, removing stopwords, and lemmatization.


**Reasoning**:
Import necessary libraries and download nltk resources.



In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

**Reasoning**:
Define a preprocessing function and apply it to the 'Question' column.



**Reasoning**:
Download the missing 'punkt_tab' resource from nltk and then re-apply the preprocessing function.



In [None]:
nltk.download('punkt_tab')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

df['Preprocessed_Question'] = df['Question'].apply(preprocess_text)
display(df[['Question', 'Preprocessed_Question']].head())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Unnamed: 0,Question,Preprocessed_Question
0,What are the admission requirements?,admission requirement
1,How can I apply for admission?,apply admission
2,What is the application deadline?,application deadline
3,How much is the admission fee?,much admission fee
4,Where can I check my application status?,check application status


## Vectorize the preprocessed questions

### Subtask:
Convert the preprocessed questions into numerical representations using a technique like TF-IDF.


**Reasoning**:
Import the `TfidfVectorizer` and instantiate it, then fit and transform the preprocessed questions.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Preprocessed_Question'])
print(tfidf_matrix.shape)

(23, 42)


## Implement a function to preprocess user queries

### Subtask:
Create a function that takes a user's query as input and performs the same preprocessing steps as applied to the FAQ dataset.


**Reasoning**:
Define the preprocess_query function that applies the same preprocessing steps as used for the 'Preprocessed_Question' column to the input query string.



In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_query(query):
    query = query.lower()
    query = re.sub(r'[^a-z\s]', '', query)
    tokens = nltk.word_tokenize(query)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Example usage:
# preprocessed_user_query = preprocess_query("What is the deadline for applications?")
# print(preprocessed_user_query)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Implement a function to find the best matching answer

### Subtask:
This function will take a preprocessed user query, calculate its similarity with the vectorized FAQ questions, and return the answer corresponding to the most similar question.


**Reasoning**:
Define the `get_best_answer` function to calculate cosine similarity between the user query and the TF-IDF matrix, find the best match, and return the corresponding answer.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def get_best_answer(preprocessed_query, tfidf_vectorizer):
    query_vector = tfidf_vectorizer.transform([preprocessed_query])
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix)
    best_match_index = similarity_scores.argmax()
    best_answer = df.loc[best_match_index, 'Answer']
    return best_answer

## Create a simple chatbot interface

### Subtask:
Implement a loop to interact with the user, taking their input queries and providing the best matching answers using the functions created in the previous steps.


**Reasoning**:
Implement an interactive loop to take user queries, preprocess them, find the best matching answer, and print the answer to the user, with an option to quit.



In [None]:
print("Hello! I am a chatbot here to answer your questions about the university. Type 'quit' to exit.")

while True:
    user_query = input("You: ")
    if user_query.lower() == 'quit':
        print("Chatbot: Goodbye!")
        break
    else:
        preprocessed_user_query = preprocess_query(user_query)
        best_answer = get_best_answer(preprocessed_user_query, tfidf_vectorizer)
        print(f"Chatbot: {best_answer}")

Hello! I am a chatbot here to answer your questions about the university. Type 'quit' to exit.
You: hi
Chatbot: Admission requirements vary by program. Please visit the admissions page at university.edu/admissions for detailed criteria for each course.
You: fees
Chatbot: Hostel fees are ₹80,000 per year, which includes lodging and meals. This can be paid in two installments.
You: course
Chatbot: Course registration is done online through the student portal. The registration period is from August 20th to August 28th.
You: bca
Chatbot: Admission requirements vary by program. Please visit the admissions page at university.edu/admissions for detailed criteria for each course.


KeyboardInterrupt: Interrupted by user

### Subtask:
Run the Streamlit application in Colab.

**Reasoning**:
Run the Streamlit application file using the `streamlit run` command. This will provide a public URL to access the app.

In [None]:
!streamlit run chatbot_app.py & npx localtunnel --port 8501

/bin/bash: line 1: streamlit: command not found
[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K[1G[0JNeed to install the following packages:
localtunnel@2.0.2
Ok to proceed? (y) [20Gy

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0Kyour url is: https://rich-rings-show.loca.lt
/root/.npm/_npx/75ac80b86e83d4a2/node_modules/localtunnel/bin/lt.js:81
    throw err;
    ^

Error: connection refused: localtunnel.me:31849 (check your firewall settings)
    at Socket.<anonymous> (/root/.npm/_npx/75ac80b86e83d4a2

### Subtask:
Run the Streamlit application in Colab.

**Reasoning**:
Run the Streamlit application file using the `streamlit run` command. This will provide a public URL to access the app.

In [None]:
!streamlit run chatbot_app.py & npx localtunnel --port 8501

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.229.160.200:8501[0m
[0m
[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0Kyour url is: https://old-swans-behave.loca.lt
2025-10-27 15:09:18.467179: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761577758.525296   11261 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761577758.543664   11261 cuda_blas.cc

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Deploy on Streamlit

### Subtask:
Create the Streamlit application file.

**Reasoning**:
Create a Python file for the Streamlit app and add the necessary imports and basic structure, including the chatbot logic.

In [None]:
%%writefile chatbot_app.py
import streamlit as st
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import BertModel, BertTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity

# --- Load data and model ---
# In a real Streamlit app, you would load these at the beginning of the script.
# For this example, we'll assume they are available globally from the Colab session.
# You might need to adjust paths or loading mechanisms if running this outside Colab.

# Ensure NLTK resources are downloaded (if not already)
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except nltk.downloader.DownloadError:
    nltk.download('wordnet')
try:
    nltk.data.find('tokenizers/punkt_tab')
except nltk.downloader.DownloadError:
     nltk.download('punkt_tab')


# Assuming 'df' DataFrame is available from the Colab environment
# Assuming 'tokenizer' and 'model' (BERT) are available from the Colab environment


# --- Chatbot Logic Functions ---

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

def preprocess_query(query):
    query = query.lower()
    query = re.sub(r'[^a-z\s]', '', query)
    tokens = nltk.word_tokenize(query)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

def get_bert_embeddings(texts, tokenizer, model):
    # Handle single text input
    if isinstance(texts, str):
        texts = [texts]
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the embeddings of the [CLS] token as sentence representation
    return outputs.last_hidden_state[:, 0, :].numpy()

def find_best_answer_bert(user_query_embedding, faq_embeddings, df):
    similarity_scores = cosine_similarity(user_query_embedding.reshape(1, -1), faq_embeddings)
    best_match_index = similarity_scores.argmax()
    best_answer = df.loc[best_match_index, 'Answer']
    return best_answer

# --- Streamlit App Interface ---

st.title("University FAQ Chatbot")

st.write("Ask me a question about the university. Type 'quit' to exit (in the Colab console if running there).")

user_input = st.text_input("Your question:")

if user_input:
    # Assuming df, tokenizer, and model are loaded and available in the Colab environment
    # In a standalone Streamlit app, you would load these here.
    # For demonstration in Colab, we rely on the Colab environment's state.
    try:
        # Access variables from the Colab environment
        global df, tokenizer, model

        if 'df' not in globals() or 'tokenizer' not in globals() or 'model' not in globals():
            st.error("Required variables (df, tokenizer, model) not found. Please run the previous cells to load them.")
        else:
            preprocessed_user_query = preprocess_query(user_input)
            user_query_embedding = get_bert_embeddings(preprocessed_user_query, tokenizer, model)
            # Assuming BERT_Embeddings column is already created in df
            if 'BERT_Embeddings' not in df.columns:
                 st.error("BERT_Embeddings column not found in DataFrame. Please run the embedding generation cell.")
            else:
                faq_embeddings = np.array(df['BERT_Embeddings'].tolist())
                best_answer = find_best_answer_bert(user_query_embedding, faq_embeddings, df)
                st.write("Chatbot:", best_answer)
    except Exception as e:
        st.error(f"An error occurred: {e}")
        #st.write("Please make sure the FAQ data and BERT model are loaded correctly.")

Overwriting chatbot_app.py


In [None]:
%pip install streamlit

Collecting streamlit
  Downloading streamlit-1.50.0-py3-none-any.whl.metadata (9.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.50.0-py3-none-any.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m101.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydeck, streamlit
Successfully installed pydeck-0.9.1 streamlit-1.50.0


# Task
Create a chatbot that answers student queries based on a provided FAQ dataset. The chatbot should preprocess the FAQ data and user queries, use BERT embeddings for semantic matching, and return the best matching answer. The chatbot should also include a simple interactive interface.

## Preprocess the faq dataset

### Subtask:
Clean and prepare the 'Question' column of the dataset for analysis by performing tokenization, removing stopwords, and lemmatization.


**Reasoning**:
Define a preprocessing function and apply it to the 'Question' column.



In [None]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

df['Preprocessed_Question'] = df['Question'].apply(preprocess_text)
display(df[['Question', 'Preprocessed_Question']].head())

Unnamed: 0,Question,Preprocessed_Question
0,What are the admission requirements?,admission requirement
1,How can I apply for admission?,apply admission
2,What is the application deadline?,application deadline
3,How much is the admission fee?,much admission fee
4,Where can I check my application status?,check application status


## Generate bert embeddings for the preprocessed questions

### Subtask:
Use a pre-trained BERT model to create numerical representations of the preprocessed questions.


**Reasoning**:
Install the `transformers` library to use BERT models.



In [None]:
%pip install transformers



**Reasoning**:
Import necessary classes from `transformers`, load a pre-trained BERT model and tokenizer, define a function to generate embeddings, and apply it to the preprocessed questions.



In [None]:
from transformers import BertModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embeddings(texts, tokenizer, model):
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the embeddings of the [CLS] token as sentence representation
    return outputs.last_hidden_state[:, 0, :].numpy()

df['BERT_Embeddings'] = get_bert_embeddings(df['Preprocessed_Question'].tolist(), tokenizer, model).tolist()
display(df[['Question', 'Preprocessed_Question', 'BERT_Embeddings']].head())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Unnamed: 0,Question,Preprocessed_Question,BERT_Embeddings
0,What are the admission requirements?,admission requirement,"[-0.07260395586490631, 0.14130949974060059, -0..."
1,How can I apply for admission?,apply admission,"[0.07762237638235092, 0.42472389340400696, -0...."
2,What is the application deadline?,application deadline,"[-0.2308618128299713, 0.042568549513816833, -0..."
3,How much is the admission fee?,much admission fee,"[-0.5232292413711548, 0.2320321798324585, -0.2..."
4,Where can I check my application status?,check application status,"[0.027905788272619247, 0.28438663482666016, -0..."


## Implement a function to preprocess user queries

### Subtask:
Create a function that takes a user's query as input and performs the same preprocessing steps as applied to the FAQ dataset.


**Reasoning**:
Define the preprocess_query function that applies the same preprocessing steps as used for the 'Preprocessed_Question' column to the input query string.



In [None]:
def preprocess_query(query):
    query = query.lower()
    query = re.sub(r'[^a-z\s]', '', query)
    tokens = nltk.word_tokenize(query)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Example usage:
# preprocessed_user_query = preprocess_query("What is the deadline for applications?")
# print(preprocessed_user_query)

## Generate bert embedding for user queries

### Subtask:
Use the same pre-trained BERT model to create a numerical representation of the preprocessed user query.


**Reasoning**:
Define the function to get the BERT embedding for the preprocessed user query.



In [None]:
def get_user_query_embedding(preprocessed_query, tokenizer, model):
    inputs = tokenizer(preprocessed_query, return_tensors='pt', padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the embeddings of the [CLS] token as sentence representation
    return outputs.last_hidden_state[:, 0, :].numpy()

# Example usage:
# preprocessed_user_query = preprocess_query("What is the deadline for applications?")
# user_query_embedding = get_user_query_embedding(preprocessed_user_query, tokenizer, model)
# print(user_query_embedding.shape)

## Implement a function to find the best matching answer

### Subtask:
This function will take the user query embedding, calculate its similarity with the BERT embeddings of the FAQ questions, and return the answer corresponding to the most similar question.


**Reasoning**:
Define the `find_best_answer_bert` function to calculate cosine similarity between the user query embedding and the BERT embeddings of FAQ questions, find the best match, and return the corresponding answer.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def find_best_answer_bert(user_query_embedding, faq_embeddings):
    similarity_scores = cosine_similarity(user_query_embedding.reshape(1, -1), faq_embeddings)
    best_match_index = similarity_scores.argmax()
    best_answer = df.loc[best_match_index, 'Answer']
    return best_answer

# Example usage:
# preprocessed_user_query = preprocess_query("What is the deadline for applications?")
# user_query_embedding = get_user_query_embedding(preprocessed_user_query, tokenizer, model)
# faq_embeddings = np.array(df['BERT_Embeddings'].tolist())
# best_answer = find_best_answer_bert(user_query_embedding, faq_embeddings)
# print(best_answer)

## Create a simple chatbot interface

### Subtask:
Implement a loop to interact with the user, taking their input queries and providing the best matching answers using the functions created in the previous steps.


**Reasoning**:
Implement an interactive loop to take user queries, preprocess them, find the best matching answer using BERT embeddings, and print the answer to the user, with an option to quit.



In [None]:
import numpy as np

print("Hello! I am a chatbot here to answer your questions about the university. Type 'quit' to exit.")

while True:
    user_query = input("You: ")
    if user_query.lower() == 'quit':
        print("Chatbot: Goodbye!")
        break
    else:
        preprocessed_user_query = preprocess_query(user_query)
        user_query_embedding = get_user_query_embedding(preprocessed_user_query, tokenizer, model)
        faq_embeddings = np.array(df['BERT_Embeddings'].tolist())
        best_answer = find_best_answer_bert(user_query_embedding, faq_embeddings)
        print(f"Chatbot: {best_answer}")

Hello! I am a chatbot here to answer your questions about the university. Type 'quit' to exit.
You: hi
Chatbot: The central library is open from 8:00 AM to 10:00 PM on weekdays and 10:00 AM to 6:00 PM on weekends.
You: fees
Chatbot: Fees can be paid online via the student portal, or through a bank transfer. Visit university.edu/payment-options for more details.
You: how much fees
Chatbot: The central library is open from 8:00 AM to 10:00 PM on weekdays and 10:00 AM to 6:00 PM on weekends.
You: hostel
Chatbot: Yes, all residents must abide by the hostel's code of conduct. You can find the rulebook on the hostel's website.


KeyboardInterrupt: Interrupted by user