[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/toobajaved/smuAIBot/blob/main/aibot.ipynb)


In [None]:
# Installing the dependencies
!pip install datasets faiss-cpu
!pip install openai==0.28
!pip install anvil-uplink

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

In [None]:
#importing the required libs
import os
import torch
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset
import faiss
import openai

# Set OpenAI API key
openai.api_key = 'key-here'

# Load the dataset from Hugging Face
dataset = load_dataset("tootooba/SMU_FAQDataset")['train']

# Extract questions and answers from the ds
questions = dataset['question']
answers = dataset['answer']

# Initialize the tokenizer and model for embeddings
'''
Model used: MiniLM-L6-V2 --> lighter weight model compared to distilBERT.
            Works better for retrieval tasks from our dataset
'''
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()

# Function to compute mean pooling: for aggregating the embeddings
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of output contains token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, dim=1) / torch.clamp(
        input_mask_expanded.sum(dim=1), min=1e-9
    )

# Function to encode questions
def encode_questions(questions):
    encoded_input = tokenizer(questions, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    return embeddings.numpy()

# Generate embeddings for the questions
question_embeddings = encode_questions(questions)

# Initialize FAISS index
embedding_dim = question_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)

# Add question embeddings to the index
index.add(question_embeddings)

# Function to find the best answer using retrieval and generation
def find_best_answer(user_question, top_k=3):
    # Encode the user's question
    encoded_input = tokenizer(user_question, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    user_embedding = mean_pooling(model_output, encoded_input['attention_mask']).numpy()

    # Search for similar questions
    distances, indices = index.search(user_embedding, top_k)
    retrieved_answers = [answers[idx] for idx in indices[0]]

    # Combine retrieved answers into a single context
    context = "\n".join(retrieved_answers)

    # Generate a response using GPT-3.5
    prompt = f"User Question: {user_question}\n\nContext from FAQ:\n{context}\n\nAnswer:"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,
        n=1,
        stop=None,
        temperature=0.5,
    )
    generated_answer = response['choices'][0]['message']['content'].strip()
    return generated_answer

# Function to interact with the chatbot
def chatbot():
    print("Welcome to the SMU FAQ Chatbot! Type 'exit' to quit.")
    while True:
        user_question = input("You: ")
        if user_question.lower() in ['exit', 'quit']:
            print("Chatbot: Goodbye!")
            break
        answer = find_best_answer(user_question)
        print(f"Chatbot: {answer}\n")

# Launch the chatbot
chatbot()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

faq_data_cleaned.csv:   0%|          | 0.00/127k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/250 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Welcome to the SMU FAQ Chatbot! Type 'exit' to quit.
You: what is smu academic calendar
Chatbot: The SMU academic calendar is a comprehensive document that outlines important information such as programs, admission requirements, schedules, costs, regulations, and student support services at Saint Mary's University. It also includes course descriptions, prerequisites, and other essential details for students. You can request a physical copy of the undergraduate calendar from Admissions or the Service Centre, while graduate calendars are available from the Faculty of Graduate Studies and Research. Additionally, a digital version of the current academic calendar can be accessed online.

You: what are you?
Chatbot: I am a helpful assistant here to provide information and assistance on various topics. If you have any questions or need help with anything specific, feel free to ask!

You: are you an assistant for SMU?
Chatbot: Yes, I am here to assist you with any questions you may have regar

In [None]:
import anvil.server

anvil.server.connect("key-here")

@anvil.server.callable
def anvil_question(user_question):
    return find_best_answer(user_question)

#Keep the anvil running forever
anvil.server.wait_forever()

Disconnecting from previous connection first...
Connecting to wss://anvil.works/uplink
Anvil websocket open
Connected to "Main Environment" as SERVER
Reconnecting Anvil Uplink...
