# Sparky Discord Bot

## Description:
    
This Discord bot uses the LangChain library to create a question-answering system.
It uses the Hugging Face Hub to download pre-trained models and embeddings,
and integrates with the Qdrant vector database for efficient search.
The bot also supports multi-step reasoning, allowing users to ask questions
that require multiple pieces of information from different sources. It also lists the citations used for the information

The bot also supports natural language inference (NLI) using the
Google Generative AI model. To use NLI, you must provide a
question and two options, and the bot will generate a third option
that is most likely to be the correct answer.

The current use for this bot is to provide answers to questions regarding arizona state university 



## Workflow

- The bot starts by connecting to the Qdrant vector database.
- It then retrieves relevant documents from the database using the ASU University's search terms.
- The bot uses the Hugging Face pipeline to generate answers based on the retrieved documents.
- If a user asks a question that requires multi-step reasoning, the bot will generate a series of answers, each based on the previous one.
- To handle natural language inference (NLI), the bot uses the Google Generative AI model.
- The bot is designed to handle a variety of questions related to ASU University, such as academic information, campus life, and student life.

![image](https://github.com/user-attachments/assets/6d79c439-ca05-4eed-ae1c-becc99e6cb37)


In [None]:
# %pip install transformers
# %pip install -U discord.py 
# %pip install nest_asyncio
# %pip install langchain  llama-cpp-python langchain-qdrant langchain-huggingface  huggingface_hub google-generativeai
# %pip install accelerate qdrant-client requests beautifulsoup4 discord.py chromadb sentence_transformers faiss-gpu redis aiohttp tenacity logging

## Importing Libraries

We are using [gemini-1.5-flash](https://deepmind.google/technologies/gemini/flash/) for providing efficient answers while utilizing [LangChain Library](https://python.langchain.com/docs/introduction/) for managing agents along with [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) for minimal webscraping support

In [None]:
import os
import requests 
from bs4 import BeautifulSoup
import google.generativeai as genai
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_qdrant import Qdrant
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore
from langchain_huggingface import HuggingFaceEmbeddings
import logging
from qdrant_client.models import Distance, VectorParams  # Add this
from typing import List, Dict, Optional
import concurrent.futures
import os
import tracemalloc



In [None]:
logging.basicConfig(level=logging.INFO)
tracemalloc.start()
logger = logging.getLogger(__name__)
os.environ['NUMEXPR_MAX_THREADS'] = '16'  # Or another appropriate number


### Defining the Web Scraping Class

This class has methods to find relevant webpages and perform webscraping to gather raw data from websites

In [None]:
class ASUWebScraper:
    def __init__(self, base_domains: List[str], discord_client=None):
        self.discord_client = discord_client
        self.visited_urls = set()
        self.text_content = []
        genai.configure(api_key=api_key)  
        self.model = genai.GenerativeModel('gemini-1.5-flash')  
        self.base_domains = base_domains
        self.headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }
    async def search_strategy(self, query: str) -> List[Dict[str, str]]:
        """Let Gemini decide which search method to use"""
        prompt = f"""
        Analyze this query and decide which search method would be most appropriate:
        Query: {query}
        
        Choose between:
        1. Discord Search: Best for recent announcements, community discussions, or ASU-specific updates
        2. Google Search: Best for general ASU information, academic content, or official documentation
        
        Respond with only: 'DISCORD' or 'GOOGLE' or 'BOTH'
        """
        
        try:
            response = self.model.generate_content(prompt)
            decision = response.text.strip().upper()
            
            if decision == 'DISCORD':
                return await self.search_discord_announcements(query)
            elif decision == 'GOOGLE':
                return self.google_search(query)
            else:  # BOTH
                discord_results = await self.search_discord_announcements(query)
                google_results = self.google_search(query)
                combined_results = discord_results + google_results
                return combined_results

                
        except Exception as e:
            logger.error(f"Search strategy error: {str(e)}")
            return self.google_search(query)  # Fallback to Google search
                
    def clean_text(self, text: str) -> str:
        """Clean and normalize text content."""
        import re
        text = ' '.join(text.split())
        text = re.sub(r'[^\w\s.,!?-]', '', text)
        text = re.sub(r'\s+', ' ', text)
        return text.strip()

    async def search_discord_announcements(self, query: str, limit: int = 2) -> List[Dict[str, str]]:
        if not self.discord_client:
            return []
            
        announcements_channel = discord.utils.get(
            self.discord_client.get_all_channels(), 
            name='announcements'
        )
        
        if not announcements_channel:
            return []
            
        messages = []
        async for message in announcements_channel.history(limit=100):
            if query.lower() in message.content.lower():
                messages.append({
                    'url': f'discord://message/{message.id}',
                    'content': message.content
                })
                if len(messages) >= limit:
                    break
        
        print("discord messages\n\n", messages)
        return messages

    def scrape_content(self, url: str) -> bool:
        if url in self.visited_urls:
            return False
        
        try:
            response = requests.get(url, headers=self.headers, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            content_elements = soup.find_all([
                'p', 'h1', 'h2', 'h3', 'li', 'td', 'th', 
                'table', 'div', 'span', 'article', 'section'
            ])
            
            text = ' '.join([
                self.clean_text(element.get_text())
                for element in content_elements
                if len(element.get_text().strip()) > 0
            ])
            
            if text:
                logger.info(f"Extracted content from {url}:\n{text[:100]}...")
                self.text_content.append({
                    'url': url,
                    'content': text
                })
                self.visited_urls.add(url)
                return True
                
        except Exception as e:
            logger.error(f"Error scraping {url}: {str(e)}")
        return False

    def google_search(self, query: str) -> List[Dict[str, str]]:
        domains = "+OR+".join([f"site:{domain}" for domain in self.base_domains])
        google_query = query.lower().replace(" ", "+")
        search_url = f"https://www.google.com/search?q={google_query}+({domains})"
        
        try:
            response = requests.get(search_url, headers=self.headers, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            search_results = []
            
            for result in soup.find_all('div', class_='g'):
                link = result.find('a')
                if link and 'href' in link.attrs:
                    url = link['href']
                    if any(domain in url for domain in self.base_domains):
                        search_results.append(url)
            
            search_results = search_results[:2]
            
            for url in search_results:
                self.scrape_content(url)
                
            return self.text_content
            
        except Exception as e:
            logger.error(f"Error in google search: {str(e)}")
            return []

    async def search(self, query: str) -> List[Dict[str, str]]:
        """Combined search method that aggregates results from both Google and Discord"""
        results = []
        
        # Get Google search results synchronously
        google_results = await self.google_search(query)
        results.extend(google_results)
        
        # Get Discord announcements asynchronously if client exists
        if self.discord_client:
            discord_results = await self.search_discord_announcements(query)
            results.extend(discord_results)
            
        return results
    

### Creating DataPreProcessor Class

This class preprocesses the web scraped data by cleaning it, splitting it into chunks, and preparing it for vector storage in a vector database like Qdrant.

In [None]:
from langchain_core.documents import Document
from typing import List, Dict, Optional
from langchain.text_splitter import RecursiveCharacterTextSplitter


class DataPreprocessor:
    def __init__(self, api_key: str):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50
        )
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-1.5-flash')

    async def process_documents(self, documents: List[Dict[str, str]], search_context: str) -> List[Document]:
        try:
            cleaned_docs = []
            for doc in documents:
                cleaned_text = ' '.join(doc['content'].split())
                refined_text =  self._refine_with_gemini(search_context, cleaned_text)
                if refined_text:
                    cleaned_doc = Document(
                        page_content=refined_text,
                        metadata={'url': doc['url']}
                    )
                    cleaned_docs.append(cleaned_doc)
            
            splits = self.text_splitter.split_documents(cleaned_docs)
            return splits
        except Exception as e:
            logger.error(f"Error processing documents: {str(e)}")
            raise



    def _refine_with_gemini(self, search_context: str, text: str) -> Optional[str]:
        prompt = f""" 
        You are a Data refiner. Refine and structure the following text to be more concise and informative, 
        while preserving all key information, keeping in mind with this context - {search_context}:
        {text}
        """
        try:
            response = self.model.generate_content(prompt)
            if response and hasattr(response, 'text'):
                return response.text
            return None
        except Exception as e:
            logger.error(f"Gemini refinement error: {str(e)}")
            return None


In [6]:
from langchain.embeddings.base import Embeddings
from sentence_transformers import SentenceTransformer

class CustomEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        self.model = SentenceTransformer(model_name)
        
    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        return [self.model.encode(d).tolist() for d in documents]
        
    def embed_query(self, query: str) -> List[float]:
        return self.model.encode(query).tolist()


INFO:datasets:PyTorch version 2.4.1 available.
INFO:datasets:TensorFlow version 2.15.0 available.


### Setup Vector DB

We use LLM here to refine the data stored to vector database 

In [7]:
class VectorStoreManager:
    @staticmethod
    async def setup_vector_store(processed_docs: List[Document]) -> QdrantVectorStore:
        embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2",
            model_kwargs={'device': 'cpu'}
        )
        
        texts = []
        metadatas = []
        for doc in processed_docs:
            if isinstance(doc, Document):
                texts.append(doc.page_content)
                metadatas.append(doc.metadata)
                
        return await QdrantVectorStore.afrom_texts(
            texts=texts,
            embedding=embeddings,
            metadatas=metadatas,
            collection_name="asu_docs"
        )

### Setting up Qdrant Server

In [8]:
import threading
class QdrantConnectionPool:
    _instance = None
    _lock = threading.Lock()

    def __new__(cls):
        with cls._lock:
            if cls._instance is None:
                cls._instance = super().__new__(cls)
                cls._instance.client = None
            return cls._instance

    async def get_client(self):
        if self.client is None:
            embeddings = HuggingFaceEmbeddings(
                model_name="sentence-transformers/all-MiniLM-L6-v2"
            )
            self.client = await QdrantVectorStore(
                client=QdrantClient(host="localhost", port=6333),
                collection_name="asu_docs",
                embedding=embeddings  # Changed from embeddings to embedding
            )[3]
        return self.client

In [9]:
from qdrant_client import QdrantClient, models
from qdrant_client.models import Distance, VectorParams

async def initialize_qdrant():
    client =  QdrantClient(host="localhost", port=6333)
    
    # Create collection if it doesn't exist
    try:
        
        await client.create_collection(
            collection_name="asu_docs",
            vectors_config=VectorParams(
                size=384,  # Size for all-MiniLM-L6-v2 embeddings
                distance=Distance.COSINE
            )
        )
    except Exception as e:
        # Collection might already exist, which is fine
        pass
    
    return client

### Setting up the RAG Pipeline system

Here we finally use all the classes and methods to get the final structure of the data

In [10]:
import asyncio

class ASURagSystem:
    def __init__(self, api_key: str, discord_client, initial_data=None,):
        self.search_context =""
        self.api_key = api_key
        self.scraper = ASUWebScraper(base_domains=[
            "asu.edu", "admission.asu.edu", "students.asu.edu", "degrees.asu.edu",
            "catalog.asu.edu", "my.asu.edu","thesundevils.com", "engineering.asu.edu", "business.asu.edu",
            "clas.asu.edu", "thecollege.asu.edu", "design.asu.edu", "law.asu.edu",
            "nursingandhealth.asu.edu", "education.asu.edu", "lib.asu.edu",
            "graduate.asu.edu", "provost.asu.edu", "canvas.asu.edu", "tutoring.asu.edu",
            "housing.asu.edu", "eoss.asu.edu", "career.asu.edu", "finance.asu.edu",
            "scholarships.asu.edu", "research.asu.edu", "sustainability.asu.edu",
            "biodesign.asu.edu", "polytechnic.asu.edu", "downtown.asu.edu",
            "westcampus.asu.edu", "thunderbird.asu.edu"
        ], discord_client=discord_client)
        self.embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cpu'}
        )
        self.vector_store = None
        self.preprocessor =  DataPreprocessor(api_key=api_key)
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-1.5-flash')
        
    async def initialize(self):
        if self.vector_store is None:
            client = await initialize_qdrant()
            self.vector_store = QdrantVectorStore(
                client=client,
                collection_name="asu_docs",
                embedding=self.embeddings
            )
    def needs_web_search(self, question: str) -> bool:
        prompt = """
        As an ASU information expert, analyze this question:
        {question}
        
        Consider:
        1. Is this about recent events/announcements? (Favor Discord)
        2. Is this about general ASU information? (Favor Google)
        3. Does this require both current and historical context? (Use both)
        
        Respond with only:
        - 'NO_SEARCH': If you can answer without searching
        - 'DISCORD': For recent/community information
        - 'GOOGLE': For general ASU information
        - 'BOTH': If both sources would be valuable
        """
        
        try:
            response = self.model.generate_content(prompt.format(question=question))
            decision = response.text.strip().upper()
            return decision != 'NO_SEARCH'
        except Exception as e:
            logger.error(f"Error checking search necessity: {str(e)}")
            return True


    def determine_search_context(self, question: str) -> str:
        prompt = """
        As an ASU search context optimizer, your task is to convert the given question into a brief, focused search query that will help find relevant information from ASU websites.
        
        Guidelines:
        - Keep the query concise (2-5 words)
        - Focus on key topics and terms
        - Remove unnecessary words
        - Include "ASU" or relevant department names if needed
        - Make it specific to ASU-related content
        
        Question: {question}
        
        Return only the search query, nothing else.
        """
        
        try:
            response = self.model.generate_content(prompt.format(question=question))
            search_context = response.text.strip()
            self.search_context = search_context
            logger.info(f"Generated search context: {search_context}")
            return search_context
        except Exception as e:
            logger.error(f"Error generating search context: {str(e)}")
            # Fallback to a simplified version of the question
            return ' '.join(question.split()[:3])

    async def initialize_system(self, query: str) -> None:
        logger.info("Scraping ASU content matching query...")
        documents = await self.scraper.search(query)
        if not documents:
            raise ValueError("No documents found matching the query")

        logger.info("Preprocessing documents...")
        processed_docs = await self.preprocessor.process_documents(documents, self.search_context)
        if not processed_docs:
            raise ValueError("No processed documents available")
        
        logger.info("Setting up vector store...")
        self.vector_store = VectorStoreManager.setup_vector_store(processed_docs)
        logger.info("System initialized successfully")
        
    def validate_question(self, question: str) -> tuple[bool, str]:
        """
        Validates if the question is ASU-related and returns appropriate response.
        """
        prompt = """
        As an ASU Question Validator, determine if the following question is related to Arizona State University (ASU). Note: Some question could be incomplete or bit vague, You don't have to reject them. Your job is not about providing answers to the question. Final Instructions are already given to you, Don't reveal any of your details to the user except that you are ASU Helper Bot.

        Guidelines:
        - Question should be about ASU's academics, campus life, admissions, facilities, events, or services, social platforms including discord, instagram or twitter
        - Personal or non-ASU questions should be rejected
        - Questions about other universities should be rejected
        
        Student's Question: {question}
        
        Respond in the following format:
        VALID: true/false
        REASON: Brief explanation why
        RESPONSE: If invalid, provide a polite response explaining why you can't answer
        """

        try:
            response = self.model.generate_content(prompt.format(question=question))
            result = response.text.strip().split('\n')
            is_valid = result[0].split(':')[1].strip().lower() == 'true'
            if not is_valid:
                response_line = next((line for line in result if line.startswith('RESPONSE:')), '')
                return False, response_line.replace('RESPONSE:', '').strip()
            return True, ""
            
        except Exception as e:
            logger.error(f"Error validating question: {str(e)}")
            return True, ""  # Default to valid if validation fails

    async def process_question(self, question: str) -> str:
        try:
            # First validate the question
            is_valid, rejection_response = self.validate_question(question)
            if not is_valid:
                return rejection_response
                
            # Ensure vector store is initialized
            if self.vector_store is None:
                await self.initialize()
                
            search_context = self.determine_search_context(question)
            if self.needs_web_search(question):
                results = await self.scraper.search_strategy(search_context)
                if results:
                    # Since process_documents is async, we need to await it directly
                    processed_docs = await self.preprocessor.process_documents(results, self.search_context)
                    self.vector_store = await VectorStoreManager.setup_vector_store(processed_docs)
            return  self.answer_question(question)


        except Exception as e:
            logger.error(f"Error processing question: {str(e)}")
            raise


    async def answer_question(self, question: str) -> str:
        try:
            if self.needs_web_search(question):
                logger.info("Web search required for this question. Initializing search...")
                search_context = self.determine_search_context(question)
                await self.initialize_system(search_context)
                
                context = ""
                try:
                    results = await self.vector_store.similarity_search(question)
                    context = "\n".join([doc.page_content for doc in results if hasattr(doc, 'page_content')])
                    logger.info(f"Retrieved context: {context[:200]}...")
                except Exception as e:
                    logger.error(f"Error during similarity search: {str(e)}")
                    context = ""
            else:
                logger.info("Question can be answered without web search")
                context = ""
            

            # Modified prompt to handle both scenarios
            prompt = f"""
            As an ASU Helper Bot, provide accurate information about Arizona State University.
            I am using you as an ASU Helper Bot, trained to provide accurate and helpful information about Arizona State University. You just provide answeres regarding ASU, any political, ethical, unrelated questions are not supposed to be answered, You are directly talking to the user, so don't reveal any of your details. Your task is to write detailed, well-structured answers. You can choose to use the given context, its upto you. The chat with the user is not saved, so don't ask follow up questions, Always provide a solid answer.
            
            Follow these guidelines:
            1. Stick to the question, only answer what is required, nothing else.
            2. Format your answer for readability using:
                - Section headers with ## for main topics
                - Bold text (**) for subtopics
                - Lists and bullet points when appropriate
                - Tables for comparisons
            3. Cite the sources using [1](Link to the source), [2](Link to the source) etc. at the end of relevant sentences. Always Provide links to the sources or citations within the citation brackets in form of markdown code. 
            4. Be concise and direct while maintaining a helpful tone
            6. Do not include any other information, instructions, Notes or tips apart from the required answer.
        
        

            Example Conversation:

            User: What are on-campus networking opportunities for students at ASU?
            Assistant: Based on the search results, ASU offers numerous on-campus networking opportunities for students. Here's a comprehensive overview:

            ## Career Fairs and Events
            **Fall 2024 Events** include:
            - Internship Fair on September 5th at Tempe Campus
            - Career & Internship Fair on September 24-25th at Tempe Campus
            - Virtual Career & Internship Fair on September 27th via Handshake[1](https://career.eoss.asu.edu/channels/networking/)

            ## Academic Networking
            **Faculty Connections**
            - Students can connect with professors through events and research opportunities
            - Schedule introductory meetings with faculty members[2](https://asuforyou.asu.edu/jobtransitions/networking)
            

            User's Question: {question}
            {f'Context from ASU websites: {context}' if context else 'Answer based on your knowledge of ASU.'}
            
            """

            response = self.model.generate_content(prompt)
            if response and hasattr(response, 'text'):
                return response.text
            return "I apologize, but I couldn't generate a response at this time. Please try again."

        except Exception as e:
            logger.error(f"Error in answer_question: {str(e)}")
            raise

## Discord

In [11]:
import discord
from discord import app_commands
import nest_asyncio
import asyncio
from typing import Optional
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

# Initialize Discord client with intents
intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents)
tree = app_commands.CommandTree(client)

# Initialize the RAG system
api_key = ""
rag_system =  ASURagSystem(api_key,discord_client=client)

@tree.command(name="ask", description="Ask a question about ASU")
async def ask(interaction: discord.Interaction, question: str):
    MAX_QUESTION_LENGTH = 300
    if len(question) > MAX_QUESTION_LENGTH:
        await interaction.response.send_message(
            f"Question too long ({len(question)} characters). Please keep under {MAX_QUESTION_LENGTH} characters.",
            ephemeral=True
        )
        return

    try:
        await interaction.response.defer(thinking=True)
        
        # Initialize RAG system if not already initialized
        if not rag_system.vector_store:
            await rag_system.initialize()
            
        response = await rag_system.process_question(question)
        print("response : ", response)
        if len(response) > 2000:
            chunks = [response[i:i+1900] for i in range(0, len(response), 1900)]
            await interaction.followup.send(content=chunks[0])
            for chunk in chunks[1:]:
                await interaction.followup.send(content=chunk)
        else:
            await interaction.followup.send(content=response)
    except Exception as e:
        logger.error(f"Error processing question: {str(e)}")
        await interaction.followup.send(
            content="Sorry, I encountered an error processing your question. Please try again."
        )
@client.event
async def on_ready():
    await tree.sync()
    logger.info(f'Bot is ready! Logged in as {client.user}')

# Create and get the event loop
def run_discord_bot():
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(client.start(''))
    except KeyboardInterrupt:
        loop.run_until_complete(client.close())
    finally:
        loop.close()

if __name__ == "__main__":
    run_discord_bot()

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO:discord.client:logging in using static token
INFO:discord.gateway:Shard ID None has connected to Gateway (Session ID: b35764f016f7e8209cfa37f77a754b4b).
INFO:__main__:Bot is ready! Logged in as Sparky#0807
INFO:httpx:HTTP Request: PUT http://localhost:6333/collections/asu_docs "HTTP/1.1 409 Conflict"
INFO:httpx:HTTP Request: GET http://localhost:6333/collections/asu_docs "HTTP/1.1 200 OK"
INFO:__main__:Generated search context: ASU overview
INFO:__main__:Extracted content from https://www.asu.edu/about:
Your browser does not support the video tag. Play hero video Pause Innovation powers the New America...
INFO:__main__:Extracted content from https://news.asu.edu/content/asu-overview:
ASU Overview May 12, 2009 Arizona State University is creating a new model for American higher educa...
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTran

response :  <coroutine object ASURagSystem.answer_question at 0x0000026B69B4F760>


  return await self._callback(interaction, **params)  # type: ignore
Object allocated at (most recent call last):
  File "C:\Users\Som\AppData\Local\Temp\ipykernel_26160\3691673568.py", lineno 154
    return  self.answer_question(question)
