# Sparky Discord Bot

## Description:
    
This Discord bot uses the LangChain library to create a question-answering system.
It uses the Hugging Face Hub to download pre-trained models and embeddings,
and integrates with the Qdrant vector database for efficient search.
The bot also supports multi-step reasoning, allowing users to ask questions
that require multiple pieces of information from different sources. It also lists the citations used for the information

The bot also supports natural language inference (NLI) using the
Google Generative AI model. To use NLI, you must provide a
question and two options, and the bot will generate a third option
that is most likely to be the correct answer.

The current use for this bot is to provide answers to questions regarding arizona state university 



## Workflow

- The bot starts by connecting to the Qdrant vector database.
- It then retrieves relevant documents from the database using the ASU University's search terms.
- The bot uses the Hugging Face pipeline to generate answers based on the retrieved documents.
- If a user asks a question that requires multi-step reasoning, the bot will generate a series of answers, each based on the previous one.
- To handle natural language inference (NLI), the bot uses the Google Generative AI model.
- The bot is designed to handle a variety of questions related to ASU University, such as academic information, campus life, and student life.

![image](https://github.com/user-attachments/assets/6d79c439-ca05-4eed-ae1c-becc99e6cb37)


In [None]:
%pip install transformers 
%pip install langchain langchain-community llama-cpp-python langchain langchain-community huggingface_hub google-generativeai
%pip install accelerate qdrant-client requests beautifulsoup4 chromadb sentence_transformers faiss-gpu redis aiohttp tenacity logging

### Importing Libraries

We are using [llama 3.1.2-1B Model](https://huggingface.co/meta-llama/Llama-3.2-1B) for providing efficient answers while utilizing [LangChain Library](https://python.langchain.com/docs/introduction/) for managing agents along with [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) for minimal webscraping support

In [60]:
import os
import requests
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from bs4 import BeautifulSoup
import google.generativeai as genai
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
from huggingface_hub import login
import warnings
warnings.filterwarnings('ignore')




os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""
geminia_api_key = ""
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Defining the Web Scraping Class

This class has methods to find relevant webpages and perform webscraping to gather raw data from websites

In [61]:
class ASUWebScraper:
    def __init__(self, base_domains):
        self.visited_urls = set()
        self.text_content = []
        self.base_domains = base_domains
    
    def clean_text(self, text):
        return ' '.join(text.split())
    
    def scrape_content(self, url):
        try:
            response = requests.get(url, timeout=10)  # Increased timeout
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                
                # More comprehensive content extraction
                content_elements = soup.find_all(['article', 'main', 'section', 'div']) 
                
                text = ' '.join([
                    self.clean_text(element.get_text())
                    for element in content_elements
                    if len(element.get_text().strip()) > 50  # Minimum content length
                ])
                
                if len(text) > 100:
                    print(text)
                    self.text_content.append({
                        'url': url,
                        'content': text
                    })
                    return True
            return False
        except Exception as e:
            print(f"Error scraping {url}: {str(e)}")
            return False
            
    def search(self, query):
        # Add direct URLs for CS admissions
        base_urls = [
            "https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science",
            "https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science",
            "https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science"
        ]
        
        for url in base_urls:
            self.scrape_content(url)
            
        # Then do the domain search
        matching_urls = [f"https://{domain}" for domain in self.base_domains 
                        if query.lower() in domain.lower()]
        for url in matching_urls:
            self.scrape_content(url)
            
        return self.text_content

### Creating DataPreProcessor Class

This class preprocesses the web scraped data by cleaning it, splitting it into chunks, and preparing it for vector storage in a vector database like Qdrant.

In [62]:
class DataPreprocessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50
        )
        genai.configure(api_key="")
        self.model = genai.GenerativeModel('gemini-1.5-flash')

    def clean_text(self, text):
        return ' '.join(text.split())
    def refine_with_gemini(self, text):
        prompt = """
        Refine and structure the following text to be more concise and informative, 
        while preserving all key information:
        
        {text}
        """
        try:
            response = self.model.generate_content(prompt.format(text=text))
            return response.text
        except Exception as e:
            print(f"Gemini refinement error: {str(e)}")
            return text

    def process_documents(self, documents):
        cleaned_docs = []
        for doc in documents:
            cleaned_text = self.clean_text(doc['content'])
            # Add Gemini refinement step
            refined_text = self.refine_with_gemini(cleaned_text)
            if refined_text:
                cleaned_docs.append({
                    'content': refined_text,
                    'url': doc['url']
                })
                
        splits = []
        for doc in cleaned_docs:
            chunks = self.text_splitter.split_text(doc['content'])
            splits.extend([{'content': chunk, 'url': doc['url']} for chunk in chunks])
        return splits

        
        return splits

def setup_vector_store(processed_docs):
    if not processed_docs:
        raise ValueError("No documents to process")
        
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    
    texts = [doc['content'] for doc in processed_docs]
    if not texts:
        raise ValueError("No text content found in documents")
        
    # Add error handling for embeddings
    try:
        vector_store = Qdrant.from_texts(
            texts=texts,
            embedding=embeddings,
            metadatas=[{'url': doc['url']} for doc in processed_docs],
            location=":memory:"
        )
        return vector_store
    except Exception as e:
        print(f"Error creating vector store: {str(e)}")
        raise
    
def setup_llm():
    model_id = "meta-llama/Llama-3.2-1B"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    
    pipeline = HuggingFacePipeline(
        pipeline=transformers.pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            max_length=1000,
            temperature=0.7,
            top_p=0.95,
            repetition_penalty=1.15
        )
    )
    
    return pipeline



### Creating gemini formatter for data

We use LLM here to refine the data stored to vector database 

In [63]:
class ASUResponseFormatter:
    def __init__(self):
        genai.configure(api_key="")
        self.model = genai.GenerativeModel('gemini-1.5-flash')
        
    def format_response(self, raw_response):
        prompt = """
        Improve and structure the following response to be more clear, concise and well-organized:
        
        {response}
        
        Format it with:
        1. Clear sections if applicable
        2. Bullet points for key information
        3. Proper grammar and professional tone
        """
        try:
            formatted = self.model.generate_content(prompt.format(response=raw_response))
            return formatted.text
        except Exception as e:
            print(f"Response formatting error: {str(e)}")
            return raw_response


### Setting up the RAG Pipeline system

Here we finally use all the classes and methods to get the final structure of the data

In [64]:
class ASURagSystem:
    def __init__(self):
        self.scraper = ASUWebScraper(base_domains=[
            "asu.edu", "admission.asu.edu", "students.asu.edu", "degrees.asu.edu",
            "catalog.asu.edu", "my.asu.edu", "engineering.asu.edu", "business.asu.edu",
            "clas.asu.edu", "thecollege.asu.edu", "design.asu.edu", "law.asu.edu",
            "nursingandhealth.asu.edu", "education.asu.edu", "lib.asu.edu",
            "graduate.asu.edu", "provost.asu.edu", "canvas.asu.edu", "tutoring.asu.edu",
            "housing.asu.edu", "eoss.asu.edu", "career.asu.edu", "finance.asu.edu",
            "scholarships.asu.edu", "research.asu.edu", "sustainability.asu.edu",
            "biodesign.asu.edu", "polytechnic.asu.edu", "downtown.asu.edu",
            "westcampus.asu.edu", "thunderbird.asu.edu"
        ])
        self.response_formatter = ASUResponseFormatter()

        self.vector_store = None
        self.qa_chain = None
    
    def initialize_system(self, query):
        print("Scraping ASU content matching query...")
        documents = self.scraper.search(query)
        
        if not documents:
            raise ValueError("No documents found matching the query")
        
        print("Preprocessing documents...")
        processed_docs = DataPreprocessor().process_documents(documents)
        
        if not processed_docs:
            raise ValueError("No processed documents available")
        
        print("Setting up vector store...")
        self.vector_store = setup_vector_store(processed_docs)
        
        if not self.vector_store:
            raise ValueError("Failed to initialize vector store")
            
        print("Initializing QA chain...")
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=setup_llm(),
            chain_type="stuff",
            retriever=self.vector_store.as_retriever(search_kwargs={"k": 3})
        )
        print("QA chain initialized.")
    
    def answer_question(self, question):
        if not self.qa_chain:
            raise ValueError("System not initialized. Call initialize_system() first with a query.")
            
        # Get raw response from LLaMA
        raw_response = self.qa_chain.run(question)
        
        # Format response using Gemini
        formatted_response = self.response_formatter.format_response(raw_response)
        
        return formatted_response



### Creating instance of the system

In [65]:
rag_system = ASURagSystem()
initial_question = "What are the admission requirements for ASU's Computer Science program?"
rag_system.initialize_system(initial_question)


Scraping ASU content matching query...
Report an accessibility problemASU homeMy ASUColleges and SchoolsSign InSearch ASU ASU homeMy ASUColleges and SchoolsSign InSearch ASU ASU homeMy ASUColleges and SchoolsSign InSearch ASU Computer Science ,BS Computer Science, BS Academic programs / Undergraduate degrees / Computer Science Sign in to save your search results for later use. Loading... Click to save to my favorites Computer Science, BS ESCSEBS Program description At a glance Required courses (Major Map) Concurrent program options Accelerated program options Admission requirements Tuition information Change of Major requirements Attend online Transfer options Global opportunities Career opportunities Contact information Apply now Apply now Request info 2024 - 2025 Major Map Algorithms, Artificial Intelligence, Computer Programming, Cybersecurity, Data Structures, Database Administrator, Information Assurance, Networks, Operating Systems, Programming, Security, Software, approved for S

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


Setting up vector store...
Initializing QA chain...
QA chain initialized.


### Testing

In [66]:
# Example usage
question = "What are the admission requirements for ASU's Computer Science program?"
answer = rag_system.answer_question(question)
print(f"Question: {question}")
print(f"Answer: {answer}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Question: What are the admission requirements for ASU's Computer Science program?
Answer: ## Arizona State University Computer Science Admissions Requirements

**Admissions Requirements vary depending on whether you are a first-year or transfer student.**

**First-Year Students:**

* **Minimum standardized test scores:**
    * 1210 SAT combined score (evidence-based reading & writing, and math)
    * 26 ACT combined score 
* **OR**
* **High school performance:**
    * 3.00 high school cumulative GPA in ASU competency courses
    * Ranking in the top 25% of the high school class
* **No high school math or science competency deficiencies allowed.**

**Transfer Students:**

* **General university admission requirements:** Must be met by all applicants.
* **Engineering major requirements:** Higher than general university admission standards. 
* **International students:** Must meet the same requirements as domestic applicants, potentially including a minimum English language proficiency te

In [75]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Qdrant
import logging
from typing import List, Dict, Optional
import concurrent.futures

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ASUWebScraper:
    def __init__(self, base_domains: List[str]):
        self.visited_urls = set()
        self.text_content = []
        self.base_domains = base_domains
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }

    def clean_text(self, text: str) -> str:
        return ' '.join(text.split())

    def scrape_content(self, url: str) -> bool:
        if url in self.visited_urls:
            return False
        
        try:
            response = requests.get(url, headers=self.headers, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            content_elements = soup.find_all(['article', 'main', 'section', 'div'])
            
            text = ' '.join([
                self.clean_text(element.get_text())
                for element in content_elements
                if len(element.get_text().strip()) > 50
            ])
            
            if len(text) > 100:
                self.text_content.append({
                    'url': url,
                    'content': text
                })
                self.visited_urls.add(url)
                return True
                
        except requests.RequestException as e:
            logger.error(f"Error scraping {url}: {str(e)}")
        return False

    def search(self, query: str) -> List[Dict[str, str]]:
        base_urls = [
            "https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science",
            "https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science/",
            "https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science"
        ]
        
        # Parallel scraping using ThreadPoolExecutor
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(self.scrape_content, base_urls)
            
        matching_urls = [f"https://{domain}" for domain in self.base_domains 
                        if query.lower() in domain.lower()]
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(self.scrape_content, matching_urls)
            
        return self.text_content

class DataPreprocessor:
    def __init__(self, api_key: str):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50
        )
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-1.5-flash')

    def process_documents(self, documents: List[Dict[str, str]]) -> List[Dict[str, str]]:
        try:
            cleaned_docs = []
            for doc in documents:
                cleaned_text = ' '.join(doc['content'].split())
                refined_text = self._refine_with_gemini(cleaned_text)
                if refined_text:
                    cleaned_docs.append({
                        'content': refined_text,
                        'url': doc['url']
                    })
            
            splits = []
            for doc in cleaned_docs:
                chunks = self.text_splitter.split_text(doc['content'])
                splits.extend([{'content': chunk, 'url': doc['url']} for chunk in chunks])
            return splits
            
        except Exception as e:
            logger.error(f"Error processing documents: {str(e)}")
            raise

    def _refine_with_gemini(self, text: str) -> Optional[str]:
        prompt = """
        Refine and structure the following text to be more concise and informative, 
        while preserving all key information:
        {text}
        """
        try:
            response = self.model.generate_content(prompt.format(text=text))
            return response.text
        except Exception as e:
            logger.error(f"Gemini refinement error: {str(e)}")
            return None

class VectorStoreManager:
    @staticmethod
    def setup_vector_store(processed_docs: List[Dict[str, str]]) -> Qdrant:
        if not processed_docs:
            raise ValueError("No documents to process")
        
        embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        
        texts = [doc['content'] for doc in processed_docs]
        if not texts:
            raise ValueError("No text content found in documents")
        
        try:
            return Qdrant.from_texts(
                texts=texts,
                embedding=embeddings,
                metadatas=[{'url': doc['url']} for doc in processed_docs],
                location=":memory:"
            )
        except Exception as e:
            logger.error(f"Error creating vector store: {str(e)}")
            raise

class ASURagSystem:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.scraper = ASUWebScraper(base_domains=[
            "asu.edu", "admission.asu.edu", "students.asu.edu", "degrees.asu.edu",
            "catalog.asu.edu", "my.asu.edu", "engineering.asu.edu", "business.asu.edu",
            "clas.asu.edu", "thecollege.asu.edu", "design.asu.edu", "law.asu.edu",
            "nursingandhealth.asu.edu", "education.asu.edu", "lib.asu.edu",
            "graduate.asu.edu", "provost.asu.edu", "canvas.asu.edu", "tutoring.asu.edu",
            "housing.asu.edu", "eoss.asu.edu", "career.asu.edu", "finance.asu.edu",
            "scholarships.asu.edu", "research.asu.edu", "sustainability.asu.edu",
            "biodesign.asu.edu", "polytechnic.asu.edu", "downtown.asu.edu",
            "westcampus.asu.edu", "thunderbird.asu.edu"
        ])
        
        self.preprocessor = DataPreprocessor(api_key)
        self.vector_store = None
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-1.5-flash')

    def initialize_system(self, query: str) -> None:
        logger.info("Scraping ASU content matching query...")
        documents = self.scraper.search(query)
        if not documents:
            raise ValueError("No documents found matching the query")

        logger.info("Preprocessing documents...")
        processed_docs = self.preprocessor.process_documents(documents)
        if not processed_docs:
            raise ValueError("No processed documents available")
        
        print("\n\n\n Preprocessed Documents\n\n",processed_docs,"\n\n")
        logger.info("Setting up vector store...")
        self.vector_store = VectorStoreManager.setup_vector_store(processed_docs)
        logger.info("System initialized successfully")

    def answer_question(self, question: str) -> str:
        if not self.vector_store:
            raise ValueError("System not initialized. Call initialize_system() first")
        
        context = self.vector_store.similarity_search(question)
        prompt = f"""
        You are an ASU Counselor, trained to provide accurate and helpful information about Arizona State University. Your task is to write detailed, well-structured answers using only the provided context. If the context does not provide specific information, Just say you don't know. Do not refer to old knowledge to answer the question.
        
        Follow these guidelines:

        1. Base your answer solely on the given context, Do not refer to old knowledge to answer the question
        2. Format your answer for readability using:
            - Section headers with ## for main topics
            - Bold text (**) for subtopics
            - Lists and bullet points when appropriate
            - Tables for comparisons
        3. Cite sources using [1], [2] etc. at the end of relevant sentences
        4. Be concise and direct while maintaining a helpful tone
        5. If the context doesn't contain enough information, acknowledge the limitations
        6. Do not include any other information, instructions, Notes or tips.
        7. Always Provide links to the sources or citations.
        8. Stick to the question, only answer what is required, nothing else.

        Example Conversation:

        User: What are on-campus networking opportunities for students at ASU?
        Assistant: Based on the search results, ASU offers numerous on-campus networking opportunities for students. Here's a comprehensive overview:

        ## Career Fairs and Events
        **Fall 2024 Events** include:
        - Internship Fair on September 5th at Tempe Campus
        - Career & Internship Fair on September 24-25th at Tempe Campus
        - Virtual Career & Internship Fair on September 27th via Handshake[1]

        ## Academic Networking
        **Faculty Connections**
        - Students can connect with professors through events and research opportunities
        - Schedule introductory meetings with faculty members[2]

        ## Student Organizations
        **Campus Involvement**
        - Join student clubs and organizations
        - Participate in on-campus student employment
        - Engage with the Global Career Network[3]
        
        Citations:
        [1] https://career.eoss.asu.edu/channels/networking/
        [2] https://asuforyou.asu.edu/jobtransitions/networking
        [3] https://collegeofglobalfutures.asu.edu/student-life/mentorship-networking/
        [4] https://career.eoss.asu.edu/channels/fall-2024-career-internship-fairs/
        [5] https://career.eoss.asu.edu/organizations/global-career-network-at-arizona-state-university/
        [6] https://asuevents.asu.edu/event/virtual-career-and-internship-fair

        Current Question: {question}
        Context: {context}
        """
        print("\n\nGiven Prompt\n\n\n", prompt,"\n\n\n")
        
        try:
            response = self.model.generate_content(prompt)
            return response.text
        except Exception as e:
            logger.error(f"Error generating response: {str(e)}")
            raise

def main():
    api_key = ""
    rag_system = ASURagSystem(api_key)
    
    try:
        rag_system.initialize_system("computer science admission requirements")
        response = rag_system.answer_question(
            "What are the admission requirements for ASU's Bachelors Computer Science program?"
        )
        print(response)
    except Exception as e:
        logger.error(f"Error running RAG system: {str(e)}")

if __name__ == "__main__":
    main()

INFO:__main__:Scraping ASU content matching query...
INFO:__main__:Preprocessing documents...
INFO:__main__:Setting up vector store...
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2





 Preprocessed Documents

 [{'content': '## Arizona State University Bachelor of Science in Computer Science\n\n**This program is offered by the Ira A. Fulton Schools of Engineering, and can be completed on-campus in Tempe or online. **\n\n**Program Description:**', 'url': 'https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science/'}, {'content': '* The Computer Science (BS) program aims to prepare graduates for careers in various computing fields or further graduate studies. \n* Core courses focus on theoretical and practical aspects of computer science, building critical thinking, effective programming, and problem-solving skills across modern languages.\n* There is an emphasis on security and systems issues.', 'url': 'https://degrees.apps.asu.edu/bachelors/major/ASU00/ESCSEBS/computer-science/'}, {'content': '* The program is flexible, allowing students to pursue interests in subfields like artificial intelligence, machine learning, robotics, database systems, an

INFO:__main__:System initialized successfully




Given Prompt


 
        You are an ASU Counselor, trained to provide accurate and helpful information about Arizona State University. Your task is to write detailed, well-structured answers using only the provided context. If the context does not provide specific information, Just say you don't know. Do not refer to old knowledge to answer the question.
        
        Follow these guidelines:

        1. Base your answer solely on the given context, Do not refer to old knowledge to answer the question
        2. Format your answer for readability using:
            - Section headers with ## for main topics
            - Bold text (**) for subtopics
            - Lists and bullet points when appropriate
            - Tables for comparisons
        3. Cite sources using [1], [2] etc. at the end of relevant sentences
        4. Be concise and direct while maintaining a helpful tone
        5. If the context doesn't contain enough information, acknowledge the limitations
        6. Do