### Maths AI Agent Integrated With RAG Vector Database 

Objective: 

The Maths AI Agent Integrated With RAG Vector Database aims to enhance mathematical problem-solving capabilities using Retrieval-Augmented Generation (RAG) and vector-based knowledge retrieval.

Key Goals:

Accurate Mathematical Assistance – Provides precise solutions to complex equations, formulas, and problem-solving tasks.

ontextual & Adaptive Learning – Uses RAG and vector databases to dynamically retrieve relevant mathematical concepts and resources.

Enhanced Search & Retrieval – Leverages AI-driven queries for retrieving structured and unstructured mathematical data.

Automated Theorem & Proof Generation – Supports theorem verification, proof validation, and step-by-step solutions using AI models.

Efficient Knowledge Representation – Stores mathematical principles, equations, and examples in a vector database for rapid access.

Interactive AI Assistance – Engages users with natural language processing (NLP) for intuitive problem-solving and question-answering.


##### Importing all libraries

In [1]:
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document 
from langchain_community.vectorstores import FAISS
import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.common.keys import Keys
from langchain.tools import Tool
from langchain_community.tools import DuckDuckGoSearchResults
import json 
from langchain_community.document_loaders import UnstructuredURLLoader
import re
import time
from langchain.agents import initialize_agent
from langchain.schema import SystemMessage, HumanMessage
from langchain.chains import create_retrieval_chain 
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama 

##### Reading "math_problems.parquet" dataset using pandas

In [2]:
mathsDF = pd.read_parquet("math_problems.parquet")

##### Using "head()" to display first few rows 

In [None]:
mathsDF.head() 

Unnamed: 0,problem,solution
0,Factor \( t^2 - 144 \).,1. Observe the expression \( t^2 - 144 \). It ...
1,Find all integer solutions to the equation \(3...,\nTo determine whether there are any integer s...
2,Eduardo is a teacher. He taught 3 classes last...,"To solve the problem, follow these steps:\n\n1..."
3,Free Christmas decorations are being given out...,Each box contains:\n- 4 pieces of tinsel\n- 1 ...
4,According to a report by People's Daily on May...,"The form of scientific notation is $a×10^n$, w..."


##### Checking duplicates in the dataset

In [4]:
mathsDF.duplicated().sum() 

0

There are so duplicates in the dataset

.

##### Creating a new DataFrame "mathsComDF" by combining "problems" and "solutions" columns from "mathsDF"

This format will help to process the dataset and get appropiate answers from LLM 

In [27]:
mathsComDF = mathsDF["problem"] + " \nAnswer: " + mathsDF["solution"]

##### Using "tail()" to display last rows of the dataset

In [None]:
mathsComDF.tail(10) 

99990    calculate the time it will take for a full tan...
99991    Prove that if a convex polygon can be divided ...
99992    What is the smallest two-digit whole number, t...
99993    Given an arithmetic sequence $\{a_n\}$, where ...
99994    Show that \(a^{2}+b^{2}+2(a-1)(b-1) \geqslant ...
99995    A pirate ship spots a trading ship $15$ miles ...
99996    Show that the product of four consecutive inte...
99997    The sequence $\{b_n\}$ satisfies $b_1 = 1$ and...
99998    The length of the minute hand of a clock is so...
99999    A group of men decided to do a work in 50 days...
dtype: object

##### Using "RecursiveCharacterTextSplitter" to split large chunks of text into smaller, manageable pieces.

It is useful for document processing, retrieval-augmented generation (RAG), and LLM optimization.

In [29]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)
mathsDocs = text_splitter.split_text("\n".join(mathsComDF.tolist())) 


In [None]:
print("Total Number OF Documents :- ",len(mathsDocs)) 

Total Number OF Documents :-  422473


In [None]:
mathsDocs 

['Factor \\( t^2 - 144 \\). \nAnswer: 1. Observe the expression \\( t^2 - 144 \\). It resembles the difference of squares formula \\( a^2 - b^2 = (a-b)(a+b) \\).\n2. Identify \\( a = t \\) and \\( b = 12 \\) since \\( 144 = 12^2 \\).\n3. Apply the difference of squares formula:\n   \\[\n   t^2 - 144 = t^2 - 12^2 = (t - 12)(t + 12)\n   \\]\n4. Therefore, the factorization of \\( t^2 - 144 \\) is \\(\\boxed{(t-12)(t+12)}\\).\nFind all integer solutions to the equation \\(3x - 12y = 7\\). \nAnswer:',
 'Answer: \nTo determine whether there are any integer solutions to the equation \\(3x - 12y = 7\\), we need to consider the divisibility properties of the left-hand side and the right-hand side of the equation.',
 '1. **Analyzing Divisibility:**\n   - The left-hand side of the equation is \\(3x - 12y\\).\n   - Notice that both terms, \\(3x\\) and \\(12y\\), are divisible by 3.\n   - Therefore, the entire expression \\(3x - 12y\\) is divisible by 3.\n\n2. **Examining the Right-Hand Side:**\n 

In [30]:
mathsDocs[0]

'Factor \\( t^2 - 144 \\). \nAnswer: 1. Observe the expression \\( t^2 - 144 \\). It resembles the difference of squares formula \\( a^2 - b^2 = (a-b)(a+b) \\).\n2. Identify \\( a = t \\) and \\( b = 12 \\) since \\( 144 = 12^2 \\).\n3. Apply the difference of squares formula:\n   \\[\n   t^2 - 144 = t^2 - 12^2 = (t - 12)(t + 12)\n   \\]\n4. Therefore, the factorization of \\( t^2 - 144 \\) is \\(\\boxed{(t-12)(t+12)}\\).\nFind all integer solutions to the equation \\(3x - 12y = 7\\). \nAnswer:'

In [31]:
mathTextToDoc = [Document(page_content=doc) for doc in mathsDocs]

In [32]:
mathTextToDoc[0]

Document(metadata={}, page_content='Factor \\( t^2 - 144 \\). \nAnswer: 1. Observe the expression \\( t^2 - 144 \\). It resembles the difference of squares formula \\( a^2 - b^2 = (a-b)(a+b) \\).\n2. Identify \\( a = t \\) and \\( b = 12 \\) since \\( 144 = 12^2 \\).\n3. Apply the difference of squares formula:\n   \\[\n   t^2 - 144 = t^2 - 12^2 = (t - 12)(t + 12)\n   \\]\n4. Therefore, the factorization of \\( t^2 - 144 \\) is \\(\\boxed{(t-12)(t+12)}\\).\nFind all integer solutions to the equation \\(3x - 12y = 7\\). \nAnswer:')

In [None]:
type(mathTextToDoc) 

list

### Using "HuggingFaceEmbeddings" to generate vector embeddings for text data

In [35]:
embeddingModel = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  embeddingModel = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")





### Using "FAISS" to create vector database

In [None]:
vector_store = FAISS.from_documents(mathTextToDoc, embedding=embeddingModel) 

In [None]:
type(vector_store)

### Saving FAISS vector database into local machine

In [None]:
vector_store.save_local("faiss_dbStore")  

### Loading FAISS vector database from local machine

In [38]:
loaded_vector_store = FAISS.load_local("faiss_dbStore",embeddings=embeddingModel,allow_dangerous_deserialization=True)

##### Using "as_retriever" to search the vector database utilizing "mmr" search_type

In [39]:
retriever = loaded_vector_store.as_retriever(search_type="mmr",search_kwargs={"k":6})

In [40]:
retrieved_docs = retriever.invoke("Solve for x in 3x + 7 = 22") 

In [18]:
print(retrieved_docs[3].page_content) 

Therefore, the correct answer is $A: b < c < a$. So, encapsulating the final answer, we write:
\[\boxed{A}\]
When the value of $x$ is tripled and then this increased value is divided by 7, the result is 21. What is the value of $x$? 
Answer: 1. Set up the equation based on the problem statement:
   \[
   \frac{3x}{7} = 21
   \]

2. Solve for $x$:
   \[
   3x = 21 \times 7
   \]
   \[
   3x = 147
   \]
   \[
   x = \frac{147}{3}
   \]
   \[
   x = 49
   \]


### Using Ollama to load "Lllama3.2" LLM model

In [42]:
llm = Ollama(model="llama3.2") 

##### Using "system_prompt" for defining task,role,goal,etc to the LLM model

In [43]:
system_prompt = (
    "You are an assistant for mathematical question-answering tasks."
    "Use the following pieces of retrieved context to answer the question."
    "If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise."
    "\n\n"
    "{context}"
)

### Using "ChatPromptTemplate" to ensures consistent prompt structuring for conversational AI.

In [None]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system",system_prompt),
        ("human","{input}"),
    ]
)

### RAG Pipeline

In [None]:
question_answer_chain = create_stuff_documents_chain(llm,prompt)  
rag_chain = create_retrieval_chain(retriever,question_answer_chain)

##### Using "rag_chain" to retrieve content from vector database and make response accordingly

In [None]:
response = rag_chain.invoke({"input":"t^2 - 144"})
print(response["answer"]) 

I don't have enough information to provide a specific response. The context provided seems incomplete or unclear, and I need more details to accurately answer your question. Please provide more context or clarify the question so I can assist you better.


.

### Selenium Automate Web Search For AI Agent

Created a headless chromedriver which can be used by AI Agent to make searches.

In [None]:
options = Options()
options.add_argument("--headless") 
options.add_argument("--user-agent=Mozilla/5.0") 
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--enable-webgl")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("start-maximized")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--window-size=1920,1080")

chromedriver_path = "C:/Users/Sambhu/Downloads/Compressed/chromedriver-win64/chromedriver.exe"
service = Service(chromedriver_path)

### Created function "duckduckgo_search"

This function ensures to make appropiate web searches for maths related queries. It returns all the relevant urls related to search query.

In [None]:
def duckduckgo_search(query):
    try: 
        driver = webdriver.Chrome(service=service,options=options)
        driver.get("https://www.duckduckgo.com/")
        search = driver.find_element(By.XPATH,"//*[@id='searchbox_input']")
        time.sleep(3)
        search.send_keys("Show only maths sites containing "+query)
        time.sleep(3)
        search.send_keys(Keys.ENTER)
        time.sleep(3)
        
        results = [
            i.get_attribute("href") for i in driver.find_elements(By.TAG_NAME, "a")
            if i.get_attribute("href") and i.get_attribute("href").startswith("http")
            and "youtube.com" not in i.get_attribute("href")
            and "microsoft.com" not in i.get_attribute("href") 
            and "google.com" not in i.get_attribute("href")
            and "duckduckgo.com" not in i.get_attribute("href") 
            and not any(param in i.get_attribute("href") for param in ["?q=", "&ia=", "maps", "images", "videos", "news", "chat", "settings"])
        ]
        driver.quit()


        return results[:5]
        
    except Exception as e:
        return f"Error: {e}"
        

### Created function "fetch_page_content" 

This function leverage "UnstructuredURLLoader" to extract all the content inside the urls fetched by function "duckduckgo_search"

In [51]:
def fetch_page_content(urls):
    """Fetch Page Contents"""
    try:
        if not urls or not all(url.startswith("http") for url in urls):
            return json.dumps({"error": "Invalid URLs detected"})
    
        loader = UnstructuredURLLoader(urls=urls)
        docs = loader.load()

        structured_data = []
        for doc in docs:
            raw_content = doc.page_content.strip()
            clean_content = raw_content.replace("\n"," ").strip()

            structured_entry = {
                "raw_content":clean_content,
            }

            structured_data.append(structured_entry)

        return json.dumps(structured_data,indent=4)
    except Exception as e:
           return f"Error: {e}" 

### Created the function "process_extracted_content"

This function uses LLM (llama3.2) to clean , extract meaningful insights from the content fecthed by function "fetch_page_content".

In [52]:
def process_extracted_content(llm_content,maths_question):
    parsed_data = json.loads(llm_content)
    parsed_splitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)

    all_docs = []
    for entry in parsed_data:
        if "raw_content" in entry:
            split_docs = parsed_splitter.split_text(entry["raw_content"])
            all_docs.extend(split_docs)
    
    if all_docs:
        retrieved = llm.predict("""
            Given the maths question: "{maths_question}" , Extract and present essential mathematical concepts, problem-solving methods, and key theoretical principles that are directly relevant to solving the given query or queries. Retrieve all step-by-step methods, including numerical calculations, trigonometric computations (such as sine, cosine, tangent values), probability distributions, algebraic transformations, calculus operations (such as differentiation and integration), statistical methods, and any other mathematical calculations that contribute to solving the problem in a comprehensive manner.
            Ensure that all relevant computations are structured clearly, allowing the AI agent to generate a complete and precise solution. Focus on both fundamental principles and advanced problem-solving techniques.
            Avoid including unnecessary content, vague descriptions, repeated phrases, redundant jargon, or duplicate solutions. Filter out irrelevant details and ensure all retrieved information is **distinct, concise, and meaningful** to maintain a clean and efficient workflow.
            Prioritize structured responses with well-organized calculations, precise reasoning, and logically sequenced steps, enhancing clarity and accuracy in the final solution. :\n\n"""+ all_docs[0])
        
        return retrieved
    else : 
        return "Error: No Valid Content Found"


### Created function "search_and_extract"

This function acts as a tool for the AI agent to perform internet searches based on user input.

**Retrieving Search Results:** The results from "duckduckgo_search" are stored in "scraped_data", and the function verifies whether the returned data is a valid string.

**Extracting Page Content:** If valid, "fetch_page_content" processes URLs from "scraped_data", and its output is stored in "extracted_content".

**Refining Content:** "process_extracted_content" function takes "extracted_content" and refines it into meaningful information, stored in "maths_content".

**Final Output:** The function "search_and_extract" ultimately returns "maths_content", ensuring relevant, structured results from the search.

In [53]:
def search_and_extract(query: str) -> str:
    scraped_data = duckduckgo_search(query)
    
    if isinstance(scraped_data, str):  # If it's an error string, return it directly
        return scraped_data
    
    extracted_content = fetch_page_content(scraped_data)
    maths_content = process_extracted_content(extracted_content,query)
    
    return maths_content

##### Created "search_extract_tool" Tool for the AI Agent

In [54]:
search_extract_tool = Tool(
    name="search_and_extract_tool",
    func=search_and_extract, 
    description="Fetches mathematical solutions from the web."
)

##### Created the AI Agent "answer_agent"

In [82]:
answer_agent = initialize_agent( 
    tools=[search_extract_tool],  
    llm=llm, 
    agent="zero-shot-react-description",  
    verbose=True,
    handle_parsing_errors=True
)


##### Query for the AI Agent

In [62]:
math_query = "What is the determinant of the matrix ?" 

##### Using "SystemMessage" to give AI Agent its task, purpose, goal, role , etc

In [84]:
answer_custom_prompt = SystemMessage(
    content = "You are a highly capable math expert. Use both your own mathematical knowledge and relevant extracted web content "
    "to solve the given math_query with a clear, step-by-step solution. "
    "If web content is available, incorporate useful insights, but prioritize accuracy and logical reasoning based on fundamental math principles. "
    "Ignore irrelevant information and ensure clarity. Format your response as follows:\n\n"
    "solution_steps\n"
    "Step 1: \n"
    "Step 2: \n"
    "Step 3: \n"
    "Step 4: \n"
    "Step 5: \n"
    "Continue step-by-step until the query is fully solved.\n\n"
    "If, after attempting all possible methods, you determine that the problem cannot be solved with the available information, clearly state:\n"
    "\"I am unable to solve this math problem with the given knowledge and resources.\" "
    "Provide reasoning for why the solution is not possible, such as missing variables, undefined operations, or limitations in known mathematical principles."
)


##### Running AI Agent

In [None]:
answer_response = answer_agent.run([
    answer_custom_prompt,  
    HumanMessage(content=math_query)]) 

'The given equation can be proved as follows:\n\n(sin α + cos α) (tan α + cot α)\n\nUsing the trigonometric identities:\nsin^2α + cos^2α = 1\ntanα = sinα / cosα\ncotα = cosα / sinα\n\nWe can expand the left-hand side of the equation:\n= sinα(tanα + cotα) + cosα(tanα + cotα)\n\n= sinα(sinα/cosα + cosα/sinα) + cosα(sinα/cosα + cosα/sinα)\n\n= sin^2α / cosα + cosα / sinα * sinα + sinα * cosα / cosα + cos^2α / sinα\n\nUsing the identity sin^2α + cos^2α = 1, we can simplify further:\n= (sin^2α + cos^2α) / cosα + sinα + cosα / sinα\n\n= 1 / cosα + sinα + cosα / sinα\n\nNow, using the definition of secant and cosecant functions:\nsecα = 1 / cosα\ncosecα = 1 / sinα\n\nWe can rewrite the equation as:\n= secα + cosecα\n\nTherefore, we have proved that (sin α + cos α) (tan α + cot α) = sec α + cosec α.'

.

### Created function "validation_agent_check" 

This function evaluates whether the response generated by the RAG model aligns with the user’s query.
* If the response accurately corresponds to the query, it returns "Valid".

* If there is a mismatch between the response and the query, it returns "Invalid".

In [85]:
def validation_agent_check(rag_response_text,maths_question):
    """Processes and validates the vector database response."""
    rag_parsed_splitter = RecursiveCharacterTextSplitter(chunk_size=200,chunk_overlap=20)
    text_chunks = rag_parsed_splitter.split_text(rag_response_text)
    
    validation_prompt = f"""
    Evaluate the following AI response in relation to the question: "{maths_question}".

    Response: "{text_chunks[0]}"

    Does this response contain a valid mathematical explanation, equation, or solution?
    Answer only 'valid' or 'invalid' based on whether it directly solves the problem.
    """ 
    validation_result = llm.predict(validation_prompt)
    return validation_result.strip().lower()



### Created function "query_math_solution"

Using the fallback mechanism, this function first processes the user query through the RAG pipeline to generate a response. The generated response is then passed to the validation_agent_check function for verification.

* If validation_agent_check determines the response is valid, the function "query_math_solution" returns the RAG-generated response to the user.

* If validation_agent_check returns invalid, the AI Agent takes over and responds directly to the user's query.

In [86]:

def query_math_solution(question_text):
    response = rag_chain.invoke({"input": question_text})
    simpleText = response["answer"]
    validation_result = validation_agent_check(simpleText,question_text)

    if validation_result == "valid":
        return "Vector Response :- " + response["answer"]
    else :    
        answer_response = answer_agent.run([
            answer_custom_prompt, 
            HumanMessage(content=question_text)])
          
        return "AI Agent :- " + answer_response


In [87]:
query = "t^2 - 144" 
math_solution = query_math_solution(query)
print(math_solution)  



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: Solve the math problem t^2 - 144
Thought: Since we have a quadratic equation in the form of t^2 - c, where c is a constant, I can try to factorize or use the square root method to solve for t.
Action: search_and_extract_tool(query="solve t^2 - 144")
Action Input: "t^2 - 144"[0m
Observation: search_and_extract_tool(query="solve t^2 - 144") is not a valid tool, try one of [search_and_extract_tool].
Thought:[32;1m[1;3mThought: Since we can't use the provided tool, I'll try using my own mathematical knowledge to solve this quadratic equation.
Action: search_and_extract_tool(query="factorize quadratic equations")
Action Input: "factorize quadratic equations"[0m
Observation: search_and_extract_tool(query="factorize quadratic equations") is not a valid tool, try one of [search_and_extract_tool].
Thought:[32;1m[1;3mQuestion: How to solve the equation t^2 - 144
Thought: Since we have a quadratic equation in the form of