# Pratham.org Q&A Bot

## Objective
Build a question-answering bot that collects information from pratham.org and responds to user queries.

## Table of Contents
1. [Setup and Dependencies](#setup-and-dependencies)
2. [Data Collection](#data-collection)
3. [Data Processing](#data-processing)
4. [Knowledge Base Creation](#knowledge-base-creation)
5. [Q&A Bot Development](#qa-bot-development)
6. [Evaluation](#evaluation)
7. [Cost Analysis](#cost-analysis)

## 1. Setup and Dependencies <a name="setup-and-dependencies"></a>

In [3]:
# @markdown **First, let's install the necessary libraries:**


!pip install -q scrapy
!pip install -q aiohttp beautifulsoup4 pypdf
!pip install -q langchain langchain-core langchain-community langchain-openai
!pip install -q faiss-cpu tiktoken gradio

In [4]:
# @markdown **Now, let's import the required modules:**

import scrapy
from scrapy.crawler import CrawlerProcess
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAI, OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.messages import SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_community.callbacks.manager import get_openai_callback
import gradio as gr
import os
import json
import aiohttp
import asyncio
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import re
import io
from pypdf import PdfReader
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Set up OpenAI API
from google.colab import userdata
from openai import OpenAI

client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
OPENAI_MODEL_NAME = "gpt-4-1106-preview"

## 2. Data Collection <a name="data-collection"></a>

In [5]:
# @markdown **We'll use an asynchronous web scraper to collect data from Pratham.org:**

class AsyncPrathamScraper:
    def __init__(self, start_url='https://www.pratham.org/'):
        self.start_url = start_url
        self.base_url = f"{urlparse(start_url).scheme}://{urlparse(start_url).netloc}"
        self.visited_urls = set()
        self.results = []
        self.max_retries = 3
        self.retry_delay = 5  # seconds

    async def fetch(self, session, url, retries=0):
        try:
            async with session.get(url, timeout=30) as response:
                if response.status == 200:
                    if response.content_type == 'application/pdf':
                        return await response.read()
                    else:
                        return await response.text()
                else:
                    logger.warning(f"Non-200 status code for {url}: {response.status}")
                    return None
        except asyncio.TimeoutError:
            if retries < self.max_retries:
                logger.warning(f"Timeout for {url}. Retrying in {self.retry_delay} seconds...")
                await asyncio.sleep(self.retry_delay)
                return await self.fetch(session, url, retries + 1)
            else:
                logger.error(f"Max retries reached for {url}")
                return None
        except aiohttp.ClientError as e:
            logger.error(f"Error fetching {url}: {str(e)}")
            return None

    def extract_content(self, soup):
        content = []
        selectors = ['div.entry-content', 'main#main', 'div.content-area', 'article']

        for selector in selectors:
            content_element = soup.select_one(selector)
            if content_element:
                content = [text.strip() for text in content_element.stripped_strings]
                break

        if not content:
            content = [text.strip() for text in soup.body.stripped_strings if text.strip()]

        content = ' '.join(content)
        content = re.sub(r'\s+', ' ', content)
        return content.strip()

    def extract_pdf_content(self, pdf_content):
        try:
            pdf = PdfReader(io.BytesIO(pdf_content))
            text = ""
            for page in pdf.pages:
                text += page.extract_text() + "\n"
            return text.strip()
        except Exception as e:
            logger.error(f"Error extracting PDF content: {str(e)}")
            return ""

    async def scrape_page(self, session, url):
        if url in self.visited_urls:
            return
        self.visited_urls.add(url)

        content = await self.fetch(session, url)
        if content is None:
            return

        if isinstance(content, bytes):  # PDF content
            extracted_content = self.extract_pdf_content(content)
            self.results.append({
                'url': url,
                'title': url.split('/')[-1],
                'content': extracted_content
            })
        else:  # HTML content
            soup = BeautifulSoup(content, 'html.parser')
            extracted_content = self.extract_content(soup)
            self.results.append({
                'url': url,
                'title': soup.title.string if soup.title else '',
                'content': extracted_content
            })

            internal_links = [
                urljoin(self.base_url, a['href'])
                for a in soup.find_all('a', href=True)
                if urlparse(a['href']).netloc == '' or urlparse(a['href']).netloc == urlparse(self.base_url).netloc
            ]

            tasks = [asyncio.create_task(self.scrape_page(session, link)) for link in internal_links]
            await asyncio.gather(*tasks)

    async def run(self):
        async with aiohttp.ClientSession() as session:
            await self.scrape_page(session, self.start_url)

        with open('pratham_org_data.json', 'w', encoding='utf-8') as f:
            json.dump(self.results, f, ensure_ascii=False, indent=4)
        logger.info("Scraping completed. Data saved to pratham_org_data.json")

# Run the scraper
async def main():
    scraper = AsyncPrathamScraper()
    await scraper.run()

await main()

print("Scraping process completed. Check the 'pratham_org_data.json' file for results.")

ERROR:__main__:Error fetching mailto:info@pratham.org: mailto:info@pratham.org
ERROR:__main__:Error fetching mailto:ece@pratham.org: mailto:ece@pratham.org
ERROR:__main__:Error fetching mailto:digital@pratham.org: mailto:digital@pratham.org
ERROR:__main__:Error fetching mailto:contact@asercentre.org: mailto:contact@asercentre.org
ERROR:__main__:Error fetching mailto:secondchance@pratham.org: mailto:secondchance@pratham.org
ERROR:__main__:Error fetching javascript:void(0): javascript:void(0)
ERROR:__main__:Error fetching tel:tel:11 46023612: tel:tel:11%2046023612
ERROR:__main__:Error fetching mailto:mailto: demo@example.com: mailto:mailto:%20demo@example.com
ERROR:__main__:Error fetching mailto:recruitment@pratham.org: mailto:recruitment@pratham.org
ERROR:__main__:Error fetching mailto:prmrecruitment@pratham.org: mailto:prmrecruitment@pratham.org
ERROR:__main__:Error fetching mailto:mme@pratham.org: mailto:mme@pratham.org
ERROR:__main__:Error fetching mailto:life.digitalcontent@pratham.

Scraping process completed. Check the 'pratham_org_data.json' file for results.


## 3. Data Processing <a name="data-processing"></a>

In [6]:
# @markdown **Now, let's process and clean the scraped data:**

# @markdown - Load the scraped data
with open("/content/pratham_org_data.json", 'r') as outfile:
    all_pages_json_data = json.load(outfile)

# @markdown - Clean and filter the data
seen_duplicate_title = set()
clean_data = []
for idx, item in enumerate(all_pages_json_data):
    if 'url' not in item or 'title' not in item or 'content' not in item:
        logger.warning(f"Missing required fields in item {idx}")
        continue
    if len(item['content'].split()) <= 50 or item['title'] in seen_duplicate_title:
        logger.info(f"Skipping item {idx}: too short or duplicate title")
        continue

    clean_data.append(item)
    seen_duplicate_title.add(item['title'])

print(f"Cleaned data contains {len(clean_data)} items")

# @markdown - Save the cleaned data
with open("pratham_org_clean_data.json", 'w') as infile:
    json.dump(clean_data, infile)

Cleaned data contains 211 items


## 4. Knowledge Base Creation <a name="knowledge-base-creation"></a>

In [7]:
# @markdown **We'll create a knowledge base using the cleaned data:**

# @markdown - Create text chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=100,
    length_function=len,
)

docs = text_splitter.create_documents(
    texts=[str(item['content']) for item in clean_data],
    metadatas=[{'title': item['title'], 'url': item['url']} for item in clean_data]
)

print(f"Created {len(docs)} document chunks")

# @markdown - Create embeddings and vector store
embedding = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)
vectordb = FAISS.from_documents(documents=docs, embedding=embedding)

# @markdown - Save the vector store
vectordb.save_local("faiss_knowledge_base")
print("Vector store saved to 'faiss_knowledge_base'")

Created 2581 document chunks
Vector store saved to 'faiss_knowledge_base'


## 5. Q&A Bot Development <a name="qa-bot-development"></a>

### Now, let's develop the Q&A bot using the created knowledge base:

In [8]:
# @markdown **Load the vector store**
vectordb = FAISS.load_local(
    folder_path="faiss_knowledge_base",
    embeddings=embedding,
    allow_dangerous_deserialization=True
)

In [9]:
# @markdown **Set up the language model and retriever**
llm = ChatOpenAI(model_name=OPENAI_MODEL_NAME)
retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# @markdown - Define the chat prompt
system_message = '''You are an AI assistant for Pratham (https://pratham.org), an innovative learning organization improving education in India. Your knowledge covers Pratham's history, programs, and impact. Guidelines:

1. Provide accurate, relevant information based on Pratham's context.
2. State uncertainties clearly; avoid assumptions.
3. Be concise and informative.
4. Use a friendly, educational tone.
5. Suggest relevant Pratham resources when appropriate.
6. Don't fabricate information.

Assist users in understanding Pratham's educational work and impact in India.'''

user_message = '''Answer this question using the provided context only.

{question}

Context:
{context}'''

# @markdown - Set up the RAG chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content=system_message),
    ("human", user_message),
])

rag_chain = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

retrieve_docs = (lambda x: x["input"]) | retriever
chain = RunnablePassthrough.assign(question=lambda x: x["input"], context=retrieve_docs).assign(
    answer=rag_chain
)

In [10]:
# @markdown **Define the prediction function for Gradio**
def predict(message, history):
    partial_message = ''
    for chunk in chain.stream({"input": message}):
        if chunk.get('context'):
            partial_message += "**Sources:**\n" + " | ".join([f"[{item.metadata['title']}]({item.metadata['url']})" for item in chunk['context']]) + "\n\n"
            yield partial_message

        if chunk.get('answer'):
            partial_message += chunk['answer']
            yield partial_message

# @markdown - Create and launch the Gradio interface
gr.ChatInterface(predict).launch(share=True, debug=False)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://0671c04de29fb11308.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




## 6. Evaluation <a name="evaluation"></a>

To evaluate the bot, we asked 5 specific questions about Pratham and record the responses.
- Qn1: what is pratham
- Qn2: can you tell me address of Pratham
- Qn3: how can i join pratham
- Qn4: can you give me contact number and official mail to contact with pratham?
- Qn5: who are you



## 7. Cost Analysis <a name="cost-analysis"></a>

In [11]:
# @markdown **To estimate the cost for 1000 users asking 5 questions each per day:**

# @markdown - Estimate cost for a single query
with get_openai_callback() as cb:
    response = chain.invoke({"input": "What is Pratham's mission?"})
    print(f"Cost for a single query: ${cb.total_cost:.4f}")

# @markdown - Calculate daily and monthly costs
daily_queries = 1000 * 5
daily_cost = daily_queries * cb.total_cost
monthly_cost = daily_cost * 30

print(f"Estimated daily cost for {daily_queries} queries: ${daily_cost:.2f}")
print(f"Estimated monthly cost: ${monthly_cost:.2f}")

# @markdown - This analysis provides a rough estimate of the operational costs for the Q&A bot. Actual costs may vary based on query complexity and length of responses.


Cost for a single query: $0.0111
Estimated daily cost for 5000 queries: $55.35
Estimated monthly cost: $1660.50


## Conclusion

This notebook demonstrates the process of creating a Q&A bot for Pratham.org, from data collection to bot deployment. The bot uses advanced NLP techniques and the latest language models to provide accurate and relevant information about Pratham's educational initiatives in India.