# Introduction


***RAG-Based Chatbot for Web content interaction*** simplifies data retrieval by scraping data from a given URL and converting it to structured JSON format. Additionally, it also extends to answering user questions based on the extracted data..

In this notebook, we will be looking at a sample code for execution of RAG based chatbot for webscraping.

This project utilizes Two agents , namely -  

**1. Scrapping agent** - which is responsible to scrap data from the URL (provided as input) to save it as a JSON format file named - scrapped_data.json

**2. Chatbot with memory Agent** - which takes the information from the JSON format file to create a vector and saves (i.e upserts) the vector embedding in Pinecone (vector database). A chatbot with memory will be used based on which user can question upon the URL link uploaded

## Functionality overview 

**Objective:** 
Design and implement a set of intelligent agents that work together to 

(1) scrape content from a specified website URL and save it in a structured format, and 

(2) utilize the scraped data to answer user queries about the website content.

**Input:**
- Task 1: URL of the webpage.
- Task 2: Any query regarding the scraped data.

**Output:**
- Task 1: JSON file with scraped data.
- Task 2: Answer to questions regarding scrapped web page


## Approach 

 PART 1 : SCRAPPING AGENT
 
 
    1.1 Scrape data from the webpage (URL - input)  
    
    1.2 Save the scrapped data in JSON format
    
    ____________________________________________________________________________________________________________

PART 2 : CHATBOT WITH MEMORY AGENT 
    
    
    2.1 Retrieve JSON data and format it to string
    
    2.2 Split the document to chunks (for creating and upserting embeddings)
    
    2.3 Text Embedding using sentence transformer model
    
    2.4 Upsert data to pinecone
    
    2.5 Question & Answer Chatbot with memory
    
    
    
    
    

# PART 1 SCRAPPING AGENT

## 1.1 Scrape data from webpage

In [1]:
#Importing necessary packages
import warnings
warnings.filterwarnings("ignore")
import json
import requests
from bs4 import BeautifulSoup

https://en.wikipedia.org/wiki/Generative_artificial_intelligence

In [2]:
URL = "https://en.wikipedia.org/wiki/Generative_artificial_intelligence"

In [3]:
def scrape_website(url):
    """ 
    method to store the scraped data as a dictionary
    
    """

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        
        data = {
            'title': soup.title.text.strip(),
            'paragraphs': [p.text.strip() for p in soup.find_all('p')],
        }

        return data
    else:
        print(f"Error: Unable to fetch the content. Status code: {response.status_code}")
        return None


In [4]:
scraped_data = scrape_website(URL)

## 1.2 Save the scrapped data in JSON format 

In [5]:
def save_as_json(scraped_data, filename='scraped_data.json'):
    """ 
    Save the scrapped content as a JSON file named - "scraped_data.json" in the working directory
    
    """
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(scraped_data, f, ensure_ascii=False, indent=2)
    print(f'Data saved successfully as {filename}')
    

In [6]:
save_as_json(scraped_data)


Data saved successfully as scraped_data.json


# PART 2: CHATBOT WITH MEMORY AGENT

## 2.1 Retreive the JSON 

In [7]:
#Importing necessary packages

import json
import pickle
import os
import datetime
import numpy as np

#import sentence tranformer for text embedding
from sentence_transformers import SentenceTransformer
os.environ["TOKENIZERS_PARALLELISM"] = "false"
#pinecone imports
import pinecone
from pinecone import Pinecone

#langchain imports
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAI
from sentence_transformers import SentenceTransformer
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain_core.output_parsers import StrOutputParser
from langchain.chains import ConversationChain

import colorama
from colorama import Fore,Style,Back

In [8]:

def remove_non_ascii(text):
    """ 
    Remove any non ascii characters found in the text
    
    """
    return ''.join(char for char in text if ord(char) < 128)


def process_json_data(data):
    """ 
    Function to recursively remove non-ASCII characters from all strings in a nested data structure
    
    """      
    if isinstance(data, dict):  # If the data is a dictionary
        return {key: process_json_data(value) for key, value in data.items()}
    
    elif isinstance(data, list):  # If the data is a list
        return [process_json_data(item) for item in data]
    
    elif isinstance(data, str):  # If the data is a string
        return remove_non_ascii(data)
    
    else:  # If the data is anything else (e.g., numbers, booleans), return it unchanged
        return data

# Load JSON data from a file
path = 'scraped_data.json'
with open(path, 'r', encoding='utf-8') as file:
    data = json.load(file)

# Process the JSON data to remove non-ASCII characters
processed_data = process_json_data(data)

In [9]:
#content variable stroes the string of content scrapped
content=''
for text in processed_data['paragraphs']:
    content+=text

## 2.2  Splitting the document to chunks (for  creating  and upserting embeddings) 


 Chunk_size determined number of character in each chunk.
 
 Chunk overlap determines number of characters that has to be shared will overlapping.  
 
 Chunk size is counted as character instead of regukar expression.

In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=20, length_function=len, is_separator_regex=False)

texts = text_splitter.create_documents([content])

In [11]:
text_list = [] #each item in the list is a chunk

for i in range(0,len(texts)):
    text_list.append(texts[i].page_content)
 
print()
print("Let us try to print first 5 chunks after splitting the content to chunks")
print()

for item in text_list[:5]:
    print(item,end=',\n')


Let us try to print first 5 chunks after splitting the content to chunks

Generative artificial intelligence (generative AI, GenAI,[1] or GAI) is artificial intelligence,
intelligence capable of generating text, images or other data using generative models,[2] often in,
models,[2] often in response to prompts.[3][4] Generative AI models learn the patterns and,
the patterns and structure of their input training data and then generate new data that has similar,
that has similar characteristics.[5][6]Improvements in transformer-based deep neural networks,


Now, let us try to print the number of splits(chunks) for the content variable 

In [12]:
len(text_list)

290

## 2.3 Text Embedding using sentence transformer model

In [13]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embeddings_list = []

#Getting the embedding list for upserting
for i, text in enumerate(text_list, start=1):
    embedding = model.encode(text)
    text_dict = {"id": text, "values": np.array(embedding)}
    embeddings_list.append(text_dict)

## 2.4 Upsert data to pinecone

Upsert is a term used in pinecone vector database, similar to the term "insert" which writes vectors into database.
We have to initialize the pinecone API in order to access the vector database indexes.

Please Note: 

The free tier of pine cone only allows us to initialize the index only once. If new content needs to be upserted, then you might have to manually delete the index on the pinecone portal and create a index again. 

In [14]:
"""
initializing the API key - PINECONE

"""
with open('key/pinecone', 'r') as f2:
    api_key = f2.read()
    pc = Pinecone(api_key=api_key)
    index_name = 'qa-bot'
    index = pc.Index(index_name)

In [15]:
# index_name is the index being created from the pinecone dashboard
index_name = 'qa-bot'
index = pc.Index(index_name)

In [16]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 434}},
 'total_vector_count': 434}

________________________________________________________________
Next, we will try to upsert the embeddings dictionary

where, 
id : text (sentences)
value : will be the corresponding embeddings
________________________________________________________________

In [17]:
index.upsert(embeddings_list)

{'upserted_count': 290}

In [18]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 434}},
 'total_vector_count': 434}

## 2.5 Question & Answer Chatbot with memory

In [19]:
""" 
Read the key.txt file and set the key to the environment variable

"""
with open('key/openai', 'r') as f1:
    openai_api_key = f1.read()
    os.environ['OPENAI_API_KEY']=openai_api_key

In [20]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
llm =OpenAI(temperature=0)

In [21]:
conversation_memory = ConversationBufferMemory()
query_response_pairs = {}

In [22]:
def answer_question(query, prompt_template, query_response_pairs):
    """ 
    Retreive documents to output the most probably answer to user
    
    """
    
    query_embedding = model.encode(query)

    results = index.query(vector = [query_embedding.tolist()], top_k=5) 

    retrieved_documents=''
    for result in results['matches']:
        text = result['id']
        retrieved_documents+=text

    # Construct the prompt using the template and retrieved documents
    prompt = prompt_template.format(documents=retrieved_documents, question=query,history = query_response_pairs)

    # Generate response using the prompt
    output_parser = StrOutputParser()
    conversation_chain = ConversationChain(llm=llm,memory=conversation_memory,output_parser=output_parser,verbose=False)
    response = conversation_chain.invoke(prompt)
    query_response_pairs[query] = response['response']
    
    return response

In [None]:
def output():
    """ 
    Function to print the output of model response and 
    user interaction with the chatbot
    
    """
    bot="AGENT"
    print(Fore.BLUE + bot+":")
    print(Fore.BLACK+"Hey! Ask me any question you have on the web URL you upoaded for the topic -")
    print(f"\033[1m{data['title']}\033[0m")

    # Run the loop for interactive conversation
    while True:
        print()
        print(Fore.GREEN+"USER:")
        user_input = input()
        print()

        prompt_template = """Answer the question based in a precise way based on the following context
        Keep the Sentence structure simple, be direct. Just give 1-3 sentences.
        Don't make up the answer just frame the answer from the provided context 
        if it is not there say you are sorry, you don't know have enough information about this question.

        Below is the context for answering the question
        {documents}

        Below is the chat history consider the below information also as context while answering the question
        {history}

        Question: {question}
        """

        if user_input.lower()=='stop':
            user_input= None
            current_time = datetime.datetime.now().strftime("%Y-%m-%d_%M-%S")
            filename = f"conv_{current_time}.pkl"
            with open(filename, "wb") as f:
                pickle.dump(query_response_pairs, f)

            print(Fore.BLUE +"AGENT     :")
            print(Fore.BLACK+"Thank You. Have a Nice Day")

            query_response_pairs.clear()
            break

        answer = answer_question(user_input,prompt_template,query_response_pairs)

        print()
        print(Fore.BLUE +"AGENT     :")
        print(Fore.BLACK+answer['response'])
        print()

        
output()


#  Hi, what exactly is generative AI
# how is it different from machine learning
#  by the way my name is sairam. 
# what was the previous questions that I was trying to ask

[34mAGENT:
[30mHey! Ask me any question you have on the web URL you upoaded for the topic -
[1mGenerative artificial intelligence - Wikipedia[0m

[32mUSER:


 Hi, what exactly is generative AI




[34mAGENT     :
[30m Generative AI, also known as GenAI or GAI, is a type of artificial intelligence that uses generative models to create new data or content. It has a wide range of uses in various industries and there are ongoing discussions about how to regulate its use. Rules are being refined to ensure responsible use of generative AI.


[32mUSER:


 how is it different from machine learning




[34mAGENT     :
[30m Generative AI is a subset of machine learning that focuses on creating new data or content, while traditional machine learning models are trained to make predictions or classifications based on existing data. Generative AI uses unsupervised or self-supervised learning, while traditional machine learning often uses supervised learning. Additionally, generative AI models are often more complex and difficult to train compared to traditional machine learning models.


[32mUSER:


 by the way my name is sairam. 




[34mAGENT     :
[30m Nice to meet you, Sairam. Is there anything else you would like to know about generative AI?


[32mUSER:


 what was the previous questions that I was trying to ask




[34mAGENT     :
[30m  The previous questions you asked were about generative AI and how it differs from traditional machine learning. Is there anything else you would like to know?


[32mUSER:


 ok. what was my name again?




[34mAGENT     :
[30m  Your name is Sairam. Is there anything else you would like to know?


[32mUSER:


 sorry. This is ram not sairam. what was sairam looking for?




[34mAGENT     :
[30m   My apologies, Ram. I do not have enough information to answer that question. Is there anything else you would like to know about generative AI?


[32mUSER:


________________________________________________________________