<a href="https://colab.research.google.com/github/surendiran-20cl/GenAI-Intellipaat/blob/main/30thApril_ChatBot_DevClass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **How to create a chatbot with Private Knowledge-base with RAG**

### **What is it?**

* **The chatbot can answer question related a particular document, specific business, product or domain**

* **Unlike GPT, a personal chatbot is trained using RAG**

### **How this thing can be done**

* **The user should be allowed to upload a document**
  * **System should be able to read the document**
-----------------------
* **Stem and Split all the data**
* **Each chink will converted to numerical representation**

# **Step 1 - Requirement Phase**

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m153.6/232.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


### **Importing libraries**

In [None]:
import os
import nltk
from PyPDF2 import PdfReader
from bs4 import BeautifulSoup
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize

**Ending with (.), (?), (!). punkT and Punk_tab they try to figure out at what point a particular statement is ending.**


In [None]:
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
SEGMENT_SIZE = 99999

### **Initialize my stemmer**

In [None]:
ps = PorterStemmer()

## **Function 1 - Preprocessing the information**

In [None]:
def process_text(text, segment_size = SEGMENT_SIZE):
  # Here, we are tokenizing the entire text
  sentences = sent_tokenize(text)

  # Three variables for processing the entire data
  original_text = []
  processed_text = []        # These two are for storing the data
  segments = ""              # Temp on which we will be doing all operation

  for statement in sentences:
    if len(segments) + len(statement) > segment_size:
      original_text.append(segments)
      processed_text.append(" ".join([ps.stem(word) for word in segments.split()]))
      segments = statement
    else:
      segments += " " + statement

  # Handling the last sequence
  if segments:
    original_text.append(segments)
    processed_text.append(" ".join([ps.stem(word) for word in segments.split()]))

  return original_text, processed_text

In [None]:
# words = "The runner is currently working with Nike"
# " ".join([ps.stem(word) for word in words.split()])

## **Function 2 - Loading the files**

### **1. Load PDF**

In [None]:
def read_pdf(file_path):
  with open(file_path, "rb") as f:
    reader = PdfReader(f)
    text = ""
    for page in reader.pages:
      text += page.extract_text()
  return process_text(text)

### **2. Read the HTML**

In [None]:
def read_HTML(file_path):
  with open(file_path, "r") as f:
    data = BeautifulSoup(f, "html.parser")
    text = data.get_text()
    return process_text(text)

### **3. Read a text file**

In [None]:
def read_TXT(file_path):
  with open(file_path, "r") as f:
    text = f.read()
    return process_text(text)

## **Function 3 - Finding the similarity**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
documents = []          # THese two variables over here, are going to represent your database

original_docs = []

vectors = None

### **1. Add Documents**

In [None]:
def add_documents(text):
  documents.extend(text)
  vectors = vectorizer.fit_transform(documents)
  return vectors

### **2. Get the multi-document functionality**

In [None]:
def process_and_add_documents(file_path, file_type):
  if file_type == "pdf":
    original_data, processed_data = read_pdf(file_path = file_path)
  elif file_type == "html":
    original_data, processed_data = read_HTML(file_path = file_path)
  elif file_type == "txt":
    original_data, processed_data = read_TXT(file_path = file_path)
  else:
    raise ValueError("Unsupported file format provided, please check and ensure that the file provided is in correct format")

  original_docs.extend(original_data)
  vectors = add_documents(processed_data)

  return vectors

### **3. Similarity Matching**

In [None]:
NUMBER_OF_TOP_MATCHES = 3

In [None]:
def find_best_matches(query, n_matches = NUMBER_OF_TOP_MATCHES):
  query_processed = process_text(query)[1]
  query_vector = vectorizer.transform(query_processed)  # Vector Format
  similarity = (query_vector * vectors.T).toarray()

  best_matches = similarity.argsort()[0][-n_matches:][::-1]

  return [original_docs[i] for i in best_matches], [documents[i] for i in best_matches]

## **Construction of Prompt and LLM**

In [None]:
!pip install cohere



### **Engineer a Prompt**

In [None]:
import cohere
from google.colab import userdata

In [None]:
co = cohere.ClientV2(api_key = userdata.get("CohereKey"))

In [None]:
def get_resp(query, context):
  messages = [
      {"role": "system", "content":"You are AI assistant. Use the provided context, to answer the user's query accurately and precisely. Try to keep the answer concise"},
      {"role": "system", "content" : context},
      {"role": "user", "content": query}
  ]

  resp = co.chat(
      model="command-a-03-2025",
      messages = messages
  )

  return resp.message.content[0].text.strip()

## **Put all this together**

In [None]:
def reset_database():
  global documents, original_docs, vectors
  documents = []
  original_docs = []
  vectors = None

In [None]:
def initialize(file_name):
  file_type = file_name.split(".")[-1]
  return process_and_add_documents(file_path=file_name, file_type=file_type)

In [None]:
def chat(user_query, is_debug = False):
  original_best_matches, processed_best_match = find_best_matches(user_query)
  context = "\n\n".join(original_best_matches)

  if is_debug:
    print(f"Context: {context}")

  resp = get_resp(user_query, context)
  return resp

## **Test**

In [None]:
import requests


def download_files():
  sample_files = [
      {
          "url" : "https://www.ipcc.ch/report/ar6/wg1/downloads/outreach/IPCC_AR6_WGI_SummaryForAll.pdf",
          "file_name":"climateChange.pdf"
      },
      {
          "url":"https://medium.com/illumination/i-tried-10-decaf-coffees-as-a-first-time-coffee-drinker-heres-what-i-found-a8c5fb93a40e",
          "file_name": "coffee.html"
      }
  ]

  for files in sample_files:
    resp = requests.get(files["url"])
    with open(files["file_name"], "wb") as f:
      f.write(resp.content)

  return [x["file_name"] for x in sample_files]

In [None]:
files_names = download_files()

for x in files_names:
  print(x)

climateChange.pdf
coffee.html


### **Reset the database**

In [None]:
reset_database()

### **Initializing the vectors**

In [None]:
vectors = initialize("/content/climateChange.pdf")

In [None]:
vectors

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1021 stored elements and shape (1, 1021)>

In [None]:
resp = chat("Who are the authors of the report?")

In [None]:
print(resp)

The authors of the report are not explicitly listed in the provided text. However, the text mentions that the summary was written and reviewed by several individuals, including members of the Working Group I Technical Support Unit (WGI TSU) and authors of the IPCC report. Here are the names of the individuals mentioned:

- Sarah Connors (WGI TSU)
- Sophie Berger (WGI TSU)
- Clotilde Péan (WGI TSU)
- Govindasamy Bala (Chapter 4 author)
- Nada Caud (WGI TSU)
- Deliang Chen (Chapter 1 author)
- Tamsin Edwards (Chapter 9 author)
- Sandro Fuzzi (Chapter 6 author)
- Thian Yew Gan (Chapter 8 author)
- Melissa Gomis (WGI TSU)
- Ed Hawkins (Chapter 1 author)
- Richard Jones (Atlas Chapter author)
- Robert Kopp (Chapter 9 author)
- Katherine Leitzell (WGI TSU)
- Elisabeth Lonnoy (WGI TSU)
- Douglas Maraun (Chapter 10 author)
- Valérie Masson-Delmotte (WGI Co-Chair)
- Tom Maycock (WGI TSU)
- Anna Pirani (WGI TSU)
- Roshanka Ranasinghe (Chapter 12 author)
- Joeri Rogelj (Chapter 5 author)
- Alex C

In [None]:
reset_database()

In [None]:
vectors = initialize("coffee.html")

In [None]:
while True:
  user_query = input("Hi, Please ask! (type 'quit' or 'exit' to stop): ")
  if user_query.lower() in ["quit", "exit"]:
    print("Thanks!")
    break

  print("=======================================================================")
  print(f"User:\"{user_query}")
  resp = chat(user_query)
  print(f"PrivateAI: ", resp, flush = True)

Hi, Please ask! (type 'quit' or 'exit' to stop): QUIT
User:"QUIT
PrivateAI:  It seems like you're looking for a concise answer to what makes a good or bad coffee based on the provided context. Here’s a summary:

**What makes a good or bad coffee?**  
According to Kory Becker, a first-time coffee drinker who tried 10 decaf coffees, the criteria for a good or bad coffee include:  
1. **Taste** – The flavor profile and overall enjoyment.  
2. **Bitterness** – The level of bitterness, which can be a positive or negative depending on preference.  
3. **Experience** – How the coffee makes the drinker feel (e.g., no jitters, no stomach upset).  
4. **Price** – The value for money compared to quality.  

Becker emphasizes personal sensitivity to caffeine and acidity, making decaf a preferred choice. The review ranks coffees based on these factors, highlighting that a "good" coffee is subjective and depends on individual preferences and priorities.
Hi, Please ask! (type 'quit' or 'exit' to stop

 Kory Becker, a first-time coffee drinker sensitive to caffeine, shares their experience trying 10 decaf coffee brands over several months. Initially avoiding coffee due to caffeine sensitivity and concerns about acidity, they now enjoy decaf coffee daily. The article ranks the coffees by taste, experience, and price, offering insights into what makes a coffee good or bad, focusing on factors like taste, bitterness, and overall experience.

# **Interface**

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# ... (Your existing code) ...

# Create file upload widget
uploader = widgets.FileUpload(
    accept='.pdf,.html,.txt',  # Accept PDF, HTML, and text files
    multiple=False  # Allow only one file at a time
)

# Create text input widget for user queries
text_input = widgets.Text(placeholder='Ask your question here...')

# Create output widget to display chatbot responses
output = widgets.Output()

# Function to handle file upload
def on_file_upload(change):
  with output:
    clear_output()  # Clear previous output
    uploaded_file = list(change['new'].values())[0]
    file_name = uploaded_file['metadata']['name']
    with open(file_name, 'wb') as f:
      f.write(uploaded_file['content'])

    try:
      global vectors
      reset_database()
      vectors = initialize(file_name)
      print(f"File '{file_name}' uploaded successfully.")
      print("Ready for your questions!")
    except Exception as e:
      print(f"Error processing the uploaded file: {e}")


# Function to handle user queries
def on_submit(change):
  with output:
    clear_output(wait=True)  # Clear output and wait for new output
    user_query = text_input.value
    text_input.value = ''  # Clear the input field after submission

    if user_query.lower() in ["quit", "exit"]:
      print("Thanks, Hope it helped you!. (PrivateAI left the conversation)..")
      return

    print(f"User: \"{user_query}\"")
    resp = chat(user_query)
    print("PrivateAI: ", resp, flush=True)

# Attach event handlers
uploader.observe(on_file_upload, names='value')
text_input.on_submit(on_submit)

# Display the widgets
display(uploader)
display(text_input)
display(output)


FileUpload(value={}, accept='.pdf,.html,.txt', description='Upload')

Text(value='', placeholder='Ask your question here...')

Output()