##### Note:
###### - Furthermore, we can build a real-time chatbot with a UI using React. As mentioned, we cannot upload Python files or any other file types for UI development. However, such a chatbot can be easily hosted anywhere. For this, we need to collect a substantial dataset to improve it further. Given the short timeframe of 2–3 days, building it quickly in today’s competitive world is a bit challenging. However, I assure you that with more time, I can create a magnificent UI-based chatbot—not just for an assignment, but also to enhance my resume.

###### In case I fail to reach relevance, please check out my GitHub for examples of fully deployed UI chatbots.

---
###### use backend fastapi https://qa-bot-ijyw.onrender/chat.com

Let me know if you’d like further revisions!

#QA Bot of **Yardstick** Using RAG


### STEP 1

#### setuping data from pdf to vector database in PINECONE

In [None]:
# !pip install PyPDF2 openai pinecone-client


In [None]:
# extracting the text data from pdf(local database)

from PyPDF2 import PdfReader


def extract_text_from_pdf(pdf_path):
  reader = PdfReader(pdf_path)
  text = ""
  for page in reader.pages:
    text += page.extract_text()
  return text

pfd_txt = extract_text_from_pdf("/content/About Yardstick.pdf")


In [None]:
pfd_txt[:200]

'41 Essential Machine \nLearning Interview \nQuestions\nwww.springboard.com\n18 mins readM\nachine learning interview questions are an integral part \nof the data science interview and the path to becoming a'

In [None]:
# splitting the text into smaller chunks using the RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

def vector_txt(txt, chunk_size=1000, chunk_overlap=200):
  txt_splitter = RecursiveCharacterTextSplitter(
      chunk_size=chunk_size,
      chunk_overlap=chunk_overlap,
      # length_function=len
  )

  texts = txt_splitter.split_text(txt)
  return texts

vec = vector_txt(pfd_txt)
print(f'Number of chunks = {len(vec)}')
print(f"First chunk:\n{vec[0]}")

Number of chunks = 6
First chunk:
About
Yardstick
Who
and
Why
we
are?
Yardstick's
vision
is
to
make
learning
enriching
and
joyful
experience.
Yardstick
designs
and
implements
learning
programs
for
children,
engaging
their 
keen,
inquisitive
and
imaginative
minds
via
holistic
experiential
learning
modules.
Yardstick
provides
specific
services
to
all
the
stakeholders
in
a
child’ s
education 
–
from
parents,
teachers
and
administrators
to
the
students.
Our
activity-based 
curricula
mapped
to
the
syllabus
encourage
children
to
understand,
appreciate 
and
apply
the
subject
being
taught.
Our
team
attempts
to
give
personalized 
attention
to
every
child.
Yardstick
offers
outstanding,
highly
interactive,
hands
on
curriculum
that
enables 
mastery
of
core
concepts
and
skills
for
all
kinds
of
minds.
The
curriculum
focuses 
on
unleashing
creativity ,
real
life
application,
and
understanding
rather
than 
memorizing,
inquiry
based
hands
on
approach.
How
do
we
do
it?
Mission
and
Vision
What
do
we
drea

In [None]:
# Initialing Pinecone and connecting to the index

from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone with the new API
pc = Pinecone(
    api_key="-"
)

# Specify serverless environment
spec = ServerlessSpec(
    cloud="aws",
    region="us-east-1"
)

# Create or connect to the index
index_name = "yardstick-qa"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=spec
    )
else:
    print(f"Index '{index_name}' already exists.")

index = pc.Index(index_name)


In [None]:
# here OpenAI text-embedding-ada-002 model embeddings each text chunk and then upserts them into
# the Pinecone index.

from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model="text-embedding-ada-002")

# Embed and upsert each chunk into Pinecone
for i, text in enumerate(vec):
    chunk_embedding = embedding.embed_query(text)
    index.upsert([(f"chunk-{i}", chunk_embedding, {"text": text})])


retriver from pincone of stored data

In [None]:
# retriving from the pinecone

from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

retriever = Pinecone(index=index,
    embedding=embedding.embed_query,
                     text_key='text'

)




### Step2

#### Building the RAG model using the costom dataset

In [None]:

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)

# Instead of directly passing 'retriever', use retriever.as_retriever()
rag_model = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever.as_retriever() # Call as_retriever() method
)

# testing the model

In [None]:
ques = '''what is question number 1 in Essential Machine
Learning Interview
Questions'''
ans = rag_model.run(ques)
print(ans)


Question number 1 in the Essential Machine Learning Interview Questions is: "What’s the trade-off between bias and variance?"


# Bot just using normal model(GPT) without RAG

The responce is complete different from above method

In [None]:
import os
from langchain.chat_models import ChatOpenAI

In [None]:
# !pip install -U langchain-openai
# !pip install langchain-community

In [None]:
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = '--'
chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)



In [None]:
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)


In [None]:
mes = [
    HumanMessage(content= '''what is question number 1 in Essential Machine
Learning Interview
Questions''')
]

In [None]:
res = chat(mes)
print(res.content)

In [None]:
ans = []
ans.append(res.content)
print(ans)

['Yardstick is a technology and professional services company that specializes in assessment and credentialing solutions. They offer a range of services including exam development, psychometric analysis, test administration, and certification management. Yardstick works with a variety of industries and organizations to create customized assessment programs that meet their specific needs. They are known for their innovative approach to assessment and their commitment to providing reliable and valid results for their clients. Yardstick is based in Canada but serves clients around the world.']


# Fine Tunned Model developed by using Fastapi

# Method 1 openAi CLI tool

In [None]:
!openai tools fine_tunes.prepare_data -f/content/fine_tuned_dataset.jsonl

Analyzing...

- Your file contains 112 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 98 duplicated prompt-completion sets. These are rows: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]
- All prompts end with suffix `\n\n###\n\n`
- The completion should start with a whitespace character (` `). This tends to produce better results due 

In [None]:
!openai api fine_tunes.create -t "/content/fine_tuned_dataset_prepared.jsonl"


usage: openai api [-h]
                  {chat.completions.create,images.generate,images.edit,images.create_variation,audio.transcriptions.create,audio.translations.create,files.create,files.retrieve,files.delete,files.list,models.list,models.retrieve,models.delete,completions.create}
                  ...
openai api: error: argument {chat.completions.create,images.generate,images.edit,images.create_variation,audio.transcriptions.create,audio.translations.create,files.create,files.retrieve,files.delete,files.list,models.list,models.retrieve,models.delete,completions.create}: invalid choice: 'fine_tunes.create' (choose from 'chat.completions.create', 'images.generate', 'images.edit', 'images.create_variation', 'audio.transcriptions.create', 'audio.translations.create', 'files.create', 'files.retrieve', 'files.delete', 'files.list', 'models.list', 'models.retrieve', 'models.delete', 'completions.create')


In [None]:
# !pip install fastapi uvicorn openai langchain pinecone

# The above method of fine tuning using CLI method showd some api arguments error so we go for another steps
The error are shown in the above

# Method 2

## Directly taking prepared dataset for Fine Tuning

# Using the Method of Contextual Prompt Engineering

In [None]:
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from fastapi import FastAPI, HTTPException
import openai
from pinecone import Pinecone, ServerlessSpec

from pydantic import BaseModel

# Set API keys
openai.api_key = "=A"
pc = Pinecone(
    api_key="-"
)

# Specify serverless environment
spec = ServerlessSpec(
    cloud="aws",
    region="us-east-1"
)

# Create or connect to the index
index_name = "yardstick-qa"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=spec
    )
else:
    print(f"Index '{index_name}' already exists.")

index = pc.Index(index_name)
app = FastAPI()

# Define the input model
class QueryRequest(BaseModel):
    query: str

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

def extract_text_from_pdfs(pdf_paths):
    all_texts = []
    for pdf_path in pdf_paths:
        reader = PdfReader(pdf_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()

        # Splitting the text into smaller chunks using the RecursiveCharacterTextSplitter
        txt_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            # length_function=len
        )

        texts = txt_splitter.split_text(text)
        all_texts.extend(texts)

    return all_texts

from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model="text-embedding-ada-002")

# Example usage
pdf_paths = ["/content/About Yardstick.pdf"]
all_texts = extract_text_from_pdfs(pdf_paths)

# Embed and upsert each chunk into Pinecone
for i, text in enumerate(all_texts):
    chunk_embedding = embedding.embed_query(text)
    index.upsert([(f"chunk-{i}", chunk_embedding, {"text": text})])

from langchain.vectorstores import Pinecone

retriever = Pinecone(
    index=index,
    embedding=embedding.embed_query,
    text_key='text'
)

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)

# Instead of directly passing 'retriever', use retriever.as_retriever()
rag_model = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever.as_retriever() # Call as_retriever() method
)

print(f'Number of chunks = {len(all_texts)}')
# print(f"First chunk:\n{all_texts[0]}")

@app.get('/')
def homePage():
    return {'HomePage'}

@app.post('/chat')
def qa_chatbot(req: QueryRequest):
    ques = req.query
    if not ques:
        raise HTTPException(status_code=400, detail='Query failed')

    try:
        answer = rag_model.run(ques)
        return {"query": ques, "answer": answer}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


Index 'yardstick-qa' already exists.
Number of chunks = 6


