<a href="https://colab.research.google.com/github/samuel3347/Chatbot-with-RAG/blob/main/Chatbot_with_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading important libraries


In [None]:
!pip install -qU \
    langchain==0.0.354 \
    openai==1.6.1 \
    datasets==2.10.1 \
    pinecone-client==3.0.0 \
    tiktoken==0.5.2

# Building a Chatbot

In [None]:
import os
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or "YOUR_API_KEY"

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

  warn_deprecated(


In [None]:
from langchain.schema import (
SystemMessage,
HumanMessage,
AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant,"),
    HumanMessage(content="Hello AI, how are you today"),
    AIMessage(content="I'm great thank you,how can I help you?"),
    HumanMessage(content="I'd like to understand moore's laws.")
]

In [None]:
res = chat(messages)
res

In [None]:
print(res.content)

Moore's Law is a principle in the field of computing that was formulated by Gordon Moore, co-founder of Intel Corporation, in 1965. The law states that the number of transistors on a microchip doubles approximately every two years, leading to a continuous increase in computing power and performance while decreasing the cost of manufacturing. This trend has held true for several decades, driving rapid advancements in technology such as faster processors, increased memory capacity, and smaller devices. Moore's Law has been a driving force behind the exponential growth of the digital age and continues to shape the development of modern computing devices.


# Loading the chunked data

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "jamescalam/ai-arxiv-chunked",
    split="train"
)

dataset



Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

In [None]:
dataset[0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof i

In [None]:
from pinecone import Pinecone
api_key  = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pc = Pinecone(api_key = api_key)

In [None]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# Upserting into Vectors

In [None]:
import time

index_name = 'ai-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

In [None]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

In [None]:
from tqdm.auto import tqdm  # for progress bar

data = dataset.to_pandas()  # this makes it easier to iterate over the dataset

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [None]:
index.describe_index_stats()

# Retreival Augmented Generation

In [None]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)



In [None]:
query = "What is so special about AI?"

vectorstore.similarity_search(query, k=3)

[Document(page_content='CodeNet: A Large-Scale AI for Code Dataset for\nLearning a Diversity of Coding Tasks\nRuchir Puri1, David S. Kung1, Geert Janssen1, Wei Zhang1,\nGiacomo Domeniconi1,Vladimir Zolotov1,Julian Dolby1,Jie Chen2,1,\nMihir Choudhury1,Lindsey Decker1,Veronika Thost2,1,Luca Buratti1,\nSaurabh Pujar1,Shyam Ramji1,Ulrich Finkler1,Susan Malaika3,Frederick Reiss1\n1IBM Research\n2MIT-IBM Watson AI Lab\n3IBM Worldwide Ecosystems\nAbstract\nOver the last several decades, software has been woven into the fabric of every\naspect of our society. As software development surges and code infrastructure of\nenterprise applications ages, it is now more critical than ever to increase software\ndevelopment productivity and modernize legacy applications. Advances in deep\nlearning and machine learning algorithms have enabled breakthroughs in computer\nvision, speech recognition, natural language processing and beyond, motivating\nresearchers to leverage AI techniques to improve software

In [None]:
def augment_prompt(query: str):
  results = vectorstore.similarity_search(query, k=3)
  source_knowledge = "\n".join([x.page_content for x in results])
  augmented_prompt = f"""Using the contexts below, answer the query.

  Contexts:
  {source_knowledge}

  Query: {query}"""
  return augmented_prompt

In [None]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

AI is special because it has the potential to significantly impact society by improving software development efficiency, making AI more accessible with compute-efficient machine learning, and addressing issues such as data sovereignty and trust in AI.


In [None]:
prompt = HumanMessage(
    content="what safety measures are used to develop AI?"
)

res = chat(messages + [prompt])
print(res.content)

Safety measures used to develop AI include:

1. Data Privacy Protection: Ensuring that sensitive data used in AI systems is protected and only accessed by authorized personnel.

2. Bias Detection and Mitigation: Identifying and addressing biases in AI algorithms to prevent discriminatory outcomes.

3. Robustness Testing: Conducting rigorous testing to ensure that AI systems function as intended and can handle unexpected scenarios.

4. Explainability and Transparency: Making AI systems transparent and understandable so that their decisions can be explained to users and stakeholders.

5. Ethical Guidelines: Adhering to ethical principles and guidelines in the development and deployment of AI systems to ensure they are used responsibly.

These safety measures help mitigate risks associated with AI technology and promote the development of safe and reliable AI systems.


# Deleting the indexes to save our resources

In [None]:
pc.delete_index(index_name)