# Part A: Build a code understanding model. Upload your own custom code files to the model and ask questions based on the code file as context.

### Prerequisites

Before we start building our chatbot, we need to install some Python libraries. Here's a brief overview of what each library does:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

You can install these libraries using pip like so:

In [3]:
!pip uninstall langchain langchain-core langchain-community langsmith fsspec gcsfs -y
!pip install \
  langchain \
  langchain-community==0.0.20 \
  langchain-core<0.2 \
  langsmith<0.1 \
  fsspec==2024.10.0 \
  gcsfs

Found existing installation: langchain 0.3.11
Uninstalling langchain-0.3.11:
  Successfully uninstalled langchain-0.3.11
Found existing installation: langchain-core 0.3.24
Uninstalling langchain-core-0.3.24:
  Successfully uninstalled langchain-core-0.3.24
Found existing installation: langchain-community 0.0.20
Uninstalling langchain-community-0.0.20:
  Successfully uninstalled langchain-community-0.0.20
Found existing installation: langsmith 0.2.2
Uninstalling langsmith-0.2.2:
  Successfully uninstalled langsmith-0.2.2
Found existing installation: fsspec 2024.9.0
Uninstalling fsspec-2024.9.0:
  Successfully uninstalled fsspec-2024.9.0
Found existing installation: gcsfs 2024.10.0
Uninstalling gcsfs-2024.10.0:
  Successfully uninstalled gcsfs-2024.10.0
/bin/bash: line 1: 0.2: No such file or directory


### Building a Chatbot (no RAG)

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

In [4]:
import os
os.environ["OPENAI_API_KEY"] = 'your_api_key'

In [54]:
from openai import OpenAI
client = OpenAI(api_key="your_api_key")


In [36]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)


In [7]:
# Initialize the conversation messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hi AI, how are you today?"},
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"},
    {"role": "user", "content": "I'd like to understand string theory."}
]

In [8]:
# Make the first API call to get the assistant's response
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Add the latest assistant response to the messages
assistant_response = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_response})

# Create a new user prompt
new_user_prompt = "Why do physicists believe it can produce a 'unified theory'?"
messages.append({"role": "user", "content": new_user_prompt})

# Make the second API call with the updated messages
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


Physicists over the years have been in search of a "unified theory" which is also known as the "Theory of Everything" (ToE). Essentially, they are trying to unify the four fundamental forces of nature: gravity, electromagnetism, and the strong and weak nuclear forces into a single theoretical framework.

Albert Einstein's general theory of relativity brilliantly explained gravity. Quantum mechanics, on the other hand, has been exceptional in explaining the behavior of the three other non-gravitational forces - electromagnetism, and the strong and weak nuclear forces.

The problem, however, is reconciling general relativity, which works on a macroscopic level (planets, galaxies, etc), with quantum mechanics, which operates at the minuscule level of particles that are subatomic. They simply do not fit together. General relativity portrays space as a smooth fabric, while quantum mechanics portrays it as a jittery, fluctuating froth at extremely small scales.

String theory, however, has t

### Dealing with Hallucinations

We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like about the new (and very popular) Llama 2 LLM.

In [9]:
# Add the latest AI response to messages
messages.append({"role": "assistant", "content": assistant_response})

# Create a new user prompt
prompt = {"role": "user", "content": "What is so special about Llama 2?"}

# Add to messages
messages.append(prompt)

# Send to OpenAI (chat-gpt equivalent)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


"Llama 2" generally refers to an antibody treatment developed for COVID-19. Allow me to explain:

A group of scientists discovered that llamas produce a unique kind of antibody, called a nanobody, that can tightly bind and neutralize SARS-CoV-2, the virus responsible for COVID-19. 

"Llama 2" is the name given to one of these nanobodies. These antibodies are much smaller than human antibodies. Because they're easily manipulated and stable, these "llama antibodies" could be stacked to neutralize the virus more effectively which could be used as a potential treatment or prevention for COVID-19. 

It's important to note that although this treatment showed promise in laboratory settings, more research and trials are needed to determine if it would be an effective treatment for humans infected with the virus.


Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it _does_ know the answer — and this can be very hard to detect.

OpenAI have since adjusted the behavior for this particular example as we can see below:

In [10]:
# Add the latest AI response to messages
messages.append({"role": "assistant", "content": assistant_response})

# Create a new user prompt
prompt = {"role": "user", "content": "Can you tell me about the LLMChain in LangChain?"}

# Add to messages
messages.append(prompt)

# Send to OpenAI (chat-gpt equivalent)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


I'm sorry for the confusion, but it appears there is a misunderstanding. The details regarding "LLMChain" in "LangChain" are not clear since there isn't publicly available information about such terms in present databases or literature connected to technology, cryptocurrency, or linguistics. If these are related to a specific, narrowly-defined context or new technology, could you please provide more details or clarify?


There is another way of feeding knowledge into LLMs. It is called _source knowledge_ and it refers to any information fed into the LLM via the prompt. We can try that with the LLMChain question. We can take a description of this object from the LangChain documentation.

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

Now we feed this into our chatbot as we were before.

In [11]:
# Define the LLMChain information
llmchain_information = [
    "A LLMChain is the most common type of chain. It consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. This chain takes multiple input variables, uses the PromptTemplate to format them into a prompt. It then passes that to the model. Finally, it uses the OutputParser (if provided) to parse the output of the LLM into a final format.",
    "Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case.",
    "LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an API, but will also: (1) Be data-aware: connect a language model to other sources of data, (2) Be agentic: Allow a language model to interact with its environment. As such, the LangChain framework is designed with the objective in mind to enable those types of applications."
]

# Combine the LLMChain information into a single string
source_knowledge = "\n".join(llmchain_information)

# Define the query
query = "Can you tell me about the LLMChain in LangChain?"

# Create the augmented prompt
augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

# Create a new user prompt
prompt = {"role": "user", "content": augmented_prompt}

# Initialize the conversation if messages list is undefined
messages = [
    {"role": "system", "content": "You are a helpful assistant."}
]

# Add the new prompt to the conversation
messages.append(prompt)

# Send the messages to OpenAI
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


The LLMChain in LangChain is a common type of chain that includes multiple components: a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. The function of the LLMChain is to handle multiple input variables. The Chain uses the PromptTemplate to shape these variables into a prompt that is then passed to the model. If an OutputParser is provided, it will interpret the output of the LLM, formatting it into a finalized version. In essence, the LLMChain is a key part of the LangChain framework, which aims to develop applications that utilize language models in an interactive and data-aware manner.


The quality of this answer is phenomenal. This is made possible thanks to the idea of augmented our query with external knowledge (source knowledge). There's just one problem — how do we get this information in the first place?

We learned in the previous chapters about Pinecone and vector databases. Well, they can help us here too. But first, we'll need a dataset.

In [12]:
!pip install pypdf



In [14]:
pip install -U langchain-community

Collecting langchain-community
  Using cached langchain_community-0.3.11-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain<0.4.0,>=0.3.11 (from langchain-community)
  Using cached langchain-0.3.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.24 (from langchain-community)
  Using cached langchain_core-0.3.24-py3-none-any.whl.metadata (6.3 kB)
Collecting langsmith<0.3,>=0.1.125 (from langchain-community)
  Using cached langsmith-0.2.2-py3-none-any.whl.metadata (14 kB)
Using cached langchain_community-0.3.11-py3-none-any.whl (2.5 MB)
Using cached langchain-0.3.11-py3-none-any.whl (1.0 MB)
Using cached langchain_core-0.3.24-py3-none-any.whl (410 kB)
Using cached langsmith-0.2.2-py3-none-any.whl (320 kB)
Installing collected packages: langsmith, langchain-core, langchain, langchain-community
Successfully installed langchain-0.3.11 langchain-community-0.3.11 langchain-core-0.3.24 langsmith-0.2.2


## Upload your own custom code files to the model

In [15]:
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
from langchain_text_splitters import Language
import os

# Define the uploaded file path (update with your file name if different)
uploaded_file_path = "code.ipynb"

# Loader to process the single `.ipynb` file
loader = GenericLoader.from_filesystem(
    uploaded_file_path,
    glob="*",
    suffixes=[".ipynb"],
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500),  # Treat as Python
)

# Load the content
documents = loader.load()

# Check the number of documents loaded
print(f"Number of documents loaded: {len(documents)}")

# Print document details
for doc in documents:
    print("Document Metadata:", doc.metadata)
    print("Document Content:", doc.page_content[:500])  # Print the first 500 characters


Number of documents loaded: 1
Document Metadata: {'source': 'code.ipynb', 'content_type': 'simplified_code', 'language': <Language.PYTHON: 'python'>}
Document Content: {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## Installing libraries"
      ],
      "metadata": {
        "id": "4KUVqVe0fGxH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install request


In [16]:
documents[:1]



In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split text data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)
print(len(text_chunks))

31


In [18]:
text_chunks[2]



In [19]:
# reformat chunks to improve vectorization; match 'jamescalam/llama-2-arxiv-papers-chunked' format sourced from Llama 2 ArXiv papers on huggingface
dataset = []

for i, chunk in enumerate(text_chunks):
    dataset.append({
        'doi': '',  # you can add a DOI here if available
        'chunk-id': str(i),
        'chunk': chunk,
        'id': '',  # you can add an ID here if available
        'title': '',  # you can add a title here if available
        'summary': '',  # you can add a summary here if available
        'source': '',  # you can add a source here if available
        'authors': [],  # you can add authors here if available
        'categories': [],  # you can add categories here if available
        'comment': '',  # you can add a comment here if available
        'journal_ref': None,  # you can add a journal reference here if available
        'primary_category': '',  # you can add a primary category here if available
        'published': '',  # you can add a published date here if available
        'updated': '',  # you can add an updated date here if available
        'references': []  # you can add references here if available
    })

print(dataset[3])



### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [20]:
pip install pinecone-client



In [21]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = "your-pinecone-code"

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [22]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [23]:
import time

index_name = 'llama-2-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-ada-002` model — we can access it via LangChain like so:

In [24]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

  embed_model = OpenAIEmbeddings(


Using this model we can create embeddings like so:

In [25]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
print(len(res), len(res[0]))


2 1536


From this we get two (aligning to our two chunks of text) 1536-dimensional embeddings.

We're now ready to embed and index all our our data! We do this by looping through our dataset and embedding and inserting everything in batches.

In [28]:
import pandas as pd
from tqdm.auto import tqdm  # for progress bar

data = pd.DataFrame(dataset) # this makes it easier to iterate over the dataset

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [str(x['chunk']) for _, x in batch.iterrows()]

    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'].page_content,
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/1 [00:00<?, ?it/s]

We can check that the vector index has been populated using `describe_index_stats` like before:

In [29]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our chatbot. To do that we'll be diving back into LangChain and reusing our template prompt from earlier.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [30]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

  vectorstore = Pinecone(


Using this `vectorstore` we can already query the index and see if we have any relevant information given our question about Llama 2.

In [38]:
query = "which language is used in code?"

vectorstore.similarity_search(query, k=3)

[Document(metadata={'source': '', 'title': ''}, page_content='"\\n",\n            "CEO Sundar Pichai attributed the company\'s success to their innovative initiatives in Artificial Intelligence (AI). The company\'s AI approach involves infrastructure investment, research, and customer experience. They\'ve seen growth in their Gemini models usage, and over a quarter of all new code at Google is now generated by AI. Google Search, Google Cloud, and YouTube sectors have significantly benefited from AI advancements. The company also acknowledged the contributions of their global employees and paid tribute to the late Susan Wojcicki, a former YouTube CEO.\\n"\n          ]\n        }\n      ]\n    },\n    {\n      "cell_type": "code",\n      "source": [],\n      "metadata": {\n        "id": "qYbteZ9AXV2O"\n      },\n      "execution_count": 44,\n      "outputs": []\n    }\n  ]\n}'),
 Document(metadata={'source': '', 'title': ''}, page_content='"\\n",\n            "Research-wise, Google\'s De

We return a lot of text here and it's not that clear what we need or what is relevant. Fortunately, our LLM will be able to parse this information much faster than us. All we need is to connect the output from our `vectorstore` to our `chat` chatbot. To do that we can use the same logic as we used earlier.

In [39]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [40]:
print(augment_prompt(query))

Using the contexts below, answer the query.

    Contexts:
    "\n",
            "CEO Sundar Pichai attributed the company's success to their innovative initiatives in Artificial Intelligence (AI). The company's AI approach involves infrastructure investment, research, and customer experience. They've seen growth in their Gemini models usage, and over a quarter of all new code at Google is now generated by AI. Google Search, Google Cloud, and YouTube sectors have significantly benefited from AI advancements. The company also acknowledged the contributions of their global employees and paid tribute to the late Susan Wojcicki, a former YouTube CEO.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "qYbteZ9AXV2O"
      },
      "execution_count": 44,
      "outputs": []
    }
  ]
}
"\n",
            "Research-wise, Google's DeepMind team, led by Nobel laureates Demis Hassabis and John Jumper, is pioneering AI adv

There is still a lot of text here, so let's pass it onto our chat model to see how it performs.

## ask questions based on the code file as context.

In [41]:
# Create a new user prompt
prompt = {"role": "user", "content": augment_prompt(query)}

# Add to messages
messages.append(prompt)

# Send to OpenAI (chat-gpt equivalent)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


The language used in the code, according to the context, is Python.


In [42]:
# Create a new user prompt
prompt = {"role": "user", "content": "which model is used in the code?"}

# Add to messages
messages.append(prompt)

# Send to OpenAI (chat-gpt equivalent)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


The model used in the provided code is "gpt-4".


In [49]:
# Create a new user prompt
prompt = {"role": "user", "content": "Tell me a brief of the code written"}

# Add to messages
messages.append(prompt)

# Send to OpenAI (chat-gpt equivalent)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


The provided code in the context is primarily about extracting content from a website, rephrasing it, and displaying the rephrased content. It uses the Python language and has several imported libraries such as OpenAI, requests, BeautifulSoup, and PyPDF2. 

Here is a brief description of the operations performed:

1. The code first defines the 'fetch_website_content' function, which fetches and returns the content from a given website. It uses the 'requests' library to make a HTTP request to the website's URL, and BeautifulSoup to parse the HTML content of the website. 

2. Then, it defines the 'rewrite_content' function that uses the OpenAI language model (specified by a model name) to paraphrase or rewrite the fetched content. To accomplish this, it uses the OpenAI API.

Finally, these functions are utilized to fetch the content from a website (specified by the 'website_url' variable), rewrite it, and print the rewritten text. 

Note: The actual API Keys and specific website URLs hav

In [50]:
# Create a new user prompt
prompt = {"role": "user", "content": "What libraries are imported?"}

# Add to messages
messages.append(prompt)

# Send to OpenAI (chat-gpt equivalent)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


The libraries that are imported in the code are openai, requests, BeautifulSoup from bs4, and PyPDF2.


The chatbot is able to respond about Llama 2 thanks to it's conversational history stored in `messages`. However, it doesn't know anything about the safety measures themselves as we have not provided it with that information via the RAG pipeline. Let's try again but with RAG.

In [52]:
# Create a new user prompt
prompt = {"role": "user", "content": "What roles are used to specify the type of input in a conversation?"}

# Add to messages
messages.append(prompt)

# Send to OpenAI (chat-gpt equivalent)
response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

# Print the response
print(response.choices[0].message.content.strip())


The roles that are used to specify the type of input in a conversation are "system" and "user". The "system" role is typically used to set the behavior of the language model, while the "user" role provides the instruction that the model should complete.


Delete the index to save resources:

In [53]:
pc.delete_index(index_name)

---