# Run First

In [None]:
%load_ext autoreload
%autoreload 2

from IPython.display import Markdown

def display_md(content):
  display(Markdown(content))

# Naive Indexer Architecture
In this course, we are indexing a small number of HTML files. However, in production situations, you're likely to encounter some situations that significantly increase the complexity of accurate indexing:  
- PDFs contain much of the world's unstructured data
- Parsing PDFs with vision can require layout understanding, which is not a generally solved problem
- PDFs often contain tables, graphics, footnotes, equations, etc that require special handling
- Many business cases require indexing highly heterogenous document layouts

Since accurately indexing files is the beginning of your inference pipeline, this is often one of the most consequential engineering problems to perform well at.

## loaders.py
Look at `./workshop-code/indexer_components/loaders.py`

The function here is pretty simple. It downloads a file at a specified URI, then saves it to a cache so it doesn't need to be downloaded on each subsequent run in the notebook. The preprocessor will consume this file in the next step. 

### Task: Read the Code
This part isn't interesting, so just look at the code and understand what it does. If you have any questions, let one of us know.

Note that the proper way to do caching is by using the HTTP response's `ETag`, `Last-Modified`, and `Cache-Control` headers, but I didn't do that here. If you want extra credit, feel free to send me a pull request with the corrected code.

## preprocessors.py
Look at `./workshop-code/indexer_components/preprocessors.py`

In order to reduce inference costs, we want to strip away all of the HTML syntax besides the human-readable body text of the documents. This is a design decision. In real life, you may, for example, preserve more of the HTML markup to reain the context of how the document is structured.

### Easy Task: Configure Beautiful Soup for A Simply Structured Blog Post

The unprocessed HTML blog post looks like this:

In [None]:
from workshop_code.indexer_components.loaders       import DocLoader

blog_post_uri = "https://lilianweng.github.io/posts/2023-06-23-agent/"
doc_content = DocLoader.load_html(blog_post_uri)
display_md(doc_content[3400:3800])

The processed HTML blog post should like like this:

In [None]:
from cheat_code.indexer_components.loaders       import DocLoader
from cheat_code.indexer_components.preprocessors import GithubBlogpostPreprocessor

blog_post_uri = "https://lilianweng.github.io/posts/2023-06-23-agent/"
blog_post_html = DocLoader.load_html(blog_post_uri)
preprocessor = GithubBlogpostPreprocessor()
cleaned_doc_content = preprocessor.get_text(blog_post_html)

display_md(cleaned_doc_content[0:1000])

Complete the method `GithubBlogpostPreprocessor.get_text()` in `./workshop_code/preprocessors.py` such that the `cleaned_text` looks like the above output.

In [None]:
from workshop_code.indexer_components.loaders       import DocLoader
from workshop_code.indexer_components.preprocessors import GithubBlogpostPreprocessor

blog_post_uri = "https://lilianweng.github.io/posts/2023-06-23-agent/"
blog_post_html = DocLoader.load_html(blog_post_uri)
preprocessor = GithubBlogpostPreprocessor()
cleaned_doc_content = preprocessor.get_text(blog_post_html)

display_md(cleaned_doc_content[0:1000])

## More Realistic BeautifulSoup Task: Copy or Write a Preprocessor for the RAG Survey Paper
I suggest just copying the cheat code. Writing this whole preprocessor will take the rest of the session.

The HTML structure of the Arxiv paper is more complex than the blog post. You can try implementing some of `ArxivHtmlPaperPreprocessor` to see for yourself. But, I suggest just copying the cheat-code. 

Here is the working implementation:

In [None]:
from cheat_code.indexer_components.loaders       import DocLoader
from cheat_code.indexer_components.preprocessors import ArxivHtmlPaperPreprocessor

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
preprocessor = ArxivHtmlPaperPreprocessor()
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)

display_md(cleaned_doc_content[0:2000])

And, here is the implementation for you to complete:

In [None]:
from workshop_code.indexer_components.loaders       import DocLoader
from workshop_code.indexer_components.preprocessors import ArxivHtmlPaperPreprocessor

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
preprocessor = ArxivHtmlPaperPreprocessor()
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)

display_md(cleaned_doc_content[0:2000])

## Hard Task, not recommended for today: Write a Preprocessor for the PDF Version of the RAG Survey Paper
In production applications, you're likely to need to do inference on PDFs. Today this is often a non-trivial task. The most popular open source solution is Tesseract. However, Tesseract often underperforms computer vision-based services from vendors like Google Cloud and AWS.

## text_splitters.py
Because LLM context windows are limited, semantic indexing strategies rely on text splitting. In this tutorial, we use the most naive strategy, character text splitting. To find inspiration or source code for more strategies, I look at LlamaIndex and Langchain. However, in some production situations, it will make sense to write a text splitter specific to your needs.

### Text splitting task #1: examine and copy the code for the text splitter
Here is the working implementation:

In [None]:
from cheat_code.indexer_components.loaders        import DocLoader
from cheat_code.indexer_components.preprocessors  import ArxivHtmlPaperPreprocessor
from cheat_code.indexer_components.text_splitters import SimpleCharacterTextSplitter

CHUNK_SIZE = 250
OVERLAP_SIZE = 25
rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
preprocessor = ArxivHtmlPaperPreprocessor()
text_splitter = SimpleCharacterTextSplitter(CHUNK_SIZE, OVERLAP_SIZE)

rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)
text_splits = text_splitter.split_text(cleaned_doc_content)

display_md(text_splits[3])

Copy the code from `cheat_code/indexer_components/text_splitters.py` to `workshop_code/` so that the code below works: 

In [None]:
from workshop_code.indexer_components.loaders        import DocLoader
from workshop_code.indexer_components.preprocessors  import ArxivHtmlPaperPreprocessor
from workshop_code.indexer_components.text_splitters import SimpleCharacterTextSplitter

CHUNK_SIZE = 250
OVERLAP_SIZE = 25
rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
preprocessor = ArxivHtmlPaperPreprocessor()
text_splitter = SimpleCharacterTextSplitter(CHUNK_SIZE, OVERLAP_SIZE)

rag_survey_paper_html = DocLoader.load_html(rag_survey_paper_uri)
cleaned_doc_content = preprocessor.get_text(rag_survey_paper_html)
text_splits = text_splitter.split_text(cleaned_doc_content)

display_md(text_splits[3])

### Text Splitting Task #2: look at alternative text splitters
Make a mental note of the other text splitters available here:
- [Langchain: Text Splitters](https://python.langchain.com/v0.2/docs/how_to/#text-splitters)
- [LlamaIndex: Text Splitters](https://medium.com/@bavalpreetsinghh/llamaindex-chunking-strategies-for-large-language-models-part-1-ded1218cfd30)

## Embeddings: vectorizers.py
Embeddings of text is currently the most common method of preparing human-readable text so that they can be compared to each other for relatedness. Currently, OpenAI's embedding models rank amongst the highest performing, so we use theirs.  

OpenAI's text embedding models take up to 8191 tokens as input and convert them to a vector of dimension 1536 for `text-embedding-3-small` or 3072 for `text-embedding-3-large`.

### Embeddings Task #1: Use OpenAI's Embeddings API
The embeddings code for your naive RAG pipeline should behave like this:

In [None]:
from cheat_code.common_components.vectorizers import Vectorizer

example_text_splits = ["Mary had a", "little lamb"]
vectorizer = Vectorizer()
embeddings_of_example_splits = vectorizer.vectorize_text_splits(example_text_splits)

rows = len(embeddings_of_example_splits)
columns = len(embeddings_of_example_splits[0])

print(f"Dimensions: {rows}x{columns}")
print(embeddings_of_example_splits[0:10])

Implement `vectorize_text_splits()` in `workshop_code/common_components/vectorizers.py` by referencing the [OpenAI embedding API's documentation](https://platform.openai.com/docs/api-reference/embeddings).

In [None]:
from workshop_code.common_components.vectorizers import Vectorizer

example_text_splits = ["Mary had a", "little lamb"]
vectorizer = Vectorizer()
embeddings_of_example_splits = vectorizer.vectorize_text_splits(example_text_splits)

rows = len(embeddings_of_example_splits)
columns = len(embeddings_of_example_splits[0])

print(f"Dimensions: {rows}x{columns}")
print(embeddings_of_example_splits[0:10])

## Vector Database: vectordb_client_adapters.py
In production settings, you're likely to store your vectors in a database. In this tutorial, we are using Couchbase.

### Vector DB Task #1: Understand the indexer code
Open `indexers.py` in `./workshop_code/`. Look over how the `CouchbaseClientAdapter` is used, and look at how its methods are implemented. If something doesn't make sense, ask a question.

## The Complete Indexer: indexer.py
### Indexer Task: test that your indexer works
Your indexer should give output like the cheat_code version below:

In [None]:
import time
from cheat_code.indexers import NaiveIndexer
from cheat_code.common_components.vectorizers import Vectorizer

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
vectorizer = Vectorizer()
indexer = NaiveIndexer(vectorizer)
indexer.index(rag_survey_paper_uri)
time.sleep(5)
num_db_entries = indexer._client_adapter.count_entries()
print(f"Number of text chunks in db: {num_db_entries}")

Run your indexer below to see if it works the same way. If it doesn't, something is broken.

In [None]:
import time
from workshop_code.indexers import NaiveIndexer
from workshop_code.common_components.vectorizers import Vectorizer

rag_survey_paper_uri = "https://arxiv.org/html/2312.10997v5"
vectorizer = Vectorizer()
indexer = NaiveIndexer(vectorizer)
indexer.index(rag_survey_paper_uri)
time.sleep(5)
num_db_entries = indexer._client_adapter.count_entries()
print(f"Number of text chunks in db: {num_db_entries}")