# Doc2Cache w/ Llama3.1

This recipe demonstrates how to convert a PDF document into a set of pre-defined FAQs that can be used to populate an LLM [Semantic Cache](https://www.redisvl.com/user_guide/llmcache_03.html) using the Llama3.1 LLM.

## Motivation

As a framework, `Doc2Cache` solves 3 main problems faced by AI engineers optimizing RAG pipelines:

1. How do you get the benefits of semantic caching from day-1 without waiting for tons of production user traffic to accumulate?
2. How do you make sure that the semantic cache has valid/factual data in it?
3. How can you test the quality of a semantic cache without a bunch of "ground truth" (labeled) data?

## Architecture

`Doc2Cache` is comprised of an end-to-end workflow with a few stages:

- Smaller document chunks are extracted from knowledge base documents (PDFs)
- Each chunk is presented to the Llama3.1 LLM along with a specialized prompt to extract FAQs
- Generated FAQs are embedding using an embedding model
- FAQ embeddings are loaded into a Redis semantic cache instance

![doc2cache](../../assets/Doc2Cache.png)

## 1. Setup
Before we begin, we must install some required libraries, initialize the LLM instance, create a Redis database, and initialize other required components.


### Install required libraries

In [None]:
# NBVAL_SKIP
!pip install redisvl>=0.3.3 unstructured[pdf] sentence-transformers openai
!pip install langchain-core langchain-community pypdf rapidocr-onnxruntime

In [5]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Init Llama3.1 model with vLLM

In [2]:
# Set key variables necessary for downloading model weights

import os

HUGGING_FACE_HUB_TOKEN = os.getenv("HUGGING_FACE_HUB_TOKEN")
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"

Run the Llama3.1 model using vLLM docker container

In [3]:
!docker run -d \
  --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \
  -p 8000:8000 --ipc=host vllm/vllm-openai:latest \
  --model $MODEL_NAME \
  --gpu-memory-utilization 0.95 \
  --max-model-len 36640

89f939208d2851e56000957da56813a1d641bf5dd3197291c15f79d3bf51dc56


### Connect to LLM and vectorizer instances

In [6]:
from langchain_community.llms import VLLMOpenAI

# Create LLM instance
llama = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name=MODEL_NAME,
    temperature=0.1
)

In [15]:
from redisvl.utils.vectorize import HFTextVectorizer
from sentence_transformers import SentenceTransformer

# Ensure the tmp cache directory exists
os.makedirs('/tmp/huggingface', exist_ok=True)

class Vectorizer(HFTextVectorizer):
    def _initialize_client(self, model: str):
        """Setup the HuggingFace client"""
        # Dynamic import of the cohere module\
        try:
            from sentence_transformers import SentenceTransformer
        except ImportError:
            raise ImportError(
                "HFTextVectorizer requires the sentence-transformers library. "
                "Please install with `pip install sentence-transformers`"
            )

        self._client = SentenceTransformer(model, cache_folder='/tmp/huggingface/transformers')


vectorizer = Vectorizer("sentence-transformers/all-mpnet-base-v2")

#### Run Redis locally
If you have a Redis db running elsewhere with [Redis Stack](https://redis.io/docs/about/about-stack/) installed, you don't need to run it on this machine. You can skip to the "Connect to Redis server" step.

In [10]:
!docker run -d --name my-redis-stack -p 6379:6379 redis/redis-stack-server:latest

Unable to find image 'redis/redis-stack-server:latest' locally
latest: Pulling from redis/redis-stack-server

[1B021b0277: Pulling fs layer 
[1B764663d7: Pulling fs layer 
[1Bb700ef54: Pulling fs layer 
[1B2c477937: Pulling fs layer 
[1B310e49ba: Pulling fs layer 
[1B2f33031a: Pulling fs layer 
[1B9eb144bd: Pulling fs layer 
[1B77c6ca59: Pulling fs layer 
[1Ba0f7b647: Pulling fs layer 
[1B1312cb2e: Pulling fs layer 
[1BDigest: sha256:887cf87cc744e4588ccade336d0dbb943e4e46330f738653ccb3a7a55df2f1862K[5A[2K[11A[2K[4A[2K[11A[2K[7A[2K[11A[2K[4A[2K[4A[2K[11A[2K[7A[2K[11A[2K[3A[2K[2A[2K[11A[2K[7A[2K[11A[2K[7A[2K[10A[2K[10A[2K[10A[2K[10A[2K[7A[2K[7A[2K[7A[2K[7A[2K[8A[2K[7A[2K[8A[2K[8A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[6A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[3A[2K[1A[2K
Status: Downloaded newer image for redis/redis-stack-server:latest
6ff8add913c50902aca6df15b28a53935eebbaf12a3ad3

## 2. Implement Doc2Cache workflow


In [16]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [17]:
doc_path = "../RAG/resources/amzn-10k-2023.pdf"

# set up the file loader/extractor and text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3200, chunk_overlap=50
)
loader = PyPDFLoader(doc_path, extract_images=True)

# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc_path)

Done preprocessing. Created 123 chunks of the original pdf ../RAG/resources/amzn-10k-2023.pdf


In [None]:
for chunk in chunks:
  print("############", "\n", chunk.page_content)

#### Extract FAQs with Llama3.1

First we will define a chain to properly prompt an LLM to extract FAQs as a JSON object per node.

In [40]:
from langchain import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field
from typing import List


class QuestionAnswer(BaseModel):
    question: str = Field(description="Frequently asked question about information in the document.")
    answer: str = Field(description="Factual answer from the LLM related to the user question.")

class FAQs(BaseModel):
    faqs: List[QuestionAnswer] = Field(description="List of question/answer pairs extracted from the document")


# Set up a parser + inject instructions into the prompt template.
json_parser = JsonOutputParser(pydantic_object=FAQs)

In [41]:
prompt = PromptTemplate(
    template="""
    You are a document intelligence tool used to extract FAQs
    from portions of financial 10k SEC doc for Amazon.

    For each small chunk from the doc and your task is to extract
    possible frequently asked questions derived straight from the content.
    Put yourself in the shoes of a potential human reader anticipate what
    real world questions they might have.
    
    You must reply with only a JSON object that captures the structured output
    according to the following string schema. No exceptions:
    
    {format_instructions}

    Document Chunk:\n{doc}\n
    
    FAQs JSON: 
    """,
    input_variables=["doc"],
    partial_variables={"format_instructions": json_parser.get_format_instructions()},
)

doc2cache = prompt | llama | json_parser

Let's test this out with a sample document from wikipedia first.

In [42]:
sample_doc = """Obi-Wan Kenobi (Ewan McGregor) is a young apprentice Jedi knight
under the tutelage of Qui-Gon Jinn (Liam Neeson) ; Anakin Skywalker (Jake Lloyd),
who will later father Luke Skywalker and become known as Darth Vader, is just
a 9-year-old boy. When the Trade Federation cuts off all routes to the planet
Naboo, Qui-Gon and Obi-Wan are assigned to settle the matter."""

In [43]:
faqs = doc2cache.invoke({"doc": sample_doc})

In [44]:
faqs

{'faqs': [{'question': 'Who is Obi-Wan Kenobi?',
   'answer': 'Obi-Wan Kenobi is a young apprentice Jedi knight.'},
  {'question': 'Who is Qui-Gon Jinn?',
   'answer': 'Qui-Gon Jinn is a Jedi knight.'},
  {'question': 'Who is Anakin Skywalker?',
   'answer': 'Anakin Skywalker is a 9-year-old boy who will later father Luke Skywalker and become known as Darth Vader.'}]}

Now we can apply this same logic to nodes from our pdf document.

In [45]:
def extract_faqs(chunks):
    all_faqs = []
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1} of {len(chunks)}", flush=True)
        try:
            results = doc2cache.invoke({"doc": chunk.page_content})
        except Exception as e:
            print("..Skipping chunk due to error decoding LLM response", str(e))
        if results and results.get("faqs"):
            all_faqs.extend(results["faqs"])
    return all_faqs

In [46]:
faqs = extract_faqs(chunks)

Processing chunk 1 of 123
Processing chunk 2 of 123
Processing chunk 3 of 123
Processing chunk 4 of 123
Processing chunk 5 of 123
Processing chunk 6 of 123
Processing chunk 7 of 123
Processing chunk 8 of 123
Processing chunk 9 of 123
Processing chunk 10 of 123
..Skipping chunk due to error decoding LLM response Invalid json output: {"faqs": [
        {"question": "What are the challenges faced by Amazon in the competitive market?", "answer": "Amazon faces challenges such as competitors entering into business combinations or alliances, established companies expanding to new market segments, and new technologies increasing competition."},
        {"question": "What are the risks associated with Amazon's expansion into new products and services?", "answer": "Amazon's expansion into new products and services is subject to risks such as limited experience in new market segments, customer adoption, service disruptions, delays, setbacks, or failures or quality issues, and potential write-down

In [47]:
print("Generated", len(faqs), "frequently asked questions.")

Generated 534 frequently asked questions.


In [65]:
faqs[50:55]

[{'question': "What factors can cause demand for Amazon's products and services to fluctuate?",
  'answer': 'Seasonality, promotions, product launches, unforeseeable events such as recessionary fears, natural or human-caused disasters, extreme weather, or geopolitical events.'},
 {'question': "What are the potential consequences of Amazon's failure to stock or restock popular products?",
  'answer': 'Significant affect on revenue and future growth.'},
 {'question': 'What are the potential consequences of Amazon overstocking products?',
  'answer': 'Significant inventory markdowns or write-offs and commitment costs, which could materially reduce profitability.'},
 {'question': "What are the potential consequences of Amazon's websites experiencing system interruptions?",
  'answer': 'Reduced volume of goods offered or sold and the attractiveness of products and services.'},
 {}]

#### Index FAQs into Redis

Now we will create embeddings of each prompt and load them into Redis for our semantic cache.

In [59]:
from redisvl.extensions.llmcache import SemanticCache


def to_semantic_cache(faqs: list) -> SemanticCache:
    """Convert list of FAQs into a semantic cache instance."""
    cache = SemanticCache(
        name="amzn_10k_cache",
        redis_url="redis://localhost:6379", # point to your own Redis URL if necessary
        vectorizer=vectorizer,
        distance_threshold=0.2
    )
    for i, faq in enumerate(faqs):
        print(i)
        if faq and "question" in faq and "answer" in faq:
            cache.store(
                prompt=faq["question"],
                response=faq["answer"]
            )
    return cache

In [60]:
# load doc2cache outputs into Redis semantic cache
cache = to_semantic_cache(faqs)

18:22:34 redisvl.index.index INFO   Index already exists, not overwriting.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258

## 3. Test the semantic cache

In [61]:
cache.check("How many employees work at Amazon?")

[{'entry_id': 'ac61cfeee88b0468599ec8f79dc54b54a76defd4476a1b388c41c207d7b4e749',
  'prompt': 'How many employees does Amazon have?',
  'response': 'As of December 31, 2022, Amazon employed approximately 1,541,000 full-time and part-time employees.',
  'vector_distance': 0.0474983453751,
  'inserted_at': 1727288555.11,
  'updated_at': 1727288555.11,
  'key': 'amzn_10k_cache:ac61cfeee88b0468599ec8f79dc54b54a76defd4476a1b388c41c207d7b4e749'}]

In [67]:
cache.check("What are Amazon's business principles?")

[{'entry_id': '969e0aa725337085711033c81202da3a0287ec2b736b17a026f0df79592bbbd0',
  'prompt': "What is Amazon's business principle?",
  'response': "Amazon's business principle is customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking.",
  'vector_distance': 0.0554879903793,
  'inserted_at': 1727288554.92,
  'updated_at': 1727288554.92,
  'key': 'amzn_10k_cache:969e0aa725337085711033c81202da3a0287ec2b736b17a026f0df79592bbbd0'}]

## Cleanup

In [68]:
cache.delete()