# Caching

- Author: [Joseph](https://github.com/XaviereKU)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

`LangChain` provides optional caching layer for LLMs.

This is useful for two reasons:
- When requesting the same completions multiple times, it can **reduce the number of API calls** to the LLM provider and thus save costs.
- By **reduing the number of API calls** to the LLM provider, it can **improve the running time of the application.**

In this tutorial, we will use gpt-3.5-turbo OpenAI API and utilize two kinds of cache, **InMemoryCache** and **SQLite Cache** .  
At end of each section we will compare wall times between before and after caching.

Optionally, we will use local LLM served with VLLM.

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [InMemoryCache](#in-memory-cache)
- [SQlite Cache](#sqlite-cache)
- [(Optional) With local model](#optional-with-local-model)
- [(Optional) InMemoryCache + Local LLM](#optional-inmemorycache--local-llm)
- [(Optional) SQLite Cache + Local LLM](#optional-sqlite-cache--local-llm)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        # "vllm", # this is for optional section
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "Your API KEY",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Caching",
    }
)

Environment variables have been set successfully.


Alternatively, one can set environmental variables with load_dotenv

In [3]:
from dotenv import load_dotenv


load_dotenv()

True

In [4]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

# Create model
llm = ChatOpenAI(model_name="gpt-4o-mini")

# Generate prompt
prompt = PromptTemplate.from_template(
    "Sumarize about the {country} in about 200 characters"
)

# Create chain
chain = prompt | llm

In [5]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)

South Korea is a vibrant nation known for its technological advancements, rich culture, and dynamic economy. It boasts K-pop, delicious cuisine, and historic sites, with Seoul as its bustling capital.
CPU times: user 17.1 ms, sys: 3.51 ms, total: 20.6 ms
Wall time: 2.64 s


## InMemoryCache
First, cache the answer to the same question using InMemoryCache.

In [12]:
from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache

# Set InMemoryCache
set_llm_cache(InMemoryCache())

In [13]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)

South Korea is a vibrant East Asian nation known for its rich history, advanced technology, K-pop culture, and delicious cuisine. It has a strong economy and is a global leader in innovation and education.
CPU times: user 9.66 ms, sys: 2.33 ms, total: 12 ms
Wall time: 1.32 s


Now we invoke the chain with the same question.

In [14]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)

South Korea is a vibrant East Asian nation known for its rich history, advanced technology, K-pop culture, and delicious cuisine. It has a strong economy and is a global leader in innovation and education.
CPU times: user 510 μs, sys: 32 μs, total: 542 μs
Wall time: 524 μs


Note that if we set InMemoryCache again, the cache will be lost and the wall time will increase

In [15]:
set_llm_cache(InMemoryCache())

In [16]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)

South Korea, located on the Korean Peninsula, is known for its rich culture, advanced technology, and vibrant economy. Major cities include Seoul and Busan, famous for K-pop, cuisine, and historical sites.
CPU times: user 5.37 ms, sys: 1.81 ms, total: 7.18 ms
Wall time: 1.3 s


## SQLite Cache
Now, we cache the answer to the same question by using SQLite cache.

In [17]:
from langchain_community.cache import SQLiteCache
from langchain_core.globals import set_llm_cache
import os

# Create cache directory
if not os.path.exists("cache"):
    os.makedirs("cache")

# Set SQLiteCache
set_llm_cache(SQLiteCache(database_path="cache/llm_cache.db"))

In [18]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)

South Korea, located on the Korean Peninsula, is a vibrant nation known for its rich culture, advanced technology, and global influence in entertainment, particularly K-pop and cinema. Its capital is Seoul.
CPU times: user 12.5 ms, sys: 3 ms, total: 15.5 ms
Wall time: 1.35 s


Now we invoke the chain with the same question.

In [19]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)

South Korea, located on the Korean Peninsula, is a vibrant nation known for its rich culture, advanced technology, and global influence in entertainment, particularly K-pop and cinema. Its capital is Seoul.
CPU times: user 40.1 ms, sys: 24.7 ms, total: 64.8 ms
Wall time: 63.4 ms


Note that if we use SQLite Cache, setting caching again does not delete store cache

In [20]:
set_llm_cache(SQLiteCache(database_path="cache/llm_cache.db"))

In [21]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response.content)

South Korea, located on the Korean Peninsula, is a vibrant nation known for its rich culture, advanced technology, and global influence in entertainment, particularly K-pop and cinema. Its capital is Seoul.
CPU times: user 3.61 ms, sys: 1.39 ms, total: 5 ms
Wall time: 3.86 ms


## (Optional) With local model
In this optional section, we utilize `docker` to serve local LLM model.
Note that this used miniconda to set environment easily.

### Device & Serving information
- CPU : AMD 5600X
- OS : Windows 10 Pro
- RAM : 32 Gb
- GPU : Nividia 3080Ti, 12GB VRAM
- CUDA : 12.6
- Driver Version : 560.94
- docker image : nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
- model : Qwen/Qwen2.5-0.5B-Instruct
- Python version : 3.10
- docker run script :
    ```
    docker run -itd --name vllm --gpus all --entrypoint /bin/bash -p 6001:8888 nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04
    ```
- vllm serving script : 
    ```
    python3 -m vllm.entrypoints.openai.api_server --model='Qwen/Qwen2.5-0.5B-Instruct' --served-model-name 'qwen-2.5' --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.80 --max-model-len 4096 --swap-space 1 --dtype bfloat16 --tensor-parallel-size 1 
    ```

In [18]:
from langchain_community.llms import VLLMOpenAI

# create model using OpenAI compatible class VLLMOpenAI
llm = VLLMOpenAI(
    model="qwen-2.5", openai_api_key="EMPTY", openai_api_base="http://localhost:6001/v1"
)

# Generate prompt
prompt = PromptTemplate.from_template(
    "Sumarize about the {country} in about 200 characters"
)

# Create chain
chain = prompt | llm

## (Optional) InMemoryCache + Local LLM
Same InMemoryCache section above, we set InMemoryCache.

In [22]:
from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache

# Set InMemoryCache
set_llm_cache(InMemoryCache())

Invoke chain with local LLM, do note that we print **response** not **response.content**

In [23]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)

content='South Korea, located on the Korean Peninsula, is known for its vibrant culture, advanced technology, and strong economy. Major cities include Seoul, Busan, and Incheon, with influences from K-pop and traditional heritage.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 45, 'prompt_tokens': 19, 'total_tokens': 64, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_d02d531b47', 'finish_reason': 'stop', 'logprobs': None} id='run-95a28c46-5a1b-49ea-92ad-7e77809c452a-0' usage_metadata={'input_tokens': 19, 'output_tokens': 45, 'total_tokens': 64, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}
CPU times: user 9.55 ms, sys: 3.33 ms, total: 12.9 ms
Wall time: 1.4 s

Now we invoke chain again, with the same question.

In [24]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)

content='South Korea, located on the Korean Peninsula, is known for its vibrant culture, advanced technology, and strong economy. Major cities include Seoul, Busan, and Incheon, with influences from K-pop and traditional heritage.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 45, 'prompt_tokens': 19, 'total_tokens': 64, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_d02d531b47', 'finish_reason': 'stop', 'logprobs': None} id='run-95a28c46-5a1b-49ea-92ad-7e77809c452a-0' usage_metadata={'input_tokens': 19, 'output_tokens': 45, 'total_tokens': 64, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}
CPU times: user 948 μs, sys: 88 μs, total: 1.04 ms
Wall time: 1.01 ms


## (Optional) SQLite Cache + Local LLM
Same as SQLite Cache section above, set SQLite Cache.  
Note that we set db name to be **vllm_cache.db** to distinguish from the cache used in SQLite Cache section.

In [25]:
from langchain_community.cache import SQLiteCache
from langchain_core.globals import set_llm_cache
import os

# Create cache directory
if not os.path.exists("cache"):
    os.makedirs("cache")

# Set SQLiteCache
set_llm_cache(SQLiteCache(database_path="cache/vllm_cache.db"))

Invoke chain with local LLM, again, note that we print **response** not **response.content**

In [26]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)

content="South Korea, located in East Asia, is known for its vibrant culture, advanced technology, and economic strength. It's famous for K-pop, Korean cuisine, and historical sites, blending tradition with modernity." additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 42, 'prompt_tokens': 19, 'total_tokens': 61, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_0aa8d3e20b', 'finish_reason': 'stop', 'logprobs': None} id='run-2659da82-c36a-48b8-be70-3f163ef2312a-0' usage_metadata={'input_tokens': 19, 'output_tokens': 42, 'total_tokens': 61, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}
CPU times: user 13.9 ms, sys: 4.15 ms, total: 18 ms
Wall time: 1.74 s


Now we invoke chain again, with the same question.

In [27]:
%%time
# Invoke chain
response = chain.invoke({"country": "South Korea"})
print(response)

content="South Korea, located in East Asia, is known for its vibrant culture, advanced technology, and economic strength. It's famous for K-pop, Korean cuisine, and historical sites, blending tradition with modernity." additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 42, 'prompt_tokens': 19, 'total_tokens': 61, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_0aa8d3e20b', 'finish_reason': 'stop', 'logprobs': None} id='run-2659da82-c36a-48b8-be70-3f163ef2312a-0' usage_metadata={'input_tokens': 19, 'output_tokens': 42, 'total_tokens': 61, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}
CPU times: user 2.21 ms, sys: 800 μs, total: 3.01 ms
Wall time: 2.29 ms
