Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update index.rst for changes in README #131

Merged
merged 1 commit into from
Apr 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 40 additions & 29 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,23 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

GPTCache: A Library for Creating Semantic Cache for LLM Queries.
GPTCache: A Library for Creating Semantic Cache for LLM Queries
===========================================================================

Boost LLM API Speed by 100x ⚡, Slash Costs by 10x 💰

|
----

.. image:: https://img.shields.io/pypi/v/gptcache?label=Release&color
:width: 100
:alt: release
:target: https://pypi.org/project/gptcache/

.. image:: https://img.shields.io/pypi/dm/gptcache.svg?color=bright-green
:width: 100
:alt: pip_downloads
:target: https://pypi.org/project/gptcache/

.. image:: https://img.shields.io/badge/License-MIT-blue.svg
:width: 100
:alt: License
Expand All @@ -40,31 +45,8 @@ What is GPTCache?

ChatGPT and various large language models (LLMs) boast incredible versatility, enabling the development of a wide range of applications. However, as your application grows in popularity and encounters higher traffic levels, the expenses related to LLM API calls can become substantial. Additionally, LLM services might exhibit slow response times, especially when dealing with a significant number of requests.

To tackle this challenge, we have created GPTCache, a project dedicated to building a semantic cache for storing LLM responses. This library offers the following primary benefits:

- **Decreased expenses**: Most LLM services charge fees based on a combination of number of requests and `token count <https://openai.com/pricing>`_. By caching query results, GPTCache reduces both the number of requests and the number of tokens sent to the LLM service. This minimizes the overall cost of using the service.
- **Enhanced performance**: LLMs employ generative AI algorithms to generate responses in real-time, a process that can sometimes be time-consuming. However, when a similar query is cached, the response time significantly improves, as the result is fetched directly from the cache, eliminating the need to interact with the LLM service. In most situations, GPTCache can also provide superior query throughput compared to standard LLM services.
- **Improved scalability and availability**: LLM services frequently enforce `rate limits <https://platform.openai.com/docs/guides/rate-limits>`_, which are constraints that APIs place on the number of times a user or client can access the server within a given timeframe. Hitting a rate limit means that additional requests will be blocked until a certain period has elapsed, leading to a service outage. With GPTCache, you can easily scale to accommodate an increasing volume of of queries, ensuring consistent performance as your application's user base expands.
- **Flexible development environment**: When developing LLM applications, an LLM APIs connection is required to prove concepts. GPTCache offers the same interface as LLM APIs and can store LLM-generated or mocked data. This helps to verify your application's features without connecting to the LLM APIs or even the network.

How does it work?
--------------------

Online services often exhibit data locality, with users frequently accessing popular or trending content. Cache systems take advantage of this behavior by storing commonly accessed data, which in turn reduces data retrieval time, improves response times, and eases the burden on backend servers. Traditional cache systems typically utilize an exact match between a new query and a cached query to determine if the requested content is available in the cache before fetching the data.

However, using an exact match approach for LLM caches is less effective due to the complexity and variability of LLM queries, resulting in a low cache hit rate. To address this issue, GPTCache adopt alternative strategies like semantic caching. Semantic caching identifies and stores similar or related queries, thereby increasing cache hit probability and enhancing overall caching efficiency.

GPTCache employs embedding algorithms to convert queries into embeddings and uses a vector store for similarity search on these embeddings. This process allows GPTCache to identify and retrieve similar or related queries from the cache storage, as illustrated in the `Modules section <https://github.com/zilliztech/GPTCache#-modules>`_ .

Featuring a modular design, GPTCache makes it easy for users to customize their own semantic cache. The system offers various implementations for each module, and users can even develop their own implementations to suit their specific needs.

In a semantic cache, false positives can occur during cache hits and false negatives during cache misses. GPTCache provides three metrics to evaluate its performance:

- Precision: the ratio of true positives to the total of true positives and false positives.
- Recall: the ratio of true positives to the total of true positives and false negatives.
- Latency: the time required for a query to be processed and the corresponding data to be fetched from the cache.
To tackle this challenge, we have created GPTCache, a project dedicated to building a semantic cache for storing LLM responses.

A `simple benchmarks <https://github.com/zilliztech/gpt-cache/blob/main/examples/benchmark/benchmark_sqlite_faiss_onnx.py>`_ is included for users to start with assessing the performance of their semantic cache.

Getting Started
--------------------
Expand Down Expand Up @@ -127,6 +109,35 @@ More Docs:
usage.md
feature.md

What can this help with?
-------------------------

GPTCache offers the following primary benefits:

- **Decreased expenses**: Most LLM services charge fees based on a combination of number of requests and `token count <https://openai.com/pricing>`_. By caching query results, GPTCache reduces both the number of requests and the number of tokens sent to the LLM service. This minimizes the overall cost of using the service.
- **Enhanced performance**: LLMs employ generative AI algorithms to generate responses in real-time, a process that can sometimes be time-consuming. However, when a similar query is cached, the response time significantly improves, as the result is fetched directly from the cache, eliminating the need to interact with the LLM service. In most situations, GPTCache can also provide superior query throughput compared to standard LLM services.
- **Improved scalability and availability**: LLM services frequently enforce `rate limits <https://platform.openai.com/docs/guides/rate-limits>`_, which are constraints that APIs place on the number of times a user or client can access the server within a given timeframe. Hitting a rate limit means that additional requests will be blocked until a certain period has elapsed, leading to a service outage. With GPTCache, you can easily scale to accommodate an increasing volume of of queries, ensuring consistent performance as your application's user base expands.
- **Flexible development environment**: When developing LLM applications, an LLM APIs connection is required to prove concepts. GPTCache offers the same interface as LLM APIs and can store LLM-generated or mocked data. This helps to verify your application's features without connecting to the LLM APIs or even the network.

How does it work?
------------------

Online services often exhibit data locality, with users frequently accessing popular or trending content. Cache systems take advantage of this behavior by storing commonly accessed data, which in turn reduces data retrieval time, improves response times, and eases the burden on backend servers. Traditional cache systems typically utilize an exact match between a new query and a cached query to determine if the requested content is available in the cache before fetching the data.

However, using an exact match approach for LLM caches is less effective due to the complexity and variability of LLM queries, resulting in a low cache hit rate. To address this issue, GPTCache adopt alternative strategies like semantic caching. Semantic caching identifies and stores similar or related queries, thereby increasing cache hit probability and enhancing overall caching efficiency.

GPTCache employs embedding algorithms to convert queries into embeddings and uses a vector store for similarity search on these embeddings. This process allows GPTCache to identify and retrieve similar or related queries from the cache storage, as illustrated in the `Modules section <https://github.com/zilliztech/GPTCache#-modules>`_.

Featuring a modular design, GPTCache makes it easy for users to customize their own semantic cache. The system offers various implementations for each module, and users can even develop their own implementations to suit their specific needs.

In a semantic cache, false positives can occur during cache hits and false negatives during cache misses. GPTCache provides three metrics to evaluate its performance:

- Precision: the ratio of true positives to the total of true positives and false positives.
- Recall: the ratio of true positives to the total of true positives and false negatives.
- Latency: the time required for a query to be processed and the corresponding data to be fetched from the cache.

A `sample benchmark <https://github.com/zilliztech/gpt-cache/blob/main/examples/benchmark/benchmark_sqlite_faiss_onnx.py>`_ is included for users to start with assessing the performance of their semantic cache.



Modules
Expand All @@ -139,9 +150,9 @@ You can take a look at modules below to learn more about system design and archi
- `LLM Adapter <modules/llm_adapter.html>`_
- `Embedding Generator <modules/embedding_generator.html>`_
- `Cache Storage <modules/cache_storage.html>`_
- `Vector Store <modules/vector_store>`_
- `Cache Manager <modules/cache_manager>`_
- `Similarity Evaluator <modules/similarity_evaluator>`_
- `Vector Store <modules/vector_store.html>`_
- `Cache Manager <modules/cache_manager.html>`_
- `Similarity Evaluator <modules/similarity_evaluator.html>`_

.. toctree::
:maxdepth: 2
Expand Down
9 changes: 0 additions & 9 deletions docs/references/adapter.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/references/gptcache.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ GPTCache

.. contents:: Index

gptcache.cache
gptcache.Cache
-------------------------
.. automodule:: gptcache.Cache
:members:
Expand Down
14 changes: 0 additions & 14 deletions docs/references/processor.rst

This file was deleted.

7 changes: 0 additions & 7 deletions docs/references/utils.rst

This file was deleted.