## App Quickstart

1. Install Python 3.10 if not already installed.

2. Clone the `chatgpt-retrieval-plugin` repository:

```
git clone git@github.com:openai/chatgpt-retrieval-plugin.git
```

_**Note**: To see how we setup the *hosted app* on DigitalOcean [refer to this video](https://youtu.be/hpePPqKxNq8), otherwise continue to setup the app locally by following the remaining steps._

3. Navigate to the app directory:

```
cd /path/to/chatgpt-retrieval-plugin
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. We will later include this in the request `headers`. The token can be generated however you prefer, such as using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set Pinecone-specific environment variables:

* `DATASTORE`: set to `pinecone`.

* `PINECONE_API_KEY`: Set to your Pinecone API key. This requires a free Pinecone account and can be [found in the Pinecone console](https://app.pinecone.io/).

* `PINECONE_ENVIRONMENT`: Set to your Pinecone environment, looks like `us-east1-gcp`, `us-west1-aws`, and can be found next to your API key in the [Pinecone console](https://app.pinecone.io/).

* `PINECONE_INDEX`: Set this to your chosen index name. The name you choose is your choice, we just recommend setting it to something descriptive like `"openai-retrieval-app"`. *Note that index names are restricted to alphanumeric characters, `"-"`, and can contain a maximum of 45 characters.*

8. Run the app with:

```
poetry run start
```

If running the app locally you should see something like:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.
```

In that case, the app has automatically connected to our index (specified by `PINECONE_INDEX`), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [142]:
!pip install -qU langchain tiktoken tqdm

## Preparing Data

In [143]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, DirectoryLoader
import glob


loader = UnstructuredPDFLoader('data/dip.pdf')
docs = loader.load()
print(len(docs))
print(docs) # pdf uploader
# 많은 pdf를 올리면 다음과 같아진다. docs[0] 로 배열을 확인 

1
[Document(page_content='It is only after years of preparation that the young artist should touch color– not color used descriptively, that is, but as a means of personal expression.\n\nHenri Matisse\n\nFor a long time I limited myself to one color – as a form of discipline.\n\nPablo Picasso\n\nColor Image Processing\n\nChapter 6 어\n\nPreview } Motivation of using colors in image processing\n\n} A powerful descriptor that simplifies object identification } Thousands of color shades and intensities can be discerned by\n\nhumans\n\n} Two major areas in color image processing\n\n} Full\n\n\n\ncolor processing\n\n} images are acquired with a full-color sensor\n\n} Pseudo\n\n\n\ncolor processing\n\n} assigning a color to a particular monochrome intensity or\n\nrange of intensities\n\n} Full-color image processing techniques are used in broad\n\nrange of applications, as the hardware cost is getting reasonable\n\n2\n\n6. Color Image Processing\n\nSpring 2023\n\nPreview\n\n} This chapter cov

In [146]:
import pinecone
print(pinecone.describe_index('pdf-text-test'))

TypeError: expected string or bytes-like object

We access the plaintext page content like so:

In [88]:
print(docs[0].page_content)

It is only after years of preparation that the young artist should touch color– not color used descriptively, that is, but as a means of personal expression.

Henri Matisse

For a long time I limited myself to one color – as a form of discipline.

Pablo Picasso

Color Image Processing

Chapter 6 어

Preview } Motivation of using colors in image processing

} A powerful descriptor that simplifies object identification } Thousands of color shades and intensities can be discerned by

humans

} Two major areas in color image processing

} Full



color processing

} images are acquired with a full-color sensor

} Pseudo



color processing

} assigning a color to a particular monochrome intensity or

range of intensities

} Full-color image processing techniques are used in broad

range of applications, as the hardware cost is getting reasonable

2

6. Color Image Processing

Spring 2023

Preview

} This chapter covers } Color fundamentals } Color models } Pseudocolor image processing } Bas

Looks good, we need to also consider the length of each page with respect to the number of tokens that will reasonably fit within the window of a ChatGPT model. We will use `gpt-3.5-turbo` as the assumed model.

### Chunking the Text

At the time of writing, `gpt-3.5-turbo` supports a context window of 4096 tokens — that means that input tokens + generated ( / completion) output tokens, cannot total more than 4096 without hitting an error.

So we 100% need to keep below this. If we assume a very safe margin of ~2000 tokens for the input prompt into `gpt-3.5-turbo`, leaving ~2000 tokens for conversation history and completion.

With this ~2000 token limit we may want to include *five* snippets of relevant information, meaning each snippet can be no more than **400** token long.

To create these snippets we use the `RecursiveCharacterTextSplitter` from LangChain. To measure the length of snippets we also need a *length function*. This is a function that consumes text, counts the number of tokens within the text (after tokenization using the `gpt-3.5-turbo` tokenizer), and returns that number. We define it like so:

In [89]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

Note that for the tokenizer we defined the encoder as `"cl100k_base"`. This is a specific tiktoken encoder which is used by `gpt-3.5-turbo`. Other encoders exist and at the time of writing are summarized as:

| Encoder | Models |
| --- | --- |
| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002` |
| `p50k_base` | `text-davinci-003`, `code-davinci-002`, `code-cushman-002` |
| `r50k_base` | `text-davinci-001`, `davinci`, `text-similarity-davinci-001` |
| `gpt2` | `gpt2` |

You can find these details in the [Tiktoken `model.py` script](https://github.com/openai/tiktoken/blob/main/tiktoken/model.py), or using `tiktoken.encoding_for_model`:

In [90]:
tiktoken.encoding_for_model('gpt-4') # gpt4로 바꾸기 

<Encoding 'cl100k_base'>

With the length function defined we can initialize our `RecursiveCharacterTextSplitter` object like so:

In [123]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

Then we split the text for a document like so:

In [124]:
chunks = text_splitter.split_text(docs[0].page_content # 큰 pdf 를 분리하는 것이고 이는 토큰에 맞게 질문하기 위해서이다.
len(chunks)

21

In [93]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1])

(384, 373)

For `docs[5]` we created `2` chunks of token length `346` and `247`.

This is for a single document, we need to do this over all of our documents. While we iterate through the docs to create these chunks we will reformat them into the format required by our API app. This format needs to align to the `/upsert` endpoints required document format, which looks like this:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional, however, we will include both.

The `"id"` will be created based on the URL of the text + it's chunk number.

In [125]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[0].metadata['source'].replace('rtdocs/', 'https://')
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

data/dip.pdf
36a81d2f5472


Then use the `uid` alongside chunk number and actual `url` to create the format needed:

In [95]:
data = [
    {
        'id': f'{uid}-{i}',
        'text': chunk,
        'metadata': {'url': url}
    } for i, chunk in enumerate(chunks)
]
data
# 각 pdf를 split 해서 해당하는 곳으로 저장하낟. 

[{'id': '36a81d2f5472-0',
  'text': 'It is only after years of preparation that the young artist should touch color– not color used descriptively, that is, but as a means of personal expression.\n\nHenri Matisse\n\nFor a long time I limited myself to one color – as a form of discipline.\n\nPablo Picasso\n\nColor Image Processing\n\nChapter 6 어\n\nPreview } Motivation of using colors in image processing\n\n} A powerful descriptor that simplifies object identification } Thousands of color shades and intensities can be discerned by\n\nhumans\n\n} Two major areas in color image processing\n\n} Full\n\n\n\ncolor processing\n\n} images are acquired with a full-color sensor\n\n} Pseudo\n\n\n\ncolor processing\n\n} assigning a color to a particular monochrome intensity or\n\nrange of intensities\n\n} Full-color image processing techniques are used in broad\n\nrange of applications, as the hardware cost is getting reasonable\n\n2\n\n6. Color Image Processing\n\nSpring 2023\n\nPreview\n\n} This 

Now we repeat the same logic across our full dataset:

In [131]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
    url = doc.metadata['source']
    m.update(url.encode('utf-8'))
    uid = m.hexdigest()[:12]
    
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'metadata': {'url': url}
        })

len(documents)
print(documents[0:2])
#총 21개 보유

100%|██████████| 1/1 [00:00<00:00, 38.62it/s]

[{'id': 'a2002610de6a-0', 'text': 'It is only after years of preparation that the young artist should touch color– not color used descriptively, that is, but as a means of personal expression.\n\nHenri Matisse\n\nFor a long time I limited myself to one color – as a form of discipline.\n\nPablo Picasso\n\nColor Image Processing\n\nChapter 6 어\n\nPreview } Motivation of using colors in image processing\n\n} A powerful descriptor that simplifies object identification } Thousands of color shades and intensities can be discerned by\n\nhumans\n\n} Two major areas in color image processing\n\n} Full\n\n\n\ncolor processing\n\n} images are acquired with a full-color sensor\n\n} Pseudo\n\n\n\ncolor processing\n\n} assigning a color to a particular monochrome intensity or\n\nrange of intensities\n\n} Full-color image processing techniques are used in broad\n\nrange of applications, as the hardware cost is getting reasonable\n\n2\n\n6. Color Image Processing\n\nSpring 2023\n\nPreview\n\n} This ch




---

In [132]:
documents[0:4]

[{'id': 'a2002610de6a-0',
  'text': 'It is only after years of preparation that the young artist should touch color– not color used descriptively, that is, but as a means of personal expression.\n\nHenri Matisse\n\nFor a long time I limited myself to one color – as a form of discipline.\n\nPablo Picasso\n\nColor Image Processing\n\nChapter 6 어\n\nPreview } Motivation of using colors in image processing\n\n} A powerful descriptor that simplifies object identification } Thousands of color shades and intensities can be discerned by\n\nhumans\n\n} Two major areas in color image processing\n\n} Full\n\n\n\ncolor processing\n\n} images are acquired with a full-color sensor\n\n} Pseudo\n\n\n\ncolor processing\n\n} assigning a color to a particular monochrome intensity or\n\nrange of intensities\n\n} Full-color image processing techniques are used in broad\n\nrange of applications, as the hardware cost is getting reasonable\n\n2\n\n6. Color Image Processing\n\nSpring 2023\n\nPreview\n\n} This 

---

### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [133]:
import os
#이건 사용자마자 jwt 토큰 받는거 이거를 헤더에 넣고 
BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.a-V-4lL-gcmOwXWPR9JCAVfyTmP5usWCHbDRVBAHA58"

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [134]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}
print(BEARER_TOKEN)


eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.a-V-4lL-gcmOwXWPR9JCAVfyTmP5usWCHbDRVBAHA58


We'll perform the upsert in batches of `batch_size`. Make sure that the `endpoint_url` variable is set to the correct location for your running *retrieval-app* API.

In [135]:
import requests
from requests.adapters import HTTPAdapter, Retry
from tqdm.auto import tqdm

batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

100%|██████████| 1/1 [00:11<00:00, 11.94s/it]


With that our LangChain doc records have all been indexed and we can move on to querying.

### Making Queries

To query the datastore all we need to do is pass one or more queries to the `/query` endpoint. We can make a few questions related to LangChain and see if we return relevant info:

In [139]:
queries = [
    {'query': "What is the color image?"},
    {'query': "How do I use zzmap?"},
    {'query': "What is the difference image processing?"}
]

res = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

# 200 - 성공했다는 뜻 

<Response [200]>

Now we can loop through the responses and see the results returned for each query:

In [140]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
What is the color image?

0.89: Color Image Processing
0.85: printing  } HSI (hue, saturation, intensity)  } corresponds to the way humans describe and interpret color } it decouples the color and gray-scale information in an image, making it suitable for many of the gray-scale techniques developed in this book  } Numerous color models  14  6. Color Image Processing  Spring 2023  6.2 Color models  } The RGB color model  } Based on a Cartesian coordinate system } The different colors are points on or inside the cube, and are  defined by vector extending from the origin  } Pixel depth: the number of bits used to represent each pixel  in RGB space } A depth of 24 bits: an RGB image in which each of the red, green, and  blue images is an 8-bit image (full-color: 16,777,216)  Figure 6.7 Schematic of the RGB color cube.
0.85: 2549 centered at (0.6863, 0.1608, 0.1922), and (b) reds within an RGB sphere of radius 0.1765 cen

The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the [Pinecone console](https://app.pinecone.io/).