# <a id='toc1_'></a>[Embedding Any articles for search](#toc0_)


**Table of contents**<a id='toc0_'></a>    
- [Embedding Any articles for search](#toc1_)    
  - [Prerequisites](#toc1_1_)    
    - [Import libraries](#toc1_1_1_)    
    - [Set API key (if needed)](#toc1_1_2_)    
  - [Collect documents](#toc1_2_)    
  - [Chunk documents](#toc1_3_)    
  - [Embed document chunks](#toc1_4_)    
  - [Store document chunks and embeddings](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->


This notebook shows how we prepared a dataset of Google Help articles for search, used in [Question_answering_using_embeddings.ipynb](Question_answering_using_embeddings.ipynb).

Procedure:

0. Prerequisites: Import libraries, set API key (if needed)
1. Collect: We download a few hundred articles
2. Chunk: Documents are split into short, semi-self-contained sections to be embedded
3. Embed: Each section is embedded with the OpenAI API
4. Store: Embeddings are saved in a CSV file (for large datasets, use a vector database)

## <a id='toc1_1_'></a>[Prerequisites](#toc0_)

### <a id='toc1_1_1_'></a>[Import libraries](#toc0_)

In [1]:
import openai  # for generating embeddings
import pandas as pd  # for DataFrames to store article sections and embeddings
import re  # for cutting <ref> links out of Wikipedia articles
import tiktoken  # for counting tokens
import os
import numpy as np


Install any missing libraries with `pip install` in your terminal. E.g.,

```zsh
pip install openai
```

(You can also do this in a notebook cell with `!pip install openai`.)

If you install any libraries, be sure to restart the notebook kernel.

### <a id='toc1_1_2_'></a>[Set API key (if needed)](#toc0_)

Note that the OpenAI library will try to read your API key from the `OPENAI_API_KEY` environment variable. If you haven't already, set this environment variable by following [these instructions](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).

## <a id='toc1_2_'></a>[Collect documents](#toc0_)

In this example, we'll download a few hundred Wikipedia articles related to the 2022 Winter Olympics.

In [2]:
from api_doc import *

# API configuration
openai.api_key = OPEN_AI_KEY

# for LangChain
os.environ["OPENAI_API_KEY"] = OPEN_AI_KEY

def concat_rows(df):
    flattened_data = []
    for _, row in df.iterrows():
        flattened_row = ', '.join([f'{col}: {row[col]}' for col in df.columns])
        flattened_data.append(flattened_row)
    return flattened_data

In [3]:

# Convert the flattened data to a new dataframe

df1 = pd.read_csv('Google Hackathon - Doc Sheet.csv')
flat_df1 = concat_rows(df1)
flat_df1 = pd.DataFrame(flat_df1, columns=['Concatenated Row'])


df2 = pd.read_csv('Google Hackathon - Google Suite.csv', header=None)
df2.rename(columns={0: 'Title', 1: 'Description'}, inplace=True)
flat_df2 = concat_rows(df2)
flat_df2 = pd.DataFrame(flat_df2, columns=['Concatenated Row'])




## <a id='toc1_3_'></a>[Chunk documents](#toc0_)

Now that we have our reference documents, we need to prepare them for search.

Because GPT can only read a limited amount of text at once, we'll split each document into chunks short enough to be read.

For this specific example on Wikipedia articles, we'll:
- Discard less relevant-looking sections like External Links and Footnotes
- Clean up the text by removing reference tags (e.g., <ref>), whitespace, and super short sections
- Split each article into sections
- Prepend titles and subtitles to each section's text, to help GPT understand the context
- If a section is long (say, > 1,600 tokens), we'll recursively split it into smaller sections, trying to split along semantic boundaries like paragraphs

In [4]:
# Parse Text Data

Next, we'll recursively split long sections into smaller sections.

There's no perfect recipe for splitting text into sections.

Some tradeoffs include:
- Longer sections may be better for questions that require more context
- Longer sections may be worse for retrieval, as they may have more topics muddled together
- Shorter sections are better for reducing costs (which are proportional to the number of tokens)
- Shorter sections allow more sections to be retrieved, which may help with recall
- Overlapping sections may help prevent answers from being cut by section boundaries

Here, we'll use a simple approach and limit sections to 1,600 tokens each, recursively halving any sections that are too long. To avoid cutting in the middle of useful sentences, we'll split along paragraph boundaries when possible.

In [5]:
GPT_MODEL = "gpt-3.5-turbo"  # only matters insofar as it selects which tokenizer to use
MAX_TOKENS = 1600



In [6]:


def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """
    Compute the number of tokens in a given text string for a specific model.
    
    Parameters:
    - text (str): The input text string.
    - model (str): Model identifier to be used for tokenization. Default is GPT_MODEL.
    
    Returns:
    - int: Number of tokens in the input text for the given model.
    """
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    """
    Splits a string into two parts based on a delimiter while trying to balance the tokens on each side.
    
    Parameters:
    - string (str): Input text string.
    - delimiter (str): Delimiter on which to split the string. Default is a newline character.
    
    Returns:
    - list[str, str]: A list of two strings, representing the split halves.
    """
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]  # no delimiter found
    elif len(chunks) == 2:
        return chunks  # no need to search for halfway point
    else:
        total_tokens = num_tokens(string)
        halfway = total_tokens // 2
        best_diff = halfway
        for i, chunk in enumerate(chunks):
            left = delimiter.join(chunks[: i + 1])
            left_tokens = num_tokens(left)
            diff = abs(halfway - left_tokens)
            if diff >= best_diff:
                break
            else:
                best_diff = diff
        left = delimiter.join(chunks[:i])
        right = delimiter.join(chunks[i:])
        return [left, right]


def truncated_string(
    string: str,
    model: str= GPT_MODEL,
    max_tokens: int= MAX_TOKENS,
    print_warning: bool = True,
) -> str:
    """
    Truncate a string to fit within a maximum number of tokens.
    
    Parameters:
    - string (str): Input text string.
    - model (str): Model identifier to be used for truncation.
    - max_tokens (int): Maximum number of tokens allowed.
    - print_warning (bool): Whether to print a warning if truncation occurs. Default is True.
    
    Returns:
    - str: The truncated string.
    """
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    if print_warning and len(encoded_string) > max_tokens:
        print(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens.")
    return truncated_string


# sample = truncated_string(full_flattened_array[0])

# sample_again = halved_by_delimiter(sample)

# np.array(sample_again).shape

## <a id='toc1_4_'></a>[Embed document chunks](#toc0_)

Now that we've split our library into shorter self-contained strings, we can compute embeddings for each.

(For large embedding jobs, use a script like [api_request_parallel_processor.py](api_request_parallel_processor.py) to parallelize requests while throttling to stay under rate limits.)

In [7]:


def compute_embeddings(full_flattened_array: list, EMBEDDING_MODEL: str = "text-embedding-ada-002", BATCH_SIZE: int = 1000) -> pd.DataFrame:
    """
    Computes embeddings for a given list of texts.
    
    Parameters:
    - full_flattened_array (list): List of texts to compute embeddings for.
    - EMBEDDING_MODEL (str): OpenAI model for embeddings. Default is "text-embedding-ada-002".
    - BATCH_SIZE (int): Number of texts to be processed in a batch. Default is 1000.
    
    Returns:
    - df (pd.DataFrame): DataFrame with original texts and their corresponding embeddings.
    """
    
    embeddings = []

    for batch_start in range(0, len(full_flattened_array), BATCH_SIZE):
        batch_end = batch_start + BATCH_SIZE
        batch = full_flattened_array[batch_start:batch_end]

        # Attempt to create embeddings, with retries on rate limit errors
        max_retries = 5
        for attempt in range(max_retries):
            try:
                print(f"Batch {batch_start} to {batch_end-1}")
                response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
                for i, be in enumerate(response["data"]):
                    assert i == be["index"]  # double check embeddings are in same order as input
                batch_embeddings = [e["embedding"] for e in response["data"]]
                embeddings.extend(batch_embeddings)
                break  # Break out of the retry loop if successful
            except openai.error.RateLimitError:
                if attempt < max_retries - 1:  # i.e. not the last attempt
                    print("Rate limit reached. Waiting for 60 seconds before retrying...")
                    time.sleep(60)  # Wait for 60 seconds before the next attempt
                else:
                    raise  # If it's the last attempt, raise the exception to alert the user

    df = pd.DataFrame({"text": full_flattened_array, "embedding": embeddings})
    return df



In [8]:

full_flattened_set = pd.concat([flat_df2, flat_df1])

full_flattened_array = np.array(full_flattened_set).squeeze().tolist()

full_flattened_set.shape



(531, 1)

In [9]:

# calculate embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 1000  # you can submit up to 2048 embedding inputs per request


result_df = compute_embeddings(full_flattened_array)



Batch 0 to 999


## <a id='toc1_5_'></a>[Store document chunks and embeddings](#toc0_)

Because this example only uses a few thousand strings, we'll store them in a CSV file.

(For larger datasets, use a vector database, which will be more performant.)

In [10]:
# save document chunks and embeddings

SAVE_PATH = "google_docs.csv"

result_df.to_csv(SAVE_PATH, index=False)

# d:\Samsickle\Downloads