-------------------------
#### Use of OpenAI embedding function

- against a pre-created dataset (with GPT vectors of dim 1536)

-----------------------------

In [1]:
#pip install wget

In [2]:
import openai
import pandas as pd
import os
import wget

from ast import literal_eval

In [3]:
# Chroma's client library for Python
import chromadb

In [4]:
EMBEDDING_MODEL = "text-embedding-3-small"

In [5]:
# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#### load time

In [6]:
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# # The file is ~700 MB so this will take some time ... 3 mins
wget.download(embeddings_url, 
              out=r'D:\AI-DATASETS\02-MISC-large\GenAI-LLMs\chromadb')

100% [......................................................................] 698933052 / 698933052

'D:\\AI-DATASETS\\02-MISC-large\\GenAI-LLMs\\chromadb/vector_database_wikipedia_articles_embedded (1).zip'

In [8]:
import zipfile

In [9]:
zip_location = r'D:\AI-DATASETS\02-MISC-large\GenAI-LLMs\chromadb\vector_database_wikipedia_articles_embedded.zip'

In [10]:
unzip_location = r'D:\AI-DATASETS\02-MISC-large\GenAI-LLMs\chromadb\data'

In [11]:
with zipfile.ZipFile(zip_location,"r") as zip_ref:
    zip_ref.extractall(unzip_location)

In [12]:
csv_location = r'D:\AI-DATASETS\02-MISC-large\GenAI-LLMs\chromadb\data\vector_database_wikipedia_articles_embedded.csv'

In [13]:
article_df = pd.read_csv(csv_location)

In [14]:
article_df.shape

(25000, 7)

In [15]:
article_df.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.001009464613161981, -0.020700545981526375, ...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0009286514250561595, 0.000820168002974242, ...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.003393713850528002, 0.0061537534929811954, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.0153952119871974, -0.013759135268628597, 0....","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.02224554680287838, -0.02044147066771984, -0...","[0.021524671465158463, 0.018522677943110466, -...",4


In [16]:
# Set the maximum column width to 100 characters
pd.set_option('display.max_colwidth', 100)

In [17]:
article_df[['title',	'text']].sample(10)

Unnamed: 0,title,text
10552,Heroin,Heroin is a drug. It is also known as Diacetylmorphine or Diamorphine. Heroin was originally a t...
19139,Lewis,"Lewis is the northern part of Lewis and Harris, the largest island in the Western Isles (or Oute..."
18407,Mark Wahlberg,"Mark Robert Michael Wahlberg (born June 5, 1971) is an American actor and television and movie p..."
13452,Passion bearer,"In Russian Orthodox Christianity, a passion bearer is someone who faces their death in a Christi..."
5977,Mauritania,"Mauritania is a country in northwest Africa. The capital city, which is also the biggest city in..."
16219,Tipi,A tipi (also called tepee or teepee) is a kind of tent. It is cone-shaped. They were made by Nat...
8090,Demeter,"Demeter (Attic Greek: Δημήτηρ, Dēmḗtēr; Doric: Δαμάτηρ, Dāmā́tēr) is the goddess of the harvest ..."
20702,42 (number),"Forty-two is a number. It comes between forty-one and forty-three, and is an even number. It is ..."
18231,Izumo Province,was an old province of Japan in the area of Shimane Prefecture on the island of Honshū. It was s...
19180,List of members of the Red Army Faction,This is a list of Members of the Red Army Faction. After Andreas Baader escaped from jail in 197...


In [18]:
article_df_1500 = article_df.sample(1500)

In [19]:
%%time
# Read vectors from strings back into a list
article_df_1500['title_vector']   = article_df_1500.title_vector.apply(literal_eval)
article_df_1500['content_vector'] = article_df_1500.content_vector.apply(literal_eval)

CPU times: total: 15.9 s
Wall time: 32.7 s


- Convert string representations of vectors back into Python list objects
- This is necessary because vectors saved as strings (e.g., in a CSV or database) need to be 
- converted back to lists to be used in calculations or machine learning models.

Explanation:
- literal_eval safely evaluates a string containing a Python expression (like a list) and converts it into a corresponding Python object.
- When vectors are stored in a DataFrame as strings (e.g., "[0.25, 0.89, 0.56]"), 
- we need to convert them back into actual lists (e.g., [0.25, 0.89, 0.56]) for further use.


In [20]:
len(article_df_1500.iloc[0]['title_vector']), len(article_df_1500.iloc[0]['content_vector'])

(1536, 1536)

In [21]:
# Set vector_id to be a string
article_df_1500['vector_id'] = article_df_1500['vector_id'].apply(str)

#### Using Chroma DB

- `Instantiate the Chroma Client`

    - We'll start by creating a client instance to interact with the Chroma database.
    
- `Create Collections for Each Class of Embedding`

    - Collections in Chroma allow you to group embeddings based on specific criteria or use cases. Each collection can store embeddings that correspond to a particular class or type, such as text embeddings, image embeddings, etc.
    
- `Query Each Collection`

    - Finally, we'll perform queries on each collection to find the most similar embeddings. This could be used for tasks such as document retrieval, semantic search, or recommendation systems.

#### Instantiate the Chroma client

Create the Chroma client. By default, Chroma is ephemeral and runs in memory. However, you can easily set up a persistent configuration which writes to disk.

In [22]:
chroma_client = chromadb.EphemeralClient() 

# Equivalent to chromadb.Client(), ephemeral.
# Uncomment for persistent client
# chroma_client = chromadb.PersistentClient()

In [23]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [24]:
# Load OpenAI API key from environment variables
openai_api_key = os.getenv('OPENAI_API_KEY')

In [25]:
embedding_function = OpenAIEmbeddingFunction(
    api_key   = openai_api_key, 
    model_name= EMBEDDING_MODEL)

In [26]:
wikipedia_content_collection = chroma_client.create_collection(name='wikipedia_content', 
                                                               embedding_function=embedding_function)

wikipedia_title_collection   = chroma_client.create_collection(name='wikipedia_titles', 
                                                               embedding_function=embedding_function)

In [27]:
# Add the content vectors
wikipedia_content_collection.add(
    ids       = article_df_1500.vector_id.tolist(),
    embeddings= article_df_1500.content_vector.tolist(),
)

# Add the title vectors
wikipedia_title_collection.add(
    ids       = article_df_1500.vector_id.tolist(),
    embeddings= article_df_1500.title_vector.tolist(),
)

In [28]:
def query_collection(collection, query, max_results, dataframe):
    
    results = collection.query(query_texts= query, 
                               n_results  = max_results, 
                               include    = ['distances']) 
    
    df = pd.DataFrame({
                'id':results['ids'][0], 
                'score':results['distances'][0],
                'title': dataframe[dataframe.vector_id.isin(results['ids'][0])]['title'],
                'content': dataframe[dataframe.vector_id.isin(results['ids'][0])]['text'],
                })
    
    return df

In [29]:
title_query_result = query_collection(
    collection = wikipedia_title_collection,
    query      = "modern art in Europe",
    max_results= 10,
    dataframe  = article_df_1500
)

title_query_result.head()

Unnamed: 0,id,score,title,content
12425,12599,1.943676,Joseph Barbera,"Joseph Roland ""Joe"" Barbera (March 24, 1911 – December 18, 2006) was an Italian American animato..."
12599,19368,1.947997,"The Good, the Bad and the Ugly","The Good, the Bad and the Ugly (1966) is a Spaghetti Western movie directed by Sergio Leone and ..."
24314,24878,1.964057,Pop rock,"Pop rock is a subgenre of pop music and rock music that uses catchy pop style, with light lyrics..."
23128,24314,1.96604,Eurofighter Typhoon,"The Eurofighter Typhoon is a jet fighter aircraft made by EADS, BAE Systems and Alenia Aeronauti..."
24878,5205,1.974909,Folk rock,Folk rock music is a mixture of folk music with modern rock music. Folk rock began in the mid 19...
