<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

# Text Embedding API

Text embedding allows us to directly convert text documents to vectors with a simple API call with Open AI.

Keep in mind, just like other Open AI services, it is not free and it is also important to note it has its own pricing structure (its typically much cheaper than GPT on a token basis, since the processing is simpler). You can view the pricing here: https://openai.com/api/pricing/

## Imports

In [1]:
from openai import OpenAI
import pandas as pd
import tiktoken # https://github.com/openai/tiktoken

In [2]:
client = OpenAI()

### What happens when GPT doesn't know anything about a topic?

For example, we know GPT is limited by its training data not being up to date to the present day (depending on the model, the cut-off can be very recent though). There are also limitations based on how esoteric the topic is. 

Let's ask GPT about a a "unicorn" company

---

In [4]:
prompt = "What does the start-up company Pentera do and who invested in it?"

response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant"},
                    {"role": "user", "content": prompt},
                ],
                temperature=0,
                max_tokens=500,

)
print(response.choices[0].message.content.strip(" \n"))

Pentera is a start-up company that specializes in developing advanced technology solutions for the agriculture industry. They focus on creating innovative products and services that help farmers optimize their operations, increase productivity, and reduce environmental impact.

As for the investors, Pentera has received funding from several notable sources. One of their major investors is GreenTech Ventures, a venture capital firm that focuses on supporting sustainable and environmentally friendly technologies. Additionally, Pentera has also secured investments from agricultural industry leaders such as AgriCorp and FarmTech Solutions, who recognize the potential of their technology in revolutionizing the sector.


While this may sound some what correct, the model is hallucinating! A common issue with LLMs, they are eager to please and with enough context they can make stuff up that sounds right, but actually isn't. In my personal research, it looks like Y Combinator did NOT actually invest in Pentera. Also Pentera isn't an HR company, Pentera is a penetration testing company that develops and provides an automated security validation platform to reduce cybersecurity risks.

We could try to alleviate this issue with some prompt engineering:

In [5]:
prompt = """Only answer the question below if you have 100% certainty of the facts.

Q: What does the start-up company Pentera do and who invested in it?
A:"""


response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant"},
                    {"role": "user", "content": prompt},
                ],
                temperature=0,
                max_tokens=500,

)
print(response.choices[0].message.content.strip(" \n"))

I'm sorry, but as an AI assistant, I don't have real-time access to current information or the ability to browse the internet. Therefore, I cannot provide you with the specific details about the start-up company Pentera and its investors. It's always best to refer to reliable sources or conduct your own research to get the most accurate and up-to-date information.


Alright, very interesting! How can we help the model? We could input some context from our own data. In fact, we have a data set about recent Unicorn companies.  

## Text Data

Let's grab some text data and send it to Open AI to receive the embeddings back.
 


In [6]:
df = pd.read_csv("unicorns.csv") 

In [7]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",


Let's create a new column that summarizes each company with the information from the other columns

In [8]:
import ast 
def summary(company,crunchbase_url,city,country,industry,investor_list):
    investors = 'The investors in the company are'
     
    for investor in ast.literal_eval(investor_list):
        investors += f" {investor}, "

    text = f"{company} has headquarters in {city} in {country} and is in the field of {industry}. {investors}. You can find more information at {crunchbase_url}"

    return text 

In [9]:
df['summary'] = df.apply(lambda df: summary(df['Company'],df['Crunchbase Url'],df['City'],df['Country'],df['Industry'],df['Investors']),axis=1)

In [10]:
df['summary'][0]

'Esusu has headquarters in New York in United States and is in the field of Fintech. The investors in the company are Next Play Ventures,  Zeal Capital Partners,  SoftBank Group, . You can find more information at https://www.cbinsights.com/company/esusu'

In [11]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...


### Token Count

In case you are ever worried about how many tokens your text actually has (to get an estimate of your costs) OpenAI has a library called "tiktoken", which allows you to estimate a cost based on token counts.

Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.

**tiktoken** supports 3 different encodings for OpenAI models:

* "gpt2" for most gpt-3 models
* "p50k_base" for code models, and Davinci models, like "text-davinci-003"
* "cl100k_base" for text-embedding-ada-002

In [12]:
import tiktoken

def num_tokens_from_string(string, encoding_name):
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

Let's run a quick example on some text:

In [13]:
num_tokens_from_string(df['summary'][0],encoding_name='cl100k_base')

58

Note how this is higher than the actual word count, this is because OpenAI tokens are not the same as words, remember things like punctuation and word length come into play, as a rough estimate, 1000 tokens is about 750 words. But with the tool above you can check your real token count before sending text over to OpenAI. Let's get a cost estimate of vectorizing our entire data set:

In [14]:
df['token_count'] = df['summary'].apply(lambda text: num_tokens_from_string(text,'cl100k_base'))

In [15]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,58
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,60
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,57
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,62
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,58


### Estimating Embedding Costs

Let's now do a quick monetary estimate of how much this will all cost, currently ADA-002 embedding model costs $0.0004 / 1K tokens

Pay careful attention, that isn't 4 cents per 1000 tokens, that would be $0.04, this is 1/100 of that cost, so quite "inexpensive" depending on your document workload.

So, let's estimate the cost:

In [16]:
df['token_count'].sum() * 0.0004 / 1000

0.028168400000000003

Another thing to keep in mind is the size limit for embeddings, currently the ADA 002 model max token limit is 8191 tokens, let's quickly check against this limit:

In [17]:
df[df['token_count'] > 8191]

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count


Looks like we're okay! It also looks like this will only cost us abou 3 cents to embed, not too bad!

## Text Embedding

To begin, we'll create a simple function to grab the embedding, in our case, we'll specify the ADA 002 model

In [20]:
def get_embedding(text):
    result = client.embeddings.create(
      model='text-embedding-ada-002',
      input=text
    )
    return result.data[0].embedding


## Create Embeddings

Now to create the embeddings, we can simply call these functions, for example:

In [21]:
get_embedding(df['summary'][0])

[0.011977683752775192,
 -0.017708472907543182,
 -0.022319912910461426,
 -0.03442494943737984,
 -0.013767298310995102,
 0.010248392820358276,
 -0.016314314678311348,
 0.025470171123743057,
 0.00376355298794806,
 -0.021957969292998314,
 0.019330520182847977,
 -0.013143949210643768,
 0.005335330963134766,
 -0.016890745609998703,
 0.01069076918065548,
 -0.023700665682554245,
 -0.0007649429608136415,
 -0.023539800196886063,
 0.01706501469016075,
 0.00254199025221169,
 -0.023258289322257042,
 -0.015764696523547173,
 -0.009919961914420128,
 0.021421754732728004,
 -0.022092022001743317,
 -0.0008847533608786762,
 -0.011769900098443031,
 -0.0014218053547665477,
 -0.011568820104002953,
 -0.011428063735365868,
 0.011991089209914207,
 -0.0037065802607685328,
 -0.01900879107415676,
 -0.004142254125326872,
 -0.012862436473369598,
 0.0059653823263943195,
 -0.02236012928187847,
 0.014142648316919804,
 0.019343925639986992,
 -0.01185703556984663,
 0.03190474212169647,
 0.008653155528008938,
 -0.01398178

Let's do the rest via our 2nd function:

In [22]:
# this will take awhile due to the amount of calls to the API.
# it will take about 0.5 seconds per row
df['embedding'] = df['summary'].apply(get_embedding)

In [23]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count,embedding
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York in United S...,58,"[0.011977683752775192, -0.017708472907543182, ..."
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York in Uni...,60,"[0.009109214879572392, 0.01313267182558775, -0..."
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto in United ...,57,"[0.001965112751349807, -0.03798539191484451, 0..."
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad in Ind...,62,"[-0.0024896422401070595, -0.024661261588335037..."
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,58,"[0.011392408050596714, -0.011185649782419205, ..."


In [24]:
df.to_csv('unicorns_with_embeddings.csv',index=False)

## Document Similarity 

We can now take a new string, embed it into a vector, and perform a cosine similarity search against all the vector embeddings in our DataFrame:

In [25]:
prompt = "What does the company Pentera do and who invested in it?"

In [26]:
prompt_embedding = get_embedding(prompt)

In [27]:
import numpy as np
# There are other services/programs for larger amount of vectors
# Take a look at vector search engines like Pinecone or Weaviate
def vector_similarity(vec1,vec2):
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(vec1), np.array(vec2))


In [28]:
df["prompt_similarity"] = df['embedding'].apply(lambda vector: vector_similarity(vector, prompt_embedding))

In [29]:
df.sort_values("prompt_similarity", ascending=False).head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count,embedding,prompt_similarity
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva in Isr...,58,"[0.011392408050596714, -0.011185649782419205, ...",0.883416
933,"10/31/2022, 2:34:02 AM",Pendo,https://www.cbinsights.com/company/pendoio,2.6,10/17/2019,2019,Raleigh,United States,Internet software & services,"[""Contour Venture Partners"",""Battery Ventures""...",,Pendo has headquarters in Raleigh in United St...,59,"[0.016953933984041214, -0.0028986947145313025,...",0.826123
61,"10/31/2022, 2:36:13 AM",Perimeter 81,https://www.cbinsights.com/company/perimeter-81,1.0,6/6/2022,2022,Tel Aviv,Israel,Cybersecurity,"[""Insight Partners"",""Toba Capital"",""Spring Ven...",,Perimeter 81 has headquarters in Tel Aviv in I...,57,"[0.007210950832813978, -0.0017711690161377192,...",0.819329
1183,"10/31/2022, 2:33:34 AM",Intarcia Therapeutics,https://www.cbinsights.com/company/intarcia-th...,3.8,4/1/2014,2014,Boston,United States,Health,"[""New Enterprise Associates"",""New Leaf Venture...",,Intarcia Therapeutics has headquarters in Bost...,62,"[0.01666351780295372, -0.001971115358173847, 0...",0.804185
988,"10/31/2022, 2:36:23 AM",Momenta,https://www.cbinsights.com/company/momenta,1.0,10/17/2018,2018,Beijing,China,Artificial intelligence,"[""Sinovation Ventures"",""Tencent Holdings"",""Seq...",,Momenta has headquarters in Beijing in China a...,55,"[0.006287400145083666, -0.03462117910385132, -...",0.803426


Now we can easily grab the summary for the most similar embedding, then insert that as context to our GPT request!

In [30]:
# Could also use sort_values() with ascending=False, but nlargest should be more performant
df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

'Pentera has headquarters in Petah Tikva in Israel and is in the field of Cybersecurity . The investors in the company are AWZ Ventures,  Blackstone,  Insight Partners, . You can find more information at https://www.cbinsights.com/company/pcysys'

## Question Answering with Embeddings

Let's try inserting the summary to help the model to see if it actually helps:

In [31]:
summary = df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

In [36]:
prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer.
Here is some context:
{summary}
Q: What does the start-up company Pentera do and who invested in it?
A:"""


response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant"},
                    {"role": "user", "content": prompt},
                ],
                temperature=0,
                max_tokens=500,

)
print(response.choices[0].message.content.strip(" \n"))

Pentera is a cybersecurity company based in Petah Tikva, Israel. The investors in the company are AWZ Ventures, Blackstone, and Insight Partners.


Nice! Let's clean this all up by creating a function that wraps all the functions together. Note how this is pretty limited due to our data, but hopefully you can see how this is generalizable to your own data sets and prompts, each situation will be different!

In [37]:
def embed_prompt_lookup():
    # initial question
    question = input("What question do you have about a Unicorn company? ")
    # Get embedding
    prompt_embedding = get_embedding(question)
    # Get prompt similarity with embeddings
    # Note how this will overwrite the prompt similarity column each time!
    df["prompt_similarity"] = df['embedding'].apply(lambda vector: vector_similarity(vector, prompt_embedding))

    # get most similar summary
    summary = df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

    prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer.
            Here is some context:
            {summary}
            Q: {question}
            A:"""


    response = client.chat.completions.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant"},
                        {"role": "user", "content": prompt},
                    ],
                    temperature=0,
                    max_tokens=500,

    )
    print(response.choices[0].message.content.strip(" \n"))

In [38]:
embed_prompt_lookup()

What question do you have about a Unicorn company? What does the startup Momenta do?
Momenta is a startup in the field of Artificial Intelligence.
