# Text Embedding API

Text embedding allows us to directly convert text documents to vectors with a simple API call with Open AI.

Keep in mind, just like other Open AI services, it is not free and it is also important to note it has its own pricing structure (its typically much cheaper than GPT on a token basis, since the processing is simpler). You can view the pricing here: https://openai.com/api/pricing/
<br><br><br>
### Library Imports

In [22]:
import os
import ast
import tiktoken
import numpy as np
import pandas as pd
import openai

In [2]:
openai.api_key = os.getenv("OPENAI_API_KEY")

### What happens when GPT doesn't know anything about a topic?

For example, we know GPT is limited by its training data not being up to date to the present day (depending on the model, the cut-off can be very recent though). There are also limitations based on how esoteric the topic is. 

Let's ask GPT about a a "unicorn" company

In [3]:
prompt = "What does the start-up company Pentera do and who invested in it?"

response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

Pentera is a start-up company that provides software solutions to help organizations manage their employee benefits programs. The company has raised $3.5 million in seed funding from investors including Y Combinator, SV Angel, and Social Leverage.


Sometimes it might feel that the model is hallucinating! A common issue with LLMs, they are eager to please and with enough context they can make stuff up that sounds right, but actually isn't. In my personal research, it looks like Y Combinator did NOT actually invest in Pentera. Also Pentera isn't an HR company, Pentera is a penetration testing company that develops and provides an automated security validation platform to reduce cybersecurity risks.

We could try to alleviate this issue with some prompt engineering:

In [4]:
prompt = """Only answer the question below if you have 100% certainty of the facts.

Q: What does the start-up company Pentera do and who invested in it? Give its financial statistics as well
A:"""


response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

I cannot answer this question with 100% certainty.


#### Alright, very interesting! How can we help the model? We could input some context from our own data. In fact, we have a data set about recent Unicorn companies.  

### Text Data

Let's grab some text data and send it to Open AI to receive the embeddings back.

In [5]:
df = pd.read_csv("/Volumes/Data/Datasets/genai_datasets/unicorns.csv")
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",


Now we will create a function to add a summary column, which will create a summarised content for every company and that can used for contextual training

In [6]:
def create_summary(company, crunchbase_url, city, country, industry, investor_list):
    investors = f"The investors in the company are "
    for investor in ast.literal_eval(investor_list):
        investors += f"{investor}, "
        
    text = f"{company} has headquarters in {city} and is in the field of {industry}, {investors}. More information can be found at {crunchbase_url}"
    
    return text

In [7]:
df['summary'] = df.apply(lambda df: create_summary(df['Company'], df['Crunchbase Url'], df['City'], df['Country'], df['Industry'], df['Investors']), axis=1)

In [8]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York and is in t...
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York and is...
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto and is in ...
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad and is...
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva and is...


In [9]:
df['summary'][0]

'Esusu has headquarters in New York and is in the field of Fintech, The investors in the company are Next Play Ventures, Zeal Capital Partners, SoftBank Group, . More information can be found at https://www.cbinsights.com/company/esusu'

#### Token Count

In case you are ever worried about how many tokens your text actually has (to get an estimate of your costs) OpenAI has a library called "tiktoken", which allows you to estimate a cost based on token counts.

Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.

**tiktoken** supports 3 different encodings for OpenAI models:

* "gpt2" for most gpt-3 models
* "p50k_base" for code models, and Davinci models, like "text-davinci-003"
* "cl100k_base" for text-embedding-ada-002

In [10]:
def num_tokens_from_string(string, encoding_name):
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [11]:
# Now we will run an example on a text from summary column
num_tokens_from_string(df['summary'][0], encoding_name='cl100k_base')

53

**Note how this is higher than the actual word count, this is because OpenAI tokens are not the same as words, remember things like punctuation and word length come into play, as a rough estimate, 1000 tokens is about 750 words. But with the tool above you can check your real token count before sending text over to OpenAI. Let's get a cost estimate of vectorizing our entire data set:**

In [12]:
df['token_count'] = df['summary'].apply(lambda items: num_tokens_from_string(items, 'cl100k_base'))

# return the total tokens
print(f"Total tokens across all entries in the dataframe - {sum(df['token_count'])}")

Total tokens across all entries in the dataframe - 65109


#### Estimating Embedding Costs

Let's now do a quick monetary estimate of how much this will all cost, currently ADA-002 embedding model costs $0.0004 / 1K tokens


Pay careful attention, that isn't 4 cents per 1000 tokens, that would be $0.04, this is 1/100 of that cost, so quite "inexpensive" depending on your document workload.

In [13]:
# Lets estimate the cost
df['token_count'].sum() * 0.0004 / 1000

0.0260436

Another thing to keep in mind is the size limit for embeddings, currently the ADA 002 model max token limit is 8191 tokens, let's quickly check against this limit: If there are any rows which are beyond 8191 tokens

In [14]:
df[df['token_count'] > 8191]

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count


### Text 

To begin, we'll create a simple function to grab the embedding, in our case, we'll specify the ADA 002 model

In [15]:
def get_embedding(text):
    # Note how this function assumes you already set your Open AI key!
    result = openai.Embedding.create(
        model='text-embedding-ada-002',
        input=text
    )
    return result["data"][0]["embedding"]

In [16]:
# Now we will call the above function to get the embeddings for the summary column
get_embedding(df['summary'][0])

[0.011609838344156742,
 -0.018315808847546577,
 -0.022067401558160782,
 -0.03170095384120941,
 -0.016051456332206726,
 0.00987472664564848,
 -0.01946808397769928,
 0.024251364171504974,
 0.002823743037879467,
 -0.023166082799434662,
 0.020338989794254303,
 -0.0147517966106534,
 0.00715482234954834,
 -0.017351113259792328,
 0.011750523000955582,
 -0.022991901263594627,
 0.000826522649731487,
 -0.02159845270216465,
 0.0180344395339489,
 0.0051349918358027935,
 -0.023487647995352745,
 -0.016815172508358955,
 -0.00987472664564848,
 0.01932070031762123,
 -0.0200040265917778,
 -0.0021270187571644783,
 -0.013472235761582851,
 0.0004218447720631957,
 -0.012474044226109982,
 -0.006598782725632191,
 0.016198839992284775,
 -0.0018506738124415278,
 -0.019025932997465134,
 -0.0036176068242639303,
 -0.012942993082106113,
 0.010370472446084023,
 -0.022402364760637283,
 0.02070075087249279,
 0.0202987939119339,
 -0.005513500887900591,
 0.030897041782736778,
 0.009546462446451187,
 -0.01357272453606128

In [17]:
# Now lets build for all the rows
df['embeddings'] = df['summary'].apply(get_embedding)

In [18]:
df.head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count,embeddings
0,"10/31/2022, 2:37:05 AM",Esusu,https://www.cbinsights.com/company/esusu,1.0,1/27/2022,2022,New York,United States,Fintech,"[""Next Play Ventures"",""Zeal Capital Partners"",...",,Esusu has headquarters in New York and is in t...,53,"[0.011540870182216167, -0.018299279734492302, ..."
1,"10/31/2022, 2:37:05 AM",Fever Labs,https://www.cbinsights.com/company/fever-labs,1.0,1/26/2022,2022,New York,United States,Internet software & services,"[""Accel"",""14W"",""GS Growth""]",,Fever Labs has headquarters in New York and is...,55,"[0.009516256861388683, 0.012649181298911572, -..."
2,"10/31/2022, 2:37:04 AM",Minio,https://www.cbinsights.com/company/minio,1.0,1/26/2022,2022,Palo Alto,United States,Data management & analytics,"[""General Catalyst"",""Nexus Venture Partners"",""...",,Minio has headquarters in Palo Alto and is in ...,52,"[0.004657353740185499, -0.04018688574433327, 0..."
3,"10/31/2022, 2:37:04 AM",Darwinbox,https://www.cbinsights.com/company/darwinbox,1.0,1/25/2022,2022,Hyderabad,India,Internet software & services,"[""Lightspeed India Partners"",""Sequoia Capital ...",,Darwinbox has headquarters in Hyderabad and is...,58,"[-0.0016203665873035789, -0.025722471997141838..."
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva and is...,54,"[0.013372410088777542, -0.012894092127680779, ..."


In [19]:
# Now we will convert this whole dataframe into csv
df.to_csv('unicorns_with_embeddings.csv', index=False)

### Find Document Similarity

We can now take a new string, embed it into a vector, and perform a cosine similarity search against all the vector embeddings in our DataFrame:

In [20]:
prompt = "What dpes the company Pantera do and who invested on it ?"

In [21]:
prompt_embedding = get_embedding(prompt)

In [23]:
'''
Function to find cosine similarity
'''
def cos_similarity(prompt1, prompt2):
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(prompt1), np.array(prompt2))

In [26]:
# We will create a column and apply similarity function for every embeddings value
df["prompt_similarity"] = df['embeddings'].apply(lambda vector: cos_similarity(vector, prompt_embedding))

In [28]:
# Lets sort the values descending with the most similar at the top
df.sort_values("prompt_similarity", ascending=False).head()

Unnamed: 0,Updated at,Company,Crunchbase Url,Last Valuation (Billion $),Date Joined,Year Joined,City,Country,Industry,Investors,Company Website,summary,token_count,embeddings,prompt_similarity
4,"10/31/2022, 2:37:04 AM",Pentera,https://www.cbinsights.com/company/pcysys,1.0,1/11/2022,2022,Petah Tikva,Israel,Cybersecurity,"[""AWZ Ventures"",""Blackstone"",""Insight Partners""]",,Pentera has headquarters in Petah Tikva and is...,54,"[0.013372410088777542, -0.012894092127680779, ...",0.809533
474,"10/31/2022, 2:35:26 AM",Panther Labs,https://www.cbinsights.com/company/panther,1.4,12/2/2021,2021,San Francisco,United States,Cybersecurity,"[""Innovation Endeavors"",""s28 Capital"",""Lightsp...",,Panther Labs has headquarters in San Francisco...,54,"[0.015146072022616863, -0.009487980045378208, ...",0.804182
332,"10/31/2022, 2:36:44 AM",Pantheon Systems,https://www.cbinsights.com/company/pantheon,1.0,7/13/2021,2021,San Francisco,United States,Internet software & services,"[""Foundry Group"",""Scale Venture Partners"",""Sof...",,Pantheon Systems has headquarters in San Franc...,55,"[0.020426444709300995, -0.015611733309924603, ...",0.80148
368,"10/31/2022, 2:36:35 AM",Injective Protocol,https://www.cbinsights.com/company/injective-p...,1.0,4/20/2021,2021,New York,United States,Fintech,"[""Pantera Capital"",""Cadenza Ventures"",""BlockTo...",,Injective Protocol has headquarters in New Yor...,56,"[-0.013752120546996593, -0.005053041502833366,...",0.7947
372,"10/31/2022, 2:36:34 AM",The Zebra,https://www.cbinsights.com/company/insurance-z...,1.0,4/12/2021,2021,Austin,United States,E-commerce & direct-to-consumer,"[""Silverton Partners"",""Accel"",""Ballast Point V...",,The Zebra has headquarters in Austin and is in...,59,"[0.0027228675317019224, -0.02102956548333168, ...",0.78957


In [29]:
# Extract the summary for the largest cosine similarity value
df.nlargest(1, 'prompt_similarity').iloc[0]['summary']

'Pentera has headquarters in Petah Tikva and is in the field of Cybersecurity , The investors in the company are AWZ Ventures, Blackstone, Insight Partners, . More information can be found at https://www.cbinsights.com/company/pcysys'

### Question Answering with Embeddings
#### Now we can easily grab the summary for the most similar embedding, then insert that as context to our GPT request!

Lets try inserting the summary to help the model to see if it actually helps

In [30]:
summary = df.nlargest(1, 'prompt_similarity').iloc[0]['summary']

In [32]:
prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer..
        Here is some context - 
        {summary}
        
        Q: What does the start up company Pantera do and who invested on it?
        A: """

response = openai.Completion.create(
    model = "text-davinci-003",
    prompt = prompt,
    temperature = 0,
    max_tokens = 256
)

print(response['choices'][0]['text'].strip("\n"))

 Pentera is a start up company in the field of Cybersecurity. AWZ Ventures, Blackstone, and Insight Partners have invested in the company.


#### Let's clean this all up by creating a function that wraps all the functions together. Note how this is pretty limited due to our data, but hopefully you can see how this is generalizable to your own data sets and prompts, each situation will be different!

In [33]:
def embed_lookup():
    # Initial Question
    question = input("What Question you have about a Unicorn Company ?")
    # Get embedding
    prompt_embedding = get_embedding(question)
    # Get prompt similarity with embeddings
    df['prompt_similarity'] = df['embeddings'].apply(lambda vector: cos_similarity(vector, prompt_embedding))
    
    # get most similar summary
    summary = df.nlargest(1, 'prompt_similarity').iloc[0]['summary']
    
    # creating the GPT prompt
    prompt = f"""Only answer the question if you have 100% certainity of facts, use the context below to answer.. 
    Here is some context.
    {summary}
    Q: {question}
    A: """
    
    # make the api call
    response = openai.Completion.create(model='text-davinci-003',
                                        prompt = prompt,
                                        temperature=0,
                                        max_tokens=256)
    # display the response
    print(response['choices'][0]['text'])

In [34]:
# Make the function call

embed_lookup()

 Pantheon Systems is a company headquartered in San Francisco that specializes in Internet software & services. The investors in the company are Foundry Group, Scale Venture Partners, and SoftBank Group.
