<a href="https://colab.research.google.com/github/tadlebeck/tadlebeck/blob/main/Create_Cases_Pinecone_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/openai/semantic_search_openai.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/openai/semantic_search_openai.ipynb)

# Semantic Search with Pinecone and OpenAI

In this guide you will learn how to use the OpenAI Embedding API to generate language embeddings, and then index those embeddings in the Pinecone vector database for fast and scalable vector search.

This is a powerful and common combination for building semantic search, question-answering, threat-detection, and other applications that rely on NLP and search over a large corpus of text data.

The basic workflow looks like this:

**Embed and index**

* Use the OpenAI Embedding API to generate vector embeddings of your documents (or any text data).
* Upload those vector embeddings into Pinecone, which can store and index millions/billions of these vector embeddings, and search through them at ultra-low latencies.

**Search**

* Pass your query text or document through the OpenAI Embedding API again.
* Take the resulting vector embedding and send it as a query to Pinecone.
* Get back semantically similar documents, even if they don't share any keywords with the query.

![Architecture overview](https://files.readme.io/6a3ea5a-pinecone-openai-overview.png)

Let's get started...

## Setup

We first need to setup our environment and retrieve API keys for OpenAI and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the OpenAI and Pinecone clients:

In [None]:
!pip install -qU pinecone-client openai datasets tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

### Creating Embeddings

Then we initialize our connection to OpenAI Embeddings *and* Pinecone vector DB. Sign up for an API key over at [OpenAI](https://beta.openai.com/signup) and [Pinecone](https://app.pinecone.io).

In [None]:
import openai
import tiktoken  # for counting tokens
import pandas as pd
import numpy as np
import os

openai.api_key = "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"
# get API key from top-right dropdown on OpenAI website

# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

from openai.embeddings_utils import get_embedding



Load the case.csv file of the historical case information

In [None]:

def remove_newlines(serie):
    serie = serie.str.replace('\n', ' ', regex=False)
    serie = serie.str.replace('\\n', ' ', regex=False)
    serie = serie.str.replace('  ',' ', regex=False)
    serie = serie.str.replace('  ',' ', regex=False)
    return serie


In [None]:
df = pd.read_csv('sample_data/cases.csv')
df = df.replace(np.nan, '')
df.tail()

Unnamed: 0.1,Unnamed: 0,Summary,Date,Amount,Type,State,Venue,Court,N_Injury_type,Injury_Type,N_Case_Type,Case_Type,Case Name,Party,Facts,Injury,Expert,Attorney,Judge,Insurance
995,995,Auto/Mower Accident - Highway - Brain Injury\n,"May 17, 2007\n",2100000.0,Verdict-Plaintiff\n,South Carolina\n,Richland County\n,"Richland County, United States District Court,...",3,leg\nhead - closed head injury\nfoot/heel - fo...,1,Motor Vehicle\n,"Chris Stuckslager, et al. v. E & H Mowers Cont...","E & H Mowers Contractors, Inc. and Gerald Davi...",A 39 year old man sustained permanent brain da...,Permanent brain injury with serious cognitive ...,"Joseph Healey MD , Luanne Ahrendt , Oliver Woo...","Eugene C. Fulton , J. Edward Bell III , Charle...",Hon Joseph F. Anderson\n,Auto Owners\n
996,996,Auto Accident - Motorcycle - Settlement\n,"Aug 01, 2009\n",2100000.0,Settlement\n,Georgia\n,Cherokee County\n,"Cherokee County, State Court, GA\n",4,back\nhead - closed head injury\nchest - fract...,1,Motor Vehicle\n,Gregory Rodriguez v. Anonymous Driver\n,"Anonymous Driver, Gregory Rodriguez\n",A personal injury suit was brought after a 53 ...,Fractured lumbar spine at T1-T5 requiring plac...,,"Matthew Dwyer , Withheld upon request of the c...",W. Alan Jordan\n,
997,997,Bus passenger claimed sudden stop caused fall\n,"Nov 27, 2017\n",2100000.0,Mediated Settlement\n,California\n,San Francisco County\n,"Superior Court of San Francisco County, San Fr...",10,back - stenosis\nhead\nneck - stenosis\nbrain ...,7,Motor Vehicle - Bus\nGovernment - Counties\nMo...,Luz Godizano and Robert Godizano v. City and C...,"Iris Ivette Monge,Daniel Antonio Monge Oliva,C...","On Nov. 13, 2015, plaintiff Luz Godizano, 59, ...","Godizano suffered a traumatic brain injury, in...","V. Paul Herbert C.P.S.A. , Dawn A. Osterweil P...","Jeffrey R. Windsor , Karen E. Kirby , Thomas J...",Rebecca J. Westerfield\n,American Automobile Association\n
998,998,Parties debated whether motor scooter equals '...,"Apr 09, 2010\n",2100000.0,Settlement\n,New York\n,Bronx County\n,"Bronx Supreme, NY\n",9,head\nhead - headaches\nbrain - brain damage\n...,2,Motor Vehicle - Crosswalk\nMotor Vehicle - Mot...,Olegario Batiz v. Jose H. Rivera & Professiona...,"Jose H. Rivera,Professional Charter Service In...","On May 27, 2008, plaintiff Olegario Batiz, 86,...","Batiz was knocked off of his scooter, and his ...","Narayan Sundaresan M.D. , Kuldip Kaur Sachdev ...","Jason Shapiro , Mary A. Jewels\n",Howard R. Silver\n,"Star Insurance Co. , Wesco Insurance Co.\n"
999,999,Octogenarian got record settlement after being...,"Nov 15, 2002\n",2100000.0,Mediated Settlement\n,California\n,Orange County\n,"Superior Court of Orange County, Santa Ana, CA\n",3,head\nbrain - brain damage\nother - hematoma\n,2,Motor Vehicle - Pedestrian\nMotor Vehicle - Ri...,Maria Rubio v. Jennifer Prairie and Michael Pr...,"Michael Prairie,Jennifer Prairie, Maria Rubio\n","On Jan. 14, 2002, plaintiff Maria Rubio, an 81...","After the accident, Rubio complained of pain t...","Wayne Lancaster , Kenneth Nudleman , Katherine...","Richard A. Cohn , Jay S. McClaugherty , Mark L...","Steven L. Perk , Jack Daniels\n",Allstate Insurance Co.\n


In [None]:
df["text"] = (
    "Type: " + df.Type.str.strip() + "; Facts: " + df.Facts.str.strip() \
    + "Case_Type: " + df.Case_Type.str.strip() + "Injurty_Type: " + df.Injury_Type.str.strip()
)
df['text'] = remove_newlines(df.text)

In [None]:
df.head()

In [None]:
from sys import getsizeof

too_big = []

for text in df['text'].tolist():
    if getsizeof(text) > 5000:
        too_big.append((text, getsizeof(text)))

print(f"{len(too_big)} / {len(df)} records are too big")

0 / 1000 records are too big


In [None]:
encoding = tiktoken.get_encoding(embedding_encoding)

# omit cases that are too long to embed
df["n_tokens"] = df.text.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens]
len(df)

df["embeddings"] = df.text.apply(lambda x: get_embedding(x, engine=embedding_model))


list

In [None]:
df.to_csv("sample_data/cases_with_embeddings.csv")
df.to_parquet('sample_data/cases_with_embeddings.parquet')


In [None]:
import pinecone

index_name = 'cases-open-ai'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="3afd09b9-7a10-4e8f-a7c9-be14562c4bb9",
    environment="asia-southeast1-gcp-free"  # find next to api key in console
)
# check if 'openai' index already exists (only create index if not)
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=len(df.embeddings[0]))
# connect to index
index = pinecone.Index(index_name)

  from tqdm.autonotebook import tqdm


## Populating the Index

Now we will take 1K questions from the TREC dataset

Then we create a vector embedding for each phrase using OpenAI, and `upsert` the ID, vector embedding, and original text for each phrase to Pinecone.

In [None]:
df['id'] = [str(i) for i in range(len(df))]
df.head()


AttributeError: ignored

In [None]:
df = pd.read_parquet('sample_data/cases_with_embeddings.parquet')
df['id'] = [str(i) for i in range(len(df))]

In [None]:
type(df.Expert[0])

str

In [None]:
from tqdm.auto import tqdm

batch_size = 32

for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i+batch_size, len(df))
    df_slice = df.iloc[i:i_end]
    to_upsert = [
        (
            row['id'],
            row['embeddings'].tolist(),
            {
                'text': row['text'],
                'Summary': row['Summary'],
                'Date': row['Date'],
                'Amount': row['Amount'],
                'Type': row['Type'],
                'State': row['State'],
                'Venue': row['Venue'],
                'Court': row['Court'],
                'Injury_Type': row['Injury_Type'],
                'Facts': row['Facts'],
                'Injury': row['Injury'],
                'Expert': row['Expert'],
                'Attorney': row['Attorney'],
                'Judge': row['Judge'],
                'Insurance': row['Insurance'],
                'Case_Type': row['Case_Type'],
                'n_tokens': row['n_tokens']
            }
        ) for _, row in df_slice.iterrows()
    ]
    index.upsert(vectors=to_upsert)

  0%|          | 0/32 [00:00<?, ?it/s]

In [None]:
len(df)

1000

In [None]:
mappings = {row['id']: row['text'] for _, row in df[['id', 'text']].iterrows()}

In [None]:
import json

with open('sample_data/mapping.json', 'w') as fp:
    json.dump(mappings, fp)

---

# Querying

With our data indexed, we're now ready to move onto performing searches. This follows a similar process to indexing. We start with a text `query`, that we would like to use to find similar sentences. As before we encode this with OpenAI's text similarity Babbage model to create a *query vector* `xq`. We then use `xq` to query the Pinecone index.

In [None]:
query = "The driver of the car ran a red light and collided with a the defendants vehicle causing severe brain damage to the passengers in the defendants car"

xq = openai.Embedding.create(input=query, engine=embedding_model)['data'][0]['embedding']

Now query...

In [None]:
res = index.query([xq], top_k=5, include_metadata=True)
res

{'matches': [{'id': '183',
              'metadata': {'Amount': 15000000.0,
                           'Attorney': 'Edward C. Bassett Jr.\n',
                           'Case_Type': 'Motor Vehicle\n'
                                        'Motor Vehicle - Alcohol Involvement\n',
                           'Court': 'Worcester County, Superior Court, '
                                    'Worcester, MA\n',
                           'Date': datetime.datetime(1999, 6, 30, 0, 0),
                           'Expert': '',
                           'Facts': 'Plaintiff was operating her vehicle when '
                                    'she was involved in a collision with a '
                                    'vehicle being operated by defendant. '
                                    'Defendant, who was intoxicated, operating '
                                    'without his headlights, and speeding, ran '
                                    'a red light. Plaintiff was ejected from '
  

The response from Pinecone includes our original text in the `metadata` field, let's print out the `top_k` most similar questions and their respective similarity scores.

In [None]:
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.89: Type: Verdict-Plaintiff; Facts: Plaintiff was operating her vehicle when she was involved in a collision with a vehicle being operated by defendant. Defendant, who was intoxicated, operating without his headlights, and speeding, ran a red light. Plaintiff was ejected from her vehicle and fractured her skull on the pavement resulting in permanent brain damage. While plaintiff was lying face down on the road, defendant got out o...Case_Type: Motor Vehicle Motor Vehicle - Alcohol InvolvementInjurty_Type: head - fracture, skull head - closed head injury
0.89: Type: Verdict-Plaintiff; Facts: Plaintiff, a fifteen year old girl, was a passenger in the back seat of an automobile operated by a third party. Defendant was traveling on an intersecting street. Defendant allegedly ran a red light at the intersection of the two streets and broadsided the vehicle in which plaintiff was a passenger. Defendant, who was under the legal drinking age, was intoxicated at the time of the collision...Ca

Looks good, let's make it harder and replace *"depression"* with the incorrect term *"recession"*.

In [None]:
query = "The defendant was intoxicated and ran a red light, he broadsided the plaintifs car and caused severe brain trauma"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=embedding_model)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=10, include_metadata=True)

amount = []


for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['Amount']}")
    amount.append(match['metadata']['Amount'])



0.92: 15000000.0
0.91: 19563203.0
0.89: 3200000.0
0.89: 4000000.0
0.88: 10200000.0
0.88: 5400000.0
0.88: 14000000.0
0.88: 20800000.0
0.87: 3900000.0
0.87: 6200000.0


In [None]:
from numpy.lib.function_base import average

avg_amt = average(amount)
max_amt = max(amount)
min_amt = min(amount)
print (f"Max ${max_amt:12.2f}")
print (f"Avg ${avg_amt:12.2f}")
print (f"Min ${min_amt:12.2f}")


Max $ 20800000.00
Avg $ 10226320.30
Min $  3200000.00


And again...

In [None]:
query = "The car ran the red light and broasided another car causing severe bodily injury"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=embedding_model)['data'][0]['embedding']

# query, returning the top 5 most similar results
res = index.query([xq], top_k=10, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.87: Type: Verdict-Plaintiff; Facts: Plaintiff, a fifteen year old girl, was a passenger in the back seat of an automobile operated by a third party. Defendant was traveling on an intersecting street. Defendant allegedly ran a red light at the intersection of the two streets and broadsided the vehicle in which plaintiff was a passenger. Defendant, who was under the legal drinking age, was intoxicated at the time of the collision...Case_Type: Motor Vehicle - Broadside Motor Vehicle - Passenger Motor Vehicle - Alcohol InvolvementInjurty_Type: head - closed head injury
0.86: Type: Verdict-Plaintiff; Facts: Plaintiff was operating her vehicle when she was involved in a collision with a vehicle being operated by defendant. Defendant, who was intoxicated, operating without his headlights, and speeding, ran a red light. Plaintiff was ejected from her vehicle and fractured her skull on the pavement resulting in permanent brain damage. While plaintiff was lying face down on the road, defendant

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Looks great, our semantic search pipeline is clearly able to identify the meaning between each of our queries and return the most semantically similar questions from the already indexed questions.

Once we're finished with the index we delete it to save resources.

---