# Install Dependencies

In [None]:
!pip install openai

Run the following cell to load the Azure Open AI API key from the Key Vault pre-configured in your environment.

In [None]:
from azureml.core import Workspace
ws = Workspace.from_config()
keyvault = ws.get_default_keyvault()
open_ai_api_key = keyvault.get_secret(name="open_ai_api_key")

# Summarization

Run the following cell to initialize the setting use the Open AI API.

In [None]:
import os
import openai
openai.api_type = "azure"
openai.api_base = "https://solliance-openai-01.openai.azure.com/"
openai.api_version = "2022-06-01-preview"
openai.api_key = open_ai_api_key

Edit the value of the prompt parameter with the text you would like to summarize and the run the following cell to see the summarization produced.

In [None]:
response = openai.Completion.create(
  engine="gpt3-text-ada-001",
  prompt="A neutron star is the collapsed core of a massive supergiant star, which had a total mass of between 10 and 25 solar masses, possibly more if the star was especially metal-rich.[1] Neutron stars are the smallest and densest stellar objects, excluding black holes and hypothetical white holes, quark stars, and strange stars.[2] Neutron stars have a radius on the order of 10 kilometres (6.2 mi) and a mass of about 1.4 solar masses.[3] They result from the supernova explosion of a massive star, combined with gravitational collapse, that compresses the core past white dwarf star density to that of atomic nuclei.\n\nTl;dr",
  temperature=0.7,
  max_tokens=60,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None
)

response.choices[0].text

# Introducing Embeddings

In the following try entering any text as input (be careful to exclude new line characters) and observe the embedding that results.

In [None]:
response = openai.Embedding.create(
  engine="gpt3-similarity-babbage-001",
  input="A neutron star is the collapsed core of a massive supergiant star."
)

In [None]:
response.data[0].embedding

## Using Embeddings to cluster "similar" text and derive insights

The following cells download a sample dataset that has already had the embeddings processed with the `text-similarity-babbage` model, similar to how you called it in the previous cell.  

In [None]:
import pandas as pd
import numpy as np

# for convenience, we precomputed the embeddings
datafile_path = "https://cdn.openai.com/API/examples/data/fine_food_reviews_with_embeddings_1k.csv"  
df = pd.read_csv(datafile_path)
df["babbage_similarity"] = df.babbage_similarity.apply(eval).apply(np.array)
matrix = np.vstack(df.babbage_similarity.values)
matrix.shape


In [None]:
df.head(5)

In [None]:
from sklearn.cluster import KMeans

n_clusters = 4

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels

df.groupby("Cluster").Score.mean().sort_values()

In the following cells you will use a dimensionality reduction technique that reduces the number of dimensions of the data down to two dimensions that can then be visualized easily in a scatter plot. Run the cell to see the clusters identified.

In [None]:
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt

tsne = TSNE(
    n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200
)
vis_dims2 = tsne.fit_transform(matrix)

x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]

for category, color in enumerate(["purple", "green", "red", "blue"]):
    xs = np.array(x)[df.Cluster == category]
    ys = np.array(y)[df.Cluster == category]
    plt.scatter(xs, ys, color=color, alpha=0.3)

    avg_x = xs.mean()
    avg_y = ys.mean()

    plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")

Run the following cell to see a sample of a review from each cluster.

In [None]:
# Reading a review which belong to each group.
rev_per_cluster = 3

for i in range(n_clusters):
    print(f"Cluster {i} Theme:", end=" ")

    reviews = "\n".join(
        df[df.Cluster == i]
        .combined.str.replace("Title: ", "")
        .str.replace("\n\nContent: ", ":  ")
        .sample(rev_per_cluster, random_state=42)
        .values
    )
    response = openai.Completion.create(
        engine="gpt3-text-davinci-002",
        prompt=f'What do the following customer reviews have in common?\n\nCustomer reviews:\n"""\n{reviews}\n"""\n\nTheme:',
        temperature=0,
        max_tokens=64,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    print(response["choices"][0]["text"].replace("\n", ""))

    sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)
    for j in range(rev_per_cluster):
        print(sample_cluster_rows.Score.values[j], end=", ")
        print(sample_cluster_rows.Summary.values[j], end=":   ")
        print(sample_cluster_rows.Text.str[:70].values[j])

    print("-" * 100)

## Searching with embeddings

Get the embedding for the search text. It's important to understand that for search, we use a different approach to computing the embeddings and resolving matches. 

- Query: For our search text, we need to use a *query* model (e.g., `text-search-babbage-query-001`)
- Documents: For the documents we want to search against, these embeddings need to be calculated with a *doc* model (e.g., `text-search-babbage-doc-001`)

These are two different engines in the parlance of Azure OpenAI.

In [None]:
from openai.embeddings_utils import get_embedding, cosine_similarity

Let's calculate the document embeddings for our corpus of documents that we can search against.

We'll start with getting the embedding for one document from our corpus so we can examine its shape. 

In [None]:
response = openai.Embedding.create(
  engine="gpt3-search-babbage-doc-001",
  input=df.head(1)['Text'][0]
)

doc_text_embedding = response.data[0].embedding

The embeddding of the doc has the following dimensions (run the cell to find out):

In [None]:
len(doc_text_embedding)

So the document is embedded with 2,048 dimensions when using Babbage.

Now, get the embedding for each document in our corpus and store that value in the `babbage_search_2` field that we add to the dataframe.

NOTE: This will take about 2-3 minutes.

In [None]:
df["babbage_search_2"] = df.Text.apply(lambda x: get_embedding(x, engine="gpt3-search-babbage-doc-001"))


Next, recall that when you search you have to get the embedding of your search text using the `query` model. 

Let's start by getting the embedding for a search query and examine how many dimensions it has.

In [None]:
response = openai.Embedding.create(
  engine="gpt3-search-babbage-query-001",
  input="great beans" #try other inputs like "bad taste", "spoilt" or "pet food"
)

search_text_embedding = response.data[0].embedding

As before, the length of the array tells us the number of dimensions used by this embedding.

In [None]:
len(search_text_embedding)

Did you notice something interesting in comparing the number of dimensions between the query and the doc? 

Yes, that's right! They have the same dimensions, even though one is a short text string and another is a long document. In fact, for search to work, they MUST have the same dimensions. 

NOTE: This explains why we couldn't use the pre-computed embedding in the `babbage_search` field of the sample data file. If you run the following cell you'll see that this one has dimensions that do not match the query. If you were to try to compute the cosine distance with these, you would get an error.

In [None]:
len(df.head(1)['babbage_search'][0])

Ok, now we get to the fun part. Let's compute just how close our query is to the first doc in the sample data set. 

In [None]:
cosine_similarity(df.head(1)['babbage_search_2'][0], search_text_embedding)

The `cosine_similarity` function returns a value between `0.0` and `1.0` where the closer the value is to `1.0` the more similar the two.

Run the next cell to compute the similarity between the query and the doc, for each doc in the dataset. We'll sort the data by the similarities score so that higher scores appear first and then we can pick the top N to see the best results.

In [None]:
df["similarities"] = df.babbage_search_2.apply(lambda x: cosine_similarity(x, search_text_embedding))

Lets take a look at the top 3 closest results.

In [None]:
pd.set_option('display.max_colwidth', 320)
pd.set_option('display.max_rows', None)

df.sort_values("similarities", ascending=False).head(n=3)[['Text', 'similarities']]

Do the above results appear like a good match for your search query?