# The Embed Endpoint Colab

In this lab, we'll learn how to analyze a text dataset using Cohere's Embed cohere endpoint. This colab accompanies the [Classify endpoint lesson](https://docs.cohere.com/docs/embed-endpoint/) of LLM University.

This is part of a bigger [colab](https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/Hello_World_Meet_Language_AI.ipynb#) containing more endpoints, please feel free to check it out!

# Setting Up
The first step is to install the Cohere Python SDK. Next, create an API key, which you can generate from the Cohere [dashboard](https://os.cohere.ai/register) or [CLI tool](https://docs.cohere.ai/cli-key).

In [6]:
# Install the libraries
! pip install cohere altair umap-learn > /dev/null

In [7]:
# Import the libraries
import cohere
import pandas as pd
import numpy as np
import altair as alt
import textwrap as tr

# Setup the Cohere client
api_key = 'AAr0vB0rJOAIfTsMb38WoPSwsukz4eEbpdIA7AWk' # Paste your API key here. Remember to not share it publicly
co = cohere.Client(api_key)

# Analyzing Text

Cohere’s Embed endpoint takes a piece of text and turns it into a vector embedding. Embeddings represent text in the form of numbers that capture its meaning and context. What it means is that it gives you the ability to turn unstructured text data into a structured form. It opens up ways to analyze and extract insights from them.


## Get embeddings

Here we have a list of 50 top web search terms about Hello, World! taken from a keyword tool. Let’s look at a few examples:

In [8]:
# Get a list of texts and add to a dataframe
df = pd.read_csv("https://github.com/cohere-ai/notebooks/raw/main/notebooks/data/hello-world-kw.csv", names=["search_term"])
df.head()

Unnamed: 0,search_term
0,how to print hello world in python
1,what is hello world
2,how do you write hello world in an alert box
3,how to print hello world in java
4,how to write hello world in eclipse


We use the Embed endpoint to get the embeddings for each of these serach terms.

In [9]:
# A function that classifies a list of inputs given the examples
def embed_text(texts):
  """
  Turns a piece of text into embeddings
  Arguments:
    text(str): the text to be turned into embeddings
  Returns:
    embedding(list): the embeddings
  """
  # Embed text by calling the Embed endpoint
  output = co.embed(
                model="embed-english-v2.0",
                texts=texts)
  embedding = output.embeddings

  return embedding

In [10]:
# Get embeddings of all search terms
df["search_term_embeds"] = embed_text(df["search_term"].tolist())
embeds = np.array(df["search_term_embeds"].tolist())

### Semantic Search

**Note:** This semantic search section is not contained in the blog post, but we encourage you to still check it out if you'd like to see how the embedding is used for searching in the dataset!

We’ll look at a couple of example applications. The first example is semantic search. Given a new query, our "search engine" must return the most similar FAQs, where the FAQs are the 50 search terms we uploaded earlier.


In [11]:
# Add a new query
new_query = "what is the history of hello world"

# Get embeddings of the new query
new_query_embeds = embed_text([new_query])[0]

We use cosine similarity to compare the similarity of the new query with each of the FAQs

In [16]:
# Calculate cosine similarity

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(target, candidates):
  """
  Computes the similarity between a target text and a list of other texts
  Arguments:
    target(list[float]): the target text
    candidates(list[list[float]]): a list of other texts, or candidates
  Returns:
    sim(list[tuple]): candidate IDs and the similarity scores
  """
  # Turn list into array
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)
  print(candidates.shape)
  print(target.shape)

  # Calculate cosine similarity
  sim = cosine_similarity(target,candidates)
  sim = np.squeeze(sim).tolist()

  # Sort by descending order in similarity
  sim = list(enumerate(sim))
  sim = sorted(sim, key=lambda x:x[1], reverse=True)

  # Return similarity scores
  return sim

Finally, we display the top 5 FAQs that match the new query

In [17]:
# Get the similarity between the new query and existing queries
similarity = get_similarity(new_query_embeds,embeds)

# Display the top 5 FAQs
print("New query:")
print(new_query,'\n')

print("Similar queries:")
for idx,score in similarity[:5]:
  print(f"Similarity: {score:.2f};", df.iloc[idx]["search_term"])

(50, 4096)
(1, 4096)
New query:
what is the history of hello world 

Similar queries:
Similarity: 0.91; how did hello world originate
Similarity: 0.88; where did hello world come from
Similarity: 0.86; what is hello world
Similarity: 0.77; why is hello world so famous
Similarity: 0.70; why hello world


### Semantic Exploration

In the second example, we take the same idea as semantic search and take a broader look, which is exploring huge volumes of text and analyzing their semantic relationships.

We'll use the same 50 top web search terms about Hello, World! There are different techniques we can use to compress the embeddings down to just 2 dimensions while retaining as much information as possible. We'll use a technique called UMAP. And once we can get it down to 2 dimensions, we can plot these embeddings on a 2D chart.

In [18]:
# Reduce the embeddings' dimensions to 2 using UMAP
import umap
reducer = umap.UMAP(n_neighbors=49) 
umap_embeds = reducer.fit_transform(embeds)

# Add the 2 dimensions to the dataframe
df['x'] = umap_embeds[:,0]
df['y'] = umap_embeds[:,1]

In [19]:
# Plot the 2-dimension embeddings on a chart
chart = alt.Chart(df).mark_circle(size=500).encode(
  x=
  alt.X('x',
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
  ),

  y=
  alt.Y('y',
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
  ),
  
  tooltip=['search_term']
  )

text = chart.mark_text(align='left', dx=15, size=12, color='black'
          ).encode(text='search_term', color= alt.value('black'))

result = (chart + text).configure(background="#FDF7F0"
      ).properties(
      width=1000,
      height=700,
      title="2D Embeddings"
      )

result.interactive()