<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/embeddings/cohereai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CohereAI Embeddings

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index

In [1]:
# Initilise with your api key
import os
import json

In [2]:
def load_credentials(pth):
    """
    Loads API credential keys of different services 
    :param pth: Path to credentials file
    :return: Dictionary of API credentials
    """
    cred_dict = {}
    with open(pth, 'r') as f:
        cred_dict = json.load(f)
    return cred_dict

In [3]:
os.getcwd()
os.path.dirname(os.getcwd())

'/Users/karthiksubramanian/PycharmProjects/recommenderLLM'

In [3]:
cred_pth = os.path.join(os.path.dirname(os.getcwd()), 'cred', 'credentials.json')
cred_dict = load_credentials(cred_pth)
cohere_api_key = cred_dict.get("COHERE_API_KEY", '')
#print(cohere_api_key)
os.environ["COHERE_API_KEY"] = cohere_api_key

#### With latest `embed-english-v3.0` embeddings.

- input_type="search_document": Use this for texts (documents) you want to store in your vector database

- input_type="search_query": Use this for search queries to find the most relevant documents in your vector database

In [4]:
from llama_index.embeddings.cohereai import CohereEmbedding

In [4]:


# with input_typ='search_query'
embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key,
    model_name="embed-english-v3.0",
    input_type="search_query",
)

embeddings = embed_model.get_text_embedding("Hello CohereAI!")

print(len(embeddings))
print(embeddings[:5])

1024
[-0.041931152, -0.022384644, -0.07067871, -0.011886597, -0.019210815]


In [5]:
# with input_type = 'search_document'
embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key,
    model_name="embed-english-v3.0",
    input_type="search_document",
)

embeddings = embed_model.get_text_embedding("Hello CohereAI!")

print(len(embeddings))
print(embeddings[:5])

1024
[-0.03074646, -0.0029201508, -0.058044434, -0.015457153, -0.02331543]


#### With old `embed-english-v2.0` embeddings.

In [None]:
embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key, model_name="embed-english-v2.0"
)

embeddings = embed_model.get_text_embedding("Hello CohereAI!")

print(len(embeddings))
print(embeddings[:5])

4096
[0.65771484, 0.7998047, 2.3769531, -2.3105469, -1.6044922]


#### Now with latest `embed-english-v3.0` embeddings, 

let's use 
1. input_type=`search_document` to build index
2. input_type=`search_query` to retrive relevant context.

In [5]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
)

from llama_index.llms import LiteLLM
from llama_index.response.notebook_utils import display_source_node

from IPython.display import Markdown, display

#### Download Example Data for demo

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2023-11-03 03:14:50--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: 'data/paul_graham/paul_graham_essay.txt'


2023-11-03 03:14:50 (11.3 MB/s) - 'data/paul_graham/paul_graham_essay.txt' saved [75042/75042]


#### Load Example Data

In [None]:
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

#### Build index with input_type = 'search_document'

In [None]:
llm = LiteLLM("command-nightly")
embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key,
    model_name="embed-english-v3.0",
    input_type="search_document",
)

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)
index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context
)

### Load WANDS Dataset

In [6]:
import pandas as pd
wands_data_pth = os.path.join(os.path.dirname(os.getcwd()), 'WANDS', 'dataset')

In [7]:
# get search queries
wands_query_df = pd.read_csv(os.path.join(wands_data_pth, "query.csv"), sep='\t')
wands_query_df.head(10)

Unnamed: 0,query_id,query,query_class
0,0,salon chair,Massage Chairs
1,1,smart coffee table,Coffee & Cocktail Tables
2,2,dinosaur,Kids Wall Décor
3,3,turquoise pillows,Accent Pillows
4,4,chair and a half recliner,Recliners
5,5,sofa with ottoman,Sectionals
6,6,acrylic clear chair,Dining Chairs
7,7,driftwood mirror,Wall & Accent Mirrors
8,8,home sweet home sign,Wall Décor
9,9,coffee table fire pit,Outdoor Fireplaces


In [8]:
# get products
wands_product_df = pd.read_csv(os.path.join(wands_data_pth, "product.csv"), sep='\t')
wands_product_df.head(10)

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0
5,5,vogan 33 '' single bathroom vanity,Vanities,Home Improvement / Bathroom Remodel & Bathroom...,the vogan vanity from our vogan series is a 33...,sinkmaterial : ceramic|overallwidth-sidetoside...,2.0,5.0,1.0
6,6,vogelsang 48 '' single bathroom vanity,Vanities,Home Improvement / Bathroom Remodel & Bathroom...,the vogelsang top vanity is a 48 '' wide singl...,dswoodtone : light wood|woodspecies : pine|ove...,1.0,5.0,1.0
7,7,36 '' single bathroom vanity,Vanities,Home Improvement / Bathroom Remodel & Bathroom...,vanity has an extra thick marble top with a bu...,whatisap-trap : a p-trap holds water to preven...,,,
8,8,erith obliqui urn,"Vases, Urns, Jars, & Bottles","Décor & Pillows / Home Accessories / Vases, Ur...","an erith obliqui urn , crushed rhinestone pear...",shape : cylinder|overallwidth-sidetoside:9.13|...,,,
9,9,vezina 65 '' rolled arm chesterfield loveseat,Sofas,Furniture / Living Room Furniture / Sofas,"in the endless world of sofa style options , t...",pattern : solid color|removablecushionlocation...,1.0,5.0,1.0


In [10]:
wands_product_df['product_name'].isnull().values.any()

False

In [12]:
wands_product_df['product_description'].isnull().values.any()

True

In [7]:
def combine_product_texts(product_name, product_description):
    if pd.notnull(product_name) and pd.notnull(product_description):
        return product_name + " " + product_description
    elif pd.notnull(product_name):
        return product_description
    else:
        return product_name

In [10]:
# get manually labeled groundtruth lables
wands_label_df = pd.read_csv(os.path.join(wands_data_pth, "label.csv"), sep='\t')
wands_label_df.head(10)

Unnamed: 0,id,query_id,product_id,label
0,0,0,25434,Exact
1,1,0,12088,Irrelevant
2,2,0,42931,Exact
3,3,0,2636,Exact
4,4,0,42923,Exact
5,5,0,41156,Exact
6,6,0,5938,Irrelevant
7,7,0,5937,Irrelevant
8,8,0,37072,Irrelevant
9,9,0,37071,Irrelevant


In [None]:
def assign_label_score(label):
    if pd.isna(label):
        # Assume not relevant if there is no human label assigned
        return 0
    elif label.lower() == "exact":
        return 1.0
    elif label.lower() == "irrelevant":
        return 0
    elif label.lower() == "partial":
        # Rate higher than 50% to make relevance matching higher quality
        return 0.6

In [None]:
wands_label_df['label_score'] = wands_label_df['label'].apply(lambda x: assign_label_score(x))

In [12]:
# Aggregate the text in products to ensure there are no NAs for each product
wands_product_df['product_text'] = wands_product_df['product_name'].combine(wands_product_df['product_description'],  lambda x, y: combine_product_texts(x, y))

### Load Amazon Product Reviews and QA Dataset

In [13]:
import gzip
import ast

def parse_amazon_data(pth):
  g = gzip.open(pth, 'rb')
  for l in g:
      try:
          yield json.loads(l)
      except ValueError as error:
          yield ast.literal_eval(json.loads(l))

def get_amazon_df(pth):
  i = 0
  df = {}
  for d in parse_amazon_data(pth):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

In [9]:
amazon_data_pth = os.path.join(os.path.dirname(os.getcwd()), 'Amazon', 'Dataset')

In [10]:
amazon_product_df = get_amazon_df(os.path.join(amazon_data_pth, "Musical_Instruments.json.gz"))
amazon_product_df.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,5.0,90.0,False,"08 9, 2004",AXHY24HWOF184,470536454,{'Format:': ' Paperback'},Bendy,Crocheting for Dummies by Karen Manthey & Susa...,Terrific Book for Learning the Art of Crochet,1092009600,
1,4.0,2.0,True,"04 6, 2017",A29OWR79AM796H,470536454,{'Format:': ' Hardcover'},Amazon Customer,Very helpful...,Four Stars,1491436800,
2,5.0,,True,"03 14, 2017",AUPWU27A7X5F6,470536454,{'Format:': ' Paperback'},Amazon Customer,EASY TO UNDERSTAND AND A PROMPT SERVICE TOO,Five Stars,1489449600,
3,4.0,,True,"02 14, 2017",A1N69A47D4JO6K,470536454,{'Format:': ' Paperback'},Christopher Burnett,My girlfriend use quite often,Four Stars,1487030400,
4,5.0,,True,"01 29, 2017",AHTIQUMVCGBFJ,470536454,{'Format:': ' Paperback'},Amazon Customer,Arrived as described. Very happy.,Very happy.,1485648000,


In [11]:
amazon_product_meta_df = get_amazon_df(os.path.join(amazon_data_pth, "meta_Musical_Instruments.json.gz"))
amazon_product_meta_df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,"[Musical Instruments, Drums & Percussion, Hand...",,[Cricket Rubbing the spine with the wooden sti...,,Wooden Percussion 2 Piece Set of 3 Inch Cricke...,"[B00NP8GYVS, B00NP80XMO, B00NP8M098]",,WADSUWAN SHOP,"[Wood percussion, Owl whistle*, Includes woode...","[>#141,729 in Musical Instruments (See Top 100...",[],Musical Instruments,,"December 2, 2013",,989983,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,"[Musical Instruments, Drums & Percussion, Hand...",,[Frog - Rubbing its spine with the wooden stic...,,"Wooden Percussion 3 Piece Set Frog, Cricket an...","[B00NP8GYVS, B00NP80XMO, B01MY48HK5, B00AZZ1AJ...",,WADSUWAN SHOP,"[Wood percussion, Small 3 inches, Creates orig...","[>#1,622 in Musical Instruments (See Top 100 i...",[],Musical Instruments,,"December 2, 2013",$0.91,98906,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,"[Musical Instruments, Instrument Accessories, ...",,[Vivaldi's famous set of four violin concertos...,,Hal Leonard Vivaldi Four Seasons for Piano (Or...,[],,Hal Leonard,"[., ., .]","[>#330,653 in Musical Instruments (See Top 100...",[],Musical Instruments,,"May 10, 2011",$62.93,41291905,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,[],,"[The Turn of the Screw (op. 54) vocal score, p...",,The Turn of the Screw (vocal score),"[0486266842, 0793507669, 0393008789, 142341280...",,Boosey &amp; Hawkes,[],"[>#86,354 in Musical Instruments (See Top 100 ...",[],Musical Instruments,,"May 23, 2007",$107.79,60015500,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
4,[],,[],,Suite for Organ (including the Trumpet Volunta...,[],,,[],"[>#482,025 in Musical Instruments (See Top 100...",[],Musical Instruments,,"February 8, 2013",,193757710,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


In [14]:
amazon_qa_df = get_amazon_df(os.path.join(amazon_data_pth, "QA_Musical_Instruments.json.gz"))
amazon_qa_df.head()

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

### Build index with input type as pandas dataframe

In [11]:
from llama_index.indices.struct_store import PandasIndex
import cohere
from tqdm import tqdm
# from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex

In [None]:
# Now we'll set up the cohere client.
co = cohere.Client(os.environ["COHERE_API_KEY"])

### WANDS example embeddings

In [13]:
wands_product_df['product_text'][:4]

0    solid wood platform bed good , deep sleep can ...
1    all-clad 7 qt . slow cooker create delicious s...
2    all-clad electrics 6.5 qt . slow cooker prepar...
3    all-clad all professional tools pizza cutter t...
Name: product_text, dtype: object

#### Cohere Embeddings

In [14]:
wands_embed_model = "embed-english-v3.0"
wands_embed_input_type = "search_document"
wands_sample_texts = list(wands_product_df['product_text'])[:4]
# Get the embeddings
wands_embeds_sample = co.embed(texts=wands_sample_texts,
                  model=wands_embed_model,
                  input_type=wands_embed_input_type).embeddings

#### Llama index embeddings

In [None]:
# llm = LiteLLM("command-nightly")
llm = LiteLLM("command")
embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key,
    model_name="embed-english-v3.0",
    input_type="search_document",
)

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)
index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context
)

### Amazon Example Embeddings

In [14]:
amzn_embed_model = "embed-english-v3.0"
amzn_embed_input_type = "search_document"
amzn_sample_texts = list(wands_product_df['product_text'])[:4]
# Get the embeddings
amzn_embeds_sample = co.embed(texts=wands_sample_texts,
                  model=wands_embed_model,
                  input_type=wands_embed_input_type).embeddings

#### Build retriever with input_type = 'search_query'

In [None]:
embed_model = CohereEmbedding(
    cohere_api_key=cohere_api_key,
    model_name="embed-english-v3.0",
    input_type="search_query",
)

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

search_query_retriever = index.as_retriever(service_context=service_context)

search_query_retrieved_nodes = search_query_retriever.retrieve(
    "What happened in the summer of 1995?"
)

In [None]:
for n in search_query_retrieved_nodes:
    display_source_node(n, source_length=2000)

**Node ID:** 1b0759b6-e6a1-4749-aeaa-1eafe14db055<br>**Similarity:** 0.3253174706260866<br>**Text:** That's not how they sell. I wrote some software to generate web sites for galleries, and Robert wrote some to resize images and set up an http server to serve the pages. Then we tried to sign up galleries. To call this a difficult sale would be an understatement. It was difficult to give away. A few galleries let us make sites for them for free, but none paid us.

Then some online stores started to appear, and I realized that except for the order buttons they were identical to the sites we'd been generating for galleries. This impressive-sounding thing called an "internet storefront" was something we already knew how to build.

So in the summer of 1995, after I submitted the camera-ready copy of ANSI Common Lisp to the publishers, we started trying to write software to build online stores. At first this was going to be normal desktop software, which in those days meant Windows software. That was an alarming prospect, because neither of us knew how to write Windows software or wanted to learn. We lived in the Unix world. But we decided we'd at least try writing a prototype store builder on Unix. Robert wrote a shopping cart, and I wrote a new site generator for stores — in Lisp, of course.

We were working out of Robert's apartment in Cambridge. His roommate was away for big chunks of time, during which I got to sleep in his room. For some reason there was no bed frame or sheets, just a mattress on the floor. One morning as I was lying on this mattress I had an idea that made me sit up like a capital L. What if we ran the software on the server, and let users control it by clicking on links? Then we'd never have to write anything to run on users' computers. We could generate the sites on the same server we'd serve them from. Users wouldn't need anything more than a browser.

This kind of software, known as a web app, is common now, but at the time it wasn't clear that it was even possible. To find out, we decided to try making a version of our store builder that y...<br>

**Node ID:** ab6c138d-a509-4894-9131-da145eb7a4b4<br>**Similarity:** 0.28713538838359537<br>**Text:** But once again, this was not due to any particular insight on our part. We didn't know how VC firms were organized. It never occurred to us to try to raise a fund, and if it had, we wouldn't have known where to start. [14]

The most distinctive thing about YC is the batch model: to fund a bunch of startups all at once, twice a year, and then to spend three months focusing intensively on trying to help them. That part we discovered by accident, not merely implicitly but explicitly due to our ignorance about investing. We needed to get experience as investors. What better way, we thought, than to fund a whole bunch of startups at once? We knew undergrads got temporary jobs at tech companies during the summer. Why not organize a summer program where they'd start startups instead? We wouldn't feel guilty for being in a sense fake investors, because they would in a similar sense be fake founders. So while we probably wouldn't make much money out of it, we'd at least get to practice being investors on them, and they for their part would probably have a more interesting summer than they would working at Microsoft.

We'd use the building I owned in Cambridge as our headquarters. We'd all have dinner there once a week — on tuesdays, since I was already cooking for the thursday diners on thursdays — and after dinner we'd bring in experts on startups to give talks.

We knew undergrads were deciding then about summer jobs, so in a matter of days we cooked up something we called the Summer Founders Program, and I posted an announcement on my site, inviting undergrads to apply. I had never imagined that writing essays would be a way to get "deal flow," as investors call it, but it turned out to be the perfect source. [15] We got 225 applications for the Summer Founders Program, and we were surprised to find that a lot of them were from people who'd already graduated, or were about to that spring. Already this SFP thing was starting to feel more serious than we'd intended.

We ...<br>