# Wikipedia Semantic Search with Cohere Embedding Archives
This notebook contains the starter code to do simple [semantic search](https://txt.cohere.ai/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://txt.cohere.ai/embeddings-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). 

In [1]:
# Let's install cohere and HF datasets
!pip install cohere datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cohere
  Downloading cohere-4.5.1-py3-none-any.whl (33 kB)
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp<4.0,>=3.0 (from cohere)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting backoff<3.0,>=2.0 (from cohere)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xx

# API

In [3]:
from datasets import load_dataset
import torch
import cohere

# Add your cohere API key from www.cohere.com
co = cohere.Client("LHOk2ra2syLlnjXcO4uS2eJLNjSMAFsuHxLHoumE") 

# Let's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.

In [4]:
#Load at max 1000 documents + embeddings
max_docs = 1000
docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)

docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    if len(docs) >= max_docs:
        break

doc_embeddings = torch.tensor(doc_embeddings)

Downloading metadata:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

Now, `doc_embeddings` holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an [embeddings vector](https://txt.cohere.ai/sentence-word-embeddings/) of 768 values.

In [5]:
doc_embeddings.shape

torch.Size([1000, 768])

We can now search these vectors for any query we want. For this toy example, we'll ask a question about Wikipedia since we know the Wikipedia page is included in the first 1000 documents we used here.

To search, we embed the query, then get the nearest neighbors to its embedding (using dot product).

In [8]:

# Get the query, then embed it
query = 'Who founded United States'
response = co.embed(texts=[query], model='multilingual-22-12')
query_embedding = response.embeddings 
query_embedding = torch.tensor(query_embedding)

# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)

# Print results
print("Query:", query)
for doc_id in top_k.indices[0].tolist():
    print(docs[doc_id]['title'])
    print(docs[doc_id]['text'], "\n")


Query: Who founded United States
United States
Tensions between American colonials and the British during the rebel period of the 1760s and early 1770s led to the American Revolutionary War, fought from 1775 through 1781. On June 14, 1775, the Continental Congress, a meeting in Philadelphia, established a Continental Army led by George Washington.The Congress said that "all men are created equal" and are born with "certain natural rights," and adopted the Declaration of Independence, written mostly by Thomas Jefferson, on July 4, 1776. That date is now celebrated every year as America's Independence Day. In 1777, the Articles of Confederation established a weak federal government that operated until 1789. Morocco was the first country in the world to recognize America’s independence. 

United States
Arguments between the farming-based South and industrial North over the growth of the institution of slavery and states' rights began the American Civil War. In 1861 The southern states sep

This shows the top three passages that are relevant to the query. We can retrieve more results by changing the `k` value. The question in this simple demo is about Wikipedia because we know that the Wikipedia page is part of the documents in this subset of the archive.