<!--<badge>--><a href="https://colab.research.google.com/github/startakovsky/pinecone-examples-fork/blob/may-2022-semantic-text-search-refresh/semantic_text_search/semantic_text_search_refresh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->

# Semantic Search With Pinecone

## Background

### What is Semantic Search and how will we use it?

_Semantic search_ is exactly the kind of search where the _meaning_ of the search query is the thing that's used, rather than it being done by keyword lookups. Pretrained neural networks on large sets of text data have been shown to be very effective at encoding the _meaning_ of a particular phrase, sentence, paragraph or long document into a data structure known as a [vector embedding](https://www.pinecone.io/learn/vector-embeddings/).

In this example, we are going to demonstrate Pinecone's semantic search capabilities with an off-the-shelf, pretrained NLP model. In the process we'll learn a few things.

### Learning Goals and Estimated Reading Time
_By the end of this 10 minute demo, you will have:_
 1. Learned about Pinecone's value for solving realtime semantic search requirements!
 2. Stored and retrieved vectors from Pinecone your very-own Pinecone Vector Database.
 3. Encoded news articles as 384-dimensional vectors using a pretrained, encoder-only, model (i.e. no model training necessary).
 4. Queried Pinecone's Vector Database to find similar news articles to the query in question.
 
Executing all the code in the notebook may take a few hours, but once all data is encoded results of queries to pinecone are processed on the order of tens of milliseconds.

## Setup: Prerequisites and Data Preparation

### Python 3.7+

This code has been tested with Python 3.7. It is recommended to run this code in a virtual environment or Google Colab.

### Acquiring your Pinecone API Key

A Pinecone API key is required. You can obtain one for free on our [our website](https://app.pinecone.io/). Either add `PINECONE_EXAMPLE_API_KEY` to your list of environmental variables, or manually enter it after running the below cell (a prompt will pop up requesting the API key, storing the result within this kernel (session)).

### Helper Module

In [1]:
# There is a helper module required for this notebook to run.
# When not present with this notebook, it will be streamed in from Pinecone's Example Repository.
# You can find the module at https://github.com/pinecone-io/examples/tree/master/semantic_text_search

import os
import httpimport

if os.path.isfile('helper.py'):
    import helper as h
else:
    print('importing `helper.py` from https://github.com/pinecone-io')
    with httpimport.github_repo(
        username='startakovsky', 
        repo='pinecone-examples-fork',
        module=['semantic_text_search'],
        branch='may-2022-semantic-text-search-refresh'):
        from semantic_text_search import helper as h

Extracting API Key from environmental variable `PINECONE_EXAMPLE_API_KEY`...

Pinecone API Key available at `h.pinecone_api_key`

### Installing and Importing Prerequisite Libraries:
Python libraries [pinecone-client](https://pypi.org/project/pinecone-client/), [sentence_transformers](https://pypi.org/project/sentence-transformers/), [datasets](https://pypi.org/project/datasets/), [pandas](https://pypi.org/project/pandas/), and [tqdm](https://pypi.org/project/tqdm/) are required for this notebook.

#### Installing via `pip`
The next line is equivalent to `pip install pinecone-client sentence-transformers datasets pandas tqdm`. Note that _sys.executable_ is a way of ensuring it's the version of pip associated with this Jupyter Notebook's Python kernel.

In [2]:
!pip install pinecone-client sentence-transformers pandas tqdm datasets -qU

#### Importing and Defining Constants

In [3]:
import collections

import tqdm
import pinecone
import pandas as pd
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

INDEX_NAME, INDEX_DIMENSION = 'semantic-text-search', 384
MODEL_NAME = 'sentence-transformers/msmarco-MiniLM-L6-cos-v5'

### Downloading and Processing Data

#### Downloading data
To demonstrate semantic search using Pinecone, we will be using [a dataset](https://huggingface.co/datasets/cc_news) consisting of over 700,000 English language news articles. We will be downloading this dataset using the `datasets` library in the next cell.

In [4]:
dataset = load_dataset("cc_news")

Reusing dataset cc_news (/Users/steven/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/ae469e556251e6e7e20a789f93803c7de19d0c4311b6854ab072fecb4e401bd6)


  0%|          | 0/1 [00:00<?, ?it/s]

#### The preprocessing step is self-explanatory and defined in the helper module.

In [5]:
df = h.get_processed_df(dataset['train'].to_pandas())

#### Sample row from dataframe

In [6]:
pd.DataFrame(df.iloc[356155])

Unnamed: 0,356155
title,Statue of Liberty evacuated as climber refuses to come down
text,"Liberty Island has been evacuated because of a climber at the Statue of Liberty.\nA climber on the Statue of Liberty in New York. Pix 11 livestream screen capture from Facebook.\nNEW YORK — Liberty Island has been evacuated because of a climber at the Statue of Liberty.\nA person climbed the statue’s base on the Fourth of July shortly after several people were arrested after hanging a banner from the statue’s pedestal calling for abolishing Immigration and Customs Enforcement.\nNews helicopter video showed the climber sitting Wednesday by the bottom of the statue’s robes, about 100 feet (30 meters) aboveground. Police nearby tried to persuade the climber to descend.\nEarlier, National Park Service spokesman Jerry Willis said at least six people were taken into custody for the banner, which read “Abolish I.C.E.,” referring to part of the Department of Homeland Security.\nWillis says federal regulations prohibit hanging banners from the monument.\nBanner organizing group Rise and Resist says the climber isn’t connected to its demonstration."
domain,www.reviewjournal.com
date,2018-07-04 20:43:12
description,Liberty Island has been evacuated because of a climber at the Statue of Liberty.
url,https://www.reviewjournal.com/news/nation-and-world/statue-of-liberty-evacuated-as-climber-refuses-to-come-down/
image_url,https://www.reviewjournal.com/wp-content/uploads/2018/07/10783746_web1_dewfdew.jpg
text_to_encode,Statue of Liberty evacuated as climber refuses to come down Liberty Island has been evacuated because of a climber at the Statue of Liberty. A climber on the Statue of Liberty in New York. Pix 11 livestream screen capture from Facebook. NEW YORK — Liberty Island has been evacuated because of a climber at the Statue of Liberty. A person climbed the statue’s base on the Fourth of July shortly after several people were arrested after hanging a banner from the statue’s pedestal calling for abolishing Immigration and Customs Enforcement.
year,2018
month,7


### Creating your Pinecone Index
The process for creating a Pinecone Index requires your Pinecone API key, the name of your index, and the number of dimensions of each vector. As we will see below, the model we are using maps each piece of text to a 384-dimensional vector.

In [7]:
pinecone.init(api_key=h.pinecone_api_key, environment='us-west1-gcp')
pinecone.create_index(name=INDEX_NAME, dimension=INDEX_DIMENSION)
index = pinecone.Index(index_name=INDEX_NAME)

## Generate embeddings and send them to your Pinecone Index
This will all be done in batches. We will compute embeddings in batch, followed by taking each batch and sending it to Pinecone, also in batches.

### Loading a Pretrained Encoder model.
We will generate embeddings by using [this Sentence Transformers model](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5). It is one of hundreds encoder models available. Downloads happen automatically with SentenceTransformer, and may take up to a minute the first time. After this first import, the model is cached and available on a local machine.

In [8]:
h.printmd(f'Loading model from _Sentence Transformers_: `{MODEL_NAME}` from Sentence Transformers...')
model = SentenceTransformer(MODEL_NAME)
h.printmd('Model loaded.')

Loading model from _Sentence Transformers_: `sentence-transformers/msmarco-MiniLM-L6-cos-v5` from Sentence Transformers...

Model loaded.

### MSMARCO model v5 and Embeddings

In this example, we created an index with 384 dimensions and the [cosine similarity score](https://en.wikipedia.org/wiki/Cosine_similarity). This calculation is trivial when comparing two vectors, but nontrivial when needing to compare a query vector against millions or billions of vectors and determine those that yield the highest similiarity with the query vector.

#### On Embeddings

This model produces vectors from text, each a sequence of 384 floats. So, when a piece of text such as "A quick fox jumped around" gets encoded into a vector embedding, the result is a sequence of floats of length 384. The same is true for a long news article and a single word. 

#### On Comparing Embeddings aka _how_ Semantic Search works

Two 15-dimensional text embeddings might look like something like: 
 - _\[-0.02, 0.06, 0.0, 0.01, 0.08, -0.03, 0.01, 0.02, 0.01, 0.02, -0.07, -0.11, -0.01, 0.08, -0.04\]_
 - _\[-0.04, -0.09, 0.04, -0.1, -0.05, -0.01, -0.06, -0.04, -0.02, -0.04, -0.04, 0.07, 0.03, 0.02, 0.03\]_
 
In order to determine how [_similar_](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) they are, it is a simple formula that takes a very short time to compute. Similarity scores are, in general, an excellent proxy for semantic similarity. So a natural question one might ask is to compare one vector to a handful of others and select the most similar.

### What is Pinecone for?
There is often a technical requirement to compare one vector to tens or hundreds of millions or more vectors, to do so with low latency (less than 50ms) and a high throughput. Pinecone solves this problem with its managed vector database service, and we will demonstrate this below.

### Components of a Pinecone vector embedding

There are three components to every Pinecone vector embedding:
 - a vector ID
 - a sequence of floats of a user-defined, fixed dimension
 - vector metadata (a key-value store)

### Prepare vector embeddings for upload

We will encode the news articles for upload to Pinecone. This may take a while depending on your machine. If on a recent MacBookPro or Google Colab, this may take up to one hour, sometimes longer. We will use the index of the pandas dataframe for the vector ID, the pretrained model to generate the sequence of 384 floats, and the year, month and article source for details in the metadata.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. This is going to be important further down this notebook for additional filter requirements we will may want to employ in our queries.

In [9]:
def get_vector_metadata_from_dataframe_row(df_row):
    """Return vector metadata."""
    vector_metadata = {
        'year': df_row['year'],
        'month': df_row['month'],
        'source': df_row['processed_domain']
    }
    return vector_metadata

#### Prepare all vector data for upload

The function below will take a portion of the dataframe and create the full vector data as Pinecone expects it for [upsert](https://www.pinecone.io/docs/insert-data/).

In [10]:
def get_vectors_to_upload_to_pinecone(df_chunk, model):
    """Return list of tuples like (vector_id, vector_values, vector_metadata)."""
    # create embeddings
    # pool = model.start_multi_process_pool()
    # vector_values = model.encode_multi_process(df_chunk['text_to_encode'], pool).tolist()
    # model.stop_multi_process_pool(pool)
    vector_values = model.encode(df_chunk['text_to_encode'], show_progress_bar=True).tolist()
    # create vector ids and metadata
    vector_ids = df_chunk.index.tolist()
    vector_metadata = df_chunk.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()
    return list(zip(vector_ids, vector_values, vector_metadata))

### Upload data to Pinecone in asynchronous batches

The function below iterates through the dataframe in chunks, and for each of those chunks, will upload asynchronously in sub-chunks to your Pinecone Index.

In [11]:
def upload_dataframe_to_pinecone_in_chunks(
    dataframe, 
    pinecone_index, 
    model, 
    chunk_size=20000, 
    upsert_size=500):
    """Encode dataframe column `text_to_encode` to dense vector and upsert to Pinecone."""
    tqdm_kwargs = h.get_tqdm_kwargs(dataframe, chunk_size)
    async_results = collections.defaultdict(list)
    for df_chunk in tqdm.notebook.tqdm(h.chunks(dataframe, chunk_size), **tqdm_kwargs):
        vectors = get_vectors_to_upload_to_pinecone(df_chunk, model)
        # upload to Pinecone in batches of `upsert_size`
        for vectors_chunk in h.chunks(vectors, upsert_size):
            start_index_chunk = df_chunk.index[0]
            async_result = pinecone_index.upsert(vectors_chunk, async_req=True)
            async_results[start_index_chunk].append(async_result)
        # wait for results
        _ = [async_result.get() for async_result in async_results[start_index_chunk]]
        is_all_successful = all(map(lambda x: x.successful(), async_results[start_index_chunk]))
        # report chunk upload status
        print(
        f'All upserts in chunk successful with index starting with {start_index_chunk:>7}: '
        f'{is_all_successful}. Vectors uploaded: {len(vectors):>3}.'
        )
    return async_results

#### Asynchronous Upload
Computing the embeddings may several hours depending on hardware capabilities. The Pinecone API responds right away with its [async](https://www.pinecone.io/docs/insert-data/#sending-upserts-in-parallel) requests. 

In [None]:
async_results = upload_dataframe_to_pinecone_in_chunks(df, index, model)

  0%|          | 0/36 [00:00<?, ?chunk of vectors/s]

Batches:   0%|          | 0/625 [00:00<?, ?it/s]

### Visualize the status of your upserts in the Pinecone Console

<img src='https://raw.githubusercontent.com/startakovsky/pinecone-examples-fork/may-2022-semantic-text-search-refresh/semantic_text_search/pinecone_console.png'>

## Querying Pinecone

Now that all the embeddings of the texts are on Pinecone's database, it's time to demonstrate Pinecone's lightning fast semantic search query capabilities.

### Pinecone Example Usage

#### _**Show me news articles about "ancient attitudes"\!**_

In the below example we query Pinecone's API with an embedding of a query term to return the vector embeddings that have the highest similarity score. Pinecone effeciently estimates which of the uploaded vector embeddings have the highest similarity when paired with the query term's embedding, and the database will scale to billions of embeddings maintaining low-latency and high throughput. In this example we have upserted over 700,000 embeddings. Our [starter plan](https://www.pinecone.io/pricing/) supports up to one million. 

#### Example: Pinecone API Request and Response

Let's find articles with a similar semantic meaning to the `query` variable.

In [None]:
query = "outdoor activities"
vector_embedding = model.encode(query).tolist()
response = index.query([vector_embedding], top_k=3, include_metadata=True)
h.printmd(f"#### A sample response from Pinecone \n ==============\n \n ```python\n{response}\n```")

In [None]:
vector_ids, scores = h.get_ids_scores(response)
h.printmd("#### Enriched Response \nTo show which questions we retreived, "
          "the above response needs to be enriched using the original dataset.")
result = df.loc[vector_ids]
result['score'] = scores
result[['title', 'score', 'domain', 'date', 'description', 'url']].style.format(
    {
        'url': h.make_clickable, 
        'score': lambda x: round(x, 2)
    }
)

#### Are the results any good?

We invite the reader to explore various queries by running the code in the last two cells. Note that this is **not a keyword search** but rather a **search for semantically similar results**. Note the _score_ column indicating the similarity score with the query. Better scores are typically associated with more semantic similarity.

### Pinecone Example Usage With [Metadata](https://www.pinecone.io/docs/metadata-filtering/)

Extensive predicate logic can be applied to metadata filtering, just like the [WHERE clause](https://www.pinecone.io/learn/vector-search-filtering/) in SQL! Pinecone's [metadata feature](https://www.pinecone.io/docs/metadata-filtering/) provides easy-to-implement filtering.

Here are the top 20 sources, with the rest grouped into the _other_ category. We will filter results so that they come from any of the top 5 sources of articles and https://www.taiwannews.com.tw, and originated in 2018.

In [None]:
sources = h.get_top_sources(df)
print(*sources, sep=', ')

In [None]:
response = index.query(
    [vector_embedding], 
    top_k=5, 
    filter={
        "$and": [
            {'year': {'$eq': 2018}},
            {'source': {'$in':  sources[:5] + ['www.taiwannews.com.tw']}}
        ]
    }
)
vector_ids, scores = h.get_ids_scores(response)
result = df.loc[vector_ids]
result['score'] = scores
result[['title', 'score', 'domain', 'date', 'description', 'url']].style.format(
    {
        'url': h.make_clickable, 
        'score': lambda x: round(x, 2)
    }
)

## Conclusion

In this example, we demonstrated how trivial Pinecone makes it possible to do semantic search using a pre-trained transformer-encoder model with Pinecone to achieve realtime similarity retrieval! We demonstrated the use of metadata

### Like what you see? Explore our [community](https://www.pinecone.io/community/)
Learn more about semantic search and the rich, performant, and production-level feature set of Pinecone's Vector Database by visiting https://pinecone.io, connecting with us [here](https://www.pinecone.io/contact/) and [following us](https://www.linkedin.com/company/pinecone-io) on LinkedIn. If interested in some of the algorithms that allow for effecient estimation of similar vectors, visit our Algorithms and Libraries section of our [Learning Center](https://www.pinecone.io/learn/).