## Semantic Search - The Search for Meaning

Times are new - so is the search. In previous times, we used to do **lexical search** where the search engine looks for the literal matches of the query without understanding the overall meaning of it. **Semantic Search** on the other hand seeks to improve search accuracy by understanding the searcher's intent and the contexutal meaning of the query. This is done through a Natural Language Processing techniques such as NER(Named Entity Recognition).

Usage of Semantic search enables articulation of domain knowledge at a high level of expresiveness and could enable the user to specify their intent in more detail at query time. Semantic search is used strategically by companies for knowledge management. It allows for meaningful and seamless data sharing by employees across teams and locations based on contextual references.

In this notebook, we'll be working with Text REtrieval Conference (TREC) Question Classification dataset together with ANNOY(Approximate Nearest Neighbors Oh Yeah) in Cohere to develop a semantic search dataframe. Later we'll visualize it also !! Buckle up !

In [1]:
#Install the necessary modules
! pip install cohere annoy umap-learn altair datasets

Collecting cohere
  Downloading cohere-3.9.0.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: cohere
  Building wheel for cohere (setup.py) ... [?25ldone
[?25h  Created wheel for cohere: filename=cohere-3.9.0-cp37-cp37m-linux_x86_64.whl size=18280 sha256=c889f4ca936a7d78ad2961eda053412c281b6957204c5ddb4b47d6a592dd4298
  Stored in directory: /root/.cache/pip/wheels/ed/64/bf/cb54309c85925a52d0cd11eba3b8e36589d571a7c3368c1393
Successfully built cohere
Installing collected packages: cohere
Successfully installed cohere-3.9.0
[0m

### What is Cohere ?
There's been a lot of talk about NLP in recent times - mostly dominated by Google(BARD) and OpenAI(GPT, ChatGPT etc.), but in this little was talked about Cohere. It's the calm before the storm. Cohere allows you to implement NLP into your product. We can use cohere APIs to generate, categorize, organize(and recently - summarize) text at scale. In my personal opinion, it is very approachable and easy to use, plus the developer team is also pretty sweet. All in all - it's a great library to get started with NLP.

Read more about [cohere](https://cohere.ai/)

In [2]:
# Necessay Imports
import re
import umap
import cohere
import numpy as np
import pandas as pd
import altair as alt

from annoy import AnnoyIndex
from datasets import load_dataset
from sklearn.metrics.pairwise import cosine_similarity


### API Access

All users get a free test key with cohere account. Copy it and explore different tasks. that can be done with cohere. Signup for cohere and then access the API keys through [here](https://dashboard.cohere.ai/api-keys). 

In [3]:
API_KEY = ""  # add your API key here.
co_client = cohere.Client(API_KEY)

### TREC Dataset
The Text Retrieval Conference Question Classification dataset contains 5500 labelled questions in training set and another 500 for test set. The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, and the vocabulary size is about 8700.

The data for this dataset is collected from 4 sources : 
- 4500 English Questions published by USC.
- About 500 manually constructed form few rare cases, 
- 894 TREC 8 and TREC 9 questions
- 500 questions from TREC 10 which serves as the test set. 
The questions are manually labelled.

Read more about the dataset [here](http://huggingface.co/datasets/trec)

In [5]:
# Load the dataset
data = load_dataset('trec', split='train')

df = pd.DataFrame(data)[:1000]  #First, let's test with 1000 rows.

df.head()

Unnamed: 0,label-coarse,label-fine,text
0,0,0,How did serfdom develop in and then leave Russ...
1,1,1,What films featured the character Popeye Doyle ?
2,0,0,How can I find a list of celebrities ' real na...
3,1,2,What fowl grabs the spotlight after the Chines...
4,2,3,What is the full form of .com ?


### Word Embeddings. What are they?
In Natural Language Processing, A word embedding is a representation of a word. It is used in text analysis - and is typically a real-valued vector that encodes the 'meaning' of the word in sucha way that words that are closer in vector space are expected to be similar in meaning.

Cohere provides support to generates embedding out of the box and is pretty sick if you ask me.

In [9]:
embeds = co_client.embed(
    texts=list(df['text']),
    model='large',
    truncate='RIGHT'
).embeddings

embeds = np.array(embeds)
Debug: print(embeds.shape) 

(1000, 4096)


### Don't ANNOY me. 
I love Spotify. It has become a indispensable part of my life. Apart from giving us such an amazing application, Spotify also contributes a lot to open source. One of their most popular open source project is Annoy(Approximate Nearest Neighbors Oh Yeah). Annoy is a library which is used to search for points close to a given query point by creating a large read-only file based data structures that are memory mapped so that many processes can simultaneously share the same data.

There are other libraries also - such as ScANN(Google), but I opted to go with this. Within Spotify, Annoy is used for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. Annoy is used to search for similar users/items.

The best part - Erik Bernhardsson built Annoy in a couple of afternoons during Hack Week. Amazing.

Read more about Annoy [here](http://github.com/spotify/annoy) 

In [10]:
search_index = AnnoyIndex(embeds.shape[1], 'angular')
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10)
search_index.save('test.ann')

True

### Finding the neighbors of a user query

We can search for exisiting items an well as new query using Annoy. Just embed the query and run it through Annoy and get the nearest neighbors. It's that easy. Test for yourself. change the query in the next cell and see for yourself.

In [15]:
EXAMPLE_ID = 69  # Nice
NUM_NEIGHBORS = 10
QUERY = "What is the oldest profession?"   #  Write your own query here(Eg: What is the oldest profession?).

query_embed = co_client.embed(
    texts = [QUERY],
    model = 'large',
    truncate='RIGHT'
).embeddings


similar_item_ids = search_index.get_nns_by_vector(
    query_embed[0], NUM_NEIGHBORS, include_distances=True)

results = pd.DataFrame(
    data={
        'texts': df.iloc[similar_item_ids[0]]['text'],
        'distance': similar_item_ids[1],
    }
)

print(f"The Question is : {QUERY}")
print('The nearest neighbors are : ')
results

The Question is : What is the oldest profession?
The nearest neighbors are : 


Unnamed: 0,texts,distance
7,What is the oldest profession ?,0.27414
852,What is her profession ?,1.02762
886,What is the oldest website on the Internet ?,1.064128
183,What is the occupation of Nicholas Cage ?,1.120292
100,Who invented Make-up ?,1.147484
320,How did the tradition of best man start ?,1.153791
403,What is the longest-running television series ?,1.15709
149,Who earns their money the hard way ?,1.157263
640,"What is the origin of the word , magic ?",1.168651
843,What is the longest English word ?,1.171553


### Let's Visualize !!

It's probably the best part. Now, we'll visualize the entire dataset. For this, we'll use umap and altair modules.

- umap: Uniform Manifold Approximation and Projection(UMAP) is a dimensionality reduction technique that can be used for visualizing dataset(also for non-linearn dimension reduction)

- altair: Altair is a declarative statistical visualization library for python. Altair API is simple, consistent and built on top of Vega-Lite JSON specification.

Read more about [umap](https://github.com/lmcinnes/umap) and [altair](https://altair-viz.github.io/).

In [19]:
# Let's visualize :

reducer = umap.UMAP(n_neighbors=20)
umap_embeds = reducer.fit_transform(embeds)

df_explore = pd.DataFrame(
    data = {
        'text': df['text']
    }
)

df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

# Plot

chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x = alt.X(
        'x',
        scale=alt.Scale(zero=False)
    ),
    y = alt.Y(
        'y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['text'],
).properties(
    title="Nearest Neighbors for TREC dataset",
    width=700,
    height=400,)

chart.interactive()

## That's it folks. 
I really wanted to do a project on NLP from a long time and started studying more on it. This notebook was a good starting point for what is about to come. Be ready and thanks a lot for reading !! 