<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/ai-powered-search/13_2_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In this notebook, we"re going to install a transformer model, analyze the embedding output, and compare some vectors

In [1]:
#outdoors
![ ! -d 'outdoors' ] && git clone --depth=1 https://github.com/ai-powered-search/outdoors.git
! cd outdoors && git pull
! cd outdoors && cat outdoors.tgz.part* > outdoors.tgz
! cd outdoors && mkdir -p '../data/outdoors/' && tar -xvf outdoors.tgz -C '../data/outdoors/'

Cloning into 'outdoors'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 25 (delta 0), reused 22 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (25/25), 491.39 MiB | 16.59 MiB/s, done.
Updating files: 100% (23/23), done.
Already up to date.
README.md
concepts.pickle
._guesses.csv
guesses.csv
._guesses_all.json
guesses_all.json
outdoors_concepts.pickle
outdoors_embeddings.pickle
._outdoors_golden_answers.csv
outdoors_golden_answers.csv
._outdoors_golden_answers.xlsx
outdoors_golden_answers.xlsx
._outdoors_golden_answers_20210130.csv
outdoors_golden_answers_20210130.csv
outdoors_labels.pickle
outdoors_question_answering_contexts.json
outdoors_questionanswering_test_set.json
outdoors_questionanswering_train_set.json
._posts.csv
posts.csv
predicates.pickle
pull_aips_dependency.py
._question-answer-seed-contexts.csv
question-answer-seed-contexts.csv
question-answer-sq

In [None]:
%%capture

!pip install nmslib

In [9]:
import sys
import os
sys.path.append("../..")
# from aips import *
import pandas as pd
import numpy as np
import pickle
import json
import tqdm

import nmslib
import sentence_transformers
from IPython.display import display, HTML

In [None]:
from sentence_transformers import SentenceTransformer
transformer = SentenceTransformer("roberta-base-nli-stsb-mean-tokens")

## Get embeddings

In [10]:
def get_embeddings(text, model, cache_name, ignore_cache=False):
  cache_file_name = f"data/outdoors/{cache_name}.pickle"
  if ignore_cache or not os.path.isfile(cache_file_name):
    return np.load(cache_file_name)
    embeddings = model.encode(texts)
    os.makedirs(os.path.dirname(cache_file_name), exist_ok=True)
    with open(cache_file_name, "wb") as cache_file:
      pickle.dump(embeddings, cache_file)
  else:
    with open(cache_file_name, "rb") as cache_file:
      embeddings = pickle.load(cache_file)
  return embeddings

In [45]:
def normalize_embedding(embedding):
  normalized = np.divide(embedding, np.linalg.norm(embedding))
  return list(map(float, normalized))

In [46]:
def rank_similarities(phrases, similarities):
  a_phrases = []
  b_phrases = []
  scores = []
  for a in range(len(similarities) - 1):
    for b in range(a + 1, len(similarities)):
      a_phrases.append(phrases[a])
      b_phrases.append(phrases[b])
      scores.append(float(similarities[a][b]))
  dataframe = pd.DataFrame({
      "score": scores,
      "phrase a": a_phrases,
      "phrase b": b_phrases
  })
  dataframe["idx"] = range(len(dataframe))
  dataframe = dataframe.reindex(columns=["idx", "score", "phrase a", "phrase b"])
  return dataframe.sort_values(by=["score"], ascending=False, ignore_index=True)

In [35]:
outdoors_dataframe = pd.read_csv("data/outdoors/posts.csv")
# filter NaN title column
titles = outdoors_dataframe[outdoors_dataframe['title'].notna()]["title"]
# titles = list(filter(None, titles))
titles.head(10)

Unnamed: 0,title
0,How do I treat hot spots and blisters when I h...
1,Where in the Alps is it safe to drink the wate...
2,Is it legal to camp on private property in Rus...
3,What are the critical dimensions to a safe bea...
4,Can I sail a raft on a European river with com...
6,What is the safest way to purify water?
8,How can you navigate without a compass or GPS
9,What is the fastest method to 'break in' full ...
10,How do I know what size ice axe I should get?
12,What can I do to prevent altitude sickness?


In [42]:
# Encoding the titles into embeddings
outdoors_dataframe = pd.read_csv("data/outdoors/posts.csv")
titles = outdoors_dataframe[outdoors_dataframe['title'].notna()]["title"]
titles = list(filter(None, titles))

cache_name = "outdoors_embeddings"
embeddings = get_embeddings(titles, transformer, cache_name)

print(f"Number of embeddings: {len(embeddings)}")
print(f"Dimensions per embedding: {len(embeddings[0])}")

Number of embeddings: 12375
Dimensions per embedding: 768


In [47]:
# Explore the top similarities for the titles
normalized_embeddings = list(map(normalize_embedding, embeddings))
# Find the pairs with the highest dot product scores
similarities = sentence_transformers.util.dot_score(
    normalized_embeddings[0:100],
    normalized_embeddings[0:100]
)
comparisons = rank_similarities(titles, similarities)
display(HTML(comparisons[:10].to_html(index=False)))

idx,score,phrase a,phrase b
2632,0.833662,How can I acclimatize to cold?,How do I tie a sleeping bag to my backpack?
3113,0.815187,How to reduce the annoying sound of falling raindrops on a tent?,Serrated vs flat-edge knives
427,0.770643,Can I sail a raft on a European river with commercial traffic?,What can I do to prevent getting poison ivy?
1321,0.686263,How does one dry clothes in humid weather?,"What to look for in a durable, 3-season sleeping bag?"
4687,0.666397,How do I desalinate seawater?,"In rock-climbing, how do I safely belay another climber?"
435,0.641523,Can I sail a raft on a European river with commercial traffic?,How should I check that the anchor is secure when I anchor a small yacht off unfamiliar land?
3891,0.623302,"When stranded at sea, should I not ask for a tow?",How do I desalinate seawater?
12,0.585904,How do I treat hot spots and blisters when I have no moleskin?,How should I treat poison ivy?
3327,0.579908,What can I do to prevent getting poison ivy?,Is drinking urine safe?
3304,0.579183,What can I do to prevent getting poison ivy?,How should I check that the anchor is secure when I anchor a small yacht off unfamiliar land?


In [48]:
# Fix rendering of this image
from plotnine import *
{
    ggplot(comparisons, aes("idx", "score")) +
    geom_point(alpha=.05)
}

{<plotnine.ggplot.ggplot at 0x7c6e80fef4f0>}

In [49]:
from plotnine import *
{
    ggplot(comparisons, aes("idx", "score")) +
    geom_violin(color="blue") +
    scale_y_continuous(limits=[-0.4, 1.0], breaks=[-0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0])
}

{<plotnine.ggplot.ggplot at 0x7c6ebcc18130>}

##Searching ANN Index

In [None]:
# initialize a new index, using a HNSW index on Dot Product
concepts_index = nmslib.init(method='hnsw', space='negdotprod')
normalized_embeddings = list(map(normalize_embedding, embeddings))

# All the embeddings can be added in a single batch
concepts_index.addDataPointBatch(normalized_embeddings)
# Commits the index to memory. This must be done before you can query for nearest neighbors
concepts_index.createIndex(print_progress=True)

In [None]:
with open("data/outdoors/outdoors_labels.pickle", "rb") as labels:
  labels = pickle.load(labels)

# Gets the top k nearest neighbors for the term query “bag” (embedding 25) in our embeddings
ids, _ = concepts_index.knnQuery(normalized_embeddings[25], k=10)
matches = [labels[phrases[i]].lower() for i in ids]
print(matches)

['bag', 'bag ratings', 'bag cover', 'bag liner', 'garbage bags', 'wag bags', 'bag cooking', 'airbag', 'paper bag', 'tea bags']


In [None]:
# let's do encoding a query and returning the k-nearest-neighbor concepts
def print_labels(query, matches):
  display(HTML(f"<h4>Results for: <em>{query}</em></h4>"))
  for (l, d) in matches:
    print(str(int(d * 1000) / 1000), "|", l)

def embedding_search(index, query, phrases, k=20, min_similarity=0.75):
  matches = []
  # Gets the embeddings for query
  query_embedding = transformer.encode(query)
  query_embedding = normalize_embedding(query_embedding)
  ids, distances = index.knnQuery(query_embedding, k=k)
  for i in range(len(ids)):
    # Converts negative dot product distance into a positive dot product
    similarity = distances[i] * -1
    if similarity >= min_similarity:
      matches.append((phrases[ids[i]], similarity))
  if not len(matches):
    # No neighbors found! Returns just the original term
    matches.append((phrases[ids[1]], distances[1] * -1))
  return matches

In [None]:
def semantic_suggest(query, phrases):
  matches = embedding_search(concepts_index, query, phrases)
  print_labels(query, matches)

In [None]:
semantic_suggest("mountain hike", phrases)

1.0 | mountain hike
0.975 | mountain hiking
0.847 | mountain trail
0.787 | mountain guide
0.779 | mountain terrain
0.775 | mountain climbing
0.768 | mountain ridge
0.754 | winter hike


In [None]:
semantic_suggest("dehyd", phrases)

0.941 | dehydrate
0.931 | dehydration
0.852 | rehydration
0.851 | dehydrator
0.836 | hydration
0.835 | hydrating
0.822 | rehydrate
0.812 | hydrate
0.788 | hydration pack
0.776 | hydration system


In [None]:
semantic_suggest("polar bear", phrases)

1.0 | polar bear
0.804 | polar
0.774 | polaris


In [None]:
semantic_suggest("bear", phrases)

1.0 | bear
0.906 | bear territory
0.897 | bear country
0.896 | bear box
0.868 | bear attack
0.853 | bear population
0.851 | bear cub
0.84 | bear bag
0.834 | bear banger
0.817 | bear hang
0.816 | bear guard
0.81 | bear pole
0.805 | bear can
0.8 | bear bell
0.794 | bear encounter
0.789 | bear activity
0.778 | bear canister
0.771 | bear spray
0.765 | fred bear


##