<a href="https://colab.research.google.com/github/winterForestStump/thesis/blob/main/notebooks/semantic_search_visualisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Semantic Search
Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search.


In this notebook, we'll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records.

1. Get the documents
2. Embed the documents
3. Search using an index and nearest neighbor search
4. Visualize the archive based on the embeddings

In [1]:
%%capture
# Install langchain for splitting and embeddings, Umap to reduce embeddings to 2 dimensions,
# Altair for visualization, Annoy for approximate nearest neighbor search
# Sentence-transformers library needs to use langchain HFEmbeddings
!pip install umap-learn altair annoy tqdm langchain sentence-transformers

## 1. Getting Set up

In [2]:
#@title Import libraries (Run this cell to execute required code) {display-mode: "form"}

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from langchain.text_splitter import RecursiveCharacterTextSplitter
import numpy as np
import re
import pandas as pd
import requests
from tqdm import tqdm
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

## 2. Get The Documents
We'll use the documents from github repo https://github.com/winterForestStump/thesis/tree/90161d198abf76582a8c309668d8a02b41ff234c/notebooks/example_texts

In [3]:
text = requests.get('https://raw.githubusercontent.com/winterForestStump/thesis/main/notebooks/example_texts/txt_examp_1.txt').text

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=32,
    length_function=len,
    is_separator_regex=False)

texts = text_splitter.create_documents([text])

In [4]:
data_list = [texts[i].page_content for i in range(len(texts))]
df = pd.DataFrame(data_list)
print(df.shape)

(2078, 1)


In [5]:
df.head(5)

Unnamed: 0,0
0,"Item 1. Business\r\nAs used in this report, unless the context indicates otherwise, the terms we, our, us, the Company, ""Air Products,"" or registrant include controlled subsidiaries and affiliates of Air Products.\r\nAbout Air Products"
1,"Air Products and Chemicals, Inc., a Delaware corporation originally founded in 1940, is a world-leading industrial gases company that has built a reputation for its innovative culture, operational excellence, and commitment to safety and the environment."
2,"to safety and the environment. Focused on serving energy, environmental, and emerging markets, we offer a portfolio of products and services that enables customers to improve their environmental performance, product quality, and productivity."
3,Air Products has a sustainability-driven two-pillar growth strategy consisting of the expansion and efficient operation of our core industrial gases business and the execution of projects that provide world-scale clean hydrogen. Our industrial gases
4,"hydrogen. Our industrial gases business provides essential gases, related equipment, and applications expertise to customers in dozens of industries, including refining, chemicals, metals, electronics, manufacturing, medical, and food. We also develop,"


## 2. Embed
The next step is to embed the text of the questions.

In [6]:
model = SentenceTransformer('thenlper/gte-large')

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

In [29]:
embeds = model.encode(list(df[0]))
print(cos_sim(embeds[0], embeds[1]))

tensor([[0.8442]])


In [30]:
# Check the dimensions of the embeddings
embeds = np.array(embeds)
embeds.shape

(2078, 1024)

## 3. Search using an index and nearest neighbor search
Let's now use [Annoy](https://github.com/spotify/annoy) to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include [Faiss](https://github.com/facebookresearch/faiss), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), and [PyNNDescent](https://github.com/lmcinnes/pynndescent)).

After building the index, we can use it to retrieve the nearest neighbors either of existing documents (section 3.1), or of new documents that we embed (section 3.2).

In [31]:
# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')

True

### 3.1. Find the neighbors of an example from the dataset
If we're only interested in measuring the distance between the documents in the dataset (no outside queries), a simple way is to calculate the distance between every pair of embeddings we have.

In [32]:
# Choose an example (we'll retrieve others similar to it)
example_id = 567

# Retrieve nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,10,
                                                include_distances=True)
# Format and print the text and distances
results = pd.DataFrame(data={'texts': df[0].iloc[similar_item_ids[0]],
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"Document:'{df.iloc[example_id]}'\nNearest neighbors:")
results

Document:'0    as volatility in equity and debt markets. Further, non-service related components are not indicative of our defined benefit plans future contribution needs due to the funded status of the plans.
Name: 567, dtype: object'
Nearest neighbors:


Unnamed: 0,texts,distance
566,"of our underlying business performance because these components are driven by factors that are unrelated to our operations, such as recent changes to the allocation of our pension plan assets associated with de-risking as well as volatility in equity and",0.463034
553,"For example, we exclude the impact of the non-service components of net periodic benefit/cost for our defined benefit pension plans as further discussed below. Additionally, we may exclude certain expenses associated with cost reduction actions,",0.464222
564,"plan assets, prior service cost amortization, actuarial loss amortization, as well as special termination benefits, curtailments, and settlements. The net impact of non-service related components is reflected within Other non-operating income (expense),",0.493863
842,"information on the significant assumptions, expense, and obligations associated with the defined benefit plans.",0.504534
1246,and other post-retirement plans is generally recognized over the employees service period. We use actuarial methods and assumptions in the valuation of defined benefit obligations and the determination of expense. Differences between actual and expected,0.510673
1696,"by contractual and regulatory requirements for funded plans and benefit payments for unfunded plans, which are dependent upon timing of retirements.",0.511543
706,"cost and lower expected returns on plan assets due to a smaller beginning balance of plan assets. The net impact of non-service related items are reflected within ""Other non-operating income (expense), net"" on our consolidated income statements.",0.514193
729,"plans, which are dependent upon the timing of retirements. Actual future contributions will depend on future funding legislation, discount rates, investment performance, plan design, and various other factors.",0.514947
723,"Pension funding includes both contributions to funded plans and benefit payments for unfunded plans, which are primarily non-qualified plans. With respect to funded plans, our funding policy is that contributions, combined with appreciation and earnings,",0.522806


### 3.2. Find the neighbors of a user query
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

In [33]:
query = "What was the revenue (sales) of the company in 2023 and how it changed since 2021?"
input_type_query = "search_query"

# Get the query's embedding
query_embed = model.encode(query)

# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed,10,
                                                include_distances=True)
# Format the results
query_results = pd.DataFrame(data={'texts': df[0].iloc[similar_item_ids[0]],
                             'distance': similar_item_ids[1]})


print(f"Query:'{query}'\nNearest neighbors:")
query_results

Query:'What was the revenue (sales) of the company in 2023 and how it changed since 2021?'
Nearest neighbors:


Unnamed: 0,texts,distance
544,Corporate and other\r\nChange vs. Prior Year\r\nFiscal Year Ended 30 September 2023 2022 $ %\r\nSales $889.0 $970.8 ($81.8) (8 %)\r\nOperating loss (287.3) (184.7) (102.6) (56 %)\r\nAdjusted EBITDA (222.4) (130.7) (91.7) (70 %),0.463944
429,"Fiscal Year 2023 Highlights\r\nSales of $12.6 billion decreased 1%, or $98.6, as lower energy cost pass-through to customers of 6% and unfavorable currency of 3% were mostly offset by higher pricing of 5% and higher volumes of 3%.",0.47159
12,"sales in fiscal years 2023, 2022, and 2021, approximately half of which were attributable to atmospheric gases.",0.478526
419,"In fiscal year 2023, we achieved earnings growth through pricing discipline in our merchant business as well as improved on-site volumes, including higher demand for hydrogen, despite inflation, higher maintenance activities, and higher costs to support",0.481915
418,2023 IN SUMMARY,0.48548
1799,"A summary of the changes in common shares issued and outstanding in fiscal year 2023 is presented below:\r\nFiscal Year Ended 30 September 2023 2022 2021\r\nNumber of common shares, beginning of year 221,838,696 221,396,755 221,017,459",0.493965
457,Currency (3 %)\r\nTotal Consolidated Sales Change (1 %),0.499416
402,"Comparisons included in the discussion that follows are for fiscal year 2023 versus (""vs."") fiscal year 2022. A discussion of changes from fiscal year 2021 to fiscal year 2022 and other financial information related to fiscal year 2021 is available in",0.500653
511,"Americas\r\nChange vs. Prior Year\r\nFiscal Year Ended 30 September 2023 2022 $ %/bp\r\nSales $5,369.3 $5,368.9 $0.4 -%\r\nOperating income 1,439.7 1,174.4 265.3 23%\r\nOperating margin 26.8 % 21.9 % 490 bp\r\nEquity affiliates income $109.2 $98.2 $11.0 11%",0.507434
1370,"Our consolidated income statements include income from discontinued operations, net of tax, of $7.4, $12.6, and $70.3 in fiscal years 2023, 2022, and 2021, respectively, primarily from the release of unrecognized tax benefits on uncertain tax positions",0.50899


## 4. Visualizing the archive
Finally, let's plot out all the documents onto a 2D chart so you're able to visualize the semantic similarities of this dataset!

In [34]:
embeds.shape

(2078, 1024)

In [35]:
#@title Plot the archive {display-mode: "form"}

# UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
reducer = umap.UMAP(n_neighbors=20)
umap_embeds = reducer.fit_transform(embeds)
# Prepare the data to plot and interactive visualization
# using Altair
df_explore = pd.DataFrame(data={'text': df[0]})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

# Plot
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['text']
).properties(
    width=700,
    height=400
)
chart.interactive()

Hover over the points to read the text. Do you see some of the patterns in clustered points? Similar questions, or questions asking about similar topics?

This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or finetuning to better improve the embeddings for a specific use case).


We can’t wait to see what you start building! Share your projects or find support on [Discord](https://discord.com/invite/co-mmunity).
