The purpose of this notebook is to show how to score the relevance of the documents we retrieved for a query.

Two approaches:


1) Re-ranking a long tail of results.

    a. For a query we retrieve a list of 10 documents.

    b. We then create pairs of query and each document.

    c. Then we ask th encoder to re-rank these pairs for relevance.



2) Re-rank after query expansion.

    a. We expand the given query with a set of additional queries.

    b. We then retrieve 10 documents for each query.

    c. We remove duplicate documents.

    d. We then pair the original query with each of the documents and proceed to scoring them


This notebook is adapted from:
https://learn.deeplearning.ai/advanced-retrieval-for-ai/lesson/5/cross-encoder-re-ranking

In [None]:
%pip install --upgrade --quiet pypdf python-dotenv\
    chromadb langchain sentence-transformers

In [4]:
from embedding_utils import load_chroma, word_wrap
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()

chroma_collection = load_chroma(filename='microsoft_annual_report_2022.pdf',
    collection_name='microsoft_annual_report_2022',
    embedding_function=embedding_function)

chroma_collection.count()

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

349

1. Re-ranking the long tail

In [6]:
query = "What has been the investment in research and development?"
results = chroma_collection.query(query_texts=query, n_results=10, include=['documents', 'embeddings'])

retrieved_documents = results['documents'][0]

for document in results['documents'][0]:
    print(word_wrap(document))
    print()

• operating expenses increased $ 1. 5 billion or 14 % driven by
investments in gaming, search and news advertising, and windows
marketing. operating expenses research and development ( in millions,
except percentages ) 2022 2021 percentage change research and
development $ 24, 512 $ 20, 716 18 % as a percent of revenue 12 % 12 %
0ppt research and development expenses include payroll, employee
benefits, stock - based compensation expense, and other headcount -
related expenses associated with product development. research and
development expenses also include third - party development and
programming costs, localization costs incurred to translate software
for international markets, and the amortization of purchased software
code and services content. research and development expenses increased
$ 3. 8 billion or 18 % driven by investments in cloud engineering,
gaming, and linkedin. sales and marketing

competitive in local markets and enables us to continue to attract top
talent from ac

In [7]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [8]:
pairs = [[query, doc] for doc in retrieved_documents]
scores = cross_encoder.predict(pairs)
print(scores)

[  0.98693573   2.6445804   -0.2680317  -10.731592    -7.7066045
  -5.646997    -4.2970366  -10.933232    -7.038429    -7.324694  ]


Sort document, score pairs from best score to lowest score. As you can see 2nd document is the highest scored and the 8th document is the lowest scored. What's more the 4th document is second to last.

In [9]:
ranks = [(index+1, score) for index, score in enumerate(scores)]
ranks.sort(key=lambda p: p[1], reverse=True)
ranks

[(2, 2.6445804),
 (1, 0.98693573),
 (3, -0.2680317),
 (7, -4.2970366),
 (6, -5.646997),
 (9, -7.038429),
 (10, -7.324694),
 (5, -7.7066045),
 (4, -10.731592),
 (8, -10.933232)]

2. Re-rank with query expansion

In [10]:
original_query = "What were the most important factors that contributed to increases in revenue?"
generated_queries = [
    "What were the major drivers of revenue growth?",
    "Were there any new product launches that contributed to the increase in revenue?",
    "Did any changes in pricing or promotions impact the revenue growth?",
    "What were the key market trends that facilitated the increase in revenue?",
    "Did any acquisitions or partnerships contribute to the revenue growth?"
]

queries = [original_query] + generated_queries

results = chroma_collection.query(query_texts=queries, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents']

Deduplicate documents. The same document might have been retrieved more than once.

In [19]:
unique_docs = list(set(doc for docs in retrieved_documents for doc in docs))
print(f"unique docs = {len(unique_docs)}")

unique docs = 23


In [12]:
pairs = [[original_query , doc] for doc in unique_docs]
scores = cross_encoder.predict(pairs)
ranks = [(index+1, score) for index, score in enumerate(scores)]
ranks.sort(key=lambda p: p[1], reverse=True)
ranks

[(6, -1.136996),
 (15, -3.7681513),
 (18, -3.7948644),
 (20, -4.341767),
 (5, -4.6518893),
 (4, -4.8184843),
 (23, -5.1418314),
 (17, -5.27475),
 (22, -6.9020915),
 (16, -7.4906545),
 (12, -7.7541003),
 (1, -7.917177),
 (7, -8.505109),
 (11, -9.357722),
 (19, -9.768025),
 (8, -9.807878),
 (2, -9.918429),
 (14, -10.000137),
 (21, -10.042843),
 (3, -10.0839405),
 (10, -10.148885),
 (13, -10.711211),
 (9, -11.079268)]

Top ranked document

In [17]:
index = ranks[0][0] - 1
print(original_query)
print()
print(word_wrap(unique_docs[index]))

What were the most important factors that contributed to increases in revenue?

• windows revenue increased $ 2. 3 billion or 10 % driven by growth in
windows oem and windows commercial. windows oem revenue increased 11 %
driven by continued strength in the commercial pc market, which has
higher revenue per license. windows commercial products and cloud
services revenue increased 11 % driven by demand for microsoft 365.


Lowest ranked document

In [18]:
index = ranks[-1][0] - 1
print(original_query)
print()
print(word_wrap(unique_docs[index]))

What were the most important factors that contributed to increases in revenue?

66 gains ( losses ), net of tax, on derivative instruments recognized
in our consolidated comprehensive income statements were as follows : (
in millions ) year ended june 30, 2022 2021 2020 designated as cash
flow hedging instruments foreign exchange contracts included in
effectiveness assessment $ ( 57 ) $ 34 $ ( 38 ) note 6 — inventories
the components of inventories were as follows : ( in millions ) june
30, 2022 2021 raw materials $ 1, 144 $ 1, 190 work in process 82 79
finished goods 2, 516 1, 367 total $ 3, 742 $ 2, 636 note 7 — property
and equipment the components of property and equipment were as follows
: ( in millions ) june 30, 2022 2021 land $ 4, 734 $ 3, 660 buildings
and improvements 55, 014 43, 928 leasehold improvements 7, 819 6, 884
computer equipment and software 60, 631 51, 250
