1. pick some text documents
2. pick three embedding models (all three embedding models require less than 1GB of memory, and are small enough to run locally)
3. put text and embeddings generated by each model into pixeltable

--- query time ---

4. create embeddings of text
5. get top 3 results from each embedding model
6. rank results based on smallest sum of ranking

# Step 1: Import libraries

For this step, we install three libraries:
- pixeltable: pixeltable is the backbone of this example, it helps handle data and lets us keep track of all of our data in one place, abstracting out the need to manage a database like postgres or a vector db while providing the functionality in a simple way
- tiktoken: tiktoken is necessary for encoding and parsing documents
- sentence-transformers: sentence-transformers is where we get all of our embedding models from

In [3]:
# ! pip install pixeltable tiktoken sentence-transformers

Our main imports here are:
- pixeltable, our data management tool
- sentence_transformer from pixeltable, this is how we'll use sentence transformers to create embeddings
- numpy

In [4]:
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer
import numpy as np

# Step 2: Define Embedding Functions

There are many ways to do reranking. In this version, we compare the similarity rankings from three popular models that are all small enough to run locally.

Pixeltable provides a way to load, manipulate, and store data locally via user defined functions. In this section, we create three different embedding functions to turn text into vector embeddings.

Each of the functions below are defined as `expr_udf` functions, these are the expression user defined functions. Each of these functions takes a string and outputs a numpy array.

In [5]:
@pxt.expr_udf
def gist_embed(text: str) -> np.ndarray:
    return sentence_transformer(text, model_id='avsolatorio/GIST-Embedding-v0') #768 model from Aivin Solatorio

  from tqdm.autonotebook import tqdm, trange


In [6]:
@pxt.expr_udf
def bge_base_embed(text: str) -> np.ndarray:
    return sentence_transformer(text, model_id='BAAI/bge-base-en-v1.5') #768 model from BAAI

In [7]:
@pxt.expr_udf
def e5_embed(text: str) -> np.ndarray:
    return sentence_transformer(text, model_id='intfloat/e5-large-v2') #1024 model from Liang Wang

# Step 3: Create Table

Tables are a core way to organize data in Pixeltable. In this example, we create a table that we're sourcing from links.

For the drop table section, we're also going to drop the table that we create later on that contains the chunked documents.

When we `create_table`, we have to define the structure of the table. The first table we create just needs to have a document.

In [8]:
# drop both tables
pxt.drop_table('chunked_docs') 
pxt.drop_table('html_docs')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/yujian/.pixeltable/pgdata


In [9]:
html_docs = pxt.create_table('html_docs', {'document': pxt.Document})

Created table `html_docs`.


# Step 4: Import Documents

For this example, we are importing HTML documents by scraping them off of Wikipedia. We'll pick four cities and insert the URLs as Documents directly into Pixeltable.

In [10]:
base_url = "https://en.wikipedia.org/wiki/"

In [11]:
cities = [
    "Boston",
    "Seattle",
    "San_Francisco",
    "New_York_City"
]

In [12]:
city_urls = [base_url + cities[x] for x in range(len(cities))]

In [13]:
city_urls

['https://en.wikipedia.org/wiki/Boston',
 'https://en.wikipedia.org/wiki/Seattle',
 'https://en.wikipedia.org/wiki/San_Francisco',
 'https://en.wikipedia.org/wiki/New_York_City']

Here we `insert` rows into the table. Each row should define all of the necessary information in the table, in this case we only have to pass in a document in each insertion.

In [14]:
html_docs.insert(
    {'document': url} for url in city_urls
)

Inserting rows into `html_docs`: 4 rows [00:00, 2114.06 rows/s]███| 8/8 [00:01<00:00,  4.90 cells/s]
Computing cells: 100%|████████████████████████████████████████████| 8/8 [00:01<00:00,  4.89 cells/s]
Inserted 4 rows with 0 errors.


UpdateStatus(num_rows=4, num_computed_values=8, num_excs=0, updated_cols=[], cols_with_excs=[])

# Step 5: Chunk Documents

Now that we have the documents themselves, we have to take the next step in ingesting documents - we need to chunk them. We chunk documents so that we can use them in a more effective manner. Since embeddings are quantitative representations of the data, in this case text, chunking documents lets us create more granular embeddings. This makes them more useful for downstream applications.

To create a table of chunked documents, we use `DocumentSplitter` from Pixeltable and use the `create_view` function to create a new table. `create_view` needs a name for the view ("chunked docs"), the table to create the view from (`html_docs`), and what it's going to do with that table to create the view (`iterator` in this case), then it creates a new table. In this example, our iterator creates a set of documents that gets paragraphs from the existing documents. For a list of available `seperators` within `DocumentSplitter`, see [here](https://pixeltable.github.io/pixeltable/api/iterators/document-splitter/).

In [15]:
from pixeltable.iterators import DocumentSplitter
chunked_docs = pxt.create_view(
    "chunked_docs",
    html_docs,
    iterator=DocumentSplitter.create(
        document=html_docs.document,
        separators='paragraph'
    )
)

Inserting rows into `chunked_docs`: 766 rows [00:00, 33060.00 rows/s]
Created view `chunked_docs` with 766 rows, 0 exceptions.


In [16]:
chunked_docs.describe()

Column Name,Type,Computed With
pos,Required[Int],
text,Required[String],
document,Document,


# Step 6: Create Embedding Indices

Now that we have our chunked documents, we need to be able to access them via semantic search. In this step, we create our three embedding indices.

We use `add_embedding_index` to do this. We'll pass three parameters into this function:
- `col_name`: this tells pixeltable which column to build the embedding on, you have to pass an existing column
- `idx_name`: this tells pixeltable what to name this embedding, pick any name that makes sense
- `string_embed`: this picks the function that is used to create the embedding, you have to use one of the functions defined earlier

In [17]:
chunked_docs.add_embedding_index(col_name="text", idx_name="gist", string_embed=gist_embed)

Computing cells: 100%|████████████████████████████████████████| 766/766 [01:03<00:00, 12.11 cells/s]


In [18]:
chunked_docs.add_embedding_index(col_name="text", idx_name="bge", string_embed=bge_base_embed)

Computing cells: 100%|████████████████████████████████████████| 766/766 [00:52<00:00, 14.65 cells/s]


In [19]:
chunked_docs.add_embedding_index(col_name="text", idx_name="e5", string_embed=e5_embed)

Computing cells: 100%|████████████████████████████████████████| 766/766 [02:39<00:00,  4.80 cells/s]


# Step 7: Get Similarities

Now that everything is set up for retrieval, we can do the first ranking step before the re-ranking step.

In this step, we're creating a function to get the top results back from a query. This function takes three parameters:
- `query_text`: the text we want to find similar text to
- `idx_name`: the index that we want to compare with
- `top_k`: the number of results we want to return, by default we set this to return the top 5 results

Within the function itself, the first thing we do is get the similarities of the `query_text` and the existing documents. Then, we order the results by similarity score, taking the highest score first (`asc=False`), selecting only the text, limiting the results to the `top_k`, and then collecting the results in the form of a table.

In [20]:
def top_k(query_text: str, idx_name: str, top_k: int = 5):
    sim = chunked_docs.text.similarity(query_text, idx=idx_name)
    return chunked_docs.order_by(sim, asc=False).select(chunked_docs.text, sim=sim).limit(top_k).collect()

In [21]:
e5_sims = top_k("Population of Boston", idx_name="e5")
e5_sims

text,sim
Boston - Wikipedia Jump to content Search Search,0.853
"In 1822, [ 15 ] the citizens of Boston voted to change the official name from the ""Town of Boston"" to the ""City of Boston"", and on March 19, 1822, the people of Boston accepted the charter incorporating the city. [ 68 ] At the time Boston was chartered as a city, the population was about 46,226, while the area of the city was only 4.8 sq mi (12 km 2 ). [ 68 ]",0.849
"Boston [ a ] is the capital and most populous city in the Commonwealth of Massachusetts in the United States . The city serves as the cultural and financial center of the New England region of the Northeastern United States . It has an area of 48.4 sq mi (125 km 2 ) [ 9 ] and a population of 675,647 as of the 2020 census , making it the third-largest city in the Northeast after New York City and Philadelphia . [ 4 ] The larger Greater Boston metropolitan statistical area , which includes and surrounds the city, has a population of 4,919,179 as of 2023, making it the largest in New England and eleventh-largest in the country. [ 10 ] [ 11 ] [ 12 ]",0.839
"State capital in New England, United States Boston State capital Downtown from Boston Harbor Acorn Street on Beacon Hill Old State House Massachusetts State House Fenway Park during a Boston Red Sox game Back Bay from the Charles River Flag Seal Coat of arms Wordmark Nickname(s): Bean Town, Title Town, others Motto(s): Sicut patribus sit Deus nobis ( Latin ) 'As God was with our fathers, so may He be with us' Show Boston Show Suffolk County Show Massachusetts Show the United States Boston Sh ...... 6.3/sq mi (1,021.8/km 2 ) • Metro [ 6 ] 4,941,632 (US: 10th ) Demonym Bostonian GDP [ 7 ] • Metro \$571.6 billion (2022) Time zone UTC−5 ( EST ) • Summer ( DST ) UTC−4 ( EDT ) ZIP Codes 53 ZIP Codes [ 8 ] 02108–02137, 02163, 02196, 02199, 02201, 02203–02206, 02210–02212, 02215, 02217, 02222, 02126, 02228, 02241, 02266, 02283–02284, 02293, 02295, 02297–02298, 02467 (also includes parts of Newton and Brookline) Area codes 617 and 857 FIPS code 25-07000 GNIS feature ID 617565 Website boston .gov",0.837
"In 2020, Boston was estimated to have 691,531 residents living in 266,724 households [ 4 ] —a 12% population increase over 2010. The city is the third-most densely populated large U.S. city of over half a million residents, and the most densely populated state capital. Some 1.2 million persons may be within Boston's boundaries during work hours, and as many as 2 million during special events. This fluctuation of people is caused by hundreds of thousands of suburban residents who travel to the city for work, education, health care, and special events. [ 164 ]",0.834


In [22]:
gist_sims = top_k("Population of Boston", idx_name="gist")
gist_sims

text,sim
"In 2020, Boston was estimated to have 691,531 residents living in 266,724 households [ 4 ] —a 12% population increase over 2010. The city is the third-most densely populated large U.S. city of over half a million residents, and the most densely populated state capital. Some 1.2 million persons may be within Boston's boundaries during work hours, and as many as 2 million during special events. This fluctuation of people is caused by hundreds of thousands of suburban residents who travel to the city for work, education, health care, and special events. [ 164 ]",0.879
"Boston [ a ] is the capital and most populous city in the Commonwealth of Massachusetts in the United States . The city serves as the cultural and financial center of the New England region of the Northeastern United States . It has an area of 48.4 sq mi (125 km 2 ) [ 9 ] and a population of 675,647 as of the 2020 census , making it the third-largest city in the Northeast after New York City and Philadelphia . [ 4 ] The larger Greater Boston metropolitan statistical area , which includes and surrounds the city, has a population of 4,919,179 as of 2023, making it the largest in New England and eleventh-largest in the country. [ 10 ] [ 11 ] [ 12 ]",0.859
"Demographics [ edit ] See also: History of the Irish in Boston , History of Italian Americans in Boston , History of African Americans in Boston , Chinese Americans in Boston , Dominican-Americans in Boston , Vietnamese in Boston , and LGBT culture in Boston Historical population Year Pop. ±% 1680 4,500 — 1690 7,000 +55.6% 1700 6,700 −4.3% 1710 9,000 +34.3% 1722 10,567 +17.4% 1742 16,382 +55.0% 1765 15,520 −5.3% 1790 18,320 +18.0% 1800 24,937 +36.1% 1810 33,787 +35.5% 1820 43,298 +28.1% 1830 ...... ] [ 152 ] [ 153 ] [ 154 ] [ 155 ] [ 156 ] [ 157 ] [ 158 ] [ 159 ] 2010–2020 [ 4 ] Source: U.S. Decennial Census [ 160 ] Historical racial/ethnic composition Race/ethnicity 2020 [ 161 ] 2010 [ 162 ] 1990 [ 163 ] 1970 [ 163 ] 1940 [ 163 ] Non-Hispanic White 44.7% 47.0% 59.0% 79.5% [ f ] 96.6% Black 22.0% 24.4% 23.8% 16.3% 3.1% Hispanic or Latino (of any race) 19.5% 17.5% 10.8% 2.8% [ f ] 0.1% Asian 9.7% 8.9% 5.3% 1.3% 0.2% Two or more races 3.2% 3.9% – – – Native American 0.2% 0.4% 0.3% 0.2% –",0.84
"In 1822, [ 15 ] the citizens of Boston voted to change the official name from the ""Town of Boston"" to the ""City of Boston"", and on March 19, 1822, the people of Boston accepted the charter incorporating the city. [ 68 ] At the time Boston was chartered as a city, the population was about 46,226, while the area of the city was only 4.8 sq mi (12 km 2 ). [ 68 ]",0.816
Environment [ edit ] Population density and elevation above sea level in Greater Boston as of 2010,0.812


In [23]:
bge_sims = top_k("Population of Boston", idx_name="bge")
bge_sims

text,sim
"In 2020, Boston was estimated to have 691,531 residents living in 266,724 households [ 4 ] —a 12% population increase over 2010. The city is the third-most densely populated large U.S. city of over half a million residents, and the most densely populated state capital. Some 1.2 million persons may be within Boston's boundaries during work hours, and as many as 2 million during special events. This fluctuation of people is caused by hundreds of thousands of suburban residents who travel to the city for work, education, health care, and special events. [ 164 ]",0.79
"Boston [ a ] is the capital and most populous city in the Commonwealth of Massachusetts in the United States . The city serves as the cultural and financial center of the New England region of the Northeastern United States . It has an area of 48.4 sq mi (125 km 2 ) [ 9 ] and a population of 675,647 as of the 2020 census , making it the third-largest city in the Northeast after New York City and Philadelphia . [ 4 ] The larger Greater Boston metropolitan statistical area , which includes and surrounds the city, has a population of 4,919,179 as of 2023, making it the largest in New England and eleventh-largest in the country. [ 10 ] [ 11 ] [ 12 ]",0.751
"Demographics [ edit ] See also: History of the Irish in Boston , History of Italian Americans in Boston , History of African Americans in Boston , Chinese Americans in Boston , Dominican-Americans in Boston , Vietnamese in Boston , and LGBT culture in Boston Historical population Year Pop. ±% 1680 4,500 — 1690 7,000 +55.6% 1700 6,700 −4.3% 1710 9,000 +34.3% 1722 10,567 +17.4% 1742 16,382 +55.0% 1765 15,520 −5.3% 1790 18,320 +18.0% 1800 24,937 +36.1% 1810 33,787 +35.5% 1820 43,298 +28.1% 1830 ...... ] [ 152 ] [ 153 ] [ 154 ] [ 155 ] [ 156 ] [ 157 ] [ 158 ] [ 159 ] 2010–2020 [ 4 ] Source: U.S. Decennial Census [ 160 ] Historical racial/ethnic composition Race/ethnicity 2020 [ 161 ] 2010 [ 162 ] 1990 [ 163 ] 1970 [ 163 ] 1940 [ 163 ] Non-Hispanic White 44.7% 47.0% 59.0% 79.5% [ f ] 96.6% Black 22.0% 24.4% 23.8% 16.3% 3.1% Hispanic or Latino (of any race) 19.5% 17.5% 10.8% 2.8% [ f ] 0.1% Asian 9.7% 8.9% 5.3% 1.3% 0.2% Two or more races 3.2% 3.9% – – – Native American 0.2% 0.4% 0.3% 0.2% –",0.733
"The Boston metro area contained a Jewish population of approximately 248,000 as of 2015. [ 185 ] More than half the Jewish households in the Greater Boston area reside in the city itself, Brookline , Newton , Cambridge , Somerville , or adjacent towns. [ 185 ] A small minority practices Confucianism , and some practice Boston Confucianism , an American evolution of Confucianism adapted for Boston intellectuals. [ 186 ]",0.731
"In 1822, [ 15 ] the citizens of Boston voted to change the official name from the ""Town of Boston"" to the ""City of Boston"", and on March 19, 1822, the people of Boston accepted the charter incorporating the city. [ 68 ] At the time Boston was chartered as a city, the population was about 46,226, while the area of the city was only 4.8 sq mi (12 km 2 ). [ 68 ]",0.727


Looking at the results we got for each similarity index, we see that while there is some overlap, they are all different.

# Step 8: Rerank

Now we can rerank our results. There are many ways to perform reranking, for this example, we are going to do reranking based on a total score from each of the results. The underlying assumption is that each embedding encodes some sort of similarity and that a sum of the similarity scores across multiple embeddings gives a better idea of what is actually most similar.

The steps we take here are:
1. Create an empty dictionary
2. Search through the texts of each of the returned similarity tables
3. If a text appears in the dictionary, add to the similarity score to its current similarity score, if not, add it to the dictionary as a key with the matching similarity score as the value
4. Sort the resulting dictionary by similarity score

In [24]:
reranking_dict = {}

In [25]:
for index, text in enumerate(bge_sims["text"]):
    if text not in reranking_dict:
        reranking_dict[text] = bge_sims["sim"][index]
    elif text in reranking_dict:
        reranking_dict[text] += bge_sims["sim"][index]

In [26]:
for index, text in enumerate(e5_sims["text"]):
    if text not in reranking_dict:
        reranking_dict[text] = bge_sims["sim"][index]
    elif text in reranking_dict:
        reranking_dict[text] += bge_sims["sim"][index]

In [27]:
for index, text in enumerate(gist_sims["text"]):
    if text not in reranking_dict:
        reranking_dict[text] = bge_sims["sim"][index]
    elif text in reranking_dict:
        reranking_dict[text] += bge_sims["sim"][index]

In [28]:
reranking_dict

{"In 2020, Boston was estimated to have 691,531 residents living in 266,724 households [ 4 ] —a 12% population increase over 2010. The city is the third-most densely populated large U.S. city of over half a million residents, and the most densely populated state capital. Some 1.2\xa0million persons may be within Boston's boundaries during work hours, and as many as 2\xa0million during special events. This fluctuation of people is caused by hundreds of thousands of suburban residents who travel to the city for work, education, health care, and special events. [ 164 ]": 2.307768256513198,
 'Boston [ a ] is the capital and most populous city in the Commonwealth of Massachusetts in the United States . The city serves as the cultural and financial center of the New England region of the Northeastern United States . It has an area of 48.4\xa0sq\xa0mi (125\xa0km 2 ) [ 9 ] and a population of 675,647 as of the 2020 census , making it the third-largest city in the Northeast after New York City 

In [35]:
max(reranking_dict, key=reranking_dict.get)

"In 2020, Boston was estimated to have 691,531 residents living in 266,724 households [ 4 ] —a 12% population increase over 2010. The city is the third-most densely populated large U.S. city of over half a million residents, and the most densely populated state capital. Some 1.2\xa0million persons may be within Boston's boundaries during work hours, and as many as 2\xa0million during special events. This fluctuation of people is caused by hundreds of thousands of suburban residents who travel to the city for work, education, health care, and special events. [ 164 ]"