# Install dependencies

Install `txtai` and all dependencies.

In [79]:
%%capture
!pip install git+https://github.com/neuml/txtai

# Create an Embeddings instance

The Embeddings instance is the main entrypoint for txtai. An Embeddings instance defines the method used to tokenize and convert a text section into an embeddings vector. 

In [80]:
%%capture

from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/all-mpnet-base-v2"})

Semantic Search
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.

# Running similarity queries

Running similarity searches for a list of different concepts.

In [81]:
from google.colab import files
uploaded = files.upload()

Saving canada.txt to canada (2).txt


In [82]:
listed = []
with open("/content/canada.txt","r") as text:
    Line = text.readline()
    while Line!='':
        Line1 = Line.split(".")
        for Sentence in Line1:
            listed.append(Sentence)
        Line = text.readline()

In [83]:
data = list(listed)

In [84]:
print("%-40s %s" % ("Query", "Best Match"))
print("-" * 90)

for query in ("Dominion","Hudson’s Bay Company","July 1", "British North America Act", "Railway", "1885","government"):
    # Get index of best section that best matches query
    uid = embeddings.similarity(query, data)[0][0]

    print("%-40s %s" % (query, data[uid]))

Query                                    Best Match
------------------------------------------------------------------------------------------
Dominion                                 The autonomous Dominion of Canada, a confederation of Nova Scotia, New Brunswick, and the future provinces of Ontario and Quebec, is officially recognized by Great Britain with the passage of the British North America Act
Hudson’s Bay Company                      Two years later, Canada acquired the vast possessions of the Hudson’s Bay Company, and within a decade the provinces of Manitoba and Prince Edward Island had joined the Canadian federation
July 1                                    July 1 will later become known as Canada Day
British North America Act                The autonomous Dominion of Canada, a confederation of Nova Scotia, New Brunswick, and the future provinces of Ontario and Quebec, is officially recognized by Great Britain with the passage of the British North America Act
Railway      

# Building an Embeddings index

Running an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search. 

In [85]:
# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-30s %s" % ("Query", "Best Match"))
print("-" * 10)

# Run an embeddings search for each query
for query in ("Hudson’s Bay Company","July 1", "British North America Act", "Railway", "1885","government"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print text
    print("%-30s %s" % (query, data[uid]))


Query                          Best Match
----------
Hudson’s Bay Company            Two years later, Canada acquired the vast possessions of the Hudson’s Bay Company, and within a decade the provinces of Manitoba and Prince Edward Island had joined the Canadian federation
July 1                          July 1 will later become known as Canada Day
British North America Act      The autonomous Dominion of Canada, a confederation of Nova Scotia, New Brunswick, and the future provinces of Ontario and Quebec, is officially recognized by Great Britain with the passage of the British North America Act
Railway                         In 1885, the Canadian Pacific Railway was completed, making mass settlement across the vast territory of Canada possible
1885                            Later in the year, another conference was held in Quebec, and in 1866 Canadian representatives traveled to London to meet with the British government
government                      In the 1860s, a movement for 

# Embeddings load/save

Embeddings indexes can be saved to disk and reloaded.

In [86]:
embeddings.save("index")

embeddings = Embeddings()
embeddings.load("index")

uid = embeddings.search("British North America Act", 1)[0][0]
print(data[uid])

The autonomous Dominion of Canada, a confederation of Nova Scotia, New Brunswick, and the future provinces of Ontario and Quebec, is officially recognized by Great Britain with the passage of the British North America Act


# Embeddings index with content

In [87]:
# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/all-mpnet-base-v2", "content": True, "objects": True})

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print(embeddings.search("British", 1)[0]["text"])

The autonomous Dominion of Canada, a confederation of Nova Scotia, New Brunswick, and the future provinces of Ontario and Quebec, is officially recognized by Great Britain with the passage of the British North America Act


## Query with SQL

When content is enabled, the entire dictionary will be stored and can be queried. In addition to similarity queries, txtai accepts SQL queries. This enables combined queries using both a similarity index and content stored in a database backend.

In [88]:
# Create an index for the list of text
embeddings.index([(uid, {"text": text, "length": len(text)}, None) for uid, text in enumerate(data)])

# Filter by score
print(embeddings.search("select text, score from txtai where similar('mass settlement') and score >= 0.15"))

# Filter by metadata field 'length'
print(embeddings.search("select text, length, score from txtai where similar('July 1') and score >= 0.05 and length >= 40"))

# Run aggregate queries
print(embeddings.search("select count(*), min(length), max(length), sum(length) from txtai"))

[{'text': ' In 1885, the Canadian Pacific Railway was completed, making mass settlement across the vast territory of Canada possible', 'score': 0.36653953790664673}, {'text': ' In the 1860s, a movement for a greater Canadian federation grew out of the need for a common defense, the desire for a national railroad system, and the necessity of finding a solution to the problem of French and British conflict', 'score': 0.2537963390350342}, {'text': 'During the 19th century, colonial dependence gave way to increasing autonomy for a growing Canada', 'score': 0.22447863221168518}]
[{'text': ' July 1 will later become known as Canada Day', 'length': 45, 'score': 0.4806599020957947}, {'text': 'On July 1, 1867, with passage of the British North America Act, the Dominion of Canada was officially established as a self-governing entity within the British Empire', 'length': 166, 'score': 0.23678340017795563}, {'text': ' Later in the year, another conference was held in Quebec, and in 1866 Canadian r