# Example - Airbnb financial data search

<a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/hybrid_search.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

The code below is an example of hybrid search, a search algorithm that combines FTS and vector search in LanceDB.

Let's get stared with an example. In this notebook we'll use Airbnb financial data documents to search for "the specific reasons for higher operating costs" in a particular year.

In [None]:
# Setup
!pip install lancedb pandas langchain langchain-community pypdf openai cohere tiktoken sentence_transformers tantivy==0.20.1

In [None]:
import os
import getpass

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass.getpass()


 ········


In [None]:
def pretty_print(docs):
    for doc in docs:
        print(doc + "\n\n")

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load $ABNB's financial report. This may take 1-2 minutes since the PDF is large
sec_filing_pdf = "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001559720/8a9ebed0-815a-469a-87eb-1767d21d8cec.pdf"

# Create your PDF loader
loader = PyPDFLoader(sec_filing_pdf)

# Load the PDF document
documents = loader.load()

# Chunk the financial report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [None]:
from langchain_community.vectorstores import LanceDB
from langchain.embeddings.openai import OpenAIEmbeddings
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import Vector, LanceModel

openai = get_registry().get("openai").create()

class Schema(LanceModel):
    text: str = openai.SourceField()
    vector: Vector(openai.ndims()) = openai.VectorField()

embedding_function = OpenAIEmbeddings()

db = lancedb.connect("~/langchain")
table = db.create_table(
    "airbnb",
    schema=Schema,
    mode="overwrite",
)

# Load the document into LanceDB
db = LanceDB.from_documents(docs, embedding_function, connection=table)

[2024-02-12T20:00:04Z WARN  lance::dataset] No existing dataset at /Users/ayushchaurasia/langchain/airbnb.lance, it will be created


In [None]:
table.create_fts_index("text")

In [None]:
table.to_pandas().head()

Unnamed: 0,text,vector
0,Table of Contents\nUNITED STATES\nSECURITIES A...,"[-0.003405824, -0.03212391, 0.012812538, -0.02..."
1,"Class A common stock, par value $0.0001 per sh...","[-0.019193485, -0.02273649, 0.009623382, -0.02..."
2,this chapter) during the preceding 12 months (...,"[-0.020692078, -0.016187502, -0.008877442, -0...."
3,Indicate by check mark whether the registrant ...,"[-0.019304628, -0.0034501317, -0.011525051, -0..."
4,"As of June 30, 2022, the aggregate market valu...","[-0.014594535, -0.011274607, -0.007967828, -0...."


## Vector Search

avg latency - `3.48 ms ± 71.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`

In [None]:
query = "What are the specific factors contributing to Airbnb's increased operational expenses in the last fiscal year?"
docs = table.search(query).limit(5).to_pandas()["text"].to_list()

In [None]:
pretty_print(docs)

In addition, the number of listings on Airbnb may decline as a result of a number of other factors affecting Hosts, including: the COVID-19 pandemic; enforcement or threatened
enforcement of laws and regulations, including short-term occupancy and tax laws; private groups, such as homeowners, landlords, and condominium and neighborhood
associations, adopting and enforcing contracts that prohibit or restrict home sharing; leases, mortgages, and other agreements, or regulations that purport to ban or otherwise restrict
home sharing; Hosts opting for long-term rentals on other third-party platforms as an alternative to listing on our platform; economic, social, and political factors; perceptions of trust
and safety on and off our platform; negative experiences with guests, including guests who damage Host property, throw unauthorized parties, or engage in violent and unlawful


Made Possible by Hosts, Strangers, AirCover, Categories, and OMG marketing campaigns and launches, a $67.9 milli

## Hybrid Search
LanceDB support hybrid search with custom Rerankers. Here's the summary of latency numbers of some of the Reranking methods available
![1_yWDh0Klw8Upsw1V54kkkdQ](https://github.com/AyushExel/assets/assets/15766192/a515fbf7-0553-437e-899e-67691eae3fef)

Let us now perform hybrid search by combining vector and FTS search results. First, we'll cover the default Reranker.

### Linear Combination Reranker
`LinearCombinationReranker(weight=0.7)` is used as the default reranker for reranking the hybrid search results if the reranker isn't specified explicitly.
The `weight` param controls the weightage provided to vector search score. The weight of `1-weight` is applied to FTS scores when reranking.

Latency - `71 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`

In [None]:
docs = table.search(query, query_type="hybrid").limit(5).to_pandas()["text"].to_list()

In [None]:
pretty_print(docs)

In addition, the number of listings on Airbnb may decline as a result of a number of other factors affecting Hosts, including: the COVID-19 pandemic; enforcement or threatened
enforcement of laws and regulations, including short-term occupancy and tax laws; private groups, such as homeowners, landlords, and condominium and neighborhood
associations, adopting and enforcing contracts that prohibit or restrict home sharing; leases, mortgages, and other agreements, or regulations that purport to ban or otherwise restrict
home sharing; Hosts opting for long-term rentals on other third-party platforms as an alternative to listing on our platform; economic, social, and political factors; perceptions of trust
and safety on and off our platform; negative experiences with guests, including guests who damage Host property, throw unauthorized parties, or engage in violent and unlawful


(a) The Borrower may, at its election, deliver a Pricing Certificate to the Administrative Agent in respect of t

### Cohere Reranker
This uses Cohere's Reranking API to re-rank  the results. It accepts the reranking model name as a parameter. By Default it uses the english-v3 model but you can easily switch to a multi-lingual model.

latency - `605 ms ± 78.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`

In [None]:
# Free API key
os.environ["COHERE_API_KEY"] = getpass.getpass()

 ········


In [None]:
from lancedb.rerankers import CohereReranker

reranker = CohereReranker()
docs = table.search(query, query_type="hybrid").limit(5).rerank(reranker=reranker).to_pandas()["text"].to_list()

In [None]:
pretty_print(docs)

Increased operating expenses, decreased revenue, negative publicity, negative reaction from our Hosts and guests and other stakeholders, or other adverse impacts from any of the
above factors or other risks related to our international operations could materially adversely affect our brand, reputation, business, results of operations, and financial condition.
In addition, we will continue to incur significant expenses to operate our outbound business in China, and we may never achieve profitability in that market. These factors, combined
with sentiment of the workforce in China, and China’s policy towards foreign direct investment may particularly impact our operations in China. In addition, we need to ensure that
our business practices in China are compliant with local laws and regulations, which may be interpreted and enforced in ways that are different from our interpretation, and/or create


Made Possible by Hosts, Strangers, AirCover, Categories, and OMG marketing campaigns and la

Relevance score is returned by Cohere API and is independent of individual FTS and vector search scores.

In [None]:
table.search(query, query_type="hybrid").limit(5).rerank(reranker=reranker).to_pandas()

Unnamed: 0,text,vector,_relevance_score
0,"Increased operating expenses, decreased revenu...","[0.0034929817, -0.024774546, 0.012623285, -0.0...",0.985328
1,"Made Possible by Hosts, Strangers, AirCover, C...","[-0.0042489874, -0.005382498, 0.007190078, -0....",0.979036
2,"Table of Contents\nAirbnb, Inc.\nConsolidated ...","[-0.008569201, -0.019810658, 0.014144964, -0.0...",0.696578
3,Our success depends significantly on existing ...,"[0.0027109187, -0.028220002, 0.022864284, -0.0...",0.539923
4,"In addition, the number of listings on Airbnb ...","[0.0068983347, -0.0147690065, 0.042441186, -0....",0.460713


### ColBERT Reranker
Colber Reranker is powered by ColBERT model. It runs locally using the huggingface implementation.

Latency - `950 ms ± 5.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`

Note: First query might be slow. It is recommended to reuse the `Reranker` objects as the models are cached. Subsequent runs will be faster on reusing the same reranker object

In [None]:
from lancedb.rerankers import ColbertReranker

reranker = ColbertReranker()
docs = table.search(query, query_type="hybrid").limit(5).rerank(reranker=reranker).to_pandas()["text"].to_list()

In [None]:
pretty_print(docs)

Made Possible by Hosts, Strangers, AirCover, Categories, and OMG marketing campaigns and launches, a $67.9 million increase in our search engine marketing and advertising
spend, a $25.1 million increase in payroll-related expenses due to growth in headcount and increase in compensation costs, a $22.0 million increase in third-party service provider
expenses, and a $11.1 million increase in coupon expense in line with increase in revenue and launch of AirCover for guests, partially offset by a decrease of $22.9 million related to
the changes in the fair value of contingent consideration related to a 2019 acquisition.
General and Administrative
2021 2022 % Change
(in millions, except percentages)
General and administrative $ 836 $ 950 14 %
Percentage of revenue 14 % 11 %
General and administrative expense increased $114.0 million, or 14%, in 2022 compared to 2021, primarily due to an increase in other business and operational taxes of $41.3


Our future revenue growth depends on the grow

### Cross Encoder Reranker
Uses cross encoder models are rerankers. Uses sentence transformer implemntation locally

Latency - `1.38 s ± 64.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`

In [None]:
from lancedb.rerankers import CrossEncoderReranker

reranker=CrossEncoderReranker()
docs = table.search(query, query_type="hybrid").limit(5).rerank(reranker=reranker).to_pandas()["text"].to_list()

In [None]:
pretty_print(docs)

Table of Contents
Airbnb, Inc.
Consolidated Statements of Operations
(in millions, except per share amounts)
Year Ended December 31,
2020 2021 2022
Revenue $ 3,378 $ 5,992 $ 8,399 
Costs and expenses:
Cost of revenue 876 1,156 1,499 
Operations and support 878 847 1,041 
Product development 2,753 1,425 1,502 
Sales and marketing 1,175 1,186 1,516 
General and administrative 1,135 836 950 
Restructuring charges 151 113 89 
Total costs and expenses 6,968 5,563 6,597 
Income (loss) from operations (3,590) 429 1,802 
Interest income 27 13 186 
Interest expense (172) (438) (24)
Other income (expense), net (947) (304) 25 
Income (loss) before income taxes (4,682) (300) 1,989 
Provision for (benefit from) income taxes (97) 52 96 
Net income (loss) $ (4,585)$ (352)$ 1,893 
Net income (loss) per share attributable to Class A and Class B common stockholders:
Basic $ (16.12)$ (0.57)$ 2.97 
Diluted $ (16.12)$ (0.57)$ 2.79


Made Possible by Hosts, Strangers, AirCover, Categories, and OMG marketing

### (Experimental) OpenAI Reranker

This prompts chat model to rerank results which is not a dedicated reranker model. This should be treated as experimental. You might run out of token limit so set the search limits based on your token limit.
NOTE: It is recommended to use `gpt-4-turbo-preview`, older models might lead to bad behaviour

Latency - `Can take 10s of seconds if using GPT-4 model`

In [None]:
from lancedb.rerankers import OpenaiReranker

reranker=OpenaiReranker(model_name="gpt-4-turbo-preview")
docs = table.search(query, query_type="hybrid").limit(5).rerank(reranker=reranker).to_pandas()["text"].to_list()

In [None]:
pretty_print(docs)

Made Possible by Hosts, Strangers, AirCover, Categories, and OMG marketing campaigns and launches, a $67.9 million increase in our search engine marketing and advertising
spend, a $25.1 million increase in payroll-related expenses due to growth in headcount and increase in compensation costs, a $22.0 million increase in third-party service provider
expenses, and a $11.1 million increase in coupon expense in line with increase in revenue and launch of AirCover for guests, partially offset by a decrease of $22.9 million related to
the changes in the fair value of contingent consideration related to a 2019 acquisition.
General and Administrative
2021 2022 % Change
(in millions, except percentages)
General and administrative $ 836 $ 950 14 %
Percentage of revenue 14 % 11 %
General and administrative expense increased $114.0 million, or 14%, in 2022 compared to 2021, primarily due to an increase in other business and operational taxes of $41.3


Table of Contents
Airbnb, Inc.
Consolidated S

## Use your custom Reranker
Hybrid search in LanceDB is designed to be very flexible. You can easily plug in your own Re-reranking logic. To do so, you simply need to implement the base Reranker class

In [None]:
from lancedb.rerankers import Reranker
import pyarrow as pa

class MyCustomReranker(Reranker):
    def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table)-> pa.Table:
        combined_results = self.merge(vector_results, fts_results) # Or custom merge algo
        # Custom Reranking logic here

        return combined_results

### Custom Reranker based on CohereReranker

For the sake of simplicity let's build custom reranker that just enchances the Cohere Reranker by accepting a filter query, and accept other CohereReranker params as kwags.

For this toy example let's say we want to get rid of docs that represent a table of contents, appendix etc. as these are semantically close of representing costs but this isn't something we are interested in because they don't represent the specific reasons why operating costs were high. They simply represent the costs.

In [None]:
from typing import List, Union
import pandas as pd
from lancedb.rerankers import CohereReranker

class MofidifiedCohereReranker(CohereReranker):
    def __init__(self, filters: Union[str, List[str]], **kwargs):
        super().__init__(**kwargs)
        filters = filters if isinstance(filters, list) else [filters]
        self.filters = filters

    def rerank_hybrid(self, query: str, vector_results: pa.Table, fts_results: pa.Table)-> pa.Table:
        combined_result = super().rerank_hybrid(query, vector_results, fts_results)
        df = combined_result.to_pandas()
        for filter in self.filters:
            df = df.query("not text.str.contains(@filter)")

        return pa.Table.from_pandas(df)

reranker = MofidifiedCohereReranker(filters="Table of Contents")

In [None]:
docs = table.search(query, query_type="hybrid").limit(5).rerank(reranker=reranker).to_pandas()["text"].to_list()

In [None]:
pretty_print(docs)

Increased operating expenses, decreased revenue, negative publicity, negative reaction from our Hosts and guests and other stakeholders, or other adverse impacts from any of the
above factors or other risks related to our international operations could materially adversely affect our brand, reputation, business, results of operations, and financial condition.
In addition, we will continue to incur significant expenses to operate our outbound business in China, and we may never achieve profitability in that market. These factors, combined
with sentiment of the workforce in China, and China’s policy towards foreign direct investment may particularly impact our operations in China. In addition, we need to ensure that
our business practices in China are compliant with local laws and regulations, which may be interpreted and enforced in ways that are different from our interpretation, and/or create


Made Possible by Hosts, Strangers, AirCover, Categories, and OMG marketing campaigns and la

As you can see the document containing the Table of contetnts of spending no longer shows up