Run first the [setup notebook](./00-setup.ipynb)

# Simple text retrieval with Woosh

In [1]:
from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from datasets import imdb

## Connect to a document store

We use an in-memory document store here, but you can choose from various document store options as documented at: https://docs.haystack.deepset.ai/docs/document-store


In [2]:
# --- Create an in-memory document store ---
document_store = InMemoryDocumentStore()

## Load the IMDB data set

In [None]:
# --- loading the imdb data set (1000 movies) ---
collection = imdb.load()
def doc_format_imdb(doc: dict) -> str:
    trim = lambda s,n: len(s) > n and s[:n] + "\u2026" or s
    title_ex = '{title_short} ({year}, {runtime}m, {rating})'.format(title_short=trim(doc['title'], 30), **doc)
    return '{title_ex:<50} {genre_short:<20} {summary} [{actors}]'.format(title_ex=title_ex, genre_short=trim(doc['genre'], 18), **doc)

for item in collection[:5]:
    print(doc_format_imdb(item))

collection[0]

The Shawshank Redemption (1994, 142m, 9.3)         Drama                Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency. [Tim Robbins Morgan Freeman Bob Gunton William Sadler]
The Godfather (1972, 175m, 9.2)                    Crime Drama          An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son. [Marlon Brando Al Pacino James Caan Diane Keaton]
The Dark Knight (2008, 152m, 9.0)                  Action Crime Drama   When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice. [Christian Bale Heath Ledger Aaron Eckhart Michael Caine]
The Godfather: Part II (1974, 202m, 9.0)           Crime Drama          The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his 

{'title': 'The Shawshank Redemption',
 'year': 1994,
 'runtime': 142,
 'rating': 9.3,
 'genre': 'Drama',
 'actors': 'Tim Robbins Morgan Freeman Bob Gunton William Sadler',
 'summary': 'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'}

## Loda the data to the document store

In [None]:
# --- Assuming 'collection' is a list of dictionaries with keys: title, summary, year, rating

movies_documents = [
    Document(
        content=f"{m['title']} {m['summary']}",
        meta={"id": id, "title": m['title'], "year": m['year'], "rating": m['rating']}
    )
    for id, m in enumerate(collection)
]

document_store.write_documents(movies_documents)

1000

## Create a retriever for the chosen document store

We use an in-memory retriever implementing BM25. Document stores offer different types of retrievers that support keyword search or embedding search. For more details, see the documentation: https://docs.haystack.deepset.ai/docs/retrievers


In [None]:
# --- Create a BM25 retriever ---
retriever = InMemoryBM25Retriever(document_store=document_store, top_k=10)

## Search for a movie

In [8]:
# --- Search for "Star Wars" before year 2000 using BM25 across title & overview ---

results = retriever.run(
    query="Star Wars",
    filters={"field": "meta.year", "operator": "<", "value": 2000}
)

for r in results['documents']:
    print(f"{r.meta['title']} ({r.meta['year']}) - Rating: {r.meta['rating']}")


Star Wars: Episode VI - Return of the Jedi (1983) - Rating: 8.3
Star Wars (1977) - Rating: 8.6
Star Wars: Episode V - The Empire Strikes Back (1980) - Rating: 8.7
Sunset Blvd. (1950) - Rating: 8.4
Being John Malkovich (1999) - Rating: 7.7
What Ever Happened to Baby Jane? (1962) - Rating: 8.1
Strangers on a Train (1951) - Rating: 7.9
Pink Floyd: The Wall (1982) - Rating: 8.1
All About Eve (1950) - Rating: 8.2
Star Trek II: The Wrath of Khan (1982) - Rating: 7.7


---