# Boolean Retrieval Model

The **Boolean Retrieval Model** is one of the simplest information retrieval models. It uses **Boolean logic** to determine whether a document matches a user's query. The main idea is to represent documents and queries as sets of terms and then evaluate the query using **logical operators**.

## Key Concepts

- **Terms**: Words or tokens in documents.
- **Document Representation**: Each document is represented as a set of terms (or a binary vector indicating presence/absence of each term).
- **Query Representation**: The user's query is expressed using **Boolean operators**:
  - `AND` – Returns documents that contain **all** specified terms.
  - `OR` – Returns documents that contain **at least one** of the specified terms.
  - `NOT` – Excludes documents that contain the specified term.

## Example

Suppose we have 3 documents:

| Document | Content                                   |
|----------|-------------------------------------------|
| doc1     | "information retrieval is essential"     |
| doc2     | "broad research is ongoing"              |
| doc3     | "topic modeling relates to information" |

A Boolean query:
information AND retrieval

- Matches `doc1` (contains both "information" and "retrieval").
- Does **not** match `doc2` or `doc3`.

Another query:


- Matches `doc1`, `doc2`, and `doc3` (all contain at least one of the terms).

## Advantages

- Simple and easy to implement.
- Clear, deterministic results.

## Limitations

- No ranking: All matching documents are treated equally.
- Strict matching: Cannot handle partial relevance or approximate matches.
- Queries must be precise, which can be difficult for users.




In [17]:
# Sample movie review documents
docs = {
    "doc1": "I loved this movie the plot was exciting and the characters were amazing",
    "doc2": "Terrible movie waste of time and I would not recommend it",
    "doc3": "An average film some good moments but overall it was predictable",
    "doc4": "Fantastic performance by the lead actor brilliant cinematography",
    "doc5": "Bad script poor direction not worth watching at all",
}

# Preprocessing: lowercase + split into tokens
def preprocess(text):
    return text.lower().split()

# Apply preprocessing to all documents
preprocessed_docs = {doc_id: preprocess(text) for doc_id, text in docs.items()}

for doc_id, tokens in preprocessed_docs.items():
    print(f"{doc_id}: {tokens}")


doc1: ['i', 'loved', 'this', 'movie', 'the', 'plot', 'was', 'exciting', 'and', 'the', 'characters', 'were', 'amazing']
doc2: ['terrible', 'movie', 'waste', 'of', 'time', 'and', 'i', 'would', 'not', 'recommend', 'it']
doc3: ['an', 'average', 'film', 'some', 'good', 'moments', 'but', 'overall', 'it', 'was', 'predictable']
doc4: ['fantastic', 'performance', 'by', 'the', 'lead', 'actor', 'brilliant', 'cinematography']
doc5: ['bad', 'script', 'poor', 'direction', 'not', 'worth', 'watching', 'at', 'all']


In [6]:
# Build term-document matrix (binary)
terms = {}
doc_list = list(docs.keys())

for i, (doc, text) in enumerate(docs.items()):
    words = set(preprocess(text))
    for word in words:
        if word not in terms:
            terms[word] = [0] * len(docs)
        terms[word][i] = 1

In [7]:
# Show matrix
print("\nTerm-Document Matrix:")
for term, vec in terms.items():
    print(f"{term:15}: {vec}")


# Evaluate boolean query (supports AND, OR, NOT)
def evaluate(query):
    query = query.lower().split()
    result = None
    operator = None
    i = 0

    while i < len(query):
        if query[i] == "not":
            word = query[i + 1]
            vec = terms.get(word, [0] * len(docs))
            vec = [1 - x for x in vec]
            i += 2
        elif query[i] in ["and", "or"]:
            operator = query[i]
            i += 1
            continue
        else:
            word = query[i]
            vec = terms.get(word, [0] * len(docs))
            i += 1

        if result is None:
            result = vec
        else:
            if operator == "and":
                result = [a & b for a, b in zip(result, vec)]
            elif operator == "or":
                result = [a | b for a, b in zip(result, vec)]

    return result


Term-Document Matrix:
and            : [1, 1, 0, 0, 0]
characters     : [1, 0, 0, 0, 0]
the            : [1, 0, 0, 1, 0]
movie          : [1, 1, 0, 0, 0]
exciting       : [1, 0, 0, 0, 0]
amazing        : [1, 0, 0, 0, 0]
loved          : [1, 0, 0, 0, 0]
was            : [1, 0, 1, 0, 0]
i              : [1, 1, 0, 0, 0]
were           : [1, 0, 0, 0, 0]
this           : [1, 0, 0, 0, 0]
plot           : [1, 0, 0, 0, 0]
terrible       : [0, 1, 0, 0, 0]
of             : [0, 1, 0, 0, 0]
not            : [0, 1, 0, 0, 1]
waste          : [0, 1, 0, 0, 0]
recommend      : [0, 1, 0, 0, 0]
it             : [0, 1, 1, 0, 0]
would          : [0, 1, 0, 0, 0]
time           : [0, 1, 0, 0, 0]
an             : [0, 0, 1, 0, 0]
some           : [0, 0, 1, 0, 0]
but            : [0, 0, 1, 0, 0]
predictable    : [0, 0, 1, 0, 0]
film           : [0, 0, 1, 0, 0]
moments        : [0, 0, 1, 0, 0]
good           : [0, 0, 1, 0, 0]
overall        : [0, 0, 1, 0, 0]
average        : [0, 0, 1, 0, 0]
lead           : [0,

### Entering movie AND exciting should return doc1 as matching document

In [8]:
# Get user query
query = input("\nEnter Boolean query (AND, OR, NOT): ")

# Evaluate
result_vec = evaluate(query)
matches = [doc_list[i] for i, v in enumerate(result_vec) if v == 1]

# Show result
print("\nMatching Documents:", matches if matches else "No match found.")


Matching Documents: ['doc1']


# Term Weighting Mechanisms

Term weighting is used in information retrieval to **measure the importance of a term in a document and across the corpus**. The most common weighting schemes are **TF**, **IDF**, and **TF-IDF**.

---

## 1. Term Frequency (TF)

**Term Frequency (TF)** measures how often a term appears in a document.  
- Higher frequency → term is more important in that document.  
- Formula:

\[
TF(t, d) = \text{Number of times term } t \text{ appears in document } d
\]

Example:

Document: `"apple banana apple"`  
- TF("apple") = 2  
- TF("banana") = 1  

---

## 2. Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** measures how important a term is across all documents.  
- Rare terms → higher IDF  
- Common terms → lower IDF  

Formula:

\[
IDF(t) = \log\frac{N}{df_t}
\]

Where:  
- \(N\) = total number of documents  
- \(df_t\) = number of documents containing term \(t\)

Example:  
- Term `"apple"` occurs in 3 out of 10 documents → IDF("apple") = log(10/3)

---

## 3. TF-IDF

**TF-IDF** combines TF and IDF to give a **weight indicating a term’s importance in a document relative to the corpus**.  

\[
TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)
\]

- High TF and high IDF → term is very important  
- Helps to **rank documents** based on relevance to a query  

Example:

Document: `"apple banana apple"`  
- TF("apple") = 2  
- IDF("apple") = 1.2  
- TF-IDF("apple") = 2 × 1.2 = 2.4  

---

### Advantages

- Balances term frequency within a document and rarity across corpus.
- Essential for **Vector Space Model** and relevance ranking.



In [9]:
import math

# Sample movie review documents
documents = {
    "doc1": "I loved this movie the plot was exciting and the characters were amazing",
    "doc2": "Terrible movie waste of time and I would not recommend it",
    "doc3": "An average film some good moments but overall it was predictable",
    "doc4": "Fantastic performance by the lead actor brilliant cinematography",
    "doc5": "Bad script poor direction not worth watching at all",
}

# Preprocessing: lowercase + split
def preprocess(text):
    return text.lower().split()

# Build vocabulary
vocab = set()
preprocessed_docs = {}

for name, text in documents.items():
    tokens = preprocess(text)
    preprocessed_docs[name] = tokens
    vocab.update(tokens)

print(f"Preprocessed docs:\n{preprocessed_docs}\n")
vocab = sorted(list(vocab))
print("Vocabulary:", vocab)


Preprocessed docs:
{'doc1': ['i', 'loved', 'this', 'movie', 'the', 'plot', 'was', 'exciting', 'and', 'the', 'characters', 'were', 'amazing'], 'doc2': ['terrible', 'movie', 'waste', 'of', 'time', 'and', 'i', 'would', 'not', 'recommend', 'it'], 'doc3': ['an', 'average', 'film', 'some', 'good', 'moments', 'but', 'overall', 'it', 'was', 'predictable'], 'doc4': ['fantastic', 'performance', 'by', 'the', 'lead', 'actor', 'brilliant', 'cinematography'], 'doc5': ['bad', 'script', 'poor', 'direction', 'not', 'worth', 'watching', 'at', 'all']}

Vocabulary: ['actor', 'all', 'amazing', 'an', 'and', 'at', 'average', 'bad', 'brilliant', 'but', 'by', 'characters', 'cinematography', 'direction', 'exciting', 'fantastic', 'film', 'good', 'i', 'it', 'lead', 'loved', 'moments', 'movie', 'not', 'of', 'overall', 'performance', 'plot', 'poor', 'predictable', 'recommend', 'script', 'some', 'terrible', 'the', 'this', 'time', 'was', 'waste', 'watching', 'were', 'worth', 'would']


### Term Frequency (TF)


In [10]:
def compute_tf(doc_tokens, vocab):
    tf = {}
    total_terms = len(doc_tokens)
    for term in vocab:
        tf[term] = doc_tokens.count(term) / total_terms
    return tf

### Inverse Document Frequency (IDF)

In [11]:
def compute_idf(all_docs, vocab):
    N = len(all_docs)
    idf = {}
    for term in vocab:
        df = sum(1 for doc in all_docs.values() if term in doc)
        idf[term] = math.log(N / (df + 1)) + 1  # +1 smoothing
    return idf

### TF-IDF

In [12]:
def compute_tfidf(tf, idf):
    tfidf = {}
    for term in tf:
        tfidf[term] = tf[term] * idf[term]
    return tfidf

In [13]:
# Compute IDF once for all docs
idf = compute_idf(preprocessed_docs, vocab)

# For each document, compute TF and TF-IDF
for name, tokens in preprocessed_docs.items():
    tf = compute_tf(tokens, vocab)
    tfidf = compute_tfidf(tf, idf)

    print(f"\nDocument: {name}")
    print("TF:     ", {k: round(v, 3) for k, v in tf.items()})
    print("TF-IDF: ", {k: round(v, 3) for k, v in tfidf.items()})


Document: doc1
TF:      {'actor': 0.0, 'all': 0.0, 'amazing': 0.077, 'an': 0.0, 'and': 0.077, 'at': 0.0, 'average': 0.0, 'bad': 0.0, 'brilliant': 0.0, 'but': 0.0, 'by': 0.0, 'characters': 0.077, 'cinematography': 0.0, 'direction': 0.0, 'exciting': 0.077, 'fantastic': 0.0, 'film': 0.0, 'good': 0.0, 'i': 0.077, 'it': 0.0, 'lead': 0.0, 'loved': 0.077, 'moments': 0.0, 'movie': 0.077, 'not': 0.0, 'of': 0.0, 'overall': 0.0, 'performance': 0.0, 'plot': 0.077, 'poor': 0.0, 'predictable': 0.0, 'recommend': 0.0, 'script': 0.0, 'some': 0.0, 'terrible': 0.0, 'the': 0.154, 'this': 0.077, 'time': 0.0, 'was': 0.077, 'waste': 0.0, 'watching': 0.0, 'were': 0.077, 'worth': 0.0, 'would': 0.0}
TF-IDF:  {'actor': 0.0, 'all': 0.0, 'amazing': 0.147, 'an': 0.0, 'and': 0.116, 'at': 0.0, 'average': 0.0, 'bad': 0.0, 'brilliant': 0.0, 'but': 0.0, 'by': 0.0, 'characters': 0.147, 'cinematography': 0.0, 'direction': 0.0, 'exciting': 0.147, 'fantastic': 0.0, 'film': 0.0, 'good': 0.0, 'i': 0.116, 'it': 0.0, 'lead': 0

### Cosine Similarity


In [14]:
# Sample movie review documents
documents = [
    "I loved this movie the plot was exciting and the characters were amazing",
    "Terrible movie waste of time and I would not recommend it",
    "An average film some good moments but overall it was predictable",
    "Fantastic performance by the lead actor brilliant cinematography",
    "Bad script poor direction not worth watching at all",
]

# Build vocabulary (unique words in all documents)
def build_vocab(docs):
    vocab_set = set()
    for doc in docs:
        vocab_set.update(doc.lower().split())  # lowercase + split
    return sorted(vocab_set)

vocab = build_vocab(documents)
print("Vocabulary:", vocab)


Vocabulary: ['actor', 'all', 'amazing', 'an', 'and', 'at', 'average', 'bad', 'brilliant', 'but', 'by', 'characters', 'cinematography', 'direction', 'exciting', 'fantastic', 'film', 'good', 'i', 'it', 'lead', 'loved', 'moments', 'movie', 'not', 'of', 'overall', 'performance', 'plot', 'poor', 'predictable', 'recommend', 'script', 'some', 'terrible', 'the', 'this', 'time', 'was', 'waste', 'watching', 'were', 'worth', 'would']


In [15]:
# Create Bag of Words vector for a document
def bow_vector(doc, vocab):
    words = doc.split()
    return [words.count(term) for term in vocab]

In [16]:
# Compute cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    dot_product = sum(a * b for a, b in zip(vec1, vec2))
    magnitude1 = math.sqrt(sum(a * a for a in vec1))
    magnitude2 = math.sqrt(sum(b * b for b in vec2))
    if magnitude1 == 0 or magnitude2 == 0:
        return 0.0
    return dot_product / (magnitude1 * magnitude2)


# Create BoW vectors for all documents
vectors = [bow_vector(doc, vocab) for doc in documents]

# Compute similarity between first and second documents
similarity = cosine_similarity(vectors[0], vectors[1])

print("\nBoW Vector for Document 1:", vectors[0])
print("BoW Vector for Document 2:", vectors[1])
print("Cosine Similarity between Doc1 and Doc2:", round(similarity, 4))


BoW Vector for Document 1: [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 1, 0, 0, 1, 0, 0]
BoW Vector for Document 2: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1]
Cosine Similarity between Doc1 and Doc2: 0.1782
