# Task 1 – Boolean Models


## Classic boolean model


Each term is binary (present or not).

Retrieve only documents that exactly satisfy the Boolean expression.

Operators: AND, OR, NOT.

Example query: q = (query AND reformulation) OR (Language AND model)

In [10]:
query = "(query AND reformulation) OR (language AND model)"

In [12]:
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# ---------- 1. Preprocessing ----------
def preprocess(text):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    tokens = [stemmer.stem(t) for t in tokens if t not in stops]
    return tokens


# ---------- 2. Load Inverted Index ----------
def load_inverted_index(filepath):
    inverted = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) >= 2:
                term, doc_id = parts[0], parts[1]
                inverted.setdefault(term, set()).add(doc_id)
    return inverted


# ---------- 3. Evaluate Boolean Query ----------
def evaluate_boolean_query(query, inverted_index, all_docs):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))

    # Extract all candidate words (ignoring AND/OR/NOT and parentheses)
    raw_tokens = re.findall(r'\b[a-zA-Z]+\b', query)
    unique_tokens = set(raw_tokens) - {"AND", "OR", "NOT"}

    # Start with the original expression
    expression = query

    # For each token, find its stemmed version and replace it
    for token in unique_tokens:
        if token.lower() in stops:
            continue
        stemmed = stemmer.stem(token.lower())
        docs = inverted_index.get(stemmed, set())
        expression = re.sub(rf'\b{token}\b', f"set({list(docs)})", expression, flags=re.IGNORECASE)

    # Replace logical operators with Python equivalents
    expression = re.sub(r"\bAND\b", "&", expression, flags=re.IGNORECASE)
    expression = re.sub(r"\bOR\b", "|", expression, flags=re.IGNORECASE)
    expression = re.sub(r"\bNOT\b", "all_docs -", expression, flags=re.IGNORECASE)

    # Evaluate expression safely
    try:
        result = eval(expression, {"__builtins__": None}, {"all_docs": all_docs, "set": set})
    except Exception as e:
        print("Error in query:", e)
        print("Expression after replacements:", expression)
        return set()

    return result


# ---------- 4. Example Run ----------
if __name__ == "__main__":
    inverted_index = load_inverted_index("results/inverted_index_weighted.txt")
    all_docs = {f"D{i}" for i in range(1, 7)}

    relevant_docs = evaluate_boolean_query(query, inverted_index, all_docs)

    print("\nClassic Boolean Model Results:")
    print("Query:", query)
    print("Retrieved documents:", sorted(relevant_docs))



Classic Boolean Model Results:
Query: (query AND reformulation) OR (language AND model)
Retrieved documents: ['D1', 'D2', 'D3', 'D4', 'D5', 'D6']


## Fuzzy boolean model


### 🔹 1. Concept Recap

The **Fuzzy Boolean Model** is a hybrid between:

- the **Boolean model** (logical operators `AND`, `OR`, `NOT`)  
- and the **Vector model** (graded, real-valued similarities instead of strict true/false).

Each term weight (e.g., **TF-IDF**) is treated as a **degree of membership** in the interval **[0, 1]**, not binary.

---

### 🔸 Core idea

For each query term \( t \) and document \( d \):

\[
w_{t,d} = \text{TF-IDF weight of term } t \text{ in document } d
\]

Each document’s relevance to a query is computed using **fuzzy logic operators**:

- **AND →** use `min`
- **OR →** use `max`
- **NOT →** use `1 - weight`

---

### 🔹 2. Fuzzy Boolean Operators

| Operator | Boolean | Fuzzy Equivalent | Formula |
|:---------:|:--------:|:----------------:|:--------:|
| **AND** | ∧ | min | \( S_{AND}(d,q) = \min(w_{t1,d}, w_{t2,d}) \) |
| **OR** | ∨ | max | \( S_{OR}(d,q) = \max(w_{t1,d}, w_{t2,d}) \) |
| **NOT** | ¬ | complement | \( S_{NOT}(d,q) = 1 - w_{t,d} \) |

---

✅ **Note:**  
For multi-term queries, you can combine these operators **recursively** to compute the final fuzzy relevance score.


In [20]:
import numpy as np

# Suppose df_tfidf is your TF-IDF matrix (pandas DataFrame)
# Rows = documents (e.g., D1, D2, …)
# Columns = terms
# Values = TF-IDF weights

def fuzzy_score(doc_weights, query_tokens, operator='AND'):
    """
    Compute fuzzy boolean similarity between a document and a query.
    
    Parameters:
        doc_weights : dict {term: weight}
        query_tokens : list of query terms (preprocessed)
        operator : 'AND' | 'OR'
    Returns:
        float : fuzzy similarity in [0, 1]
    """
    weights = []
    for t in query_tokens:
        weights.append(doc_weights.get(t, 0))
    
    if not weights:
        return 0.0

    if operator == 'AND':
        return np.min(weights)
    elif operator == 'OR':
        return np.max(weights)
    else:
        raise ValueError("Operator must be 'AND' or 'OR'")

# ----------------------------------------------------------
# Example usage
# ----------------------------------------------------------
query = "cat AND mat"
tokens = [t.lower() for t in query.split() if t.lower() not in ['and', 'or', 'not']]
operator = 'AND' if 'AND' in query else 'OR'

inverted_path = "results/inverted_index_weighted.txt"
df_tfidf = pd.read_csv(inverted_path, sep="\t", header=None, names=["term", "doc", "freq", "tfidf"])
results = {}
for doc in df_tfidf.index:
    doc_vector = df_tfidf.loc[doc].to_dict()
    score = fuzzy_score(doc_vector, tokens, operator)
    results[doc] = round(score, 4)

print("Fuzzy Boolean results:")
for doc, s in sorted(results.items(), key=lambda x: x[1], reverse=True):
    print(f"{doc}: {s}")


Fuzzy Boolean results:
0: 0
1: 0
2: 0
3: 0
4: 0
5: 0
6: 0
7: 0
8: 0
9: 0
10: 0
11: 0
12: 0
13: 0
14: 0
15: 0
16: 0
17: 0
18: 0
19: 0
20: 0
21: 0
22: 0
23: 0
24: 0
25: 0
26: 0
27: 0
28: 0
29: 0
30: 0
31: 0
32: 0
33: 0
34: 0
35: 0
36: 0
37: 0
38: 0
39: 0
40: 0
41: 0
42: 0
43: 0
44: 0
45: 0
46: 0
47: 0
48: 0
49: 0
50: 0
51: 0
52: 0
53: 0
54: 0
55: 0
56: 0
57: 0
58: 0
59: 0
60: 0
61: 0
62: 0
63: 0
64: 0
65: 0
66: 0
67: 0
68: 0
69: 0
70: 0
71: 0
72: 0
73: 0
74: 0
75: 0
76: 0
77: 0
78: 0
79: 0
80: 0
81: 0
82: 0
83: 0
84: 0
85: 0
86: 0
87: 0
88: 0
89: 0
90: 0
91: 0
92: 0
93: 0
94: 0
95: 0
96: 0
97: 0
98: 0
99: 0
100: 0
101: 0
102: 0
103: 0
104: 0
105: 0
106: 0
107: 0
108: 0
109: 0
110: 0
111: 0
112: 0
113: 0
114: 0
115: 0
116: 0
117: 0
118: 0
119: 0
120: 0
121: 0
122: 0
123: 0
124: 0
125: 0
126: 0
127: 0
128: 0
129: 0
130: 0
131: 0
132: 0
133: 0
134: 0
135: 0
136: 0
137: 0
138: 0
139: 0
140: 0
141: 0
142: 0
143: 0
144: 0
145: 0
146: 0
147: 0
148: 0
149: 0
150: 0
151: 0
152: 0
153: 0
154: 0
15

In [19]:
import pandas as pd
import numpy as np
from collections import defaultdict

# ------------------------------------------------------------
# 1. LOAD THE INVERTED INDEX FILE
# Format: <Term> <Doc> <Freq> <TF-IDF>
# ------------------------------------------------------------
inverted_path = "results/inverted_index_weighted.txt"

# Read tab-separated txt file 
data = pd.read_csv(inverted_path, sep="\t", header=None, names=["term", "doc", "freq", "tfidf"])

# Build a document-term matrix {doc: {term: tfidf}}
doc_dict = defaultdict(dict)
for _, row in data.iterrows():
    doc_dict[row["doc"]][str(row["term"]).lower()] = float(row["tfidf"])

# ------------------------------------------------------------
# 2. FUZZY BOOLEAN EVALUATION FUNCTIONS
# ------------------------------------------------------------
def fuzzy_and(a, b): return min(a, b)
def fuzzy_or(a, b): return max(a, b)
def fuzzy_not(a): return 1 - a

def get_weight(doc, term):
    """Return TF-IDF weight of term in doc, or 0 if missing."""
    return doc_dict[doc].get(term.lower(), 0.0)

def eval_fuzzy(query, doc):
    """Evaluate fuzzy boolean query for one document."""
    # Add spaces around parentheses
    query = query.replace("(", " ( ").replace(")", " ) ")
    tokens = query.split()
    
    def parse(tokens):
        stack = []
        i = 0
        while i < len(tokens):
            tok = tokens[i].lower()
            if tok == "(":
                # find matching parenthesis
                depth = 0
                j = i
                while j < len(tokens):
                    if tokens[j] == "(":
                        depth += 1
                    elif tokens[j] == ")":
                        depth -= 1
                        if depth == 0:
                            break
                    j += 1
                val = parse(tokens[i + 1:j])
                stack.append(val)
                i = j
            elif tok == "and":
                stack.append("AND")
            elif tok == "or":
                stack.append("OR")
            elif tok == "not":
                # next token is negated
                next_term = tokens[i + 1].lower()
                w = fuzzy_not(get_weight(doc, next_term))
                stack.append(w)
                i += 1
            else:
                stack.append(get_weight(doc, tok))
            i += 1
        
        # Evaluate AND first, then OR
        while "AND" in stack:
            idx = stack.index("AND")
            res = fuzzy_and(stack[idx - 1], stack[idx + 1])
            stack = stack[:idx - 1] + [res] + stack[idx + 2:]
        while "OR" in stack:
            idx = stack.index("OR")
            res = fuzzy_or(stack[idx - 1], stack[idx + 1])
            stack = stack[:idx - 1] + [res] + stack[idx + 2:]
        return stack[0]

    return parse(tokens)

# ------------------------------------------------------------
# 3. RUN A QUERY ON ALL DOCUMENTS
# ------------------------------------------------------------
def fuzzy_search(query):
    results = {}
    for doc in doc_dict.keys():
        score = eval_fuzzy(query, doc)
        results[doc] = round(score, 4)
    # Sort descending by score
    results = dict(sorted(results.items(), key=lambda x: x[1], reverse=True))
    return results

# ------------------------------------------------------------
# 4. TEST
# ------------------------------------------------------------
query = "(10% AND 12%) OR (175 AND NOT 2)"
results = fuzzy_search(query)

print(f"\n🔍 Query: {query}\n")
print("Doc\tScore")
for doc, score in results.items():
    print(f"{doc}\t{score}")



🔍 Query: (10% AND 12%) OR (175 AND NOT 2)

Doc	Score
D2	0.0715
D6	0.0
D4	0.0
D1	0.0
D5	0.0
D3	0.0


## Extended Boolean Model

## 🔹 1. Concept Recap

The **Extended Boolean Model** replaces the strict Boolean **AND / OR** logic with continuous functions that measure **how much a document satisfies a query**.

It uses the **p-norm operator**, where the parameter \( p \) controls how strict or relaxed the matching is:

| p value | Behavior |
|:--------:|:----------|
| \( p \to \infty \) | strict Boolean (perfect AND/OR) |
| \( p = 1 \) | loose, soft matching (closer to vector model) |
| **Typical value** | between 2 and 5 |

---

## 🧩 Formulas

Let \( w_{d_i} \) be the **TF-IDF weight** of term *i* in document *d*, normalized to [0, 1].

---

### 🔸 AND Query

\[
S_{AND}(d, q) = \left( \frac{1}{n} \sum_{i=1}^{n} (w_{d_i})^p \right)^{\frac{1}{p}}
\]

---

### 🔸 OR Query

\[
S_{OR}(d, q) = 1 - \left( \frac{1}{n} \sum_{i=1}^{n} (1 - w_{d_i})^p \right)^{\frac{1}{p}}
\]

---

where  
\( n \) = number of terms in the query,  
and \( p \) = the **p-norm parameter** controlling the **strictness of matching**.

---

✅ **Intuition:**
- When \( p \) is large → behavior approaches strict Boolean logic.  
- When \( p \) is small → more flexible, similar to vector-space similarity.


In [21]:
import pandas as pd
import numpy as np
from collections import defaultdict

# ------------------------------------------------------------
# 1. LOAD THE INVERTED INDEX
# ------------------------------------------------------------
inverted_path = "results/inverted_index_weighted.txt"

data = pd.read_csv(inverted_path, sep="\t", header=None, names=["term", "doc", "freq", "tfidf"])

# Build structure: {doc: {term: tfidf}}
doc_dict = defaultdict(dict)
for _, row in data.iterrows():
    doc_dict[row["doc"]][str(row["term"]).lower()] = float(row["tfidf"])

# ------------------------------------------------------------
# 2. EXTENDED BOOLEAN MODEL FUNCTIONS
# ------------------------------------------------------------
def extended_boolean_score(doc_weights, query_terms, operator="AND", p=2):
    """
    Compute the Extended Boolean model score for one document.
    - doc_weights: dict {term: weight}
    - query_terms: list of query tokens (strings)
    - operator: 'AND' or 'OR'
    - p: float, p-norm parameter
    """
    # Extract weights for query terms (default 0 if term not in doc)
    w = np.array([doc_weights.get(term.lower(), 0.0) for term in query_terms])
    n = len(w)
    if n == 0:
        return 0.0

    if operator.upper() == "AND":
        return (np.sum(w ** p) / n) ** (1 / p)
    elif operator.upper() == "OR":
        return 1 - ((np.sum((1 - w) ** p) / n) ** (1 / p))
    else:
        raise ValueError("Operator must be 'AND' or 'OR'")

# ------------------------------------------------------------
# 3. QUERY EXECUTION
# ------------------------------------------------------------
def extended_boolean_search(query, p=2):
    """
    Execute an AND/OR query across all documents using the Extended Boolean Model.
    """
    # Detect operator
    query_lower = query.lower()
    if " and " in query_lower:
        operator = "AND"
        terms = [t.strip() for t in query_lower.split("and")]
    elif " or " in query_lower:
        operator = "OR"
        terms = [t.strip() for t in query_lower.split("or")]
    else:
        operator = "AND"
        terms = [query_lower.strip()]
    
    # Compute scores for all docs
    results = {}
    for doc, weights in doc_dict.items():
        score = extended_boolean_score(weights, terms, operator=operator, p=p)
        results[doc] = round(score, 4)

    # Sort results by descending score
    results = dict(sorted(results.items(), key=lambda x: x[1], reverse=True))
    return results

# ------------------------------------------------------------
# 4. TEST
# ------------------------------------------------------------
query = "10% AND 12%"
p = 2  # try 1, 2, or higher for stricter matching

results = extended_boolean_search(query, p=p)

print(f"\n🔍 Extended Boolean Query: {query} (p={p})\n")
print("Doc\tScore")
for doc, score in results.items():
    print(f"{doc}\t{score}")



🔍 Extended Boolean Query: 10% AND 12% (p=2)

Doc	Score
D2	0.1011
D4	0.0531
D6	0.0
D1	0.0
D5	0.0
D3	0.0
