# Task 1 – Boolean Models


## Classic boolean model


Each term is binary (present or not).

Retrieve only documents that exactly satisfy the Boolean expression.

Operators: AND, OR, NOT.

Example query: q = (query AND reformulation) OR (Language AND model)

In [None]:
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# ========== 1. Preprocessing (same as Lab 1) ==========
def preprocess(text):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))
    # Same regex as used before
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    tokens = [stemmer.stem(t) for t in tokens if t not in stops]
    return tokens

# ========== 2. Load Inverted Index ==========
# Expected format:
# term  doc_id  
def load_inverted_index(filepath):
    inverted = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            term, doc_id = parts[0], parts[1]
            if term not in inverted:  # inverted.setdefault(term, set()).add(doc_id)
                inverted[term] = set()
            inverted[term].add(doc_id)
    return inverted


# ========== 3. Parse and Evaluate Boolean Query ==========
def evaluate_boolean_query(query, inverted_index, all_docs):
    # Preprocess and keep original
    original_query = query
    tokens = preprocess(query)
    expression = original_query.lower()

    # Replace **stemmed tokens** in the expression
    for token in tokens:
        docs = inverted_index.get(token, set())
        expression = re.sub(rf'\b{token}\b', f"set({list(docs)})", expression)

    # Replace logical operators with Python equivalents
    expression = expression.replace(" AND ", " & ")
    expression = expression.replace(" OR ", " | ")
    expression = expression.replace(" NOT ", f" all_docs - ")

    # Evaluate expression safely
    try:
        result = eval(expression, {"__builtins__": None}, {"all_docs": all_docs, "set": set}) 
    except Exception as e:
        print("Error in query:", e)
        return set()

    return result


# ========== 4. Example Run ==========
if __name__ == "__main__":
    inverted_index = load_inverted_index("results/inverted_index.txt")
    all_docs = {f"D{i}.txt" for i in range(1, 7)}  # D1–D6

    query = "(query AND reformulation) OR (language AND model)"
    relevant_docs = evaluate_boolean_query(query, inverted_index, all_docs)

    print("\nClassic Boolean Model Results:")
    print("Query:", query)
    print("Retrieved documents:", sorted(relevant_docs))



Classic Boolean Model Results:
Query: (queri AND reformul) OR (languag AND model)
Retrieved documents: ['D1.txt', 'D4.txt']


In [None]:
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# ---------- 1. Preprocessing ----------
def preprocess(text):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    tokens = [stemmer.stem(t) for t in tokens if t not in stops]
    return tokens


# ---------- 2. Load Inverted Index ----------
def load_inverted_index(filepath):
    inverted = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) >= 2:
                term, doc_id = parts[0], parts[1]
                inverted.setdefault(term, set()).add(doc_id)
    return inverted


# ---------- 3. Evaluate Boolean Query ----------
def evaluate_boolean_query(query, inverted_index, all_docs):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))

    # Extract all candidate words (ignoring AND/OR/NOT and parentheses)
    raw_tokens = re.findall(r'\b[a-zA-Z]+\b', query)
    unique_tokens = set(raw_tokens) - {"AND", "OR", "NOT"}

    # Start with the original expression
    expression = query

    # For each token, find its stemmed version and replace it
    for token in unique_tokens:
        if token.lower() in stops:
            continue
        stemmed = stemmer.stem(token.lower())
        docs = inverted_index.get(stemmed, set())
        expression = re.sub(rf'\b{token}\b', f"set({list(docs)})", expression, flags=re.IGNORECASE)

    # Replace logical operators with Python equivalents
    expression = re.sub(r"\bAND\b", "&", expression, flags=re.IGNORECASE)
    expression = re.sub(r"\bOR\b", "|", expression, flags=re.IGNORECASE)
    expression = re.sub(r"\bNOT\b", "all_docs -", expression, flags=re.IGNORECASE)

    # Evaluate expression safely
    try:
        result = eval(expression, {"__builtins__": None}, {"all_docs": all_docs, "set": set})
    except Exception as e:
        print("Error in query:", e)
        print("Expression after replacements:", expression)
        return set()

    return result


# ---------- 4. Example Run ----------
if __name__ == "__main__":
    inverted_index = load_inverted_index("results/inverted_index.txt")
    all_docs = {f"D{i}.txt" for i in range(1, 7)}

    query = "(query AND reformulation) OR (language AND model)"
    relevant_docs = evaluate_boolean_query(query, inverted_index, all_docs)

    print("\nClassic Boolean Model Results:")
    print("Query:", query)
    print("Retrieved documents:", sorted(relevant_docs))



Classic Boolean Model Results:
Query: (query AND reformulation) AND (language AND model)
Retrieved documents: ['D1.txt', 'D4.txt']


## Fuzzy boolean model
Each term gets a degree of membership between 0 and 1 (based on TF or TF-IDF).

Logical operators are softened using fuzzy logic:

AND → min()

OR → max()

NOT → 1 − value

You then compute a degree of relevance for each document.

## Extended Boolean Model

- **Combines** the Boolean and Vector models.  
- **Allows partial matching** using *p-norms*:

#### For AND:
$$
S_{AND}(d, q) = \left( \sum_{i} w_{di}^p \right)^{1/p}
$$

#### For OR:
$$
S_{OR}(d, q) = \left( \sum_{i} (1 - w_{di})^p \right)^{1/p}
$$

*(depending on the specific formulation used in your lecture notes)*  

➡️ You’ll **rank documents** by their score.


# Task 2

Test Query
q = (query AND reformulation) OR (Language AND model)  
- Parentheses define precedence
- Expected Steps
1. Preprocess the query using the same pipeline as for documents (tokenization, stop word
removal, and stemming).
1. Parse the Boolean expression into logical operations.
2. For the Classic Boolean Model: retrieve only the documents that strictly satisfy the
Boolean condition.
1. For the Fuzzy and Extended Boolean Models:
o Compute partial degrees of relevance for each document according to the model’s
equations given in lecture notes.
o Rank documents by their computed degree of match with the query


### Preprocess the query using the same pipeline as for documents (tokenization, stop word removal, and stemming).


### Parse the Boolean expression into logical operations.

### For the Classic Boolean Model: retrieve only the documents that strictly satisfy the Boolean condition.


### For the Fuzzy and Extended Boolean Models:
- Compute partial degrees of relevance for each document according to the model’s equations given in lecture notes.
- Rank documents by their computed degree of match with the query