# Task 1 – Boolean Models


## Classic boolean model


Each term is binary (present or not).

Retrieve only documents that exactly satisfy the Boolean expression.

Operators: AND, OR, NOT.

Example query: q = (query AND reformulation) OR (Language AND model)

In [1]:
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# ---------- 1. Preprocessing ----------
def preprocess(text):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    tokens = [stemmer.stem(t) for t in tokens if t not in stops]
    return tokens


# ---------- 2. Load Inverted Index ----------
def load_inverted_index(filepath):
    inverted = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) >= 2:
                term, doc_id = parts[0], parts[1]
                inverted.setdefault(term, set()).add(doc_id)
    return inverted


# ---------- 3. Evaluate Boolean Query ----------
def evaluate_boolean_query(query, inverted_index, all_docs):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))

    # Extract all candidate words (ignoring AND/OR/NOT and parentheses)
    raw_tokens = re.findall(r'\b[a-zA-Z]+\b', query)
    unique_tokens = set(raw_tokens) - {"AND", "OR", "NOT"}

    # Start with the original expression
    expression = query

    # For each token, find its stemmed version and replace it
    for token in unique_tokens:
        if token.lower() in stops:
            continue
        stemmed = stemmer.stem(token.lower())
        docs = inverted_index.get(stemmed, set())
        expression = re.sub(rf'\b{token}\b', f"set({list(docs)})", expression, flags=re.IGNORECASE)

    # Replace logical operators with Python equivalents
    expression = re.sub(r"\bAND\b", "&", expression, flags=re.IGNORECASE)
    expression = re.sub(r"\bOR\b", "|", expression, flags=re.IGNORECASE)
    expression = re.sub(r"\bNOT\b", "all_docs -", expression, flags=re.IGNORECASE)

    # Evaluate expression safely
    try:
        result = eval(expression, {"__builtins__": None}, {"all_docs": all_docs, "set": set})
    except Exception as e:
        print("Error in query:", e)
        print("Expression after replacements:", expression)
        return set()

    return result


# ---------- 4. Example Run ----------
if __name__ == "__main__":
    inverted_index = load_inverted_index("results/inverted_index.txt")
    all_docs = {f"D{i}.txt" for i in range(1, 7)}

    query = "(query AND reformulation) OR (language AND model)"
    relevant_docs = evaluate_boolean_query(query, inverted_index, all_docs)

    print("\nClassic Boolean Model Results:")
    print("Query:", query)
    print("Retrieved documents:", sorted(relevant_docs))



Classic Boolean Model Results:
Query: (query AND reformulation) OR (language AND model)
Retrieved documents: ['D1.txt', 'D2.txt', 'D3.txt', 'D4.txt', 'D5.txt', 'D6.txt']


## Fuzzy boolean model


Each term gets a degree of membership between 0 and 1 (based on TF or TF-IDF).

Logical operators are softened using fuzzy logic:

AND → min()

OR → max()

NOT → 1 − value

You then compute a degree of relevance for each document.

In [19]:
import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# ---------- 1. Preprocessing ----------
def preprocess(text):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))
    tokens = re.findall(r'\b[a-zA-Z]+\b', text.lower())
    tokens = [stemmer.stem(t) for t in tokens if t not in stops]
    return tokens


# ---------- 2. Load Fuzzy Index ----------
# Expected format: <term> <doc_id> <tf> <tfidf>
def load_fuzzy_index(filepath):
    fuzzy = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            term, doc_id, weight = parts[0], parts[1], float(parts[3])
            fuzzy.setdefault(term, {})[doc_id] = weight
    return fuzzy


# ---------- 3. Fuzzy Operators ----------
def fuzzy_and(a, b):
    return min(a, b) 

def fuzzy_or(a, b):
    return max(a, b)  

def fuzzy_not(a):
    return 1 - a


# ---------- 4. Evaluate Fuzzy Boolean Query ----------
def evaluate_fuzzy_query(query, fuzzy_index, all_docs):
    stemmer = PorterStemmer()
    stops = set(stopwords.words('english'))

    raw_tokens = re.findall(r'\b[a-zA-Z]+\b', query)
    unique_tokens = set(raw_tokens) - {"AND", "OR", "NOT"} 

    # Initialize document scores
    doc_scores = {doc: 0.0 for doc in all_docs}

    # For each document, evaluate query with fuzzy logic : each document’s membership degree depends on its own term weights.
    for doc in all_docs:
        expression = query

        for token in unique_tokens:
            if token.lower() in stops:
                continue
            stemmed = stemmer.stem(token.lower())
            w = fuzzy_index.get(stemmed, {}).get(doc, 0.0)
            expression = re.sub(rf'\b{token}\b', str(w), expression, flags=re.IGNORECASE)
            # (query AND reformulation) OR (language AND model)
            # → (0.36 AND 0.42) OR (0.10 AND 0.08)


        # 2️⃣ Convert the Boolean operators to Python functions
        # Handle NOT first
        expression = re.sub(r"\bNOT\s+([\d.]+)", r"(1-\1)", expression, flags=re.IGNORECASE)
        # Handle AND
        while re.search(r"([\d.]+)\s+AND\s+([\d.]+)", expression):
            expression = re.sub(r"([\d.]+)\s+AND\s+([\d.]+)", r"min(\1, \2)", expression, flags=re.IGNORECASE)
        # Handle OR
        while re.search(r"([\d.]+)\s+OR\s+([\d.]+)", expression):
            expression = re.sub(r"([\d.]+)\s+OR\s+([\d.]+)", r"max(\1, \2)", expression, flags=re.IGNORECASE)

        try:
            val = eval(expression, {"__builtins__": None}, {"min": min, "max": max, "float": float})
            doc_scores[doc] = val if isinstance(val, float) else 0.0
        except Exception as e:
            print(f"Error evaluating {doc}: {e}")


    # Sort by decreasing score
    ranked = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
    return ranked


# ---------- 5. Example Run ----------
if __name__ == "__main__":

    # Example fuzzy index structure:
    #     fuzzy_index = {
    #     "queri": {"D1.txt": 0.36, "D2.txt": 0.20},
    #     "reformul": {"D1.txt": 0.42},
    #     ...
    #      }

    fuzzy_index = load_fuzzy_index("results/inverted_index_weighted.txt")  # file with TF-IDF weights
    all_docs = {f"D{i}.txt" for i in range(1, 7)}

    query = "(query AND reformulation) OR (language AND model)"
    ranked_docs = evaluate_fuzzy_query(query, fuzzy_index, all_docs)

    print("\nFuzzy Boolean Model Results:")
    for doc, score in ranked_docs:
        print(f"{doc}: {score:.3f}")


Error evaluating D4.txt: invalid syntax (<string>, line 1)
Error evaluating D6.txt: invalid syntax (<string>, line 1)
Error evaluating D3.txt: invalid syntax (<string>, line 1)
Error evaluating D5.txt: invalid syntax (<string>, line 1)
Error evaluating D2.txt: invalid syntax (<string>, line 1)
Error evaluating D1.txt: invalid syntax (<string>, line 1)

Fuzzy Boolean Model Results:
D4.txt: 0.000
D6.txt: 0.000
D3.txt: 0.000
D5.txt: 0.000
D2.txt: 0.000
D1.txt: 0.000


## Extended Boolean Model

- **Combines** the Boolean and Vector models.  
- **Allows partial matching** using *p-norms*:

#### For AND:
$$
S_{AND}(d, q) = \left( \sum_{i} w_{di}^p \right)^{1/p}
$$

#### For OR:
$$
S_{OR}(d, q) = \left( \sum_{i} (1 - w_{di})^p \right)^{1/p}
$$


➡️ You’ll **rank documents** by their score.


# Task 2

## Vector Space Model (VSM) for Information Retrieval

In this notebook, we implement the Vector Space Model using TF–IDF weighted document vectors and compute similarity between queries and documents using three measures:

- **Inner Product Similarity**
- **Cosine Similarity**
- **Jaccard Similarity**

We test the model on the following queries:

1. `q1`: large language models for information retrieval and ranking  
2. `q2`: LLM for information retrieval and Ranking  
3. `q3`: query Reformulation in information retrieval  
4. `q4`: ranking Documents  
5. `q5`: Optimizing recommendation systems with LLMs by leveraging item metadata

6. **Load the Document–Term and Inverted Index files**  
   - Use the files generated in Lab 1 to access the TF–IDF weights of terms in each document.

7. **Preprocess each query**  
   - Tokenize the query text.  
   - Remove stop words.  
   - Apply stemming (e.g., Porter Stemmer).

8. **For each similarity measure (Inner Product, Cosine, Jaccard):**  
   - Compute the similarity score between the query and every document.  
   - Use a **binary weighting scheme** for queries:  
     - Weight = 1 if the term appears in the query.  
     - Weight = 0 if the term does not appear.  
   - Rank the documents in **descending order of similarity**.  
   - Display the ranked documents for each query.
