
# Assignment 1: Boolean Model, TF-IDF, and Data Retrieval vs. Information Retrieval Conceptual Questions

**Student names**: _Your_names_here_ <br>
**Group number**: _Your_group_here_ <br>
**Date**: _Submission Date_

## Important notes
Please carefully read the following notes and consider them for the assignment delivery. Submissions that do not fulfill these requirements will not be assessed and should be submitted again.
1. You may work in groups of maximum 2 students.
2. The assignment must be delivered in ipynb format.
3. The assignment must be typed. Handwritten assignments are not accepted.

**Due date**: 14.09.2025 23:59

In this assignment, you will:
- Implement a Boolean retrieval model
- Compute TF-IDF vectors for documents
- Run retrieval on queries
- Answer conceptual questions 

---
## Dataset

You will use the **Cranfield** dataset, provided in this file:

- `cran.all.1400`: The document collection (1400 documents)

**The code to parse the file is ready — just update the cran file path to match your own file location. Use the docs variable in your code for the parsed file**

### Load and parse documents (provided)

Run the cell to parse the Cranfield documents. Update the path so it points to your `cran.all.1400` file.


In [1]:

# Read 'cran.all.1400' and parse the documents into a suitable data structure

CRAN_PATH = r"cran.all.1400"  # <-- change this!

def parse_cranfield(path):
    docs = {}
    current_id = None
    current_field = None
    buffers = {"T": [], "A": [], "B": [], "W": []}
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(".I "):
                if current_id is not None:
                    docs[current_id] = {
                        "id": current_id,
                        "title": " ".join(buffers["T"]).strip(),
                        "abstract": " ".join(buffers["W"]).strip()
                    }
                current_id = int(line.split()[1])
                buffers = {k: [] for k in buffers}
                current_field = None
            elif line.startswith("."):
                tag = line[1:].strip()
                current_field = tag if tag in buffers else None
            else:
                if current_field is not None:
                    buffers[current_field].append(line)
    if current_id is not None:
        docs[current_id] = {
            "id": current_id,
            "title": " ".join(buffers["T"]).strip(),
            "abstract": " ".join(buffers["W"]).strip()
        }
    print(f"Parsed {len(docs)} documents.")
    return docs

docs = parse_cranfield(CRAN_PATH)



Parsed 1400 documents.


## 1.1 – Boolean Retrieval Model

### 1.1.1 Tokenize documents

Implement tokenization using the given list of stopwords. Create a list of normalized terms per document (e.g., lowercase, remove punctuation/digits; drop stopwords). Store the token lists to use in later steps.

In [2]:
# TODO: Implement tokenization using the given list of stopwords, create list of terms per document

import re

STOPWORDS = set("""a about above after again against all am an and any are aren't as at be because been
before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down
during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers
herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most
mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she
she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's
these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're
we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't
you you'd you'll you're you've your yours yourself yourselves""".split())

def tokenize_document(text):
    """
    Tokenize a document by:
    1. Converting to lowercase
    2. Removing punctuation and digits
    3. Splitting into tokens
    4. Removing stopwords
    5. Returning list of normalized terms
    """
    if not text:
        return []
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and digits, keep only letters and spaces
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # Split into tokens and remove empty strings
    tokens = [token.strip() for token in text.split() if token.strip()]
    
    # Remove stopwords and return filtered tokens
    filtered_tokens = [token for token in tokens if token not in STOPWORDS]
    
    return filtered_tokens

# Tokenize all documents
tokenized_docs = {}
for doc_id, doc in docs.items():
    # Combine title and abstract for tokenization
    combined_text = f"{doc['title']} {doc['abstract']}"
    tokenized_docs[doc_id] = tokenize_document(combined_text)

# Display some examples
print(f"Tokenized {len(tokenized_docs)} documents")
print(f"\nExample tokenization for document {list(docs.keys())[0]}:")
print(f"Original title: {docs[list(docs.keys())[0]]['title']}")
print(f"Original abstract: {docs[list(docs.keys())[0]]['abstract'][:100]}...")
print(f"Tokenized terms: {tokenized_docs[list(docs.keys())[0]][:20]}...")
print(f"Total terms in this document: {len(tokenized_docs[list(docs.keys())[0]])}")

Tokenized 1400 documents

Example tokenization for document 1:
Original title: experimental investigation of the aerodynamics of a wing in a slipstream .
Original abstract: experimental investigation of the aerodynamics of a wing in a slipstream .   an experimental study o...
Tokenized terms: ['experimental', 'investigation', 'aerodynamics', 'wing', 'slipstream', 'experimental', 'investigation', 'aerodynamics', 'wing', 'slipstream', 'experimental', 'study', 'wing', 'propeller', 'slipstream', 'made', 'order', 'determine', 'spanwise', 'distribution']...
Total terms in this document: 84


### Build vocabulary

Create a set (or list) of unique terms from all tokenized documents. Report the number of unique terms.


In [6]:
# TODO: Create a set or list of unique terms

# Create vocabulary from all tokenized documents
vocabulary = set()

# Collect all unique terms from all documents
for doc_id, terms in tokenized_docs.items():
    vocabulary.update(terms)

# Convert to sorted list for easier handling
vocabulary_list = sorted(list(vocabulary))

# Report: 
# - Number of unique terms
print(f"Number of unique terms in vocabulary: {len(vocabulary_list)}")
print(f"Total number of documents processed: {len(tokenized_docs)}")

# Show some statistics
total_terms = sum(len(terms) for terms in tokenized_docs.values())
avg_terms_per_doc = total_terms / len(tokenized_docs)

print(f"Total terms across all documents: {total_terms}")
print(f"Average terms per document: {avg_terms_per_doc:.2f}")

# Show first 20 terms as examples
print(f"\nFirst 20 terms in vocabulary (alphabetically):")
print(vocabulary_list[:20])

# Show some interesting statistics
print(f"\nVocabulary statistics:")
print(f"- Shortest term length: {min(len(term) for term in vocabulary_list)}")
print(f"- Longest term length: {max(len(term) for term in vocabulary_list)}")
print(f"- Average term length: {sum(len(term) for term in vocabulary_list) / len(vocabulary_list):.2f}")

Number of unique terms in vocabulary: 6934
Total number of documents processed: 1400
Total terms across all documents: 141060
Average terms per document: 100.76

First 20 terms in vocabulary (alphabetically):
['ab', 'abbreviated', 'ability', 'ablated', 'ablating', 'ablation', 'ablative', 'able', 'abrupt', 'abruptly', 'absence', 'absent', 'absolute', 'absorbed', 'absorbing', 'absorption', 'abstract', 'abundantly', 'academic', 'accelerated']

Vocabulary statistics:
- Shortest term length: 1
- Longest term length: 21
- Average term length: 7.91


### Build inverted index

For each term, store the list (or set) of document IDs where the term appears.


In [7]:

# TODO: For each term, store list of document IDs where the term appears

# Build inverted index
inverted_index = {}

# For each document, add its ID to the posting list of each term it contains
for doc_id, terms in tokenized_docs.items():
    for term in terms:
        if term not in inverted_index:
            inverted_index[term] = []
        inverted_index[term].append(doc_id)

# Sort document IDs for each term (for consistency and easier processing)
for term in inverted_index:
    inverted_index[term] = sorted(list(set(inverted_index[term])))

# Report statistics
print(f"Inverted index built successfully!")
print(f"Number of terms in inverted index: {len(inverted_index)}")
print(f"Total vocabulary size: {len(vocabulary_list)}")

# Show some examples
print(f"\nExample entries from inverted index:")
example_terms = list(inverted_index.keys())[:5]
for term in example_terms:
    doc_list = inverted_index[term]
    print(f"'{term}': appears in {len(doc_list)} documents -> {doc_list[:10]}{'...' if len(doc_list) > 10 else ''}")

# Find terms with highest and lowest document frequencies
term_frequencies = [(term, len(doc_list)) for term, doc_list in inverted_index.items()]
term_frequencies.sort(key=lambda x: x[1], reverse=True)

print(f"\nTerms with highest document frequency:")
for term, freq in term_frequencies[:5]:
    print(f"'{term}': appears in {freq} documents")

print(f"\nTerms with lowest document frequency (appearing in only 1 document):")
single_doc_terms = [term for term, freq in term_frequencies if freq == 1]
print(f"Number of terms appearing in only 1 document: {len(single_doc_terms)}")
if single_doc_terms:
    print(f"Examples: {single_doc_terms[:10]}")

# Calculate average document frequency
avg_doc_freq = sum(len(doc_list) for doc_list in inverted_index.values()) / len(inverted_index)
print(f"\nAverage document frequency: {avg_doc_freq:.2f}")


Inverted index built successfully!
Number of terms in inverted index: 6934
Total vocabulary size: 6934

Example entries from inverted index:
'experimental': appears in 318 documents -> [1, 11, 12, 17, 19, 25, 29, 30, 35, 41]...
'investigation': appears in 216 documents -> [1, 8, 9, 19, 29, 30, 44, 45, 50, 56]...
'aerodynamics': appears in 24 documents -> [1, 11, 33, 216, 225, 237, 244, 284, 289, 296]...
'wing': appears in 181 documents -> [1, 13, 14, 30, 31, 42, 52, 60, 69, 76]...
'slipstream': appears in 14 documents -> [1, 409, 453, 484, 1064, 1089, 1090, 1091, 1092, 1094]...

Terms with highest document frequency:
'flow': appears in 702 documents
'results': appears in 596 documents
'pressure': appears in 520 documents
'number': appears in 484 documents
'boundary': appears in 459 documents

Terms with lowest document frequency (appearing in only 1 document):
Number of terms appearing in only 1 document: 2669
Examples: ['libby', 'wassermann', 'contaminates', 'persist', 'ensuing', 'pho

### Retrieve documents for a Boolean query (AND/OR)

Create a function to retrieve documents for a Boolean query (AND/OR) with query terms.  


In [11]:
# TODO: Create a function for retrieving documents for a Boolean query (AND/OR) with query terms

def boolean_retrieve(query: str):
    """
    Retrieve documents for a Boolean query with AND/OR operations.
    
    Args:
        query: String containing terms and Boolean operators (AND/OR)
               Example: "gas AND pressure" or "structural AND aeroelastic OR flight"
    
    Returns:
        List of document IDs that match the query
    """
    # Tokenize the query using the same tokenization function
    query_terms = tokenize_document(query)
    
    if not query_terms:
        return []
    
    # Parse the query to separate terms and operators
    # Split by AND/OR while preserving the operators
    import re
    
    # Split query by AND/OR operators, keeping the operators
    parts = re.split(r'\s+(AND|OR)\s+', query.upper())
    
    # Extract terms and operators
    terms = []
    operators = []
    
    for i, part in enumerate(parts):
        if part.strip() in ['AND', 'OR']:
            operators.append(part.strip())
        else:
            # Tokenize each term part
            term_tokens = tokenize_document(part)
            if term_tokens:
                terms.append(term_tokens)
    
    # If no operators found, treat as single term query
    if not operators:
        if len(terms) == 1:
            # Single term query
            term = terms[0][0] if terms[0] else ""
            return inverted_index.get(term, [])
        else:
            # Multiple terms without operators - treat as AND
            operators = ['AND'] * (len(terms) - 1)
    
    # Get document sets for each term
    doc_sets = []
    for term_list in terms:
        if not term_list:
            continue
        # For multi-word terms, use the first word (you could extend this)
        term = term_list[0]
        doc_set = set(inverted_index.get(term, []))
        doc_sets.append(doc_set)
    
    if not doc_sets:
        return []
    
    # Apply Boolean operations
    result_set = doc_sets[0]  # Start with first term's documents
    
    for i, operator in enumerate(operators):
        if i + 1 < len(doc_sets):
            next_set = doc_sets[i + 1]
            
            if operator == 'AND':
                result_set = result_set.intersection(next_set)
            elif operator == 'OR':
                result_set = result_set.union(next_set)
    
    # Convert back to sorted list
    return sorted(list(result_set))



In [9]:
# Do not change this code
boolean_queries = [
  "gas AND pressure",
  "structural AND aeroelastic AND flight AND high AND speed OR aircraft",
  "heat AND conduction AND composite AND slabs",
  "boundary AND layer AND control",
  "compressible AND flow AND nozzle",
  "combustion AND chamber AND injection",
  "laminar AND turbulent AND transition",
  "fatigue AND crack AND growth",
  "wing AND tip AND vortices",
  "propulsion AND efficiency"
]

In [12]:
# Run Boolean queries in batch, using the function you created
def run_batch_boolean(queries):
    results = {}
    for i, q in enumerate(queries, 1):
        res = boolean_retrieve(q)
        results[f"Q{i}"] = res
    return results

boolean_results = run_batch_boolean(boolean_queries)
for qid, res in boolean_results.items():
    print(qid, "=>", res[:5])


Q1 => [27, 49, 85, 101, 110]
Q2 => [12, 14, 29, 47, 51]
Q3 => [5, 399]
Q4 => [1, 61, 244, 265, 342]
Q5 => [118, 131]
Q6 => []
Q7 => [7, 9, 80, 89, 96]
Q8 => []
Q9 => [675]
Q10 => [968]


## Part 1.2 – TF-IDF Indexing


$tf_{i,j} = \text{Raw Frequency}$

$idf_t = \log\left(\frac{N}{df_t}\right)$

### Build document–term matrix (TF and IDF weights)

Compute tf and idf using the formulas above and store the weights in a document–term matrix (rows = documents, columns = terms).



In [13]:
# TODO: Calculate the weights for the documents and the terms using tf and idf weighting. Put these values into a document–term matrix (rows = documents, columns = terms).

import numpy as np
import math

# Get all document IDs and create mappings
doc_ids = sorted(list(docs.keys()))
term_ids = sorted(list(vocabulary_list))

# Create mappings for efficient lookup
doc_to_idx = {doc_id: idx for idx, doc_id in enumerate(doc_ids)}
term_to_idx = {term: idx for idx, term in enumerate(term_ids)}

print(f"Creating document-term matrix:")
print(f"- Number of documents: {len(doc_ids)}")
print(f"- Number of terms: {len(term_ids)}")
print(f"- Matrix size: {len(doc_ids)} x {len(term_ids)}")

# Initialize the document-term matrix
# Rows = documents, Columns = terms
doc_term_matrix = np.zeros((len(doc_ids), len(term_ids)))

# Calculate TF (Term Frequency) for each document-term pair
print("\nCalculating TF (Term Frequency)...")
for doc_id, terms in tokenized_docs.items():
    doc_idx = doc_to_idx[doc_id]
    
    # Count term frequencies in this document
    term_counts = {}
    for term in terms:
        term_counts[term] = term_counts.get(term, 0) + 1
    
    # Store TF values in matrix
    for term, count in term_counts.items():
        if term in term_to_idx:
            term_idx = term_to_idx[term]
            doc_term_matrix[doc_idx, term_idx] = count  # Raw frequency

# Calculate IDF (Inverse Document Frequency) for each term
print("Calculating IDF (Inverse Document Frequency)...")
N = len(doc_ids)  # Total number of documents
idf_values = {}

for term in term_ids:
    # Count how many documents contain this term
    df_t = len(inverted_index.get(term, []))  # Document frequency
    
    if df_t > 0:
        # IDF formula: log(N / df_t)
        idf_values[term] = math.log(N / df_t)
    else:
        idf_values[term] = 0  # Term doesn't appear in any document

# Calculate TF-IDF weights
print("Calculating TF-IDF weights...")
tfidf_matrix = np.zeros((len(doc_ids), len(term_ids)))

for doc_idx in range(len(doc_ids)):
    for term_idx in range(len(term_ids)):
        term = term_ids[term_idx]
        tf = doc_term_matrix[doc_idx, term_idx]  # Term frequency
        idf = idf_values[term]  # Inverse document frequency
        tfidf_matrix[doc_idx, term_idx] = tf * idf

# Report statistics
print(f"\nDocument-term matrix created successfully!")
print(f"Matrix shape: {tfidf_matrix.shape}")

# Show some statistics
non_zero_entries = np.count_nonzero(tfidf_matrix)
total_entries = tfidf_matrix.size
sparsity = 1 - (non_zero_entries / total_entries)

print(f"Matrix statistics:")
print(f"- Non-zero entries: {non_zero_entries:,}")
print(f"- Total entries: {total_entries:,}")
print(f"- Sparsity: {sparsity:.4f} ({sparsity*100:.2f}% zeros)")

# Show TF-IDF statistics
print(f"\nTF-IDF statistics:")
print(f"- Min TF-IDF value: {np.min(tfidf_matrix):.4f}")
print(f"- Max TF-IDF value: {np.max(tfidf_matrix):.4f}")
print(f"- Mean TF-IDF value: {np.mean(tfidf_matrix):.4f}")

# Show IDF statistics
idf_list = list(idf_values.values())
print(f"\nIDF statistics:")
print(f"- Min IDF value: {min(idf_list):.4f}")
print(f"- Max IDF value: {max(idf_list):.4f}")
print(f"- Mean IDF value: {sum(idf_list)/len(idf_list):.4f}")

# Show examples of high and low IDF terms
sorted_idf = sorted(idf_values.items(), key=lambda x: x[1], reverse=True)
print(f"\nTerms with highest IDF (most discriminative):")
for term, idf in sorted_idf[:5]:
    print(f"'{term}': IDF = {idf:.4f}")

print(f"\nTerms with lowest IDF (most common):")
for term, idf in sorted_idf[-5:]:
    print(f"'{term}': IDF = {idf:.4f}")

print(f"\nDocument-term matrix ready for TF-IDF retrieval!")


Creating document-term matrix:
- Number of documents: 1400
- Number of terms: 6934
- Matrix size: 1400 x 6934

Calculating TF (Term Frequency)...
Calculating IDF (Inverse Document Frequency)...
Calculating TF-IDF weights...

Document-term matrix created successfully!
Matrix shape: (1400, 6934)
Matrix statistics:
- Non-zero entries: 90,604
- Total entries: 9,707,600
- Sparsity: 0.9907 (99.07% zeros)

TF-IDF statistics:
- Min TF-IDF value: 0.0000
- Max TF-IDF value: 104.4755
- Mean TF-IDF value: 0.0456

IDF statistics:
- Min IDF value: 0.6903
- Max IDF value: 7.2442
- Mean IDF value: 5.9705

Terms with highest IDF (most discriminative):
'ab': IDF = 7.2442
'abbreviated': IDF = 7.2442
'ablated': IDF = 7.2442
'ablative': IDF = 7.2442
'absent': IDF = 7.2442

Terms with lowest IDF (most common):
'boundary': IDF = 1.1152
'number': IDF = 1.0621
'pressure': IDF = 0.9904
'results': IDF = 0.8540
'flow': IDF = 0.6903

Document-term matrix ready for TF-IDF retrieval!


### Build TF–IDF document vectors

From the matrix, build a TF–IDF vector for each document (consider normalization if needed for cosine similarity).


In [14]:

# TODO: Build TF–IDF document vectors from the document–term matrix

# Build TF-IDF document vectors with normalization for cosine similarity
print("Building TF-IDF document vectors...")

# Create normalized TF-IDF vectors for each document
# Normalization is important for cosine similarity calculations
tfidf_vectors = {}

for doc_idx in range(len(doc_ids)):
    doc_id = doc_ids[doc_idx]
    
    # Get the TF-IDF vector for this document (row from the matrix)
    doc_vector = tfidf_matrix[doc_idx, :]
    
    # Calculate the L2 norm (Euclidean norm) for normalization
    l2_norm = np.linalg.norm(doc_vector)
    
    # Normalize the vector (unit vector for cosine similarity)
    if l2_norm > 0:
        normalized_vector = doc_vector / l2_norm
    else:
        # Handle case where document has no terms (shouldn't happen with our data)
        normalized_vector = doc_vector
    
    tfidf_vectors[doc_id] = normalized_vector

print(f"Created {len(tfidf_vectors)} TF-IDF document vectors")

# Verify normalization (all vectors should have unit length)
print("\nVerifying vector normalization:")
norms = []
for doc_id, vector in tfidf_vectors.items():
    norm = np.linalg.norm(vector)
    norms.append(norm)

print(f"- Min vector norm: {min(norms):.6f}")
print(f"- Max vector norm: {max(norms):.6f}")
print(f"- Mean vector norm: {np.mean(norms):.6f}")

# Show some examples of document vectors
print(f"\nExample document vectors:")
example_docs = list(doc_ids)[:3]
for doc_id in example_docs:
    vector = tfidf_vectors[doc_id]
    non_zero_count = np.count_nonzero(vector)
    max_value = np.max(vector)
    max_term_idx = np.argmax(vector)
    max_term = term_ids[max_term_idx] if max_term_idx < len(term_ids) else "unknown"
    
    print(f"Document {doc_id}:")
    print(f"  - Non-zero elements: {non_zero_count}/{len(vector)}")
    print(f"  - Max TF-IDF value: {max_value:.4f}")
    print(f"  - Term with max value: '{max_term}'")
    print(f"  - Vector norm: {np.linalg.norm(vector):.6f}")

# Create a function to get TF-IDF vector for any document
def get_document_vector(doc_id):
    """
    Get the normalized TF-IDF vector for a document.
    
    Args:
        doc_id: Document ID
        
    Returns:
        Normalized TF-IDF vector (numpy array)
    """
    return tfidf_vectors.get(doc_id, np.zeros(len(term_ids)))

# Create a function to get TF-IDF vector for a query
def get_query_vector(query_terms):
    """
    Get the normalized TF-IDF vector for a query.
    
    Args:
        query_terms: List of query terms
        
    Returns:
        Normalized TF-IDF vector (numpy array)
    """
    # Initialize query vector
    query_vector = np.zeros(len(term_ids))
    
    # Count term frequencies in query
    term_counts = {}
    for term in query_terms:
        term_counts[term] = term_counts.get(term, 0) + 1
    
    # Calculate TF-IDF for query terms
    for term, count in term_counts.items():
        if term in term_to_idx:
            term_idx = term_to_idx[term]
            tf = count  # Raw frequency in query
            idf = idf_values.get(term, 0)  # IDF from collection
            query_vector[term_idx] = tf * idf
    
    # Normalize the query vector
    l2_norm = np.linalg.norm(query_vector)
    if l2_norm > 0:
        query_vector = query_vector / l2_norm
    
    return query_vector

# Test the query vector function
print(f"\nTesting query vector creation:")
test_query = ["aircraft", "flight", "pressure"]
query_vec = get_query_vector(test_query)
print(f"Query: {test_query}")
print(f"Query vector norm: {np.linalg.norm(query_vec):.6f}")
print(f"Non-zero elements: {np.count_nonzero(query_vec)}")

# Show which terms in the query have non-zero TF-IDF values
print(f"Query term TF-IDF values:")
for term in test_query:
    if term in term_to_idx:
        term_idx = term_to_idx[term]
        tfidf_val = query_vec[term_idx]
        idf_val = idf_values.get(term, 0)
        print(f"  '{term}': TF-IDF = {tfidf_val:.4f}, IDF = {idf_val:.4f}")

print(f"\nTF-IDF document vectors ready for cosine similarity calculations!")


Building TF-IDF document vectors...
Created 1400 TF-IDF document vectors

Verifying vector normalization:
- Min vector norm: 0.000000
- Max vector norm: 1.000000
- Mean vector norm: 0.998571

Example document vectors:
Document 1:
  - Non-zero elements: 60/6934
  - Max TF-IDF value: 0.5861
  - Term with max value: 'slipstream'
  - Vector norm: 1.000000
Document 2:
  - Non-zero elements: 71/6934
  - Max TF-IDF value: 0.3222
  - Term with max value: 'past'
  - Vector norm: 1.000000
Document 3:
  - Non-zero elements: 14/6934
  - Max TF-IDF value: 0.4357
  - Term with max value: 'past'
  - Vector norm: 1.000000

Testing query vector creation:
Query: ['aircraft', 'flight', 'pressure']
Query vector norm: 1.000000
Non-zero elements: 3
Query term TF-IDF values:
  'aircraft': TF-IDF = 0.7604, IDF = 2.9815
  'flight': TF-IDF = 0.5984, IDF = 2.3464
  'pressure': TF-IDF = 0.2526, IDF = 0.9904

TF-IDF document vectors ready for cosine similarity calculations!


### Implement cosine similarity

Implement a function to compute cosine similarity scores between a (tokenized) query and all documents.


In [15]:

# TODO: Create a function for calculating the similarity score of all the documents by their relevance to query terms

def cosine_similarity(vec1, vec2):
    """
    Calculate cosine similarity between two vectors.
    
    Args:
        vec1, vec2: Normalized vectors (numpy arrays)
        
    Returns:
        Cosine similarity score (float between 0 and 1)
    """
    # Since vectors are normalized, cosine similarity = dot product
    return np.dot(vec1, vec2)

def tfidf_retrieve(query: str):
    """
    Retrieve documents using TF-IDF and cosine similarity.
    
    Args:
        query: Query string
        
    Returns:
        List of document IDs sorted by relevance (highest similarity first)
    """
    # Tokenize the query
    query_terms = tokenize_document(query)
    
    if not query_terms:
        return []
    
    # Get the query vector
    query_vector = get_query_vector(query_terms)
    
    # If query vector is all zeros (no valid terms), return empty results
    if np.all(query_vector == 0):
        return []
    
    # Calculate cosine similarity between query and all documents
    similarities = []
    
    for doc_id in doc_ids:
        doc_vector = get_document_vector(doc_id)
        similarity = cosine_similarity(query_vector, doc_vector)
        similarities.append((doc_id, similarity))
    
    # Sort by similarity score (descending order)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Return only document IDs (sorted by relevance)
    return [doc_id for doc_id, _ in similarities]

# Test the TF-IDF retrieval function
print("Testing TF-IDF retrieval function:")
print("=" * 50)

# Test with a simple query
test_query1 = "aircraft flight"
result1 = tfidf_retrieve(test_query1)
print(f"Query: '{test_query1}'")
print(f"Results: {len(result1)} documents")
print(f"Top 5 documents: {result1[:5]}")

# Show similarity scores for top results
query_vec1 = get_query_vector(tokenize_document(test_query1))
print(f"Top 5 similarity scores:")
for i, doc_id in enumerate(result1[:5]):
    doc_vec = get_document_vector(doc_id)
    sim_score = cosine_similarity(query_vec1, doc_vec)
    print(f"  Doc {doc_id}: {sim_score:.4f}")

# Test with another query
test_query2 = "pressure gas flow"
result2 = tfidf_retrieve(test_query2)
print(f"\nQuery: '{test_query2}'")
print(f"Results: {len(result2)} documents")
print(f"Top 5 documents: {result2[:5]}")

# Show similarity scores for top results
query_vec2 = get_query_vector(tokenize_document(test_query2))
print(f"Top 5 similarity scores:")
for i, doc_id in enumerate(result2[:5]):
    doc_vec = get_document_vector(doc_id)
    sim_score = cosine_similarity(query_vec2, doc_vec)
    print(f"  Doc {doc_id}: {sim_score:.4f}")

# Test with a more complex query
test_query3 = "structural aeroelastic analysis"
result3 = tfidf_retrieve(test_query3)
print(f"\nQuery: '{test_query3}'")
print(f"Results: {len(result3)} documents")
print(f"Top 5 documents: {result3[:5]}")

# Show similarity scores for top results
query_vec3 = get_query_vector(tokenize_document(test_query3))
print(f"Top 5 similarity scores:")
for i, doc_id in enumerate(result3[:5]):
    doc_vec = get_document_vector(doc_id)
    sim_score = cosine_similarity(query_vec3, doc_vec)
    print(f"  Doc {doc_id}: {sim_score:.4f}")

# Analyze the distribution of similarity scores
print(f"\nSimilarity score analysis for query '{test_query1}':")
all_scores = []
for doc_id in doc_ids:
    doc_vec = get_document_vector(doc_id)
    sim_score = cosine_similarity(query_vec1, doc_vec)
    all_scores.append(sim_score)

all_scores = np.array(all_scores)
print(f"- Min similarity: {np.min(all_scores):.4f}")
print(f"- Max similarity: {np.max(all_scores):.4f}")
print(f"- Mean similarity: {np.mean(all_scores):.4f}")
print(f"- Documents with similarity > 0: {np.count_nonzero(all_scores)}")
print(f"- Documents with similarity > 0.1: {np.sum(all_scores > 0.1)}")

print(f"\nTF-IDF retrieval function implemented successfully!")


Testing TF-IDF retrieval function:
Query: 'aircraft flight'
Results: 1400 documents
Top 5 documents: [51, 1169, 253, 810, 1163]
Top 5 similarity scores:
  Doc 51: 0.3891
  Doc 1169: 0.3851
  Doc 253: 0.3224
  Doc 810: 0.2853
  Doc 1163: 0.2660

Query: 'pressure gas flow'
Results: 1400 documents
Top 5 documents: [169, 167, 1286, 665, 166]
Top 5 similarity scores:
  Doc 169: 0.3194
  Doc 167: 0.3177
  Doc 1286: 0.2904
  Doc 665: 0.2878
  Doc 166: 0.2830

Query: 'structural aeroelastic analysis'
Results: 1400 documents
Top 5 documents: [875, 12, 184, 746, 781]
Top 5 similarity scores:
  Doc 875: 0.3613
  Doc 12: 0.3596
  Doc 184: 0.2800
  Doc 746: 0.2583
  Doc 781: 0.2513

Similarity score analysis for query 'aircraft flight':
- Min similarity: 0.0000
- Max similarity: 0.3891
- Mean similarity: 0.0102
- Documents with similarity > 0: 183
- Documents with similarity > 0.1: 47

TF-IDF retrieval function implemented successfully!


In [16]:
# Do not change this code
tfidf_queries = [
  "gas pressure",
  "structural aeroelastic flight high speed aircraft",
  "heat conduction composite slabs",
  "boundary layer control",
  "compressible flow nozzle",
  "combustion chamber injection",
  "laminar turbulent transition",
  "fatigue crack growth",
  "wing tip vortices",
  "propulsion efficiency"
]

In [17]:
# Run TF-IDF queries in batch (print top-5 results for each), using the function you created
def run_batch_tfidf(queries):
    results = {}
    for i, q in enumerate(queries, 1):
        res = tfidf_retrieve(q)
        results[f"Q{i}"] = res
    return results

tfidf_results = run_batch_tfidf(tfidf_queries)

for qid, res in tfidf_results.items():
    print(qid, "=>", res[:5])


Q1 => [169, 1286, 167, 185, 1003]
Q2 => [12, 51, 746, 875, 884]
Q3 => [399, 144, 485, 5, 181]
Q4 => [368, 748, 638, 451, 1349]
Q5 => [389, 118, 1187, 172, 173]
Q6 => [974, 628, 397, 308, 635]
Q7 => [418, 1264, 315, 272, 9]
Q8 => [768, 726, 1196, 883, 884]
Q9 => [1284, 433, 675, 1271, 288]
Q10 => [968, 1328, 1380, 1092, 592]



## Part 1.3 – Conceptual Questions

Answer the following questions:

**1. What is the difference between data retrieval and information retrieval?**
*Your answer here*

**For the following scenarios, which approach would be suitable data retrieval or information retrieval? Explain your reasoning.** <br>
1.a A clerk in pharmacy uses the following query: Medicine_name = Ibuprofen_400mg
*Your answer here*

1.b A clerk in pharmacy uses the following query: An anti-biotic medicine 
*Your answer here*

1.c Searching for the schedule of a flight using the following query: Flight_ID = ZEFV2
*Your answer here*

1.d Searching an E-commerce website using the following query to find an specific shoe: Brooks Ghost 15
*Your answer here*

1.e Searching the same E-commerce website using the following query: Nice running shoes
*Your answer here*
