# Week 3 Assignment: TF-IDF and Cosine Similarity

This notebook implements a simple information retrieval system that:
1. Loads news articles from text files
2. Loads search queries from a queries file
3. Computes TF-IDF weights for documents and queries
4. Calculates cosine similarity between queries and documents
5. Ranks documents by relevance to each query

## Step 1: Import Required Libraries

In [2]:
# Import necessary libraries
import os  # For working with files and directories
from sklearn.feature_extraction.text import TfidfVectorizer  # For computing TF-IDF
from sklearn.metrics.pairwise import cosine_similarity  # For computing cosine similarity
import numpy as np  # For numerical operations

## Step 2: Load Articles from Text Files

In [None]:
# Function to load all article files (Following Week 2 reference style)
def load_text_files(folder_path):
    """
    Load all text files from the specified folder.
    Returns:
    - data: dictionary with doc_id as key and content as value
    - doc_id_to_filename: dictionary mapping doc_id to filename
    """
    data = {}
    doc_id_to_filename = {}
    doc_id = 0

    print(f"Scanning folder: {folder_path}")
    for filename in os.listdir(folder_path):
        print(f"Found file: {filename}")  
        if filename.endswith(".txt"):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                content = file.read()
                data[doc_id] = content
                doc_id_to_filename[doc_id] = filename
                print(f"Loaded doc_id {doc_id} -> {filename}")
                doc_id += 1

    print(f"Total files loaded: {len(data)}")
    return data, doc_id_to_filename

# Set the folder path (current directory)
folder_path = r"C:\Users\Swornim\Documents\College\Information Retrieval\W3"

# Load all articles
data, doc_id_to_filename = load_text_files(folder_path)

# Convert to lists for TF-IDF processing
documents = [data[doc_id] for doc_id in sorted(data.keys())]
document_names = [doc_id_to_filename[doc_id] for doc_id in sorted(data.keys())]

print(f"\nReady for TF-IDF processing!")

✓ Loaded article_1.txt
✓ Loaded article_2.txt
✓ Loaded article_3.txt
✓ Loaded article_4.txt
✓ Loaded article_5.txt
✓ Loaded article_6.txt
✓ Loaded article_7.txt
✓ Loaded article_8.txt

Total articles loaded: 8


## Step 3: Load Queries from File

In [None]:
# Function to load queries (Following Week 2 reference style)
def load_queries_from_file(filename='queries.txt'):
    """
    Load queries from a text file.
    Each line in the file is treated as a separate query.
    """
    queries = []
    
    # Check if queries file exists
    if os.path.exists(filename):
        with open(filename, 'r', encoding='utf-8') as file:
            # Read all lines and remove empty lines
            queries = [line.strip() for line in file.readlines() if line.strip()]
        print(f"✓ Loaded {len(queries)} queries from {filename}")
    else:
        print(f"✗ File {filename} not found")
    
    return queries

# Load queries
queries = load_queries_from_file()

# Display the queries
print("\nQueries:")
for i, query in enumerate(queries, 1):
    print(f"{i}. {query}")

✓ Loaded 5 queries from queries.txt

Queries:
1. fitness tracker technology
2. Apple Watch smartwatch features
3. health monitoring devices
4. sleep quality tracking
5. wearable technology trends


## Step 4: Compute TF-IDF Weights

**What is TF-IDF?**
- **TF** (Term Frequency): How often a word appears in a document
- **IDF** (Inverse Document Frequency): How rare/common a word is across all documents
- **TF-IDF**: Combines both to give higher weight to important words

We'll use `TfidfVectorizer` to automatically compute TF-IDF weights for all documents.

In [5]:
# Create TF-IDF Vectorizer
# This will convert text documents into TF-IDF feature vectors
vectorizer = TfidfVectorizer(
    stop_words='english',  # Remove common English words like 'the', 'is', 'and'
    lowercase=True,        # Convert all text to lowercase
    max_features=1000      # Limit to top 1000 most important words
)

# Fit the vectorizer on documents and transform them to TF-IDF vectors
# fit_transform learns the vocabulary and computes TF-IDF values
document_tfidf = vectorizer.fit_transform(documents)

print(f"TF-IDF matrix shape: {document_tfidf.shape}")
print(f"  - {document_tfidf.shape[0]} documents")
print(f"  - {document_tfidf.shape[1]} unique words (features)")
print("\nTF-IDF computation complete!")

TF-IDF matrix shape: (8, 1000)
  - 8 documents
  - 1000 unique words (features)

TF-IDF computation complete!


## Step 5: Transform Queries to TF-IDF Vectors

Now we need to convert our queries into the same TF-IDF format as documents.
We use `transform()` (not `fit_transform()`) because we want to use the same vocabulary learned from documents.

In [6]:
# Transform queries to TF-IDF vectors using the same vectorizer
# This ensures queries and documents use the same vocabulary
query_tfidf = vectorizer.transform(queries)

print(f"Query TF-IDF matrix shape: {query_tfidf.shape}")
print(f"  - {query_tfidf.shape[0]} queries")
print(f"  - {query_tfidf.shape[1]} features (same as documents)")
print("\nQuery transformation complete!")

Query TF-IDF matrix shape: (5, 1000)
  - 5 queries
  - 1000 features (same as documents)

Query transformation complete!


## Step 6: Compute Cosine Similarity

**What is Cosine Similarity?**
- Measures how similar two vectors are
- Values range from 0 (completely different) to 1 (identical)
- Higher value = more similar = more relevant document

We'll compute cosine similarity between each query and all documents.

In [7]:
# Compute cosine similarity between queries and documents
# Result: a matrix where each row is a query and each column is a document
similarity_matrix = cosine_similarity(query_tfidf, document_tfidf)

print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"  - {similarity_matrix.shape[0]} queries")
print(f"  - {similarity_matrix.shape[1]} documents")
print("\nCosine similarity computation complete!")
print("\nSample similarity scores (Query 1 vs all documents):")
print(similarity_matrix[0])

Similarity matrix shape: (5, 8)
  - 5 queries
  - 8 documents

Cosine similarity computation complete!

Sample similarity scores (Query 1 vs all documents):
[0.13874896 0.02420311 0.         0.00992125 0.         0.
 0.         0.        ]


## Step 7: Rank Documents by Similarity

For each query, we'll rank documents from most relevant to least relevant based on cosine similarity scores.

In [8]:
# Function to display ranked results for each query
def display_ranked_results(queries, similarity_matrix, document_names):
    """
    For each query, display documents ranked by similarity score.
    """
    # Loop through each query
    for query_idx, query in enumerate(queries):
        print("=" * 80)
        print(f"QUERY {query_idx + 1}: {query}")
        print("=" * 80)
        
        # Get similarity scores for this query with all documents
        scores = similarity_matrix[query_idx]
        
        # Create pairs of (document_index, similarity_score)
        doc_scores = [(i, scores[i]) for i in range(len(scores))]
        
        # Sort by similarity score in descending order (highest first)
        # key=lambda x: x[1] means sort by the second element (the score)
        ranked_docs = sorted(doc_scores, key=lambda x: x[1], reverse=True)
        
        # Display ranked results
        print(f"\nRanked Documents (by relevance):\n")
        for rank, (doc_idx, score) in enumerate(ranked_docs, 1):
            # Create a visual bar to represent the similarity score
            bar_length = int(score * 50)  # Scale to 50 characters max
            bar = '█' * bar_length
            
            print(f"  Rank {rank}: {document_names[doc_idx]}")
            print(f"           Similarity: {score:.4f} {bar}")
            print()
        
        print()

# Display results
display_ranked_results(queries, similarity_matrix, document_names)

QUERY 1: fitness tracker technology

Ranked Documents (by relevance):

  Rank 1: article_1.txt
           Similarity: 0.1387 ██████

  Rank 2: article_2.txt
           Similarity: 0.0242 █

  Rank 3: article_4.txt
           Similarity: 0.0099 

  Rank 4: article_3.txt
           Similarity: 0.0000 

  Rank 5: article_5.txt
           Similarity: 0.0000 

  Rank 6: article_6.txt
           Similarity: 0.0000 

  Rank 7: article_7.txt
           Similarity: 0.0000 

  Rank 8: article_8.txt
           Similarity: 0.0000 


QUERY 2: Apple Watch smartwatch features

Ranked Documents (by relevance):

  Rank 1: article_1.txt
           Similarity: 0.2177 ██████████

  Rank 2: article_4.txt
           Similarity: 0.1476 ███████

  Rank 3: article_3.txt
           Similarity: 0.0215 █

  Rank 4: article_7.txt
           Similarity: 0.0086 

  Rank 5: article_2.txt
           Similarity: 0.0000 

  Rank 6: article_5.txt
           Similarity: 0.0000 

  Rank 7: article_6.txt
           Similari

## Step 8: Summary Statistics (Optional)

Let's see some overall statistics about our retrieval system.

In [9]:
# Display summary statistics
print("=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80)
print(f"\nTotal Documents: {len(documents)}")
print(f"Total Queries: {len(queries)}")
print(f"Vocabulary Size: {len(vectorizer.vocabulary_)}")

print("\n\nAverage Similarity Score for Each Query:")
print("-" * 50)
for i, query in enumerate(queries):
    avg_similarity = np.mean(similarity_matrix[i])
    max_similarity = np.max(similarity_matrix[i])
    print(f"\nQuery {i+1}: {query}")
    print(f"  Average similarity: {avg_similarity:.4f}")
    print(f"  Max similarity: {max_similarity:.4f}")

print("\n\nMost Relevant Document-Query Pair:")
print("-" * 50)
# Find the maximum similarity score in the entire matrix
max_score = np.max(similarity_matrix)
# Find which query and document have this maximum score
max_position = np.where(similarity_matrix == max_score)
query_idx = max_position[0][0]
doc_idx = max_position[1][0]

print(f"Query: {queries[query_idx]}")
print(f"Document: {document_names[doc_idx]}")
print(f"Similarity Score: {max_score:.4f}")

SUMMARY STATISTICS

Total Documents: 8
Total Queries: 5
Vocabulary Size: 1000


Average Similarity Score for Each Query:
--------------------------------------------------

Query 1: fitness tracker technology
  Average similarity: 0.0216
  Max similarity: 0.1387

Query 2: Apple Watch smartwatch features
  Average similarity: 0.0494
  Max similarity: 0.2177

Query 3: health monitoring devices
  Average similarity: 0.0252
  Max similarity: 0.2016

Query 4: sleep quality tracking
  Average similarity: 0.0242
  Max similarity: 0.1783

Query 5: wearable technology trends
  Average similarity: 0.0084
  Max similarity: 0.0475


Most Relevant Document-Query Pair:
--------------------------------------------------
Query: Apple Watch smartwatch features
Document: article_1.txt
Similarity Score: 0.2177


## Explanation of the Code

### How it Works:

1. **Loading Data**: We read all article files and the queries file into memory
   
2. **TF-IDF Vectorization**: 
   - `TfidfVectorizer` converts text into numbers
   - Each document becomes a vector of TF-IDF weights
   - Common words (stop words) are removed
   
3. **Cosine Similarity**:
   - Measures the angle between query and document vectors
   - Score of 1 = identical, 0 = completely different
   - Higher score = more relevant
   
4. **Ranking**:
   - For each query, we sort documents by similarity score
   - The document with highest score is ranked #1

### Key Python Concepts Used:

- **Lists**: To store multiple items (articles, queries)
- **Functions**: Reusable code blocks (`load_articles()`, `load_queries()`)
- **Loops**: `for` loops to process multiple files/queries
- **File I/O**: Reading text files with `open()`
- **Libraries**: sklearn for TF-IDF and cosine similarity calculations