# Part 2: Indexing and Evaluation
## Part 1: Indexing

This notebook implements:
1. **Inverted Index** - Build index structure from preprocessed corpus
2. **Conjunctive Query (AND)** - Search for documents containing ALL query terms
3. **TF-IDF Ranking** - Rank results by relevance
4. **5 Test Queries** - Evaluate search engine with custom queries

In [26]:
import os
import sys
import pandas as pd

# Add project root to path
sys.path.insert(0, os.path.abspath('../../'))

In [27]:
from myapp.search.load_corpus import load_preprocessed_corpus

corpus = load_preprocessed_corpus('../../data/processed/preprocessed_corpus.pkl')

print(f"Loaded {len(corpus)} documents")
print(f"\nExample document structure:")
example_pid = list(corpus.keys())[0]
example_doc = corpus[example_pid]
print(f"  PID: {example_doc['pid']}")
print(f"  Tokens (first 10): {example_doc['searchable_text'][:10]}")
print(f"  Metadata: {example_doc['metadata']}")

Loaded 28080 documents

Example document structure:
  PID: TKPFCZ9EA7H5FYZH
  Tokens (first 10): ['solid', 'women', 'multicolor', 'track', 'pant', 'yorker', 'trackpant', 'made', '100', 'rich']
  Metadata: {'category': 'clothing and accessories', 'sub_category': 'bottomwear', 'brand': 'york', 'seller': 'shyam enterprises'}


In [28]:
TEST_QUERIES = [
    "women cotton dress summer",
    "men leather shoes formal",
    "kids blue jeans comfortable",
    "sports running shoes lightweight",
    "winter jacket warm waterproof"
]

In [29]:
from myapp.search.algorithms import search_in_corpus

# Test with a simple query
test_query = "cotton sweatshirt"
print(f"Testing query: '{test_query}'\n")

results = search_in_corpus(test_query, corpus, top_k=10)

print(f"\n{'='*60}")
print(f"Top 10 Results:")
print(f"{'='*60}")
for rank, (pid, score) in enumerate(results, 1):
    doc = corpus[pid]
    title = doc['original']['title']
    print(f"{rank:2d}. [{score:6.3f}] {pid} - {title[:50]}")

Testing query: 'cotton sweatshirt'

Processed query terms: ['cotton', 'sweatshirt']
Indexed 20906 unique terms from 28080 documents in 0.19s
Found 1025 documents matching all terms

Top 10 Results:
 1. [27.215] SWSFMUGJATWZTYEZ - Full Sleeve Solid Men Sweatshirt
 2. [24.229] SWSFNVDXSZTP8HRQ - Full Sleeve Color Block Men Sweatshirt
 3. [24.229] SWSFNVDXSNZ8UJVE - Full Sleeve Solid Men Sweatshirt
 4. [21.929] SWSFY4HPTTBCZZAJ - Full Sleeve Solid Women Sweatshirt
 5. [21.243] SWSFMJGT3MWCDDM7 - Full Sleeve Solid Men Sweatshirt
 6. [21.243] SWSFMTNHEM5SNHVZ - Full Sleeve Solid Women Sweatshirt
 7. [21.243] SWSFMJGS8HVBPEH6 - Full Sleeve Solid Women Sweatshirt
 8. [21.243] SWSFMJGTCUB3ZMVQ - Full Sleeve Solid Men Sweatshirt
 9. [21.243] SWSFMJGTQFMZGZQH - Full Sleeve Solid Women Sweatshirt
10. [21.243] SWSFMJGTVFW3NAXZ - Full Sleeve Solid Women Sweatshirt


In [30]:
# Run all test queries
all_results = {}

for query in TEST_QUERIES:
    print(f"\n{'='*80}")
    print(f"Query: '{query}'")
    print(f"{'='*80}")

    results = search_in_corpus(query, corpus, top_k=20)
    all_results[query] = results

    print(f"\nTop 20 Results:")
    if results:
        for rank, (pid, score) in enumerate(results, 1):
            doc = corpus[pid]
            title = doc['original']['title']
            brand = doc['original'].get('brand', 'N/A')
            price = doc['original'].get('selling_price', 'N/A')
            rating = doc['original'].get('average_rating', 'N/A')

            print(f"{rank:2d}. [Score: {score:6.3f}] {pid}")
            print(f"    Title: {title[:60]}")
            print(f"    Brand: {brand} | Price: {price} | Rating: {rating}")
    else:
        print("  No results found")



Query: 'women cotton dress summer'
Processed query terms: ['women', 'cotton', 'dress', 'summer']
Indexed 20906 unique terms from 28080 documents in 0.17s
Found 273 documents matching all terms

Top 20 Results:
 1. [Score: 13.672] TSHFWCRAKZUK5MUK
    Title: Solid Women Polo Neck Multicolor T-Shirt  (Pack of 3)
    Brand: Shoef | Price: 654 | Rating: 3.6
 2. [Score: 13.672] TSHFWCRYXXQMDRQP
    Title: Solid Women Polo Neck Multicolor T-Shirt  (Pack of 3)
    Brand: Shoef | Price: 654 | Rating: 3.6
 3. [Score: 13.672] TSHFSFVNZVVTB7FS
    Title: Solid Women Polo Neck Multicolor T-Shirt  (Pack of 3)
    Brand: Shoef | Price: 616 | Rating: 3.6
 4. [Score: 13.672] TSHFWCSUTZGJJDBK
    Title: Solid Women Polo Neck Multicolor T-Shirt  (Pack of 3)
    Brand: Shoef | Price: 654 | Rating: 3.6
 5. [Score: 13.672] TSHFSFVZT5Y4UYAH
    Title: Solid Women Polo Neck Multicolor T-Shirt  (Pack of 3)
    Brand: Shoef | Price: 616 | Rating: 3.6
 6. [Score: 13.672] TSHFSFVZPNAUGFWZ
    Title: Solid Women

In [31]:
# Create summary table
summary_data = []
for query, results in all_results.items():
    num_results = len(results)
    avg_score = sum(score for _, score in results) / num_results if num_results > 0 else 0
    max_score = max((score for _, score in results), default=0)
    min_score = min((score for _, score in results), default=0)

    summary_data.append({
        'Query': query,
        'Num Results': num_results,
        'Avg Score': f"{avg_score:.3f}",
        'Max Score': f"{max_score:.3f}",
        'Min Score': f"{min_score:.3f}"
    })

summary_df = pd.DataFrame(summary_data)
print("\n" + "="*80)
print("SUMMARY - Test Queries Performance")
print("="*80)
print(summary_df.to_string(index=False))



SUMMARY - Test Queries Performance
                           Query  Num Results Avg Score Max Score Min Score
       women cotton dress summer           20    13.672    13.672    13.672
        men leather shoes formal           15    26.121    44.504    15.411
     kids blue jeans comfortable            1    21.347    21.347    21.347
sports running shoes lightweight            3    30.068    32.287    25.630
   winter jacket warm waterproof            0     0.000     0.000     0.000


## Test Validation Queries

Test with the two predefined queries from `validation_labels.csv`:
- Query 1: "women full sleeve sweatshirt cotton"
- Query 2: "men slim jeans blue"


In [32]:
VALIDATION_QUERIES = [
    "women full sleeve sweatshirt cotton",
    "men slim jeans blue"
]

for i, query in enumerate(VALIDATION_QUERIES, 1):
    print(f"\n{'='*80}")
    print(f"Validation Query {i}: '{query}'")
    print(f"{'='*80}")

    results = search_in_corpus(query, corpus, top_k=20)

    print(f"\nTop 20 Results (for evaluation in Part 2):")
    for rank, (pid, score) in enumerate(results, 1):
        doc = corpus[pid]
        title = doc['original']['title']
        print(f"{rank:2d}. [Score: {score:6.3f}] {pid} - {title[:50]}")



Validation Query 1: 'women full sleeve sweatshirt cotton'
Processed query terms: ['women', 'full', 'sleev', 'sweatshirt', 'cotton']
Indexed 20906 unique terms from 28080 documents in 0.17s
Found 500 documents matching all terms

Top 20 Results (for evaluation in Part 2):
 1. [Score: 26.444] SWSFFVK89VKGHBZN - Full Sleeve Graphic Print Women Sweatshirt
 2. [Score: 26.444] SWSFFVK8RYHBBFAH - Full Sleeve Graphic Print Women Sweatshirt
 3. [Score: 26.277] SWSFMJGS8HVBPEH6 - Full Sleeve Solid Women Sweatshirt
 4. [Score: 26.277] SWSFMJGTQFMZGZQH - Full Sleeve Solid Women Sweatshirt
 5. [Score: 26.277] SWSFMJGTVFW3NAXZ - Full Sleeve Solid Women Sweatshirt
 6. [Score: 26.277] SWSFMJGTGZMJXK5H - Full Sleeve Solid Women Sweatshirt
 7. [Score: 26.253] SWSFY4HPTTBCZZAJ - Full Sleeve Solid Women Sweatshirt
 8. [Score: 25.734] SWSFFVKBCQG5FHPF - Full Sleeve Printed Women Sweatshirt
 9. [Score: 25.734] SWSFFVKBH5YEGFFN - Full Sleeve Printed Women Sweatshirt
10. [Score: 25.734] SWSFFVKBB8FHEB7R - Fu