# Part 2: Indexing and Evaluation

In [1]:
import os
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', 100)

from processing import build_terms
from indexing import Indexing
from ranking import Ranking
from evaluation import Evaluation

import IPython.display 
IPython.display.clear_output()

## Part I: Indexing

1. Build inverted index: After having pre-processed the data, you can then create the 
inverted index.

See [processing.py](processing.py) for implementation of build terms.<br>
See [indexing.py](indexing.py) for implementation of index.

Here we import our data.

In [2]:
input_directory = "../IRWA-2025-data/"
input_name = "fashion_products_dataset_processed.csv"
input_path = os.path.join(input_directory, input_name)
df_fashion_products = pd.read_csv(input_path, sep=",")

pd.set_option('display.max_colwidth', 15)
print("FASHION PRODUCTS DATAFRAME")
display(df_fashion_products.sample(5))
print()

print("FASHION PRODUCTS DATAFRAME: DOCUMENT COLUMN")
pd.set_option('display.max_colwidth', 100)
display(df_fashion_products.sample(5)["document"])

FASHION PRODUCTS DATAFRAME


Unnamed: 0,pid,title,description,brand,category,sub_category,seller,out_of_stock,selling_price,discount,...,product_details_Style Code,product_details_Pattern,product_details_Fabric Care,product_details_Suitable For,product_details_Sleeve,product_details_Pack of,product_details_Type,product_details_Ideal For,product_details_Fit,document
13513,TSHFV3B4YMM...,green women...,,Marca Disa,Clothing an...,Topwear,Marca Disati,False,494.0,45.0,...,MD2BM655OLIVE,Printed,Regular Mac...,Western Wear,Full Sleeve,1.0,Round Neck,Women,Regular,green cotto...
592,TSHF3MN2SQA...,yellow wome...,,Fairdea,Clothing an...,Topwear,InstaHeaven,False,284.0,52.0,...,FD0038_XL,Solid,Regular Mac...,Western Wear,Full Sleeve,1.0,Round or Crew,Women,Regular,yellow soli...
27881,TSHFT4FAMZV...,self collar...,,Oka,Clothing an...,Topwear,OKANE,False,715.0,44.0,...,11218105,Self Design,Machine was...,Western Wear,Half Sleeve,1.0,Collared Neck,Men,Regular,collar clot...
13835,SWSF96MPKMG...,women sweat...,,Scott Inter...,Clothing an...,Winter Wear,switzinc,False,549.0,81.0,...,AW18-SSL2,Solid,Gentle Mach...,Western Wear,Full Sleeve,,,,,cotton soli...
20004,SHTFEUYGHSY...,fit collar ...,shop print ...,Adam Park,Clothing an...,Topwear,SouthIndiaS...,False,1008.0,32.0,...,SH-VC-VC002...,Printed,Cold water ...,Western Wear,Full Sleeve,1.0,,,Slim,-mult-po co...



FASHION PRODUCTS DATAFRAME: DOCUMENT COLUMN


6438     hh- cotton cloth wear stripe blend women slim -hsts-navi sleev wash polo topwear half machin nec...
9577     retailnet cotton solid henley rever cloth full dri wear regular ink blend women breakboun sleev ...
21502    natur collar green solid color cloth design arbo wear properli polycotton regular new fit sdm ar...
13014    aff cotton solid color henley amp cloth pack short wear regular blend women affgar multicolour a...
4887     long-last graphic cotton color gsm rever -cmb- cloth pack white design vestu dri wear regular te...
Name: document, dtype: object

Inverted index function.

In [3]:
indexing = Indexing(df_fashion_products, "pid", "document")
indexing.build_inverted_index()
indexing.print_inverted_index() 

great: ['TKPFCZ9EA7H5FYZH', 'TKPFCZ9EJZV2UVRZ', 'TKPFCZ9EHFCY5Z4Y', 'TKPFCZ9ESZZ7YWEF', 'TKPFCZ9EVXKBSUD7'] ...
combo: ['TKPFCZ9EA7H5FYZH', 'TKPFCZ9EHFCY5Z4Y', 'TKPFCZ9ESZZ7YWEF', 'TKPFCZ9EVXKBSUD7', 'TKPFCZ9EFK9DNWDA'] ...
skin: ['TKPFCZ9EA7H5FYZH', 'TKPFCZ9EJZV2UVRZ', 'TKPFCZ9EHFCY5Z4Y', 'TKPFCZ9ESZZ7YWEF', 'TKPFCZ9EVXKBSUD7'] ...
accessori: ['TKPFCZ9EA7H5FYZH', 'TKPFCZ9EJZV2UVRZ', 'TKPFCZ9EHFCY5Z4Y', 'TKPFCZ9ESZZ7YWEF', 'TKPFCZ9EVXKBSUD7'] ...
round: ['TKPFCZ9EA7H5FYZH', 'TKPFCZ9EJZV2UVRZ', 'TKPFCZ9EHFCY5Z4Y', 'TKPFCZ9ESZZ7YWEF', 'TKPFCZ9EVXKBSUD7'] ...


In [4]:
indexing.search_by_conjunctive_queries("women cotton dress")[0:5]

['DHTFZH9UAFE7MJBZ',
 'DHTFZH9UFKMXZ9EP',
 'ETHF25572QHXAGEQ',
 'ETHF2557CGF4CUMW',
 'ETHF2557EG7MDXES']

2. Propose test queries: Define five queries that will be used to evaluate your search 
engine (Be creative).

In [5]:
proposed_queries = pd.Series({
    1: "slim men dark blue jeans",
    2: "adidas jacket",
    3: "women cotton dress",
    4: "black leather bag",
    5: "discount sport shoes"
})
proposed_queries


1    slim men dark blue jeans
2               adidas jacket
3          women cotton dress
4           black leather bag
5        discount sport shoes
dtype: object

3. Rank your results: Implement the TF-IDF algorithm and provide ranking-based results.

See [ranking.py](ranking.py) file for implementation.

In [6]:
documents = df_fashion_products.set_index("pid")["document"]
ranking = Ranking(documents, proposed_queries)
ranking.print_rankings()


query=1, text=slim men dark blue jeans
01: score=0.5291, document=JEAFTGSEW5KDCDJ2, text=accessori blend women cotton slim dark bottomwear western cloth lev wear jean blue
02: score=0.5067, document=TSHFESTT8E2GVHAJ, text=cotton solid cloth wear blend slim bl per sleev wash topwear half machin neck accessori true round dark tag t-shirt western men blue
03: score=0.4872, document=TSHFR6YBTZ5NY2HQ, text=cotton solid cloth short wear slim per sleev wash polo topwear lev machin neck accessori tag dark t-shirt western pure men blue
04: score=0.4804, document=JEAFEN3WVGHTJPY9, text=retailnet accessori blend cotton slim dark bottomwear western cloth wear jean men blue g
05: score=0.4691, document=JEAFJV3HMBB9YA7G, text=accessori cotton slim dark bottomwear western cloth pure szto wear jean men blue
06: score=0.4533, document=JEAFEKQZNAWHW8PT, text=true accessori blend cotton slim dark bottomwear bl solid cloth western kapsonsretailpvtltd wear jean men blue
07: score=0.4531, document=JEAFJV9Y

## Part II: Evaluation

1.  Implement the following evaluation metrics to assess the effectiveness of your retrieval 
solutions. These metrics will help you measure how well your system retrieves relevant 
documents for each query: 

    - i.  Precision@K (P@K) 
    - ii.  Recall@K (R@K) 
    - iii.  Average Precision@K (P@K) 
    - iv.  F1-Score@K 
    - v.  Mean Average Precision (MAP) 
    - vi.  Mean Reciprocal Rank (MRR) 
    - vii.  Normalized Discounted Cumulative Gain (NDCG)

See [evaluation.py](evaluation.py) file for implementation.

2.  Apply the evaluation metrics you have implemented to the search results and relevance judgments provided in validation_labels.csv for the predefined queries. When reporting evaluation results, provide only numeric values, rounded to three decimal places. Do not include textual explanations or additional statistics in this section. 
    - a.  Query 1: women full sleeve sweatshirt cotton 
    - b.  Query 2: men slim jeans blue 

We import the validation labels as a dataframe.

In [7]:
df_validation_labels = pd.read_csv(os.path.join(input_directory, "validation_labels.csv"), sep=",")
df_validation_labels["title"] = df_validation_labels["title"].apply(build_terms).apply(" ".join)
df_validation_labels.sample(5)

Unnamed: 0,title,pid,query_id,labels
36,skinni jean super blue women,JEAFUZXSMRFFNGC2,2,1
8,sweatshirt solid full women sleev,JCKF7M8DNBB6WZ8Y,1,1
31,dark jean blue slim women,JEAFQFGWXSVSMRWC,2,0
17,sweatshirt solid full women sleev,SWSFN8YZXFSUCAJX,1,1
24,jean men taper blue fit,JEAFUZXRDZWSYWGG,2,1


We create the predicted labels dataframe using the documents from the Fashion Products Dataset and the queries from the Validation Labels.

In [8]:
documents = df_fashion_products.set_index("pid")["document"]
queries = pd.Series(df_validation_labels["title"].unique())

rank = Ranking(documents, queries)
df_predicted_labels = rank.rank_tfidf_dataframe(use_query_text=True, topK=3000)
df_predicted_labels.sample(5)

Unnamed: 0,query,document,score
21469,round superhero pack t-shirt women neck multicolor,TSHFGFDZ7F5FVWS7,0.133653
19015,color sweatshirt full block women sleev,SWSFJFV4CSN5TJDN,0.193455
11659,sweatshirt solid full women sleev,TSHFM8F7B3VWXGEZ,0.06174
4179,sweatshirt full stripe women sleev,SWSFWG9BDHFG2M9Z,0.155717
31142,skinni jean super blue women,TSHFSGN6BMQX9UKZ,0.069471


We merge the validation labels dataset with the computed scores from the rank object.

In [9]:
renaming = {"document": "pid", "query": "title"}
df_predicted_labels = df_predicted_labels.rename(columns=renaming)

df_validation_predicted_labels = df_validation_labels.set_index(["title", "pid"]).join(
    df_predicted_labels.set_index(["title", "pid"]),
    how="left"
).reset_index()

df_validation_predicted_labels.sample(5)

Unnamed: 0,title,pid,query_id,labels,score
9,graphic sweatshirt full print women sleev,SWSFWCSPEGMDH8FV,1,1,0.173241
19,round superhero pack t-shirt women neck multicolor,TSHFY24T4KGF2NYU,1,0,0.25286
33,jean men taper blue fit,JEAFUPSWVCTKXMHF,2,1,0.422397
18,sweatshirt solid full women sleev,SWSFMJGS8HVBPEH6,1,1,0.185621
39,blue slim women jean,JEAEKZFJHSKJMV58,2,0,0.130395


We print the evaluation metrics for each category of queries.

In [10]:
for query_id in df_validation_predicted_labels["query_id"].unique():
    evaluation = Evaluation(df_validation_predicted_labels, query_id)
    evaluation.print_evaluation(k=10)
    print()

Query: 1
1: Precision at K: 0.600
2: Recall at K: 0.462
3: Average Precision at K: 0.521
4: F1 score at K: 0.522
5: Mean Average Precision at K: 0.647
6: Mean Reciprocal Rank at K: 0.750
7: Normal Discount Cumulative Gain at K: 0.506

Query: 2
1: Precision at K: 0.700
2: Recall at K: 0.700
3: Average Precision at K: 0.774
4: F1 score at K: 0.700
5: Mean Average Precision at K: 0.647
6: Mean Reciprocal Rank at K: 0.750
7: Normal Discount Cumulative Gain at K: 0.718



3.  You will act as expert judges by establishing the ground truth for each document and 
query.  
- a.  For the test queries you defined in Part 1, Step 2 during indexing, assign a binary relevance label to each document: 1 if the document is relevant to the query, or 0 if it is not. 
- b.  Comment on each of the evaluation metrics, stating how they differ, and which information gives each of them. Analyze your results. 
- c.  Analyze the current search system and identify its main problems or limitations. For each issue you find, propose possible ways to resolve it. Consider aspects such as retrieval accuracy, ranking quality, handling of different field types, query formulation, and indexing strategies.

Exercise 3a: Create sample queries and perform search

In [11]:
print("EXERCISE 3a: SAMPLE QUERIES AND SEARCH")
print("=" * 50)

proposed_queries = {
    3: "slim men dark blue jeans",
    4: "adidas jacket",
    5: "women cotton dress",
    6: "black leather bag",
    7: "discount sport shoes"
}

print("Sample Queries:")
for qid, query in proposed_queries.items():
    print(f"  {qid}: {query}")

# Perform search
documents_series = df_fashion_products.set_index("pid")["document"]
ranking_system = Ranking(documents=documents_series)
all_search_results = []

for query_id, query_text in proposed_queries.items():
    print(f"\nSearch results for: '{query_text}'")
    ranking_system.set_queries(pd.Series({query_id: query_text}))
    search_results = ranking_system.rank_tfidf_dataframe(topK=5)
    
    for i, (_, row) in enumerate(search_results.iterrows(), 1):
        print(f"  {i}. Score: {row['score']:.4f} | Title: {row['document']}")
        pid = row['document']
        try:
            title = df_fashion_products.loc[df_fashion_products['pid'] == pid, 'title'].iloc[0]
        except:
            title = documents_series["title"]
        
        all_search_results.append({
            'title': title,
            'pid': pid,
            'query_id': query_id,
            'labels': '' 
        })

print(f"\nTotal search results: {len(all_search_results)}")

# Save results to CSV file with the required columns
output_directory = input_directory
output_file = 'exercise_3a_search_results.csv'
output_path = os.path.join(input_directory, output_file)
results_df = pd.DataFrame(all_search_results)
results_df = results_df[['title', 'pid', 'query_id', 'labels']] 
results_df.to_csv(output_path, index=False)

print(f"\nSearch results saved to '{output_file}'")

EXERCISE 3a: SAMPLE QUERIES AND SEARCH
Sample Queries:
  3: slim men dark blue jeans
  4: adidas jacket
  5: women cotton dress
  6: black leather bag
  7: discount sport shoes

Search results for: 'slim men dark blue jeans'
  1. Score: 0.5291 | Title: JEAFTGSEW5KDCDJ2
  2. Score: 0.5067 | Title: TSHFESTT8E2GVHAJ
  3. Score: 0.4872 | Title: TSHFR6YBTZ5NY2HQ
  4. Score: 0.4804 | Title: JEAFEN3WVGHTJPY9
  5. Score: 0.4691 | Title: JEAFJV3HMBB9YA7G

Search results for: 'adidas jacket'
  1. Score: 0.4633 | Title: JCKFUFGWGPSMQ6FJ
  2. Score: 0.4619 | Title: JCKFMGFUMG2ZFYJG
  3. Score: 0.4537 | Title: JCKFQF3GRQYC4VVS
  4. Score: 0.4501 | Title: JCKFWZZMPCHQA4GB
  5. Score: 0.4441 | Title: JCKFRAJT3PCXS7T3

Search results for: 'women cotton dress'
  1. Score: 0.2618 | Title: KTAFPHHVKZZ5GEHY
  2. Score: 0.2505 | Title: KTAFPHHVSDSYVBV2
  3. Score: 0.2483 | Title: KTAFPHHVAYSHGU73
  4. Score: 0.2440 | Title: KTAFPHHVCARZAEZH
  5. Score: 0.2416 | Title: KTAFPHHV7GHGKP4C

Search results for: 

Exercise 3b: Load labeled results and evaluate search performance

In [12]:
print("EXERCISE 3b: EVALUATION OF LABELED RESULTS")
print("=" * 50)

# Load the labeled results
try:
    labeled_results_name = "exercise_3a_labels.csv"
    labeled_results_path = os.path.join(input_directory, labeled_results_name)
    labeled_results = pd.read_csv(labeled_results_path)
    print("Loaded labeled results successfully")
    print(f"Total labeled documents: {len(labeled_results)}")
    
    # Check if labels column exists and has values
    if 'labels' not in labeled_results.columns:
        print("ERROR: 'labels' column not found in the file")
        print("Please make sure the file has a 'labels' column with values 0 or 1")
    else:
        # Check for any missing labels
        missing_labels = labeled_results['labels'].isna().sum()
        if missing_labels > 0:
            print(f"Warning: {missing_labels} documents have missing labels")
            labeled_results['labels'] = labeled_results['labels'].fillna(0)
        
        print(f"Relevance distribution")
        print(f"Relevant documents (1): {sum(labeled_results['labels'] == 1)}")
        print(f"Non-relevant documents (0): {sum(labeled_results['labels'] == 0)}")
        
        # Prepare data for Evaluation class
        print("Recalculating scores for evaluation")
        documents_series = df_fashion_products.set_index("pid")["document"]
        ranking_system = Ranking(documents=documents_series)
        
        evaluation_data = []
        for query_id in labeled_results['query_id'].unique():
            query_data = labeled_results[labeled_results['query_id'] == query_id]
            query_text = proposed_queries[query_id]
            
            ranking_system.set_queries(pd.Series({query_id: query_text}))
            search_results = ranking_system.rank_tfidf_dataframe(topK=len(query_data))
            
            for _, row in query_data.iterrows():
                pid = row['pid']
                score = search_results[search_results['document'] == pid]['score'].iloc[0]\
                    if pid in search_results['document'].values else 0.0
                
                evaluation_data.append({
                    'query_id': query_id,
                    'document': pid,
                    'score': score,
                    'labels': row['labels']
                })
        
        evaluation_df = pd.DataFrame(evaluation_data)
        
        # Evaluate each query using the Evaluation class
        print("Evaluation Metrics for all queries @ K=10:")
        print("-" * 45)
        
        for query_id in sorted(evaluation_df['query_id'].unique()):
            evaluator = Evaluation(evaluation_df, query_id, proposed_queries[query_id])
            evaluator.print_evaluation()
            print()
                
except FileNotFoundError:
    print("ERROR: 'exercise_3a_labels.csv' not found")
except Exception as e:
    print(f"Error: {e}")

EXERCISE 3b: EVALUATION OF LABELED RESULTS
Loaded labeled results successfully
Total labeled documents: 25
Relevance distribution
Relevant documents (1): 11
Non-relevant documents (0): 14
Recalculating scores for evaluation
Evaluation Metrics for all queries @ K=10:
---------------------------------------------
Query 3: slim men dark blue jeans
1: Precision at K: 0.300
2: Recall at K: 1.000
3: Average Precision at K: 0.700
4: F1 score at K: 0.462
5: Mean Average Precision at K: 0.513
6: Mean Reciprocal Rank at K: 0.600
7: Normal Discount Cumulative Gain at K: 0.853

Query 4: adidas jacket
1: Precision at K: 0.300
2: Recall at K: 1.000
3: Average Precision at K: 0.867
4: F1 score at K: 0.462
5: Mean Average Precision at K: 0.513
6: Mean Reciprocal Rank at K: 0.600
7: Normal Discount Cumulative Gain at K: 0.947

Query 5: women cotton dress
1: Precision at K: 0.500
2: Recall at K: 1.000
3: Average Precision at K: 1.000
4: F1 score at K: 0.667
5: Mean Average Precision at K: 0.513
6: Mean 

  return float(relevant) / total_relevant


Exercise 3c: Calculate system-level metrics and comprehensive analysis

In [13]:
# Exercise 3c: Calculate system-level metrics and comprehensive analysis
print("EXERCISE 3c: SYSTEM-LEVEL METRICS AND ANALYSIS")
print("=" * 50)

try:
    labeled_results_name = "exercise_3a_labels.csv"
    labeled_results_path = os.path.join(input_directory, labeled_results_name)
    labeled_results = pd.read_csv(labeled_results_path)
    
    documents_series = df_fashion_products.set_index("pid")["document"]
    ranking_system = Ranking(documents=documents_series)
    
    evaluation_data = []
    for query_id in labeled_results['query_id'].unique():
        query_data = labeled_results[labeled_results['query_id'] == query_id]
        query_text = proposed_queries[query_id]
        
        ranking_system.set_queries(pd.Series({query_id: query_text}))
        search_results = ranking_system.rank_tfidf_dataframe(topK=len(query_data))
        
        for _, row in query_data.iterrows():
            pid = row['pid']
            score = search_results[search_results['document'] == pid]['score'].iloc[0]\
                if pid in search_results['document'].values else 0.0
            
            evaluation_data.append({
                'query_id': query_id,
                'document': pid,
                'score': score,
                'labels': row['labels']
            })
    
    evaluation_df = pd.DataFrame(evaluation_data)
    
    evaluator_system = Evaluation(evaluation_df, evaluation_df['query_id'].iloc[0])
    
    map_score = evaluator_system.map_at_k(5)
    mrr_score = evaluator_system.mrr_at_k(5)
    
    print("SYSTEM-LEVEL EVALUATION METRICS:")
    print(f"Mean Average Precision (MAP@5): {map_score:.3f}")
    print(f"Mean Reciprocal Rank (MRR@5): {mrr_score:.3f}")
    
    print("COMPARISON WITH BASELINE METHODS:")
    print("-" * 40)
    
    indexing_system = Indexing(
        documents=df_fashion_products,
        identifier_column="pid", 
        text_column="document"
    )
    indexing_system.build_inverted_index()
    
    print("Inverted Index Results:")
    for query_id, query_text in proposed_queries.items():
        conjunctive_results = indexing_system.search_by_conjunctive_queries(query_text)
        print(f"Query {query_id}: {len(conjunctive_results)} documents")
    
    print("QUERY PERFORMANCE ANALYSIS:")
    print("-" * 30)
    
    query_performance = []
    for query_id in sorted(evaluation_df['query_id'].unique()):
        evaluator = Evaluation(evaluation_df, query_id)
        query_data = evaluation_df[evaluation_df['query_id'] == query_id]
        
        precision = evaluator.precision_at_k(5)
        num_relevant = sum(query_data['labels'] == 1)
        query_text = proposed_queries[query_id]
        
        query_performance.append({
            'query_id': query_id,
            'query_text': query_text,
            'precision@5': precision,
            'relevant_docs': num_relevant
        })
        
        print(f"Query {query_id}: '{query_text}'")
        print(f"Precision@5: {precision:.3f}")
        print(f"Relevant docs in pool: {num_relevant}")
        print()
    
    performance_df = pd.DataFrame(query_performance)
    avg_precision = performance_df['precision@5'].mean()
    
    print("SUMMARY STATISTICS:")
    print(f"Average Precision@5 across all queries: {avg_precision:.3f}")
    print(f"Total queries evaluated: {len(performance_df)}")
    print(f"Total relevant documents in pool: {sum(performance_df['relevant_docs'])}")
    
except FileNotFoundError:
    print("ERROR: Labeled results file not found")
    print("Please complete Exercise 3b first")
except Exception as e:
    print(f"Error in Exercise 3c: {e}")

EXERCISE 3c: SYSTEM-LEVEL METRICS AND ANALYSIS
SYSTEM-LEVEL EVALUATION METRICS:
Mean Average Precision (MAP@5): 0.513
Mean Reciprocal Rank (MRR@5): 0.600
COMPARISON WITH BASELINE METHODS:
----------------------------------------
Inverted Index Results:
Query 3: 44 documents
Query 4: 0 documents
Query 5: 468 documents
Query 6: 1 documents
Query 7: 0 documents
QUERY PERFORMANCE ANALYSIS:
------------------------------
Query 3: 'slim men dark blue jeans'
Precision@5: 0.600
Relevant docs in pool: 3

Query 4: 'adidas jacket'
Precision@5: 0.600
Relevant docs in pool: 3

Query 5: 'women cotton dress'
Precision@5: 1.000
Relevant docs in pool: 5

Query 6: 'black leather bag'
Precision@5: 0.000
Relevant docs in pool: 0

Query 7: 'discount sport shoes'
Precision@5: 0.000
Relevant docs in pool: 0

SUMMARY STATISTICS:
Average Precision@5 across all queries: 0.440
Total queries evaluated: 5
Total relevant documents in pool: 11
