# Project Part 1: Text Processing and Exploratory Data Analysis

You are provided with a document corpus, which is an e-commerce fashion products dataset.
You can see an example document in the appendix.

In [18]:
import pandas as pd
import json
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Usuari\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Usuari\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
FASHION_PRODUCTS_PATH = "../IRWA-2025-data/fashion_products_dataset.json"
df_fashion_products = pd.read_json(FASHION_PRODUCTS_PATH)
print(df_fashion_products.shape)
print(df_fashion_products.info())
display(df_fashion_products.head())

(28080, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28080 entries, 0 to 28079
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   _id              28080 non-null  object        
 1   actual_price     28080 non-null  object        
 2   average_rating   28080 non-null  object        
 3   brand            28080 non-null  object        
 4   category         28080 non-null  object        
 5   crawled_at       28080 non-null  datetime64[ns]
 6   description      28080 non-null  object        
 7   discount         28080 non-null  object        
 8   images           28080 non-null  object        
 9   out_of_stock     28080 non-null  bool          
 10  pid              28080 non-null  object        
 11  product_details  28080 non-null  object        
 12  seller           28080 non-null  object        
 13  selling_price    28080 non-null  object        
 14  sub_category     28080 non

Unnamed: 0,_id,actual_price,average_rating,brand,category,crawled_at,description,discount,images,out_of_stock,pid,product_details,seller,selling_price,sub_category,title,url
0,fa8e22d6-c0b6-5229-bb9e-ad52eda39a0a,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:51,Yorker trackpants made from 100% rich combed c...,69% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EA7H5FYZH,"[{'Style Code': '1005COMBO2'}, {'Closure': 'El...",Shyam Enterprises,921,Bottomwear,Solid Women Multicolor Track Pants,https://www.flipkart.com/yorker-solid-men-mult...
1,893e6980-f2a0-531f-b056-34dd63fe912c,1499,3.9,York,Clothing and Accessories,2021-02-10 20:11:52,Yorker trackpants made from 100% rich combed c...,66% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EJZV2UVRZ,"[{'Style Code': '1005BLUE'}, {'Closure': 'Draw...",Shyam Enterprises,499,Bottomwear,Solid Men Blue Track Pants,https://www.flipkart.com/yorker-solid-men-blue...
2,eb4c8eab-8206-59d0-bcd1-a724d96bf74f,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:52,Yorker trackpants made from 100% rich combed c...,68% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EHFCY5Z4Y,"[{'Style Code': '1005COMBO4'}, {'Closure': 'El...",Shyam Enterprises,931,Bottomwear,Solid Men Multicolor Track Pants,https://www.flipkart.com/yorker-solid-men-mult...
3,3f3f97bb-5faf-57df-a9ff-1af24e2b1045,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:53,Yorker trackpants made from 100% rich combed c...,69% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9ESZZ7YWEF,"[{'Style Code': '1005COMBO3'}, {'Closure': 'El...",Shyam Enterprises,911,Bottomwear,Solid Women Multicolor Track Pants,https://www.flipkart.com/yorker-solid-men-mult...
4,750caa3d-6264-53ca-8ce1-94118a1d8951,2999,3.9,York,Clothing and Accessories,2021-02-10 20:11:53,Yorker trackpants made from 100% rich combed c...,68% off,[https://rukminim1.flixcart.com/image/128/128/...,False,TKPFCZ9EVXKBSUD7,"[{'Style Code': '1005COMBO1'}, {'Closure': 'Dr...",Shyam Enterprises,943,Bottomwear,"Solid Women Brown, Grey Track Pants",https://www.flipkart.com/yorker-solid-men-brow...


In [None]:
VALIDATION_LABELS_PATH = "../IRWA-2025-data/validation_labels.csv"
df_validation_labels = pd.read_csv(VALIDATION_LABELS_PATH)
print(df_validation_labels.shape)
print(df_validation_labels.info())
display(df_validation_labels.head())

(40, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     40 non-null     object
 1   pid       40 non-null     object
 2   query_id  40 non-null     int64 
 3   labels    40 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.4+ KB
None


Unnamed: 0,title,pid,query_id,labels
0,Full Sleeve Printed Women Sweatshirt,SWSFFVKBCQG5FHPF,1,1
1,Full Sleeve Striped Women Sweatshirt,SWSFJY5ZFHQ7HXKW,1,0
2,Full Sleeve Printed Women Sweatshirt,SWSFUY89NHMZHZPX,1,1
3,Full Sleeve Graphic Print Women Sweatshirt,SWSFXQ5YX6RZKHP4,1,1
4,Full Sleeve Solid Women Sweatshirt,JCKFTZBC3DMCVYXH,1,0


## PART 1: Data Preparation

1. As a first step, you must pre-process the documents. In particular, for the text fields (title,
description) you should:
- Removing stop words
- Tokenization
- Removing punctuation marks
- Stemming
- and... anything else you think it's needed (bonus point)

In [None]:
def process_text(line: str) -> list[str]:
    """
    Preprocess a text removing stop words, stemming,
    transforming in lowercase and return the tokens of the text.

    Parameters:
        line (str): Text to be preprocessed.

    Returns:
        list[str]: Tokens corresponding to the input text after preprocessing.
    """

    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))
    line = line.lower()
    line = line.split(" ")
    line = [word for word in line if word not in stop_words]
    line = [stemmer.stem(word) for word in line]
    return line

In [None]:
# WARNING: These lines are computationally expensive. Expected time: 1m 40s.
fashion_products_title_tokens = df_fashion_products["title"].apply(process_text)
fashion_products_description_tokens = df_fashion_products["description"].apply(process_text)

In [54]:
df_fashion_products_01 = pd.DataFrame(df_fashion_products)
df_fashion_products_01['title'] = fashion_products_title_tokens

df_fashion_products_02 = pd.DataFrame(df_fashion_products_01)
df_fashion_products_02['description'] = fashion_products_description_tokens

2. Take into account that for future queries, the final output must return (when present) the 
following information for each of the selected documents:  
pid, title, description,  brand,  category, sub_category, product_details, seller, out_of_stock, 
selling_price, discount, actual_price, average_rating, url

In [90]:
selected_attributes = [
    "pid", "title", "description",  "brand",  "category", "sub_category", "product_details", 
    "seller", "out_of_stock", "selling_price", "discount", "actual_price", "average_rating", "url"
]

df_fashion_products_03 = pd.DataFrame(df_fashion_products_02)
df_fashion_products_03 = df_fashion_products_03[selected_attributes]

In [94]:
display(df_fashion_products_03.head(3))

Unnamed: 0,pid,title,description,brand,category,sub_category,product_details,seller,out_of_stock,selling_price,discount,actual_price,average_rating,url
0,TKPFCZ9EA7H5FYZH,"[solid, women, multicolor, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...",York,Clothing and Accessories,Bottomwear,"[{'Style Code': '1005COMBO2'}, {'Closure': 'El...",Shyam Enterprises,False,921,69% off,2999,3.9,https://www.flipkart.com/yorker-solid-men-mult...
1,TKPFCZ9EJZV2UVRZ,"[solid, men, blue, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...",York,Clothing and Accessories,Bottomwear,"[{'Style Code': '1005BLUE'}, {'Closure': 'Draw...",Shyam Enterprises,False,499,66% off,1499,3.9,https://www.flipkart.com/yorker-solid-men-blue...
2,TKPFCZ9EHFCY5Z4Y,"[solid, men, multicolor, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...",York,Clothing and Accessories,Bottomwear,"[{'Style Code': '1005COMBO4'}, {'Closure': 'El...",Shyam Enterprises,False,931,68% off,2999,3.9,https://www.flipkart.com/yorker-solid-men-mult...


3. Decide how to handle the fields category, sub_category, brand, product_details, and seller 
during pre-processing. Should they be merged into a single text field, indexed as separate fields 
in the inverted index or any other alternative? Justify your choice, considering how their 
distinctiveness may affect retrieval effectiveness. What are pros and cons of each approach?

In [115]:
pre_processed_fields = ["category", "sub_category", "brand", "seller"]

merged_fields = df_fashion_products_03[pre_processed_fields].apply(" ".join, axis=1)

df_fashion_products_04 = pd.DataFrame(df_fashion_products_03)
df_fashion_products_04["merged_category_brand_seller"] = merged_fields
df_fashion_products_04 = df_fashion_products_04.drop(pre_processed_fields, axis=1)

In [116]:
display(df_fashion_products_04.head(3))

Unnamed: 0,pid,title,description,product_details,out_of_stock,selling_price,discount,actual_price,average_rating,url,merged_category_brand_seller
0,TKPFCZ9EA7H5FYZH,"[solid, women, multicolor, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...","[{'Style Code': '1005COMBO2'}, {'Closure': 'El...",False,921,69% off,2999,3.9,https://www.flipkart.com/yorker-solid-men-mult...,Clothing and Accessories Bottomwear York Shyam...
1,TKPFCZ9EJZV2UVRZ,"[solid, men, blue, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...","[{'Style Code': '1005BLUE'}, {'Closure': 'Draw...",False,499,66% off,1499,3.9,https://www.flipkart.com/yorker-solid-men-blue...,Clothing and Accessories Bottomwear York Shyam...
2,TKPFCZ9EHFCY5Z4Y,"[solid, men, multicolor, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...","[{'Style Code': '1005COMBO4'}, {'Closure': 'El...",False,931,68% off,2999,3.9,https://www.flipkart.com/yorker-solid-men-mult...,Clothing and Accessories Bottomwear York Shyam...


In [117]:
def flatten_details(details: list) -> dict:
    """
    Converts a list of dictionariess [{a:1}, {b:2}] to a unique dictionary {a:1, b:2}.
    
    Parameters:
        details (list): A list containing atomic dictionaries.

    Returns:
        dict: A merged dict.
    """
    if not isinstance(details, list):
        return {}
    merged = {}
    for detail in details:
        if isinstance(detail, dict):
            merged.update(detail)
    return merged

In [None]:
details_df = df_fashion_products_04["product_details"].apply(flatten_details).apply(pd.Series)

details_merged_df = pd.DataFrame(details_df)
details_merged_df["merged_product_details"] = details_df.astype(str).apply(" ".join, axis=1)

df_fashion_products_05 = pd.concat([df_fashion_products_04, details_merged_df], axis=1)
df_fashion_products_05 = df_fashion_products_05.drop(details_df.columns, axis=1)
df_fashion_products_05 = df_fashion_products_05.drop(["product_details"], axis=1)

In [102]:
display(df_fashion_products_05.head(3))

Unnamed: 0,pid,title,description,product_details,out_of_stock,selling_price,discount,actual_price,average_rating,url,...,Brand,Model Number,Shade,Thumb Hole,Length,Strap Material,Weave type,Fabric care,Coat Type,merged_product_details
0,TKPFCZ9EA7H5FYZH,"[solid, women, multicolor, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...","[{'Style Code': '1005COMBO2'}, {'Closure': 'El...",False,921,69% off,2999,3.9,https://www.flipkart.com/yorker-solid-men-mult...,...,,,,,,,,,,1005COMBO2 Elastic Side Pockets Cotton Blend S...
1,TKPFCZ9EJZV2UVRZ,"[solid, men, blue, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...","[{'Style Code': '1005BLUE'}, {'Closure': 'Draw...",False,499,66% off,1499,3.9,https://www.flipkart.com/yorker-solid-men-blue...,...,,,,,,,,,,"1005BLUE Drawstring, Elastic Side Pockets Cott..."
2,TKPFCZ9EHFCY5Z4Y,"[solid, men, multicolor, track, pant]","[yorker, trackpant, made, 100%, rich, comb, co...","[{'Style Code': '1005COMBO4'}, {'Closure': 'El...",False,931,68% off,2999,3.9,https://www.flipkart.com/yorker-solid-men-mult...,...,,,,,,,,,,1005COMBO4 Elastic Side Pockets Cotton Blend S...


4. Consider the fields out_of_stock, selling_price, discount, actual_price, and average_rating. 
Decide how these should be handled during pre-processing to use in further search. Should 
they be indexed as textual terms?

## PART 2: Exploratory Data Analysis  

When  working  with  data,  it  is  important to have a better understanding of the content and 
some statistics. Provide an exploratory data analysis to describe the dataset you are working on 
in  this  project  and  explain  the  decisions  made  for the analysis. For example, word counting 
distribution,  average  sentence  length,  vocabulary  size,  ranking  of  products  based  on  rating, 
price,  discount,  top  sellers  and  brands,  out_of_stock  distribution,  word  clouds  for  the most 
frequent words, and entity recognition. Feel free to do the exploratory analysis and report your 
findings in the report.