In [60]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.metrics import jaccard_score
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import OneHotEncoder
from nltk.metrics import edit_distance
# Load the dataset
df = pd.read_csv("dataset/Amazon-clothing-info.csv")

## **Study 1 – Similarity Measures on Amazon Apparel Metadata**

### **Objective**  
To explore various similarity measures across a selection of metadata attributes from the `Amazon-clothing-info.csv` dataset. These measures help in identifying similar items based on different features (such as price, brand, and title) — a critical component in building content-based recommendation systems.

---

### **Selected Attributes and Corresponding Similarity Measures**

| Attribute              | Description                                  | Type       | Similarity Measure Used       |
|------------------------|----------------------------------------------|------------|-------------------------------|
| `formatted_price`      | Product price                                | Numerical  | Euclidean / Manhattan         |
| `color`                | Color name                                   | Categorical| Jaccard Similarity            |
| `brand`                | Brand name                                   | Categorical| Hamming Distance              |
| `product_type_name`    | Product category/type                        | Categorical| Cosine Similarity (One-Hot)   |
| `title`                | Product title                                | Textual    | Edit Distance (Levenshtein)   |

---

### **Preprocessing Notes**

- `formatted_price`: Prices were cleaned and converted from string format (e.g., "$25.99") to float.
- Categorical columns like `color`, `brand`, and `product_type_name` were standardized (lowercase, trimmed spaces).
- `title`: Titles were normalized by lowering case and stripping punctuations for edit distance calculations.

---

### **Simulation of 5 Example Similarity Queries**

Below are five requests simulating practical recommendation queries. For each, we compute pairwise similarity to a given item and show the **Top 10 most similar results** based on the selected metric:

1. **Show me clothing with similar price to an item priced at $25.99**  
   → *(Using Euclidean/Manhattan distance on `formatted_price`)*

2. **Show me clothing of the same color as "Pink"**  
   → *(Using Jaccard similarity on `color`)*

3. **Show me clothing from similar brands to "Nike"**  
   → *(Using Cosine similarity on one-hot encoded `brand`)*

4. **Show me clothing in the same category as "pants"**  
   → *(Using Cosine similarity on one-hot encoded `product_type_name`)*

5. **Show me clothing with similar title to "Oxford Shirt"**  
   → *(Using Edit Distance on `title`)*

Each result set will include relevant metadata:  
**`asin`, `title`, `formatted_price`, `color`, `brand`, `product_type_name`**





---

### **Similarity 1: Price Similarity (Euclidean Distance)**

We explore similarity based on product prices using **Euclidean distance**. The `formatted_price` column contains prices in a string format (e.g., "$25.99"), so we first clean and convert these into numerical values.

Euclidean distance allows us to find products with the closest numeric prices.




In [61]:
# --- Preprocessing: Clean and convert 'formatted_price' to numerical float ---
def clean_price(price_str):
    try:
        return float(str(price_str).replace("$", "").replace(",", "").strip())
    except:
        return np.nan

df["formatted_price_clean"] = df["formatted_price"].apply(clean_price)

# Drop rows with missing prices
df_price = df.dropna(subset=["formatted_price_clean"]).copy()

# --- Function: Find Top 10 items with similar price ---
def find_similar_price_items(target_price, top_n=10):
    # Create a temporary column for distance
    df_price["price_distance"] = (df_price["formatted_price_clean"] - target_price).abs()
    similar_items = df_price.sort_values("price_distance").head(top_n)

    return similar_items[["asin", "title", "formatted_price", "color", "brand", "product_type_name", "price_distance"]]

In [62]:
# Example query: Find items similar to $25.99
similar_price_df = find_similar_price_items(9.98)

# Display results
similar_price_df.reset_index(drop=True)


Unnamed: 0,asin,title,formatted_price,color,brand,product_type_name,price_distance
0,B00ZZR0UGC,Fire Women's Long-sleeve Print Chiffon Fashion...,$9.98,White,Fire,SHIRT,0.0
1,B01M111WN8,Pikolai Vintage Women Long Sleeve Cotton blend...,$9.98,Black,Pikolai,SHIRT,0.0
2,B01M0P91TT,Pikolai Vintage Women Long Sleeve Cotton blend...,$9.98,Red,Pikolai,SHIRT,0.0
3,B071WKG6G6,Mossimo Supply Co Gold Fish Orange Lace Tank/C...,$9.98,Orange,Mossimo Supply Co,SHIRT,0.0
4,B016L14J2G,"Badger Women's Performance Racerback Tank Top,...",$9.98,Electric Blue,Badger,SHIRT,0.0
5,B071X6ZF3G,Reel Legends Womens Freeline V-Neck Top Small ...,$9.98,Black,Reel Legends,SHIRT,0.0
6,B01LZQ4YLK,Pikolai Women Cotton blend Casual Loose Printi...,$9.98,Black,Pikolai,SHIRT,0.0
7,B016L14KY8,"Badger Women's Performance Racerback Tank Top,...",$9.98,Electric Blue,Badger,SHIRT,0.0
8,B0177DOFK8,Navy Strp Racerbck Snit Small Navy,$9.98,Navy,BCX/Byer California,SHIRT,0.0
9,B0721BBB3C,Reel Legends Womens Freeline V-Neck Top Small ...,$9.98,Pink Glow,Reel Legends,SHIRT,0.0



---

### **Similarity 2: Color Similarity (Jaccard Similarity)**

To compare the **`color`** attribute, we use **Jaccard similarity**. Since colors can contain multiple values (e.g., `"Red/Black"`), we tokenize them into sets of individual color tokens.

The **Jaccard similarity** is computed as:

> Jaccard(A, B) = |A ∩ B| / |A ∪ B|

This helps identify products that share similar color combinations.


In [63]:
# Clean and preprocess
df_color = df[["asin", "title", "color", "brand", "product_type_name"]].copy()
df_color = df_color.dropna(subset=["color"])
df_color = df_color.reset_index(drop=True)

# Convert color strings to sets (e.g., "Red/White" → {"Red", "White"})
df_color["color_set"] = df_color["color"].apply(lambda x: set(x.split("/")))

# Encode with MultiLabelBinarizer
mlb = MultiLabelBinarizer()
color_encoded = mlb.fit_transform(df_color["color_set"])

# Choose a new reference item (e.g., color = "Pink")
reference_index = df_color[df_color["color"].str.contains("Pink", case=False)].index[0]
reference_vector = color_encoded[reference_index]

# Compute Jaccard similarity between reference and all others
similarities = []
for i in range(len(color_encoded)):
    if i == reference_index:
        similarities.append(0)  # skip self-comparison
    else:
        sim = jaccard_score(reference_vector, color_encoded[i])
        similarities.append(sim)

df_color["color_similarity"] = similarities

# Show top 10 most similar products by color
top10_color_similar = df_color.sort_values("color_similarity", ascending=False).head(10)
top10_color_similar[["asin", "title", "color", "brand", "product_type_name", "color_similarity"]]


Unnamed: 0,asin,title,color,brand,product_type_name,color_similarity
1840,B01E72S5CG,Anskan Women's Spartan Logo Red T-shirt XL Pink,Pink,Anskan,BOOKS_1973_AND_LATER,1.0
761,B01AAZN97K,Active Basic Womens Basic Deep Scoop Neck with...,Pink,Active Products,SHIRT,1.0
4932,B019X6XUTE,INC Womens Plus Embroidered Open Sleeve Peasan...,Pink,INC International Concepts,APPAREL,1.0
10065,B0719L6BN2,Wayf Women's Large Textured Cold-Shoulder Blou...,Pink,WAYF,SHIRT,1.0
7583,B01JZXBZDS,Lush Neon Small Junior Keyhole-Back High-Low B...,Pink,Lush Clothing,SHIRT,1.0
28357,B0759DZSK6,Theory Women's Medium Layered Surplice Blouse ...,Pink,Theory,SHIRT,1.0
7161,B01MTKEW91,Ro & De Womens Medium Scoop-Neck Open-Back Blo...,Pink,Rode,SHIRT,1.0
27662,B01NAWEG5M,TOOGOO(R) Women's spring autumn women long sle...,Pink,TOOGOO(R),OUTERWEAR,1.0
27688,B0759V9R3K,Tommy Hilfiger Plaid Women's Button Down Shirt...,Pink,Tommy Hilfiger,SHIRT,1.0
6939,B06XXSNGLZ,She'sModa Elegant Ruffles Bandage Pink Slim Pu...,Pink,She'sModa,SHIRT,1.0


### **Similarity 3 – Brand Similarity with TF-IDF and Cosine Similarity**

We want to find clothing items that come from **brands textually similar to "Nike"**. To do this, we:
- Apply **TF-IDF vectorization** on the `brand` attribute to convert brand names into numerical vectors.
- Compute **cosine similarity** between the TF-IDF vector of "Nike" and all other brand vectors.
- Display the top 10 closest matches based on highest cosine similarity.

**Cosine similarity** is well-suited for comparing text-based features, as it captures the angle between two vectors regardless of their length. This approach is ideal for partial matches (e.g., `"Nike Sportswear"`, `"Nike Inc."`) and variations in naming conventions.


In [64]:
# Fill missing brand values with empty string
df_brand = df[["brand", "title", "formatted_price"]].fillna("")

# TF-IDF vectorization on 'brand'
tfidf = TfidfVectorizer()
brand_tfidf = tfidf.fit_transform(df_brand["brand"])

# Get the index of the target brand
nike_index = df_brand[df_brand["brand"].str.lower() == "nike"].index[0]

# Compute cosine similarity
cosine_sim = cosine_similarity(brand_tfidf[nike_index], brand_tfidf).flatten()

# Get top 10 similar brands including the original
top_indices = cosine_sim.argsort()[::-1][:10]

# Display the results
df_brand.iloc[top_indices].assign(similarity=cosine_sim[top_indices])


Unnamed: 0,brand,title,formatted_price,similarity
10984,NIKE,Nike Ohio State Buckeyes 2017 Women's Medium P...,$50.00,1.0
8941,NIKE,Nike Womens S/S Polo II XX-Large Black/White,$14.95,1.0
8328,NIKE,Nike Women's Gung-Ho Polo Sky Blue/White XL (L),$29.99,1.0
7923,Nike,Nike Pro Fierce Lux Dot Bra Womens Style: 6583...,$34.77,1.0
6780,NIKE,Nike Women's Victory Golf Short Sleeve Polo Sh...,$49.99,1.0
5275,Nike,Nike Get Fit Lux Womens Tank Top Size XS,$23.91,1.0
3655,NIKE,Nike Victory Golf Polo 2015 Pewter Grey X-Large,$29.95,1.0
6385,Nike Golf,Women's Bold Stripe Polo,$20.00,0.672728
6232,Nike Golf,Nike Womens Dri-FIT Micro Pique Polo (354067) ...,$46.52,0.672728
5742,Nike Golf,Nike Golf Women's Sport Polo Two (Chlorine Blu...,$60.00,0.672728



---

### **Similarity 4 – Category Similarity with Cosine Similarity**

We want to find clothing items that belong to **similar categories as "pants"**. To do this, we:
- Apply **one-hot encoding** on the `product_type_name` attribute to convert categories into binary vectors.
- Compute **cosine similarity** between the encoded vector for "pants" and all other category vectors.
- Display the top 10 closest matches based on the highest cosine similarity.

**Cosine similarity** is suitable for high-dimensional binary data such as one-hot encoded vectors. It measures the angle between two vectors, making it useful for determining the similarity of categorical variables like product types.

In [65]:
# Fill missing values
df_category = df[["product_type_name", "title", "formatted_price"]].fillna("")

# One-hot encode 'product_type_name'
encoder = OneHotEncoder(sparse_output=False)
category_encoded = encoder.fit_transform(df_category[["product_type_name"]])

# Get the index of the reference category
pants_index = df_category[df_category["product_type_name"].str.lower() == "pants"].index[0]

# Compute cosine similarity
category_sim = cosine_similarity([category_encoded[pants_index]], category_encoded).flatten()

# Get top 10 results (including 'pants' item itself)
top_indices = category_sim.argsort()[::-1][:10]

# Display results
df_category.iloc[top_indices].assign(similarity=category_sim[top_indices])


Unnamed: 0,product_type_name,title,formatted_price,similarity
24384,PANTS,GENERATION LOVE Womens Tribal Pattern Lace Ins...,$59.00,1.0
18343,PANTS,New 2 Pc Womens Racerback Tank Top Seamless Sl...,$19.56,1.0
17856,PANTS,Dickies - 1254 Women's Button-Down Oxford Shir...,$19.99,1.0
25120,PANTS,Victoria's Secret Pink NEW Bling Muscle Tee Ta...,$49.50,1.0
21065,PANTS,Cable & Gauge Womens Crossover Chiffon-Hem Hig...,$28.95,1.0
18428,PANTS,"Ideology Women's Graphic Training Leggings, No...",$49.50,1.0
24187,PANTS,Victoria's Secret Pink NEW Bling Muscle Tee Ta...,$49.50,1.0
26407,PANTS,"Tossed Pocket Print Top Size: Large, Color: Gemma",$22.99,1.0
19250,PANTS,Victoria's Secret Pink NEW Muscle tee Tank Col...,$39.50,1.0
20262,PANTS,New 3 Pc Womens Racerback Tank Top Seamless Sl...,$27.89,1.0


### **Similarity 5 – Title Similarity with Edit Distance**

We want to find clothing items with **titles similar to "Oxford Shirt"**. To achieve this, we:

- Preprocess the `title` column by converting all entries to lowercase and removing surrounding whitespace.
- Compute the **Levenshtein (edit) distance** between `"Oxford Shirt"` and each product title.
- Normalize the edit distance to a similarity score using the formula:
  
  $$
  \text{similarity} = 1 - \frac{\text{edit\_distance}}{\max(\text{len(title)}, \text{len(reference)})}
  $$


- Display the top 10 titles with the highest similarity scores.




In [67]:
# Define reference title
reference_title = "oxford shirt"

# Prepare and clean data
df_title = df[["title", "product_type_name", "formatted_price", "brand"]].copy()
df_title["title_clean"] = df_title["title"].astype(str).str.lower().str.strip()

# Compute edit distance similarity
similarities = []
for title in df_title["title_clean"]:
    dist = edit_distance(reference_title, title)
    max_len = max(len(reference_title), len(title))
    similarity = 1 - dist / max_len if max_len > 0 else 0
    similarities.append(similarity)

df_title["title_similarity"] = similarities

# Show top 10 most similar products by title
top_similar = df_title.sort_values("title_similarity", ascending=False)
df_title[df_title["title_clean"].str.contains("oxford")][["title", "product_type_name", "formatted_price", "brand"]].head(10)



Unnamed: 0,title,product_type_name,formatted_price,brand
98,Fjallraven - Women's Ovik Foxford Shirt Longsl...,SHIRT,$88.00,Fjallraven
143,"FeatherLite Ladies Long Sleeve Oxford Shirt, W...",SHIRT,$21.65,FeatherLite
156,"FeatherLite Ladies Long Sleeve Oxford Shirt, L...",SHIRT,$21.78,FeatherLite
198,Fjallraven - Women's Ovik Foxford Shirt Longsl...,SHIRT,$88.00,Fjallraven
207,"FeatherLite Ladies Long Sleeve Oxford Shirt, F...",SHIRT,$21.84,FeatherLite
210,"FeatherLite Ladies Long Sleeve Oxford Shirt, L...",SHIRT,$21.65,FeatherLite
236,Fjallraven - Women's Ovik Foxford Shirt Longsl...,SHIRT,$88.00,Fjallraven
238,"FeatherLite Ladies Long Sleeve Oxford Shirt, L...",SHIRT,$21.84,FeatherLite
239,"FeatherLite Ladies Long Sleeve Oxford Shirt, F...",SHIRT,$21.65,FeatherLite
262,"FeatherLite Ladies Long Sleeve Oxford Shirt, F...",SHIRT,$21.78,FeatherLite
