In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics import jaccard_score
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import pairwise_distances

## **Study 1 – Similarity Measures on Amazon Apparel Metadata**

### **Objective**  
To explore various similarity measures across a selection of metadata attributes from the `Amazon-clothing-info.csv` dataset. These measures help in identifying similar items based on different features (such as price, brand, and title) — a critical component in building content-based recommendation systems.

---

### **Selected Attributes and Corresponding Similarity Measures**

| Attribute              | Description                                  | Type       | Similarity Measure Used       |
|------------------------|----------------------------------------------|------------|-------------------------------|
| `formatted_price`      | Product price                                | Numerical  | Euclidean / Manhattan         |
| `color`                | Color name                                   | Categorical| Jaccard Similarity            |
| `brand`                | Brand name                                   | Categorical| Hamming Distance              |
| `product_type_name`    | Product category/type                        | Categorical| Cosine Similarity (One-Hot)   |
| `title`                | Product title                                | Textual    | Edit Distance (Levenshtein)   |

---

### **Preprocessing Notes**

- `formatted_price`: Prices were cleaned and converted from string format (e.g., "$25.99") to float.
- Categorical columns like `color`, `brand`, and `product_type_name` were standardized (lowercase, trimmed spaces).
- `title`: Titles were normalized by lowering case and stripping punctuations for edit distance calculations.

---

### **Simulation of 5 Example Similarity Queries**

Below are five requests simulating practical recommendation queries. For each, we compute pairwise similarity to a given item and show the **Top 10 most similar results** based on the selected metric:

1. **Show me clothing with similar price to an item priced at $25.99**  
   → *(Using Euclidean/Manhattan distance on `formatted_price`)*

2. **Show me clothing of the same color as "Pink"**  
   → *(Using Jaccard similarity on `color`)*

3. **Show me clothing from similar brands to "Nike"**  
   → *(Using Hamming distance on one-hot encoded `brand`)*

4. **Show me clothing in the same category as "pants"**  
   → *(Using Cosine similarity on one-hot encoded `product_type_name`)*

5. **Show me clothing with similar title to "Denim Jeans"**  
   → *(Using Edit Distance on `title`)*

Each result set will include relevant metadata:  
**`asin`, `title`, `formatted_price`, `color`, `brand`, `product_type_name`**




### **Similarity 1: Price Similarity (Euclidean Distance)**

We explore similarity based on product prices using **Euclidean distance**. The `formatted_price` column contains prices in a string format (e.g., "$25.99"), so we first clean and convert these into numerical values.

Euclidean distance allows us to find products with the closest numeric prices.

---


In [11]:
# Load the dataset
df = pd.read_csv("dataset/Amazon-clothing-info.csv")

# --- Preprocessing: Clean and convert 'formatted_price' to numerical float ---
def clean_price(price_str):
    try:
        return float(str(price_str).replace("$", "").replace(",", "").strip())
    except:
        return np.nan

df["formatted_price_clean"] = df["formatted_price"].apply(clean_price)

# Drop rows with missing prices
df_price = df.dropna(subset=["formatted_price_clean"]).copy()

# --- Function: Find Top 10 items with similar price ---
def find_similar_price_items(target_price, top_n=10):
    # Create a temporary column for distance
    df_price["price_distance"] = (df_price["formatted_price_clean"] - target_price).abs()
    similar_items = df_price.sort_values("price_distance").head(top_n)

    return similar_items[["asin", "title", "formatted_price", "color", "brand", "product_type_name", "price_distance"]]

In [14]:
# Example query: Find items similar to $25.99
similar_price_df = find_similar_price_items(9.98)

# Display results
similar_price_df.reset_index(drop=True)


Unnamed: 0,asin,title,formatted_price,color,brand,product_type_name,price_distance
0,B00ZZR0UGC,Fire Women's Long-sleeve Print Chiffon Fashion...,$9.98,White,Fire,SHIRT,0.0
1,B01M111WN8,Pikolai Vintage Women Long Sleeve Cotton blend...,$9.98,Black,Pikolai,SHIRT,0.0
2,B01M0P91TT,Pikolai Vintage Women Long Sleeve Cotton blend...,$9.98,Red,Pikolai,SHIRT,0.0
3,B071WKG6G6,Mossimo Supply Co Gold Fish Orange Lace Tank/C...,$9.98,Orange,Mossimo Supply Co,SHIRT,0.0
4,B016L14J2G,"Badger Women's Performance Racerback Tank Top,...",$9.98,Electric Blue,Badger,SHIRT,0.0
5,B071X6ZF3G,Reel Legends Womens Freeline V-Neck Top Small ...,$9.98,Black,Reel Legends,SHIRT,0.0
6,B01LZQ4YLK,Pikolai Women Cotton blend Casual Loose Printi...,$9.98,Black,Pikolai,SHIRT,0.0
7,B016L14KY8,"Badger Women's Performance Racerback Tank Top,...",$9.98,Electric Blue,Badger,SHIRT,0.0
8,B0177DOFK8,Navy Strp Racerbck Snit Small Navy,$9.98,Navy,BCX/Byer California,SHIRT,0.0
9,B0721BBB3C,Reel Legends Womens Freeline V-Neck Top Small ...,$9.98,Pink Glow,Reel Legends,SHIRT,0.0



---

### **Similarity 2: Color Similarity (Jaccard Similarity)**

To compare the **`color`** attribute, we use **Jaccard similarity**. Since colors can contain multiple values (e.g., `"Red/Black"`), we tokenize them into sets of individual color tokens.

The **Jaccard similarity** is computed as:

> Jaccard(A, B) = |A ∩ B| / |A ∪ B|

This helps identify products that share similar color combinations.


In [16]:
# Clean and preprocess
df_color = df[["asin", "title", "color", "brand", "product_type_name"]].copy()
df_color = df_color.dropna(subset=["color"])
df_color = df_color.reset_index(drop=True)

# Convert color strings to sets (e.g., "Red/White" → {"Red", "White"})
df_color["color_set"] = df_color["color"].apply(lambda x: set(x.split("/")))

# Encode with MultiLabelBinarizer
mlb = MultiLabelBinarizer()
color_encoded = mlb.fit_transform(df_color["color_set"])

# Choose a new reference item (e.g., color = "Pink")
reference_index = df_color[df_color["color"].str.contains("Pink", case=False)].index[0]
reference_vector = color_encoded[reference_index]

# Compute Jaccard similarity between reference and all others
similarities = []
for i in range(len(color_encoded)):
    if i == reference_index:
        similarities.append(0)  # skip self-comparison
    else:
        sim = jaccard_score(reference_vector, color_encoded[i])
        similarities.append(sim)

df_color["color_similarity"] = similarities

# Show top 10 most similar products by color
top10_color_similar = df_color.sort_values("color_similarity", ascending=False).head(10)
top10_color_similar[["asin", "title", "color", "brand", "product_type_name", "color_similarity"]]


Unnamed: 0,asin,title,color,brand,product_type_name,color_similarity
1840,B01E72S5CG,Anskan Women's Spartan Logo Red T-shirt XL Pink,Pink,Anskan,BOOKS_1973_AND_LATER,1.0
761,B01AAZN97K,Active Basic Womens Basic Deep Scoop Neck with...,Pink,Active Products,SHIRT,1.0
4932,B019X6XUTE,INC Womens Plus Embroidered Open Sleeve Peasan...,Pink,INC International Concepts,APPAREL,1.0
10065,B0719L6BN2,Wayf Women's Large Textured Cold-Shoulder Blou...,Pink,WAYF,SHIRT,1.0
7583,B01JZXBZDS,Lush Neon Small Junior Keyhole-Back High-Low B...,Pink,Lush Clothing,SHIRT,1.0
28357,B0759DZSK6,Theory Women's Medium Layered Surplice Blouse ...,Pink,Theory,SHIRT,1.0
7161,B01MTKEW91,Ro & De Womens Medium Scoop-Neck Open-Back Blo...,Pink,Rode,SHIRT,1.0
27662,B01NAWEG5M,TOOGOO(R) Women's spring autumn women long sle...,Pink,TOOGOO(R),OUTERWEAR,1.0
27688,B0759V9R3K,Tommy Hilfiger Plaid Women's Button Down Shirt...,Pink,Tommy Hilfiger,SHIRT,1.0
6939,B06XXSNGLZ,She'sModa Elegant Ruffles Bandage Pink Slim Pu...,Pink,She'sModa,SHIRT,1.0


### **Similarity 3 – Brand Similarity with Hamming Distance**

We want to find clothing items that come from **brands similar to "Nike"**. To do this, we:
- One-hot encode the `brand` attribute.
- Compute Hamming distance between the encoded row for "Nike" and all others.
- Display the top 10 closest matches.

**Hamming distance** is suitable for one-hot encoded categorical data. A lower Hamming distance indicates a higher similarity.


In [18]:
from sklearn.metrics import pairwise_distances

# Drop rows where brand is missing
df_brand = df.dropna(subset=["brand"]).copy()

# One-hot encode brand column
brand_dummies = pd.get_dummies(df_brand["brand"])

# Find index of a product from brand "Nike"
nike_index = df_brand[df_brand["brand"].str.lower() == "nike"].index[0]

# Compute Hamming distances
hamming_distances = pairwise_distances(
    [brand_dummies.iloc[nike_index]], brand_dummies, metric="hamming"
).flatten()

# Attach distance to dataframe
df_brand["brand_similarity"] = 1 - hamming_distances  # higher = more similar

# Sort and show top 10 most similar (excluding Nike itself)
top10_brand_similar = df_brand[df_brand.index != nike_index].sort_values("brand_similarity", ascending=False).head(10)

# Display
top10_brand_similar[["brand", "title", "formatted_price", "brand_similarity"]]


Unnamed: 0,brand,title,formatted_price,brand_similarity
3664,Meilaier,Meilaier Womens Fashion Skull Blouse Knitted S...,$19.40,1.0
5431,Meilaier,Women's Wing Printing Racerback Vest Top,$9.98,1.0
10,Feel The Piece,Feel The Piece Sami Dip Dye Top One Size in Navy,$72.40,0.999451
11,FeatherLite,FeatherLite Ladies Long Sleeve Stain Resistant...,$22.78,0.999451
12,FARYSAYS,FARYSAYS Women's Sexy Cut-out Shoulder Blouse ...,$16.99,0.999451
28363,SODIAL(R),SODIAL(R) Womens Button Down Lapel Shirt Plaid...,$9.76,0.999451
28,FeatherLite,Featherlite Ladies' Silky Smooth Pique (Red) (2X),$15.74,0.999451
28379,TOOGOO(R),TOOGOO(R) Women's Tops Spring Autumn Casual Pu...,$14.58,0.999451
28378,Lauren by Ralph Lauren,Lauren Ralph Lauren White Women's Large Stripe...,$18.99,0.999451
13,FineBrandShop,Ladies Fuchsia Pink Seamless Stone Set Tube Top,$7.50,0.999451
