### Notes before use: 
Cell [1-11] contain codes that need to be run one time when the server / machine initializes <br>
Cell [12] contains the search function that can be called directly by entering a user search query <br>
Cell [13-14] contain two sample queries <br>
Cell [15] contains future improvements on the recommendation model

In [1]:
import pandas as pd
import re
import numpy as np

import spacy
nlp = spacy.load('en_core_web_md')

import pickle
def save_obj(obj, name):
    with open(name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
        
import scipy
import nltk
import string
from fuzzywuzzy import fuzz 
from fuzzywuzzy import process
from sklearn.feature_extraction.text import TfidfVectorizer



### Part 1: Data Cleaning and Preprocessing
1. Filter out women's clothings
2. Merge product data with outfit data
3. Concat text fields
4. Perform data cleaning procedures including replacing null values with unknown tokens, removing punctuations and numbers, single characters, multiple spaces, stopwords and special characters such as "\n"
5. Tokenization and lemmatization

In [2]:
# Read in product data
data = pd.read_excel("Behold+product+data+04262021.xlsx")
data.head()

Unnamed: 0,product_id,brand,brand_category,name,details,created_at,brand_canonical_url,description,brand_description,brand_name,product_active
0,01EX0PN4J9WRNZH5F93YEX6QAF,Two,Unknown,Khadi Stripe Shirt-our signature shirt,,2021-01-27 01:17:19.305 UTC,https://two-nyc.myshopify.com/products/white-k...,Our signature khadi shirt\navailable in black ...,Our signature khadi shirt\n\navailable in blac...,Khadi Stripe Shirt-our signature shirt,True
1,01F0C4SKZV6YXS3265JMC39NXW,Collina Strada,Unknown,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,,2021-03-09 18:43:10.457 UTC,https://collina-strada-2.myshopify.com/product...,Mid-length dress with ruffles and adjustable s...,Mid-length dress with ruffles and adjustable s...,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,True
2,01EY4Y1BW8VZW51BWG5VZY82XW,Cariuma,Unknown,IBI Slip On Raw Red Knit Sneaker Women,,2021-02-10 02:58:59.591 UTC,https://cariuma.myshopify.com/products/ibi-sli...,IBI Slip On Raw Red Knit Sneaker Women,IBI Slip On Raw Red Knit Sneaker Women,IBI Slip On Raw Red Knit Sneaker Women,False
3,01EY50E27A0P5V6KCW01XPDB43,Cariuma,Unknown,IBI Slip On Black Knit Sneaker Women,,2021-02-10 03:40:52.842 UTC,https://cariuma.myshopify.com/products/ibi-sli...,IBI Slip On Black Knit Sneaker Women,IBI Slip On Black Knit Sneaker Women,IBI Slip On Black Knit Sneaker Women,False
4,01EY6DWHC2W5HPNEGXKEJ4A1CX,Cariuma,Unknown,CATIBA PRO Skate Black Suede and Canvas Contra...,,2021-02-10 16:55:13.024 UTC,https://cariuma.myshopify.com/products/catiba-...,,,CATIBA PRO Skate Black Suede and Canvas Contra...,False


In [3]:
# Creating a new column as an indicator of women's items
data['is_womens_clothing'] = data.apply(lambda x: x.astype(str).str.\
                            findall(r'\b(woman|women|girls?|females?|lady|ladies|unisex|women\'s|woman\'s)\b', re.IGNORECASE).any(), axis=1)

# Turn results from "list" to "boolean" (binary)
for i in range(len(data['is_womens_clothing'])):
    if len(data.loc[i,'is_womens_clothing']) == 0:
        data.loc[i,'is_womens_clothing'] = 0
    else:
        data.loc[i,'is_womens_clothing'] = 1
        
# Filter out women's clothings for the focus of this project
data = data.loc[data.is_womens_clothing == 1, :]

In [4]:
# Merge outfit dataset with product dataset
# Note: 
# Since we use inner join here, we will only be able to recommend products that exist in both product and outfit datasets
# The reason for doing this:
# 1. We need the detailed descriptions of the product from the product dataset for more accurate comparison
# 2. We want to recommend outfit based on the existing professional outfit combination

outfit = pd.read_csv("outfit_combinations USC.csv")
fulldata = data.merge(outfit, left_on = "product_id", right_on = "product_id", how = "inner")
fulldata.drop_duplicates()
fulldata.head()

Unnamed: 0,product_id,brand_x,brand_category,name,details,created_at,brand_canonical_url,description,brand_description,brand_name,product_active,is_womens_clothing,outfit_id,outfit_item_type,brand_y,product_full_name
0,01DVA59VHYAPT4PVX32NXW91G5,Tibi,women:SHOES:MULES,Juan Embossed Mules,\nAs seen on the Pre-Fall ‘19 runway\nHeel mea...,2019-12-05 04:32:46.134 UTC,https://pink.modaoperandi.com/tibi-pf19/juan-e...,Tibi's Juan embossed mules are made from shiny...,Tibi's Juan embossed mules are made from shiny...,Juan Embossed Mules,False,1,01DVA879D7TQ59VPTTGCMJWWSK,shoe,Tibi,Juan Embossed Mules
1,01DT2D39XSRFC204J231X3C7XK,Frame,women:CLOTHING:COATS,Belted Double-Faced Cotton Coat,\nBelted waist fastening \nComposition: cotton...,2019-11-19 17:59:22.799 UTC,https://pink.modaoperandi.com/frame-denim-fw19...,There are only a few great coats you need to b...,There are only a few great coats you need to b...,Belted Double-Faced Cotton Coat,False,1,01DVC571VTD70793BKGPVSTF2A,accessory2,FRAME,Belted Double-Faced Cotton Coat
2,01DT2D39XSRFC204J231X3C7XK,Frame,women:CLOTHING:COATS,Belted Double-Faced Cotton Coat,\nBelted waist fastening \nComposition: cotton...,2019-11-19 17:59:22.799 UTC,https://pink.modaoperandi.com/frame-denim-fw19...,There are only a few great coats you need to b...,There are only a few great coats you need to b...,Belted Double-Faced Cotton Coat,False,1,01DVC571VV0DYHPSK1GJDPQTQT,accessory2,FRAME,Belted Double-Faced Cotton Coat
3,01DT2D39XSRFC204J231X3C7XK,Frame,women:CLOTHING:COATS,Belted Double-Faced Cotton Coat,\nBelted waist fastening \nComposition: cotton...,2019-11-19 17:59:22.799 UTC,https://pink.modaoperandi.com/frame-denim-fw19...,There are only a few great coats you need to b...,There are only a few great coats you need to b...,Belted Double-Faced Cotton Coat,False,1,01DVC571VV2KR8G4TAZWZM0YQH,accessory2,FRAME,Belted Double-Faced Cotton Coat
4,01DT2D39XSRFC204J231X3C7XK,Frame,women:CLOTHING:COATS,Belted Double-Faced Cotton Coat,\nBelted waist fastening \nComposition: cotton...,2019-11-19 17:59:22.799 UTC,https://pink.modaoperandi.com/frame-denim-fw19...,There are only a few great coats you need to b...,There are only a few great coats you need to b...,Belted Double-Faced Cotton Coat,False,1,01DVC571VV8YNZS2NC6JCTADP0,accessory2,FRAME,Belted Double-Faced Cotton Coat


In [5]:
# Drop duplicate columns and rename columns
fulldata = fulldata.loc[:,["product_id","outfit_id","outfit_item_type","brand_x","product_full_name","description","details"]]
fulldata.rename(columns={"brand_x": "brand"}, inplace=True)
fulldata.head()

Unnamed: 0,product_id,outfit_id,outfit_item_type,brand,product_full_name,description,details
0,01DVA59VHYAPT4PVX32NXW91G5,01DVA879D7TQ59VPTTGCMJWWSK,shoe,Tibi,Juan Embossed Mules,Tibi's Juan embossed mules are made from shiny...,\nAs seen on the Pre-Fall ‘19 runway\nHeel mea...
1,01DT2D39XSRFC204J231X3C7XK,01DVC571VTD70793BKGPVSTF2A,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,\nBelted waist fastening \nComposition: cotton...
2,01DT2D39XSRFC204J231X3C7XK,01DVC571VV0DYHPSK1GJDPQTQT,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,\nBelted waist fastening \nComposition: cotton...
3,01DT2D39XSRFC204J231X3C7XK,01DVC571VV2KR8G4TAZWZM0YQH,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,\nBelted waist fastening \nComposition: cotton...
4,01DT2D39XSRFC204J231X3C7XK,01DVC571VV8YNZS2NC6JCTADP0,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,\nBelted waist fastening \nComposition: cotton...


In [6]:
# Check for null values
fulldata.isnull().sum()

product_id            0
outfit_id             0
outfit_item_type      0
brand                 0
product_full_name     0
description          94
details              34
dtype: int64

In [7]:
# Replace null values with an indicator "UNKNOWN_TOKEN"
fulldata = fulldata.replace(np.nan, 'UNKNOWN_TOKEN', regex=True)

# Remove special character "\n" in "details" column
fulldata['details'] = fulldata['details'].str.replace("\n", "")

# Concat columns with text
fulldata['text'] = fulldata['outfit_item_type']+' '+fulldata['brand']+' '+fulldata['product_full_name']+' '+fulldata['description']+' '+fulldata['details']

fulldata.head()

Unnamed: 0,product_id,outfit_id,outfit_item_type,brand,product_full_name,description,details,text
0,01DVA59VHYAPT4PVX32NXW91G5,01DVA879D7TQ59VPTTGCMJWWSK,shoe,Tibi,Juan Embossed Mules,Tibi's Juan embossed mules are made from shiny...,As seen on the Pre-Fall ‘19 runwayHeel measure...,shoe Tibi Juan Embossed Mules Tibi's Juan embo...
1,01DT2D39XSRFC204J231X3C7XK,01DVC571VTD70793BKGPVSTF2A,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,Belted waist fastening Composition: cotton Dry...,accessory2 Frame Belted Double-Faced Cotton Co...
2,01DT2D39XSRFC204J231X3C7XK,01DVC571VV0DYHPSK1GJDPQTQT,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,Belted waist fastening Composition: cotton Dry...,accessory2 Frame Belted Double-Faced Cotton Co...
3,01DT2D39XSRFC204J231X3C7XK,01DVC571VV2KR8G4TAZWZM0YQH,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,Belted waist fastening Composition: cotton Dry...,accessory2 Frame Belted Double-Faced Cotton Co...
4,01DT2D39XSRFC204J231X3C7XK,01DVC571VV8YNZS2NC6JCTADP0,accessory2,Frame,Belted Double-Faced Cotton Coat,There are only a few great coats you need to b...,Belted waist fastening Composition: cotton Dry...,accessory2 Frame Belted Double-Faced Cotton Co...


In [8]:
# Write a function for further data cleaning
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Remove single character
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Remove multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)
    
    #Remove stopwords and do lemmatization
    doc = nlp(sentence)
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    
    return " ".join(tokens)

# Apply the data cleaning function to the concated texts
fulldata['text'] = fulldata['text'].apply(preprocess_text)

### Part 2: TF-IDF Vectorization and Cosine Similarity Computation
1. Perform TF-IDF vectorization on the concated text field for the product data
2. Perform TF-IDF vecotrization on the user serach query, compute cosine similarity with each product and return the most similar product **(search function)**
3. Link the product back to the outfit combination data, and return the entire outfit for recommendation **(outfit recommendation function, called within the search function)**

In [9]:
doc = list(fulldata['text'].values)

# Perform TF-IDF vectorization on the concated text field for the product data
vectorizer = TfidfVectorizer(lowercase =True,
                             stop_words='english',
                             min_df=3, 
                             max_df=0.9,
                             token_pattern=r'\w{3,}')
tfidf_vector = vectorizer.fit_transform(doc)
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns=vectorizer.get_feature_names(),index = fulldata.product_id).reset_index()
tfidf_df.head()

Unnamed: 0,product_id,abbreviated,abloh,abstract,accent,accentuate,accessory,acetate,acler,acne,...,year,yoga,yoke,zebra,zelander,zeynep,zimmermann,zip,zipped,zoom
0,01DVA59VHYAPT4PVX32NXW91G5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,01DT2D39XSRFC204J231X3C7XK,0.0,0.0,0.0,0.0,0.0,0.058749,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,01DT2D39XSRFC204J231X3C7XK,0.0,0.0,0.0,0.0,0.0,0.058749,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,01DT2D39XSRFC204J231X3C7XK,0.0,0.0,0.0,0.0,0.0,0.058749,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,01DT2D39XSRFC204J231X3C7XK,0.0,0.0,0.0,0.0,0.0,0.058749,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# Save the TF-IDF vectorizer
save_obj(vectorizer, "tfidf_vectorizer")

In [11]:
# Write a function that returns an entire outfit given a product_id
# This function will be called inside the search function
def outfit_recommendation(id):
    
    #Use fuzzy-matching to find product_id that most similar to the input id (in case there is a typo)
    foundID = process.extractOne(id,outfit['product_id'],scorer=fuzz.token_set_ratio)[0]
    
    #Get all outfit_id for which involve matched product
    outfits = outfit[outfit.product_id==foundID].outfit_id
    
    #Select first outfit as the default outfit
    products = outfit[outfit.outfit_id==outfits.values[0]]
    
    #Formatting output as a dictionary
    outfitDict = {}
    
    for i in products.index:
        key = products.loc[i,"outfit_item_type"]
        outfitDict[key] = products.loc[i,"product_full_name"] + " (" + products.loc[i,"product_id"] + ")"
        
    return outfitDict

In [12]:
# Search function
# This function mainly performs TF-IDF vectorization on user search query and compute cosine similarity with each product
def search(query):
    # Preprocess and clean user search query
    query = preprocess_text(query)
    
    # Perform TF-IDF vectorization on user search query
    query_vector = vectorizer.transform([query])
    query_vector = query_vector.toarray()
    
    # Compute cosine similarity
    cos = 0
    for i in range(len(tfidf_df)):
        similar = 1 - scipy.spatial.distance.cosine(query_vector, tfidf_df.iloc[i,1:].values)
        if similar > cos:
            cos = similar
            productid = tfidf_df.loc[i,"product_id"]
    return outfit_recommendation(productid)

In [13]:
# Test query #1
q1 = "slim fitting, straight leg pant with a center back zipper and slightly cropped leg Reformation"
search(q1)

{'top': 'Double-Layer Paneled Blouse (01DS48PYMWDQ10H1SN6GJZMV2D)',
 'shoe': 'Heather C-Chain Leopard-Print Calf Hair & Leather Sandals (01DT0DJQQ3J6EAWSPMEYCY5AYB)',
 'bottom': 'Cropped Wool Straight-Leg Pants (01DT37DJVBDP6796WQP3EGAAJM)',
 'accessory1': 'Croc-effect leather belt bag (01DT515RAX3SAPGFWARM3X7993)',
 'accessory2': 'Cotton-blend twill trench coat (01DT518N0WCC9SBA7XZ7HYKF6S)'}

In [14]:
# Test query #2
q2 = "Sexy silky, a-line mini skirt zipper Benson skirt"
search(q2)

{'shoe': 'Love Lizard-Embossed Metallic Leather Pumps (01DT0DJ11WJ7J7SZMB81SB7YKT)',
 'onepiece': 'Paco Draped Metallic Georgette Mini Dress (01DT2D3D3Z5VE6WNAQNR3K8H7A)',
 'accessory1': 'Eos Acrylic Box Clutch (01DT2D3E08QCR1PRYV9J0A1E5Y)',
 'accessory2': 'Jerry Striped Faux Fur Coat (01DVRHDWGHX8FG6VH1WFKET1N5)'}

### Future Improvements on the Model
Based on test query #1 and #2, using TF-IDF vectorization and cosine similarity should have given fairly accurate outfit recommendation given a customer's search query. However, there are still a few areas for future improvements on the model:
1. Currently we can't test on our recommendation performance, so in the future we might develop some recommendation accuracy measures, which could be both quantitative (scores) and qualitative (customer surveys).
2. Since we join the product data with outfit data, we might not make the most accurate product recommendation if the most desirable product doesn't exist in the outfit data. In the future, we could improve on this by either develop our own professional outfit recommendation rule/algorithm, or enrich the outfit combination database.
3. As we notice, there could be multiple outfits associated with one product, and in our model, we only return the first outfit as default. But in the future, we could pick outfit combination in a smarter way. For example, we could prompt the user to reveal more preferences for their outfits in a following query.