# DSO 560 Project Part II

## Executive Summary
**The Steps by which we make outfit recommendation is as follows:**<br>
*  Give an input (product ID or product description), match a closest product(call it product B) in the dataset.<br>
*  Recommed outfit based on exsting outfit for product B or form a new outfit by matching the closet product in each of the other item types if there is no exsiting outfit.<br>

**We explored two methods:**<br>
* ID Matching, namely matching products based the similarity of their IDs. We use Levenshtein Distance and allow for mistyped IDs. 
* Description Matching, namely matching products based the similarity of their product descriptions. We use en_core_web_md for word embedding and ti-idf to quantify the similarity score.

## Contents
### [I. Load Data](#1)
### [II. Text Preprocessing](#2)
### [III. Modelling](#3)
[A. Match IDs](#31)
    
[B. Match Descriptions](#32)


In [1]:
# import all neccessary libraries
import psycopg2
import pandas as pd
import numpy as np
import nltk
import re
import sklearn
import spacy
import gensim 
import warnings
warnings.simplefilter("ignore")
from collections import Counter
from keras.preprocessing.sequence import pad_sequences
from numpy import asarray
from numpy import zeros
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

Using TensorFlow backend.


<a id = '1'></a>

# I. Load Data

In [2]:
conn = psycopg2.connect(database="threadtogether", user="dso_student", password="-H4jgA`rn6w`=Jg(", host="threadtogether.ychennay.com", port="5432")

In [3]:
# View all the tables
cur = conn.cursor()
cur.execute("""select *  
                            from information_schema.tables
                            where table_schema='public'""")
data = cur.fetchall()

In [4]:
pd.DataFrame(data, columns=[desc[0] for desc in cur.description])

Unnamed: 0,table_catalog,table_schema,table_name,table_type,self_referencing_column_name,reference_generation,user_defined_type_catalog,user_defined_type_schema,user_defined_type_name,is_insertable_into,is_typed,commit_action
0,threadtogether,public,full_data,BASE TABLE,,,,,,YES,NO,
1,threadtogether,public,womens_clothing_reviews,BASE TABLE,,,,,,YES,NO,
2,threadtogether,public,tagged_product_attributes,BASE TABLE,,,,,,YES,NO,
3,threadtogether,public,categories,BASE TABLE,,,,,,YES,NO,
4,threadtogether,public,outfits,BASE TABLE,,,,,,YES,NO,
5,threadtogether,public,outfit_combinations,VIEW,,,,,,NO,NO,


In [5]:
# Extract data from full_data
cur = conn.cursor()
cur.execute("select *  from full_data")
data = cur.fetchall()

In [6]:
full_data = pd.DataFrame(data, columns=[desc[0] for desc in cur.description])
full_data.head()

Unnamed: 0,product_id,brand,mpn,product_full_name,description,brand_category,created_at,updated_at,deleted_at,brand_canonical_url,details,labels,bc_product_id
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,514683.0,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,2019-11-11 22:37:15.719107+00,2019-12-19 20:40:30.786144+00,,https://bananarepublic.gap.com/browse/product....,"A modern pump, in a rounded silhouette with an...","{""Needs Review""}",
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,526676.0,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,2019-11-11 22:36:50.682513+00,2019-12-19 20:40:30.786144+00,,https://bananarepublic.gap.com/browse/product....,Dress it down with jeans and sneakers or dress...,"{""Needs Review""}",
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,400100000000.0,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,2019-11-13 17:33:59.581661+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/loewe-52mm-pad...,100% UV protection\nCase and cleaning cloth in...,"{""Needs Review""}",
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,Converse,400012000000.0,Baby's & Little Kid's All-Star Two-Tone Mid-To...,The iconic mid-top design gets an added dose o...,"JustKids/Shoes/Baby024Months/BabyGirl,JustKids...",2019-11-13 17:05:05.203733+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/converse-babys...,Canvas upper\nRound toe\nLace-up vamp\nSmartFO...,"{""Needs Review""}",
4,01DSK15ZD4D5A0QXA8NSD25YXE,Alexander McQueen,400011000000.0,64MM Rimless Sunglasses,Hexagonal shades offer a rimless view with int...,JewelryAccessories/SunglassesReaders/RoundOval,2019-11-13 18:42:30.941321+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/alexander-mcqu...,100% UV protection\nGradient lenses\nAdjustabl...,"{""Needs Review""}",


In [7]:
full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48979 entries, 0 to 48978
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   product_id           48979 non-null  object
 1   brand                48979 non-null  object
 2   mpn                  48979 non-null  object
 3   product_full_name    48979 non-null  object
 4   description          41005 non-null  object
 5   brand_category       48741 non-null  object
 6   created_at           48979 non-null  object
 7   updated_at           48979 non-null  object
 8   deleted_at           45984 non-null  object
 9   brand_canonical_url  48967 non-null  object
 10  details              47950 non-null  object
 11  labels               48979 non-null  object
 12  bc_product_id        48979 non-null  object
dtypes: object(13)
memory usage: 4.9+ MB


In [8]:
# Extract data from outfits
cur = conn.cursor()
cur.execute("select *  from outfit_combinations	")
data = cur.fetchall()

In [9]:
# View
pd.DataFrame(data, columns=[desc[0] for desc in cur.description]).head()

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2PEPWFTT7RMP5AA1T,top,Eileen Fisher,Rib Mock Neck Tank
2,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2S5T9W793F4CY41HE,accessory1,kate spade new york,medium margaux leather satchel
3,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2ZFDYRYY5TRQZJTBD,shoe,Tory Burch,Penelope Mid Cap Toe Pump
4,01DMHCX50CFX5YNG99F3Y65GQW,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt


In [10]:
outfit_comb = pd.DataFrame(data, columns=[desc[0] for desc in cur.description])

In [11]:
np.unique(outfit_comb['outfit_item_type'])

array(['accessory1', 'accessory2', 'accessory3', 'bottom', 'onepiece',
       'shoe', 'top'], dtype=object)

In [12]:
outfit_comb2 = outfit_comb.drop(['brand', 'product_full_name'], axis=1)

In [13]:
# for outfit_item_type unify accessory 1,2,3 into "accessory"
outfit_comb2.loc[outfit_comb2['outfit_item_type'] ==   'accessory1', 'outfit_item_type'] =  'accessory'
outfit_comb2.loc[outfit_comb2['outfit_item_type'] ==   'accessory2', 'outfit_item_type'] =  'accessory'
outfit_comb2.loc[outfit_comb2['outfit_item_type'] ==   'accessory3', 'outfit_item_type'] =  'accessory'

In [14]:
# join outfit data with full_data
join_data = full_data.merge(outfit_comb2, on='product_id')

In [15]:
join_data.head(3)

Unnamed: 0,product_id,brand,mpn,product_full_name,description,brand_category,created_at,updated_at,deleted_at,brand_canonical_url,details,labels,bc_product_id,outfit_id,outfit_item_type
0,01DVA59VHYAPT4PVX32NXW91G5,Tibi,330527,Juan Embossed Mules,Tibi's Juan embossed mules are made from shiny...,women:SHOES:MULES,2019-12-05 04:32:46.134000+00:00,2020-04-08 01:25:26.119000+00:00,2020-04-08 01:25:26.119000+00:00,https://pink.modaoperandi.com/tibi-pf19/juan-e...,As seen on the Pre-Fall ‘19 runway\nHeel measu...,[],820.0,01DVA879D7TQ59VPTTGCMJWWSK,shoe
1,01DVA59VHYAPT4PVX32NXW91G5,Tibi,330527,Juan Embossed Mules,Tibi's Juan embossed mules are made from shiny...,women:SHOES:MULES,2019-12-05 04:32:46.134561+00,2019-12-19 20:40:30.786144+00,,https://pink.modaoperandi.com/tibi-pf19/juan-e...,As seen on the Pre-Fall ‘19 runway\nHeel measu...,"{""Needs Review""}",,01DVA879D7TQ59VPTTGCMJWWSK,shoe
2,01DVA4XY7A0QMMSK3V3SBR52J9,Alexandre Birman,329388,Clarita Bow-Embellished Suede Sandals,Alexandre Birman's 'Clarita' sandals have quic...,women:SHOES:SANDALS,2019-12-05 04:26:15.652000+00:00,2020-04-08 00:45:56.136000+00:00,2020-04-08 00:45:56.136000+00:00,https://pink.modaoperandi.com/alexandre-birman...,Heel height measures approximately 50mm / 2 in...,[],805.0,01DVA8GAYP45BCEMYGEK7FXGDQ,shoe


<a id = '2'></a>

# II. Text Preprocessing

In [16]:
#Replace "NA", "\n", "NULL" and "UnKnown" with "" in join_data
join_data = join_data.replace(np.nan, "", regex=True)
join_data = join_data.replace("\n", "",regex=True)
join_data = join_data.replace(r"\b(NULL|Unknown)\b", "",regex=True) 
# Some of the records have "NULL" under the column of 'details' and "Unknown" under 'brand_category'
join_data.head(3)

Unnamed: 0,product_id,brand,mpn,product_full_name,description,brand_category,created_at,updated_at,deleted_at,brand_canonical_url,details,labels,bc_product_id,outfit_id,outfit_item_type
0,01DVA59VHYAPT4PVX32NXW91G5,Tibi,330527,Juan Embossed Mules,Tibi's Juan embossed mules are made from shiny...,women:SHOES:MULES,2019-12-05 04:32:46.134000+00:00,2020-04-08 01:25:26.119000+00:00,2020-04-08 01:25:26.119000+00:00,https://pink.modaoperandi.com/tibi-pf19/juan-e...,As seen on the Pre-Fall ‘19 runwayHeel measure...,[],820.0,01DVA879D7TQ59VPTTGCMJWWSK,shoe
1,01DVA59VHYAPT4PVX32NXW91G5,Tibi,330527,Juan Embossed Mules,Tibi's Juan embossed mules are made from shiny...,women:SHOES:MULES,2019-12-05 04:32:46.134561+00,2019-12-19 20:40:30.786144+00,,https://pink.modaoperandi.com/tibi-pf19/juan-e...,As seen on the Pre-Fall ‘19 runwayHeel measure...,"{""Needs Review""}",,01DVA879D7TQ59VPTTGCMJWWSK,shoe
2,01DVA4XY7A0QMMSK3V3SBR52J9,Alexandre Birman,329388,Clarita Bow-Embellished Suede Sandals,Alexandre Birman's 'Clarita' sandals have quic...,women:SHOES:SANDALS,2019-12-05 04:26:15.652000+00:00,2020-04-08 00:45:56.136000+00:00,2020-04-08 00:45:56.136000+00:00,https://pink.modaoperandi.com/alexandre-birman...,Heel height measures approximately 50mm / 2 in...,[],805.0,01DVA8GAYP45BCEMYGEK7FXGDQ,shoe


In [17]:
# select useful columns and drop duplicates
columns_list = ['outfit_id','product_id','outfit_item_type','brand','product_full_name','brand_category','details','description']
join_data = join_data[columns_list]
join_data.drop_duplicates(inplace = True)

In [18]:
set(join_data['outfit_item_type'])

{'accessory', 'bottom', 'onepiece', 'shoe', 'top'}

In [19]:
# Lower case some target texts
for column in [col for col in columns_list if col not in ['outfit_id','product_id','outfit_item_type']]:
    join_data[column] = join_data[column].str.lower()
join_data.drop_duplicates(inplace = True)
join_data.head(3)

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,brand_category,details,description
0,01DVA879D7TQ59VPTTGCMJWWSK,01DVA59VHYAPT4PVX32NXW91G5,shoe,tibi,juan embossed mules,women:shoes:mules,as seen on the pre-fall ‘19 runwayheel measure...,tibi's juan embossed mules are made from shiny...
2,01DVA8GAYP45BCEMYGEK7FXGDQ,01DVA4XY7A0QMMSK3V3SBR52J9,shoe,alexandre birman,clarita bow-embellished suede sandals,women:shoes:sandals,heel height measures approximately 50mm / 2 in...,alexandre birman's 'clarita' sandals have quic...
3,01DWJE4FDNYRV6ZJBG25HJFYY2,01DVA4XY7A0QMMSK3V3SBR52J9,shoe,alexandre birman,clarita bow-embellished suede sandals,women:shoes:sandals,heel height measures approximately 50mm / 2 in...,alexandre birman's 'clarita' sandals have quic...


In [20]:
# Remove stopwords
from nltk.corpus import stopwords
from nltk import word_tokenize

columns_list2 = ['brand','product_full_name','description','brand_category','details']
nltk_stopwords = set(stopwords.words('english') + [".",",",":","''","'s","'","``","(", ")","]",
                                                   "-","!","/",">","<",";","#","...","..","?","--","[","&"])
for column in columns_list2:
    join_data[column] = join_data[column].apply(lambda x: ' '.join([word for word in word_tokenize(x) if word not in nltk_stopwords]))
    
join_data.head(3)

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,brand_category,details,description
0,01DVA879D7TQ59VPTTGCMJWWSK,01DVA59VHYAPT4PVX32NXW91G5,shoe,tibi,juan embossed mules,women shoes mules,seen pre-fall ‘ 19 runwayheel measures approxi...,tibi juan embossed mules made shiny black leat...
2,01DVA8GAYP45BCEMYGEK7FXGDQ,01DVA4XY7A0QMMSK3V3SBR52J9,shoe,alexandre birman,clarita bow-embellished suede sandals,women shoes sandals,heel height measures approximately 50mm 2 inch...,alexandre birman 'clarita sandals quickly rise...
3,01DWJE4FDNYRV6ZJBG25HJFYY2,01DVA4XY7A0QMMSK3V3SBR52J9,shoe,alexandre birman,clarita bow-embellished suede sandals,women shoes sandals,heel height measures approximately 50mm 2 inch...,alexandre birman 'clarita sandals quickly rise...


In [21]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for column in columns_list2:
    join_data[column] = join_data[column].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x)]))

In [22]:
join_data.drop_duplicates(inplace=True)
print(len(join_data))

5244


In [23]:
# sometimes there is duplicate product in an outfit. We drop them.
join_data = join_data.groupby(['outfit_id', 'product_id']).first().reset_index()
print(len(join_data))

5199


<a id = '3'></a>

# III. Modelling

In [24]:
# Join all the target texts we need into a new column called 'tag_info'
join_data['tag_info'] = join_data[['brand','product_full_name','brand_category','details','description']].apply(lambda x: ' '.join(x), axis=1)
join_data['tag_info'][0]

'eileen fisher slim knit skirt apparel  nice skirt'

In [25]:
join_data

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,brand_category,details,description,tag_info
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,eileen fisher,slim knit skirt,apparel,,nice skirt,eileen fisher slim knit skirt apparel nice skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2PEPWFTT7RMP5AA1T,top,eileen fisher,rib mock neck tank,apparel,,nice tank,eileen fisher rib mock neck tank apparel nice...
2,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2S5T9W793F4CY41HE,accessory,kate spade new york,medium margaux leather satchel,bag,,nice bag,kate spade new york medium margaux leather sat...
3,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2ZFDYRYY5TRQZJTBD,shoe,tory burch,penelope mid cap toe pump,shoe,,nice shoe,tory burch penelope mid cap toe pump shoe nic...
4,01DMHCX50CFX5YNG99F3Y65GQW,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,eileen fisher,slim knit skirt,apparel,,nice skirt,eileen fisher slim knit skirt apparel nice skirt
...,...,...,...,...,...,...,...,...,...
5194,01E6MC52DNZE5AZ9MPZ2CYCXW9,01E5ZYHZA7186DVWEJ99Q4D2PM,accessory,sam edelman,65mm gradient oversize square sunglass,,58mm lens width 20mm bridge width 145mm temple...,bold angular frame enhance statement-making st...,sam edelman 65mm gradient oversize square sung...
5195,01E6MC52DPCTAAQVYNSKA7PER7,01E2P0SJSKFKNQJ5SVQ8MD1JZT,shoe,dr. marten,fenimore triple buckle boot,,true size woman size shown u.s. equivalent uni...,combat boot get extra dose attitude three thic...,dr. marten fenimore triple buckle boot true s...
5196,01E6MC52DPCTAAQVYNSKA7PER7,01E4RW25Y8ZF6WKZRE50Y6SKH5,onepiece,nili lotan,pamela dress,,model 5'10 wear size 4 size 4 measure approxim...,pamela dress designed elegant slim fit feature...,nili lotan pamela dress model 5'10 wear size ...
5197,01E6MC52DPCTAAQVYNSKA7PER7,01E5ZS3R9JD696YWGK9NSG56E1,accessory,mansur gavriel,leather circle crossbody bag,,8 w x 8 h x 2 ½ d. interior capacity small 3 s...,vintage nostalgia meet contemporary minimalism...,mansur gavriel leather circle crossbody bag 8...


<a id = '31'></a>

## A. Match IDs

For ID-matching method, we used Fuzzy Matching - Levenshtein Distance to solve this problem. Our steps are: a) match the input product ID(call it ID-1) with an existing product-ID(ID-2) in the database. b) recommed outfit based on exsting outfit for ID-2 or form a new outfit by matching the closet IDs with ID-2 in each of the other item types if there is no exsiting outfit

In [26]:
# import necessary packages
import difflib
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

In [27]:
IDs = set(join_data['product_id'])

**Try difflib**<br>
SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching".

In [28]:
# try difflib using 
input_id = '01DMBRYVA2P5H24WK'
difflib_result = difflib.get_close_matches(input_id, IDs)
difflib_result

['01DMBRYVA2P5H24WK0HTK4R0A1', '01DMBRYVA2Q2ST7MNYR6EEY4TK']

In [29]:
# try difflib, chang the id 
input_id = '01DMBRYVA2P5H24WKasdaaf231qtwrqdasadfvtgydfuynfgdfsbnaBSF'
difflib_result = difflib.get_close_matches(input_id, IDs)
difflib_result
# can't match sometimes

[]

**Try Fuzzy Matching - Levenshtein Distance**

In [30]:
input_id = '01DMBRYVA2P5H24WK'
fuzzywuzzy_result = process.extractBests(input_id, IDs, scorer=fuzz.ratio)
fuzzywuzzy_result

[('01DMBRYVA2P5H24WK0HTK4R0A1', 79),
 ('01DMBRYVA2Q2ST7MNYR6EEY4TK', 60),
 ('01DMBRYVA2S5T9W793F4CY41HE', 56),
 ('01DMBRYVA2PEPWFTT7RMP5AA1T', 56),
 ('01DMBRYVA2ZFDYRYY5TRQZJTBD', 51)]

In [31]:
input_id = '01DMBRYVA2P5H24WKasdaaf231qtwrqdasadfvtgydfuynfgdfsbnaBSF'
fuzzywuzzy_result = process.extractBests(input_id, IDs, scorer=fuzz.ratio)
fuzzywuzzy_result
# will always return ID matches

[('01DMBRYVA2P5H24WK0HTK4R0A1', 46),
 ('01DMBRYVA2PEPWFTT7RMP5AA1T', 39),
 ('01DMBRYVA2ZFDYRYY5TRQZJTBD', 36),
 ('01DVP7RVX271PQ54TKKWQ8DGYD', 34),
 ('01DMBRYVA2Q2ST7MNYR6EEY4TK', 31)]

**Fuzzywuzzy system seems better. Sometimes we mistype a few letters in the ID, fuzzymatching better solves the problem.**

**Now there are two different business logic:**
* If the matched product-ID has already in database an outfit-ID, we return the whole outfit under that outfit-ID.
* If the matched product-ID exists in database but has not been categorized into any outfit-ID, then we return 2 outfits - ('accessory', 'onepiece', 'shoe') or ('accessory', 'bottom', 'shoe', 'top') whose product-IDs are the closest to the ID we entry.

In [32]:
# get all unique product_ids for each outfit_item _type 
accessory_IDs = set(join_data[join_data['outfit_item_type']=='accessory']['product_id'])
bottom_IDs = set(join_data[join_data['outfit_item_type']=='bottom']['product_id'])
onepiece_IDs = set(join_data[join_data['outfit_item_type']=='onepiece']['product_id'])
shoe_IDs = set(join_data[join_data['outfit_item_type']=='shoe']['product_id'])
top_IDs = set(join_data[join_data['outfit_item_type']=='top']['product_id'])

In [33]:
def outfit_matcher(input_type: str, input_ID: str):
    fuzzy_result = process.extractBests(input_ID, IDs, scorer=fuzz.ratio, limit=1)
    matched_id = fuzzy_result[0][0]
    score = fuzzy_result[0][1]
    print('The Match Score is:  ', score)
    
    # Here we assume that, if a product have not been assigned an outfit, then its outfit_id will be marked as nan
    # The join_data here is just an example, all rows have an outfit_id, but in practice, after replacing join_data with a full database, it is common
    # that many of the product have not been assigned an outfit_id
    if (join_data['outfit_id'][join_data['product_id']==matched_id].isna().sum())==len(join_data['outfit_id'][join_data['product_id']==matched_id]):
        match_accessory = process.extractBests(input_ID, accessory_IDs, scorer=fuzz.ratio, limit=1)[0][0]
        match_bottom = process.extractBests(input_ID, bottom_IDs, scorer=fuzz.ratio, limit=1)[0][0]
        match_onepiece = process.extractBests(input_ID, onepiece_IDs, scorer=fuzz.ratio, limit=1)[0][0]
        match_shoe = process.extractBests(input_ID, shoe_IDs, scorer=fuzz.ratio, limit=1)[0][0]
        match_top = process.extractBests(input_ID, top_IDs, scorer=fuzz.ratio, limit=1)[0][0]
        accessory = join_data[join_data['product_id']==match_accessory].reset_index().loc[[0],['outfit_item_type', 'product_full_name', 'product_id']]
        bottom = join_data[join_data['product_id']==match_bottom].reset_index().loc[[0],['outfit_item_type', 'product_full_name', 'product_id']]
        onepiece = join_data[join_data['product_id']==match_onepiece].reset_index().loc[[0],['outfit_item_type', 'product_full_name', 'product_id']]
        shoe = join_data[join_data['product_id']==match_shoe].reset_index().loc[[0],['outfit_item_type', 'product_full_name', 'product_id']]
        top = join_data[join_data['product_id']==match_top].reset_index().loc[[0],['outfit_item_type', 'product_full_name', 'product_id']]
        # For some product with same id, they may be labelled as different types, like 'dress' can be both top or onepiece, so here changing the
        # type to corresponding type is necessary
        accessory['outfit_item_type'] = 'accessory'
        bottom['outfit_item_type'] = 'bottom'
        onepiece['outfit_item_type'] = 'onepiece'
        shoe['outfit_item_type'] = 'shoe'
        top['outfit_item_type'] = 'top'
        # Identify whether its onepiece or not, recommend different outfits
        # If input onepiece, then only return outfit with onepiece
        if input_type == 'onepiece':
            result1 = pd.concat([onepiece, shoe, accessory])
            result2 = pd.DataFrame()
        # If input top or bottom, then only return outfit with top and bottom
        elif input_type in ['top', 'bottom']:
            result1 = pd.concat([top, bottom, shoe, accessory])
            result2 = pd.DataFrame()
        # If input others, then return 2 outfits: one with onepiece, one with top & bottom
        else:
            result1 = pd.concat([onepiece, shoe, accessory])
            result1 = result1.append(pd.Series(['----------','----------','----------'], \
                                                          index=['outfit_item_type', 'product_full_name', 'product_id'],  name='Split Line'))
            result2 = pd.concat([top, bottom, shoe, accessory])
        result = pd.concat([result1, result2])
    else:
        oufit_match = list(join_data['outfit_id'][join_data['product_id']==matched_id])
        result = pd.DataFrame()
        for one_outfit_id in oufit_match:
            if pd.notna(one_outfit_id):
                match_frame = join_data[join_data['outfit_id']==one_outfit_id].loc[:,['outfit_id', 'outfit_item_type', 'product_full_name', 'product_id']]
                match_frame = match_frame.append(pd.Series(['----------','----------','----------','----------'], \
                                                          index=['outfit_id', 'outfit_item_type', 'product_full_name', 'product_id'],  name='Split Line'))
                result = pd.concat([result, match_frame])
    return result

Test product_id that has already been assigned into 2 outfits, and both are returned

In [34]:
outfit_matcher('shoe', '01DMBRYVA2P5H24WK')

The Match Score is:   79


Unnamed: 0,outfit_id,outfit_item_type,product_full_name,product_id
0,01DDBHC62ES5K80P0KYJ56AM2T,bottom,slim knit skirt,01DMBRYVA2P5H24WK0HTK4R0A1
1,01DDBHC62ES5K80P0KYJ56AM2T,top,rib mock neck tank,01DMBRYVA2PEPWFTT7RMP5AA1T
2,01DDBHC62ES5K80P0KYJ56AM2T,accessory,medium margaux leather satchel,01DMBRYVA2S5T9W793F4CY41HE
3,01DDBHC62ES5K80P0KYJ56AM2T,shoe,penelope mid cap toe pump,01DMBRYVA2ZFDYRYY5TRQZJTBD
Split Line,----------,----------,----------,----------
4,01DMHCX50CFX5YNG99F3Y65GQW,bottom,slim knit skirt,01DMBRYVA2P5H24WK0HTK4R0A1
5,01DMHCX50CFX5YNG99F3Y65GQW,top,rib mock neck tank,01DMBRYVA2PEPWFTT7RMP5AA1T
6,01DMHCX50CFX5YNG99F3Y65GQW,shoe,penelope mid cap toe pump,01DMBRYVA2ZFDYRYY5TRQZJTBD
7,01DMHCX50CFX5YNG99F3Y65GQW,accessory,crystal clutch,01DMHCNT41E14QWP503V7CT9G6
Split Line,----------,----------,----------,----------


Test the other function - if no outfit-id has been assigned, which means outfit-id = nan

In [35]:
join_data = join_data.append(pd.Series([np.nan, 'ABCDEFG123', np.nan, \
                                        np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], index=join_data.columns), ignore_index=True)

In [36]:
IDs = set(join_data['product_id'])

In [37]:
# test if we input item type that is top or bottom, we should have 1 recommended outfit without onepiece
outfit_matcher('top', 'ABCDEFG123')

The Match Score is:   100


Unnamed: 0,outfit_item_type,product_full_name,product_id
0,top,stretch-cotton jersey hoodie,01DVA54B99J6F2SRCNMEDXF3XN
0,bottom,pleated stretch-denim wide-leg pant,01DVA4YVM475BFGHRYN11N4N20
0,shoe,nike® killshot 2 sneaker,01DPCYE6AKJFKJSNCBDTQFG52Y
0,accessory,pristine mini two-tone leather shoulder bag,01DTJ8BNRJCMD36E0NY19MKZD3


In [38]:
# test if we input item type that is onepiece, we should have 1 recommended outfit without top and bottom
outfit_matcher('onepiece', 'ABCDEFG123')

The Match Score is:   100


Unnamed: 0,outfit_item_type,product_full_name,product_id
0,onepiece,ramie mini slip dress,01DTASNVCZQD122G9B5THFJSR2
0,shoe,nike® killshot 2 sneaker,01DPCYE6AKJFKJSNCBDTQFG52Y
0,accessory,pristine mini two-tone leather shoulder bag,01DTJ8BNRJCMD36E0NY19MKZD3


In [39]:
# test if we input item type that is shoe or accessory, we should have 2 recommended outfit return 2 outfits:
# ('accessory', 'onepiece', 'shoe') and ('accessory', 'bottom', 'shoe', 'top')
outfit_matcher('accessory', 'ABCDEFG123')

The Match Score is:   100


Unnamed: 0,outfit_item_type,product_full_name,product_id
0,onepiece,ramie mini slip dress,01DTASNVCZQD122G9B5THFJSR2
0,shoe,nike® killshot 2 sneaker,01DPCYE6AKJFKJSNCBDTQFG52Y
0,accessory,pristine mini two-tone leather shoulder bag,01DTJ8BNRJCMD36E0NY19MKZD3
Split Line,----------,----------,----------
0,top,stretch-cotton jersey hoodie,01DVA54B99J6F2SRCNMEDXF3XN
0,bottom,pleated stretch-denim wide-leg pant,01DVA4YVM475BFGHRYN11N4N20
0,shoe,nike® killshot 2 sneaker,01DPCYE6AKJFKJSNCBDTQFG52Y
0,accessory,pristine mini two-tone leather shoulder bag,01DTJ8BNRJCMD36E0NY19MKZD3


<a id = '32'></a>

## B. Match Descriptions

We used pre-trained word embedding(Spacy: en_core_web_md model) & TF-IDF score to calculate similarities between product descriptions and give recomendations. <br>
Our steps are: 
* Match the input product (product A)description with an existing product(product B) description in the database.
* Recommed outfit based on exsting outfit for product B or form a new outfit by matching the closet descprtion in each of the other item types if there is no exsiting outfit

In [40]:
import pandas as pd
import spacy

# load spacy en_core_web_md model
nlp = spacy.load("en_core_web_md")

In [41]:
join_data2 = join_data.iloc[0:5199,:] 
# Because there's an additional row we added in the last section (section A), we now select every row but the last row

In [42]:
# This is to take a look at the sentence vectors
for idx, info in enumerate(join_data2['tag_info']):
    print(nlp(info))
    print(nlp(info).vector[:10]) 
    
    if idx == 3: # stop printing after first 5 or so, takes a long time!
        break

eileen fisher slim knit skirt apparel  nice skirt
[ 0.14618534  0.00274011 -0.21634111  0.28702915  0.11337754  0.16731776
  0.26781687 -0.15008447 -0.04747277  0.56070554]
eileen fisher rib mock neck tank apparel  nice tank
[ 0.0624612   0.10584851 -0.233849    0.25422123  0.24160144  0.084532
  0.08296299 -0.06675752  0.07328629  0.78760993]
kate spade new york medium margaux leather satchel bag  nice bag
[-0.12475392  0.0595475   0.0077775  -0.23935592  0.10415664  0.21869141
 -0.01632317 -0.34990403  0.01141775  0.59389067]
tory burch penelope mid cap toe pump shoe  nice shoe
[ 0.01899355  0.13973817  0.03686982 -0.18735638  0.23253345 -0.03653726
 -0.07757377 -0.04530009  0.17124364  0.64829963]


The next steps are to test whether Word Embedding is a good way to solve this problem. We looked at the 1000+ outfit combinations and calculated the similarities between products in the same outfit combination. In the end, we can see that the minimum similarity for products in the same combination is about 40%. Then we concluded that this might be a decent way for solving this problem.

In [43]:
outfit_id_set = set(join_data2['outfit_id'])
len(outfit_id_set)

1137

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

test_similarity_min = []
for outfit_id in outfit_id_set:
    test_data = join_data2[join_data2['outfit_id']==outfit_id]
    
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(test_data['tag_info'])
    tf_idf_lookup_table = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    DOCUMENT_SUM_COLUMN = "DOCUMENT_TF_IDF_SUM"
    # sum the tf idf scores for each document
    tf_idf_lookup_table[DOCUMENT_SUM_COLUMN] = tf_idf_lookup_table.sum(axis=1)
    available_tf_idf_scores = tf_idf_lookup_table.columns # a list of all the columns we have
    available_tf_idf_scores = list(map( lambda x: x.lower(), available_tf_idf_scores)) # lowercase everything
    
    tag_info_vectors = []
    for idx, info in enumerate(test_data['tag_info']): # iterate through each info
        tokens = nlp(info) # have spacy tokenize the review text

        # initially start a running total of tf-idf scores for a document
        total_tf_idf_score_per_document = 0

        # start a running total of initially all zeroes (300 is picked since that is the word embedding size used by word2vec)
        running_total_word_embedding = np.zeros(300) 
        for token in tokens: # iterate through each token

        # if the token has a pretrained word embedding it also has a tf-idf score
            if token.has_vector and token.text.lower() in available_tf_idf_scores:

                tf_idf_score = tf_idf_lookup_table.loc[idx, token.text.lower()]
                #print(f"{token} has tf-idf score of {tf_idf_lookup_table.loc[idx, token.text.lower()]}")
                running_total_word_embedding += tf_idf_score * token.vector

                total_tf_idf_score_per_document += tf_idf_score

        # divide the total embedding by the total tf-idf score for each document
        document_embedding = running_total_word_embedding / total_tf_idf_score_per_document
        tag_info_vectors.append(document_embedding)
    
    similarities = pd.DataFrame(cosine_similarity(tag_info_vectors), columns=list(range(len(test_data['tag_info']))), index=list(range(len(test_data['tag_info']))))
    new_min_similarity = np.min(np.min(similarities))
    test_similarity_min.append(new_min_similarity)
    

In [45]:
# This is the minimum similarity in the test 
round(np.min(test_similarity_min),3)

0.403

**This is where our recommendation system starts! You could input the description for the target product here:**

The logic for choosing the recommended outfit:
- Find out the product with the highest similarity to the target input product. If the product already has an outfit_id, return all the products under that outfit_id.
- If the product has not been categorized into a certain outfit-id, then we just return either ('accessory', 'onepiece', 'shoe') or ('accessory', 'bottom', 'shoe', 'top') that are most similar to the target input product. 

In [46]:
print('Please enter your descriptions for target product:')
input_des = input()
input_des = ' '.join([word for word in word_tokenize(input_des) if word not in nltk_stopwords])
input_des = ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(input_des)])
all_data = join_data2['tag_info'].append(pd.Series(input_des)).reset_index(drop=True)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(all_data)
tf_idf_lookup_table = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

DOCUMENT_SUM_COLUMN = "DOCUMENT_TF_IDF_SUM"
tf_idf_lookup_table[DOCUMENT_SUM_COLUMN] = tf_idf_lookup_table.sum(axis=1)
available_tf_idf_scores = tf_idf_lookup_table.columns # a list of all the columns we have
available_tf_idf_scores = list(map( lambda x: x.lower(), available_tf_idf_scores)) # lowercase everything

tag_info_vectors = []
for idx, info in enumerate(all_data): # iterate through each info
    tokens = nlp(info) # have spacy tokenize the review text

    # initially start a running total of tf-idf scores for a document
    total_tf_idf_score_per_document = 0

    # start a running total of initially all zeroes (300 is picked since that is the word embedding size used by word2vec)
    running_total_word_embedding = np.zeros(300) 
    for token in tokens: # iterate through each token

    # if the token has a pretrained word embedding it also has a tf-idf score
        if token.has_vector and token.text.lower() in available_tf_idf_scores:

            tf_idf_score = tf_idf_lookup_table.loc[idx, token.text.lower()]
            #print(f"{token} has tf-idf score of {tf_idf_lookup_table.loc[idx, token.text.lower()]}")
            running_total_word_embedding += tf_idf_score * token.vector

            total_tf_idf_score_per_document += tf_idf_score

    # divide the total embedding by the total tf-idf score for each document
    document_embedding = running_total_word_embedding / total_tf_idf_score_per_document
    tag_info_vectors.append(document_embedding)

from sklearn.metrics.pairwise import cosine_similarity
similarities = pd.DataFrame(cosine_similarity(tag_info_vectors), columns=list(range(len(all_data))), index=list(range(len(all_data))))
similarities = similarities.unstack().reset_index()
similarities.columns = ["info1", "info2", "similarity"]
similarity_score = similarities[similarities['info1']==5199].reset_index().loc[0:5198, ['similarity']]
similarity_matrix = join_data2.merge(similarity_score, left_index=True, right_index=True)

best_match = similarity_matrix[similarity_matrix['similarity']==np.max(similarity_matrix['similarity'])]
if best_match['outfit_id'].isna().mean() < 1:
    similarity_mean = []
    for best_outfit_id in best_match['outfit_id']:
        try:
            new_match = similarity_matrix[similarity_matrix['outfit_id']==best_outfit_id]
            new_score = np.mean(new_match['similarity'])
            similarity_mean.append(new_score)
        except:
            continue
    best_score = np.max(similarity_mean)
    best_index = similarity_mean.index(best_score)
    output_data = similarity_matrix[similarity_matrix['outfit_id']==best_match.iloc[best_index,0]][['outfit_item_type','product_full_name','product_id']]
else:
    match = re.search(r"\b(one-? ?piece|dress(es)?|rompers?|jumpsuits?|gowns?)\b",input_des,flags=re.IGNORECASE)
    # use regex to search for one-piece patterns. If so, we only retrun ('accessory', 'onepiece', 'shoe').If not, return ('accessory', 'bottom', 'shoe', 'top') 
    if match != None:
        similarity_result = similarity_matrix[similarity_matrix.groupby(['outfit_item_type'])['similarity'].transform(max)==similarity_matrix['similarity']]
        output_data = similarity_result[['outfit_item_type', 'product_full_name', 'product_id']].drop_duplicates()
        output_data = output_data[output_data.outfit_item_type != 'top']
        output_data = output_data[output_data.outfit_item_type != 'bottom']
    else:
        similarity_result = similarity_matrix[similarity_matrix.groupby(['outfit_item_type'])['similarity'].transform(max)==similarity_matrix['similarity']]
        output_data = similarity_result[['outfit_item_type','product_full_name', 'product_id']].drop_duplicates()

output_data

Please enter your descriptions for target product:
boho, dress night out dating


Unnamed: 0,outfit_item_type,product_full_name,product_id
90,onepiece,ida dress,01DPD4R5X5TQCWTVTC2AEAFC10
91,accessory,cassi belt bag,01DPEHS0XH9PDD1GH5ZE4P43A2
92,accessory,woman 2011 icon trench,01DPGV0TFFJ720BT3F8ADN4V7P
93,shoe,virginia boot,01DPKNCMSFAWF2HVQSRHHXDV0K
