In [1]:
%%html
<style>
div.output_subarea pre {
    white-space: pre;
}
</style>

# Unsupervised clustering approaches

For unsupervised learning for clustering, there are 3 potential approaches can be applicable here.
1. PCA + K-mean
2. Graph based embedding
3. LDA 

**#1:** The features presented for each product are mostly categorical instead of value variance. Considering this is as a classification problem, PCA may not yield the best result in this senario.

**#2** Grpah based embedding approach, could be an interesting one. However, the support for graph embedding is a bit lagging with pyspark mllib. `node2vec` maybe possible with work around, while `graphSage` would require implementing everything alsmost from scratch. (if time allows, I may revisit this later)

**#3** Lastly, LDA, a NLP based approachm, where we treat the catrogorization as a topic clustering problem.

Finally I decided to use #3 due to time and effort constraint. Following are clustering experiment:

In [2]:
import findspark; findspark.init()
from os import environ, listdir, path, getcwd
from pathlib import Path
import numpy as np
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import Window
from pyspark.ml.clustering import LDA, LocalLDAModel, LDAModel
from clustering_ex import utils, udf

### Transformed data set base on differnt K value
change the `K` value to see different clustering output. I experimented different `K` values, from 5~8, with different hyper-parameters setting. Different output can be found in `s3://zip-ex/output/`

**So far, `K=6` produces best clustering results.**

In [3]:
bucket_by_k ={
    '5':'output/1622372691_550',
    '6':'output/1622377560_65050',
    '7':'output/1622375164_7-6956370388595972856',
}
K='6'
bucket_prefix = bucket_by_k[K]

### Initial item data exloration and analysis (DEA)

The things I learned from DEA below is: the same product with exact same `product name` could have multiple `item code`s, sold under different `brands` `categories` by differet `retailers`. 

**Here, I am making the assumption of product with same `product_name` are the same across the given dataset.**

In [4]:
environ['PYSPARK_SUBMIT_ARGS'] = "--packages=com.amazonaws:aws-java-sdk:1.11.900,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell"
environ['DEBUG'] = "1"
environ['PYSPARK_PYTHON']=f'{Path(getcwd())}/.tox/dev/bin/python'
session, logger, settings = utils.start_spark()

getting spark session
spark session created


In [5]:
item_df = session.read.parquet(f'{settings["base_bucket"]}/{bucket_prefix}/item_df.parquet')
tf_idf_feature_df = session.read.parquet(f'{settings["base_bucket"]}/{bucket_prefix}/item_tf_idf_df.parquet')
item_df.describe(['price','code_count','category_count','retrailer_count','brand_count']).show()

+-------+-----------------+------------------+------------------+-------------------+------------------+
|summary|            price|        code_count|    category_count|    retrailer_count|       brand_count|
+-------+-----------------+------------------+------------------+-------------------+------------------+
|  count|            21355|             21415|             21415|              21415|             21415|
|   mean|278.1060359519797| 1.252860144758347| 1.884753677328975| 1.0146626196591175|1.0285314032220407|
| stddev|1107.255797879404|2.2371720839549876|0.7366245146089858|0.15332252187577486|0.4777464588368348|
|    min|             1.45|                 1|                 1|                  1|                 1|
|    max|          78737.0|               186|                11|                  8|                45|
+-------+-----------------+------------------+------------------+-------------------+------------------+



In [6]:
item_overview = item_df.select(
    F.countDistinct('product_name').alias('product_distinct'),
    F.size(udf.merge_lists(F.collect_set('retailers'))).alias('retailers_distinct'), 
    F.size(udf.merge_lists(F.collect_set('categories'))).alias('categories_distinct'), 
    F.size(udf.merge_lists(F.collect_set('brands'))).alias('brands_distinct')
)
item_overview.show()

+----------------+------------------+-------------------+---------------+
|product_distinct|retailers_distinct|categories_distinct|brands_distinct|
+----------------+------------------+-------------------+---------------+
|           21415|                87|                212|           2178|
+----------------+------------------+-------------------+---------------+



## LDA based topic categoies, findings and insights

Roughly, we can see the LDA model classifies the dataset in to 6 categories:

- **TOPIC 0,** Designer fashion brands clothing
- **TOPIC 1,** Jumpsuits, Rompers, women's fashion 
- **TOPIC 2,** Sports ware
- **TOPIC 3,** Fashion Accessories
- **TOPIC 4,** Face masks 
- **TOPIC 5,** Skincare, Cosmetics

Below are couple interesting findings after we looking depper into item data in each category.

**1 different "polo" brands clustered in different topics**

Though "Polo" shows up in both *Topic 0* and *Topic 2* a lot, when we look into the brands information in each topic, we can start tell the differnce. 

i.e. Luxury/designer brands, such as, "burberry", "givenchy", "saint-laurent", etc. are mostly clustered inside *Topic 0* for "polo" related products. 

Meanwhile, in *Topic 2*, the brands we see are mostly "nike", "adidas", "puma" (sports brands).

**2. "Face Masks" means different things across topics**

We also find "Face Masks" is showing up on both *Topic 4 and 5*. 

However, when we looking into corresponding topics, we find the "face masks" means skin care products, such as "Forever Glow Anti-Aging Face Mask 5 Pack" in *Topic 5*. 

While in *Topic 4*, it means differnt types of protective/fashion face masks due to recent COVID-19 event. 

**Sample Top 10 products in each category topic**

In [7]:
item_topic_df = session.read.parquet(f'{settings["base_bucket"]}/{bucket_prefix}/item_topic_df.parquet')
windowSpec  = Window.partitionBy("topic").orderBy(F.col('score').desc())
top_10_samples = item_topic_df.withColumn('rank', F.row_number().over(windowSpec)
                                 ).where('rank<=10').sort('topic'
                                                        ).select('topic','product_name','categories', 'brands')
top_10_samples.show(60,False)

+-----+-----------------------------------------------------------------+--------------------------------------------------------------------------+-----------------------------------+
|topic|product_name                                                     |categories                                                                |brands                             |
+-----+-----------------------------------------------------------------+--------------------------------------------------------------------------+-----------------------------------+
|0    |Black Triptych Bolt Polo                                         |[Polos, Short Sleeve]                                                     |[neil-barrett]                     |
|0    |Brown Knitted Pique Polo                                         |[Polos, Long Sleeve]                                                      |[bottega-veneta]                   |
|0    |Black and Beige Merino Wool Blakeford Polo                       |[P

**Topic item samples maps to product category**

In [8]:
sample_category_by_topic = top_10_samples.groupBy('topic'
                                                 ).agg(udf.merge_lists(F.collect_set('categories')).alias('distinct_category')
                                                       ,udf.merge_lists(F.collect_set('brands')).alias('distinct_brands'))
sample_category_by_topic.sort('topic').show(truncate=False)

+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|topic|distinct_category                                                                                                                                                           |distinct_brands                                                                                                                        |
+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|0    |[Tennis, Long Sleeve, Polos, Short Sleeve]

## Data Visualization with LDAvis

**NOTE: the number insde the circle are not as same as the tiopic number used above. They are ranked based on topic impact (a pyLDAvis implementaion)**

In [9]:
output_bucket = f"{settings['base_bucket']}/{bucket_prefix}"
stats = utils.read_lda_vis_data_s3(settings['base_bucket'], f"{bucket_prefix}/lda_vis")
import pyLDAvis
pyLDAvis.enable_notebook()
lda_vis = pyLDAvis.prepare(**stats)
pyLDAvis.display(lda_vis)