# Problem



1. Given a product description, can it recognize whether the product is in Salty Snack category. This classifier could be helpful to identify potential competitor products for an eCommerce seller or brand. This classifier could be extended to other category or a multiclasses classifier. However, for the simplicity of the task, I only train the model to perform a binary classifier task.

2. Given 2 product descriptions, the model need to classify whether they are the same product or not. This problem is a multi billions problem for a eCommerce player who sells its products across multiple platforms. Same products usually listed differently in different retailers, it becomes a challenges for seller to track the performances of the products across multiple platform with the lack of mapping tables. One of the usecase is to track the price and sales across multiple retailer so eCommerce seller can optimize their pricing strategy. Product matching could be a valuable tool for marketplace such as Amazon or Google shopping where one seller could launch multiple product listings with the exact same products and slightly different descriptions or title in order to get more market share. Human could easily tell 2 products identical even if there are slightly different phrasing or keywords in product description or differences in product pages. Therefore, a model that could detect identical products by comparing its product pages could save tons of mannual work



### Data

There are total 113788 product descriptions scraped from different eCommerce retailers such as Amazon.com, Walmart.com, Kroger.com, Target.com, Instacrt etc. In this dataset, there are 53,788 positive label and 60,000 negative label. Positive labels are corresponding to products in Salty Snack category and negative labels are products not belong to Salty Snack category. The only feature the model will be using is the product description

### Feature Engineering

Since there is only one feature available, product description, we need to firstly vectorized the text into model learnable features. In NLP area, such vectorization process is called Word2Vec. In this exercise, I used 2 methods to encode the product descriptions 
- TFIDF
- BERT Sentence Embedding


### Algorithms

In this problem, I will implement 5 algorithms to classify the product into Salty Snack Category and Non-Salty Snack.

- Decision Trees. 
    - Different Purning methods
    - How to splite and select the features
    

- Neural Networks. 
    - Layers and Neurons
    - Activation function
    - Batch size
    - Optimizers
    - Termination

- Boosting. 
    - Different trees
    - Purning
    


- Support Vector Machines.
    - kernel functions

This should be done in such a way that you can swap out kernel functions. I'd like to see at least two.

- k-Nearest Neighbors. 

    - Use different values of k.


### Experiments
### Metrics & Testing


- the training and testing error rates you obtained running the various learning algorithms on your problems. 
- Graph of perofrmance as function of training size At the very least you should include graphs that show performance on both training and test data as a function of training size (note
- Training time/ iterations and conversion
- Learning curvers


In [1]:
import pandas as pd
from pathlib import Path

DATAPATH = "/Users/surichen/Documents/Suri/GeogiaTech/Spring2023/data/project1"

data = pd.read_csv(Path(DATAPATH, "product_categories.csv"))

In [2]:
data.label.value_counts()

0    60000
1    53788
Name: label, dtype: int64

# Feature Engineering

Since we are dealing with text data, we need to firstly vectorized the text into features. There are multiple word2vec technique we could use. From simple bag of words to embedding. For the assignment purpose, I chose 2 methods
- TF-IDF
- Sentence-BERT embedding: Pre-trained sentence embeddings using Siameses BERT-Networks

Text cleaning

In [3]:
from product_matcher.processor import clean_string

In [4]:
data.head(1)

Unnamed: 0,TITLE,SHORT_DESCRIPTION,PRODUCT_DESCRIPTION,IRI_CATEGORY_NAME,label
0,KETO FLCK CHKN CHPS BBQ 3 OZ,,KETO FLCK CHKN CHPS BBQ 3 OZ,SALTY SNACKS,1


In [6]:
data['TITLE_CLEANED'] = data['TITLE'].apply(lambda x: clean_string(x))
data = data.dropna(subset=['TITLE_CLEANED']).reset_index(drop=True)
# shuffling data
data = data.sample(frac=1)
data.reset_index(inplace=True)
data.to_csv("../data/product_categories_cleaned.csv", index=False)

# Train - Test split

1. Shuffle data

In [22]:
split = int(0.75 * len(data))

train = data.loc[:split,]
test = data.loc[split:,]

In [24]:
train.to_csv('../data/problem1_training_data.csv', index=False)
test.to_csv('../data/problem1_testing_data.csv', index=False)

# TF-IDF

Note: to fit a tfidf vectorizer, we could only use training data

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = train.TITLE_CLEANED.tolist()

In [26]:
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(corpus)
X_tfidf.shape

(85332, 22966)

saving vectorizer and tf_idf features

In [27]:
import pickle
pickle.dump(vectorizer, open("../product_matcher/model/tfidfvectorizer.pickle", "wb"))
pickle.dump(X_tfidf, open("../data/problem1_tfidf.pickle", "wb"))

There are total 113,775 vocabularies in this corpus

# Sentence Bert

sentence embedding generates 768 dimensions of feature spaces

In [28]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

In [29]:
X_bert_train = sentence_embeddings = model.encode(corpus)
pickle.dump(X_bert_train, open("../data/problem1_sentence_emb_train.pickle", "wb"))

In [35]:
corpus = test.TITLE_CLEANED.tolist()
X_bert_test =  model.encode(corpus)


In [38]:
pickle.dump(X_bert_test, open("../data/problem1_sentence_emb_test.pickle", "wb"))

# Building Model

experiments (for each model)
- Number of training samples
    - Accuracy
        - F score
        - Recalls
        - Precisions
    - Complexity
        - Training/Learning time
        - Prediction time
    - Convergence Plot

In [41]:
data.TITLE_CLEANED.tolist()

['21z stacy s simply naked cracker crisps',
 'yerbaã   yerbaã  orange vanilla dream yerba mate sparkling water  16 fl oz',
 'frito lay snacks mixed',
 '180 snacks skinny rice bar with himalayan salt 2 variety pack  total 14 bars',
 'kinder bueno mlk chcl hzl crm wfr indv wrpd in plst bg 75 oz',
 'tropicana pure premium healthy kids 100  orange juice no pulp 52 fl oz',
 '24125 oz  30ct ms vickies variety pk',
 'quaker baked squares strawberry 422z20pk',
 'mission  mission restaurant style tortilla strips fiesta size  20 oz',
 'bobcat water 1ltrplbott',
 'yodr hot cracklins 55 oz',
 'pasta roni butter  herb italiano 55 oz',
 'snyder s of hanover  snyder s of hanover pretzel pieces cheddar ale  10 oz',
 'raze  raze energy drink sour gummy worms  16 oz',
 'frito lay snacks mixed variety pack 78 oz  36 count',
 'entm bluebry min mffn 5ct 825 oz',
 'unknown',
 'cheetos crunchy cheese flavored snacks 1 oz',
 'pepsi diet 2ltr kfp',
 'diet candry gale 2ltrplsngl  8',
 'h  e  b select ingredient

In [62]:
data.iloc[4392, [1,6]].tolist()

['Ruffles Potato Chips Queso Cheese Flavored 2 1/8 Oz',
 'ruffles potato chips queso cheese flavored 2 18 oz']

In [64]:
print("""
Product Title: Lay's Potato Chips Queso Cheese Flavored 2-1/8 Oz (12Packs)
Cleaned Title: lays potato chips queso cheese flavored 2 18 oz 12packs
""")


Product Title: Lay's Potato Chips Queso Cheese Flavored 2-1/8 Oz (12Packs)
Cleaned Title: lays potato chips queso cheese flavored 2 18 oz 12packs

