NATURAL LANGUAGE PROCESSING FOR PRODUCT TITLES OF SHOPPEE

- NAME- Surbhi Prasad, Karishma Chauhan
- GOAL= Find Similar Products from title Information
- ALGORITHM- LogisticRegression


<b> RUNTIME Total </b> :
- Total - ~2 hours
- Cosine Similarity- 10 mins (Reduced time from 48 mins to 10 mins using Cache)
- Classification Model - 1.92 hours

<b> AIM </b> : 
Find closest products for Shoppee website based on data provided by ecommerce. This data has ~32.5K images and its title. Idea is to utilise title and descriptions of products in predicting closest pther products for recommendation.

<b> METHOD </b> :

1. Data Import: Loaded process text file from MongoDB for all images with titles.
2. Label Group Encoding: Encoded label code(String Indexer) which is product category of image.
3. Paired Dataset: Created all possible pairs of titles based on rule that atleast there are 5 sets of titles for having enough dataset for training for each label group.
4. Target variable Creation(for Binary Classification): Pairs of titles of same label group were classified as 1 and remaining pairs of non equal label group as 0.
5. Negative Sampling to Balance Data: Since, we have ~100 times 0s than 1s, we sampled zeros from data. We kept equal samples of negative class as positive samples in data for 1:1 ratio.
6. Embeddings Spark NLP Pipeline: Extracted two multilingual embeddings as most of the products are of Indonesian market.Extracted last layer of Robeta Bert (of 768 dimemsional/features veactor) and UniversalWord(500 feature vector) embeddings from Spark Package in TensorFlow.
6. Embeddings Cosine Similarity: Calulcated distance(d) for each paired title embeddings for both embeddings to be fed in later Classification model.
7. Train-Test Split: Created Training data and validation data based on Random split of 0.8/0.2 and chosed metric as 'accuracy' for tuning
8. Classification Pipeline: 
a) Utilized Cosine distances(d) as "features" and Binary Label as "labels" for Classification.
b) Modified data based on vector Assembler and ran Logistic Regression Pipeline and Random Forest pipeline
c) Ran 5 fold Cross validation with logistic regression with improvement on validation accuracy to 0.74
9) Ensembled both model results.
10) Evaluated on Test Data.

<b> RESULT </b>

Accuracy Achieved: 
- Train: 0.659
 - Val: 0.657

Cluster: 
- Group-3 GPU- 64GB,16Cores,DBR 10.3 ML,Spark 3.21
- It was used since embeddings were computationally expensive to run(~800 features)

Main Packages Used: 
- Spark NLP version:  3.4.2
- Apache Spark version:  3.2.1

Future Scope:
1. Tune Bert Embeddings based on Argface loss method for model accuracy improved.
2. Create Ensemble Model of two classification methods for final prediction.
3. Can utilise this model for combination with Image models for ensemble.

In [0]:
!pip install sparknlp #this sometimes not work if not added in maven "johnsnowlabs spark NLP"


In [0]:
import sparknlp

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as sqlF
from pyspark import SparkContext, SparkConf
import pyspark.sql.functions as F
import tensorflow as tf
import numpy as np
from pyspark.sql import functions as f
from pyspark.sql.functions import shuffle,rand
import torch
from pyspark.sql.functions import split, col
from pyspark.sql.functions import explode
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator


In [0]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

In [0]:
spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config('spark.executor.memory ',
            
            
            '16G')\
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.network.timeout", "36000000s")\
    .config("spark.executor.heartbeatInterval", "3600s")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1")\
    .getOrCreate()

In [0]:
spark.version

In [0]:
import sparknlp
spark = sparknlp.start(spark32=True)

In [0]:
database = 'Group3'
collection = 'train_csv'
user_name = 'root'
password = 'root'
address = 'cluster0.eq07a.mongodb.net'
connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"
df=spark.read.format("mongo").option("uri",connection_string).load().cache()

LOAD DATA PROCSEED FILE FROM MONGODBFOR TRAINING MODEL

In [0]:

df.printSchema()

LABEL ENCODING FOR NON CONTINUOUS LABLE GROUP IDS FOR BETTER USAGE IF CONTINUOUS CATEGORIES

In [0]:
indexer = StringIndexer(inputCol="label_group", outputCol="label_code") 
indexed = indexer.setHandleInvalid("skip").fit(df).transform(df) 
indexed.show()

FILTERED DATA BASED ON ATLEAST 5 IMAGES FOR TRAINING IN A LABEL GROUP

In [0]:
df_gt_4_cnt=indexed.groupby("label_code").count().filter("count>5")

In [0]:
df_gt_4=indexed.filter(f.col("label_code").isin(list(np.array(df_gt_4_cnt.select('label_code').collect()).reshape(-1))))

In [0]:
df_filtered= df_gt_4.select('title','image','label_code').crossJoin(df_gt_4.select('title','image','label_code')).orderBy(rand())
newcolumns = ['title1','image1',
 'label_code1',
 'title2',           
 'image2',
 'label_code2',]
result_df=df_filtered.toDF(*newcolumns)

CREATED BINARY LABEL FOR ALL POSSIBLE PAIRS OF PRODUCTS.

In [0]:
result_df_label=result_df.withColumn('binary_label',f.when((result_df.label_code1== result_df.label_code2), '1').otherwise('0') )

In [0]:
result_df_label=result_df_label.filter("image1!=image2")

In [0]:
cnt_1=(result_df_label.filter("binary_label==1").count())/2

In [0]:
neg_frac=cnt_1/(result_df_label.filter("binary_label==0").count())

In [0]:
df_0=result_df_label.filter("binary_label==0").sample(fraction=neg_frac,seed=3)

In [0]:
df_total=df_0.union(result_df_label.filter("binary_label==1"))

In [0]:
df_total.printSchema()

In [0]:
df_total.cache()

In [0]:
df_total.printSchema()

In [0]:
print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

In [0]:
!java -version

Extract embeddings for all pairs using SPARK NLP pipeline

In [0]:
document = DocumentAssembler()\
    .setInputCol("title1")\
    .setOutputCol("document")
embeddings = XlmRoBertaSentenceEmbeddings.pretrained()\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")
classsifierdl = MultiClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label_code1")\
    .setBatchSize(128) \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setValidationSplit(0.1)
embeddingsFinisher = EmbeddingsFinisher()\
  .setInputCols("sentence_embeddings")\
  .setOutputCols("finished_embeddings")\
  .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    document,
    embeddings,
    embeddingsFinisher
])

In [0]:
pipelineModel = pipeline.fit(df_total)

In [0]:
result = pipelineModel.transform(df_total)

In [0]:
result.cache()

In [0]:
np.array(result.select("finished_embeddings").limit(1).collect()).shape

In [0]:

result_new=result.select(result.title1,result.image1,result.label_code1,result.title2,result.image2,result.label_code2,result.binary_label,explode(result.finished_embeddings))

In [0]:
result_new.cache()

In [0]:
document = DocumentAssembler()\
    .setInputCol("title2")\
    .setOutputCol("document")
embeddings = XlmRoBertaSentenceEmbeddings.pretrained()\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")
classsifierdl = MultiClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label_code2")\
    .setBatchSize(128) \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setValidationSplit(0.1)
embeddingsFinisher = EmbeddingsFinisher()\
  .setInputCols("sentence_embeddings")\
  .setOutputCols("finished_embeddings")\
  .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    document,
    embeddings,
    embeddingsFinisher
])

In [0]:
pipelineModel = pipeline.fit(result_new)
result = pipelineModel.transform(result_new)

In [0]:

result_new=result.select(result.title1,result.image1,result.label_code1,result.title2,result.image2,result.label_code2,result.binary_label,result.col,explode(result.finished_embeddings))

In [0]:
result_new.cache()

In [0]:
result_new.printSchema()

In [0]:
newcolumns = [ 'title1', 
              'image1',
              'label_code1',
              'title2' ,    
              'image2',
              'label_code2',
              'binary_label', 
              'embed1',      
              'embed2']
result_new=result_new.toDF(*newcolumns)

In [0]:
result_new.printSchema()

In [0]:
result_df_add.filter('binary_label == 0').count()

In [0]:
result_new.cache()

Cosine Similarity Calculation

In [0]:
import math

def cosine_similarity(X, Y):
    den = X.norm(2) * Y.norm(2)
    if den == 0.0:
        return -1.0
    else:
        return X.dot(Y) / float(den)

In [0]:
result_new.printSchema()

In [0]:
pairProdDF = result_new.rdd.map(lambda x: (x[6],cosine_similarity(x[7], x[8])))


In [0]:
def toDoubleSafe(v):
    try:
        return float(v)
    except ValueError:
        return 0.0

In [0]:
pairProdDF_rdd = pairProdDF.map(lambda x: (x[0],toDoubleSafe(x[1])))

In [0]:
pairProdDF_DF=pairProdDF_rdd.toDF()

In [0]:
pairProdDF_DF.cache()

Classification Model

In [0]:

input_cols=["_2"]

va = VectorAssembler(outputCol="features", inputCols=input_cols)
#lpoints - labeled data.
lpoints = va.transform(pairProdDF_DF).select("features", "_1").withColumnRenamed("_1", "label")

In [0]:
vlpoints=lpoints.sample(fraction=0.005,seed=3)

In [0]:
vlpoints.cache()

In [0]:
vlpoints

In [0]:
lpoints.show(2)

In [0]:
lpoints = lpoints.withColumn("label_int",col("label").cast(IntegerType())).drop('label').withColumnRenamed("label_int","label")

In [0]:
pendtsets = lpoints.randomSplit([0.8, 0.2], 1)
pendttrain = pendtsets[0].cache()
pendtvalid = pendtsets[1].cache()

In [0]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(regParam=0.01, maxIter=10, fitIntercept=True)
lrmodel = lr.fit(pendttrain)

In [0]:
lrmodel_predict=lrmodel.transform(pendttrain)

In [0]:
train_accuracy=BinaryClassificationEvaluator().evaluate(lrmodel.transform(pendttrain))

In [0]:
BinaryClassificationEvaluator().evaluate(lrmodel.transform(pendttrain))

In [0]:
BinaryClassificationEvaluator().evaluate(lrmodel.transform(pendtvalid))