**Victoria_Wang_BrainStation_Capstone_Dec2024**

Part4. Victoria_Wang_Capstone_NewReviewSentimentPredictor

**1. Project Overview: Leveraging Sentiment Analysis and Similarity Search to Optimize Product offerings and success**

In the past decade, there is an exponential growth in online purchases and E-commerce platforms. Hence, E-commerce platforms are tasked with figuring out how customers feel about their brand, the services and product they offer. The relationship between customer sentiments and factors that influence them is valuable. Sentiment analysis allows for product-customer fit, which translates to sales and profit. According to the Statista Research Department, by 2029, the revenue in the E-commerce market in the US is estimated to reach 1.9 trillion dollars. So despite the fierce competition, there’s a lot of opportunities in the E-commerce market. We want to take advantage of these opportunities by extracting data driven insights via customer text reviews to iteratively improve product-customer fit.

According to a 2024 survey that focus on the most profit Amazon sellers worldwid by product category from December 2023 to January 2024, the beauty and personal care category topped the chart.

Hence, for this project, we will focus on the beauty and personal care category for analysis of the text reviews to predict customer sentiment and product success.

**Project Goal:**

Our problem statement is: **How might we… leverage user text reviews to identify product issues and prioritize features that customers value the most?**

My solution is to use machine learning and NLP to analyze customer sentiment and extract insights. This will result in a Review Analyzer App for various stakeholders to enhance customer-product fit and satisfaction with data-driven product insights.

Of note, given the limited computational power of my personal computer, I will subset the dataset to 1% (165674, 16) and utilize Google Colab for mapping text to 384-dimension embeddings via the Sentence Transformer model ('all-MiniLM-L6-v2').

**2. Table of Content:**
1. Project Overview
2. Table of Content
3. Data Source
4. Data Import Instructions
5. Importing relevant packages
6. EDA & Insights
7. Basic Time Series Analysis & Insights
8. Data is subsetted to 1% (165674, 16)
9. Text Analysis - Preprocessing using Sentence Transformer model ('all-MiniLM-L6-v2') for mapping text to 384-dimension embeddings and Modeling with Logistic Regression, XBoost and Random Forest
10. Text Analysis - Preprocessing using CountVectorizer and Modeling with Logistic Regression, XBoost and Random Forest
11. Text Analysis - Preprocessing using TFIDF with SVD and Modeling with Logistic Regression, XBoost and Random Forest
12. New Review Sentiment Predictor
13. Conclusions
14. Future Directions

**3. Data Source**

**Citation for the dataset UCSD Amazon Reviews' 23:**

@article{hou2024bridging,
  title={Bridging Language and Items for Retrieval and Recommendation},
  author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
  journal={arXiv preprint arXiv:2403.03952},
  year={2024}
}

Dataset: https://amazon-reviews-2023.github.io/index.html#

**Citation for the Sentence Transformer Model ('all-MiniLM-L6-v2'):**

https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode

**Citation for beauty and personal care category as the most profitable category:**

https://www.statista.com/statistics/1400287/amazon-most-profitable-sellers-category/#:~:text=A%202024%20survey%20found%20that,with%2027%20percent%20of%20sellers.)

**Citation for Statistia E-commerce Market Insights:**
https://www.statista.com/statistics/272391/us-retail-e-commerce-sales-forecast/

In [236]:
#importing relevant packages

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.max_colwidth', None)

In [238]:
data = pd.read_csv("AmazonBP_all_selected_01subset.csv")

In [240]:
data.head(2)

Unnamed: 0,user_id,rating,title_x,text,timestamp,verified_purchase,helpful_vote,parent_asin,average_rating,price,rating_number,time,year,month,week_of_year,sentiment
0,AHTDSFK3OHCOYISQCNCZ4O4AUYIA,3,Not worth the money,"When I received this product, I thought, oh my, they only sent one bar……well there were actually 5 very small bars in one box ….the soap is nice but definitely not worth the money….",1650645000319,True,1,B09R3SGVGJ,4.4,23.25,588,2022-04-22 16:30:00.319,2022,4,16,1
1,AEW6PUQJFC4QRTW2CD6Q2G36ROVA,1,Wax warmer does not work,My wax warmer does not work. I have attempted to use and it does not heat up. it does not give me information to contact the seller to resolve my issue.,1632234980122,True,0,B07XFZ9NM4,4.3,37.99,8621,2021-09-21 14:36:20.122,2021,9,38,0


In [242]:
data = data[["text", "rating", "sentiment"]] #target variable = sentiment 

In [244]:
embeddings = pd.read_csv("embedding_X.csv")

In [246]:
embeddings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,0.007690,-0.024506,0.034328,-0.028799,-0.019449,-0.028808,0.038582,-0.028599,0.026492,-0.001092,...,0.018177,-0.040543,-0.076314,-0.044289,0.149188,0.019562,-0.033702,0.039511,-0.048531,-0.005066
1,-0.037028,-0.034255,0.081530,-0.006164,-0.032405,-0.026308,-0.001707,-0.083984,-0.092233,-0.002425,...,-0.045218,-0.056729,0.040575,0.039373,0.018681,0.048034,0.051699,-0.123818,-0.039959,0.025758
2,0.020867,-0.017111,0.037967,0.020771,0.040526,0.016246,0.104589,0.019750,-0.025461,-0.023806,...,-0.050882,-0.004516,0.044491,0.042849,-0.053529,-0.030552,0.005018,-0.035845,-0.021226,0.001323
3,-0.027971,0.051233,0.086873,0.007926,-0.001492,-0.060656,-0.010778,0.019588,-0.055853,0.044437,...,0.000183,0.032754,0.028275,0.006578,-0.001095,0.078420,0.135342,-0.077515,-0.036457,0.058698
4,-0.148131,-0.038093,0.055804,0.016401,-0.085882,0.061092,0.092550,-0.024187,-0.035340,-0.019383,...,-0.044752,-0.041577,-0.066648,0.063983,-0.020987,0.006152,0.015749,-0.032523,-0.012122,0.001223
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165669,-0.069432,-0.041378,0.054979,0.042455,-0.045213,0.008689,0.077764,0.041951,-0.010031,0.021014,...,-0.018155,-0.088326,0.006628,0.014237,0.036244,0.046247,0.006656,0.065568,-0.017292,0.013564
165670,0.022814,0.034560,0.061283,0.068886,0.064795,0.016943,0.099068,0.018491,-0.018950,-0.009223,...,-0.005386,-0.002559,0.067101,0.007541,0.013052,0.012929,0.025601,-0.016452,-0.032590,0.034203
165671,-0.036473,-0.028978,-0.002595,-0.038476,-0.020702,0.016110,0.058352,0.000292,0.075078,-0.002597,...,0.044596,0.005243,0.001257,-0.044776,0.009232,0.042935,0.031706,0.023608,-0.011936,0.030196
165672,0.032948,0.030169,-0.035991,0.098402,-0.041057,0.004765,-0.070714,0.001973,-0.016665,0.000751,...,0.029399,-0.036036,0.027774,-0.051800,0.017403,0.009851,-0.003396,-0.022825,-0.116474,0.018763


In [248]:
from sentence_transformers import SentenceTransformer

In [250]:
model_name = 'all-MiniLM-L6-v2' #384 dimensions
model = SentenceTransformer(model_name)

In [252]:
import pickle

logreg = pickle.load(open("text_embedding_logreg.pkl", "rb"))

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [270]:
new_phrases = [
    "I dislike the smell of this cream.",
]

In [272]:
new_phrases_embeddings = model.encode(new_phrases)

In [274]:
logreg.predict_proba(new_phrases_embeddings)[:,1] #Closer to 1 = positive sentiment 

array([0.20190949])

In [275]:
new_phrases_embeddings.sum(axis=1)

array([0.22584647], dtype=float32)

In [276]:
#mapping the embedding of the newly entered phrases to embeddings of the testing set of the text review via cosine similarity
similarity = cosine_similarity(new_phrases_embeddings, embeddings) 

In [277]:
top_10_most_similar = np.argsort(similarity).squeeze()[::-1][:10]

In [278]:
#This allows us to review the full text of top 10 similar reviews with the newly entered phrases
data.iloc[top_10_most_similar] 

Unnamed: 0,text,rating,sentiment
21070,The smell of this hand cream is absolute awful.,2,0
150290,The smell is aweful. I didn't like it at all. I use it now as a foot cream!,2,0
147269,"Could not tolerate the odor of this cream, I tried to return it but couldn’t manage the process, I wound up tossing it in the garbage and it wasn’t exactly cheap.",1,0
84856,"This is the bad product I ever used, there are no smell on it, all the time I use this cream but this time I order it from amazon and I got it so bad condition",1,0
112107,I don't like the smell,5,1
84447,I don't like the smell,1,0
34016,I don’t like the smell,4,1
3982,I just hate the smell,5,1
114541,I don't like the smell.....,2,0
46512,I dont enjoy the smell of it,5,1


In [279]:
data.iloc[top_10_most_similar].sort_values(by='rating', ascending=True)

Unnamed: 0,text,rating,sentiment
147269,"Could not tolerate the odor of this cream, I tried to return it but couldn’t manage the process, I wound up tossing it in the garbage and it wasn’t exactly cheap.",1,0
84856,"This is the bad product I ever used, there are no smell on it, all the time I use this cream but this time I order it from amazon and I got it so bad condition",1,0
84447,I don't like the smell,1,0
21070,The smell of this hand cream is absolute awful.,2,0
150290,The smell is aweful. I didn't like it at all. I use it now as a foot cream!,2,0
114541,I don't like the smell.....,2,0
34016,I don’t like the smell,4,1
112107,I don't like the smell,5,1
3982,I just hate the smell,5,1
46512,I dont enjoy the smell of it,5,1
