**Victoria_Wang_BrainStation_Capstone_Dec2024**

Part 3.Victoria_Wang_Capstone_01Ssubset_Embeddings

**1. Project Overview: Leveraging Sentiment Analysis and Similarity Search to Optimize Product offerings and success**

In the past decade, there is an exponential growth in online purchases and E-commerce platforms. Hence, E-commerce platforms are tasked with figuring out how customers feel about their brand, the services and product they offer. The relationship between customer sentiments and factors that influence them is valuable. Sentiment analysis allows for product-customer fit, which translates to sales and profit. According to the Statista Research Department, by 2029, the revenue in the E-commerce market in the US is estimated to reach 1.9 trillion dollars. So despite the fierce competition, there’s a lot of opportunities in the E-commerce market. We want to take advantage of these opportunities by extracting data driven insights via customer text reviews to iteratively improve product-customer fit.

According to a 2024 survey that focus on the most profit Amazon sellers worldwid by product category from December 2023 to January 2024, the beauty and personal care category topped the chart.

Hence, for this project, we will focus on the beauty and personal care category for analysis of the text reviews to predict customer sentiment and product success.

**Project Goal:**

Our problem statement is: **How might we… leverage user text reviews to identify product issues and prioritize features that customers value the most?**

My solution is to use machine learning and NLP to analyze customer sentiment and extract insights. This will result in a Review Analyzer App for various stakeholders to enhance customer-product fit and satisfaction with data-driven product insights.

Of note, given the limited computational power of my personal computer, I will subset the dataset to 1% (165674, 16) and utilize Google Colab for mapping text to 384-dimension embeddings via the Sentence Transformer model ('all-MiniLM-L6-v2').

**2. Table of Content:**
1. Project Overview
2. Table of Content
3. Data Source
4. Data Import Instructions
5. Importing relevant packages
6. EDA & Insights
7. Basic Time Series Analysis & Insights
8. Data is subsetted to 1% (165674, 16)
9. Text Analysis - Preprocessing using Sentence Transformer model ('all-MiniLM-L6-v2') for mapping text to 384-dimension embeddings and Modeling with Logistic Regression, XBoost and Random Forest
10. Text Analysis - Preprocessing using CountVectorizer and Modeling with Logistic Regression, XBoost and Random Forest
11. Text Analysis - Preprocessing using TFIDF with SVD and Modeling with Logistic Regression, XBoost and Random Forest
12. New Review Sentiment Predictor
13. Conclusions
14. Future Directions


**3. Data Source**

**Citation for the dataset UCSD Amazon Reviews' 23:**

@article{hou2024bridging, title={Bridging Language and Items for Retrieval and Recommendation}, author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian}, journal={arXiv preprint arXiv:2403.03952}, year={2024} }

Dataset: https://amazon-reviews-2023.github.io/index.html#

**Citation for the Sentence Transformer Model ('all-MiniLM-L6-v2'):**

https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode

**Citation for beauty and personal care category as the most profitable category:**

https://www.statista.com/statistics/1400287/amazon-most-profitable-sellers-category/#:~:text=A%202024%20survey%20found%20that,with%2027%20percent%20of%20sellers.)

**Citation for Statistia E-commerce Market Insights:**
https://www.statista.com/statistics/272391/us-retail-e-commerce-sales-forecast/

In [1]:
!pip install sentence-transformers



In [2]:
#import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns

import statsmodels.api as sm
from scipy import stats
from scipy.stats import norm

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

from statsmodels.api import tsa # time series analysis

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

from warnings import filterwarnings
filterwarnings(action='ignore')

In [4]:
import pandas as pd

data_subset = pd.read_csv("/content/AmazonBP_all_selected_01subset.csv")

In [5]:
data_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165674 entries, 0 to 165673
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   user_id            165674 non-null  object 
 1   rating             165674 non-null  int64  
 2   title_x            165674 non-null  object 
 3   text               165674 non-null  object 
 4   timestamp          165674 non-null  int64  
 5   verified_purchase  165674 non-null  bool   
 6   helpful_vote       165674 non-null  int64  
 7   parent_asin        165674 non-null  object 
 8   average_rating     165674 non-null  float64
 9   price              165674 non-null  float64
 10  rating_number      165674 non-null  int64  
 11  time               165674 non-null  object 
 12  year               165674 non-null  int64  
 13  month              165674 non-null  int64  
 14  week_of_year       165674 non-null  int64  
 15  sentiment          165674 non-null  int64  
dtypes:

In [6]:
data_subset.isnull().mean() #checking there's no null

Unnamed: 0,0
user_id,0.0
rating,0.0
title_x,0.0
text,0.0
timestamp,0.0
verified_purchase,0.0
helpful_vote,0.0
parent_asin,0.0
average_rating,0.0
price,0.0


In [7]:
data_subset['text']

Unnamed: 0,text
0,"When I received this product, I thought, oh my..."
1,My wax warmer does not work. I have attempted ...
2,Great for sensitive skin<br />A little goes a ...
3,Just had the best shave of my life!
4,This product is perfect for your lip glosses. ...
...,...
165669,Super moisturizing and they smell/taste so goo...
165670,I love this product! Easy to use and looks nat...
165671,Great clippers
165672,The barrel was too small and you can’t control...


In [8]:
data_subset['sentiment']

Unnamed: 0,sentiment
0,1
1,0
2,1
3,1
4,1
...,...
165669,1
165670,1
165671,1
165672,1


In [9]:
X = data_subset['text']
y = data_subset['sentiment'] #target variable

In [10]:
X.shape

(165674,)

In [11]:
y.shape

(165674,)

In [12]:
from sentence_transformers import SentenceTransformer

In [13]:
model_name = 'all-MiniLM-L6-v2' #384 dimensions
model = SentenceTransformer(model_name, device='cuda')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [14]:
sentences = X.tolist()

In [15]:
embedding = model.encode(sentences, show_progress_bar=True, batch_size=256)

Batches:   0%|          | 0/648 [00:00<?, ?it/s]

In [16]:
embedding.shape

(165674, 384)

In [17]:
embedding_df = pd.DataFrame(embedding)

In [18]:
embedding_df.to_csv("embedding_X.csv", index=False)

In [19]:
X = embedding_df
y = data_subset ['sentiment']

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [36]:
#saving the X y test train set
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

In [22]:
X_train.shape

(132539, 384)

In [23]:
X_test.shape

(33135, 384)

In [24]:
y_train.shape

(132539,)

In [25]:
y_test.shape

(33135,)

In [26]:
from sklearn.linear_model import LogisticRegression

#Instantiate model
text_embedding_logreg = LogisticRegression(C=0.5, penalty=None)

#Fitting the model
text_embedding_logreg.fit(X_train, y_train)

In [27]:
#Score the model
print(f'Score on train: {text_embedding_logreg.score(X_train, y_train)}')
print(f'Score on test: {text_embedding_logreg.score(X_test, y_test)}')

Score on train: 0.8932691509668852
Score on test: 0.8885770333484231


In [28]:
from xgboost import XGBClassifier
#from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier

xgb_model = XGBClassifier(n_estimators=100, max_depth=8, learning_rate=0.1)
rf_model = RandomForestClassifier()

xgb_model.fit(X_train, y_train)
#ab_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

print("Test Set Scores:")
#print(f"AdaBoost score: {ab_model.score(X_test, y_test)}")
print(f"XG Boost score: {xgb_model.score(X_test, y_test)}")
print(f"Random Forest score: {rf_model.score(X_test, y_test)}")

Test Set Scores:
XG Boost score: 0.8805190885770333
Random Forest score: 0.8557416628942206


In [34]:
import pickle

#xgb_model.fit(X_train, y_train) #XGBoost Model

with open('xgb_model.pkl', 'wb') as file:
    pickle.dump(xgb_model, file)

In [33]:
import pickle

#rf_model.fit(X_train, y_train) #RandomForest Model

with open('rf_model.pkl', 'wb') as file:
    pickle.dump(rf_model, file)

In [32]:
import pickle

#text_embedding_logreg.fit(X_train, y_train) #Logistic Regression Model

with open('text_embedding_logreg.pkl', 'wb') as file:
    pickle.dump(text_embedding_logreg, file)

In [29]:
#Get AUC score for Log Regression:
from sklearn.metrics import roc_auc_score

# Assuming you have a trained model (model) and test data (X_test, y_test)

# Get the predicted probabilities for the positive class
y_pred_proba= text_embedding_logreg.predict_proba(X_test)[:, 1]

# Calculate the AUC score
auc_score_logreg = roc_auc_score(y_test, y_pred_proba)

print("AUC Score Log Reg:", auc_score_logreg)


AUC Score Log Reg: 0.9182640396655221


In [30]:
#Get AUC score per XGBClassifier:
from sklearn.metrics import roc_auc_score

# Assuming you have a trained model (model) and test data (X_test, y_test)

# Get the predicted probabilities for the positive class
y_pred_proba_xgb_model= xgb_model.predict_proba(X_test)[:, 1]

# Calculate the AUC score
auc_score_xgb_model = roc_auc_score(y_test, y_pred_proba_xgb_model)

print("AUC Score XGBClassifier:", auc_score_xgb_model)

AUC Score XGBClassifier: 0.9124848579239087


In [31]:
#Get AUC score per RandomForestClassifier:
from sklearn.metrics import roc_auc_score

# Assuming you have a trained model (model) and test data (X_test, y_test)

# Get the predicted probabilities for the positive class
y_pred_proba_rf_model= rf_model.predict_proba(X_test)[:, 1]

# Calculate the AUC score
auc_score_rf_model = roc_auc_score(y_test, y_pred_proba_rf_model)

print("AUC Score RandomForestClassifier:", auc_score_rf_model)

AUC Score RandomForestClassifier: 0.8852481016363634
