## Sentiment Analysis of Sephora Skincare Review Data Set:

A natural language processing approach called sentiment analysis is used to identify the sentiment or emotion expressed in a text. Sentiment analysis can be used to examine the overall sentiment surrounding Sephora skincare goods by applying it to customer reviews or other text data.

Sentiment analysis for Sephora skincare goods aims to categorize customer reviews as favorable, negative, or neutral automatically. This can be used to better understand how consumers generally feel about various skincare brands or products.

- Perform sentiment analysis on customer reviews to understand the overall sentiment (positive, negative, or neutral).
- Use natural language processing techniques to classify reviews and extract sentiment-related information.
- Visualize sentiment patterns using bar plots or word clouds to identify popular sentiment trends.

# Different Approaches for Sentiment Analysis
There are various ways to sentiment analysis, each with its own set of benefits and drawbacks. Here are some typical approaches to sentiment analysis:
- Lexicon-based Sentiment Analysis: This method uses sentiment lexicons or dictionaries that contain words or phrases related to sentiment scores. Sentiment scores are assigned to words in the text that match entries in the lexicon, and an overall sentiment score is generated.
- Rule-based Sentiment Analysis: Predefined rules or patterns are used to identify sentiment in text in rule-based sentiment analysis. Keywords, linguistic patterns, or grammatical structures can all be used to generate these rules. Specific words or phrases, for example, may express positive or negative sentiment.
- Machine Learning Classification: Machine learning algorithms can be trained on labeled datasets to classify text into positive, negative, or neutral sentiment categories. This method entails extracting features such as bag-of-words or word embeddings and training a classifier such as Naive Bayes, Support Vector Machines (SVM), or neural networks.
- Deep Learning: For sentiment analysis, transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and XLNet (Generalized Autoregressive Pretraining for Language Understanding) can be used. To categorize sentiment, these algorithms may learn complicated patterns and correlations in text data.






# Why BERT is a Preferred Choice for Sentiment Analysis ?

BERT (Bidirectional Encoder Representations from Transformers) is a well-known and powerful model in natural language processing, including sentiment analysis. BERT, as a **contextual language model**, can understand the semantic meaning of words by evaluating their context within a phrase. It considers the full sentence and effectively handles complex linguistic structures, such as word relationships, to understand the given meaning and sentiment.

BERT's **pre-training** phase on a large corpus of text data, such as  BooksCorpus and English Wikipedia, is one of its primary strengths. As a result of this pre-training, BERT is able to gain extensive language representations that encapsulate the general comprehension of diverse linguistic nuances. BERT can be fine-tuned to be specifically adapted for sentiment analysis tasks by exploiting the knowledge gathered during pre-training. Because of its transfer learning capability, BERT can produce promising outcomes even with insufficient labeled data.

Overall, BERT's proficiency in sentiment analysis originates from its contextual awareness, pre-training on varied text sources, **fine-tuning** adaptability, **transfer learning** technique, and capacity to generalize effectively while performing well.


In [33]:
# load library
import tensorflow as tf
import tensorflow_text as text
from transformers import BertTokenizer,TFBertForSequenceClassification
import pandas as pd
from multiprocessing import Pool
import pickle
import os
import time


In [6]:

# get function from utiliy class
root=os.getcwd()
from utility import Utility
util=Utility()
util.read_all_data()
df=util.data_df.copy()
# get positive reviews and negative reviews
# select labeled reviews. if is_recorecommended 1 tells positive and 0 tell negative.
review_df=df.loc[~((df['review_text'].isnull())|(df['is_recommended'].isnull())),['review_text','is_recommended']]







['loves_count', 'sephora_exclusive', 'sale_price_usd', 'size', 'brand_id', 'ingredients', 'variation_desc', 'primary_category', 'secondary_category', 'child_max_price', 'child_min_price', 'value_price_usd', 'new', 'variation_value', 'child_count', 'limited_edition', 'online_only', 'variation_type', 'reviews', 'out_of_stock', 'highlights', 'tertiary_category', 'product_id']
(1307279, 40)


In [30]:
# get weather positive and negative labeled data are balanced  
index_pos=review_df.index[(review_df['is_recommended']>0.9)==True]
index_neg=review_df.index[(review_df['is_recommended']<0.09)==True]
print('positive_review_data:',len(index_pos)/(len(index_pos)+len(index_neg)),len(index_pos))
print('negative_review_data:',len(index_neg)/(len(index_pos)+len(index_neg)),len(index_neg))
# we see labeled data are unblanced and
# the number of label data are large and it takes so much time for training 
# we chose only 1000 cases for traing and 200 cases for validation 
index_all=index_pos[:600].union(index_neg[:600])
print(len(index_all))
# get reviews for encoding 
review_short_df=review_df.loc[index_all,:].copy()



positive_review_data: 0.8395363015197921 928146
negative_review_data: 0.16046369848020797 177400
1200
600.0


In [32]:
# get word embeddings for Sephora review text using BERT
def encode(text):
    return tokenizer.encode(text, 
            add_special_tokens=True,
            truncation=True,
            max_length=128)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# change rveiew as a string so that BERT can encode all strings in the form of numerical vectors. 
review_short_df['review_text']=review_short_df['review_text'].astype(str)

start_time = time.time()
texts=review_short_df['review_text'].values.tolist()
    
    # Create a multiprocessing pool
with Pool() as pool:
    # Tokenize the texts in parallel
    encoded_texts = pool.map(encode, texts)
    
print("--- %s seconds ---" % (time.time() - start_time))


--- 2.0522215366363525 seconds ---


In [34]:
# train the model
max_length = max(len(tokens) for tokens in encoded_texts)
padded_texts = tf.keras.preprocessing.sequence.pad_sequences(encoded_texts, maxlen=max_length, padding='post')
# convert in the tesors 
labels = tf.convert_to_tensor(review_short_df['is_recommended'].to_list())
labels = tf.dtypes.cast(labels, tf.int32)

# get training data set
indices = tf.where(tf.equal(labels, 0))
indices_0 = tf.squeeze(indices)
indices = tf.where(tf.equal(labels, 1))
indices_1 = tf.squeeze(indices)
num=500
indices_all=tf.concat([indices_0[:num], indices_1[:num]], axis=0)
# select label and codes for training         
selected_labels = tf.gather(labels, indices_all)
selected_padded_texts = tf.gather(padded_texts, indices_all)
print('length of training labels:',len(selected_labels))
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((selected_padded_texts, selected_labels))
dataset = dataset.shuffle(len(selected_padded_texts)).batch(batch_size)
# Load the pre-trained BERT model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=['accuracy'])

# Train the model
print(len(dataset))
model.fit(dataset, epochs=5)
# model.save_weights(os.path.join(root, 'models','bertWeight'))

        



length of training labels: 1000


Downloading tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


32
Epoch 1/5


2023-06-02 16:12:39.113454: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int32 and shape [1000]
	 [[{{node Placeholder/_1}}]]
2023-06-02 16:12:39.113918: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int32 and shape [1000]
	 [[{{node Placeholder/_1}}]]


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fca205c6ca0>

In [35]:
# validation of the model
indices_all_val=tf.concat([indices_0[500:600], indices_1[500:600]], axis=0)
selected_labels_val = tf.gather(labels, indices_all_val)
selected_padded_texts_val = tf.gather(padded_texts, indices_all_val)
validation_results = model.evaluate(selected_padded_texts_val, selected_labels_val)
print("Validation Loss:", validation_results[0])
print("Validation Accuracy:", validation_results[1])

Validation Loss: 0.39037904143333435
Validation Accuracy: 0.8999999761581421


## Conclusions 
Using 5000 positive and negative labeled evaluations for training, the model attained a high accuracy of 99% throughout the training phase. This shows that the model was able to learn from the data provided and generate correct predictions on the training set.

- Epoch 1/5
    313/313 [==============================] - 2777s 9s/step - loss: 0.3204 - accuracy: 0.8463
    Epoch 2/5
    313/313 [==============================] - 2790s 9s/step - loss: 0.1445 - accuracy: 0.9507
    Epoch 3/5
    313/313 [==============================] - 2798s 9s/step - loss: 0.0897 - accuracy: 0.9726
    Epoch 4/5
    313/313 [==============================] - 2761s 9s/step - loss: 0.0591 - accuracy: 0.9827
    Epoch 5/5
    313/313 [==============================] - 2742s 9s/step - loss: 0.0388 - accuracy: 0.9888


A separate set of 500 good and negative evaluations was utilized to validate the model's performance. On this validation set, the model had an accuracy of 92%. This suggests that the model is performing well on previously unknown data and that it is generalizing its predictions beyond the training set.

Based on these findings, it can be assumed that when the model is trained on a bigger dataset that contains all labeled Sephora reviews, it will perform even better in terms of predictive accuracy for sentiment analysis. The increased amount of labeled data will offer the model with more different instances to train from, enhancing its ability to interpret and predict sentiment across a broader variety of evaluations.

As a result of using the whole labeled dataset of Sephora reviews, it is realistic to expect the model to display high predictive ability for sentiment analysis tasks.


