<a href="https://colab.research.google.com/github/wberilo/weaklySupervisedLearning/blob/main/weaklySupervisedLearning_yelp_PPGTI3102.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis of yelp reviews

Using the Yelp review dataset and the stars field as a label for sentiment analysis, let's create a pipeline that employs weakly supervised learning to handle labeling the Yelp review texts as positive or negative. First, we will generate initial labels using regex-based functions (weak labels). Next, we will refine these functions to improve their accuracy. We will then compare the performance of our models on ambiguous data (2, 3, 4-star reviews) versus less ambiguous data (extremes, 1 and 5-star reviews). This comparison will help us understand how the classification of our data affects the outcome, depending on how we set the threshold for positive or negative sentiment. Additionally, we will analyze how the nature of our data influences these results.

import our dataset https://huggingface.co/datasets/Yelp/yelp_review_full

for our sentiment analysis of Yelp reviews

In [None]:
!pip install snorkel



In [None]:
import pandas as pd

splits = {'train': 'yelp_review_full/train-00000-of-00001.parquet', 'test': 'yelp_review_full/test-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/Yelp/yelp_review_full/" + splits["train"])
df.head()

Unnamed: 0,label,text
0,4,dr. goldberg offers everything i look for in a...
1,1,"Unfortunately, the frustration of being Dr. Go..."
2,3,Been going to Dr. Goldberg for over 10 years. ...
3,3,Got a letter in the mail last week that said D...
4,0,I don't know what Dr. Goldberg was like before...


In [None]:
import re
import unicodedata

def remove_excessive_spaces_in_text(text: str) -> str:
    return re.sub(r'\s+', ' ', text).strip()

def remove_quotes_dots(text: str) -> str:
    return re.sub(r'[.,`"]', '', text).strip()

def remove_quotes_single(text: str) -> str:
    return re.sub(r"'", "", text).strip()

def remove_repeated_non_word_characters(text: str) -> str:
    return re.sub(r'(\W)\1+', r'\1', text).strip()

def remove_repeated_letters_in_text(text: str, n_repeat: int = 4) -> str:
    return re.sub(r'([a-z])\1{'+str(n_repeat)+',}', r'\1', text).strip()

def remove_accents_from_text(text: str) -> str:
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

def remove_non_alphanumeric_characters(text: str) -> str:
    return re.sub(r'[^a-zA-Z0-9\s]', '', text).strip()

def to_lower(text: str) -> str:
    return text.lower()



# Data cleanup

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


pipeline_clean_text = Pipeline([
    ('remove_accents_from_text', FunctionTransformer(remove_accents_from_text)),
    ('remove_excessive_spaces_in_text', FunctionTransformer(remove_excessive_spaces_in_text)),
    ('remove_repeated_letters_in_text', FunctionTransformer(remove_repeated_letters_in_text)),
    ('remove_repeated_non_word_characters', FunctionTransformer(remove_repeated_non_word_characters)),
    ('to_lower', FunctionTransformer(to_lower)),
    ('remove_quotes_dots', FunctionTransformer(remove_quotes_dots)),
    ('remove_quotes_single', FunctionTransformer(remove_quotes_single)),
    ('remove_non_alphanumeric_characters', FunctionTransformer(remove_non_alphanumeric_characters))
])

df.dropna(subset=['text'], inplace=True)
df.drop_duplicates(subset=['text'], inplace=True)

In [None]:
cleanDf = df.copy()
cleanDf['text'] = df['text'].apply(pipeline_clean_text.transform)
cleanDf.head()
cleanDf.to_csv('data.csv', index=False)

# Define our labels

In [None]:
YES = 1
NO = 0
ABSTAIN = -1

Because we're dealing with star reviews, we'll convert the existing star labels to either positive or negative

In this case, we are setting the star-to-sentiment threshold as 4, meaning that anything that is not a 5 star-review is negative.

In [None]:
def convert_stars_to_labels(stars):
  if stars > 3:
    return YES
  else:
    return NO

cleanDf['label'] = cleanDf['label'].apply(convert_stars_to_labels)

cleanDf.head()


Unnamed: 0,label,text
0,1,dr goldberg offers everything i look for in a ...
1,0,unfortunately the frustration of being dr gold...
2,0,been going to dr goldberg for over 10 years i ...
3,0,got a letter in the mail last week that said d...
4,0,i dont know what dr goldberg was like before m...


Lets split our test data after cleaning our dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(cleanDf['text'], cleanDf['label'], test_size=0.3, random_state=42)

Lets create and declare some regex functions to label our texts

In [None]:
regex_c = re.compile(r"\bdidnt enjoy\b", re.IGNORECASE)
regex_d = re.compile(r'\bbogus\b', re.IGNORECASE)
regex_g = re.compile(r"\bdont recommend\b", re.IGNORECASE)
regex_m = re.compile(r'\bdisappoint(?:ed|ing)?\b', re.IGNORECASE)
regex_b = re.compile(r'\bunpleasant|disgusting\b', re.IGNORECASE)
regex_a = re.compile(r'\bhate|bad|horrible\b', re.IGNORECASE)
regex_p = re.compile(r'\bunacceptable\b', re.IGNORECASE)
regex_a_not_good = re.compile(r'\bnot good|not very good\b', re.IGNORECASE)
regex_o = re.compile(r'\bbad quality\b', re.IGNORECASE)
regex_i = re.compile(r'\b(?:terrible|awful)\b', re.IGNORECASE)

# Positive phrases
regex_f = re.compile(r'\brecommend\b', re.IGNORECASE)
regex_j = re.compile(r'\bgenial|genius\b', re.IGNORECASE)
regex_n = re.compile(r'\badmirable\b', re.IGNORECASE)
regex_a_good = re.compile(r'\b(?:very good|amazing|(?<!not )good)\b', re.IGNORECASE)
regex_o = re.compile(r'\b(?<!bad )quality\b', re.IGNORECASE)
regex_q = re.compile(r'\bnice\b', re.IGNORECASE)
regex_r = re.compile(r'\b(?:great|awesome)\b', re.IGNORECASE)
regex_t = re.compile(r'\b(?:excellent|wonderful)\b', re.IGNORECASE)
regex_u = re.compile(r'\b(?:perfect|outstanding)\b', re.IGNORECASE)
regex_v = re.compile(r'\b(?:superb|fabulous)\b', re.IGNORECASE)
regex_w = re.compile(r'\b(?:brilliant|amazing)\b', re.IGNORECASE)


In [None]:
import re
from snorkel.labeling import labeling_function


# Negative phrases
@labeling_function()
def lf_regex_b(x):
    return NO if regex_c.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_d(x):
    return NO if regex_d.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_g(x):
    return NO if regex_g.search(x.text) else ABSTAIN


@labeling_function()
def lf_regex_m(x):
    return NO if regex_m.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_b(x):
    return NO if regex_a.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_a(x):
    return NO if regex_p.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_a_not_good(x):
    return NO if regex_a_not_good.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_p(x):
    return NO if regex_o.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_i(x):
    return NO if regex_i.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_o(x):
    return NO if regex_i.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_f(x):
    return YES if regex_f.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_j(x):
    return YES if regex_j.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_n(x):
    return YES if regex_n.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_a_good(x):
    return YES if regex_a_good.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_o(x):
    return YES if regex_o.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_q(x):
    return YES if regex_q.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_r(x):
    return YES if regex_r.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_t(x):
    return YES if regex_t.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_u(x):
    return YES if regex_u.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_v(x):
    return YES if regex_v.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_w(x):
    return YES if regex_w.search(x.text) else ABSTAIN




Lets apply our regex functions to our training data

In [None]:
from snorkel.labeling import PandasLFApplier

lfs = [lf_regex_d, lf_regex_g, lf_regex_m, lf_regex_b, lf_regex_a,
       lf_regex_a_not_good, lf_regex_p, lf_regex_i,
       lf_regex_f, lf_regex_j, lf_regex_n, lf_regex_a_good, lf_regex_o,
       lf_regex_q, lf_regex_r, lf_regex_t, lf_regex_u, lf_regex_v, lf_regex_w]

applier = PandasLFApplier(lfs=lfs)

x_train_df = pd.DataFrame(X_train, columns=['text'])

x_train_df.head()


Unnamed: 0,text
224334,i went here with a friend for dinner the other...
638856,this is not the place to go for mediterranean ...
300721,have again 3 stars for the cobb salad which wa...
65864,boba places are a dime a dozen but for some re...
604081,i walked in to the store and went straight to ...


In [None]:
L_train = applier.apply(df=x_train_df)

100%|██████████| 455000/455000 [06:28<00:00, 1170.45it/s]


Use those labels to train a model

In [None]:
from snorkel.labeling.model.label_model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=100, seed=11)


100%|██████████| 500/500 [00:00<00:00, 627.49epoch/s]


In [None]:
Weak_labels = label_model.predict(L=L_train, tie_break_policy="abstain")

In [None]:
Weak_labels

array([-1,  1,  1, ...,  1, -1,  1])

Lets get the count of labels per label type for our expected test labels and compare them with our weak labels

In [None]:
import collections
counter = collections.Counter(Weak_labels)
print(counter)

counter_labels = collections.Counter(Y_train)
print(counter_labels)

Counter({1: 339834, -1: 107465, 0: 7701})
Counter({0: 363689, 1: 91311})


The resulting labels from our regex functions seem pretty skewed towards positive labels.

In [None]:
# build the final df from weak labels

df_final = X_train.to_frame(name='text')
df_final['label'] = Weak_labels
df_final.head()


Unnamed: 0,text,label
224334,i went here with a friend for dinner the other...,-1
638856,this is not the place to go for mediterranean ...,1
300721,have again 3 stars for the cobb salad which wa...,1
65864,boba places are a dime a dozen but for some re...,1
604081,i walked in to the store and went straight to ...,1


In [None]:
# drop abstain values

df_final = df_final[df_final['label'] != -1]

df_final.shape


(347535, 2)

In [None]:
# vectorize text values
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df_final['text'])

train model and test accuracy

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split the data into training and testing sets
X_train_model, X_test_model, y_train_model, y_test_model = train_test_split(X_tfidf, df_final['label'], test_size=0.25, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train_model, y_train_model)

# Make predictions on the test set
y_pred = model.predict(X_test_model)

# Calculate the accuracy of the model on its own generated weak labels
accuracy = accuracy_score(y_test_model, y_pred)
print("Accuracy:", accuracy)




Accuracy: 0.9857971548271258


In [None]:
# Calculate the accuracy of the model on real (star reviews) labels

X_test_tfidf = vectorizer.transform(X_test)
y_pred_test = model.predict(X_test_tfidf)

accuracy = accuracy_score(Y_test, y_pred_test)
print("Accuracy:", accuracy)

Accuracy: 0.20756410256410257


# Simple regex functions result

Abysmal result (0.2), probably consequence of poor regex functions, bad coverage of the picked words and maybe ambiguity of middle stars (2, 3 star reviews)

# Better list of words

Instead of using the regex functions that we build, lets try to use a much larger dataset of positive and negative words, with ~6800 words to try and do a more complete search, we'll create two functions and one tiebreaker function. We'll train a model and evaluate it as well

# Quality over quanitity

Because these functions take much longer to run, we'll cut down our testing data, but now that we have ~6800 words, we can take a better look at the effects of quality over quantity of our labels

Import a much more complete lexicon of positive and negative words

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

In [None]:
# Read the positive words from the file
with open('./positive-words.txt', 'r') as f:
  positive_words = [line.strip() for line in f]

# Define a function to check if any positive words exist in a string
def has_positive_words_amount(x):
  amount_positive_words = 0
  for word in positive_words:
    escaped_word = re.escape(word)
    if re.search(r'\b{}\b'.format(escaped_word), x.text):
      amount_positive_words += 1
  return amount_positive_words

@labeling_function()
def has_positive_words(x):
  for word in positive_words:
    escaped_word = re.escape(word)
    if re.search(r'\b{}\b'.format(escaped_word), x.text):
      return YES
  return ABSTAIN


In [None]:
# Read the negative words from the file
with open('/content/negative-words.txt', 'r') as f:
  negative_words = [line.strip() for line in f]

# Define a function to check if any negative words exist in a string
def has_negative_words_amount(x):
  amount_negative_words = 0
  for word in negative_words:
    escaped_word = re.escape(word)
    if re.search(r'\b{}\b'.format(escaped_word), x.text):
      amount_negative_words += 1
  return amount_negative_words

@labeling_function()
def has_negative_words(x):
  for word in negative_words:
    escaped_word = re.escape(word)
    if re.search(r'\b{}\b'.format(escaped_word), x.text):
      return NO
  return ABSTAIN

In [None]:
@labeling_function()
def has_sentiment_words(x):
  positive = has_positive_words_amount(x)
  negative = has_negative_words_amount(x)
  if(negative > positive):
    return NO
  if(negative < positive):
    return YES
  else:
    return ABSTAIN

In [None]:
from snorkel.labeling import PandasLFApplier

limited_x = X_train[:1000]
limited_y = Y_train[:1000]

lfs = [has_positive_words, has_negative_words, has_sentiment_words]

applier = PandasLFApplier(lfs=lfs)

x_train_df_again = pd.DataFrame(limited_x, columns=['text'])

L_train_again = applier.apply(df=x_train_df_again)

100%|██████████| 1000/1000 [13:32<00:00,  1.23it/s]


In [None]:
from snorkel.labeling.model.label_model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train_again, n_epochs=500, log_freq=100, seed=11)

Weak_labels_again = label_model.predict(L=L_train_again, tie_break_policy="abstain")



100%|██████████| 500/500 [00:00<00:00, 713.94epoch/s]


In [None]:
df_final = limited_x.to_frame(name='text')
df_final['label'] = Weak_labels_again
df_final.head()

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df_final['text'])

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split the data into training and testing sets
X_train_model, X_test_model, y_train_model, y_test_model = train_test_split(X_tfidf, df_final['label'], test_size=0.25, random_state=42)

# Train a Logistic Regression model
model2 = LogisticRegression()
model2.fit(X_train_model, y_train_model)

# Make predictions on the test set
y_pred = model2.predict(X_test_model)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test_model, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.748


In [None]:
X_test_tfidf = vectorizer.transform(X_test)
y_pred_test = model2.predict(X_test_tfidf)

accuracy = accuracy_score(Y_test, y_pred_test)
print("Accuracy:", accuracy)

Accuracy: 0.24658461538461537


# Lexicon of 6800 words result

Slightly better, we might want to focus on data and how we are classificating positive and negative, instead we'll try:

# Changing testing data window

Instead, of dealing with mild sentiment reviews or reviews that can be neutral and induce confusion, lets change the testing data to deal with the extreme cases, with one or five stars and try to evaluate our model based on that data instead.

In [None]:
extremes_df = df[(df['label'] == 0) | (df['label'] == 4)].dropna()
extremes_df.head()

clean_extremesDf = extremes_df.copy()
clean_extremesDf['text'] = clean_extremesDf['text'].apply(pipeline_clean_text.transform)
clean_extremesDf.head()


Unnamed: 0,label,text
0,4,dr goldberg offers everything i look for in a ...
4,0,i dont know what dr goldberg was like before m...
5,4,top notch doctor in a top notch practice cant ...
6,4,dr eric goldberg is a fantastic doctor who has...
7,0,im writing this review to give you a heads up ...


In [None]:
clean_extremesDf['label'] = clean_extremesDf['label'].apply(convert_stars_to_labels)


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(clean_extremesDf['text'], clean_extremesDf['label'], test_size=0.3, random_state=42)

In [None]:
X_test_tfidf = vectorizer.transform(X_test)
y_pred_test = model2.predict(X_test_tfidf)

accuracy = accuracy_score(Y_test, y_pred_test)
print("Accuracy:", accuracy)

Accuracy: 0.5938076923076923


# Results

Not great, but much better! (0.59) our functions can more easily distinguish between positive and negative emotions in the more extreme cases.

In conclusion, this shows us that the better quality of our functions and also better handling and knowledge of the data are essential to develop a weakly supervised learning pipeline, and that those two facts are correlated as once we have a better knowledge of our data, we can also apply that knowledge into developing a better way to create those weakly originated labels in the first place.

Some other methods might be more effective when dealing with this dataset in order to improve our model that weren't applied here, such as:


* better data cleanup and knowledge
  * handling and removal of common non-sentiment word features
  * analysis and removal of ambiguous or wrong labels
* better training
  * evaluation of multiple models
  * talking to an expert
  * better handling of abstain or neutral labels
  * better analysis of our functions to avoid skewing
* better analysis
  * iterate the star-to-sentiment threshold from 1-5 and see how that affects results