MODEL BUILDING

The models i built are:
1.Roberta
2.Vader

1.Roberta:
Roberta, developed by Facebook AI, enhances BERT through robust pretraining and more data.It achieves state-of-the-art results on various NLP tasks due to its deeper language understanding.
The model has proven effective in a wide range of applications, from language translation to question answering.Roberta's training process involves removing the next sentence prediction task and using larger batch sizes.These improvements contribute to its superior performance over BERT in many natural language processing benchmarks.

Advantages of Roberta:
Improved NLP model by Facebook AI, outperforms BERT with robust pretraining and larger data.Broader language understanding, better handling of informal language, and domain-specific jargon.Generalizes well across tasks, advancing NLP research and enabling practical applications.

2.Vader:
VADER, or Valence Aware Dictionary and sEntiment Reasoner, is a lexicon-based sentiment analysis tool.It excels in analyzing social media content, considering nuances and informal language.
VADER employs a human-annotated lexicon to determine sentiment scores for words, capturing intensity and negations.Its design enables accurate sentiment analysis for short texts, such as tweets and posts.VADER's effectiveness and simplicity make it a popular choice for social media monitoring and opinion mining.

Advantages of Vader:
Lexicon-based sentiment analysis tool for social media text, capturing nuances effectively.Considers word polarity, intensity, negations, and punctuation for accurate sentiment analysis.Fast, simple, and versatile, influencing sentiment analysis research and industry applications.

I built few models and analysed among these Roberta and Vader where the best performing models.So i would suggest vader and roberta as the best models that can be used in the implementation of the WebHelpers project.

Link for the dataset :https://drive.google.com/file/d/1CZpQwEXjFiqq4YBwz8QDYsuV18b5moaZ/view?usp=sharing

The dataset is very large so i have taken only few rows for the model building and training.

To avoid errors , make sure you install all the packages used in this code in your environment.

In [None]:
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

import nltk

In [None]:
df = pd.read_csv('Reviews.csv')
print(df.shape)
#I have used a very large dataset consisting of 5lakhs of records.
# In order to decrease the runtime you can modify the number of records
# you needed depending on the GPU availability.
df = df.head(500)
print(df.shape)

In [None]:
df.head()

In [None]:
ax = df['Score'].value_counts().sort_index() \
    .plot(kind='bar',
          title='Count of Reviews by Stars',
          figsize=(10, 5))
ax.set_xlabel('Review Stars')
plt.show()

In [None]:
example = df['Text'][50]
print(example)

In [None]:

import nltk
nltk.download('punkt')

tokens = nltk.word_tokenize(example)
tokens[:10]


In [None]:
nltk.download('averaged_perceptron_tagger')

tagged = nltk.pos_tag(tokens)
tagged[:10]

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

entities = nltk.chunk.ne_chunk(tagged)
entities.pprint()

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores('I am so happy!')


In [None]:
sia.polarity_scores('This is the worst thing ever.')


In [None]:
sia.polarity_scores(example)


In [None]:
# Run the polarity score on the entire dataset
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = row['Text']
    myid = row['Id']
    res[myid] = sia.polarity_scores(text)

In [None]:
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index': 'Id'})
vaders = vaders.merge(df, how='left')

In [None]:
# Now we have sentiment score and metadata
vaders.head()

In [None]:
ax = sns.barplot(data=vaders, x='Score', y='compound')
ax.set_title('Compund Score by Amazon Star Review')
plt.show()

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(12, 3))
sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0])
sns.barplot(data=vaders, x='Score', y='neu', ax=axs[1])
sns.barplot(data=vaders, x='Score', y='neg', ax=axs[2])
axs[0].set_title('Positive')
axs[1].set_title('Neutral')
axs[2].set_title('Negative')
plt.tight_layout()
plt.show()

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
example = df['Text'][50]
print(example)
encoded_text = tokenizer(example, return_tensors='pt')
input_ids = encoded_text['input_ids']
attention_mask = encoded_text['attention_mask']

with torch.no_grad():
    output = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = output.logits

scores = torch.softmax(logits, dim=1)
scores_dict = {
    'roberta_neg': scores[0, 0].item(),
    'roberta_neu': scores[0, 1].item(),
    'roberta_pos': scores[0, 2].item()
}
print(scores_dict)



In [None]:
# VADER results on example
print(example)
sia.polarity_scores(example)

In [None]:
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'roberta_neg' : scores[0],
        'roberta_neu' : scores[1],
        'roberta_pos' : scores[2]
    }
    return scores_dict

In [None]:
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    try:
        text = row['Text']
        myid = row['Id']
        vader_result = sia.polarity_scores(text)
        vader_result_rename = {}
        for key, value in vader_result.items():
            vader_result_rename[f"vader_{key}"] = value
        roberta_result = polarity_scores_roberta(text)
        both = {**vader_result_rename, **roberta_result}
        res[myid] = both
    except RuntimeError:
        print(f'Broke for id {myid}')

In [None]:
results_df = pd.DataFrame(res).T
results_df = results_df.reset_index().rename(columns={'index': 'Id'})
results_df = results_df.merge(df, how='left')

In [None]:
#Compare scores between models

results_df.columns

In [None]:
#Combine and compare
sns.pairplot(data=results_df,
             vars=['vader_neg', 'vader_neu', 'vader_pos',
                  'roberta_neg', 'roberta_neu', 'roberta_pos'],
            hue='Score',
            palette='tab10')
plt.show()



In [None]:
#Review examples
results_df.query('Score == 1') \
    .sort_values('roberta_pos', ascending=False)['Text'].values[0]

In [None]:
results_df.query('Score == 1') \
    .sort_values('vader_pos', ascending=False)['Text'].values[0]

In [None]:
results_df.query('Score == 5') \
    .sort_values('roberta_neg', ascending=False)['Text'].values[0]

In [None]:
results_df.query('Score == 5') \
    .sort_values('vader_neg', ascending=False)['Text'].values[0]