# Sentiment Analysis Using Python
### I will be doing some sentiment analysis in python using different techniques:

* VADER (Valence Aware Dictionary and sEntiment Reasoner) - Bag of words approach
* Roberta Pretrained Model from 🤗
* Huggingface Pipeline

### Importing Necessary Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
plt.style.use('ggplot')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Loading The Data

In [3]:
df = pd.read_csv("/kaggle/input/amazon-fine-food-reviews/Reviews.csv")

In [4]:
# USING LAST 1000 ROWS
sa = df.tail(1000)
sa

### QUICK EDA

In [47]:
scores = sa['Score'].value_counts().sort_index()
ax = scores.plot(kind='bar',title = 'Count by Stars',figsize=(10,6),color = ['darkred','red','orange','yellow','green'])
ax.set_xlabel("Review Stars")
plt.show()

In [6]:
#Changing index
sa.index = np.arange(1, len(sa)+1)

### Basic NLTK

In [7]:
eg = sa['Text'][26]
print(eg)

In [8]:
tokens = nltk.word_tokenize(eg)
tokens[:10]

In [9]:
tagged = nltk.pos_tag(tokens)
tagged[:10]

In [10]:
entites = nltk.chunk.ne_chunk(tagged)
entites.pprint()

### VADER Sentiment Scoring
#### I will use NLTK's SentimentIntensityAnalyzer to get the neg/neu/pos scores of the text.

1. This uses a "bag of words" approach:
    * Stop words are removed
    * each word is scored and combined to a total score.

In [11]:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm

sia = SentimentIntensityAnalyzer()

In [12]:
sia.polarity_scores("i am so happy")

In [13]:
sia.polarity_scores(eg)

In [14]:
res = {}
for i ,z in tqdm(sa.iterrows(),total = len(sa)):
    text = z['Text']
    myid = z['Id']
    res[myid] = sia.polarity_scores(text)

In [15]:
res[567455]

In [16]:
vader = pd.DataFrame(res).T
vader = vader.reset_index().rename(columns={'index':'Id'})

In [17]:
vaders = vader.merge(sa,how='left')

In [18]:
vaders.head()

### Plot VADER results

In [19]:
ax = sns.barplot(x=vaders['Score'],y=vaders['compound'])
ax.set_title('Compound Score by Amazon Star Review')
plt.show()

In [20]:
fig, axes = plt.subplots(1,3,figsize=(10,3))
sns.barplot(data = vaders,x='Score',y='pos',ax = axes[0])
sns.barplot(data = vaders,x='Score',y='neg',ax = axes[1])
sns.barplot(data = vaders,x='Score',y='neu',ax = axes[2])
axes[0].set_title("Positive")
axes[1].set_title("Negative")
axes[2].set_title("Neutral")
plt.tight_layout()
plt.show()

### Roberta Pretrained Model
* Use a model trained of a large corpus of data.
* Transformer model accounts for the words but also the context related to other words.

In [21]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [22]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [23]:
# vader result example
print(eg)
sia.polarity_scores(eg)

In [24]:
# running for roberta model
encoded_text = tokenizer(eg, return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
    'roberta_neg' : scores[0],
    'roberta_neu' : scores[1],
    'roberta_pos' : scores[2]
}
print(scores_dict)

In [25]:
def polarity_scores_roberta(eg):
    encoded_text = tokenizer(eg, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'roberta_neg' : scores[0],
        'roberta_neu' : scores[1],
        'roberta_pos' : scores[2]
    }
    return scores_dict

In [26]:
res = {}
for i, row in tqdm(sa.iterrows(), total=len(sa)):
    try:
        text = row['Text']
        myid = row['Id']
        vader_result = sia.polarity_scores(text)
        vader_result_rename = {}
        for key, value in vader_result.items():
            vader_result_rename[f"vader_{key}"] = value
        roberta_result = polarity_scores_roberta(text)
        both = {**vader_result_rename, **roberta_result}
        res[myid] = both
    except RuntimeError:
        print(f'Broke for id {myid}')

In [32]:
#combining both models into the dataset 'sa'.
result = pd.DataFrame(res).T
result = result.reset_index().rename(columns={'index':'Id'})
result = result.merge(sa,how='left')

In [33]:
result.head()

In [34]:
result.columns

### Comparing Both Models

In [36]:
sns.pairplot(data=result,
             vars=['vader_neg', 'vader_neu', 'vader_pos',
                  'roberta_neg', 'roberta_neu', 'roberta_pos'],
            hue='Score',
            palette='tab10')
plt.show()

### Review Examples:
**Positive 1-Star and Negative 5-Star Reviews**.

**Lets look at some examples where the model scoring and review score differ the most.**

In [37]:
result.query('Score == 1') \
    .sort_values('roberta_pos', ascending=False)['Text'].values[0]

In [39]:
result.query('Score == 1') \
    .sort_values('vader_pos', ascending=False)['Text'].values[0]

#### negative sentiment 5-star review

In [41]:
result.query('Score == 5') \
    .sort_values('roberta_neg', ascending=False)['Text'].values[0]

In [43]:
result.query('Score == 5') \
    .sort_values('vader_neg', ascending=False)['Text'].values[0]

### The Transformers Pipeline
    * Quick & easy way to run sentiment predictions

In [44]:
from transformers import pipeline

sent_pipeline = pipeline("sentiment-analysis")

In [45]:
sent_pipeline("i am good but i want be better")

In [46]:
sent_pipeline("i don think i can complete this task but i have made advanced project")

# THE END