<h1>Sentiment Analysis</h1>

1. Lexicon - A mechanism that is used to classify sentences. Every lexicon has a threshold value based on which sentiment is analyzed.<br>
We will use VADER lexicon today.
<pre>
{VADER} = -1 to +1
          >0.05(+ve)
          <0.05(-ve)
</pre>


Identify Noise

Remove Noise


Character Normalization(Transforming text data.):
1. LowerCasing.
2. Converting special characters.
3. Handling the encoding issue.<br>
  a.  To increase recall value.

Data masking - Hiding sensitive information


Now, we obtain clean text.


Linguistic processing -
1. Tokenization
2. POS Tagging
3. Lemmatization
4. NER

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer

!pip install vaderSentiment

x = ["punkt", "stopwords", "vader_lexicon", "all"]
for j in x:
  nltk.download(j)

In [12]:
def preprocess(txt):
  tokenize = word_tokenize(txt)
  stop_words = set(stopwords.words("english"))
  filtered_tokens = [token.lower() for token in tokenize if token.lower() not in stop_words]
  return filtered_tokens

def analyze_sentiment(txt):
  preprocessed = ' '.join(preprocess(txt))
  sia = SentimentIntensityAnalyzer()
  sentiment_scores = sia.polarity_scores(preprocessed)
  if sentiment_scores['compound'] >= 0.05:
    sentiment = "positive"
  elif sentiment_scores['compound'] <= 0.05:
    sentiment = "negative"
  else:
    sentiment = "neutral"

  return sentiment, sentiment_scores

txt = "I love this product! It's amazing and works perfectly."
sent, sent_score = analyze_sentiment(txt)
print(f"Sentiment: {sent}\nScores: {sent_score}")


Sentiment: positive
Scores: {'neg': 0.0, 'neu': 0.194, 'pos': 0.806, 'compound': 0.9259}


In [16]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NLP/text.csv")

analyzer = SentimentIntensityAnalyzer()
def get_sentiment(txt):
  scores = analyzer.polarity_scores(txt)
  sentiment = 1 if scores["pos"] > 0 else 0
  return sentiment

df["sentiment"] = df["reviewText"].apply(get_sentiment)
df

Unnamed: 0,reviewText,Positive,sentiment
0,This is a one of the best apps acording to a b...,1,1
1,This is a pretty good version of the game for ...,1,1
2,this is a really cool game. there are a bunch ...,1,1
3,"This is a silly game and can be frustrating, b...",1,1
4,This is a terrific game on any pad. Hrs of fun...,1,1
...,...,...,...
19995,this app is fricken stupid.it froze on the kin...,0,0
19996,Please add me!!!!! I need neighbors! Ginger101...,1,1
19997,love it! this game. is awesome. wish it had m...,1,1
19998,I love love love this app on my side of fashio...,1,1


In [17]:
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(df["Positive"], df["sentiment"]))

[[ 1377  3390]
 [  620 14613]]


In [18]:
print(classification_report(df["Positive"], df["sentiment"]))

              precision    recall  f1-score   support

           0       0.69      0.29      0.41      4767
           1       0.81      0.96      0.88     15233

    accuracy                           0.80     20000
   macro avg       0.75      0.62      0.64     20000
weighted avg       0.78      0.80      0.77     20000



In [20]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

def sentiment_scores(sentence):
  sid_obj = SentimentIntensityAnalyzer()
  sentiment_dict = sid_obj.polarity_scores(sentence)
  print("Overall Sentiment dictionary is:", sentiment_dict)
  em = ["neg", "neu", "pos"]
  for e in em:
    print(f"{sentiment_dict[e]*100}% {e}")
  s =  sentiment_dict["compund"]
  if s >= 0.05:
    sent = ("positive")
  elif s <= -0.05:
    sent = "negative"
  else:
    sent = "neutral"
  print(f"Overall Sentiment: {sent}")

In [22]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

# Load dataset
data = df
X = data['reviewText']
y = data['Positive']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Naive Bayes Classifier
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_counts, y_train)
nb_predicted = nb_classifier.predict(X_test_counts)

# LSTM Model
max_words = 1000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
maxlen = 100  # assuming a maximum length of 100 words per sentence
X_train_pad = pad_sequences(X_train_seq, padding='post', maxlen=maxlen)
X_test_pad = pad_sequences(X_test_seq, padding='post', maxlen=maxlen)

lstm_model = Sequential()
lstm_model.add(Embedding(max_words, 50, input_length=maxlen))
lstm_model.add(LSTM(64))
lstm_model.add(Dense(1, activation='sigmoid'))
lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
lstm_model.fit(X_train_pad, y_train, epochs=5, batch_size=32)
lstm_predicted = (lstm_model.predict(X_test_pad) > 0.5).astype('int')

# VADER Sentiment Analysis
sid = SentimentIntensityAnalyzer()
vader_predicted = []
for sentence in X_test:
    ss = sid.polarity_scores(sentence)
    if ss['compound'] >= 0.05:
        vader_predicted.append(1)  # Positive
    elif ss['compound'] <= -0.05:
        vader_predicted.append(0)  # Negative
    else:
        vader_predicted.append(2)  # Neutral

# Evaluate performance
print("Naive Bayes Accuracy:", accuracy_score(y_test, nb_predicted))
print("Naive Bayes Report:", classification_report(y_test, nb_predicted))

print("LSTM Accuracy:", accuracy_score(y_test, lstm_predicted))
print("LSTM Report:", classification_report(y_test, lstm_predicted))

print("VADER Accuracy:", accuracy_score(y_test, vader_predicted))
print("VADER Report:", classification_report(y_test, vader_predicted))



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Naive Bayes Accuracy: 0.901
Naive Bayes Report:               precision    recall  f1-score   support

           0       0.82      0.76      0.79       958
           1       0.93      0.95      0.94      3042

    accuracy                           0.90      4000
   macro avg       0.87      0.85      0.86      4000
weighted avg       0.90      0.90      0.90      4000

LSTM Accuracy: 0.7605
LSTM Report:               precision    recall  f1-score   support

           0       0.00      0.00      0.00       958
           1       0.76      1.00      0.86      3042

    accuracy                           0.76      4000
   macro avg       0.38      0.50      0.43      4000
weighted avg       0.58      0.76      0.66      4000

VADER Accuracy: 0.7975
VADER Report:               precision    recall  f1-score   support

           0       0.65      0.52      0.58       958
           1       0.87      0.89      0.88      3042
           2 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
