# Sentiment Analysis
## Lexicon-based analysis
This type of analysis, such as the NLTK Vader sentiment analyzer, involves using a set of predefined rules and heuristics to determine the sentiment of a piece of text. These rules are typically based on lexical and syntactic features of the text, such as the presence of positive or negative words and phrases.

While lexicon-based analysis can be relatively simple to implement and interpret, it may not be as accurate as ML-based or transformed-based approaches, especially when dealing with complex or ambiguous text data

In [1]:
! pip install nltk




[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# import libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# download nltk corpus (first time only)
import nltk
nltk.download('all')




[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\nvish\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\nvish\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\nvish\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\nvish\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\nvish\AppData\Roaming\nltk_data...
[

True

In [3]:
# load imdb review data
import pandas as pd
import pandas as pd
df = pd.read_csv(r'Datasets\aclImdb_data_50000.csv')
df

Unnamed: 0,text,label
0,In a college dorm a guy is killed by somebody ...,neg
1,The production year says it all. The movie is ...,neg
2,A pleasant surprise! I expected a further down...,pos
3,"The ""math"" aspect to this is merely a gimmick ...",neg
4,Some of the greatest and most loved horror mov...,neg
...,...,...
49995,I found this gem in a rack the local video ren...,neg
49996,If we consider three films with a similar subj...,pos
49997,King of Masks (Bian Lian in China) is a shocki...,pos
49998,It's hard to know what was going through Per K...,neg


preprocess_text

In [4]:
# create preprocess_text function
import re

def preprocess_text(text):

    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', str(text))
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text
# apply the function df
df['text'] = df['text'].apply(preprocess_text)
df

Unnamed: 0,text,label
0,college dorm guy killed somebody scythe girlfr...,neg
1,production year say movie marauding mess polit...,neg
2,pleasant surprise expected downgrade along lin...,pos
3,math aspect merely gimmick try set tv show apa...,neg
4,greatest loved horror movie wicked sense humou...,neg
...,...,...
49995,found gem rack local video rental store tape e...,neg
49996,consider three film similar subject one made 1...,pos
49997,king mask bian lian china shockingly beautiful...,pos
49998,hard know going per kristensen morten lindberg...,neg


In [5]:
df.to_csv('preprocess_text.csv', index=False)

In [7]:
# initialize NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()
# create get_sentiment function
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    sentiment = 'pos' if scores['pos'] > 0 else 'neg'
    return sentiment




# apply get_sentiment function

df['sentiment'] = df['text'].apply(get_sentiment)

df

Unnamed: 0,text,label,sentiment
0,college dorm guy killed somebody scythe girlfr...,neg,pos
1,production year say movie marauding mess polit...,neg,pos
2,pleasant surprise expected downgrade along lin...,pos,pos
3,math aspect merely gimmick try set tv show apa...,neg,pos
4,greatest loved horror movie wicked sense humou...,neg,pos
...,...,...,...
49995,found gem rack local video rental store tape e...,neg,pos
49996,consider three film similar subject one made 1...,pos,pos
49997,king mask bian lian china shockingly beautiful...,pos,pos
49998,hard know going per kristensen morten lindberg...,neg,pos


In [8]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(df['label'], df['sentiment']))

[[  154 24846]
 [   40 24960]]


In [9]:
from sklearn.metrics import classification_report

print(classification_report(df['label'], df['sentiment']))

              precision    recall  f1-score   support

         neg       0.79      0.01      0.01     25000
         pos       0.50      1.00      0.67     25000

    accuracy                           0.50     50000
   macro avg       0.65      0.50      0.34     50000
weighted avg       0.65      0.50      0.34     50000



### Conclusion
The model exhibits poor performance, especially for the "neg" class with a very low recall of 0.01 and an f1-score of 0.01, indicating that it rarely correctly identifies negative instances. In contrast, the "pos" class has a high recall of 1.00 and a moderate f1-score of 0.67, but overall accuracy remains at 0.50, suggesting significant imbalance and misclassification issues.