# Ngram-Based Sentiment Analysis using TextBlob and NLTK

This project implements sentiment analysis using bigrams and the Naïve Bayes classifier from TextBlob. 

It uses NLTK for text preprocessing, including stopword removal, tokenization, and n-gram generation.

## How the Model is Trained:

A dataset of 10,000 positive and negative sentences is used for training.


Stopwords are removed using NLTK's stopword list.


Sentences are tokenized and converted into bigrams (2-word sequences).
                                                    

Each sentence is labeled as positive (pos) or negative (neg) using TextBlob's sentiment polarity.


The Naïve Bayes classifier from TextBlob is trained using these bigram-based labeled sentences.
    

The classifier is then used to predict sentiment for new text inputs.



In [1]:
# Imports the Natural Language Toolkit library, providing tools for working with human language data.
import nltk

# Downloads the WordNet lexical database, used for tasks like finding synonyms and understanding word meanings.
nltk.download('wordnet')

# Downloads the Brown Corpus, a collection of American English text used for training and testing language models.
nltk.download('brown')

# Downloads the Punkt Sentence Tokenizer, used to split text into individual sentences.
nltk.download('punkt')

# Downloads another Punkt Sentence Tokenizer, most likely a model variation or for a specific encoding/data format.
nltk.download('punkt_tab')

# Downloads the Averaged Perceptron Tagger for English, used for Part-of-Speech tagging.
nltk.download('averaged_perceptron_tagger_eng')

nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sreer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\sreer\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sreer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sreer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\sreer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sreer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

In [2]:
from textblob import TextBlob

# Sample text
text = "I love programming in Python"

# Create a TextBlob object
blob = TextBlob(text)

# Generate bigrams (2-grams)
bigrams = blob.ngrams(n=2)
print(bigrams)


[WordList(['I', 'love']), WordList(['love', 'programming']), WordList(['programming', 'in']), WordList(['in', 'Python'])]


In [3]:
trigrams = blob.ngrams(n=3)
print(trigrams)


[WordList(['I', 'love', 'programming']), WordList(['love', 'programming', 'in']), WordList(['programming', 'in', 'Python'])]


# Ngrams from nltk

In [4]:
#Python - Bigrams
import nltk
word_data = "The best performance can bring in sky high success."
nltk_tokens = nltk.word_tokenize(word_data)
print(list(nltk.bigrams(nltk_tokens)))


[('The', 'best'), ('best', 'performance'), ('performance', 'can'), ('can', 'bring'), ('bring', 'in'), ('in', 'sky'), ('sky', 'high'), ('high', 'success'), ('success', '.')]


# Sentiment Model using Ngram

In [5]:
# Imports the TextBlob class for text processing.
from textblob import TextBlob
# Imports the NaiveBayesClassifier for text classification.
from textblob.classifiers import NaiveBayesClassifier
# Imports the Counter class for counting item frequencies.
from collections import Counter
# Imports the NLTK library for various NLP tasks.
import nltk
# Imports the stopwords module for handling common words.
from nltk.corpus import stopwords
# Imports the ngrams function for creating n-gram sequences.
from nltk.util import ngrams

In [6]:
# File path
file_path = r"C:/DATASCIENCE/NLP/positive_negative_sentences_10000.txt"

In [7]:
# Load stop words
stop_words = set(stopwords.words('english'))

In [8]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [9]:
# Function to remove stop words
def remove_stopwords(text):
    words = nltk.word_tokenize(text)
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
    return filtered_words


In [10]:
# Function to generate n-grams
def generate_ngrams(text, n=2):
    words = remove_stopwords(text)  
    return [" ".join(gram) for gram in ngrams(words, n)]

In [11]:
# Function to determine sentiment label using TextBlob's internal sentiment analysis
def get_sentiment_label(text):
    polarity = TextBlob(text).sentiment.polarity
    return "pos" if polarity > 0 else "neg"

In [12]:
# Read and preprocess sentences
with open(file_path, "r", encoding="utf-8") as f:  # Open the file in read mode with UTF-8 encoding
    sentences = [" ".join(generate_ngrams(line.strip())) for line in f if line.strip()]

    # Create a list called 'sentences' using a list comprehension
    # Iterate through each line in the file 'f'
    # if line.strip(): Skip empty lines
    # generate_ngrams(line.strip()): Preprocess the line (remove stop words, create n-grams)
    # " ".join(...): Join the n-grams back into a single string

In [13]:
sentences

['grateful good good things things life',
 'feel confident confident decisions',
 'success way',
 'weather beautiful beautiful today',
 'feel blessed blessed friends',
 'feeling great great today',
 'weather beautiful beautiful today',
 '',
 'proud achievements',
 'embrace change change open open arms',
 'life beautiful beautiful full full possibilities',
 'love spending spending time time family',
 'feel blessed blessed friends',
 'weather beautiful beautiful today',
 'excited future',
 'wonderful experience',
 'enjoy learning learning new new things',
 'everything going going perfectly',
 'life beautiful beautiful full full possibilities',
 'grateful good good things things life',
 'excited future',
 'great opportunity',
 'constantly growing growing improving',
 'feel confident confident decisions',
 'life beautiful beautiful full full possibilities',
 'weather beautiful beautiful today',
 'love spending spending time time family',
 'feeling great great today',
 'feel empowered empow

In [14]:
# Create labeled training data
train_data = [(sentence, get_sentiment_label(sentence)) for sentence in sentences]

In [15]:
train_data

[('grateful good good things things life', 'pos'),
 ('feel confident confident decisions', 'pos'),
 ('success way', 'pos'),
 ('weather beautiful beautiful today', 'pos'),
 ('feel blessed blessed friends', 'neg'),
 ('feeling great great today', 'pos'),
 ('weather beautiful beautiful today', 'pos'),
 ('', 'neg'),
 ('proud achievements', 'pos'),
 ('embrace change change open open arms', 'neg'),
 ('life beautiful beautiful full full possibilities', 'pos'),
 ('love spending spending time time family', 'pos'),
 ('feel blessed blessed friends', 'neg'),
 ('weather beautiful beautiful today', 'pos'),
 ('excited future', 'pos'),
 ('wonderful experience', 'pos'),
 ('enjoy learning learning new new things', 'pos'),
 ('everything going going perfectly', 'pos'),
 ('life beautiful beautiful full full possibilities', 'pos'),
 ('grateful good good things things life', 'pos'),
 ('excited future', 'pos'),
 ('great opportunity', 'pos'),
 ('constantly growing growing improving', 'neg'),
 ('feel confident c

In [16]:
# Train the Naive Bayes classifier
classifier = NaiveBayesClassifier(train_data)

In [17]:
# Test the classifier
test_sentence = "I am feeling amazing today!"
processed_sentence = " ".join(generate_ngrams(test_sentence))
print(f"Sentence: {test_sentence}")
print(f"Predicted Sentiment: {classifier.classify(processed_sentence)}")

Sentence: I am feeling amazing today!
Predicted Sentiment: pos


In [18]:
# Show the classifier accuracy
accuracy = classifier.accuracy(train_data)
print(f"Training Accuracy: {accuracy:.2f}")

Training Accuracy: 0.98


In [19]:
# Sentiment analysis details
blob = TextBlob(processed_sentence)
sentiment = classifier.classify(processed_sentence)
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity
print(f"Sentence: \"{test_sentence}\" → Sentiment: {sentiment}, Polarity: {polarity:.2f}, Subjectivity: {subjectivity:.2f}")

Sentence: "I am feeling amazing today!" → Sentiment: pos, Polarity: 0.60, Subjectivity: 0.90


In [20]:
# Classify additional sentences
test_sentences = [
    "I had a bad day!",
    "This movie was the worst.",
    "The service was okay, but the food was terrible.",
    "We had a fantastic and funny experience.",
    "Not bad, but could be better."
]

In [21]:
# Remove stop words and classify
for sentence in test_sentences:
    processed = " ".join(generate_ngrams(sentence))
    print(f"Sentence: \"{sentence}\" → Sentiment: {classifier.classify(processed)}")

Sentence: "I had a bad day!" → Sentiment: neg
Sentence: "This movie was the worst." → Sentiment: neg
Sentence: "The service was okay, but the food was terrible." → Sentiment: neg
Sentence: "We had a fantastic and funny experience." → Sentiment: neg
Sentence: "Not bad, but could be better." → Sentiment: neg


In [22]:
# Evaluate accuracy
test_data = [(" ".join(generate_ngrams(sentence)), get_sentiment_label(sentence)) for sentence in test_sentences]
# Create a list called 'test_data' to store the test sentences and their corresponding sentiment labels.
# It uses a list comprehension to efficiently process each 'sentence' in the 'test_sentences' list.
# For each sentence:
# - generate_ngrams(sentence): This function is called to generate n-grams (word sequences) from the sentence after removing stop words. These n-grams serve as features for the classifier.
# - " ".join(...): The n-grams are joined back into a single string.
# - get_sentiment_label(sentence): This function determines the sentiment label ("pos" for positive or "neg" for negative) of the original sentence using TextBlob's built-in sentiment analysis.
# - The resulting tuple (processed sentence, sentiment label) is added to the 'test_data' list.


accuracy = classifier.accuracy(test_data)
# Calculate the accuracy of the 'classifier' on the 'test_data'.
# The 'accuracy()' method of the 'classifier' object compares the predicted sentiment labels for the test sentences with their actual labels and calculates the proportion of correctly classified sentences.


print(f"Model Accuracy: {accuracy:.2f}")
# Print the calculated accuracy to the console, formatted to two decimal places using an f-string.

Model Accuracy: 0.60


In [23]:
# Word frequency analysis
all_words = [word for sentence in sentences for word in sentence.split()]
word_freq = Counter(all_words)

In [24]:
word_freq.most_common(10)

[('feel', 1814),
 ('challenges', 1069),
 ('full', 1056),
 ('beautiful', 1042),
 ('life', 1000),
 ('everything', 986),
 ('going', 976),
 ('today', 964),
 ('spending', 950),
 ('time', 950)]