A1. Data Loading
Load the dataset into a Pandas DataFrame. Extract the text column and treat each entry as a separate document.

In [12]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('test.csv', encoding='latin1')

# Keep only text and sentiment columns
df = df.drop(columns=[col for col in df.columns if col not in ['text', 'sentiment']])

# Clean the data
df = df.dropna()
df = df.drop(df.index[100:])  # Keep first 100 rows
df = df.rename(columns={'sentiment': 'label'})

print(f"Dataset loaded: {len(df)} documents")
print(f"Columns: {df.columns.tolist()}")
print("\nFirst few rows:")
print(df.head())

# Remove all rows with links in it
df = df[~df['text'].str.contains('http')]
print(f"\nDataset after removing links: {len(df)} documents")
print(df.head())

Dataset loaded: 100 documents
Columns: ['text', 'label']

First few rows:
                                                text     label
0  Last session of the day  http://twitpic.com/67ezh   neutral
1   Shanghai is also really exciting (precisely -...  positive
2  Recession hit Veronique Branquinho, she has to...  negative
3                                        happy bday!  positive
4             http://twitpic.com/4w75p - I like it!!  positive

Dataset after removing links: 96 documents
                                                text     label
1   Shanghai is also really exciting (precisely -...  positive
2  Recession hit Veronique Branquinho, she has to...  negative
3                                        happy bday!  positive
5                    that`s great!! weee!! visitors!  positive
6            I THINK EVERYONE HATES ME ON HERE   lol  negative


A2. Tokenization
Tokenize all documents and store the tokens corresponding to each document using NLTK library.

In [21]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab', quiet=True)

def swati_tokenise(text):
    tokens = [word.lower() for word in text.split()]

    return tokens

df['tokens'] = df['text'].apply(lambda x: word_tokenize(str(x).lower()))

# Custom for funsies
# df['tokens'] = df['text'].apply(lambda x: swati_tokenise(str(x).lower()))

print(f"Tokenization completed for {len(df)} documents")
print("\nExample - First document:")
print(f"Original text: {df['text'].iloc[0]}")
print(f"Tokens: {df['tokens'].iloc[0]}")

Tokenization completed for 96 documents

Example - First document:
Original text:  Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China:  (SH)  (BJ).
Tokens: ['shanghai', 'is', 'also', 'really', 'exciting', '(', 'precisely', '--', 'skyscrapers', 'galore', ')', '.', 'good', 'tweeps', 'in', 'china', ':', '(', 'sh', ')', '(', 'bj', ')', '.']


A3. Token Population
Merge the tokens obtained from all documents and create a master list of distinct tokens present across the entire dataset.

In [22]:
# Create token population - distinct tokens across all documents
all_tokens = []
for tokens in df['tokens']:
    all_tokens.extend(tokens)

# Get unique tokens
token_population = list(set(all_tokens))

print(f"Total tokens (with repetitions): {len(all_tokens)}")
print(f"Unique tokens in population: {len(token_population)}")
print(f"\nSample tokens: {token_population[:20]}")

Total tokens (with repetitions): 1390
Unique tokens in population: 585

Sample tokens: ['anything', 'come', '...', 'oscillate', 'huh', 'marley', 'peoples', 'bought', 'airport', 'later', 'faux', 'system', 'about', 'stupid', 'appearances', 'well', 'spend', 'storm', '(', 'is']


A4. Stop-words
Study the concept of stop-words and identify why they are removed in text analysis. Load and examine the English stop-words list using NLTK.

In [23]:
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords', quiet=True)

# Load English stop-words
stop_words = set(stopwords.words('english'))

print(f"Number of stop-words: {len(stop_words)}")
print(f"\nSample stop-words: {list(stop_words)[:30]}")
print("\n--- Analysis of Stop-words ---")
print("Stop-words are commonly occurring words like 'the', 'is', 'at', 'which', 'on', etc.")

Number of stop-words: 198

Sample stop-words: ['more', "i'm", 'i', 'herself', 'does', 'few', 'd', "i'll", 'again', 'weren', 'own', 'being', "won't", 'over', 'she', 'after', 'both', "haven't", 'between', 'theirs', 'them', 'ourselves', 'was', 'about', 'each', 'itself', 'll', 'needn', "aren't", 'been']

--- Analysis of Stop-words ---
Stop-words are commonly occurring words like 'the', 'is', 'at', 'which', 'on', etc.


- These are stop words because they are very common function words (pronouns, auxiliaries, prepositions, and contractions) that add grammatical structure but little semantic meaning.
- They occur frequently across texts and do not help distinguish topics or content.
- Removing them reduces noise and improves efficiency in most NLP tasks.

A5. Bag-of-Words Construction
Remove stop-words from the token population and construct a Bag-of-Words with unique, meaningful tokens.

In [24]:
# Remove stop-words and non-alphabetic tokens from token population
bag_of_words = [token for token in token_population 
                if token not in stop_words and token.isalpha()]

# Sort for consistency
bag_of_words = sorted(bag_of_words)

print(f"Token population before filtering: {len(token_population)}")
print(f"Bag-of-Words size (after removing stop-words): {len(bag_of_words)}")
print(f"\nSample BoW tokens: {bag_of_words[:30]}")

Token population before filtering: 585
Bag-of-Words size (after removing stop-words): 455

Sample BoW tokens: ['able', 'ac', 'achy', 'acoustic', 'acum', 'adopted', 'afternoon', 'agree', 'ahhh', 'airport', 'almost', 'alone', 'alright', 'also', 'always', 'anime', 'another', 'answer', 'antibacterial', 'anything', 'appearances', 'apple', 'argh', 'armpit', 'around', 'ask', 'attention', 'bad', 'bank', 'bday']


A6. Document Vectorization
Create a Bag-of-Words feature vector for each document. Each dimension represents a word count.

In [25]:
from collections import Counter

vocab = {word: idx for idx, word in enumerate(bag_of_words)}

# Vectorize each document
def vectorize_document(tokens):
    filtered_tokens = [token for token in tokens 
                      if token not in stop_words and token.isalpha()]
    
    # Count occurrences
    token_counts = Counter(filtered_tokens)
    
    # Create vector
    vector = np.zeros(len(bag_of_words))
    for token, count in token_counts.items():
        if token in vocab:
            vector[vocab[token]] = count
    
    return vector

# Apply vectorization to all documents
document_vectors = np.array([vectorize_document(tokens) for tokens in df['tokens']])

print(f"Document vectors shape: {document_vectors.shape}")
print(f"(Number of documents: {document_vectors.shape[0]}, Vocabulary size: {document_vectors.shape[1]})")
print(f"\nFirst document vector (first 20 dimensions): {document_vectors[0][:20]}")

Document vectors shape: (96, 455)
(Number of documents: 96, Vocabulary size: 455)

First document vector (first 20 dimensions): [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]


A7. Dataset Preparation for Classification
Combine the document vectors with their sentiment labels and split into training and testing sets.

In [26]:
from sklearn.model_selection import train_test_split

X = document_vectors
y = df['label'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} documents")
print(f"Testing set: {X_test.shape[0]} documents")
print(f"\nLabel distribution in training set:")
print(pd.Series(y_train).value_counts())
print(f"\nLabel distribution in testing set:")
print(pd.Series(y_test).value_counts())

Training set: 67 documents
Testing set: 29 documents

Label distribution in training set:
neutral     30
negative    20
positive    17
Name: count, dtype: int64

Label distribution in testing set:
neutral     13
negative     9
positive     7
Name: count, dtype: int64


A8. Sentiment Classification
Train a machine learning classifier using the Bag-of-Words vectors and predict sentiment labels for test data.

In [27]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)

lr_classifier = LogisticRegression(max_iter=1000, random_state=42)
lr_classifier.fit(X_train, y_train)
y_pred_lr = lr_classifier.predict(X_test)

print("Training completed for both classifiers!")
print(f"\nNaive Bayes predictions: {y_pred_nb[:10]}")
print(f"Logistic Regression predictions: {y_pred_lr[:10]}")
print(f"Actual labels: {y_test[:10]}")

Training completed for both classifiers!

Naive Bayes predictions: ['neutral' 'negative' 'positive' 'negative' 'neutral' 'positive' 'neutral'
 'negative' 'positive' 'neutral']
Logistic Regression predictions: ['neutral' 'negative' 'positive' 'neutral' 'neutral' 'neutral' 'neutral'
 'neutral' 'positive' 'neutral']
Actual labels: ['negative' 'positive' 'positive' 'neutral' 'negative' 'positive'
 'neutral' 'negative' 'positive' 'negative']


A9. Model Evaluation

Evaluate the performance of the classifiers using metrics such as accuracy, precision, recall, and confusion matrix.

In [29]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

print("="*60)
print("NAIVE BAYES CLASSIFIER RESULTS")
print("="*60)

# Calculate metrics for Naive Bayes
nb_accuracy = accuracy_score(y_test, y_pred_nb)
nb_precision = precision_score(y_test, y_pred_nb, average='weighted', zero_division=0)
nb_recall = recall_score(y_test, y_pred_nb, average='weighted', zero_division=0)
nb_f1 = f1_score(y_test, y_pred_nb, average='weighted', zero_division=0)

print(f"Accuracy:  {nb_accuracy:.4f}")
print(f"Precision: {nb_precision:.4f}")
print(f"Recall:    {nb_recall:.4f}")
print(f"F1-Score:  {nb_f1:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_nb))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_nb, zero_division=0))

print("\n" + "="*60)
print("LOGISTIC REGRESSION CLASSIFIER RESULTS")
print("="*60)

# Calculate metrics for Logistic Regression
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr, average='weighted', zero_division=0)
lr_recall = recall_score(y_test, y_pred_lr, average='weighted', zero_division=0)
lr_f1 = f1_score(y_test, y_pred_lr, average='weighted', zero_division=0)

print(f"Accuracy:  {lr_accuracy:.4f}")
print(f"Precision: {lr_precision:.4f}")
print(f"Recall:    {lr_recall:.4f}")
print(f"F1-Score:  {lr_f1:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, zero_division=0))

print("\n" + "="*60)
print("COMPARISON & ANALYSIS")
print("="*60)
print(f"Naive Bayes Accuracy:       {nb_accuracy:.2%}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.2%}")
print(f"\nPerformance Difference:     {abs(lr_accuracy - nb_accuracy):.2%}")

NAIVE BAYES CLASSIFIER RESULTS
Accuracy:  0.4483
Precision: 0.4574
Recall:    0.4483
F1-Score:  0.4473

Confusion Matrix:
[[3 3 3]
 [4 6 3]
 [1 2 4]]

Classification Report:
              precision    recall  f1-score   support

    negative       0.38      0.33      0.35         9
     neutral       0.55      0.46      0.50        13
    positive       0.40      0.57      0.47         7

    accuracy                           0.45        29
   macro avg       0.44      0.46      0.44        29
weighted avg       0.46      0.45      0.45        29


LOGISTIC REGRESSION CLASSIFIER RESULTS
Accuracy:  0.5517
Precision: 0.5589
Recall:    0.5517
F1-Score:  0.4680

Confusion Matrix:
[[ 1  7  1]
 [ 0 13  0]
 [ 1  4  2]]

Classification Report:
              precision    recall  f1-score   support

    negative       0.50      0.11      0.18         9
     neutral       0.54      1.00      0.70        13
    positive       0.67      0.29      0.40         7

    accuracy                       

- Logistic Regression (55.17%) outperformed Naive Bayes (44.83%) by 10.3%
- LR heavily biases toward predicting "neutral" class (13/13 neutral samples correctly identified, but over-predicts it)
- The model struggles with negative sentiment (only 11% recall) and positive sentiment (29% recall)
- Small dataset size (29 test samples) and class imbalance are major limiting factors