1. Load the dataset into a Pandas DataFrame and extract the text and label columns.

In [1]:
# First column contains review, second column contains the label(Positive, negative, neutral).

print("Load the given CSV file containing text and label columns into a Pandas DataFrame.")

import pandas as pd

# https://www.kaggle.com/code/akanksha10/sentiment-analysis-dataset/input
df = pd.read_csv('test.csv', encoding='latin1')
 
df = df.drop(columns=[col for col in df.columns if col not in ['text', 'sentiment']])

df = df.dropna()
df = df.drop(df.index[100:])
df = df.rename(columns={'sentiment':'label'})

Load the given CSV file containing text and label columns into a Pandas DataFrame.


2. Reuse the preprocessed text obtained in Lab-3 (after tokenization, case folding, stop-word removal, and joining tokens).

In [2]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab', quiet=True)

df['tokens'] = df['text'].apply(lambda x: word_tokenize(str(x).lower()))
df['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word.isalpha()])

print(f"Tokenization completed for {len(df)} documents")
print(f"Original text: {df['text'].iloc[0]}")
print(f"Tokens: {df['tokens'].iloc[0]}")

Tokenization completed for 100 documents
Original text: Last session of the day  http://twitpic.com/67ezh
Tokens: ['last', 'session', 'of', 'the', 'day', 'http']


In [3]:
for i in range(3):
    print(f"Example - Document {i+1}:")
    print(f"Original text: {df['text'].iloc[i]}")
    print(f"Tokens: {df['tokens'].iloc[i]}\n")
print("NLTK Tokenization ensures that the tokens are already lowercase")

Example - Document 1:
Original text: Last session of the day  http://twitpic.com/67ezh
Tokens: ['last', 'session', 'of', 'the', 'day', 'http']

Example - Document 2:
Original text:  Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China:  (SH)  (BJ).
Tokens: ['shanghai', 'is', 'also', 'really', 'exciting', 'precisely', 'skyscrapers', 'galore', 'good', 'tweeps', 'in', 'china', 'sh', 'bj']

Example - Document 3:
Original text: Recession hit Veronique Branquinho, she has to quit her company, such a shame!
Tokens: ['recession', 'hit', 'veronique', 'branquinho', 'she', 'has', 'to', 'quit', 'her', 'company', 'such', 'a', 'shame']

NLTK Tokenization ensures that the tokens are already lowercase


In [4]:
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords', quiet=True)

# Load English stop-words
stop_words = set(stopwords.words('english'))

df['tokens_without_stopwords'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df.head())
print(f"Stop-word removal completed.")

                                                text     label  \
0  Last session of the day  http://twitpic.com/67ezh   neutral   
1   Shanghai is also really exciting (precisely -...  positive   
2  Recession hit Veronique Branquinho, she has to...  negative   
3                                        happy bday!  positive   
4             http://twitpic.com/4w75p - I like it!!  positive   

                                              tokens  \
0                [last, session, of, the, day, http]   
1  [shanghai, is, also, really, exciting, precise...   
2  [recession, hit, veronique, branquinho, she, h...   
3                                      [happy, bday]   
4                                [http, i, like, it]   

                            tokens_without_stopwords  
0                         [last, session, day, http]  
1  [shanghai, also, really, exciting, precisely, ...  
2  [recession, hit, veronique, branquinho, quit, ...  
3                                      [happy,

In [5]:
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

lemmatizer = WordNetLemmatizer()

df['lemmatized_tokens'] = df['tokens_without_stopwords'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
print(df.head())
print(f"Lemmatization completed.")

                                                text     label  \
0  Last session of the day  http://twitpic.com/67ezh   neutral   
1   Shanghai is also really exciting (precisely -...  positive   
2  Recession hit Veronique Branquinho, she has to...  negative   
3                                        happy bday!  positive   
4             http://twitpic.com/4w75p - I like it!!  positive   

                                              tokens  \
0                [last, session, of, the, day, http]   
1  [shanghai, is, also, really, exciting, precise...   
2  [recession, hit, veronique, branquinho, she, h...   
3                                      [happy, bday]   
4                                [http, i, like, it]   

                            tokens_without_stopwords  \
0                         [last, session, day, http]   
1  [shanghai, also, really, exciting, precisely, ...   
2  [recession, hit, veronique, branquinho, quit, ...   
3                                      [ha

In [6]:
df['final_tokens'] = df['lemmatized_tokens'].apply(lambda x: ' '.join(x))
print(df[['text', 'final_tokens']].head())

                                                text  \
0  Last session of the day  http://twitpic.com/67ezh   
1   Shanghai is also really exciting (precisely -...   
2  Recession hit Veronique Branquinho, she has to...   
3                                        happy bday!   
4             http://twitpic.com/4w75p - I like it!!   

                                        final_tokens  
0                              last session day http  
1  shanghai also really exciting precisely skyscr...  
2  recession hit veronique branquinho quit compan...  
3                                         happy bday  
4                                          http like  


3. Load the previously constructed Bag-of-Words document–term matrix from Lab-3.

In [7]:
token_population_lemmatized = [ltoken for ltokens in df['lemmatized_tokens'] for ltoken in ltokens]

# Create unique vocabulary
vocab_lemma = list(set(token_population_lemmatized))

print(f"\nLemmatization-based Vocabulary size: {len(vocab_lemma)}")
print(f"Sample BoW tokens: {vocab_lemma[:30]}")


Lemmatization-based Vocabulary size: 453
Sample BoW tokens: ['underwire', 'cant', 'realise', 'day', 'split', 'coding', 'much', 'sign', 'let', 'guy', 'hey', 'bore', 'breaky', 'salvation', 'jimmy', 'snicker', 'haaaw', 'come', 'able', 'bike', 'almost', 'hole', 'link', 'known', 'dawson', 'shop', 'uk', 'lie', 'like', 'ramen']


In [8]:
from collections import Counter
import numpy as np

# Create unique vocabulary
vocab_lemma = list(set(token_population_lemmatized))
vocab_dict_lemma = {word: idx for idx, word in enumerate(vocab_lemma)}

# Vectorize each document
def vectorize_document_lemma(tokens):
    vector = np.zeros(len(vocab_lemma))
    token_freq = Counter(tokens)
    for token, count in token_freq.items():
        if token in vocab_dict_lemma:
            vector[vocab_dict_lemma[token]] = count
    return vector

# Create DTM for lemmatized tokens
dtm_lemma = np.array([vectorize_document_lemma(tokens) for tokens in df['lemmatized_tokens']])

print(f"DTM shape: {dtm_lemma.shape}")
print(f"(Number of documents: {dtm_lemma.shape[0]}, Vocabulary size: {dtm_lemma.shape[1]})")
print(f"\nFirst document vector (first 20 dimensions):\n{dtm_lemma[0][:20]}")

DTM shape: (100, 453)
(Number of documents: 100, Vocabulary size: 453)

First document vector (first 20 dimensions):
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


4. State the vocabulary size and dimensionality of the BoW matrix.

In [9]:
print(f"DTM shape: {dtm_lemma.shape}")
print(f"(Number of documents: {dtm_lemma.shape[0]}, Vocabulary size: {dtm_lemma.shape[1]})")

DTM shape: (100, 453)
(Number of documents: 100, Vocabulary size: 453)


5. Construct TF-IDF feature vectors using the same preprocessed documents.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the preprocessed text
tfidf_matrix = tfidf_vectorizer.fit_transform(df['final_tokens'])

# Get feature names (vocabulary)
tfidf_vocab = tfidf_vectorizer.get_feature_names_out()

print(f"TF-IDF matrix constructed successfully")
print(f"Shape: {tfidf_matrix.shape}")
print(f"Vocabulary size: {len(tfidf_vocab)}")

TF-IDF matrix constructed successfully
Shape: (100, 450)
Vocabulary size: 450


6. Display the TF-IDF document–term matrix.

In [11]:
import pandas as pd

# Convert sparse matrix to dense for display
tfidf_dense = tfidf_matrix.toarray()

# Create DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_dense, columns=tfidf_vocab)

print("TF-IDF Document-Term Matrix:")
print(f"Shape: {tfidf_df.shape}")
print("\nFirst 5 documents, first 10 features:")
print(tfidf_df.iloc[:5, :10])
print("\nSample of non-zero values from first document:")
first_doc_nonzero = tfidf_df.iloc[0][tfidf_df.iloc[0] > 0].head(10)
print(first_doc_nonzero)

TF-IDF Document-Term Matrix:
Shape: (100, 450)

First 5 documents, first 10 features:
   able   ac  achy  acoustic  acum  adopted  afternoon  agree  ahhh  airport
0   0.0  0.0   0.0       0.0   0.0      0.0        0.0    0.0   0.0      0.0
1   0.0  0.0   0.0       0.0   0.0      0.0        0.0    0.0   0.0      0.0
2   0.0  0.0   0.0       0.0   0.0      0.0        0.0    0.0   0.0      0.0
3   0.0  0.0   0.0       0.0   0.0      0.0        0.0    0.0   0.0      0.0
4   0.0  0.0   0.0       0.0   0.0      0.0        0.0    0.0   0.0      0.0

Sample of non-zero values from first document:
day        0.388706
http       0.483967
last       0.510927
session    0.594674
Name: 0, dtype: float64


7. State the dimensionality of the TF-IDF matrix and compare it with the BoW matrix.

In [13]:
print("Dimensionality Comparison:\n")
print(f"BoW Matrix Shape: {dtm_lemma.shape}")
print(f"Documents: {dtm_lemma.shape[0]}")
print(f"Vocabulary size: {dtm_lemma.shape[1]}")
print()
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")
print(f"Documents: {tfidf_matrix.shape[0]}")
print(f"Vocabulary size: {tfidf_matrix.shape[1]}")

Dimensionality Comparison:

BoW Matrix Shape: (100, 453)
Documents: 100
Vocabulary size: 453

TF-IDF Matrix Shape: (100, 450)
Documents: 100
Vocabulary size: 450


TF-IDF has the same document dimensionality but a slightly lower feature dimensionality than BoW because it filters or nullifies some terms during weighting.

8. Select any one document and display:
- its BoW frequency vector 
- its TF-IDF weighted vecto

In [14]:
# Select document 5 for comparison
doc_index = 5

print(f"Document {doc_index}:")
print(f"Original text: {df['text'].iloc[doc_index][:100]}...")
print(f"Preprocessed: {df['final_tokens'].iloc[doc_index][:100]}...")
print()

# BoW vector
bow_vector = dtm_lemma[doc_index]
bow_nonzero_indices = np.where(bow_vector > 0)[0]
print(f"BoW Frequency Vector (non-zero elements only):")
print(f"Total vocabulary size: {len(bow_vector)}")
print(f"Non-zero elements: {len(bow_nonzero_indices)}")
for idx in bow_nonzero_indices[:15]:
    print(f"  {vocab_lemma[idx]}: {bow_vector[idx]}")
print()

# TF-IDF vector
tfidf_vector = tfidf_dense[doc_index]
tfidf_nonzero_indices = np.where(tfidf_vector > 0)[0]
print(f"TF-IDF Weighted Vector (non-zero elements only):")
print(f"Total vocabulary size: {len(tfidf_vector)}")
print(f"Non-zero elements: {len(tfidf_nonzero_indices)}")
for idx in tfidf_nonzero_indices[:15]:
    print(f"  {tfidf_vocab[idx]}: {tfidf_vector[idx]:.4f}")

Document 5:
Original text:  that`s great!! weee!! visitors!...
Preprocessed: great weee visitor...

BoW Frequency Vector (non-zero elements only):
Total vocabulary size: 453
Non-zero elements: 3
  great: 1.0
  visitor: 1.0
  weee: 1.0

TF-IDF Weighted Vector (non-zero elements only):
Total vocabulary size: 450
Non-zero elements: 3
  great: 0.5443
  visitor: 0.5932
  weee: 0.5932


9. Comment on the difference in feature representation.

Although both representations use the same words for this document, TF-IDF provides richer information by reflecting how important each word is in the overall corpus (due to weighting), whereas BoW treats all terms uniformly.

10. Identify words that have high TF-IDF scores but low raw frequency in BoW

In [16]:
# Calculate average TF-IDF score and average BoW frequency for each word
tfidf_avg = np.mean(tfidf_dense, axis=0)

# Create a mapping from tfidf vocab to bow vocab for comparison
word_comparison = []

for i, word in enumerate(tfidf_vocab):
    tfidf_score = tfidf_avg[i]
    
    # Find corresponding BoW frequency
    if word in vocab_dict_lemma:
        bow_idx = vocab_dict_lemma[word]
        bow_freq = np.mean(dtm_lemma[:, bow_idx])
    else:
        bow_freq = 0
    
    # Calculate ratio to find words where TF-IDF emphasizes more than raw frequency
    if bow_freq > 0:
        ratio = tfidf_score / bow_freq
    else:
        ratio = 0
    
    word_comparison.append({
        'word': word,
        'avg_tfidf': tfidf_score,
        'avg_bow': bow_freq,
        'tfidf_to_bow_ratio': ratio
    })

# Sort by TF-IDF score to find important words
word_comparison.sort(key=lambda x: x['avg_tfidf'], reverse=True)

# Get top TF-IDF words
top_tfidf_words = [w for w in word_comparison if w['avg_tfidf'] > 0][:20]

# Find words with relatively high TF-IDF but low absolute frequency
# These are distinctive words that appear rarely but are important
high_tfidf_low_bow = []
for item in word_comparison:
    if item['avg_tfidf'] > 0.05 and item['avg_bow'] < 0.5:
        high_tfidf_low_bow.append(item)

# Sort by TF-IDF score
high_tfidf_low_bow.sort(key=lambda x: x['avg_tfidf'], reverse=True)

print("Words with High TF-IDF scores but Low BoW frequency:\n")
print(f"{'Word':<20} {'Avg TF-IDF':<12} {'Avg BoW Freq':<12} {'Ratio':<10}")
print("-" * 60)

if len(high_tfidf_low_bow) > 0:
    for item in high_tfidf_low_bow[:15]:
        print(f"{item['word']:<20} {item['avg_tfidf']:<12.4f} {item['avg_bow']:<12.2f} {item['tfidf_to_bow_ratio']:<10.2f}")
    print(f"\nTotal words identified: {len(high_tfidf_low_bow)}")
else:
    # If no words found with strict criteria, show top TF-IDF words
    print("Showing top distinctive words by TF-IDF score:")
    print(f"\n{'Word':<20} {'Avg TF-IDF':<12} {'Avg BoW Freq':<12}")
    print("-" * 50)
    for item in top_tfidf_words[:15]:
        print(f"{item['word']:<20} {item['avg_tfidf']:<12.4f} {item['avg_bow']:<12.2f}")
    print(f"\nThese words have high TF-IDF weights, indicating they are")
    print("distinctive and important for classification, even if they")
    print("don't appear very frequently in raw counts.")

Words with High TF-IDF scores but Low BoW frequency:

Word                 Avg TF-IDF   Avg BoW Freq Ratio     
------------------------------------------------------------
Showing top distinctive words by TF-IDF score:

Word                 Avg TF-IDF   Avg BoW Freq
--------------------------------------------------
happy                0.0333       0.07        
day                  0.0298       0.10        
go                   0.0278       0.06        
http                 0.0210       0.04        
like                 0.0197       0.06        
know                 0.0197       0.05        
time                 0.0163       0.06        
im                   0.0163       0.06        
miss                 0.0162       0.03        
sorry                0.0140       0.04        
need                 0.0134       0.05        
think                0.0130       0.03        
make                 0.0127       0.04        
watching             0.0127       0.04        
got                  0.

11. Explain why TF-IDF assigns higher importance to these word

TF-IDF emphasizes words that are rare across the corpus but informative within a document, whereas BoW relies only on raw frequency and cannot capture term importance.

12. Split the dataset into training and testing sets.

In [17]:
from sklearn.model_selection import train_test_split

# Prepare labels
y = df['label']

# Split for BoW
X_train_bow, X_test_bow, y_train, y_test = train_test_split(
    dtm_lemma, y, test_size=0.2, random_state=42, stratify=y
)

# Split for TF-IDF
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(
    tfidf_dense, y, test_size=0.2, random_state=42, stratify=y
)

print("Dataset Split:")
print(f"Total samples: {len(df)}")
print(f"Training samples: {len(X_train_bow)}")
print(f"Testing samples: {len(X_test_bow)}")
print()
print("Label distribution in training set:")
print(y_train.value_counts())
print()
print("Label distribution in testing set:")
print(y_test.value_counts())

Dataset Split:
Total samples: 100
Training samples: 80
Testing samples: 20

Label distribution in training set:
label
neutral     35
negative    24
positive    21
Name: count, dtype: int64

Label distribution in testing set:
label
neutral     9
negative    6
positive    5
Name: count, dtype: int64


13. Train the same classifiers(used in Lab3) using TF-IDF features and record the classification accuracy.

In [21]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train classifiers with BoW features
print("="*60)
print("TRAINING CLASSIFIERS WITH BOW FEATURES")
print("="*60)

# Naive Bayes with BoW
nb_bow = MultinomialNB()
nb_bow.fit(X_train_bow, y_train)
y_pred_nb_bow = nb_bow.predict(X_test_bow)
acc_nb_bow = accuracy_score(y_test, y_pred_nb_bow)
print(f"\n1. Naive Bayes (BoW) Accuracy: {acc_nb_bow:.4f}")
print(classification_report(y_test, y_pred_nb_bow))

# Logistic Regression with BoW
lr_bow = LogisticRegression(max_iter=1000, random_state=42)
lr_bow.fit(X_train_bow, y_train)
y_pred_lr_bow = lr_bow.predict(X_test_bow)
acc_lr_bow = accuracy_score(y_test, y_pred_lr_bow)
print(f"\n2. Logistic Regression (BoW) Accuracy: {acc_lr_bow:.4f}")
print(classification_report(y_test, y_pred_lr_bow))

print("\n" + "="*60)
print("TRAINING CLASSIFIERS WITH TF-IDF FEATURES")
print("="*60)

# Naive Bayes with TF-IDF
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train_tfidf)
y_pred_nb_tfidf = nb_tfidf.predict(X_test_tfidf)
acc_nb_tfidf = accuracy_score(y_test_tfidf, y_pred_nb_tfidf)
print(f"\n1. Naive Bayes (TF-IDF) Accuracy: {acc_nb_tfidf:.4f}")
print(classification_report(y_test_tfidf, y_pred_nb_tfidf))

# Logistic Regression with TF-IDF
lr_tfidf = LogisticRegression(max_iter=1000, random_state=42)
lr_tfidf.fit(X_train_tfidf, y_train_tfidf)
y_pred_lr_tfidf = lr_tfidf.predict(X_test_tfidf)
acc_lr_tfidf = accuracy_score(y_test_tfidf, y_pred_lr_tfidf)
print(f"\n2. Logistic Regression (TF-IDF) Accuracy: {acc_lr_tfidf:.4f}")
print(classification_report(y_test_tfidf, y_pred_lr_tfidf))

TRAINING CLASSIFIERS WITH BOW FEATURES

1. Naive Bayes (BoW) Accuracy: 0.3500
              precision    recall  f1-score   support

    negative       0.50      0.33      0.40         6
     neutral       0.33      0.33      0.33         9
    positive       0.29      0.40      0.33         5

    accuracy                           0.35        20
   macro avg       0.37      0.36      0.36        20
weighted avg       0.37      0.35      0.35        20


2. Logistic Regression (BoW) Accuracy: 0.5500
              precision    recall  f1-score   support

    negative       0.50      0.17      0.25         6
     neutral       0.50      0.89      0.64         9
    positive       1.00      0.40      0.57         5

    accuracy                           0.55        20
   macro avg       0.67      0.49      0.49        20
weighted avg       0.62      0.55      0.51        20


TRAINING CLASSIFIERS WITH TF-IDF FEATURES

1. Naive Bayes (TF-IDF) Accuracy: 0.6000
              precision    r

14. Compare both models (the ones using BOW and TF-IDF) in terms of:
- accuracy 
- types of misclassification

TF-IDF improves Naive Bayes performance by emphasizing informative, rare words and reducing noise from common terms.

Logistic Regression benefits more from BoW in this case, possibly due to the small dataset and class imbalance.

TF-IDF can cause over-penalization of frequent sentiment words, leading to missed classifications in linear models.

15. Identify at least two documents:
- misclassified using BoW 
- correctly classified using TF-IDF

In [26]:
# Get test indices
test_indices = y_test.index.tolist()

# Find misclassifications with BoW but correct with TF-IDF (using Naive Bayes)
bow_wrong = (y_pred_nb_bow != y_test)
tfidf_correct = (y_pred_nb_tfidf == y_test_tfidf)

# Find indices where BoW failed but TF-IDF succeeded
improved_mask = bow_wrong & tfidf_correct
improved_indices_local = np.where(improved_mask)[0]

if len(improved_indices_local) >= 2:
    # Get actual dataframe indices
    improved_indices = [test_indices[i] for i in improved_indices_local[:2]]
    
    print("Documents misclassified by BoW but correctly classified by TF-IDF (Naive Bayes):")
    print("="*80)
    
    for idx in improved_indices:
        # Get position in test set
        test_pos = test_indices.index(idx)
        
        print(f"\nDocument Index: {idx}")
        print(f"Original Text: {df['text'].iloc[idx]}")
        print(f"Preprocessed: {df['final_tokens'].iloc[idx]}")
        print(f"\nTrue Label: {y_test.iloc[test_pos]}")
        print(f"BoW Prediction (NB): {y_pred_nb_bow[test_pos]}")
        print(f"TF-IDF Prediction (NB): {y_pred_nb_tfidf[test_pos]}")
        
        # Show top weighted words
        bow_vec = X_test_bow[test_pos]
        tfidf_vec = X_test_tfidf[test_pos]
        
        # Top BoW words
        bow_top_indices = np.argsort(bow_vec)[-10:][::-1]
        print(f"\nTop BoW features:")
        for i in bow_top_indices:
            if bow_vec[i] > 0:
                print(f"  {vocab_lemma[i]}: {bow_vec[i]}")
        
        # Top TF-IDF words
        tfidf_top_indices = np.argsort(tfidf_vec)[-10:][::-1]
        print(f"\nTop TF-IDF features:")
        for i in tfidf_top_indices:
            if tfidf_vec[i] > 0:
                print(f"  {tfidf_vocab[i]}: {tfidf_vec[i]:.4f}")
        
        print("-"*80)
else:
    print(f"Found {len(improved_indices_local)} documents where TF-IDF (Naive Bayes) improved over BoW")
    
    # Show alternative comparison - documents where at least one improved
    print("\n" + "="*80)
    print("KEY OBSERVATION:")
    print("="*80)
    print("\nTF-IDF enhances probabilistic models like Naive Bayes by:")
    print("  • Weighting informative terms that are rare but meaningful")
    print("  • Reducing the influence of common words across all documents")
    print("  • Providing better probability estimates for class discrimination")
    print("\nBoW provides more stable features for linear classifiers like Logistic Regression")
    print("on small datasets because:")
    print("  • Raw counts preserve frequency information that may be important")
    print("  • No additional weighting that could introduce variance on limited data")
    print("  • Linear models can learn appropriate weights during training")

Documents misclassified by BoW but correctly classified by TF-IDF (Naive Bayes):

Document Index: 95
Original Text: was so excited to eat the wartermelon i bought the other day and it was terrible and not sweet
Preprocessed: excited eat wartermelon bought day terrible sweet

True Label: neutral
BoW Prediction (NB): positive
TF-IDF Prediction (NB): neutral

Top BoW features:
  terrible: 1.0
  day: 1.0
  bought: 1.0
  sweet: 1.0
  wartermelon: 1.0
  eat: 1.0
  excited: 1.0

Top TF-IDF features:
  sweet: 0.3994
  wartermelon: 0.3994
  terrible: 0.3994
  bought: 0.3994
  excited: 0.3994
  eat: 0.3665
  day: 0.2611
--------------------------------------------------------------------------------

Document Index: 85
Original Text:  lol man i got 2 1 /2 hrs an iont how i woulda made it wit out my ramen noodles and t.v. Time
Preprocessed: lol man got hr iont woulda made wit ramen noodle time

True Label: neutral
BoW Prediction (NB): negative
TF-IDF Prediction (NB): neutral

Top BoW features:
  

16. Analyze possible reasons based on word weighting.

TF-IDF corrects BoW’s tendency to overemphasize frequent sentiment words by weighting terms based on their corpus-level importance, enabling more accurate classification of contextually neutral documents.

17. Comment on how TF-IDF reduces the influence of frequent but less informative words

- TF-IDF multiplies term frequency (TF) by inverse document frequency (IDF).

- Words that appear in many documents receive a low IDF score, even if they occur frequently within a document.

- As a result, common words such as fillers, generic verbs, or widely used sentiment terms contribute less to the final feature vector.

- This allows the model to focus on rare but informative words that better distinguish documents.

18. Discuss situations where Bag-of-Words may still outperform TF-IDF

1. Very Small Datasets
- IDF estimates are unreliable with limited data.
- BoW avoids over-penalizing words due to inaccurate document frequency statistics.

2. Tasks Where Frequency Itself Is Informative
- In sentiment analysis, repeated use of words like “good” or “bad” can strongly indicate sentiment.
- BoW preserves this repetition effect, while TF-IDF may down-weight it too much.

3. Linear Models with Limited Training Data
- Models like Logistic Regression may learn better with raw counts when feature space is small.
- TF-IDF can sometimes dilute discriminative signals in such cases.

4. Domain-Specific or Controlled Vocabulary
- When the vocabulary is already well-curated (e.g., medical codes, command logs), term frequency alone can be sufficient and more stable.

19. Summarize the overall findings and conclude which representation is more suitable for this sentiment dataset, with justification.

TF-IDF is the more suitable feature representation for this sentiment dataset, particularly when used with Naive Bayes, as it improves classification accuracy and handles neutral and mixed-sentiment texts more effectively.

- The dataset is small, sparse, and noisy, containing informal language and short texts.
- TF-IDF:
    - Suppresses globally frequent but weakly informative words
    - Highlights rare, context-specific terms
    - Produces better probabilistic estimates for Naive Bayes
    
- Although BoW performs reasonably well with Logistic Regression, its lack of weighting causes poorer generalization in this setting.