# References:
- Research paper on feature engineering for sentiment analysis (Bag of Words, TF-IDF, word embedding, NLP based preprocessing): https://www.sciencedirect.com/science/article/pii/S1877050919306593

- Researcb paper on sentiment analysis techniques (SVMs, Logistic Regression, TF-IDF): https://www.irjmets.com/uploadedfiles/paper//issue_10_october_2023/45265/final/fin_irjmets1697386365.pdf 

In [1]:
import zipfile
import pandas as pd


# Unzip the file
with zipfile.ZipFile('sentiments.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

In [2]:
# Load the CSV file into a DataFrame
df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [4]:
df.tail()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599998,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


This dataset has six features: 

1. Target: Polarity of the tweet
    - 0 = Negative
    - 2 = Neutral
    - 3 = Positive

2. ids: The id of the tweet

3. date: The date of the tweet (Sat May 16 23:58:44 UTC 2009)

4. flag: The query (lyx). If there is no query, then this value is NO_QUERY

5. user: The user the tweeted

6. text: the text of the tweet (Lyx is cool)

In [5]:
# Label our features and target variable
df.columns = ['target', 'ids', 'date', 'flag', 'user', 'text']

# Double check that it worked
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


Data Cleaning

In [6]:
# Ensure there are no null values
df.isnull().sum()

target    0
ids       0
date      0
flag      0
user      0
text      0
dtype: int64

In [7]:
import re

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

In [8]:
# Clean the text
df['text'] = df['text'].apply(clean_text)

In [9]:
# Verify that it worked
df['text']

0          is upset that he can't update his facebook by ...
1          i dived many times for the ball. managed to sa...
2             my whole body feels itchy and like its on fire
3          no, it's not behaving at all. i'm mad. why am ...
4                                         not the whole crew
                                 ...                        
1599994    just woke up. having no school is the best fee...
1599995    thewdb.com - very cool to hear old walt interv...
1599996    are you ready for your mojo makeover? ask me f...
1599997    happy 38th birthday to my boo of alll time!!! ...
1599998                                                happy
Name: text, Length: 1599999, dtype: object

Data Analysis

In [10]:
# Check the target variable values
target_counts = df['target'].value_counts()
print("Distribution of sentiment labels:")
print(target_counts)

Distribution of sentiment labels:
target
4    800000
0    799999
Name: count, dtype: int64


Note: The data classes are nearly perfectly evenly distributed

In [11]:
# Create our training dataframe with the text and labels
train_df = df[['target', 'text']]

# Rename target to sentiment
train_df.rename(columns={'target': 'sentiment'}, inplace=True) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.rename(columns={'target': 'sentiment'}, inplace=True)


In [12]:
# Change sentiment encodings as follows
    # 0: negative
    # 1: positive
train_df['sentiment'] = train_df['sentiment'].map({0: 0, 4: 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['sentiment'] = train_df['sentiment'].map({0: 0, 4: 1})


In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   sentiment  1599999 non-null  int64 
 1   text       1599999 non-null  object
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


In [14]:
# Check for empty strings after cleaning
print(f"Empty texts: {(train_df['text'].str.strip() == '').sum()}")
print(f"Very short text: {(train_df['text'].str.len() < 3).sum()}")

Empty texts: 3039
Very short text: 3619


In [15]:
# Remove empty and short strings
train_df['text'].str.strip()
train_df = train_df[train_df['text'].str.len() > 3]

train_df = train_df.reset_index(drop=True)

print(f"Empty texts: {(train_df['text'].str.strip() == '').sum()}")
print(f"Very short text: {(train_df['text'].str.len() < 3).sum()}")

Empty texts: 0
Very short text: 0


In [16]:
print("Random positive samples")
print(train_df[train_df['sentiment'] == 1]['text'].sample(5).values)

print("\nRandom negative samples")
print(train_df[train_df['sentiment'] == 0]['text'].sample(5).values)

Random positive samples
['so we have a new fresh look! supportguy was fixing it during the nite. i think it was worth it'
 'hello twitters'
 "hee hee, i'm going to go to sleep now and think of all those fun things and smile throughout my sleep love your guts xo"
 "im so happy with life i feel like i haven't lost anything god, fuck yeah!!"
 'sparkpeople has a twitter group . . . you should check that out to find sparkpeeps!']

Random negative samples
['ahhww thats mean and horrible did the old man see you write that?'
 'needs to buy more csi cause he basically seen all of these like 50 times or more.'
 "11th june, how shit. i assume you finish tomoreee? jealous how come you aren't coming on friday btw? miss yooou!"
 'sooooo bored. broke a string practicing fml'
 'uughieess is taking forevvv my poor lido eye lids r gonna give out haha']


Note: Duplicates may cause data leakage if the same tweet appears in both the training and validation sets. This may lead to an inflated estimate of held-out error.

In [17]:
# Check for duplicate tweets (like retweets etc)
print(f"Duplicates before: {train_df.duplicated(subset=['text']).sum()}")
train_df = train_df.drop_duplicates(subset=['text'])
print(f"Duplicates after: {train_df.duplicated(subset=['text']).sum()}")

Duplicates before: 55535
Duplicates after: 0


In [18]:
from collections import Counter

all_words = ' '.join(train_df['text']).split()
word_freq = Counter(all_words)

print('Most common words overall:')
print(word_freq.most_common(20))

positive_words = ' '.join(train_df[train_df['sentiment']==1]['text']).split()
negative_words = ' '.join(train_df[train_df['sentiment']==0]['text']).split()

print('\nMost common positive words')
print(Counter(positive_words).most_common(10))

print('\nMost common negative words')
print(Counter(negative_words).most_common(10))

Most common words overall:
[('i', 735077), ('to', 553465), ('the', 511526), ('a', 374782), ('my', 308947), ('and', 293685), ('you', 228314), ('is', 227970), ('for', 210461), ('in', 208056), ('it', 189202), ('of', 181075), ('on', 158373), ('so', 143201), ('have', 141608), ('that', 126691), ("i'm", 126276), ('me', 125535), ('but', 123798), ('just', 123103)]

Most common positive words
[('i', 282435), ('the', 257515), ('to', 246219), ('a', 194226), ('you', 147104), ('and', 144370), ('my', 123030), ('for', 113147), ('is', 104030), ('in', 96629)]

Most common negative words
[('i', 452642), ('to', 307246), ('the', 254011), ('my', 185917), ('a', 180556), ('and', 149315), ('is', 123940), ('in', 111427), ('it', 100258), ('for', 97314)]


In [19]:
# Analyze word length
train_df['text_length'] = train_df['text'].str.len()
train_df['word_count'] = train_df['text'].str.split().str.len()

print(train_df[['text_length', 'word_count']].describe())

print(train_df.groupby('sentiment')[['text_length', 'word_count']].mean())

        text_length    word_count
count  1.539821e+06  1.539821e+06
mean   6.703064e+01  1.292963e+01
std    3.503266e+01  6.821159e+00
min    4.000000e+00  1.000000e+00
25%    3.800000e+01  7.000000e+00
50%    6.200000e+01  1.200000e+01
75%    9.500000e+01  1.800000e+01
max    3.600000e+02  6.400000e+01
           text_length  word_count
sentiment                         
0            68.741518   13.420274
1            65.294334   12.431695


# TF-IDF Fitting

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Split the data
X = train_df['text']
y = train_df['sentiment']

# Split with stratification to maintiain class distribution
X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size = 0.2,
        random_state=42,
        stratify=y
)

# Create and fit the TF-IDF vectorizer on training data only
tfidf = TfidfVectorizer(
        min_df=5,           # Exclude really rare words
        max_features=10000, # Keep only the 10,000 most common words
        ngram_range=(1,2)   # Track words and pairs of words
)

# Transform the training and test sets
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test) # Don't fit to the test set 

# Print shape of the transformed data
print(f"Training set shape: {X_train_tfidf.shape}")
print(f"Test set shape: {X_test_tfidf.shape}")

Training set shape: (1231856, 10000)
Test set shape: (307965, 10000)


# Baseline Models

Establishing baseline performance using:
1. Logistic Regression - Simple but effective for text classification
2. Linear SVM - Generally performs well on high-dimensional sparse data like TF-IDF vectors
3. Naive Bayes - Works well with discrete features and handles sparce data like TF-IDF vectors effictively

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

# Initilize models
lr_model = LogisticRegression(max_iter=1000, random_state=42)
svm_model = LinearSVC(max_iter=1000, random_state=42)
nb_model = MultinomialNB()

# Train and evaluate Logistic Regression
print("Training Logistic Regression...")
lr_model.fit(X_train_tfidf, y_train)
lr_pred = lr_model.predict(X_test_tfidf)

print("\nLogistic Regression Results:")
print("Accuracy:", accuracy_score(y_test, lr_pred))
print("\nDetailed Classification Report:")
print(classification_report(y_test, lr_pred))

# Train and evaluate SVM
svm_model.fit(X_train_tfidf, y_train)
svm_pred = svm_model.predict(X_test_tfidf)

print("\nSVM Results:")
print("Accuracy:", accuracy_score(y_test, svm_pred))
print("\nDetailed Classification Report:")
print(classification_report(y_test, svm_pred))

# Train and evaluate Naive Bayes
print("\nTraining Naive Bayes...")
nb_model.fit(X_train_tfidf, y_train)
nb_pred = nb_model.predict(X_test_tfidf)

print("\nNaive Bayes Results:")
print("Accuracy: ", accuracy_score(y_test, nb_pred))
print("\nDetailed Classification Report:")
print(classification_report(y_test, nb_pred))

Training Logistic Regression...

Logistic Regression Results:
Accuracy: 0.8017501988862371

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.79      0.80    155119
           1       0.79      0.81      0.80    152846

    accuracy                           0.80    307965
   macro avg       0.80      0.80      0.80    307965
weighted avg       0.80      0.80      0.80    307965


SVM Results:
Accuracy: 0.8019774974428913

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.79      0.80    155119
           1       0.79      0.82      0.80    152846

    accuracy                           0.80    307965
   macro avg       0.80      0.80      0.80    307965
weighted avg       0.80      0.80      0.80    307965


Training Naive Bayes...

Naive Bayes Results:
Accuracy:  0.7790300845875343

Detailed Classification Report:
              precision    recall  f

In [22]:
# Compare most important features for Logistic Regression
def print_top_features(vectorizer, model, n=10):
        feature_names = vectorizer.get_feature_names_out()
        coef = model.coef_[0]
        top_positive = np.argsort(coef)[-n:]
        top_negative = np.argsort(coef)[:n]

        print("\nTop positive features:")
        for idx in reversed(top_positive):
            print(f"{feature_names[idx]}: {coef[idx]:.4f}")

        print("\nTop negative features:")
        for idx in top_negative:
            print(f"{feature_names[idx]}: {coef[idx]:.4f}")

print("\nMmost influential features in Logistic Regression:")
print_top_features(tfidf, lr_model)


Mmost influential features in Logistic Regression:

Top positive features:
cant wait: 7.5503
not bad: 6.8374
smile: 5.9716
no problem: 5.8096
thanks: 5.7104
happy: 5.3644
can wait: 4.7939
congratulations: 4.6870
proud: 4.5489
smiling: 4.5387

Top negative features:
sad: -13.6319
sadly: -8.7481
poor: -8.6485
miss: -8.4607
unfortunately: -8.1018
disappointed: -7.5717
died: -7.5531
missing: -7.5036
not happy: -7.1715
sick: -7.1222


In [None]:
# from transformers import (
#     DistilBertTokenizer,
#     DistilBertForSequenceClassification,
#     TrainingArguments,
#     Trainer
# )

# import torch
# from datasets import Dataset
# from sklearn.metrics import accuracy_score, classification_report
# import numpy as np

# # Load tokenizer and model
# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# model = DistilBertForSequenceClassification.from_pretrained(
#     'distilbert-base-uncased', 
#     num_labels=2
# )

# # Prepare dataset
# def tokenize_data(examples):
#     return tokenizer(
#         examples['text'],
#         padding='max_length',
#         truncation=True,
#         max_length=128
#     )

# # Rename 'sentiment' to 'labels'
# train_dataset = Dataset.from_pandas(
#     train_df[['text', 'sentiment']].rename(columns={'sentiment': 'labels'})
# )
# test_dataset = Dataset.from_pandas(
#     pd.DataFrame({'text': X_test, 'labels': y_test})  # Changed to 'labels'
# )

# # Tokenize datasets
# train_encoded = train_dataset.map(tokenize_data, batched=True)
# test_encoded = test_dataset.map(tokenize_data, batched=True)

# # Set format to PyTorch tensors
# train_encoded.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
# test_encoded.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# # Add compute_metrics
# def compute_metrics(pred):
#     labels = pred.label_ids
#     preds = pred.predictions.argmax(-1)
#     acc = accuracy_score(labels, preds)
#     return {'accuracy': acc}

# # Update training arguments
# training_args = TrainingArguments(
#     output_dir="./distilbert_results",
#     num_train_epochs=3,
#     per_device_train_batch_size=16,
#     per_device_eval_batch_size=32,
#     warmup_steps=500,
#     weight_decay=0.01,
#     logging_dir='./logs',
#     logging_steps=500,  
#     eval_strategy="epoch",  # Added
#     save_strategy="epoch",  # Added
# )

# # Create Trainer
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_encoded,
#     eval_dataset=test_encoded,
#     compute_metrics=compute_metrics
# )

# # Train the model
# print("Starting training...")
# trainer.train()

# # Evaluate
# eval_results = trainer.evaluate()
# print(f"\nEvaluation Results: {eval_results}")

# # Make predictions
# predictions = trainer.predict(test_encoded)
# preds = np.argmax(predictions.predictions, axis=1)

# # Print classification report
# print("\nDistilBERT Results:")
# print(classification_report(y_test, preds))

  from .autonotebook import tqdm as notebook_tqdm
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1539821/1539821 [11:02<00:00, 2323.31 examples/s]
Map: 100%|██████████| 307965/307965 [01:58<00:00, 2591.70 examples/s]


Starting training...




Epoch,Training Loss,Validation Loss
