<a href="https://colab.research.google.com/github/samObot19/sentiment-analysis/blob/main/ml_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 Sentiment Analysis on Twitter Data (Multi-Class Classification)

## 📝 Introduction

In this notebook, we aim to build a **multi-class sentiment analysis model** using labeled Twitter data. Unlike binary sentiment classification (positive vs. negative), our dataset includes four sentiment categories:

- **Positive**
- **Negative**
- **Neutral**
- **Irrelevant**

This makes the task more complex and realistic, reflecting the wide range of emotions and opinions people express on social media.

We will apply both **traditional machine learning** and **deep learning** techniques to classify tweets accurately based on their sentiment. The project involves the following steps:

- Preprocessing the raw tweet text (cleaning, tokenization, lemmatization)
- Converting text into numerical form using **TF-IDF** and **word embeddings**
- Training and evaluating models including:
  - **Logistic Regression** (machine learning)
  - **LSTM (Long Short-Term Memory)** network (deep learning)
- Comparing model performance using **precision**, **recall**, **F1-score**, and overall accuracy
- Performing **error analysis** to understand misclassifications and suggest improvements

This project not only demonstrates practical NLP skills, but also highlights the challenges and considerations when dealing with **noisy, real-world text data** from social platforms like Twitter.

---


## Traditional machine learning(Logistic Regression)

# Load the datasets

In [1]:
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/datasets/twitter_training.csv'

# Load dataset
df = pd.read_csv(file_path, header=None, names=["ID", "Topic", "Sentiment", "Tweet"])
df["Tweet"] = df["Tweet"].str.strip()

df.head()


Mounted at /content/drive


Unnamed: 0,ID,Topic,Sentiment,Tweet
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


# Drop unnecessary columns (like ID and Topic)


In [2]:
df = df.drop(columns=["ID", "Topic"])
print(df.columns)


Index(['Sentiment', 'Tweet'], dtype='object')


# Text preprocessing

In [3]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = str(text)
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@w+|\#', '', text)  # remove @mentions and hashtags
    text = re.sub(r"[^a-zA-Z]", " ", text)  # remove numbers/symbols
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words and len(w) > 2]
    return " ".join(tokens)

df['Cleaned_Tweet'] = df['Tweet'].apply(preprocess)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [4]:
df.head()

Unnamed: 0,Sentiment,Tweet,Cleaned_Tweet
0,Positive,im getting on borderlands and i will murder yo...,getting borderland murder
1,Positive,I am coming to the borders and I will kill you...,coming border kill
2,Positive,im getting on borderlands and i will kill you ...,getting borderland kill
3,Positive,im coming on borderlands and i will murder you...,coming borderland murder
4,Positive,im getting on borderlands 2 and i will murder ...,getting borderland murder


# Encode Sentiment Labels

In [5]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Label'] = le.fit_transform(df['Sentiment'])  # Positive = 1, Negative = 0, etc.
df.head(20)

Unnamed: 0,Sentiment,Tweet,Cleaned_Tweet,Label
0,Positive,im getting on borderlands and i will murder yo...,getting borderland murder,3
1,Positive,I am coming to the borders and I will kill you...,coming border kill,3
2,Positive,im getting on borderlands and i will kill you ...,getting borderland kill,3
3,Positive,im coming on borderlands and i will murder you...,coming borderland murder,3
4,Positive,im getting on borderlands 2 and i will murder ...,getting borderland murder,3
5,Positive,im getting into borderlands and i can murder y...,getting borderland murder,3
6,Positive,So I spent a few hours making something for fu...,spent hour making something fun know huge bord...,3
7,Positive,So I spent a couple of hours doing something f...,spent couple hour something fun know huge bord...,3
8,Positive,So I spent a few hours doing something for fun...,spent hour something fun know huge borderland ...,3
9,Positive,So I spent a few hours making something for fu...,spent hour making something fun know huge rhan...,3


# TF-IDF Vectorization

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['Cleaned_Tweet']).toarray()
y = df['Label']


# Split Data

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#  ML Model (Logistic Regression)

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler



# Set l1_ratio to a value between 0 and 1 when using elasticnet penalty
lr = LogisticRegression(max_iter=400)
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)

print("Logistic Regression:")
print(classification_report(y_test, y_pred_lr, target_names=le.classes_))


Logistic Regression:
              precision    recall  f1-score   support

  Irrelevant       0.67      0.52      0.58      2592
    Negative       0.73      0.77      0.75      4519
     Neutral       0.67      0.63      0.65      3596
    Positive       0.67      0.75      0.71      4230

    accuracy                           0.69     14937
   macro avg       0.68      0.67      0.67     14937
weighted avg       0.69      0.69      0.68     14937



## Deep learning techniques

# LSTM (Long Short-Term Memory) network (deep learning)

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Tokenization
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['Cleaned_Tweet'])
X_seq = tokenizer.texts_to_sequences(df['Cleaned_Tweet'])
X_pad = pad_sequences(X_seq, maxlen=100)

# Label Encoding for multi-class classification
le = LabelEncoder()
y_encoded = le.fit_transform(y)  # Converts labels to integers (e.g., 0, 1, 2, ...)

# Train-test split
X_train_dl, X_test_dl, y_train_dl, y_test_dl = train_test_split(X_pad, y_encoded, test_size=0.2, random_state=42)

# Number of classes
num_classes = len(set(y_encoded))

# Build LSTM model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(LSTM(64, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))  # Multi-class output

# Compile model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summary
model.summary()

# Train model
model.fit(X_train_dl, y_train_dl, epochs=5, batch_size=64, validation_data=(X_test_dl, y_test_dl))




Epoch 1/5
[1m934/934[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m148s[0m 156ms/step - accuracy: 0.5138 - loss: 1.1275 - val_accuracy: 0.6801 - val_loss: 0.7907
Epoch 2/5
[1m934/934[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m210s[0m 165ms/step - accuracy: 0.7294 - loss: 0.6964 - val_accuracy: 0.7321 - val_loss: 0.6901
Epoch 3/5
[1m934/934[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m200s[0m 162ms/step - accuracy: 0.7927 - loss: 0.5392 - val_accuracy: 0.7593 - val_loss: 0.6318
Epoch 4/5
[1m934/934[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m207s[0m 168ms/step - accuracy: 0.8219 - loss: 0.4655 - val_accuracy: 0.7710 - val_loss: 0.6151
Epoch 5/5
[1m934/934[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m196s[0m 162ms/step - accuracy: 0.8451 - loss: 0.4020 - val_accuracy: 0.7843 - val_loss: 0.6043


<keras.src.callbacks.history.History at 0x7d7762dca310>

In [14]:
from sklearn.metrics import classification_report
import numpy as np

y_pred_probs = model.predict(X_test_dl)


y_pred = np.argmax(y_pred_probs, axis=1)

sentiment_labels = df['Sentiment'].unique()


print("Classification Report:\n")
print(classification_report(y_test_dl, y_pred, target_names=sentiment_labels))


[1m467/467[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 23ms/step
Classification Report:

              precision    recall  f1-score   support

    Positive       0.76      0.71      0.73      2592
     Neutral       0.82      0.83      0.83      4519
    Negative       0.79      0.75      0.77      3596
  Irrelevant       0.75      0.81      0.78      4230

    accuracy                           0.78     14937
   macro avg       0.78      0.78      0.78     14937
weighted avg       0.78      0.78      0.78     14937

