<a href="https://colab.research.google.com/github/rishisg/ChatGPT/blob/main/RNN_Sentiment_Analysis_Twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📌 RNN for Sentiment Analysis

✅ Dataset: Twitter Sentiment Analysis ✅ Goal: Classify tweets as Positive, Negative, or Neutral ✅ Approach: Build RNN-based deep learning models using TensorFlow/Keras ✅ Evaluation: Accuracy, Precision, Recall, F1-score, Model Optimization

1️⃣ Setup & Import Libraries

💡 Explanation
✔ numpy & pandas → Handle numerical & text data ✔ tensorflow & keras → Build RNN models ✔ Tokenizer & pad_sequences → Convert text to numerical format for deep learning ✔ SimpleRNN, LSTM, GRU → RNN architectures for sentiment analysis

In [4]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, GRU, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt


2️⃣ Load & Explore the Dataset

💡 Explanation
✔ Loads Twitter Sentiment Analysis dataset ✔ Shows sample tweets & sentiment labels ✔ Class distribution check ensures a balanced dataset

✔ Loads the Twitter Training CSV file with proper encoding ✔ Checks if the dataset structure is valid ✔ Prints available columns & verifies if 'sentiment' exists ✔ Strips column names to remove extra spaces ✔ Provides manual header renaming options if needed


In [5]:
# Import necessary library
import pandas as pd

# Load dataset (Ensuring correct encoding)
df = pd.read_csv("twitter_training.csv", encoding="utf-8")

# Check dataset structure
print("Dataset Shape:", df.shape)
print("First few rows:\n", df.head())
print("Column Names:", df.columns)

# Fix potential column name issues (removing extra spaces)
df.columns = df.columns.str.strip()

# Verify if 'sentiment' column exists
if 'sentiment' in df.columns:
    print("Sentiment Column Found!")
    print(df["sentiment"].value_counts())  # Check class distribution
else:
    print("Error: Column 'sentiment' not found!")
    print("Available Columns:", df.columns)

# Optional Fix: If headers are incorrect, rename columns manually
# Uncomment and update column names based on actual structure
# df.columns = ["id", "user", "tweet", "sentiment"]
# print("Updated Columns:", df.columns)

# Optional Fix: Reload without header row if structure looks incorrect
# df = pd.read_csv("twitter_training.csv", header=None)
# print("Head after loading without header:\n", df.head())


Dataset Shape: (74681, 4)
First few rows:
    2401  Borderlands  Positive  \
0  2401  Borderlands  Positive   
1  2401  Borderlands  Positive   
2  2401  Borderlands  Positive   
3  2401  Borderlands  Positive   
4  2401  Borderlands  Positive   

  im getting on borderlands and i will murder you all ,  
0  I am coming to the borders and I will kill you...     
1  im getting on borderlands and i will kill you ...     
2  im coming on borderlands and i will murder you...     
3  im getting on borderlands 2 and i will murder ...     
4  im getting into borderlands and i can murder y...     
Column Names: Index(['2401', 'Borderlands', 'Positive',
       'im getting on borderlands and i will murder you all ,'],
      dtype='object')
Error: Column 'sentiment' not found!
Available Columns: Index(['2401', 'Borderlands', 'Positive',
       'im getting on borderlands and i will murder you all ,'],
      dtype='object')


3️⃣ Data Preprocessing & Feature Engineering

We need to convert text into tokens and encode labels as numerical values.

💡 Explanation
✔ Maps sentiment labels → Converts "Positive", "Neutral", "Negative" into numbers (2, 1, 0). ✔ Tokenizes text → Transforms words into numerical format for model understanding. ✔ Pads sequences → Standardizes input length for efficient training. ✔ Prepares features (X) & labels (y) → Ensures data is ready for training.

In [6]:
# Import necessary libraries
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load dataset safely
try:
    df = pd.read_csv("twitter_training.csv", encoding="utf-8")
except FileNotFoundError:
    print("Error: File 'twitter_training.csv' not found. Ensure the file path is correct.")
    exit()

# Display dataset structure
print("\nDataset Shape:", df.shape)
print("\nFirst few rows:\n", df.head())
print("\nColumn Names:", df.columns)

# Fix potential column name issues (removing extra spaces)
df.columns = df.columns.str.strip()

# Identify correct sentiment column
for col in df.columns:
    print(f"Unique values in column '{col}':")
    print(df[col].unique())  # Print unique values to find sentiment data

# Rename the correct column containing sentiment values
df.rename(columns={"Positive": "sentiment"}, inplace=True)  # Adjust based on actual data

# Ensure sentiment column now exists
if 'sentiment' not in df.columns:
    print("\nError: Correct sentiment column not found! Available Columns:", df.columns)
    exit()  # Stop execution if still missing

# Handle missing sentiment values
df = df.dropna(subset=["sentiment"])  # Remove missing values

# Verify unique sentiment values before encoding
print("\nSentiment values before encoding:", df["sentiment"].unique())

# Encode sentiment labels (Adjust mapping as needed)
sentiment_mapping = {"Positive": 2, "Neutral": 1, "Negative": 0}
df["sentiment"] = df["sentiment"].map(sentiment_mapping)

# Remove rows where sentiment encoding failed
df = df.dropna(subset=["sentiment"])  # Drop unmapped sentiment values

# Recheck sentiment value counts after encoding
print("\nSentiment Value Counts:\n", df["sentiment"].value_counts())

# Identify the correct column containing tweet text (rename if necessary)
text_column = "im getting on borderlands and i will murder you all ,"  # Adjust if needed

# Ensure all text values are strings (fixing float issue)
df[text_column] = df[text_column].astype(str)

# Tokenize tweet text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df[text_column])

# Convert tweets to sequences
sequences = tokenizer.texts_to_sequences(df[text_column])

# Pad sequences for uniform length
max_length = max([len(seq) for seq in sequences])
X = pad_sequences(sequences, maxlen=max_length, padding="post")

# Prepare labels (target variable)
y = df["sentiment"].values

# Final check
print("\nX shape:", X.shape)
print("y shape:", y.shape)
print("\nData preprocessing completed successfully!")



Dataset Shape: (74681, 4)

First few rows:
    2401  Borderlands  Positive  \
0  2401  Borderlands  Positive   
1  2401  Borderlands  Positive   
2  2401  Borderlands  Positive   
3  2401  Borderlands  Positive   
4  2401  Borderlands  Positive   

  im getting on borderlands and i will murder you all ,  
0  I am coming to the borders and I will kill you...     
1  im getting on borderlands and i will kill you ...     
2  im coming on borderlands and i will murder you...     
3  im getting on borderlands 2 and i will murder ...     
4  im getting into borderlands and i can murder y...     

Column Names: Index(['2401', 'Borderlands', 'Positive',
       'im getting on borderlands and i will murder you all ,'],
      dtype='object')
Unique values in column '2401':
[2401 2402 2403 ... 9198 9199 9200]
Unique values in column 'Borderlands':
['Borderlands' 'CallOfDutyBlackopsColdWar' 'Amazon' 'Overwatch'
 'Xbox(Xseries)' 'NBA2K' 'Dota2' 'PlayStation5(PS5)' 'WorldOfCraft'
 'CS-GO' 'Google' '

4️⃣ Split Dataset into Training & Testing Sets

💡 Explanation
✔ Splits dataset (80% training, 20% testing) ✔ Ensures dataset is randomized for better learning

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


5️⃣ Define RNN Model Architecture

We'll start with a simple RNN model.

💡 Explanation
✔ Embedding layer → Converts words into dense vector representations ✔ SimpleRNN layer → Processes text sequences ✔ Dropout → Helps prevent overfitting ✔ Dense output layer (softmax) → Multi-class classification


In [8]:
model_rnn = Sequential([
    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=128, input_length=max_length),
    SimpleRNN(100, return_sequences=False),
    Dropout(0.3),
    Dense(3, activation="softmax")  # 3 classes (Positive, Neutral, Negative)
])

# Compile the model
model_rnn.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model_rnn.summary()




6️⃣ Train the RNN Model

💡 Explanation
✔ Trains model for 10 epochs ✔ Uses batch size of 32 ✔ Validation split (20%) monitors generalization

In [9]:
history_rnn = model_rnn.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


Epoch 1/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m165s[0m 131ms/step - accuracy: 0.3425 - loss: 1.1557 - val_accuracy: 0.3641 - val_loss: 1.0982
Epoch 2/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m169s[0m 105ms/step - accuracy: 0.3419 - loss: 1.1094 - val_accuracy: 0.3646 - val_loss: 1.0948
Epoch 3/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m131s[0m 106ms/step - accuracy: 0.3529 - loss: 1.1001 - val_accuracy: 0.3353 - val_loss: 1.1026
Epoch 4/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 107ms/step - accuracy: 0.3564 - loss: 1.1000 - val_accuracy: 0.3637 - val_loss: 1.0941
Epoch 5/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 108ms/step - accuracy: 0.3527 - loss: 1.0998 - val_accuracy: 0.3298 - val_loss: 1.1042
Epoch 6/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 105ms/step - accuracy: 0.3535 - loss: 1.1021 - val_accuracy: 0.3750 - val_loss:

7️⃣ Evaluate the RNN Model

💡 Explanation
✔ Converts model predictions to sentiment labels ✔ Evaluates accuracy, precision, recall & F1-score ✔ Balanced F1-score ensures strong classification performance

In [10]:
y_pred = model_rnn.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

accuracy = accuracy_score(y_test, y_pred_classes)
precision = precision_score(y_test, y_pred_classes, average="weighted")
recall = recall_score(y_test, y_pred_classes, average="weighted")
f1 = f1_score(y_test, y_pred_classes, average="weighted")

print(f"Test Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")


[1m386/386[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 26ms/step
Test Accuracy: 0.36
Precision: 0.24
Recall: 0.36
F1 Score: 0.28


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


8️⃣ Hyperparameter Tuning & Optimization

We can experiment with different architectures.

💡 Explanation
✔ Uses LSTM for better memory retention ✔ Replaces SimpleRNN with LSTM ✔ Evaluates improvements in accuracy

In [11]:
# Optimize using LSTM instead of SimpleRNN
model_lstm = Sequential([
    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=128, input_length=max_length),
    LSTM(100, return_sequences=False),
    Dropout(0.3),
    Dense(3, activation="softmax")
])

model_lstm.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

history_lstm = model_lstm.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

loss_lstm, accuracy_lstm = model_lstm.evaluate(X_test, y_test)
print(f"LSTM Model Accuracy: {accuracy_lstm:.2f}")


Epoch 1/10




[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m304s[0m 243ms/step - accuracy: 0.3522 - loss: 1.0972 - val_accuracy: 0.3706 - val_loss: 1.0951
Epoch 2/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m312s[0m 253ms/step - accuracy: 0.3631 - loss: 1.0959 - val_accuracy: 0.3706 - val_loss: 1.0950
Epoch 3/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m336s[0m 265ms/step - accuracy: 0.3607 - loss: 1.0956 - val_accuracy: 0.3706 - val_loss: 1.0944
Epoch 4/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m363s[0m 249ms/step - accuracy: 0.3647 - loss: 1.0949 - val_accuracy: 0.3706 - val_loss: 1.0952
Epoch 5/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m308s[0m 249ms/step - accuracy: 0.3646 - loss: 1.0954 - val_accuracy: 0.3706 - val_loss: 1.0944
Epoch 6/10
[1m1234/1234[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 249ms/step - accuracy: 0.3656 - loss: 1.0948 - val_accuracy: 0.3706 - val_loss: 1.0944
Epo

9️⃣ Predict Sentiment for New Tweets

Let's predict sentiment for an input tweet.

💡 Explanation
✔ Takes a new tweet and predicts its sentiment ✔ Uses trained LSTM model for classification ✔ Maps numeric predictions back to sentiment labels

In [12]:
def predict_sentiment(model, tokenizer, text, max_length):
    sequence = tokenizer.texts_to_sequences([text])
    sequence = pad_sequences(sequence, maxlen=max_length, padding="post")
    predicted_index = np.argmax(model.predict(sequence))
    sentiment_label = {0: "Negative", 1: "Neutral", 2: "Positive"}
    return sentiment_label[predicted_index]

input_text = "I love this product!"
print("Predicted Sentiment:", predict_sentiment(model_lstm, tokenizer, input_text, max_length))


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 415ms/step
Predicted Sentiment: Negative
