# Real-Time Sentiment Analysis Project (Improved)

This notebook demonstrates the end-to-end process of building a sentiment analysis model. **Improvements included: Bigrams (ngram_range=(1,2)) and Negation Handling.**

## 1. Environment Setup

In [5]:
import pandas as pd
import numpy as np
import re
import nltk
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Download NLTK data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Shozab\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Shozab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Shozab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 2. Data Loading

In [6]:
col_names = ['target', 'ids', 'date', 'flag', 'user', 'text']
df = pd.read_csv('./../datasets/training.1600000.processed.noemoticon.csv', 
                 encoding='ISO-8859-1', 
                 header=None, 
                 names=col_names)

df = df[['target', 'text']]
df['sentiment'] = df['target'].map({0: 'negative', 4: 'positive'})
df = df.drop('target', axis=1)

## 3. Improved Data Cleaning (Negation Handling)

We refine the stopword list to **keep negation words** like 'not', 'no', and 'never'. This prevents 'not happy' from being cleaned into just 'happy'.

In [7]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#(\w+)', r'\1', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = text.strip()
    text = re.sub(r'\s+', ' ', text)

    tokens = nltk.word_tokenize(text)
    
    # --- IMPROVEMENT: Refined Stopwords ---
    stop_words = set(stopwords.words('english'))
    negation_words = {'not', 'no', 'never', 'neither', 'nor', 'none', "n't", 'cannot', "don't", "doesn't", "didn't", "won't", "shouldn't", "couldn't", "wasn't", "weren't", "isn't", "aren't"}
    final_stopwords = stop_words - negation_words
    
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens if word in negation_words or word not in final_stopwords]
    
    return ' '.join(tokens)

print("Cleaning text data with negation handling...")
df['cleaned_text'] = df['text'].apply(clean_text)

df.to_csv('./../datasets/cleaned_sentiment_data.csv', index=False)
print("Cleaned data saved to CSV.")

Cleaning text data with negation handling...
Cleaned data saved to CSV.


## 4. Improved Feature Extraction (Bigrams)

We update the `TfidfVectorizer` to use **bigrams** (`ngram_range=(1, 2)`). This allows the model to recognize phrases like 'not good' as a single feature.

In [8]:
# --- IMPROVEMENT: Bigrams (1, 2) ---
vectorizer = TfidfVectorizer(max_features=20000, ngram_range=(1, 2))

# cleaned_df = pd.read_csv("./../datasets/cleaned_sentiment_data.csv")

X = df['cleaned_text']
y = df['sentiment']
X_tfidf = vectorizer.fit_transform(X)

print(f"Feature matrix shape: {X_tfidf.shape}")

Feature matrix shape: (1600000, 20000)


## 5. Model Training

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

print("Training Logistic Regression model...")
log_reg_model = LogisticRegression(random_state=42, max_iter=1000)
log_reg_model.fit(X_train, y_train)

y_pred = log_reg_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print(classification_report(y_test, y_pred))

Training Logistic Regression model...
Accuracy: 79.82%
              precision    recall  f1-score   support

    negative       0.81      0.78      0.79    159494
    positive       0.79      0.82      0.80    160506

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



## 6. Saving the Improved Model

In [10]:
import os
os.makedirs('./../models', exist_ok=True)
joblib.dump(log_reg_model, './../models/sentiment_model_v3.pkl')
joblib.dump(vectorizer, './../models/vectorizer_v3.pkl')
print("Improved model and vectorizer saved.")

Improved model and vectorizer saved.
