# Real-Time Sentiment Analysis Project

This notebook demonstrates the end-to-end process of building a sentiment analysis model. We will cover data loading, exploratory data analysis, text pre-processing using NLP techniques, model training, and evaluation.

## 1. Environment Setup

First, we import the necessary libraries and download the required NLTK data for text processing.

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Download NLTK data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

## 2. Data Loading and Initial Exploration

We load the Sentiment140 dataset and perform an initial inspection of the data structure.

In [None]:
# Load the dataset
# Note: The dataset doesn't have headers, so we define them manually
col_names = ['target', 'ids', 'date', 'flag', 'user', 'text']
df = pd.read_csv('./../datasets/training.1600000.processed.noemoticon.csv', 
                 encoding='ISO-8859-1', 
                 header=None, 
                 names=col_names)

# Display the first few rows
print("First 5 rows of the dataset:")
print(df.head())

# Display dataset information
print("\nDataset Info:")
print(df.info())

### 2.1 Data Selection and Label Mapping

We select only the relevant columns (`target` and `text`) and map the numerical target values to human-readable labels: `0` for negative and `4` for positive.

In [None]:
# Select relevant columns
df = df[['target', 'text']]

# Map target values to sentiment labels
df['sentiment'] = df['target'].map({0: 'negative', 4: 'positive'})
df = df.drop('target', axis=1)

# Check the distribution of sentiments
print("Sentiment Distribution:")
print(df['sentiment'].value_counts())

print("\nUpdated DataFrame:")
print(df.head())

## 3. Data Cleaning and Pre-processing (NLP)

Raw tweets are noisy. We define a cleaning function to perform lowercase conversion, noise removal (URLs, mentions, hashtags), tokenization, stop word removal, and stemming.

In [None]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def clean_text(text):
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Remove URLs, mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#(\w+)', r'\1', text)
    
    # 3. Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    
    # 4. Remove leading/trailing whitespace and multiple spaces
    text = text.strip()
    text = re.sub(r'\s+', ' ', text)

    # 5. Tokenization, Stopword removal, and Stemming
    tokens = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return ' '.join(tokens)

# Apply the cleaning function (This may take some time)
print("Cleaning text data...")
df['cleaned_text'] = df['text'].apply(clean_text)

# Save the cleaned data for future use
df[['sentiment', 'cleaned_text']].to_csv('cleaned_tweets.csv', index=False)
print("Cleaned data saved to 'cleaned_tweets.csv'")

## 4. Feature Extraction (Vectorization)

We use TF-IDF (Term Frequency-Inverse Document Frequency) to convert the cleaned text into numerical features for our machine learning models.

In [None]:
# Initialize the TF-IDF Vectorizer
# We limit to 3000 features to focus on the most important words
vectorizer = TfidfVectorizer(max_features=3000)

# Fit and transform the cleaned text
X = df['cleaned_text']
y = df['sentiment']
X_tfidf = vectorizer.fit_transform(X)

print(f"Feature matrix shape: {X_tfidf.shape}")
print("Sample features learned:")
print(vectorizer.get_feature_names_out()[:50])

## 5. Model Training and Evaluation

We split the data into training and testing sets, then train and compare two models: Logistic Regression (Baseline) and Random Forest (Advanced).

In [None]:
# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# --- Model A: Logistic Regression ---
print("Training Logistic Regression model...")
log_reg_model = LogisticRegression(random_state=42)
log_reg_model.fit(X_train, y_train)

# Evaluate Logistic Regression
y_pred_log_reg = log_reg_model.predict(X_test)
print(f"\nLogistic Regression Accuracy: {accuracy_score(y_test, y_pred_log_reg) * 100:.2f}%")
print("Classification Report:")
print(classification_report(y_test, y_pred_log_reg))

# --- Model B: Random Forest ---
print("\nTraining Random Forest model... (This may take several minutes)")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Evaluate Random Forest
y_pred_rf = rf_model.predict(X_test)
print(f"\nRandom Forest Accuracy: {accuracy_score(y_test, y_pred_rf) * 100:.2f}%")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

## 6. Model Saving

Based on the performance and efficiency, we select the best model and save it along with the vectorizer for deployment.

In [None]:
# We choose Logistic Regression for its balance of speed and accuracy
best_model = log_reg_model

# Define file paths
import os
os.makedirs('./models', exist_ok=True)
model_path = './models/sentiment_model.pkl'
vectorizer_path = './models/vectorizer.pkl'

# Save the model and vectorizer
joblib.dump(best_model, model_path)
joblib.dump(vectorizer, vectorizer_path)

print(f"Model saved to {model_path}")
print(f"Vectorizer saved to {vectorizer_path}")