# Toxicity & Sentiment Analyzer Documentation

This notebook provides a detailed explanation of a comprehensive text analysis system that combines toxicity detection and sentiment analysis. The system is designed to analyze text input for both toxic content (across multiple categories) and overall sentiment, making it useful for content moderation and text analysis applications.

## Overview

The system consists of:
- A ToxicityAnalyzer class that handles both toxicity and sentiment analysis
- A machine learning pipeline using TF-IDF vectorization and logistic regression
- Sentiment analysis using a pre-trained RoBERTa model
- A Gradio-based web interface for easy interaction

Let's break down each component and understand how they work together.

## Required Libraries

The code begins by importing necessary libraries:
- Core data processing: `numpy`, `pandas`
- Machine learning: `sklearn` components for text vectorization and classification
- Deep learning: `transformers` for sentiment analysis
- UI: `gradio` for the web interface
- Utilities: `os`, `json`, `re`, `joblib`, `tqdm`

In [None]:
import os
import json
import numpy as np
import pandas as pd
import scipy.sparse as sp
from transformers import pipeline
import gradio as gr
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
import joblib
from tqdm import tqdm

## ToxicityAnalyzer Class

### Initialization

The `ToxicityAnalyzer` class is the core component of the system. The initialization method sets up:

1. **Model Directory Structure**:
   - Creates a directory for storing model artifacts
   - Sets up paths for model, vectorizer, and sentiment cache files

2. **Toxicity Categories**:
   - Defines six categories: toxic, severe_toxic, obscene, threat, insult, identity_hate

3. **Sentiment Analysis Pipeline**:
   - Initializes the RoBERTa-based sentiment analyzer
   - Sets up sentiment caching for performance optimization

4. **Component Initialization**:
   - Loads existing models if available
   - Prepares for new model training if needed

In [None]:
class ToxicityAnalyzer:
    def __init__(self, model_dir='model_artifacts'):
        print("Initializing ToxicityAnalyzer...")
        self.model_dir = model_dir
        os.makedirs(model_dir, exist_ok=True)
        
        self.model_path = os.path.join(model_dir, 'logistic_model.joblib')
        self.vectorizer_path = os.path.join(model_dir, 'tfidf_vectorizer.joblib')
        self.sentiment_cache_path = os.path.join(model_dir, 'sentiment_cache.json')
        
        # Label columns in order
        self.label_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
        
        print("Loading sentiment analyzer...")
        self.sentiment_analyzer = pipeline(
            "sentiment-analysis",
            model="cardiffnlp/twitter-roberta-base-sentiment"
        )
        
        # Initialize components
        self.sentiment_cache = self._load_sentiment_cache()
        self.vectorizer = None
        self.model = None
        
        # Load or train the model
        self._initialize_model()

### Sentiment Cache Management

The sentiment cache system improves performance by storing previously computed sentiment analyses:

- `_load_sentiment_cache()`: Loads existing cache from disk
- `_save_sentiment_cache()`: Persists current cache to disk

This caching mechanism is particularly useful when processing large datasets with repeated text patterns.

In [None]:
    def _load_sentiment_cache(self):
        if os.path.exists(self.sentiment_cache_path):
            print("Loading sentiment cache...")
            with open(self.sentiment_cache_path, 'r') as f:
                return json.load(f)
        return {}

    def _save_sentiment_cache(self):
        with open(self.sentiment_cache_path, 'w') as f:
            json.dump(self.sentiment_cache, f)

### Text Processing and Sentiment Analysis

Two key methods handle text preprocessing and sentiment analysis:

1. **clean_text()**:
   - Converts text to lowercase
   - Removes special characters
   - Normalizes the text for consistent processing

2. **get_sentiment()**:
   - Checks cache for existing results
   - Applies RoBERTa model for sentiment analysis
   - Maps results to a three-dimensional vector [negative, neutral, positive]
   - Caches results for future use

In [None]:
    def clean_text(self, text):
        text = str(text).lower()
        return re.sub(r'[^a-zA-Z\s]', '', text)

    def get_sentiment(self, text):
        if text in self.sentiment_cache:
            return self.sentiment_cache[text]
        
        result = self.sentiment_analyzer(text, truncation=True, max_length=128)
        sentiment_map = {
            'LABEL_0': [1, 0, 0],  # Negative
            'LABEL_1': [0, 1, 0],  # Neutral
            'LABEL_2': [0, 0, 1]   # Positive
        }
        sentiment = sentiment_map[result[0]['label']]
        
        self.sentiment_cache[text] = sentiment
        return sentiment

### Data Preparation Pipeline

The `prepare_data` method implements a comprehensive feature engineering pipeline:

1. **Text Preprocessing**:
   - Cleans and normalizes input text
   - Handles training vs. inference modes

2. **Feature Extraction**:
   - Creates TF-IDF features (up to 50,000 features)
   - Extracts sentiment features
   - Combines both feature sets

3. **Training Mode Features**:
   - Calculates label distribution
   - Provides detailed progress information
   - Returns both features and labels

In [None]:
    def prepare_data(self, df, training=False):
        print(f"\nPreparing dataset with {len(df)} samples...")
        
        # Extract features and labels
        X = df['comment_text'].apply(self.clean_text)
        if training:
            y = df[self.label_columns].values
            print(f"Label distribution:")
            for col in self.label_columns:
                positive_count = df[col].sum()
                print(f"{col}: {positive_count} positive samples ({positive_count/len(df)*100:.2f}%)")
        
        print("\nCreating TF-IDF features...")
        if training:
            self.vectorizer = TfidfVectorizer(
                max_features=50000,
                ngram_range=(1, 2),
                strip_accents='unicode',
                min_df=5
            )
            X_tfidf = self.vectorizer.fit_transform(X)
        else:
            X_tfidf = self.vectorizer.transform(X)
        
        print("Getting sentiment features...")
        sentiment_features = []
        for text in tqdm(X, desc="Processing sentiments"):
            sentiment = self.get_sentiment(text)
            sentiment_features.append(sentiment)
            
        sentiment_features = np.array(sentiment_features)
        
        # Combine features
        X_combined = sp.hstack([X_tfidf, sp.csr_matrix(sentiment_features)])
        
        if training:
            return X_combined, y
        return X_combined

### Model Initialization and Training

The `_initialize_model` method handles the complete model lifecycle:

1. **Data Loading**:
   - Loads training data from CSV
   - Handles existing model loading if available

2. **Model Training**:
   - Prepares features and labels
   - Implements stratified splitting
   - Trains separate classifiers for each toxicity category

3. **Model Evaluation**:
   - Calculates accuracy for each classifier
   - Saves trained models for future use

In [None]:
    def _initialize_model(self):
        print("\nLoading data...")
        df = pd.read_csv('data/train.csv')
        print(f"Dataset size: {len(df)} samples")
        
        if os.path.exists(self.model_path) and os.path.exists(self.vectorizer_path):
            print("Loading existing model and vectorizer...")
            self.vectorizer = joblib.load(self.vectorizer_path)
            self.model = joblib.load(self.model_path)
        else:
            print("Training new model...")
            X_combined, y = self.prepare_data(df, training=True)
            
            print("\nSplitting dataset...")
            X_train, X_test, y_train, y_test = train_test_split(
                X_combined, y, test_size=0.2, random_state=42, 
                stratify=y[:, 0]  # Stratify on toxic label
            )
            
            print("\nTraining models for each label...")
            estimators = []
            for i, label in enumerate(tqdm(self.label_columns)):
                print(f"\nTraining classifier for {label}...")
                clf = LogisticRegression(
                    C=1.0,
                    max_iter=200,
                    class_weight='balanced',
                    verbose=1
                )
                clf.fit(X_train, y_train[:, i])
                
                # Evaluate on test set
                score = clf.score(X_test, y_test[:, i])
                print(f"{label} classifier accuracy: {score:.4f}")
                estimators.append(clf)
            
            self.model = MultiOutputClassifier(estimators)
            self.model.estimators_ = estimators
            
            print("\nSaving models...")
            joblib.dump(self.model, self.model_path)
            joblib.dump(self.vectorizer, self.vectorizer_path)

### Text Analysis and Prediction

The `score_comment` method is the main interface for analyzing individual texts. It combines all the preprocessing, feature extraction, and prediction steps into a single workflow:

1. **Text Processing**:
   - Takes raw text input
   - Applies cleaning and normalization

2. **Feature Generation**:
   - Creates TF-IDF features from processed text
   - Extracts sentiment features
   - Combines features for prediction

3. **Multi-label Classification**:
   - Applies trained classifiers for each toxicity category
   - Determines the dominant sentiment
   - Formats results for user consumption

In [None]:
    def score_comment(self, comment):
        cleaned_text = self.clean_text(comment)
        
        # Get features
        X_tfidf = self.vectorizer.transform([cleaned_text])
        sentiment_features = np.array([self.get_sentiment(cleaned_text)])
        X_combined = sp.hstack([X_tfidf, sp.csr_matrix(sentiment_features)])
        
        # Get predictions
        predictions = [clf.predict(X_combined)[0] for clf in self.model.estimators_]
        
        # Format sentiment output
        sentiment_labels = ['Negative', 'Neutral', 'Positive']
        dominant_sentiment = sentiment_labels[np.argmax(sentiment_features[0])]
        
        # Get detected labels
        if dominant_sentiment == 'Negative':
            predictions = [clf.predict(X_combined)[0] for clf in self.model.estimators_]
            detected_labels = []
            for label, pred in zip(self.label_columns, predictions):
                if pred == 1:
                    detected_labels.append(label.replace('_', ' ').title())
            return dominant_sentiment, "\n".join(detected_labels) if detected_labels else "No toxic labels detected"
        else:
            return dominant_sentiment, "No toxic labels detected"

## Web Interface Implementation

The final component of the system is a web-based user interface built using Gradio. This interface makes the toxicity analyzer accessible through a browser, with the following features:

1. **Input Interface**:
   - Text input box with 3 lines
   - Clear placeholder text
   - User-friendly layout

2. **Output Display**:
   - Sentiment result field
   - Detected toxicity labels field
   - Clean, organized presentation

3. **Styling and Theme**:
   - Dark mode interface
   - Blue accent colors
   - Professional appearance

In [None]:
def create_interface():
    analyzer = ToxicityAnalyzer()
    
    interface = gr.Interface(
        fn=analyzer.score_comment,
        inputs=gr.Textbox(
            lines=3, 
            placeholder='Type your text here...',
            label="Input Text"
        ),
        outputs=[
            gr.Textbox(label="Sentiment", lines=1),
            gr.Textbox(label="Detected Labels", lines=6)
        ],
        title="Toxicity & Sentiment Analyzer",
        description="Analysis of text for toxicity and sentiment",
        theme=gr.themes.Base(primary_hue="blue", neutral_hue="slate"),
        css="""
            .gradio-container {background-color: #1f1f1f}
        """
    )
    return interface

if __name__ == "__main__":
    print("Starting Toxicity Analyzer...")
    interface = create_interface()
    interface.launch()

## Usage Example

To use the Toxicity Analyzer:

1. Ensure all required libraries are installed
2. Run the script to start the web interface
3. Enter text in the input box
4. View sentiment and toxicity results

The system will automatically handle model loading/training and provide real-time analysis of input text.