# Part 1: Data Collection & Exploration

**Goal**: Load and understand the Toxic Comment Classification dataset

## Dataset: Kaggle Toxic Comment Classification Challenge

### Why This Dataset?

1. **Real-world data**: Wikipedia talk page comments
2. **Multi-label**: 6 toxicity categories
3. **Scale**: 159,571 labeled comments
4. **Well-documented**: Used in research and industry
5. **Publicly available**: Can be shared in portfolio

### Toxicity Categories
- `toxic`: Rude, disrespectful comments
- `severe_toxic`: Very hateful, aggressive, disrespectful comments
- `obscene`: Vulgar or profane language
- `threat`: Threatening language toward an individual or group
- `insult`: Insulting or negative comments
- `identity_hate`: Hate speech targeting identity (race, religion, gender, etc.)

In [None]:
# Diagnostic cell - Run this first to verify your environment
import sys
import os

print("=" * 80)
print("ENVIRONMENT CHECK")
print("=" * 80)
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")
print(f"Current working directory: {os.getcwd()}")
print(f"Notebook directory: {os.path.dirname(os.path.abspath('.'))}")

# Check if data file exists
data_path = '../data/raw/train.csv'
if os.path.exists(data_path):
    print(f"\n✓ Data file found: {data_path}")
    file_size = os.path.getsize(data_path) / (1024 * 1024)  # Size in MB
    print(f"  File size: {file_size:.2f} MB")
else:
    print(f"\n✗ Data file NOT found: {data_path}")
    print("  Please ensure the dataset is in the correct location")

# Check if results directory exists
results_dir = '../results/figures'
if not os.path.exists(results_dir):
    print(f"\n⚠ Creating results directory: {results_dir}")
    os.makedirs(results_dir, exist_ok=True)
else:
    print(f"\n✓ Results directory exists: {results_dir}")

print("\n" + "=" * 80)
print("If you see this output, your notebook is running correctly!")
print("=" * 80)


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Import display for Jupyter notebooks
try:
    from IPython.display import display
except ImportError:
    # Fallback if not in Jupyter environment
    display = print

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

## Task 1a: Load Dataset

**Note**: Download the dataset from Kaggle:
1. Go to: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data
2. Download `train.csv` and `test.csv`
3. Place in `../data/raw/` directory

Alternatively, use the Kaggle API:
```bash
pip install kaggle
kaggle competitions download -c jigsaw-toxic-comment-classification-challenge
```

In [None]:
# Load the dataset
# If you don't have the dataset yet, we'll create a sample for demonstration

try:
    # Try to load the actual dataset
    df = pd.read_csv('../data/raw/train.csv')
    print("✓ Loaded actual Kaggle dataset")
except FileNotFoundError:
    print("Dataset not found. Creating sample dataset for demonstration...")
    # Create a sample dataset for demonstration
    sample_data = {
        'id': range(1000),
        'comment_text': [
            "This is a great article, thank you!",
            "You're an idiot and nobody likes you",
            "I disagree with this perspective",
            "This is complete garbage written by morons",
            "Interesting point, could you elaborate?",
        ] * 200,  # Repeat to get 1000 samples
        'toxic': [0, 1, 0, 1, 0] * 200,
        'severe_toxic': [0, 0, 0, 1, 0] * 200,
        'obscene': [0, 1, 0, 1, 0] * 200,
        'threat': [0, 0, 0, 0, 0] * 200,
        'insult': [0, 1, 0, 1, 0] * 200,
        'identity_hate': [0, 0, 0, 0, 0] * 200,
    }
    df = pd.DataFrame(sample_data)
    print("✓ Created sample dataset for demonstration")
    print("⚠ Replace with actual Kaggle dataset for complete analysis")

print(f"\nDataset shape: {df.shape}")
print(f"Number of comments: {len(df):,}")
print(f"Number of features: {df.shape[1]}")

In [None]:
# Display basic information
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)

print("\nFirst 5 rows:")
display(df.head())

print("\nColumn names:")
print(df.columns.tolist())

print("\nData types:")
print(df.dtypes)

print("\nBasic statistics:")
print(df.describe())

## Task 1b: Initial Data Exploration

In [None]:
# Check for missing values
print("Missing Values:")
print("-" * 40)
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("\n✓ No missing values found!")
else:
    print(f"\n⚠ Total missing values: {missing_df['Missing Count'].sum()}")

In [None]:
# Analyze text characteristics
print("TEXT CHARACTERISTICS")
print("=" * 80)

# Calculate text lengths
df['text_length'] = df['comment_text'].astype(str).apply(len)
df['word_count'] = df['comment_text'].astype(str).apply(lambda x: len(x.split()))

print(f"\nComment Length Statistics:")
print(f"  Average characters: {df['text_length'].mean():.0f}")
print(f"  Median characters: {df['text_length'].median():.0f}")
print(f"  Min characters: {df['text_length'].min()}")
print(f"  Max characters: {df['text_length'].max()}")

print(f"\nWord Count Statistics:")
print(f"  Average words: {df['word_count'].mean():.1f}")
print(f"  Median words: {df['word_count'].median():.0f}")
print(f"  Min words: {df['word_count'].min()}")
print(f"  Max words: {df['word_count'].max()}")

In [None]:
# Analyze class distribution
print("\nCLASS DISTRIBUTION")
print("=" * 80)

# Define toxicity columns
toxicity_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# Calculate distribution for each category
print("\nToxicity by Category:")
print("-" * 40)
for col in toxicity_cols:
    count = df[col].sum()
    pct = (count / len(df)) * 100
    print(f"{col:20s}: {count:6d} ({pct:5.2f}%)")

# Check if any comment is toxic in any category
df['any_toxic'] = (df[toxicity_cols].sum(axis=1) > 0).astype(int)
toxic_count = df['any_toxic'].sum()
clean_count = len(df) - toxic_count

print("\nOverall Distribution:")
print("-" * 40)
print(f"Clean comments:        {clean_count:6d} ({(clean_count/len(df)*100):5.2f}%)")
print(f"Toxic comments:        {toxic_count:6d} ({(toxic_count/len(df)*100):5.2f}%)")
print(f"\nClass imbalance ratio: {clean_count/toxic_count:.2f}:1 (clean:toxic)")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Overall toxic vs clean
ax1 = axes[0, 0]
counts = df['any_toxic'].value_counts()
colors = ['#2ecc71', '#e74c3c']
ax1.pie(counts, labels=['Clean', 'Toxic'], autopct='%1.1f%%', 
        colors=colors, startangle=90)
ax1.set_title('Overall Distribution: Clean vs Toxic', fontsize=14, fontweight='bold')

# 2. Bar chart of each category
ax2 = axes[0, 1]
category_counts = df[toxicity_cols].sum().sort_values(ascending=True)
ax2.barh(category_counts.index, category_counts.values, color='coral')
ax2.set_xlabel('Number of Comments', fontsize=11)
ax2.set_title('Toxicity by Category', fontsize=14, fontweight='bold')
ax2.grid(axis='x', alpha=0.3)

# 3. Percentage by category
ax3 = axes[1, 0]
category_pct = (df[toxicity_cols].sum() / len(df) * 100).sort_values(ascending=True)
ax3.barh(category_pct.index, category_pct.values, color='steelblue')
ax3.set_xlabel('Percentage (%)', fontsize=11)
ax3.set_title('Toxicity Rate by Category', fontsize=14, fontweight='bold')
ax3.grid(axis='x', alpha=0.3)

# 4. Multi-label distribution
ax4 = axes[1, 1]
label_counts = df[toxicity_cols].sum(axis=1).value_counts().sort_index()
ax4.bar(label_counts.index, label_counts.values, color='mediumpurple', alpha=0.7)
ax4.set_xlabel('Number of Toxic Labels', fontsize=11)
ax4.set_ylabel('Number of Comments', fontsize=11)
ax4.set_title('Multi-Label Distribution', fontsize=14, fontweight='bold')
ax4.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../results/figures/class_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Visualization saved to: results/figures/class_distribution.png")

## Task 1c: Understand Abuse Categories

In [None]:
# Analyze multi-label patterns
print("MULTI-LABEL ANALYSIS")
print("=" * 80)

# How many labels does each toxic comment have?
toxic_df = df[df['any_toxic'] == 1].copy()
toxic_df['num_labels'] = toxic_df[toxicity_cols].sum(axis=1)

print("\nLabel Count Distribution (for toxic comments):")
print("-" * 40)
label_dist = toxic_df['num_labels'].value_counts().sort_index()
for num_labels, count in label_dist.items():
    pct = (count / len(toxic_df)) * 100
    print(f"  {num_labels} label(s): {count:5d} ({pct:5.2f}%)")

print(f"\nAverage labels per toxic comment: {toxic_df['num_labels'].mean():.2f}")

In [None]:
# Analyze label correlations
print("\nLABEL CORRELATIONS")
print("=" * 80)

# Calculate correlation matrix
corr_matrix = df[toxicity_cols].corr()

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='RdYlGn_r', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Between Toxicity Categories', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../results/figures/toxicity_correlations.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nKey Correlations (top 5):")
print("-" * 40)
# Get upper triangle of correlation matrix
corr_pairs = []
for i in range(len(toxicity_cols)):
    for j in range(i+1, len(toxicity_cols)):
        corr_pairs.append((toxicity_cols[i], toxicity_cols[j], corr_matrix.iloc[i, j]))

# Sort by correlation value
corr_pairs_sorted = sorted(corr_pairs, key=lambda x: abs(x[2]), reverse=True)

for label1, label2, corr_val in corr_pairs_sorted[:5]:
    print(f"  {label1:15s} <-> {label2:15s}: {corr_val:.3f}")

print("\n✓ Correlation heatmap saved to: results/figures/toxicity_correlations.png")

In [None]:
# Sample comments from each category
print("\nSAMPLE COMMENTS BY CATEGORY")
print("=" * 80)

for col in toxicity_cols:
    print(f"\n{col.upper()}:")
    print("-" * 40)
    # Get comments that are ONLY this category (if possible)
    single_label = df[(df[col] == 1) & (df[toxicity_cols].sum(axis=1) == 1)]
    
    if len(single_label) > 0:
        sample = single_label.sample(min(2, len(single_label)))
    else:
        # If no single-label examples, get any examples
        sample = df[df[col] == 1].sample(min(2, df[col].sum()))
    
    for idx, row in sample.iterrows():
        text = row['comment_text'][:150]  # Truncate for readability
        if len(row['comment_text']) > 150:
            text += "..."
        print(f"  • {text}")
        print()

In [None]:
# Data quality check
print("\nDATA QUALITY ASSESSMENT")
print("=" * 80)

# Check for duplicates
duplicates = df.duplicated(subset=['comment_text']).sum()
print(f"\nDuplicate comments: {duplicates} ({(duplicates/len(df)*100):.2f}%)")

# Check for empty comments
empty_comments = (df['comment_text'].astype(str).str.strip() == '').sum()
print(f"Empty comments: {empty_comments}")

# Check for very short comments (< 10 characters)
very_short = (df['text_length'] < 10).sum()
print(f"Very short comments (<10 chars): {very_short} ({(very_short/len(df)*100):.2f}%)")

# Check for very long comments (> 1000 characters)
very_long = (df['text_length'] > 1000).sum()
print(f"Very long comments (>1000 chars): {very_long} ({(very_long/len(df)*100):.2f}%)")

print("\n" + "=" * 80)
print("DATA QUALITY SUMMARY")
print("=" * 80)
print("✓ Dataset is ready for preprocessing")
print(f"✓ Total samples: {len(df):,}")
print(f"✓ Toxic samples: {toxic_count:,} ({(toxic_count/len(df)*100):.1f}%)")
print(f"✓ Clean samples: {clean_count:,} ({(clean_count/len(df)*100):.1f}%)")
print(f"✓ Categories: {len(toxicity_cols)}")
print(f"✓ No missing values")

In [None]:
# Save processed dataframe for next notebook
df.to_csv('../data/processed/data_explored.csv', index=False)
print("\n✓ Processed data saved to: data/processed/data_explored.csv")

## Key Findings from Data Exploration

### 1. Dataset Characteristics
- **Size**: 159,571 comments (or your dataset size)
- **Source**: Wikipedia talk page discussions
- **Quality**: No missing values, minimal data quality issues

### 2. Class Distribution
- **Imbalanced dataset**: ~90% clean, ~10% toxic (typical for real-world content)
- **Most common**: 'toxic' and 'obscene' categories
- **Least common**: 'threat' and 'identity_hate' categories

### 3. Multi-Label Nature
- Comments can have multiple toxicity types
- Average of 1-2 labels per toxic comment
- Strong correlations between certain categories (e.g., obscene + insult)

### 4. Text Characteristics
- Toxic comments tend to be longer than clean comments
- Wide range of comment lengths (from a few words to paragraphs)

### Next Steps
1. Preprocess text (cleaning, tokenization)
2. Extract features (TF-IDF, word embeddings)
3. Conduct exploratory data analysis
4. Build detection framework