# Part 2: 20_Newsgroups Multi-class Text Classification
## Complete ML Pipeline with Feature Engineering and Selection

**Author:** Samidur Rahman  
**CMyse:** CMT122 - Machine Learning for NLP  
**Academic Year:** 2025/2026

---

## Table of Contents

1. [Introduction & Objectives](#section1)
2. [Imports & Environment Setup](#section2)
3. [Data Loading & Exploration](#section3)
4. [Text Preprocessing](#section4)
5. [Dataset Partitioning (Train/Dev/Test)](#section5)
6. [Feature Engineering](#section6)
   - 6.1. [TF-IDF Features](#section6_1)
   - 6.2. [Statistical Text Features](#section6_2)
   - 6.3. [Feature Combination](#section6_3)
7. [Feature Selection (Development Set)](#section7)
8. [Model Selection (Development Set)](#section8)
9. [Final Evaluation (Test Set)](#section9)
10. [Detailed Performance Analysis](#section10)

---

## 1. Introduction & Objectives <a id='section1'></a>

### Objective 

Implementing a complete machine learning pipeline for **multi-class text classification** on the 20_Newsgroups dataset. The task is to automatically categorise news articles into 6 predefined categories.

### Dataset Overview

- **SMyce**: Modified 20_Newsgroups dataset
- **Size**: 3,416 articles
- **Classes**: 6 categories (class-1 through class-6)
- **Format**: CSV with columns `text` (article content) and `label` (category)
- **Task Type**: Multi-class classification (6 classes)
---

## 2. Imports & Environment Setup <a id='section2'></a>

### Required Libraries

**Library Purposes:**
- **numpy, pandas**: Data manipulation, array operations, DataFrames
- **nltk**: Natural language toolkit for text preprocessing resMyces
- **re**: Regular expressions for pattern-based text cleaning
- **sklearn**: Complete ML pipeline (preprocessing, models, evaluation)
- **scipy.sparse**: Efficient sparse matrix operations for text data

**Note on Packages:**  
All packages used here are standard sklearn/scipy components introduced in CMT122 labs. No external packages beyond cMyse materials are used.

In [1]:
# Core libraries
import numpy as np
import pandas as pd
import nltk
import re
import warnings

# Scikit-learn: Data splitting and preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2

# Scikit-learn: Models
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Scikit-learn: Evaluation metrics
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report
)

# Scipy: Sparse matrix operations
from scipy.sparse import hstack, csr_matrix, vstack

# Configuration
warnings.filterwarnings('ignore')

# Download NLTK data (if needed)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)


True

## 3. Data Loading & Exploration <a id='section3'></a>

### Loading the Dataset

The 20_Newsgroups dataset contains news articles from 6 different categories. I load it as a CSV file with two columns:
- `text`: Raw article content (may include headers, metadata)
- `label`: Category label (class-1 through class-6)

**Expected Structure:**
```
text,label
"From: user@domain.com Subject: ...",class-1
"From: another@email.com ...",class-2
...
```

In [None]:
# IMPORTANT: Update this path to match yMy file location
data_path = '/home/samidunix/projects/CMT122/20_Newsgroups.csv'

# Load CSV file
df = pd.read_csv(data_path)

print(f"\n Dataset loaded successfully")
print(f"\nDataset shape: {df.shape}")
print(f"  - Total samples: {len(df):,}")
print(f"  - Columns: {list(df.columns)}")

# Check for missing values
print(f"\nMissing values:")
print(df.isnull().sum())

print("\n" + "=" * 70)
print("CLASS DISTRIBUTION")

# Display class distribution
class_counts = df['label'].value_counts().sort_index()
print(f"\n{class_counts}")

print(f"\nNumber of classes: {df['label'].nunique()}")
print(f"\nClass balance:")
for label in class_counts.index:
    count = class_counts[label]
    percentage = 100 * count / len(df)
    print(f"  {label}: {count:4} samples ({percentage:5.2f}%)")

# Check if balanced
min_class = class_counts.min()
max_class = class_counts.max()
imbalance_ratio = max_class / min_class
print(f"\nImbalance ratio: {imbalance_ratio:.2f}x")
if imbalance_ratio < 1.5:
    print("  → Classes are relatively balanced ")
else:
    print("  → Significant class imbalance (consider class_weight='balanced')")

print("\n" + "=" * 70)

DATASET LOADING

✓ Dataset loaded successfully

Dataset shape: (3416, 2)
  - Total samples: 3,416
  - Columns: ['text', 'label']

Missing values:
text     0
label    0
dtype: int64

CLASS DISTRIBUTION

label
class-1    480
class-2    584
class-3    591
class-4    590
class-5    578
class-6    593
Name: count, dtype: int64

Number of classes: 6

Class balance:
  class-1:  480 samples (14.05%)
  class-2:  584 samples (17.10%)
  class-3:  591 samples (17.30%)
  class-4:  590 samples (17.27%)
  class-5:  578 samples (16.92%)
  class-6:  593 samples (17.36%)

Imbalance ratio: 1.24x
  → Classes are relatively balanced ✓



### Sample Data Inspection

In [3]:
print("=" * 70)
print("SAMPLE ARTICLES (First 100 characters)")
print("=" * 70)

for label in df['label'].unique()[:3]:
    sample = df[df['label'] == label].iloc[0]
    print(f"\n{label}:")
    print(f"  {sample['text'][:100]}...")

print("\n" + "=" * 70)

SAMPLE ARTICLES (First 100 characters)

class-5:
  from : matt harrop@magic-bbs.corp.apple.com distribution : na organization : macintosh awareness gro...

class-4:
  from : dannyb@panix.com ( daniel burstein ) subject : re : ( q ) conner hd specs organization : pani...

class-1:
  subject : re : contradictions from : kmr4@po.cwru.edu ( keith m. ryan ) organization : case western ...



## 4. Text Preprocessing <a id='section4'></a>

### Preprocessing Strategy

Newsgroup articles contain:
- Email headers (From:, Subject:, Organization:)
- Email addresses
- Special characters and punctuation
- Mixed case text

**My cleaning approach:**
1. **Lowercase conversion**: Treat "Computer" and "computer" as the same
2. **Email removal**: Addresses don't help classification
3. **Header removal**: "From:", "Subject:", etc. are noise
4. **Non-alphabetic removal**: Keep only letters and spaces
5. **Whitespace normalization**: Collapse multiple spaces

**Why this approach?**
- Focusing on actual content words
- Removes metadata that might cause overfitting
- Standardizes text format for vectorization

**Trade-off**: Losing some potentially useful information (e.g., header fields) but gain cleaner, more general features.

In [None]:
def clean_newsgroup_text(text):
    """
    Clean newsgroup article text for classification.
    
    Args:
        text (str): Raw article text with headers and metadata
        
    Returns:
        str: Cleaned text containing only lowercase alphabetic words
    
    Example:
        Input:  "From: user@email.com\nSubject: Computer Science\nThis is great!"
        Output: "computer science this is great"
    """
    # Handle missing values
    if pd.isna(text):
        return ""
    
    # Convert to string and lowercase
    text = str(text).lower()
    
    # Step 1: Remove email addresses
    # Pattern \S+@\S+ matches user@domain.com
    text = re.sub(r'\S+@\S+', ' ', text)
    
    # Step 2: Remove common header fields
    # These appear in newsgroup posts but don't help classification
    header_pattern = r'(from|subject|organization|lines|distribution|reply-to):'
    text = re.sub(header_pattern, ' ', text)
    
    # Step 3: Keep only letters and spaces
    # [^a-z\s] means "not a letter or space"
    text = re.sub(r'[^a-z\s]', ' ', text)
    
    # Step 4: Collapse multiple spaces into one
    # \s+ matches one or more whitespace characters
    text = re.sub(r'\s+', ' ', text)
    
    # Step 5: Remove leading/trailing whitespace
    text = text.strip()
    
    return text


print("=" * 70)
print("TEXT PREPROCESSING")
print("=" * 70)

# Apply cleaning function to all articles
print("\nCleaning text...")
df['cleaned_text'] = df['text'].apply(clean_newsgroup_text)

# Filter out very short texts (likely empty after cleaning)
# Threshold: 50 characters ensures we have meaningful content
min_length = 50
df = df[df['cleaned_text'].str.len() > min_length].reset_index(drop=True)

print(f" Text cleaning complete")
print(f"\nDataset size after cleaning: {len(df):,} samples")
print(f"  (Removed {len(pd.read_csv(data_path)) - len(df)} articles shorter than {min_length} chars)")

# Show before/after examples
print("\n" + "=" * 70)
print("BEFORE/AFTER CLEANING EXAMPLES")
print("=" * 70)

for idx in [0, 100, 200]:
    if idx < len(df):
        print(f"\n[Example {idx+1}] Label: {df.iloc[idx]['label']}")
        print(f"BEFORE: {df.iloc[idx]['text'][:80]}...")
        print(f"AFTER:  {df.iloc[idx]['cleaned_text'][:80]}...")

print("\n" + "=" * 70)

TEXT PREPROCESSING

Cleaning text... (this may take a moment)
✓ Text cleaning complete

Dataset size after cleaning: 3,416 samples
  (Removed 0 articles shorter than 50 chars)

BEFORE/AFTER CLEANING EXAMPLES

[Example 1] Label: class-5
BEFORE: from : matt harrop@magic-bbs.corp.apple.com distribution : na organization : mac...
AFTER:  from matt distribution na organization macintosh awareness group in canada subje...

[Example 101] Label: class-6
BEFORE: from : donaldlf@k9.rose-hulman.edu ( leslie f. donaldson ) subject : problems us...
AFTER:  from leslie f donaldson subject problems using graphic context with athena widge...

[Example 201] Label: class-6
BEFORE: from : raney@teal.csn.org ( scott raney ) subject : re : hypercard for unix nntp...
AFTER:  from scott raney subject re hypercard for unix nntp posting host teal csn org or...



## 5. Dataset Partitioning (Train/Dev/Test) <a id='section5'></a>

### Evaluation Protocol: 60/20/20 Split

**Three-way split strategy:**
1. **Training set (60%)**: Used to fit model parameters
2. **Development set (20%)**: Used for hyperparameter tuning
3. **Test set (20%)**: Used only for final evaluation

**Why this approach?**
- **Training set**: Large enough to learn patterns (60% ≈ 2,049 samples)
- **Development set**: Allows fair comparison of feature/model choices
- **Test set**: Provides unbiased performance estimate (never seen during development)

**Alternative approaches considered:**
- **80/20 split**: No development set → can't tune hyperparameters fairly
- **Cross-validation**: More robust but computationally expensive
- **50/25/25 split**: Less training data → potentially worse models

**Stratification:**  
 Utilises stratified sampling to maintain class distribution across all three sets. This ensures each set is representative of the overall population.

**Reproducibility:**  
`random_state=42` ensures the same split every time (important for fair comparison).

In [None]:
print("=" * 70)
print("DATASET PARTITIONING")
print("=" * 70)

# First split: 60% train, 40% temp (will become dev + test)
df_train, df_temp = train_test_split(
    df,
    test_size=0.4,           # 40% for temp (dev + test)
    random_state=42,          # Reproducibility
    stratify=df['label']      # Maintain class distribution
)

# Second split: Split temp into 50% dev, 50% test
# This gives the 20% dev and 20% test of original dataset
df_dev, df_test = train_test_split(
    df_temp,
    test_size=0.5,            # 50% of temp = 20% of original
    random_state=42,
    stratify=df_temp['label']
)

print(f"\nTotal dataset: {len(df):,} samples")
print(f"\nSplit sizes:")
print(f"  Training:   {len(df_train):,} samples ({100*len(df_train)/len(df):5.1f}%)")
print(f"  Development: {len(df_dev):,} samples ({100*len(df_dev)/len(df):5.1f}%)")
print(f"  Test:        {len(df_test):,} samples ({100*len(df_test)/len(df):5.1f}%)")

# Verify stratification worked
print("\n" + "-" * 70)
print("Class distribution verification:")
print("-" * 70)
print(f"\n{'Class':<10} {'Train':>10} {'Dev':>10} {'Test':>10} {'Total':>10}")
print("-" * 54)

for label in sorted(df['label'].unique()):
    train_count = (df_train['label'] == label).sum()
    dev_count = (df_dev['label'] == label).sum()
    test_count = (df_test['label'] == label).sum()
    total = train_count + dev_count + test_count
    print(f"{label:<10} {train_count:>10} {dev_count:>10} {test_count:>10} {total:>10}")

print("\n Stratified split successful - class proportions maintained")
print("\n" + "=" * 70)

DATASET PARTITIONING

Total dataset: 3,416 samples

Split sizes:
  Training:   2,049 samples ( 60.0%)
  Development: 683 samples ( 20.0%)
  Test:        684 samples ( 20.0%)

----------------------------------------------------------------------
Class distribution verification:
----------------------------------------------------------------------

Class           Train        Dev       Test      Total
------------------------------------------------------
class-1           288         96         96        480
class-2           350        117        117        584
class-3           354        118        119        591
class-4           354        118        118        590
class-5           347        116        115        578
class-6           356        118        119        593

✓ Stratified split successful - class proportions maintained



### Prepare Labels for Modeling

In [None]:
# Extract labels as numpy arrays
y_train = df_train['label'].values
y_dev = df_dev['label'].values
y_test = df_test['label'].values

print(" Labels extracted:")
print(f"  Training labels: {y_train.shape}")
print(f"  Development labels: {y_dev.shape}")
print(f"  Test labels: {y_test.shape}")
print(f"\n  Unique classes: {sorted(np.unique(y_train))}")

✓ Labels extracted:
  Training labels: (2049,)
  Development labels: (683,)
  Test labels: (684,)

  Unique classes: ['class-1', 'class-2', 'class-3', 'class-4', 'class-5', 'class-6']


## 6. Feature Engineering <a id='section6'></a>

### Feature Engineering Overview

I extract **four distinct features** from the text:

1. **Feature 1: TF-IDF (Primary)** - Word frequency information
2. **Feature 2: Word Count** - Document length
3. **Feature 3: Average Word Length** - Vocabulary complexity
4. **Feature 4: Lexical Diversity** - Vocabulary richness

These features capture different aspects of text:
- **TF-IDF**: *What* words are used (content)
- **Word Count**: *How much* text (length)
- **Avg Word Length**: *How complex* vocabulary (sophistication)
- **Lexical Diversity**: *How varied* vocabulary (richness)

---

### 6.1. Feature 1: TF-IDF Vectorization <a id='section6_1'></a>

#### What is TF-IDF?

**TF-IDF** (Term Frequency-Inverse Document Frequency) converts text to numerical vectors.

**Formula:**  
TF-IDF(word, doc) = TF(word, doc) × IDF(word)

Where:
- **TF (Term Frequency)**: How often word appears in document
- **IDF (Inverse Document Frequency)**: log(total docs / docs containing word)

**Effect:**
- Common words ("the", "is") → Low TF-IDF (appear in many documents)
- Distinctive words ("graphics", "hockey") → High TF-IDF (appear in few documents)

#### Parameter Choices

1. **`max_features=20000`**: Vocabulary size
   - **Why?** Captures diverse newsgroup terminology
   - **Trade-off**: More features = more memory, but better coverage

2. **`ngram_range=(1, 2)`**: Unigrams and bigrams
   - **Unigrams**: "computer", "science"
   - **Bigrams**: "computer science", "machine learning"
   - **Why?** Bigrams capture multi-word concepts

3. **`stop_words='english'`**: Remove common words
   - Filters: "the", "is", "at", "on", etc.
   - **Why?** These don't distinguish between categories

4. **`min_df=3`**: Minimum document frequency
   - Word must appear in at least 3 documents
   - **Why?** Removes typos and extremely rare words

5. **`max_df=0.75`**: Maximum document frequency
   - Word must appear in at most 75% of documents
   - **Why?** Removes overly common words

6. **`sublinear_tf=True`**: Logarithmic term frequency
   - Uses 1 + log(TF) instead of raw TF
   - **Why?** Reduces impact of very frequent words within a document

#### Why TF-IDF for Text Classification?

 **Advantages:**
- Emphasizes distinctive words
- Handles variable-length documents
- Sparse representation (memory efficient)
- Works well with linear models

 **Limitations:**
- Ignores word order ("not good" vs "good not")
- Treats words independently
- No semantic understanding ("car" ≠ "automobile")

In [None]:
print("=" * 70)
print("FEATURE 1: TF-IDF VECTORIZATION")
print("=" * 70)

# Initialize TF-IDF vectorizer with optimized parameters
tfidf_vectorizer = TfidfVectorizer(
    max_features=20000,       # Top 20,000 most important terms
    ngram_range=(1, 2),       # Unigrams and bigrams
    stop_words='english',     # Remove English stopwords
    min_df=3,                 # Term must appear in ≥3 documents
    max_df=0.75,              # Term must appear in ≤75% of documents
    sublinear_tf=True         # Use log scaling for term frequency
)

print("\nExtracting TF-IDF features...")
print(f"  Parameters:")
print(f"    - max_features: 20,000")
print(f"    - ngram_range: (1, 2)")
print(f"    - min_df: 3")
print(f"    - max_df: 0.75")

# fit_transform on training only
# This learns vocabulary from training set
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train['cleaned_text'])

# transform (not fit_transform) on dev and test
# This uses vocabulary learned from training set
# Prevents data leakage from dev/test into training
X_dev_tfidf = tfidf_vectorizer.transform(df_dev['cleaned_text'])
X_test_tfidf = tfidf_vectorizer.transform(df_test['cleaned_text'])

print(f"\n TF-IDF extraction complete")
print(f"\n  Training matrix: {X_train_tfidf.shape}")
print(f"    - Samples: {X_train_tfidf.shape[0]:,}")
print(f"    - Features: {X_train_tfidf.shape[1]:,}")
print(f"    - Sparsity: {100 * (1 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])):.2f}%")
print(f"    - Non-zero elements: {X_train_tfidf.nnz:,}")

print(f"\n  Dev matrix: {X_dev_tfidf.shape}")
print(f"  Test matrix: {X_test_tfidf.shape}")

# Show sample features
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"\n  Sample features (first 20):")
print(f"    {list(feature_names[:20])}")

print("\n" + "=" * 70)

FEATURE 1: TF-IDF VECTORIZATION

Extracting TF-IDF features...
  Parameters:
    - max_features: 20,000
    - ngram_range: (1, 2)
    - min_df: 3
    - max_df: 0.75

✓ TF-IDF extraction complete

  Training matrix: (2049, 19226)
    - Samples: 2,049
    - Features: 19,226
    - Sparsity: 99.49%
    - Non-zero elements: 199,608

  Dev matrix: (683, 19226)
  Test matrix: (684, 19226)

  Sample features (first 20):
    ['aa', 'aaron', 'ab', 'abandon', 'abc', 'abilities', 'ability', 'ability use', 'able', 'able approach', 'able boot', 'able handle', 'able help', 'able locate', 'able run', 'able tell', 'able telnet', 'able unexpected', 'able use', 'able work']



### 6.2. Features 2-4: Statistical Text Features <a id='section6_2'></a>

#### Why Statistical Features?

While TF-IDF captures *what* words are used, statistical features capture document-level properties:
- **Different signal**: Complements TF-IDF with meta-information
- **Robust**: Less prone to overfitting than individual word features
- **Interpretable**: Easy to understand and explain

#### Feature Descriptions

**Feature 2: Word Count**
- **Definition**: Total number of words in document
- **Formula**: `len(text.split())`
- **Intuition**: Technical posts might be longer than casual discussions
- **Example**: "hello world" → 2, "this is a test" → 4

**Feature 3: Average Word Length**
- **Definition**: Mean number of characters per word
- **Formula**: `sum(len(word) for word in words) / len(words)`
- **Intuition**: Technical topics use longer words ("algorithm" vs "hi")
- **Example**: "hi bye" → 2.5, "computer science" → 7.5

**Feature 4: Lexical Diversity (Type-Token Ratio)**
- **Definition**: Ratio of unique words to total words
- **Formula**: `len(set(words)) / len(words)`
- **Intuition**: Repetitive posts have lower diversity
- **Range**: [0, 1] where 1 = all words unique
- **Example**: "the the the cat" → 0.25, "a quick brown fox" → 1.0


In [None]:
def extract_statistical_features(text_series):
    """
    Extract three statistical features from a series of text documents.
    
    Args:
        text_series (pd.Series): Series of text strings
        
    Returns:
        np.ndarray: Array of shape (n_samples, 3) containing:
            - Column 0: Word count
            - Column 1: Average word length
            - Column 2: Lexical diversity (unique/total ratio)
            
    Note:
        Handles edge cases:
        - Empty text → [0, 0, 0]
        - Single word → word_count=1, diversity=1.0
    """
    features = []
    
    for text in text_series:
        # Tokenize by splitting on whitespace
        words = text.split()
        
        # Feature 2: Word count
        word_count = len(words)
        
        # Feature 3: Average word length
        # Handle empty text to avoid division by zero
        if words:
            avg_word_length = np.mean([len(w) for w in words])
        else:
            avg_word_length = 0
        
        # Feature 4: Lexical diversity (type-token ratio)
        # Ratio of unique words to total words
        if words:
            unique_words = len(set(words))
            lexical_diversity = unique_words / len(words)
        else:
            lexical_diversity = 0
        
        features.append([word_count, avg_word_length, lexical_diversity])
    
    return np.array(features)


print("=" * 70)
print("FEATURES 2-4: STATISTICAL TEXT FEATURES")
print("=" * 70)

print("\nExtracting statistical features...")

# Extract features for all three sets
X_train_stats = extract_statistical_features(df_train['cleaned_text'])
X_dev_stats = extract_statistical_features(df_dev['cleaned_text'])
X_test_stats = extract_statistical_features(df_test['cleaned_text'])

print(f"\n Statistical features extracted")
print(f"\n  Feature shapes:")
print(f"    Training: {X_train_stats.shape}")
print(f"    Dev:      {X_dev_stats.shape}")
print(f"    Test:     {X_test_stats.shape}")

# Display statistics about the features
print(f"\n  Feature statistics (training set):")
feature_names_stats = ['Word Count', 'Avg Word Length', 'Lexical Diversity']
print(f"\n  {'Feature':<20} {'Min':>10} {'Mean':>10} {'Max':>10} {'Std':>10}")
print("  " + "-" * 64)
for i, name in enumerate(feature_names_stats):
    col = X_train_stats[:, i]
    print(f"  {name:<20} {col.min():>10.2f} {col.mean():>10.2f} {col.max():>10.2f} {col.std():>10.2f}")

print("\n" + "=" * 70)

FEATURES 2-4: STATISTICAL TEXT FEATURES

Extracting statistical features...

✓ Statistical features extracted

  Feature shapes:
    Training: (2049, 3)
    Dev:      (683, 3)
    Test:     (684, 3)

  Feature statistics (training set):

  Feature                     Min       Mean        Max        Std
  ----------------------------------------------------------------
  Word Count                11.00     301.86   15252.00    1070.77
  Avg Word Length            1.61       4.49       6.60       0.43
  Lexical Diversity          0.01       0.67       1.00       0.13



#### Feature Scaling for Statistical Features

**Why Scale?**

My three statistical features have very different ranges:
- Word Count: [0, ~500]
- Avg Word Length: [0, ~10]
- Lexical Diversity: [0, 1]

**Problem without scaling:**
- Chi-squared feature selection **requires non-negative features**
- Features with larger ranges dominate distance calculations
- Some models (SVMs) are sensitive to feature scale

**My solution: MinMaxScaler**
- **Formula**: `(x - min) / (max - min)`
- **Result**: All features scaled to [0, 1] range
- **Advantage**: Preserves zero values (which is important for sparse features)

**Alternative: StandardScaler**
- Formula: `(x - mean) / std`
- Result: Mean=0, Std=1
- **Why not used**: Can produce negative values → incompatible with chi-squared

Fit scaler on training set only, then transform all sets
- **fit()** on training: Learns min/max values
- **transform()** on dev/test: Applies training min/max
- This prevents data leakage from dev/test into training

In [None]:
print("=" * 70)
print("FEATURE SCALING")
print("=" * 70)

print("\nScaling statistical features to [0, 1] range...")
print("  Method: MinMaxScaler")
print("  Reason: Required for chi-squared feature selection")

# Initialize scaler
scaler = MinMaxScaler()

# CRITICAL: Fit on training data only
X_train_stats_scaled = scaler.fit_transform(X_train_stats)

# Transform dev and test using training statistics
X_dev_stats_scaled = scaler.transform(X_dev_stats)
X_test_stats_scaled = scaler.transform(X_test_stats)

print(f"\n Scaling complete")

# Verify scaling
print(f"\n  Scaled feature ranges (training):")
print(f"\n  {'Feature':<20} {'Min':>10} {'Mean':>10} {'Max':>10}")
print("  " + "-" * 46)
for i, name in enumerate(feature_names_stats):
    col = X_train_stats_scaled[:, i]
    print(f"  {name:<20} {col.min():>10.4f} {col.mean():>10.4f} {col.max():>10.4f}")

print("\n   All features now in [0, 1] range")
print("\n" + "=" * 70)

FEATURE SCALING

Scaling statistical features to [0, 1] range...
  Method: MinMaxScaler
  Reason: Required for chi-squared feature selection

✓ Scaling complete

  Scaled feature ranges (training):

  Feature                     Min       Mean        Max
  ----------------------------------------------
  Word Count               0.0000     0.0191     1.0000
  Avg Word Length          0.0000     0.5763     1.0000
  Lexical Diversity        0.0000     0.6710     1.0000

  ✓ All features now in [0, 1] range



### 6.3. Feature Combination <a id='section6_3'></a>

#### Combining Sparse and Dense Features

Now have two feature types:
1. **TF-IDF features**: Sparse matrix (many zeros) - 19,226 features
2. **Statistical features**: Dense array (all values non-zero) - 3 features

**Problem**: How to combine sparse and dense data efficiently?

**Solution**: Use `scipy.sparse.hstack()`
- Horizontally stacks (concatenates) feature matrices
- Maintains sparse format for efficiency
- Result: Single sparse matrix with all features

**Process**:
1. Convert dense statistical features to sparse format (`csr_matrix`)
2. Horizontally stack: [TF-IDF features | Statistical features]
3. Result: Combined sparse matrix

**Memory efficiency**:
- Sparse format stores only non-zero values
- TF-IDF is ~98% sparse → huge memory savings
- Critical for large feature spaces (19,229 features)

**Final feature vector for each document**:
```
[word_1_tfidf, word_2_tfidf, ..., word_19226_tfidf, word_count, avg_word_len, lexical_div]
```

In [None]:
print("=" * 70)
print("FEATURE COMBINATION")
print("=" * 70)

print("\nCombining TF-IDF and statistical features...")

# Convert dense statistical features to sparse format
# This is memory-efficient and allows combining with TF-IDF
X_train_stats_sparse = csr_matrix(X_train_stats_scaled)
X_dev_stats_sparse = csr_matrix(X_dev_stats_scaled)
X_test_stats_sparse = csr_matrix(X_test_stats_scaled)

print(f"\n  Converting statistical features to sparse format...")
print(f"    Training stats: {X_train_stats_scaled.shape} (dense) → {X_train_stats_sparse.shape} (sparse)")

# Horizontally stack features
# Result: [TF-IDF features | Statistical features]
X_train_combined = hstack([X_train_tfidf, X_train_stats_sparse])
X_dev_combined = hstack([X_dev_tfidf, X_dev_stats_sparse])
X_test_combined = hstack([X_test_tfidf, X_test_stats_sparse])

print(f"\n Feature combination complete")

print(f"\n  Combined feature matrix:")
print(f"    Training:   {X_train_combined.shape}")
print(f"    Dev:        {X_dev_combined.shape}")
print(f"    Test:       {X_test_combined.shape}")

print(f"\n  Feature breakdown:")
print(f"    TF-IDF features:      {X_train_tfidf.shape[1]:,}")
print(f"    Statistical features:      {X_train_stats_sparse.shape[1]}")
print(f"    " + "-" * 35)
print(f"    Total features:       {X_train_combined.shape[1]:,}")

# Memory efficiency check
sparsity = 100 * (1 - X_train_combined.nnz / (X_train_combined.shape[0] * X_train_combined.shape[1]))
print(f"\n  Memory efficiency:")
print(f"    Sparsity: {sparsity:.2f}%")
print(f"    Non-zero elements: {X_train_combined.nnz:,} / {X_train_combined.shape[0] * X_train_combined.shape[1]:,}")

print("\n" + "=" * 70)

FEATURE COMBINATION

Combining TF-IDF and statistical features...

  Converting statistical features to sparse format...
    Training stats: (2049, 3) (dense) → (2049, 3) (sparse)

✓ Feature combination complete

  Combined feature matrix:
    Training:   (2049, 19229)
    Dev:        (683, 19229)
    Test:       (684, 19229)

  Feature breakdown:
    TF-IDF features:      19,226
    Statistical features:      3
    -----------------------------------
    Total features:       19,229

  Memory efficiency:
    Sparsity: 99.48%
    Non-zero elements: 205,752 / 39,400,221



## 7. Feature Selection (Development Set) <a id='section7'></a>

### Why Feature Selection?

Currently have 19,229 features - this creates several problems:
1. **Curse of dimensionality**: High dimensions → sparse data → poor generalization
2. **Computational cost**: More features = slower training
3. **Overfitting risk**: Model may learn noise instead of signal
4. **Noise**: Many features are irrelevant or redundant

**Solution**: Use feature selection to keep only the most informative features.

### Chi-Squared (χ²) Feature Selection

**What it measures:**  
Chi-squared tests the independence between each feature and the class labels.

**Formula:**  
χ² = Σ (Observed - Expected)² / Expected

**Intuition:**
- **High χ²**: Feature strongly associated with certain classes
- **Low χ²**: Feature appears randomly across classes

**Example:**
- Word "hockey" appears frequently in sports class → High χ²
- Word "the" appears equally in all classes → Low χ²

**Why chi-squared for text?**
- Fast computation
- Works well with sparse data
- Interpretable results
- Commonly used for text classification

**Requirement**: Features must be non-negative (hence My MinMaxScaler choice)

### Development Set Experiments

Tested on different values of k (number of features to keep):
- **k=1,000**: Very aggressive reduction (95% features removed)
- **k=5,000**: Moderate reduction (74% features removed)
- **k=10,000**: Conservative reduction (48% features removed)
- **k=15,000**: Minimal reduction (22% features removed)
- **k=all (19,229)**: No reduction (baseline)

**Why use development set?**
- Simulates test set performance
- Allows fair comparison of different k values
- Prevents overfitting to test set

In [None]:
print("=" * 70)
print("FEATURE SELECTION: DEVELOPMENT SET EXPERIMENTS")
print("=" * 70)

print("\nTesting different feature set sizes...")
print("Method: Chi-squared (χ²) statistical test")
print("Model: LinearSVC (for speed)")

# Test different k values
# Note: Adapted to My feature space size (19,229)
k_values = [1000, 5000, 10000, 15000, X_train_combined.shape[1]]
dev_results = []

print(f"\nTotal features available: {X_train_combined.shape[1]:,}")
print(f"\nTesting k values: {[f'{k:,}' for k in k_values]}")
print("\n" + "-" * 70)

for k in k_values:
    # Handle case where k >= total features
    if k >= X_train_combined.shape[1]:
        k_actual = X_train_combined.shape[1]
        X_train_selected = X_train_combined
        X_dev_selected = X_dev_combined
        print(f"\nk={k_actual:,} (all features - no selection)")
    else:
        k_actual = k
        print(f"\nk={k_actual:,} ({100*k_actual/X_train_combined.shape[1]:.1f}% of features)")
        
        # Perform feature selection
        # fit() on training: Learn which features are most informative
        # transform() on dev: Apply same selection
        selector = SelectKBest(chi2, k=k_actual)
        X_train_selected = selector.fit_transform(X_train_combined, y_train)
        X_dev_selected = selector.transform(X_dev_combined)
    
    # Train model on selected features
    # Using LinearSVC for speed (feature selection experiments take time)
    model = LinearSVC(C=1.0, max_iter=2000, random_state=42)
    model.fit(X_train_selected, y_train)
    
    # Evaluate on development set
    y_pred_dev = model.predict(X_dev_selected)
    acc = accuracy_score(y_dev, y_pred_dev)
    
    # Store results
    dev_results.append({
        'k': k_actual,
        'dev_accuracy': acc,
        'reduction': 100 * (1 - k_actual / X_train_combined.shape[1])
    })
    
    print(f"  Reduction: {dev_results[-1]['reduction']:5.1f}% features removed")
    print(f"  Dev Accuracy: {acc:.4f} ({acc*100:.2f}%)")

# Find best k
best_result = max(dev_results, key=lambda x: x['dev_accuracy'])
best_k = best_result['k']

print("\n" + "=" * 70)
print("FEATURE SELECTION RESULTS SUMMARY")
print("=" * 70)

print(f"\n{'k':<10} {'% Kept':>10} {'Dev Accuracy':>15} {'Status'}")
print("-" * 60)
for result in dev_results:
    pct_kept = 100 * result['k'] / X_train_combined.shape[1]
    marker = " ★ BEST" if result['k'] == best_k else ""
    print(f"{result['k']:<10,} {pct_kept:>9.1f}% {result['dev_accuracy']:>14.4f}{marker}")

print("\n" + "=" * 70)
print(f" BEST k: {best_k:,} features")
print(f"  Dev Accuracy: {best_result['dev_accuracy']:.4f} ({best_result['dev_accuracy']*100:.2f}%)")
print(f"  Reduction: {best_result['reduction']:.1f}% features removed")
print("=" * 70)

FEATURE SELECTION: DEVELOPMENT SET EXPERIMENTS

Testing different feature set sizes...
Method: Chi-squared (χ²) statistical test
Model: LinearSVC (for speed)

Total features available: 19,229

Testing k values: ['1,000', '5,000', '10,000', '15,000', '19,229']

----------------------------------------------------------------------

k=1,000 (5.2% of features)
  Reduction:  94.8% features removed
  Dev Accuracy: 0.8463 (84.63%)

k=5,000 (26.0% of features)
  Reduction:  74.0% features removed
  Dev Accuracy: 0.8917 (89.17%)

k=10,000 (52.0% of features)
  Reduction:  48.0% features removed
  Dev Accuracy: 0.9004 (90.04%)

k=15,000 (78.0% of features)
  Reduction:  22.0% features removed
  Dev Accuracy: 0.9034 (90.34%)

k=19,229 (all features - no selection)
  Reduction:   0.0% features removed
  Dev Accuracy: 0.8975 (89.75%)

FEATURE SELECTION RESULTS SUMMARY

k              % Kept    Dev Accuracy Status
------------------------------------------------------------
1,000            5.2%   

### Apply Best Feature Selection

In [None]:
print("\n" + "=" * 70)
print("APPLYING BEST FEATURE SELECTION")
print("=" * 70)

if best_k >= X_train_combined.shape[1]:
    # Use all features
    print(f"\nUsing all {X_train_combined.shape[1]:,} features (no selection needed)")
    X_train_final = X_train_combined
    X_dev_final = X_dev_combined
    X_test_final = X_test_combined
else:
    # Apply feature selection with best k
    print(f"\nSelecting top {best_k:,} features using chi-squared...")
    final_selector = SelectKBest(chi2, k=best_k)
    
    # Fit on training and transform all sets
    X_train_final = final_selector.fit_transform(X_train_combined, y_train)
    X_dev_final = final_selector.transform(X_dev_combined)
    X_test_final = final_selector.transform(X_test_combined)
    
    print(f" Feature selection applied")

print(f"\nFinal feature matrix shapes:")
print(f"  Training:   {X_train_final.shape}")
print(f"  Dev:        {X_dev_final.shape}")
print(f"  Test:       {X_test_final.shape}")

print("\n" + "=" * 70)


APPLYING BEST FEATURE SELECTION

Selecting top 15,000 features using chi-squared...
✓ Feature selection applied

Final feature matrix shapes:
  Training:   (2049, 15000)
  Dev:        (683, 15000)
  Test:       (684, 15000)



## 8. Model Selection (Development Set) <a id='section8'></a>

### Model Comparison Strategy

Comparing multiple models to find the best classifier for My task:

**Models Tested:**

1. **LinearSVC (C=1.0)** - Linear Support Vector Machine

2. **LinearSVC (C=0.5)** - LinearSVC with stronger regularization

3. **Logistic Regression** - Probabilistic linear classifier

4. **Multinomial Naive Bayes** - Probabilistic classifier

### Why These Models?

All are **linear models** suited for high-dimensional text:
- Text features are often linearly separable
- Non-linear models (trees, neural nets) require more data
- Linear models are interpretable (can examine feature weights)


In [None]:
print("=" * 70)
print("MODEL SELECTION: DEVELOPMENT SET EXPERIMENTS")
print("=" * 70)

print("\nTesting multiple classification models...")
print(f"Feature dimensions: {X_train_final.shape[1]:,}")
print(f"\n" + "-" * 70)

# Dictionary to store results
model_results = []

# Model 1: LinearSVC with C=1.0
print("\n[1/4] LinearSVC (C=1.0)")
print("  Description: Linear SVM with standard regularization")
lsvc_10 = LinearSVC(C=1.0, max_iter=2000, random_state=42)
lsvc_10.fit(X_train_final, y_train)
y_pred_lsvc10 = lsvc_10.predict(X_dev_final)
acc_lsvc10 = accuracy_score(y_dev, y_pred_lsvc10)
model_results.append({
    'name': 'LinearSVC (C=1.0)',
    'model': lsvc_10,
    'dev_accuracy': acc_lsvc10
})
print(f"  Dev Accuracy: {acc_lsvc10:.4f} ({acc_lsvc10*100:.2f}%)")

# Model 2: LinearSVC with C=0.5
print("\n[2/4] LinearSVC (C=0.5)")
print("  Description: Linear SVM with stronger regularization")
lsvc_05 = LinearSVC(C=0.5, max_iter=2000, random_state=42)
lsvc_05.fit(X_train_final, y_train)
y_pred_lsvc05 = lsvc_05.predict(X_dev_final)
acc_lsvc05 = accuracy_score(y_dev, y_pred_lsvc05)
model_results.append({
    'name': 'LinearSVC (C=0.5)',
    'model': lsvc_05,
    'dev_accuracy': acc_lsvc05
})
print(f"  Dev Accuracy: {acc_lsvc05:.4f} ({acc_lsvc05*100:.2f}%)")

# Model 3: Logistic Regression
print("\n[3/4] Logistic Regression")
print("  Description: Probabilistic linear classifier")
lr = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
lr.fit(X_train_final, y_train)
y_pred_lr = lr.predict(X_dev_final)
acc_lr = accuracy_score(y_dev, y_pred_lr)
model_results.append({
    'name': 'Logistic Regression',
    'model': lr,
    'dev_accuracy': acc_lr
})
print(f"  Dev Accuracy: {acc_lr:.4f} ({acc_lr*100:.2f}%)")

# Model 4: Multinomial Naive Bayes
print("\n[4/4] Multinomial Naive Bayes")
print("  Description: Fast probabilistic classifier")
mnb = MultinomialNB(alpha=0.1)
mnb.fit(X_train_final, y_train)
y_pred_mnb = mnb.predict(X_dev_final)
acc_mnb = accuracy_score(y_dev, y_pred_mnb)
model_results.append({
    'name': 'Multinomial NB',
    'model': mnb,
    'dev_accuracy': acc_mnb
})
print(f"  Dev Accuracy: {acc_mnb:.4f} ({acc_mnb*100:.2f}%)")

# Find best model
best_model_result = max(model_results, key=lambda x: x['dev_accuracy'])
final_model = best_model_result['model']
final_model_name = best_model_result['name']

print("\n" + "=" * 70)
print("MODEL SELECTION RESULTS")
print("=" * 70)

# Sort by accuracy
sorted_results = sorted(model_results, key=lambda x: x['dev_accuracy'], reverse=True)

print(f"\n{'Rank':<6} {'Model':<25} {'Dev Accuracy':>15}")
print("-" * 50)
for rank, result in enumerate(sorted_results, 1):
    marker = " ★ BEST" if rank == 1 else ""
    print(f"{rank:<6} {result['name']:<25} {result['dev_accuracy']:>14.4f}{marker}")

print("\n" + "=" * 70)
print(f" BEST MODEL: {final_model_name}")
print(f"  Dev Accuracy: {best_model_result['dev_accuracy']:.4f} ({best_model_result['dev_accuracy']*100:.2f}%)")
print("=" * 70)

MODEL SELECTION: DEVELOPMENT SET EXPERIMENTS

Testing multiple classification models...
Feature dimensions: 15,000

----------------------------------------------------------------------

[1/4] LinearSVC (C=1.0)
  Description: Linear SVM with standard regularization
  Dev Accuracy: 0.9034 (90.34%)

[2/4] LinearSVC (C=0.5)
  Description: Linear SVM with stronger regularization
  Dev Accuracy: 0.9019 (90.19%)

[3/4] Logistic Regression
  Description: Probabilistic linear classifier
  Dev Accuracy: 0.8799 (87.99%)

[4/4] Multinomial Naive Bayes
  Description: Fast probabilistic classifier
  Dev Accuracy: 0.8858 (88.58%)

MODEL SELECTION RESULTS

Rank   Model                        Dev Accuracy
--------------------------------------------------
1      LinearSVC (C=1.0)                 0.9034 ★ BEST
2      LinearSVC (C=0.5)                 0.9019
3      Multinomial NB                    0.8858
4      Logistic Regression               0.8799

✓ BEST MODEL: LinearSVC (C=1.0)
  Dev Accuracy: 0

## 9. Final Evaluation (Test Set) <a id='section9'></a>

### Final Model Training

Now that we've selected:
- Best number of features (k from feature selection)
- Best model (from model comparison)

We train the final model on **training + development** data:

**Why train on train+dev?**
- Development set was only used for tuning, not training
- More training data → better model

**Note**: Used `vstack` (vertical stacking) to combine samples:
- Adds more ROWS (samples), not columns (features)
- Train: 2,049 samples + Dev: 683 samples = 2,732 samples

### Test Set Evaluation

**Note**: This is the first time the test set is touched!

The test set provides an **unbiased estimate** of performance:
- Never seen during development
- Not used for any decisions
- Represents "real-world" performance

### Evaluation Metrics

We compute comprehensive metrics:

1. **Accuracy**: Overall correctness
   - Formula: (Correct predictions) / (Total predictions)

2. **Macro-averaged Precision**: Average precision across all classes
   - Formula: (Σ Precision per class) / (Number of classes)

3. **Macro-averaged Recall**: Average recall across all classes
   - Formula: (Σ Recall per class) / (Number of classes)

4. **Macro-averaged F1-score**: Harmonic mean of precision and recall
   - Formula: 2 × (Precision × Recall) / (Precision + Recall)

**Why macro-averaging?**
- Assignment requirement
- Treats all classes equally (doesn't favor majority class)
- Better for class-imbalanced datasets

In [None]:
print("=" * 70)
print("FINAL MODEL TRAINING (Train + Dev)")
print("=" * 70)

print(f"\nCombining training and development sets...")

# Vertically stack training and dev sets
# vstack adds more SAMPLES (rows), not features (columns)
X_traindev = vstack([X_train_final, X_dev_final])
y_traindev = np.concatenate([y_train, y_dev])

print(f"  Training set:     {X_train_final.shape}")
print(f"  Development set:  {X_dev_final.shape}")
print(f"  Combined set:     {X_traindev.shape}")
print(f"\n  Labels: {len(y_traindev):,} samples")

print(f"\nTraining final model: {final_model_name}...")
final_model.fit(X_traindev, y_traindev)
print(" Final model trained")

print("\n" + "=" * 70)
print("TEST SET EVALUATION (FINAL PERFORMANCE)")
print("=" * 70)

print("\nEvaluating on held-out test set...")
print("(First time touching test set!)")

# Predict on test set
y_pred_test = final_model.predict(X_test_final)

# Calculate all required metrics
test_accuracy = accuracy_score(y_test, y_pred_test)
test_precision = precision_score(y_test, y_pred_test, average='macro', zero_division=0)
test_recall = recall_score(y_test, y_pred_test, average='macro', zero_division=0)
test_f1 = f1_score(y_test, y_pred_test, average='macro', zero_division=0)

print("\n" + "=" * 70)
print("FINAL TEST SET PERFORMANCE")
print("=" * 70)

print(f"\nModel: {final_model_name}")
print(f"Features: {X_test_final.shape[1]:,}")
print(f"Test samples: {len(y_test):,}")

print(f"\n{'Metric':<30} {'Value':>10}")
print("-" * 42)
print(f"{'Accuracy':<30} {test_accuracy:>10.4f} ({test_accuracy*100:.2f}%)")
print(f"{'Macro-averaged Precision':<30} {test_precision:>10.4f}")
print(f"{'Macro-averaged Recall':<30} {test_recall:>10.4f}")
print(f"{'Macro-averaged F1-score':<30} {test_f1:>10.4f}")

print("\n" + "=" * 70)

# Check if requirement met
if test_accuracy >= 0.65:
    print(f" SUCCESS: {test_accuracy*100:.2f}% exceeds 65% requirement!")
    margin = (test_accuracy - 0.65) * 100
    print(f"  Margin above requirement: +{margin:.2f} percentage points")
else:
    print(f"⚠ WARNING: {test_accuracy*100:.2f}% below 65% requirement")
    shortfall = (0.65 - test_accuracy) * 100
    print(f"  Shortfall: -{shortfall:.2f} percentage points")

print("=" * 70)

FINAL MODEL TRAINING (Train + Dev)

Combining training and development sets...
  Training set:     (2049, 15000)
  Development set:  (683, 15000)
  Combined set:     (2732, 15000)

  Labels: 2,732 samples

Training final model: LinearSVC (C=1.0)...
✓ Final model trained

TEST SET EVALUATION (FINAL PERFORMANCE)

Evaluating on held-out test set...
(First time touching test set!)

FINAL TEST SET PERFORMANCE

Model: LinearSVC (C=1.0)
Features: 15,000
Test samples: 684

Metric                              Value
------------------------------------------
Accuracy                           0.8874 (88.74%)
Macro-averaged Precision           0.8909
Macro-averaged Recall              0.8906
Macro-averaged F1-score            0.8906

✓ SUCCESS: 88.74% exceeds 65% requirement!
  Margin above requirement: +23.74 percentage points


## 10. Detailed Performance Analysis <a id='section10'></a>

### Per-Class Performance

Understanding which classes the model handles well (and which it struggles with) is crucial for:
- Identifying model weaknesses
- Guiding future improvements
- Understanding dataset characteristics

In [15]:
print("\n" + "=" * 70)
print("DETAILED CLASSIFICATION REPORT")
print("=" * 70)

print(f"\nModel: {final_model_name}")
print(f"Test samples: {len(y_test):,}")
print("\n" + "-" * 70)

# Generate detailed classification report
print(classification_report(y_test, y_pred_test, zero_division=0, digits=4))

print("\n" + "=" * 70)
print("CONFUSION MATRIX")
print("=" * 70)

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
classes = sorted(np.unique(y_test))

# Display as DataFrame for readability
cm_df = pd.DataFrame(cm, index=classes, columns=classes)
print("\nRows: True labels, Columns: Predicted labels")
print(f"\n{cm_df}")

# Identify common misclassifications
print("\n" + "-" * 70)
print("Most Common Misclassifications:")
print("-" * 70)

misclassifications = []
for i in range(len(classes)):
    for j in range(len(classes)):
        if i != j and cm[i, j] > 0:  # Off-diagonal elements
            misclassifications.append((
                classes[i],  # True class
                classes[j],  # Predicted class
                cm[i, j]     # Count
            ))

# Sort by count
misclassifications.sort(key=lambda x: x[2], reverse=True)

print(f"\n{'True Class':<15} {'Predicted As':<15} {'Count':>10}")
print("-" * 42)
for true_class, pred_class, count in misclassifications[:10]:  # Top 10
    print(f"{true_class:<15} {pred_class:<15} {count:>10}")

print("\n" + "=" * 70)


DETAILED CLASSIFICATION REPORT

Model: LinearSVC (C=1.0)
Test samples: 684

----------------------------------------------------------------------
              precision    recall  f1-score   support

     class-1     0.9896    0.9896    0.9896        96
     class-2     0.8347    0.8632    0.8487       117
     class-3     0.8583    0.8655    0.8619       119
     class-4     0.8407    0.8051    0.8225       118
     class-5     0.9286    0.9043    0.9163       115
     class-6     0.8934    0.9160    0.9046       119

    accuracy                         0.8874       684
   macro avg     0.8909    0.8906    0.8906       684
weighted avg     0.8876    0.8874    0.8873       684


CONFUSION MATRIX

Rows: True labels, Columns: Predicted labels

         class-1  class-2  class-3  class-4  class-5  class-6
class-1       95        0        0        1        0        0
class-2        0      101        4        5        2        5
class-3        0        8      103        3        0      