## Section 1: Import Libraries and Load Dataset

We will use the 20 Newsgroups dataset from scikit-learn for text classification. We select 2 categories (alt.atheism and soc.religion.christian) for binary classification. This dataset contains real-world text documents and is ideal for comparing classification models.

In [10]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Load 20 Newsgroups dataset - binary classification (2 categories)
categories = ['alt.atheism', 'soc.religion.christian']
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
X_text = train.data
y = train.target

print(f"Dataset shape: {len(X_text)} samples")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Class distribution: Class 0: {np.sum(y==0)}, Class 1: {np.sum(y==1)}")
print(f"\nSample text (first 200 chars): {X_text[0][:200]}...")

Dataset shape: 1079 samples
Number of classes: 2
Class distribution: Class 0: 480, Class 1: 599

Sample text (first 200 chars): From: nigel.allen@canrem.com (Nigel Allen)
Subject: library of congress to host dead sea scroll symposium april 21-22
Lines: 96


 Library of Congress to Host Dead Sea Scroll Symposium April 21-22
 To...


## Section 2: Preprocess Text Data

We use TfidfVectorizer to convert text data into numerical features. This creates a matrix where each row represents a document and each column represents a unique word, with values representing the TF-IDF score.

In [11]:
# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', lowercase=True)
X = vectorizer.fit_transform(X_text)

print(f"Feature matrix shape: {X.shape}")
print(f"Number of features extracted: {X.shape[1]}")

Feature matrix shape: (1079, 5000)
Number of features extracted: 5000


## Section 3: K-Fold Cross-Validation Setup

We implement 5-fold cross-validation to evaluate both models. This splits the dataset into 5 folds where each fold is used as a test set once while the remaining 4 folds are used for training. This provides more robust evaluation metrics.

In [12]:
# Initialize k-fold cross-validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

# Storage for metrics
lr_f1_scores = []
nn_f1_scores = []
lr_accuracy = []
nn_accuracy = []
lr_precision = []
nn_precision = []
lr_recall = []
nn_recall = []

fold_num = 0

# Train and evaluate models on each fold
for train_idx, test_idx in kf.split(X):
    fold_num += 1
    
    # Split data
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Train Logistic Regression
    lr_model = LogisticRegression(max_iter=1000, random_state=42)
    lr_model.fit(X_train, y_train)
    lr_pred = lr_model.predict(X_test)
    
    lr_f1_scores.append(f1_score(y_test, lr_pred))
    lr_accuracy.append(accuracy_score(y_test, lr_pred))
    lr_precision.append(precision_score(y_test, lr_pred))
    lr_recall.append(recall_score(y_test, lr_pred))
    
    # Train Neural Network
    nn_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
    nn_model.fit(X_train, y_train)
    nn_pred = nn_model.predict(X_test)
    
    nn_f1_scores.append(f1_score(y_test, nn_pred))
    nn_accuracy.append(accuracy_score(y_test, nn_pred))
    nn_precision.append(precision_score(y_test, nn_pred))
    nn_recall.append(recall_score(y_test, nn_pred))

## Section 4: Cross-Validation Results Summary

Display the evaluation metrics (F1-score, Accuracy, Precision, Recall) for each fold and each model. This shows the variability in model performance across different data splits.

In [13]:
# Create results dataframe for each fold
results_by_fold = pd.DataFrame({
    'Fold': range(1, k_folds + 1),
    'LR F1': np.round(lr_f1_scores, 4),
    'NN F1': np.round(nn_f1_scores, 4),
    'LR Accuracy': np.round(lr_accuracy, 4),
    'NN Accuracy': np.round(nn_accuracy, 4)
})

print("Cross-Validation Results by Fold:")
print(results_by_fold.to_string(index=False))

Cross-Validation Results by Fold:
 Fold  LR F1  NN F1  LR Accuracy  NN Accuracy
    1 0.9569 0.9878       0.9491       0.9861
    2 0.9794 0.9874       0.9769       0.9861
    3 0.9784 0.9868       0.9769       0.9861
    4 0.9746 0.9829       0.9722       0.9815
    5 0.9699 0.9847       0.9628       0.9814


## Section 5: Calculate Performance Statistics

Compute the mean and standard deviation of F1-scores across all folds for each model. These statistics are essential for conducting the hypothesis test.

In [14]:
# Calculate statistics for F1-scores
lr_f1_mean = np.mean(lr_f1_scores)
lr_f1_std = np.std(lr_f1_scores, ddof=1)
nn_f1_mean = np.mean(nn_f1_scores)
nn_f1_std = np.std(nn_f1_scores, ddof=1)

# Summary statistics table
statistics = pd.DataFrame({
    'Model': ['Logistic Regression', 'Neural Network'],
    'Mean F1-Score': [np.round(lr_f1_mean, 4), np.round(nn_f1_mean, 4)],
    'Std Dev F1-Score': [np.round(lr_f1_std, 4), np.round(nn_f1_std, 4)],
    'Mean Accuracy': [np.round(np.mean(lr_accuracy), 4), np.round(np.mean(nn_accuracy), 4)],
    'Mean Precision': [np.round(np.mean(lr_precision), 4), np.round(np.mean(nn_precision), 4)],
    'Mean Recall': [np.round(np.mean(lr_recall), 4), np.round(np.mean(nn_recall), 4)]
})

print("Performance Statistics Summary:")
print(statistics.to_string(index=False))

Performance Statistics Summary:
              Model  Mean F1-Score  Std Dev F1-Score  Mean Accuracy  Mean Precision  Mean Recall
Logistic Regression         0.9718            0.0092         0.9676          0.9469       0.9983
     Neural Network         0.9859            0.0021         0.9842          0.9788       0.9933


## Section 6: Hypothesis Testing - Independent Samples T-Test

**Null Hypothesis (H₀):** There is no significant difference between the mean F1-scores of Logistic Regression and Neural Network models.

**Alternative Hypothesis (H₁):** There is a significant difference between the mean F1-scores of the two models.

**Significance Level:** α = 0.05

**Test Type:** Two-tailed independent samples t-test

This test assumes:
- Both samples are approximately normally distributed
- Both samples have roughly equal variances
- Observations are independent

In [15]:
# Perform independent samples t-test
t_statistic, p_value = stats.ttest_ind(lr_f1_scores, nn_f1_scores)

# Significance level
alpha = 0.05

# Interpretation
is_significant = p_value < alpha

print("T-Test Results:")
print(f"t-statistic: {np.round(t_statistic, 4)}")
print(f"p-value: {np.round(p_value, 4)}")
print(f"Significance level (α): {alpha}")
print(f"Statistically Significant: {is_significant}")
print(f"\nDecision: {'Reject H₀' if is_significant else 'Fail to reject H₀'}")

T-Test Results:
t-statistic: -3.3584
p-value: 0.01
Significance level (α): 0.05
Statistically Significant: True

Decision: Reject H₀


## Section 7: Interpretation of T-Test Results

**Understanding T-Statistic:**
- The t-statistic measures how many standard errors the sample mean difference is from zero
- A larger absolute t-value indicates a greater difference between the two model means
- Formula: t = (mean₁ - mean₂) / SE, where SE is the standard error

**Understanding P-Value:**
- The p-value is the probability of observing the data (or more extreme) if the null hypothesis is true
- A small p-value (< 0.05) suggests the difference is unlikely due to random chance
- If p-value > 0.05, we fail to reject the null hypothesis (no significant difference)

In [16]:
# Detailed interpretation
interpretation_text = f"""
INTERPRETATION OF RESULTS:

1. Model Performance Difference:
   - LR Mean F1: {np.round(lr_f1_mean, 4)}
   - NN Mean F1: {np.round(nn_f1_mean, 4)}
   - Difference: {np.round(abs(lr_f1_mean - nn_f1_mean), 4)}

2. T-Test Analysis:
   - t-statistic = {np.round(t_statistic, 4)}: The mean F1-score difference is {abs(np.round(t_statistic, 2))} standard errors from zero
   - p-value = {np.round(p_value, 4)}: The probability of this difference occurring by chance is {np.round(p_value*100, 2)}%

3. Statistical Conclusion:
   Since p-value {('>' if p_value >= alpha else '<')} α (0.05), we {'FAIL TO REJECT' if p_value >= alpha else 'REJECT'} the null hypothesis.
   
   This means: There {'IS NO' if p_value >= alpha else 'IS A'} statistically significant difference between the models.

4. Practical Interpretation:
   The difference {'appears to be due to random variation' if p_value >= alpha else 'likely represents a real difference'} in model performance.
   The {'models perform similarly' if p_value >= alpha else 'better-performing model'} on this text classification task.
"""

print(interpretation_text)


INTERPRETATION OF RESULTS:

1. Model Performance Difference:
   - LR Mean F1: 0.9718
   - NN Mean F1: 0.9859
   - Difference: 0.0141

2. T-Test Analysis:
   - t-statistic = -3.3584: The mean F1-score difference is 3.36 standard errors from zero
   - p-value = 0.01: The probability of this difference occurring by chance is 1.0%

3. Statistical Conclusion:
   Since p-value < α (0.05), we REJECT the null hypothesis.

   This means: There IS A statistically significant difference between the models.

4. Practical Interpretation:
   The difference likely represents a real difference in model performance.
   The better-performing model on this text classification task.



## Section 8: Types of T-Tests

**Independent Samples T-Test (Used Here):**
- Compares means of two independent groups
- Assumes independent observations and equal variances
- Formula: t = (μ₁ - μ₂) / √(s₁²/n₁ + s₂²/n₂)

**Paired Samples T-Test:**
- Compares means of two related/dependent groups (e.g., before-after measurements)
- Assumes paired observations

**One-Sample T-Test:**
- Compares a sample mean against a known population mean

**Welch's T-Test:**
- Variant of independent samples t-test that doesn't assume equal variances
- More robust when sample sizes or variances differ

## Section 9: Comprehensive Results Table

Final summary table combining dataset information, model descriptions, and statistical test results.

In [17]:
# Dataset Information
print("=" * 80)
print("DATASET INFORMATION")
print("=" * 80)
dataset_info = pd.DataFrame({
    'Attribute': ['Dataset Name', 'Task', 'Total Samples', 'Features', 'Classes', 'Class Balance'],
    'Value': ['20 Newsgroups (2 categories)', 'Binary Text Classification', len(X_text), X.shape[1], 2, 'Balanced']
})
print(dataset_info.to_string(index=False, header=False))

# Model Information
print("\n" + "=" * 80)
print("MODEL DESCRIPTIONS")
print("=" * 80)
model_info = pd.DataFrame({
    'Model': ['Logistic Regression', 'Neural Network'],
    'Type': ['Linear Classifier', 'Deep Learning'],
    'Hyperparameters': ['max_iter=1000', 'hidden_layers=(100,50), max_iter=500'],
    'Complexity': ['Low', 'High']
})
print(model_info.to_string(index=False))

# Performance Metrics by Fold
print("\n" + "=" * 80)
print("CROSS-VALIDATION METRICS (F1-SCORE)")
print("=" * 80)
fold_metrics = pd.DataFrame({
    'Fold': range(1, k_folds + 1),
    'LR F1-Score': np.round(lr_f1_scores, 4),
    'NN F1-Score': np.round(nn_f1_scores, 4),
    'Difference': np.round(np.array(nn_f1_scores) - np.array(lr_f1_scores), 4)
})
print(fold_metrics.to_string(index=False))

# Summary Statistics
print("\n" + "=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80)
summary = pd.DataFrame({
    'Metric': ['Mean F1-Score', 'Std Dev F1-Score', 'Mean Accuracy', 'Mean Precision', 'Mean Recall'],
    'Logistic Regression': [
        np.round(lr_f1_mean, 4),
        np.round(lr_f1_std, 4),
        np.round(np.mean(lr_accuracy), 4),
        np.round(np.mean(lr_precision), 4),
        np.round(np.mean(lr_recall), 4)
    ],
    'Neural Network': [
        np.round(nn_f1_mean, 4),
        np.round(nn_f1_std, 4),
        np.round(np.mean(nn_accuracy), 4),
        np.round(np.mean(nn_precision), 4),
        np.round(np.mean(nn_recall), 4)
    ]
})
print(summary.to_string(index=False))

# T-Test Results
print("\n" + "=" * 80)
print("INDEPENDENT SAMPLES T-TEST RESULTS")
print("=" * 80)
ttest_results = pd.DataFrame({
    'Test Parameter': ['t-statistic', 'p-value', 'Significance Level', 'Result', 'Conclusion'],
    'Value': [
        np.round(t_statistic, 4),
        np.round(p_value, 4),
        alpha,
        'Significant' if is_significant else 'Not Significant',
        'Reject H₀' if is_significant else 'Fail to reject H₀'
    ]
})
print(ttest_results.to_string(index=False, header=False))

DATASET INFORMATION
 Dataset Name 20 Newsgroups (2 categories)
         Task   Binary Text Classification
Total Samples                         1079
     Features                         5000
      Classes                            2
Class Balance                     Balanced

MODEL DESCRIPTIONS
              Model              Type                      Hyperparameters Complexity
Logistic Regression Linear Classifier                        max_iter=1000        Low
     Neural Network     Deep Learning hidden_layers=(100,50), max_iter=500       High

CROSS-VALIDATION METRICS (F1-SCORE)
 Fold  LR F1-Score  NN F1-Score  Difference
    1       0.9569       0.9878      0.0309
    2       0.9794       0.9874      0.0080
    3       0.9784       0.9868      0.0084
    4       0.9746       0.9829      0.0083
    5       0.9699       0.9847      0.0148

SUMMARY STATISTICS
          Metric  Logistic Regression  Neural Network
   Mean F1-Score               0.9718          0.9859
Std Dev F1-Scor

## Section 10: Discussion and Conclusions

### Key Findings:
1. **Model Performance Comparison:**
   - Logistic Regression Mean F1-Score: 0.9718 (±0.0092)
   - Neural Network Mean F1-Score: 0.9859 (±0.0021)
   - The Neural Network achieved ~1.41% higher F1-score on average
   - Neural Network shows more consistent performance (lower std dev)

2. **Statistical Significance:**
   - t-statistic = -3.3584, p-value = 0.01 < α(0.05)
   - We REJECT the null hypothesis
   - There IS a statistically significant difference between models
   - The difference is unlikely due to random chance (only 1% probability)

3. **Effect Size Interpretation:**
   - Although statistically significant, the practical difference (1.41%) is small
   - Both models achieved >97% F1-score, indicating excellent performance
   - The additional complexity of Neural Network may not justify the marginal improvement

4. **Model Selection Recommendation:**
   - For production: Consider Logistic Regression for simplicity, speed, and interpretability
   - For performance optimization: Neural Network provides statistically significant improvement
   - Trade-off between complexity and performance gain should guide decision

### Assumptions and Limitations:
- T-test assumes normal distribution and equal variances (approximately met with k-fold CV)
- Results are specific to 20 Newsgroups dataset; generalization to other domains needs validation
- Small sample size (k=5 folds) may limit statistical power
- Text preprocessing (TF-IDF, stopword removal) affects feature representation and model performance

