# Conditional Independence and Naive Bayes Experiments

This notebook explores the concept of conditional independence and its practical application in Naive Bayes classifiers. We'll conduct several experiments to understand:

1. **Conditional Independence Theory**: What it means and how to test it
2. **Graphical Models**: Visualizing independence relationships
3. **Naive Bayes Implementation**: From scratch and using scikit-learn
4. **Real-world Applications**: Text classification and medical diagnosis examples
5. **Performance Analysis**: When Naive Bayes works well despite violated assumptions

Let's start by importing the necessary libraries and setting up our environment.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings('ignore')

# For machine learning
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.datasets import make_classification, fetch_20newsgroups

# For graphical models visualization
import networkx as nx

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (10, 6)

print("All libraries imported successfully!")
print("Numpy version:", np.__version__)
print("Pandas version:", pd.__version__)

## 1. Understanding Conditional Independence

Conditional independence is a fundamental concept stating that two variables A and B are conditionally independent given C if:

$$P(A, B | C) = P(A | C) \cdot P(B | C)$$

Or equivalently:
$$P(A | B, C) = P(A | C)$$

Let's create a simple example to demonstrate this concept with synthetic data.

In [None]:
def generate_medical_data(n_samples=1000):
    """
    Generate synthetic medical data to demonstrate conditional independence.
    
    Scenario: Patient diagnosis
    - C: Has infection (hidden cause)
    - A: Has fever (symptom)
    - B: Lab test positive (diagnostic test)
    
    Both fever and lab test depend on infection status, but are
    conditionally independent given the infection status.
    """
    # Generate infection status (30% of patients have infection)
    infection = np.random.binomial(1, 0.3, n_samples)
    
    # Generate fever based on infection (90% with infection have fever, 10% without infection have fever)
    fever_prob = np.where(infection == 1, 0.9, 0.1)
    fever = np.random.binomial(1, fever_prob)
    
    # Generate lab test results based on infection (95% with infection test positive, 5% without infection test positive)
    lab_prob = np.where(infection == 1, 0.95, 0.05)
    lab_positive = np.random.binomial(1, lab_prob)
    
    return pd.DataFrame({
        'infection': infection,
        'fever': fever,
        'lab_positive': lab_positive
    })

# Generate the data
medical_data = generate_medical_data(5000)
print("Medical data sample:")
print(medical_data.head(10))
print(f"\nData shape: {medical_data.shape}")
print(f"\nInfection rate: {medical_data['infection'].mean():.3f}")
print(f"Fever rate: {medical_data['fever'].mean():.3f}")
print(f"Lab positive rate: {medical_data['lab_positive'].mean():.3f}")

In [None]:
def test_conditional_independence(data, var_a, var_b, condition_var):
    """
    Test conditional independence using chi-squared test.
    
    Tests if var_a and var_b are conditionally independent given condition_var.
    """
    results = {}
    
    # Test overall dependence (without conditioning)
    contingency_table = pd.crosstab(data[var_a], data[var_b])
    chi2, p_val_overall, dof, expected = chi2_contingency(contingency_table)
    results['overall'] = {
        'chi2': chi2,
        'p_value': p_val_overall,
        'dependent': p_val_overall < 0.05
    }
    
    # Test conditional independence for each value of conditioning variable
    condition_tests = {}
    for condition_value in data[condition_var].unique():
        subset = data[data[condition_var] == condition_value]
        if len(subset) > 10:  # Need sufficient data
            try:
                cont_table = pd.crosstab(subset[var_a], subset[var_b])
                if cont_table.shape == (2, 2):  # Only for 2x2 tables
                    chi2_cond, p_val_cond, _, _ = chi2_contingency(cont_table)
                    condition_tests[condition_value] = {
                        'chi2': chi2_cond,
                        'p_value': p_val_cond,
                        'dependent': p_val_cond < 0.05,
                        'sample_size': len(subset)
                    }
            except ValueError:
                # Handle cases where chi2 test can't be performed
                condition_tests[condition_value] = {
                    'chi2': None,
                    'p_value': None,
                    'dependent': None,
                    'sample_size': len(subset)
                }
    
    results['conditional'] = condition_tests
    return results

# Test conditional independence in our medical data
ci_results = test_conditional_independence(medical_data, 'fever', 'lab_positive', 'infection')

print("CONDITIONAL INDEPENDENCE TEST RESULTS")
print("="*50)
print(f"Overall dependence between fever and lab_positive:")
print(f"  Chi-squared: {ci_results['overall']['chi2']:.4f}")
print(f"  P-value: {ci_results['overall']['p_value']:.6f}")
print(f"  Dependent: {ci_results['overall']['dependent']}")
print()

print("Conditional independence tests:")
for condition, result in ci_results['conditional'].items():
    condition_name = "No Infection" if condition == 0 else "Has Infection"
    print(f"  Given {condition_name} (n={result['sample_size']}):")
    if result['p_value'] is not None:
        print(f"    Chi-squared: {result['chi2']:.4f}")
        print(f"    P-value: {result['p_value']:.6f}")
        print(f"    Dependent: {result['dependent']}")
    else:
        print(f"    Could not perform test (insufficient data)")
    print()

In [None]:
# Visualize the conditional independence relationships
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Joint distribution heatmap
joint_prob = pd.crosstab(medical_data['fever'], medical_data['lab_positive'], normalize='all')
sns.heatmap(joint_prob, annot=True, fmt='.3f', cmap='Blues', ax=axes[0,0])
axes[0,0].set_title('Joint P(Fever, Lab+)')
axes[0,0].set_xlabel('Lab Positive')
axes[0,0].set_ylabel('Fever')

# 2. Conditional distributions given infection status
infection_groups = medical_data.groupby('infection')

# For no infection group
no_infection = medical_data[medical_data['infection'] == 0]
if len(no_infection) > 0:
    cond_prob_no_inf = pd.crosstab(no_infection['fever'], no_infection['lab_positive'], normalize='all')
    sns.heatmap(cond_prob_no_inf, annot=True, fmt='.3f', cmap='Reds', ax=axes[0,1])
    axes[0,1].set_title('P(Fever, Lab+ | No Infection)')
    axes[0,1].set_xlabel('Lab Positive')
    axes[0,1].set_ylabel('Fever')

# For infection group
has_infection = medical_data[medical_data['infection'] == 1]
if len(has_infection) > 0:
    cond_prob_inf = pd.crosstab(has_infection['fever'], has_infection['lab_positive'], normalize='all')
    sns.heatmap(cond_prob_inf, annot=True, fmt='.3f', cmap='Greens', ax=axes[1,0])
    axes[1,0].set_title('P(Fever, Lab+ | Has Infection)')
    axes[1,0].set_xlabel('Lab Positive')
    axes[1,0].set_ylabel('Fever')

# 3. Correlation analysis
correlations = []
for infection_status in [0, 1]:
    subset = medical_data[medical_data['infection'] == infection_status]
    if len(subset) > 1:
        corr = subset[['fever', 'lab_positive']].corr().iloc[0, 1]
        correlations.append(corr)
    else:
        correlations.append(0)

axes[1,1].bar(['No Infection', 'Has Infection'], correlations, 
              color=['red', 'green'], alpha=0.7)
axes[1,1].set_title('Correlation between Fever and Lab+ by Infection Status')
axes[1,1].set_ylabel('Correlation Coefficient')
axes[1,1].axhline(y=0, color='black', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

# Print conditional probabilities
print("CONDITIONAL PROBABILITY ANALYSIS")
print("="*40)
for infection_status in [0, 1]:
    status_name = "No Infection" if infection_status == 0 else "Has Infection"
    subset = medical_data[medical_data['infection'] == infection_status]
    
    print(f"\nGiven {status_name}:")
    # P(Fever=1, Lab+=1 | Infection)
    prob_both = len(subset[(subset['fever'] == 1) & (subset['lab_positive'] == 1)]) / len(subset)
    # P(Fever=1 | Infection) * P(Lab+=1 | Infection)
    prob_fever = subset['fever'].mean()
    prob_lab = subset['lab_positive'].mean()
    prob_product = prob_fever * prob_lab
    
    print(f"  P(Fever=1, Lab+=1 | {status_name}) = {prob_both:.4f}")
    print(f"  P(Fever=1 | {status_name}) * P(Lab+=1 | {status_name}) = {prob_fever:.4f} * {prob_lab:.4f} = {prob_product:.4f}")
    print(f"  Difference: {abs(prob_both - prob_product):.4f}")

## 2. Graphical Models and Independence Patterns

Graphical models help us visualize and understand independence relationships. There are three fundamental patterns:

1. **Common Cause (Tail-to-tail)**: A ← C → B
2. **Chain (Head-to-tail)**: A → C → B  
3. **Common Effect (Head-to-head)**: A → C ← B

Let's visualize these patterns and understand their independence implications.

In [None]:
def draw_graphical_model(edges, title, pos=None):
    """Draw a simple graphical model."""
    G = nx.DiGraph()
    G.add_edges_from(edges)
    
    if pos is None:
        pos = nx.spring_layout(G, seed=42)
    
    plt.figure(figsize=(8, 6))
    nx.draw(G, pos, with_labels=True, node_color='lightblue', 
            node_size=2000, font_size=16, font_weight='bold',
            arrows=True, arrowsize=20, edge_color='gray', arrowstyle='->')
    plt.title(title, fontsize=14, fontweight='bold')
    plt.axis('off')
    plt.show()

# Draw the three fundamental patterns
print("Three Fundamental Graphical Model Patterns:")
print("="*50)

# Pattern 1: Common Cause (our medical example)
print("\n1. Common Cause Pattern (Tail-to-tail): Infection → Fever, Infection → Lab+")
edges1 = [('Infection', 'Fever'), ('Infection', 'Lab+')]
pos1 = {'Infection': (0.5, 1), 'Fever': (0, 0), 'Lab+': (1, 0)}
draw_graphical_model(edges1, "Common Cause: A ← C → B\n(Fever ← Infection → Lab+)", pos1)

print("Independence: Fever ⊥ Lab+ | Infection")
print("Fever and Lab+ are conditionally independent given Infection status")

# Pattern 2: Chain
print("\n2. Chain Pattern (Head-to-tail): Weather → Umbrella Sales → Store Revenue")
edges2 = [('Weather', 'Umbrella Sales'), ('Umbrella Sales', 'Store Revenue')]
pos2 = {'Weather': (0, 0), 'Umbrella Sales': (0.5, 0), 'Store Revenue': (1, 0)}
draw_graphical_model(edges2, "Chain: A → C → B\n(Weather → Umbrella Sales → Store Revenue)", pos2)

print("Independence: Weather ⊥ Store Revenue | Umbrella Sales")
print("Weather and Store Revenue are conditionally independent given Umbrella Sales")

# Pattern 3: Common Effect
print("\n3. Common Effect Pattern (Head-to-head): Exercise → Health ← Diet")
edges3 = [('Exercise', 'Health'), ('Diet', 'Health')]
pos3 = {'Exercise': (0, 0), 'Diet': (1, 0), 'Health': (0.5, -0.5)}
draw_graphical_model(edges3, "Common Effect: A → C ← B\n(Exercise → Health ← Diet)", pos3)

print("Independence: Exercise ⊥ Diet (unconditionally)")
print("But Exercise ⊥̸ Diet | Health (dependent when Health is observed)")

## 3. Naive Bayes Classifier from Scratch

The Naive Bayes classifier makes the "naive" assumption that all features are conditionally independent given the class label:

$$P(x_1, x_2, ..., x_n | y) = \prod_{i=1}^{n} P(x_i | y)$$

Using Bayes' theorem:
$$P(y | x_1, x_2, ..., x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i | y)}{P(x_1, x_2, ..., x_n)}$$

Let's implement a Gaussian Naive Bayes classifier from scratch and test it on our medical data.

In [None]:
class NaiveBayesFromScratch:
    """
    Gaussian Naive Bayes classifier implemented from scratch.
    """
    
    def __init__(self):
        self.class_priors = {}
        self.feature_stats = {}  # Will store mean and std for each feature per class
        self.classes = None
        
    def fit(self, X, y):
        """
        Train the Naive Bayes classifier.
        
        Parameters:
        X: Feature matrix (n_samples, n_features)
        y: Target labels (n_samples,)
        """
        X = np.array(X)
        y = np.array(y)
        self.classes = np.unique(y)
        n_samples = len(y)
        
        # Calculate class priors P(y)
        for class_label in self.classes:
            class_count = np.sum(y == class_label)
            self.class_priors[class_label] = class_count / n_samples
        
        # Calculate feature statistics for each class
        self.feature_stats = {}
        for class_label in self.classes:
            # Get samples for this class
            class_mask = (y == class_label)
            class_features = X[class_mask]
            
            # Calculate mean and std for each feature
            self.feature_stats[class_label] = {
                'mean': np.mean(class_features, axis=0),
                'std': np.std(class_features, axis=0) + 1e-6  # Add small value to avoid division by zero
            }
    
    def _gaussian_probability(self, x, mean, std):
        """Calculate Gaussian probability density."""
        exponent = -0.5 * ((x - mean) / std) ** 2
        return (1 / (std * np.sqrt(2 * np.pi))) * np.exp(exponent)
    
    def _predict_single(self, x):
        """Predict class for a single sample."""
        class_probabilities = {}
        
        for class_label in self.classes:
            # Start with prior probability
            prob = self.class_priors[class_label]
            
            # Multiply by likelihood of each feature (naive assumption)
            stats = self.feature_stats[class_label]
            for i, feature_value in enumerate(x):
                feature_prob = self._gaussian_probability(
                    feature_value, stats['mean'][i], stats['std'][i]
                )
                prob *= feature_prob
            
            class_probabilities[class_label] = prob
        
        # Return class with highest probability
        return max(class_probabilities, key=class_probabilities.get)
    
    def predict(self, X):
        """Predict classes for multiple samples."""
        X = np.array(X)
        predictions = []
        for x in X:
            predictions.append(self._predict_single(x))
        return np.array(predictions)
    
    def predict_proba(self, X):
        """Predict class probabilities for multiple samples."""
        X = np.array(X)
        probabilities = []
        
        for x in X:
            class_probs = {}
            for class_label in self.classes:
                prob = self.class_priors[class_label]
                stats = self.feature_stats[class_label]
                for i, feature_value in enumerate(x):
                    feature_prob = self._gaussian_probability(
                        feature_value, stats['mean'][i], stats['std'][i]
                    )
                    prob *= feature_prob
                class_probs[class_label] = prob
            
            # Normalize probabilities
            total_prob = sum(class_probs.values())
            normalized_probs = [class_probs[class_label] / total_prob for class_label in self.classes]
            probabilities.append(normalized_probs)
        
        return np.array(probabilities)

# Test our implementation on medical data
print("Testing Naive Bayes from Scratch on Medical Data")
print("="*50)

# Prepare data
X = medical_data[['fever', 'lab_positive']].values
y = medical_data['infection'].values

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train our model
nb_scratch = NaiveBayesFromScratch()
nb_scratch.fit(X_train, y_train)

# Make predictions
y_pred_scratch = nb_scratch.predict(X_test)
y_proba_scratch = nb_scratch.predict_proba(X_test)

# Calculate accuracy
accuracy_scratch = accuracy_score(y_test, y_pred_scratch)
print(f"Accuracy (from scratch): {accuracy_scratch:.4f}")

# Compare with scikit-learn
nb_sklearn = GaussianNB()
nb_sklearn.fit(X_train, y_train)
y_pred_sklearn = nb_sklearn.predict(X_test)
accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)
print(f"Accuracy (scikit-learn): {accuracy_sklearn:.4f}")

print(f"\nClass priors learned by our model:")
for class_label, prior in nb_scratch.class_priors.items():
    class_name = "No Infection" if class_label == 0 else "Has Infection"
    print(f"  {class_name}: {prior:.4f}")

print(f"\nFeature statistics learned by our model:")
for class_label in nb_scratch.classes:
    class_name = "No Infection" if class_label == 0 else "Has Infection"
    stats = nb_scratch.feature_stats[class_label]
    print(f"  {class_name}:")
    print(f"    Fever - Mean: {stats['mean'][0]:.4f}, Std: {stats['std'][0]:.4f}")
    print(f"    Lab+ - Mean: {stats['mean'][1]:.4f}, Std: {stats['std'][1]:.4f}")

## 4. Text Classification with Naive Bayes

Naive Bayes is particularly effective for text classification. Let's implement a sentiment analysis system and explore why the "naive" assumption works well despite being violated in practice.

We'll use a simple movie review dataset and compare different variants of Naive Bayes.

In [None]:
# Create a simple movie review dataset
def create_movie_review_data():
    """Create synthetic movie review data for sentiment analysis."""
    
    positive_reviews = [
        "This movie was absolutely fantastic and amazing",
        "I loved every minute of this incredible film",
        "Outstanding performance and brilliant storytelling",
        "A masterpiece of cinema with excellent acting",
        "Wonderful plot and spectacular visuals",
        "This film exceeded all my expectations perfectly",
        "Brilliant direction and amazing cinematography",
        "Absolutely loved the characters and story",
        "An incredible journey with perfect ending",
        "Fantastic movie with outstanding performances",
        "Amazing plot twists and excellent writing",
        "Perfect blend of action and emotion",
        "Wonderful acting and brilliant dialogue",
        "This movie was truly spectacular",
        "Excellent film with amazing visual effects"
    ]
    
    negative_reviews = [
        "This movie was terrible and completely boring",
        "I hated every minute of this awful film",
        "Poor performance and terrible storytelling",
        "A disaster of cinema with horrible acting",
        "Awful plot and terrible visuals",
        "This film disappointed me completely",
        "Poor direction and terrible cinematography",
        "Absolutely hated the characters and story",
        "A boring journey with awful ending",
        "Terrible movie with poor performances",
        "Awful plot twists and poor writing",
        "Horrible blend of action and emotion",
        "Poor acting and terrible dialogue",
        "This movie was truly awful",
        "Terrible film with poor visual effects"
    ]
    
    # Create labels
    labels = [1] * len(positive_reviews) + [0] * len(negative_reviews)
    texts = positive_reviews + negative_reviews
    
    return texts, labels

# Generate the dataset
texts, labels = create_movie_review_data()
print(f"Created dataset with {len(texts)} reviews")
print(f"Positive reviews: {sum(labels)}")
print(f"Negative reviews: {len(labels) - sum(labels)}")

print("\nSample reviews:")
print("Positive:", texts[0])
print("Negative:", texts[15])

# Prepare data for machine learning
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)

print(f"\nTraining set: {len(X_train_text)} reviews")
print(f"Test set: {len(X_test_text)} reviews")

In [None]:
# Compare different Naive Bayes variants for text classification
def compare_naive_bayes_variants(X_train, X_test, y_train, y_test):
    """Compare different Naive Bayes variants on text data."""
    
    results = {}
    
    # 1. Multinomial Naive Bayes with Count Vectorizer
    count_vectorizer = CountVectorizer(stop_words='english', lowercase=True)
    X_train_count = count_vectorizer.fit_transform(X_train)
    X_test_count = count_vectorizer.transform(X_test)
    
    mnb = MultinomialNB()
    mnb.fit(X_train_count, y_train)
    mnb_pred = mnb.predict(X_test_count)
    mnb_accuracy = accuracy_score(y_test, mnb_pred)
    results['Multinomial NB (Count)'] = mnb_accuracy
    
    # 2. Multinomial Naive Bayes with TF-IDF
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', lowercase=True)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    mnb_tfidf = MultinomialNB()
    mnb_tfidf.fit(X_train_tfidf, y_train)
    mnb_tfidf_pred = mnb_tfidf.predict(X_test_tfidf)
    mnb_tfidf_accuracy = accuracy_score(y_test, mnb_tfidf_pred)
    results['Multinomial NB (TF-IDF)'] = mnb_tfidf_accuracy
    
    # 3. Bernoulli Naive Bayes
    bnb = BernoulliNB()
    bnb.fit(X_train_count, y_train)
    bnb_pred = bnb.predict(X_test_count)
    bnb_accuracy = accuracy_score(y_test, bnb_pred)
    results['Bernoulli NB'] = bnb_accuracy
    
    return results, count_vectorizer, mnb

# Test different variants
nb_results, vectorizer, best_model = compare_naive_bayes_variants(
    X_train_text, X_test_text, y_train_text, y_test_text
)

print("NAIVE BAYES VARIANTS COMPARISON")
print("="*40)
for model_name, accuracy in nb_results.items():
    print(f"{model_name}: {accuracy:.4f}")

# Analyze the learned vocabulary
feature_names = vectorizer.get_feature_names_out()
print(f"\nVocabulary size: {len(feature_names)}")
print(f"Sample features: {feature_names[:10]}")

# Get most informative features
def get_most_informative_features(model, vectorizer, n_features=10):
    """Get the most informative features for each class."""
    feature_names = vectorizer.get_feature_names_out()
    
    # Get log probabilities for each class
    class_0_log_probs = model.feature_log_prob_[0]  # Negative class
    class_1_log_probs = model.feature_log_prob_[1]  # Positive class
    
    # Calculate the ratio of probabilities
    log_prob_ratio = class_1_log_probs - class_0_log_probs
    
    # Get top features for positive class
    positive_indices = np.argsort(log_prob_ratio)[-n_features:][::-1]
    positive_features = [(feature_names[i], log_prob_ratio[i]) for i in positive_indices]
    
    # Get top features for negative class
    negative_indices = np.argsort(log_prob_ratio)[:n_features]
    negative_features = [(feature_names[i], log_prob_ratio[i]) for i in negative_indices]
    
    return positive_features, negative_features

positive_features, negative_features = get_most_informative_features(best_model, vectorizer)

print("\nMOST INFORMATIVE FEATURES")
print("="*30)
print("Top features for POSITIVE sentiment:")
for feature, ratio in positive_features:
    print(f"  {feature}: {ratio:.4f}")

print("\nTop features for NEGATIVE sentiment:")
for feature, ratio in negative_features:
    print(f"  {feature}: {ratio:.4f}")

# Test predictions on new examples
test_examples = [
    "This movie was absolutely fantastic",
    "Terrible film with awful acting",
    "Amazing story and brilliant performance",
    "Boring and disappointing movie"
]

X_test_examples = vectorizer.transform(test_examples)
predictions = best_model.predict(X_test_examples)
probabilities = best_model.predict_proba(X_test_examples)

print("\nPREDICTIONS ON NEW EXAMPLES")
print("="*35)
for i, (text, pred, prob) in enumerate(zip(test_examples, predictions, probabilities)):
    sentiment = "Positive" if pred == 1 else "Negative"
    confidence = max(prob)
    print(f"Text: '{text}'")
    print(f"Prediction: {sentiment} (confidence: {confidence:.4f})")
    print()

## 5. Analyzing the "Naive" Assumption

The "naive" assumption in Naive Bayes is that features are conditionally independent given the class. In reality, this assumption is often violated. Let's explore:

1. How to detect when the assumption is violated
2. Why Naive Bayes still works well despite violated assumptions
3. When the assumption matters most

In [None]:
def analyze_feature_dependencies(X, y, feature_names, n_features=5):
    """
    Analyze dependencies between features within each class.
    """
    
    results = {}
    
    for class_label in np.unique(y):
        print(f"\nANALYZING CLASS {class_label}")
        print("="*30)
        
        # Get data for this class
        class_mask = (y == class_label)
        X_class = X[class_mask]
        
        if hasattr(X_class, 'toarray'):  # Handle sparse matrices
            X_class = X_class.toarray()
        
        # Calculate correlation matrix
        correlations = np.corrcoef(X_class.T)
        
        # Find highly correlated feature pairs
        high_corr_pairs = []
        for i in range(len(feature_names)):
            for j in range(i+1, len(feature_names)):
                if abs(correlations[i, j]) > 0.3:  # Threshold for high correlation
                    high_corr_pairs.append((
                        feature_names[i], 
                        feature_names[j], 
                        correlations[i, j]
                    ))
        
        # Sort by correlation strength
        high_corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
        
        print(f"Highly correlated feature pairs (|r| > 0.3):")
        for feat1, feat2, corr in high_corr_pairs[:n_features]:
            print(f"  {feat1} <-> {feat2}: {corr:.4f}")
        
        if not high_corr_pairs:
            print("  No highly correlated feature pairs found")
        
        results[class_label] = high_corr_pairs
    
    return results

# Analyze feature dependencies in our text data
print("FEATURE DEPENDENCY ANALYSIS")
print("="*40)

# Get feature matrix and analyze
X_count = vectorizer.transform(X_train_text + X_test_text)
y_all = np.array(y_train_text + y_test_text)

# Select top features to make analysis manageable
feature_names = vectorizer.get_feature_names_out()
top_features_idx = np.argsort(np.array(X_count.sum(axis=0)).flatten())[-20:]  # Top 20 most frequent
top_feature_names = feature_names[top_features_idx]
X_count_top = X_count[:, top_features_idx]

dependencies = analyze_feature_dependencies(X_count_top, y_all, top_feature_names)

# Create a synthetic dataset where independence assumption is clearly violated
def create_dependent_features_dataset():
    """Create a dataset where features are clearly dependent."""
    np.random.seed(42)
    n_samples = 1000
    
    # Class labels
    y = np.random.binomial(1, 0.5, n_samples)
    
    # Feature 1: depends on class
    f1 = np.where(y == 1, 
                  np.random.normal(2, 1, n_samples), 
                  np.random.normal(-2, 1, n_samples))
    
    # Feature 2: depends on both class AND feature 1 (violates independence)
    f2 = 0.8 * f1 + np.where(y == 1, 
                             np.random.normal(1, 0.5, n_samples),
                             np.random.normal(-1, 0.5, n_samples))
    
    # Feature 3: independent given class (follows naive assumption)
    f3 = np.where(y == 1,
                  np.random.normal(1, 1, n_samples),
                  np.random.normal(-1, 1, n_samples))
    
    X = np.column_stack([f1, f2, f3])
    return X, y

# Test Naive Bayes on dependent features
print("\n\nTESTING ON DATASET WITH DEPENDENT FEATURES")
print("="*50)

X_dep, y_dep = create_dependent_features_dataset()
X_train_dep, X_test_dep, y_train_dep, y_test_dep = train_test_split(
    X_dep, y_dep, test_size=0.3, random_state=42
)

# Train Naive Bayes
nb_dep = GaussianNB()
nb_dep.fit(X_train_dep, y_train_dep)
y_pred_dep = nb_dep.predict(X_test_dep)
accuracy_dep = accuracy_score(y_test_dep, y_pred_dep)

print(f"Naive Bayes accuracy on dependent features: {accuracy_dep:.4f}")

# Analyze feature correlations
print("\nFeature correlations within each class:")
for class_label in [0, 1]:
    class_data = X_dep[y_dep == class_label]
    corr_matrix = np.corrcoef(class_data.T)
    print(f"\nClass {class_label}:")
    print(f"  Feature 1 vs Feature 2: {corr_matrix[0, 1]:.4f}")
    print(f"  Feature 1 vs Feature 3: {corr_matrix[0, 2]:.4f}")
    print(f"  Feature 2 vs Feature 3: {corr_matrix[1, 2]:.4f}")

# Compare with a method that can handle dependencies (like Logistic Regression)
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42)
lr.fit(X_train_dep, y_train_dep)
y_pred_lr = lr.predict(X_test_dep)
accuracy_lr = accuracy_score(y_test_dep, y_pred_lr)

print(f"\nLogistic Regression accuracy: {accuracy_lr:.4f}")
print(f"Performance difference: {accuracy_lr - accuracy_dep:.4f}")

# Visualize the dependent features
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot relationships between features colored by class
colors = ['red', 'blue']
class_names = ['Class 0', 'Class 1']

for i, class_label in enumerate([0, 1]):
    mask = y_dep == class_label
    axes[0].scatter(X_dep[mask, 0], X_dep[mask, 1], 
                   c=colors[i], alpha=0.6, label=class_names[i])
    axes[1].scatter(X_dep[mask, 0], X_dep[mask, 2], 
                   c=colors[i], alpha=0.6, label=class_names[i])
    axes[2].scatter(X_dep[mask, 1], X_dep[mask, 2], 
                   c=colors[i], alpha=0.6, label=class_names[i])

axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2 (Dependent on F1)')
axes[0].set_title('F1 vs F2: Clear Dependence')
axes[0].legend()

axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 3 (Independent)')
axes[1].set_title('F1 vs F3: Independence')
axes[1].legend()

axes[2].set_xlabel('Feature 2')
axes[2].set_ylabel('Feature 3')
axes[2].set_title('F2 vs F3')
axes[2].legend()

plt.tight_layout()
plt.show()

# Explain why Naive Bayes still works reasonably well
print("\nWHY NAIVE BAYES WORKS DESPITE VIOLATED ASSUMPTIONS:")
print("="*55)
print("1. We only need correct classification, not exact probabilities")
print("2. The decision boundary often remains reasonable")
print("3. Errors in probability estimates may cancel out")
print("4. Strong independence violations are needed for major impact")
print("5. Large datasets help overcome assumption violations")

## 6. Real-world Application: Document Classification

Let's apply our understanding to a more realistic scenario using actual newsgroup data. This will demonstrate how Naive Bayes performs on real text data where the independence assumption is clearly violated.

In [None]:
# Load a subset of the 20 newsgroups dataset
try:
    # Select a few categories for faster processing
    categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
    
    newsgroups_train = fetch_20newsgroups(
        subset='train',
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')  # Remove metadata that could leak information
    )
    
    newsgroups_test = fetch_20newsgroups(
        subset='test',
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=('headers', 'footers', 'quotes')
    )
    
    print("REAL-WORLD DOCUMENT CLASSIFICATION")
    print("="*40)
    print(f"Training documents: {len(newsgroups_train.data)}")
    print(f"Test documents: {len(newsgroups_test.data)}")
    print(f"Categories: {newsgroups_train.target_names}")
    
    # Show sample documents
    print("\nSample documents:")
    for i in range(len(categories)):
        idx = np.where(newsgroups_train.target == i)[0][0]
        print(f"\nCategory: {newsgroups_train.target_names[i]}")
        print(f"Text (first 200 chars): {newsgroups_train.data[idx][:200]}...")
    
    # Vectorize the data
    print("\nVectorizing documents...")
    tfidf = TfidfVectorizer(
        max_features=1000,  # Limit features for faster processing
        stop_words='english',
        max_df=0.95,  # Remove very common words
        min_df=2      # Remove very rare words
    )
    
    X_train_news = tfidf.fit_transform(newsgroups_train.data)
    X_test_news = tfidf.transform(newsgroups_test.data)
    y_train_news = newsgroups_train.target
    y_test_news = newsgroups_test.target
    
    print(f"Feature matrix shape: {X_train_news.shape}")
    print(f"Vocabulary size: {len(tfidf.get_feature_names_out())}")
    
    # Train Naive Bayes
    print("\nTraining Naive Bayes classifier...")
    nb_news = MultinomialNB(alpha=1.0)  # Laplace smoothing
    nb_news.fit(X_train_news, y_train_news)
    
    # Make predictions
    y_pred_news = nb_news.predict(X_test_news)
    accuracy_news = accuracy_score(y_test_news, y_pred_news)
    
    print(f"Accuracy: {accuracy_news:.4f}")
    
    # Detailed classification report
    print("\nDetailed Classification Report:")
    print(classification_report(y_test_news, y_pred_news, 
                              target_names=newsgroups_test.target_names))
    
    # Confusion matrix
    cm = confusion_matrix(y_test_news, y_pred_news)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=newsgroups_test.target_names,
                yticklabels=newsgroups_test.target_names)
    plt.title('Confusion Matrix - Newsgroup Classification')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
    
    # Find most discriminative features for each category
    print("\nMost discriminative features by category:")
    feature_names = tfidf.get_feature_names_out()
    
    for i, category in enumerate(newsgroups_test.target_names):
        # Get log probabilities for this class vs others
        class_log_probs = nb_news.feature_log_prob_[i]
        
        # Get top features
        top_indices = np.argsort(class_log_probs)[-10:][::-1]
        top_features = [feature_names[idx] for idx in top_indices]
        
        print(f"\n{category}:")
        print(f"  {', '.join(top_features)}")
    
    # Test on custom examples
    test_docs = [
        "I don't believe in God and think religion is harmful",
        "Computer graphics and image processing algorithms",
        "Patient symptoms and medical diagnosis procedures",
        "Jesus Christ is our savior and lord"
    ]
    
    X_custom = tfidf.transform(test_docs)
    predictions = nb_news.predict(X_custom)
    probabilities = nb_news.predict_proba(X_custom)
    
    print("\nPredictions on custom documents:")
    print("="*40)
    for i, (doc, pred, probs) in enumerate(zip(test_docs, predictions, probabilities)):
        predicted_category = newsgroups_test.target_names[pred]
        confidence = np.max(probs)
        print(f"\nDocument: '{doc[:50]}...'")
        print(f"Predicted: {predicted_category} (confidence: {confidence:.4f})")
        print("All probabilities:")
        for j, prob in enumerate(probs):
            print(f"  {newsgroups_test.target_names[j]}: {prob:.4f}")

except Exception as e:
    print(f"Could not load 20 newsgroups dataset: {e}")
    print("This might be due to network connectivity issues.")
    print("The core concepts have been demonstrated with our synthetic examples.")

## 7. Conclusions and Key Takeaways

### Summary of Experiments

Through our experiments, we've demonstrated several key concepts:

1. **Conditional Independence**: We showed how to test for conditional independence and visualized the concept with medical diagnosis data.

2. **Graphical Models**: We explored the three fundamental patterns (common cause, chain, common effect) and their independence implications.

3. **Naive Bayes Implementation**: We built a Naive Bayes classifier from scratch and compared it with scikit-learn implementations.

4. **Text Classification**: We applied Naive Bayes to sentiment analysis and document classification, showing its effectiveness despite violated assumptions.

5. **Assumption Analysis**: We created datasets with known dependencies and showed that Naive Bayes can still perform well even when the independence assumption is violated.

### Why Naive Bayes Works Despite "Naive" Assumptions

1. **Decision-focused**: We only need correct classification, not exact probability estimates
2. **Robust to violations**: Small correlations between features don't significantly impact performance
3. **Large sample benefits**: More data helps overcome assumption violations
4. **Balanced errors**: Estimation errors in different features may cancel out

### When to Use Naive Bayes

**Good for:**
- Text classification and NLP tasks
- Small to medium datasets
- High-dimensional data
- Real-time applications (fast prediction)
- Baseline models
- When features are actually close to independent

**Consider alternatives when:**
- Features are strongly correlated
- Very large datasets where more complex models are feasible
- When exact probability estimates are needed
- Non-linear relationships are important

### Best Practices

1. **Choose the right variant**: Gaussian for continuous features, Multinomial for count data, Bernoulli for binary features
2. **Use appropriate smoothing**: Laplace smoothing helps with zero probabilities
3. **Preprocess text properly**: Remove stop words, handle rare words
4. **Validate assumptions**: Test for independence when possible
5. **Compare with other methods**: Don't assume Naive Bayes is always the best choice

The "naive" assumption in Naive Bayes is often violated in practice, but the algorithm remains surprisingly effective across many domains, especially text classification. Understanding when and why it works helps us apply it more effectively.