In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Estimating and Comparing Classifiers

There are at least two reasons for wanting to know the generalization rate of a classifier on a given problem. One is to see if the classifier performs well enough to be useful; another is to compare its performance with that of a competing design. Estimating the final generalization performance invariably requires making assumptions about the classifier or the problem or both, and can fail if the assumptions are not valid. We should stress, then, that all the following methods are heuristic. Indeed, if there were a foolproof method for choosing which of two classifiers would generalize better on an arbitrary new problem, we could incorporate such a method into the learning and violate the **No Free Lunch Theorem**. Occasionally, our assumptions are explicit (as in parametric models), but more often than not they are implicit and difficult to identify or relate to the final estimation (as in empirical methods).

## Parametric Models

One approach to estimating the generalization rate is to compute it from the assumed parametric model. For example, in the two-class multivariate normal case, we might estimate the probability of error using the **Bhattacharyya** or **Chernoff bounds**, substituting estimates of the means and the covariance matrix for the unknown parameters. However, there are three problems with this approach:

1. **Overly Optimistic Error Estimates**: Such an error estimate is often overly optimistic; characteristics that make the training samples peculiar or unrepresentative will not be revealed.
2. **Suspecting the Model**: We should always suspect the validity of an assumed parametric model; a performance evaluation based on the same model cannot be believed unless the evaluation is unfavorable.
3. **Difficulty in Non-Parametric Models**: In more general situations where the distributions are not simple, it is very difficult to compute the error rate exactly, even if the probabilistic structure is known completely.

![image-2.png](attachment:image-2.png)

Fig.9: In cross validation, the data set D is split into two parts. The ﬁrst (e.g., 90% of the patterns) is used as a standard training set for setting free parameters in the classiﬁer model; the other (e.g., 10%) is the validation set and is meant to represent the full generalization task. For most problems, the training error decreases monotonically during training, as shown in black. Typically, the error on the validation set decreases, but then increases, an indication that the classiﬁer may be overﬁtting the training data. In cross validation, training or parameter adjustment is stopped at the ﬁrst minimum of the validation error.

##  Cross Validation

In cross validation, we randomly split the set of labeled training samples $D$ into two parts: one is used as the traditional training set for adjusting model parameters in the classifier. The other set — the **validation set** — is used to estimate the generalization validation error. Since our ultimate goal is low generalization error, we train the classifier until we reach a minimum of this validation error, as shown in the figure below.

**Fig.9**: Cross-validation process. The dataset $D$ is split into two parts: the training set (used to adjust model parameters) and the validation set (used to estimate generalization error).

                    Training Set (90%)
                        |
                        v
               +---------------------+
               |     Model Training  |
               +---------------------+
                        |
                        v
            Validation Set (10%)
                        |
                        v
               +---------------------+
               | Estimate Validation |
               |        Error        |
               +---------------------+

It is essential that the validation (or test) set not include points used for training the parameters in the classifier — a methodological error known as “testing on the training set.”

Cross validation can be applied to virtually every classification method. For example:
- In **neural networks** with a fixed topology, the amount of training corresponds to the number of epochs or presentations of the training set.
- In **k-nearest neighbor classifiers**, the optimal value of $k$ can be set by cross-validation.

### Heuristic for Validation Set Proportion

There are several heuristics for choosing the portion $\gamma$ of $D$ to be used as a validation set, where $0 < \gamma < 1$. Typically, a smaller portion of the data is used as the validation set, $ \gamma < 0.5 $, because the validation set is used to set a global property of the classifier. A traditional default is to split the data with $\gamma = 0.1$.

A simple generalization of the above method is **m-fold cross validation**. Here, the training set is randomly divided into $m$ disjoint sets of equal size $n/m$, where $n$ is the total number of patterns in $D$. The classifier is trained $m$ times, each time with a different set held out as a validation set. The estimated performance is the mean of these $m$ errors.

In the limit where $m = n$, the method is effectively the **leave-one-out** approach.

Cross-validation is heuristic and doesn't always work on every problem. In fact, there are problems where **anti-cross validation** (halting on the adjustment of parameters when the validation error reaches the first local maximum) might be effective.

![image.png](attachment:image.png)

Fig.10: The 95% conﬁdence intervals for a given estimated error probability p̂ can be derived from a binomial distribution of Eq. 38. For each value of p̂, the true probability has a 95% chance of lying between the curves marked by the number of test samples n . The larger the number of test samples, the more precise the estimate of the true probability and hence the smaller the 95% conﬁdence interval.


### Estimation of Generalization Error

Once we train a classifier using cross-validation, the validation error gives an estimate of the accuracy of the final classifier on the unknown test set. If the true but unknown error rate of the classifier is $p$, and if $k$ of the $n$ independent, randomly drawn test samples are misclassified, then $k$ has the **binomial distribution**:

$$
P(k) = \binom{n}{k} p^k (1 - p)^{n-k}
$$

Thus, the fraction of test samples misclassified is exactly the maximum likelihood estimate for \(p\):

$$
\hat{p} = \frac{k}{n}
$$

The properties of this estimate for the parameter $p$ of a binomial distribution are well known. **Figure 9.10** shows 95% confidence intervals as a function of $\hat{p}$ and $n$.

### 95% Confidence Intervals for the Error Rate

For a given value of $\hat{p}$, the true value of $p$ has a 95% chance of lying between the curves marked by the number of test samples $n$. The larger the number of test samples, the more precise the estimate of the true probability, and hence the smaller the 95% confidence interval.

**Fig.10**: The 95% confidence intervals for a given estimated error probability $\hat{p}$, derived from a binomial distribution. The larger the number of test samples, the narrower the confidence interval.

                0.8 ─┐                                  .
                     │                               .
                     │                            .
                     │                        .
                     │                    .
                0.7 ─┤                .
                     │            .
                     │        .
                0.6 ─┤    .
                     │ .
                     └──────────────────────────────────
                       0.1    0.2    0.3    0.4    0.5
                                 p̂ (estimated error rate)

The larger the test set $n$, the more confident we can be in the estimate of the true error rate.

![image-3.png](attachment:image-3.png)

Fig.10: The 95% conﬁdence intervals for a given estimated error probability p̂ can be derived from a binomial distribution of Eq. 38. For each value of p̂, the true probability has a 95% chance of lying between the curves marked by the number of test samples n . The larger the number of test samples, the more precise the estimate of the true probability and hence the smaller the 95% conﬁdence interval.

In [3]:
import random

# Sample dataset (replace this with your actual data)
X = [[random.random(), random.random()] for _ in range(100)]  # 100 samples with 2 features each
y = [random.choice([0, 1]) for _ in range(100)]  # Binary labels

# Function to split data into k-folds
def k_fold_split(X, y, k=5):
    data = list(zip(X, y))
    random.shuffle(data)  # Shuffle the dataset
    fold_size = len(data) // k
    folds = [data[i * fold_size:(i + 1) * fold_size] for i in range(k)]
    return folds

# Function for a simple classifier: thresholding based on the sum of features
def simple_classifier(x):
    return 1 if sum(x) > 1 else 0

# Cross-validation function
def cross_validation(X, y, k=5):
    folds = k_fold_split(X, y, k)
    errors = []

    for i in range(k):
        # Split into training and validation sets
        train_data = [item for j, fold in enumerate(folds) if j != i for item in fold]
        val_data = folds[i]

        # Train a simple model (dummy classifier in this case)
        predictions = [simple_classifier(x) for x, _ in val_data]
        true_labels = [label for _, label in val_data]

        # Calculate the error rate
        error = sum(p != t for p, t in zip(predictions, true_labels)) / len(val_data)
        errors.append(error)

    avg_error = sum(errors) / k
    return avg_error

# Perform k-fold cross-validation
k = 5
avg_error = cross_validation(X, y, k)
print(f"Average Cross-Validation Error: {avg_error:.4f}")

# Binomial distribution error estimation (95% confidence intervals)
def binomial_confidence_interval(k, n, p_hat):
    # Binomial confidence intervals using the normal approximation
    z = 1.96  # 95% confidence
    standard_error = (p_hat * (1 - p_hat) / n) ** 0.5
    lower_bound = p_hat - z * standard_error
    upper_bound = p_hat + z * standard_error
    return lower_bound, upper_bound

# Example usage for 100 test samples and 0.1 error rate (hypothetical)
k = 10  # Number of errors
n = 100  # Number of test samples
p_hat = k / n  # Estimated error rate

lower, upper = binomial_confidence_interval(k, n, p_hat)
print(f"95% Confidence Interval for Error Rate: ({lower:.4f}, {upper:.4f})")


Average Cross-Validation Error: 0.5700
95% Confidence Interval for Error Rate: (0.0412, 0.1588)


## Jackknife and Bootstrap Estimation of Classification Accuracy

### Jackknife Method

The jackknife is closely related to cross-validation but works by training the classifier $ n $ separate times, each time deleting one different training point from the dataset. The accuracy for each of these classifiers is computed, and the jackknife estimate of the accuracy is the mean of these leave-one-out accuracies.

Formally, for a dataset $ D $ with $ n $ training points:

1. Train the classifier $ n $ times, each time leaving out one data point.
2. For each leave-one-out classifier, evaluate its performance on the deleted point.
3. The jackknife estimate of the accuracy is the average of these leave-one-out accuracies.

This method is particularly useful because each classifier is quite similar to the one being tested, differing only by one data point. However, the computational cost can be high, especially for large datasets.

Additionally, the jackknife estimate of the variance of the accuracies can be calculated, allowing us to determine the statistical significance of differences between classifiers. For example, if classifier $ C_1 $ has an accuracy of 80% and $ C_2 $ has 85%, we can assess whether this difference is statistically significant.

#### Hypothesis Testing

To compare two classifiers $ C_1 $ and $ C_2 $, we calculate the jackknife estimate of their variances and use hypothesis testing to check if the difference in their accuracies is significant.

### Bootstrap Method

The bootstrap method involves training $ B $ classifiers on different bootstrap samples (random samples drawn with replacement) and then testing the classifiers on other bootstrap data. The bootstrap estimate of classification accuracy is simply the mean of the accuracies from these classifiers.

Though useful, the bootstrap estimation can be computationally expensive, and in practice, it often doesn't provide substantial improvements over other methods like cross-validation.

### Maximum-Likelihood Model Comparison

Maximum-likelihood model comparison, or ML-II, is a generalization of maximum-likelihood estimation for choosing between candidate models based on their ability to explain the training data. The goal is to select the model that best explains the observed data.

Given a model with unknown parameter vector $ \theta $, we maximize the likelihood of the training data $ p(D|\hat{\theta}) $, where $ D $ represents the training data. The selection of the best model is done using the posterior probability of each hypothesis $ h_i $, calculated via Bayes' rule:

$$
P(h_i | D) \propto P(D | h_i) P(h_i)
$$

Where:

- $ P(D | h_i) $ is the likelihood (evidence) of the data given the model.
- $ P(h_i) $ is the prior probability of the model.
- $ P(h_i | D) $ is the posterior probability of the model given the data.

In many cases, the prior $ P(h_i) $ is neglected, and the model selection is based solely on the likelihood $ P(D | h_i) $. The model with the highest likelihood is chosen as the best model for the data.

### Model Comparison Example

Consider three candidate models $ h_1 $, $ h_2 $, and $ h_3 $ of varying complexity. If the observed data is $ D_0 $, then the model $ h_2 $ would be chosen if it maximizes the likelihood $ P(D_0 | h_2) $, as shown in the figure below:

$$
P(D | h_1), P(D | h_2), P(D | h_3)
$$

- Model $ h_1 $: Most expressive, can fit a wide range of data sets.
- Model $ h_3 $: Most restrictive, less flexible.
- Model $ h_2 $: Best matches the observed data $ D_0 $.

Thus, maximum-likelihood model selection suggests choosing $ h_2 $ as it explains the data $ D_0 $ better than the other models.

### Conclusion

- **Jackknife** provides a good estimate of classification accuracy, especially in small datasets, by leaving out one point at a time.
- **Bootstrap** involves resampling the data and can estimate the model accuracy but can be computationally expensive.
- **Maximum-likelihood model comparison** selects the model that best explains the training data, based on its likelihood.

These methods are useful for understanding the performance and reliability of classifiers, and they can be applied for model comparison, accuracy estimation, and hypothesis testing.


In [6]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample Data Generation
X, y = np.random.rand(100, 2), np.random.randint(0, 2, 100)

# Jackknife function
def jackknife_estimate(X, y, model, metric=accuracy_score):
    n = len(X)
    accuracies = []
    
    # Loop through each data point
    for i in range(n):
        # Create training set excluding the i-th sample
        X_train = np.delete(X, i, axis=0)
        y_train = np.delete(y, i)
        
        # Create test set with just the i-th sample
        X_test = X[i].reshape(1, -1)
        y_test = y[i]
        
        # Train model on the modified training set
        model.fit(X_train, y_train)
        
        # Predict on the single test point
        y_pred = model.predict(X_test)
        
        # Calculate the accuracy and store it
        acc = metric([y_test], y_pred)
        accuracies.append(acc)
    
    # Return the mean of accuracies (Jackknife estimate)
    return np.mean(accuracies)

# Initialize model
model = LogisticRegression()

# Perform Jackknife estimation
jackknife_accuracy = jackknife_estimate(X, y, model)
print(f"Jackknife Estimated Accuracy: {jackknife_accuracy}")


Jackknife Estimated Accuracy: 0.55


In [7]:
from sklearn.utils import resample

# Bootstrap function
def bootstrap_estimate(X, y, model, n_iterations=100, metric=accuracy_score):
    accuracies = []
    
    # Perform Bootstrap resampling
    for _ in range(n_iterations):
        # Sample the data with replacement
        X_resampled, y_resampled = resample(X, y, n_samples=len(X), random_state=None)
        
        # Train the model on the bootstrap sample
        model.fit(X_resampled, y_resampled)
        
        # Test the model on the original data
        y_pred = model.predict(X)
        
        # Calculate the accuracy and store it
        acc = metric(y, y_pred)
        accuracies.append(acc)
    
    # Return the mean of the accuracies (Bootstrap estimate)
    return np.mean(accuracies)

# Perform Bootstrap estimation
bootstrap_accuracy = bootstrap_estimate(X, y, model)
print(f"Bootstrap Estimated Accuracy: {bootstrap_accuracy}")


Bootstrap Estimated Accuracy: 0.597


In [8]:
# Import required libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
import numpy as np

# Sample Data Generation
X, y = np.random.rand(100, 2), np.random.randint(0, 2, 100)

# Jackknife function
def jackknife_estimate(X, y, model, metric=accuracy_score):
    n = len(X)
    accuracies = []
    
    # Loop through each data point
    for i in range(n):
        # Create training set excluding the i-th sample
        X_train = np.delete(X, i, axis=0)
        y_train = np.delete(y, i)
        
        # Create test set with just the i-th sample
        X_test = X[i].reshape(1, -1)
        y_test = y[i]
        
        # Train model on the modified training set
        model.fit(X_train, y_train)
        
        # Predict on the single test point
        y_pred = model.predict(X_test)
        
        # Calculate the accuracy and store it
        acc = metric([y_test], y_pred)
        accuracies.append(acc)
    
    # Return the mean of accuracies (Jackknife estimate)
    return np.mean(accuracies)

# Bootstrap function
def bootstrap_estimate(X, y, model, n_iterations=100, metric=accuracy_score):
    accuracies = []
    
    # Perform Bootstrap resampling
    for _ in range(n_iterations):
        # Sample the data with replacement
        X_resampled, y_resampled = resample(X, y, n_samples=len(X), random_state=None)
        
        # Train the model on the bootstrap sample
        model.fit(X_resampled, y_resampled)
        
        # Test the model on the original data
        y_pred = model.predict(X)
        
        # Calculate the accuracy and store it
        acc = metric(y, y_pred)
        accuracies.append(acc)
    
    # Return the mean of the accuracies (Bootstrap estimate)
    return np.mean(accuracies)

# Define multiple classifiers for comparison
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

# Likelihood estimation function (using accuracy for simplicity)
def model_comparison(X, y, models):
    model_likelihoods = {}
    
    for name, model in models.items():
        model.fit(X, y)
        y_pred = model.predict(X)
        accuracy = accuracy_score(y, y_pred)
        
        # In a real-world scenario, you might calculate log-likelihood instead of accuracy
        model_likelihoods[name] = accuracy
    
    return model_likelihoods

# Perform Jackknife estimation
model = LogisticRegression()
jackknife_accuracy = jackknife_estimate(X, y, model)
print(f"Jackknife Estimated Accuracy: {jackknife_accuracy}")

# Perform Bootstrap estimation
bootstrap_accuracy = bootstrap_estimate(X, y, model)
print(f"Bootstrap Estimated Accuracy: {bootstrap_accuracy}")

# Perform model comparison
model_likelihoods = model_comparison(X, y, models)
print("Model Comparison (Likelihood):")
for model, likelihood in model_likelihoods.items():
    print(f"{model}: {likelihood}")


Jackknife Estimated Accuracy: 0.54
Bootstrap Estimated Accuracy: 0.5583
Model Comparison (Likelihood):
Logistic Regression: 0.59
Random Forest: 1.0


##  Bayesian Model Comparison

Bayesian model comparison uses the full information over priors when computing posterior probabilities in Eq. (40). In particular, the evidence for a particular hypothesis is an integral:

$$ P(D | h_i) = \int p(D | \theta, h_i) p(\theta | D, h_i) \, d\theta, $$

where, as before, $\theta$ describes the parameters in the candidate model. It is common for the posterior $p(\theta | D, h_i)$ to be peaked at $\hat{\theta}$, and thus the evidence integral can often be approximated as:

$$ P(D | h_i) \approx p(D | \hat{\theta}, h_i) p(\hat{\theta} | h_i) \Delta \theta. $$

### Occam Factor and Model Complexity

Before the data arrive, model $h_i$ has some broad range of model parameters, denoted by $\Delta_0 \theta$ and shown in Fig. 9.13. After the data arrive, a smaller range is commensurate or compatible with $D$, denoted $\Delta \theta$. The **Occam factor** in Eq. (42) is the ratio of two volumes in parameter space:

$$ \text{Occam factor} = \frac{\Delta \theta}{\Delta_0 \theta}, $$

where:
- $\Delta_0 \theta$: The prior volume, accessible to the model without regard to $D$.
- $\Delta \theta$: The volume commensurate with the data $D$.

The Occam factor has magnitude less than 1.0 and simply measures the fractional decrease in the volume of the model’s parameter space due to the presence of training data. The more the training data, the smaller the range of parameters that are commensurate with it, and thus the greater this collapse in the parameter space and the larger the Occam factor.

$$ p(\theta | D, h_i) \Delta \theta \quad \text{vs.} \quad p(\theta | h_i) \Delta_0 \theta $$

In practice, the Occam factor can be calculated fairly easily if the evidence is approximated as a $k$-dimensional Gaussian, centered on the maximum-likelihood value $\hat{\theta}$.

![image.png](attachment:image.png)

Fig.12: The evidence (i.e., probability of generating diﬀerent data sets given a model) is shown for three models of diﬀerent expressive power or complexity. Model h1 is the most expressive, since with diﬀerent values of its parameters the model can ﬁt a wide range of data sets. Model h3 is the most restrictive of the three. If the actual data observed is D0 , then maximum-likelihood model selection states that we should choose h2 , which has the highest evidence. Model h2 “matches” this particular data set better than do the other two models, and should be selected.


### Approximate Evidence

In the case where the posterior can be assumed to be a Gaussian, the evidence can be calculated directly, yielding:

$$ P(D | h_i) \approx p(D | \hat{\theta}, h_i) p(\hat{\theta} | h_i) \frac{(2\pi)^k}{|H|^{1/2}}, $$

where $H$ is the Hessian matrix defined as:

$$ H = \frac{\partial^2 \ln p(\theta | D, h_i)}{\partial \theta^2}, $$

and $k$ is the number of parameters in the model. The Hessian matrix measures how "peaked" the posterior is around the value $\hat{\theta}$.

### Degeneracies and Scaling

There may be a problem due to degeneracies in a model — several parameters could be relabeled and leave the classiﬁcation rule (and hence the likelihood) unchanged. The resulting degeneracy leads, in essence, to an “overcounting” which alters the eﬀective volume in parameter space.

For such cases, we must multiply the right-hand side of Eq. (42) by the degeneracy of $\hat{\theta}$ in order to scale the Occam factor and obtain the proper estimate of the evidence.

![image-2.png](attachment:image-2.png)

Fig.13: In the absence of training data, a particular model h has available a large range of possible values of its parameters, denoted ∆0 θ. In the presence of a particular training set D, a smaller range is available. The Occam factor, ∆θ/∆0 θ, measures the fractional decrease in the volume of the model’s parameter space due to the presence of training data D. In practice, the Occam factor can be calculated fairly easily if the evidence is approximated as a k-dimensional Gaussian, centered on the maximum-likelihood value θ̂.

### Bayesian Model Selection and the No Free Lunch Theorem

There seems to be a fundamental contradiction between two deep ideas in statistical pattern recognition:

1. The **No Free Lunch Theorem** states that in the absence of prior information about the problem, there is no reason to prefer one classification algorithm over another.
2. Bayesian model selection seems to show how to reliably choose the better of two algorithms.

Consider two “composite” algorithms — algorithm A and algorithm B — each of which employs two others (algorithm 1 and algorithm 2). For any problem:
- Algorithm A uses Bayesian model selection and applies the “better” of algorithm 1 and algorithm 2.
- Algorithm B uses anti-Bayesian model selection and applies the “worse” of algorithm 1 and algorithm 2.

It appears that algorithm A will reliably outperform algorithm B throughout the full class of problems — in contradiction with the No Free Lunch Theorem.

#### Resolution of the Contradiction

In Bayesian model selection, we ignore the prior over the space of models, $H$, effectively assuming it is uniform. This assumption therefore does not take into account how those models correspond to underlying target functions, i.e., mappings from input to category labels. Accordingly, Bayesian model selection usually corresponds to a **non-uniform prior** over target functions. 

In fact, the non-uniform prior varies depending on the choice of model. Therefore, Bayesian model selection usually applies a non-uniform prior that seems to match many important real-world problems.

The **No Free Lunch Theorem** allows that for some particular non-uniform prior, there may be a learning algorithm that gives better-than-chance or even optimal results. This shows that Bayesian model selection, despite appearing to contradict the No Free Lunch Theorem, is consistent with many real-world learning scenarios.


In [9]:
import numpy as np
from scipy.stats import multivariate_normal
from scipy.linalg import det, inv
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# Generate synthetic data
np.random.seed(42)
n_samples = 100
x = np.linspace(0, 10, n_samples)
y = 2 * x + 1 + np.random.normal(scale=2, size=n_samples)  # Linear relationship with noise

x = x[:, np.newaxis]  # Reshape for model compatibility

# Define two models for comparison
def fit_model(x, y, degree):
    """Fit a polynomial model of a given degree."""
    model = Pipeline([
        ("poly_features", PolynomialFeatures(degree=degree, include_bias=False)),
        ("regression", LinearRegression())
    ])
    model.fit(x, y)
    return model

# Evidence approximation (Gaussian assumption)
def compute_evidence(model, x, y):
    """Compute the evidence using Gaussian approximation."""
    y_pred = model.predict(x)
    residuals = y - y_pred
    mse = mean_squared_error(y, y_pred)
    n_params = len(model.named_steps["regression"].coef_) + 1  # Coefficients + intercept

    # Approximation of the Hessian determinant (using variance of residuals)
    hessian_det = (1 / mse) ** n_params

    # Compute log-evidence
    log_evidence = -0.5 * len(y) * np.log(2 * np.pi * mse) - 0.5 * np.log(hessian_det)
    return log_evidence

# Fit two models
model1 = fit_model(x, y, degree=1)  # Linear model
model2 = fit_model(x, y, degree=2)  # Quadratic model

# Compute evidences
log_evidence1 = compute_evidence(model1, x, y)
log_evidence2 = compute_evidence(model2, x, y)

# Compare models
print(f"Log Evidence for Model 1 (Linear): {log_evidence1}")
print(f"Log Evidence for Model 2 (Quadratic): {log_evidence2}")

if log_evidence1 > log_evidence2:
    print("Model 1 (Linear) is preferred based on Bayesian evidence.")
else:
    print("Model 2 (Quadratic) is preferred based on Bayesian evidence.")


Log Evidence for Model 1 (Linear): -149.7927566773839
Log Evidence for Model 2 (Quadratic): -149.01673960682518
Model 2 (Quadratic) is preferred based on Bayesian evidence.


In [10]:
# Generate synthetic data
import math

def generate_data(n_samples, slope, intercept, noise_level):
    x = [i for i in range(n_samples)]
    y = [slope * xi + intercept + noise_level * (2 * (math.random() - 0.5)) for xi in x]
    return x, y

# Polynomial model
def fit_polynomial(x, y, degree):
    # Solve for polynomial coefficients using normal equations
    X = [[xi**d for d in range(degree + 1)] for xi in x]
    X_T = transpose(X)
    XTX = matmul(X_T, X)
    XTy = matmul(X_T, y)
    coeffs = solve(XTX, XTy)
    return coeffs

def predict_polynomial(coeffs, x):
    return sum(c * (x**i) for i, c in enumerate(coeffs))

# Matrix operations
def transpose(matrix):
    return list(map(list, zip(*matrix)))

def matmul(A, B):
    return [[sum(a * b for a, b in zip(A_row, B_col)) for B_col in zip(*B)] for A_row in A]

def solve(A, b):
    # Gaussian elimination
    n = len(b)
    for i in range(n):
        max_row = max(range(i, n), key=lambda r: abs(A[r][i]))
        A[i], A[max_row] = A[max_row], A[i]
        b[i], b[max_row] = b[max_row], b[i]
        for j in range(i + 1, n):
            ratio = A[j][i] / A[i][i]
            for k in range(i, n):
                A[j][k] -= ratio * A[i][k]
            b[j] -= ratio * b[i]
    x = [0] * n
    for i in range(n - 1, -1, -1):
        x[i] = (b[i] - sum(A[i][j] * x[j] for j in range(i + 1, n))) / A[i][i]
    return x

# Bayesian evidence calculation
def compute_log_evidence(x, y, coeffs):
    predictions = [predict_polynomial(coeffs, xi) for xi in x]
    residuals = [yi - pred for yi, pred in zip(y, predictions)]
    mse = sum(r**2 for r in residuals) / len(y)
    n_params = len(coeffs)
    
    # Compute Gaussian approximation of evidence
    hessian_det = (1 / mse) ** n_params
    log_evidence = -0.5 * len(y) * math.log(2 * math.pi * mse) - 0.5 * math.log(hessian_det)
    return log_evidence

# Main comparison
n_samples = 100
x, y = generate_data(n_samples, slope=2, intercept=1, noise_level=2)

# Fit models
coeffs1 = fit_polynomial(x, y, degree=1)
coeffs2 = fit_polynomial(x, y, degree=2)

# Compute evidences
log_evidence1 = compute_log_evidence(x, y, coeffs1)
log_evidence2 = compute_log_evidence(x, y, coeffs2)

# Compare models
print(f"Log Evidence for Model 1 (Linear): {log_evidence1}")
print(f"Log Evidence for Model 2 (Quadratic): {log_evidence2}")

if log_evidence1 > log_evidence2:
    print("Model 1 (Linear) is preferred based on Bayesian evidence.")
else:
    print("Model 2 (Quadratic) is preferred based on Bayesian evidence.")


AttributeError: module 'math' has no attribute 'random'

In [15]:
# Generate synthetic data
import random

def generate_data(n_samples, slope, intercept, noise_level):
    x = [i for i in range(n_samples)]
    y = [slope * xi + intercept + noise_level * (2 * (random.random() - 0.5)) for xi in x]
    return x, y

# Polynomial model
def fit_polynomial(x, y, degree):
    # Construct the design matrix
    X = [[xi**d for d in range(degree + 1)] for xi in x]
    X_T = transpose(X)
    XTX = matmul(X_T, X)
    
    # Convert y to a column vector
    y_column = [[yi] for yi in y]
    XTy = matmul(X_T, y_column)
    
    # Solve for polynomial coefficients
    coeffs_column = solve(XTX, XTy)
    # Flatten the coefficients column vector
    coeffs = [c[0] for c in coeffs_column]
    return coeffs

def predict_polynomial(coeffs, x):
    return sum(c * (x**i) for i, c in enumerate(coeffs))

# Matrix operations
def transpose(matrix):
    return list(map(list, zip(*matrix)))

def matmul(A, B):
    return [[sum(a * b for a, b in zip(A_row, B_col)) for B_col in zip(*B)] for A_row in A]

def solve(A, b):
    """Solves the linear system Ax = b using Gaussian elimination."""
    n = len(A)
    
    # Flatten b if it's a column vector
    if isinstance(b[0], list):
        b = [row[0] for row in b]
    
    for i in range(n):
        # Make the diagonal element 1 and scale the row
        diag = A[i][i]
        for j in range(i, n):
            A[i][j] /= diag
        b[i] /= diag
        
        # Make the elements below the pivot in column i zero
        for j in range(i + 1, n):
            ratio = A[j][i]
            for k in range(i, n):
                A[j][k] -= ratio * A[i][k]
            b[j] -= ratio * b[i]
    
    # Back substitution
    x = [0] * n
    for i in range(n - 1, -1, -1):
        x[i] = b[i]
        for j in range(i + 1, n):
            x[i] -= A[i][j] * x[j]
    
    return [[xi] for xi in x]  # Return result as a column vector


# Bayesian evidence calculation
def compute_log_evidence(x, y, coeffs):
    predictions = [predict_polynomial(coeffs, xi) for xi in x]
    residuals = [yi - pred for yi, pred in zip(y, predictions)]
    mse = sum(r**2 for r in residuals) / len(y)
    n_params = len(coeffs)
    
    # Compute Gaussian approximation of evidence
    hessian_det = (1 / mse) ** n_params
    log_evidence = -0.5 * len(y) * math.log(2 * math.pi * mse) - 0.5 * math.log(hessian_det)
    return log_evidence

# Main comparison
n_samples = 100
x, y = generate_data(n_samples, slope=2, intercept=1, noise_level=2)

# Fit models
coeffs1 = fit_polynomial(x, y, degree=1)
coeffs2 = fit_polynomial(x, y, degree=2)

# Compute evidences
log_evidence1 = compute_log_evidence(x, y, coeffs1)
log_evidence2 = compute_log_evidence(x, y, coeffs2)

# Compare models
print(f"Log Evidence for Model 1 (Linear): {log_evidence1}")
print(f"Log Evidence for Model 2 (Quadratic): {log_evidence2}")

if log_evidence1 > log_evidence2:
    print("Model 1 (Linear) is preferred based on Bayesian evidence.")
else:
    print("Model 2 (Quadratic) is preferred based on Bayesian evidence.")


Log Evidence for Model 1 (Linear): -96.07438815735479
Log Evidence for Model 2 (Quadratic): -94.46296967242567
Model 2 (Quadratic) is preferred based on Bayesian evidence.


In [16]:
def plot_graph(x, y, coeffs1, coeffs2):
    """Plots the data points and the fitted models."""
    import turtle
    
    # Set up the drawing area
    screen = turtle.Screen()
    screen.setup(width=800, height=600)
    screen.setworldcoordinates(-10, -10, 110, 250)  # Adjust axes as per data
    turtle.speed(0)

    # Plot data points
    turtle.penup()
    turtle.color("black")
    for xi, yi in zip(x, y):
        turtle.goto(xi, yi)
        turtle.dot(3, "black")  # Data points as small black dots
    
    # Plot linear model
    turtle.color("blue")
    turtle.penup()
    turtle.goto(x[0], coeffs1[0] + coeffs1[1] * x[0])
    turtle.pendown()
    for xi in x:
        yi = coeffs1[0] + coeffs1[1] * xi
        turtle.goto(xi, yi)
    
    # Plot quadratic model
    turtle.color("red")
    turtle.penup()
    turtle.goto(x[0], coeffs2[0] + coeffs2[1] * x[0] + coeffs2[2] * x[0]**2)
    turtle.pendown()
    for xi in x:
        yi = coeffs2[0] + coeffs2[1] * xi + coeffs2[2] * xi**2
        turtle.goto(xi, yi)

    # Keep the window open until clicked
    turtle.done()

# Generate data
n_samples = 100
x, y = generate_data(n_samples, slope=2, intercept=1, noise_level=2)

# Fit models
coeffs1 = fit_polynomial(x, y, degree=1)  # Linear model
coeffs2 = fit_polynomial(x, y, degree=2)  # Quadratic model

# Compute Bayesian evidences
log_evidence1 = compute_log_evidence(x, y, coeffs1)
log_evidence2 = compute_log_evidence(x, y, coeffs2)

# Model comparison
print(f"Log Evidence for Model 1 (Linear): {log_evidence1}")
print(f"Log Evidence for Model 2 (Quadratic): {log_evidence2}")

if log_evidence1 > log_evidence2:
    print("Model 1 (Linear) is preferred based on Bayesian evidence.")
else:
    print("Model 2 (Quadratic) is preferred based on Bayesian evidence.")

# Plot the graphs
plot_graph(x, y, coeffs1, coeffs2)


Log Evidence for Model 1 (Linear): -99.43789980253715
Log Evidence for Model 2 (Quadratic): -99.35820045210853
Model 2 (Quadratic) is preferred based on Bayesian evidence.
