# A10.2 Bayes Theorem as an Inference 

# 1. Bayes's Theorem: Single-Feature Inference
Bayes theorem comes from probabilty theory and uses probablity notation. Suppose it is 2020 and we want to deduce the likelyhood someone has COVID who is in the Emergency Room (ER) during the month of April. We will use this to understand Bayes theorem, and then use naive bayes to extend this to other factors like, do they have a fever and have they lost the sense of taste?  But we will start with focusing on just these two events.
- $C$ = event C: has COVID
- $E$ = event E: ER visit
We want the conditional probability:
## 1.1 Conditional Probability Notation
$$
P(C \mid E)
$$
 * | stands for "given" or "conditioned on"
 * What is the probablity of C, given that E has occured.
 * Historically, $P(A \mid B)$ meant ‚Äúthe probability of set $A$ inside set $B$‚Äù

> That is, what is the probability a person has COVID if they have visited an emergency room?

## 1.2 Computing a Conditional Probability from data 

Consider 2% of a cities population has covid and 20% of them visit the ER, while while 5% of the total population visit the Emergency room during the month of April.  This introduces the 3 probabilities we will use in our calculations.
  - $P(C) = 0.02$ - prior (baseline) probabilty someone in the city has COVID
  - $P(E)= 0.05$ - probability someone in the city goes to ER for any reason
  - $P(E \mid C) = 0.20$ - probability someone with COVID goes to the ER
    
Consider the city has 100,000 people, then
  * 2,000 have COVID
  * 10,000 visited the ER
  * 400 have COVID *and* visited the ER
We can think of the 400 people as those for whom *both* conditions are true. There are two valid probabilistic ways to express the fraction of the population represented by those people:
1. Start from people with COVID, then take the fraction of them who visited the ER:


$$
\begin{aligned}
P(E \mid C)
&= \frac{P(C \cap E)}{P(C)} = \frac{\text{fraction of people who are both } C \text{ and } E}
        {\text{fraction of people who are } C} \\
&\therefore \\ 
P(C \cap E) 
&= P(E \mid C)\,P(C)
\end{aligned}
$$

2. Start from people who visited the ER, then take the fraction of them who had COVID:

$$\begin{aligned}
P(C \mid E)
&= \frac{P(E \cap C)}{P(E)}=\frac{\text{fraction of people who are both } E \text{ and } C}
{\text{fraction of people who are } E} \\
&\therefore \\ 
P(C \cap E)&= P(C \mid E)P(E)
\end{aligned}
$$
equating    $P(C \cap E)$ 


$$
P(C \mid E)P(E) = P(E \mid C)P(C)
$$
and solving for $P(C\mid E)$ gives  Bayers Theorem:  

$$\boxed{
P(C \mid E) = \frac{P(E \mid C)\,P(C)}{P(E)}
}
$$

substituting in values:  

$$
P(C \mid E)= \frac{(0.2)(0.02)}{0.05}= 0.08
$$
So the probability of someone in the ER having covid is 8%

<div class="alert alert-block alert-info">
<strong>Check your understanding</strong>
<p>In our model we had:
<ol>
  <li>$P(C) = 0.02$</li>
  <li>$P(E)= 0.05$ </li>
<li>$P(E \mid C) = 0.20$</li>
</ol>    
How do the following scenarios change the data from above, and predict the probability that someone in the emergency room has COVID?
</p>
<p>Scenario 1: Change prior relevance. Does the probability of someone in the ER having COVID increase or decreast if 10% of the population has COVID?  Calculate the new probability</p>
  <div style="
    background-color: #efffff;
    color: #000000;
    padding: 10px;
    border-radius: 4px;
    border: 1px solid #dddddd;
    margin-top: 10px;
  ">
<details>
    <summary>Answer</summary>

- $P(C) = 0.02$  
- $P(E) = 0.05$  
- $P(E \mid C) = 0.50$
It increases from 8% to 40%
$P(C \mid E) = 0.40$
</div>


<p>Scenario 2. Change disease severity.What if the disease became worse and 50% of the people who have COVID go to the emergency room? What is the new probability of someone in the ER having COVID and does it increase or decrease?</p>

  <div style="
    background-color: #efffff;
    color: #000000;
    padding: 10px;
    border-radius: 4px;
    border: 1px solid #dddddd;
    margin-top: 10px;
  ">
<details>
    <summary>Answer</summary>
- $P(C) = 0.02$  
- $P(E) = 0.05$  
- $P(E \mid C) = 0.50$  
It increases from 8% to 20%
</details>
</div>
<p>Scenario 3. What if there was an earthquake and 15% of the population ends up in the ER? What is the new probability of somene in the ER having covid and did it increase or decrease from the initial conditions</p>

  <div style="
    background-color: #efffff;
    color: #000000;
    padding: 10px;
    border-radius: 4px;
    border: 1px solid #dddddd;
    margin-top: 10px;
  ">
<details>
    <summary>Answer</summary>
- $P(C) = 0.02$  
- $P(E) = 0.15$  
- $P(E \mid C) = 0.20$  
It drops  fom 8% to 2%.
</details>
</div>
</div>


## 1.3 From Inference to Binary Classification

Up to this point, we have used Bayes‚Äô theorem to perform **inference**. That is, we answered questions of the form: "Given that a person visited the emergency room, what is the probability that they have COVID?"
$$
P(C \mid E)
$$
This result is a probability, not yet a decision and in our original example, we found:
$$
P(C \mid E) = 0.08
$$
That is, there is an 8% chance that a randomly selected emergency room patient has COVID.
### 1.3.1 Inference vs. classification
It is important to distinguish between two related but different tasks:
* **Inference**: estimating probabilities
* **Classification**: making a decision based on those probabilities

Bayes‚Äô theorem gives us inference.
Classification requires an additional step.
### 1.3.2 Binary classification in this example

In this appendix, we are working with a binary outcome:

* $C$: the person has COVID (1)
* $\neg C$: the person does not have COVID (0)

Because these are the only two possibilities,

$$
P(C \mid E) + P(\neg C \mid E) = 1
$$

So if:

$$
P(C \mid E) = 0.08
$$

then:

$$
P(\neg C \mid E) = 0.92
$$

Bayes‚Äô theorem has therefore told us **how plausible each explanation is**, given the evidence.



## 1.4 Decission (classification) rule

To turn inference into classification, we need a **decision rule**. The simplest possible rule is:

 > **Assign the class with the larger posterior probability.**

In this case:

* $P(C \mid E) = 0.08$
* $P(\neg C \mid E) = 0.92$

Since $0.92 > 0.08$, the classification decision would be:

> **This person is classified as not having COVID.**

## 1.5 What Bayes‚Äô theorem does ‚Äî and does not ‚Äî tell us

Bayes‚Äô theorem itself:

* does not decide what threshold to use
* does not say whether 8% is ‚Äúhigh‚Äù or ‚Äúlow‚Äù
* does not encode policy, risk, or consequences

It only tells us:

> Given the assumptions and the data, how probable each outcome is.

The **classification decision** depends on context.

For example:

* A hospital screening system might flag patients even at low probabilities
* A public health study might use a different cutoff
* A diagnostic test might require very high confidence

Those decisions come *after* inference.

# 2. Na√Øve Bayes

In the previous section, we used Bayes‚Äô theorem with a single observed feature, whether a person visited the emergency room. Now we extend that idea by adding additional binary (yes/no) features that may also provide evidence.

We will continue to use:
* $C$ = the event that a person has COVID  

and introduce the following features:
* $E$ = ER visit
* $F$ = fever present
* $T$ = loss of taste
* $S$ = shortness of breath

---

## 2.0 From one feature to many features

With a single feature, Bayes‚Äô theorem gives:

$$
P(C \mid E) = \frac{[P(E \mid C)][P(C)]}{P(E)}
$$

When multiple features are observed simultaneously, the conditional probability becomes:

$$
P(C \mid E,F,T,S) = \frac{[P(E,F,T,S \mid C)][P(C)]}{P(E,F,T,S)}
$$

Here:

* $P(E,F,T,S \mid C)$ is a **joint conditional probability**
* it represents the probability that all four features occur together, given that the person has COVID. This joint probability is the main challenge in extending Bayes‚Äô theorem to multiple features.

## 2.1 Joint probability vs. conditional probability

Before introducing any assumptions, it is important to clarify the notation. The comma in a probability expression means "and", not multiplication.

| Expression          | Meaning (plain English)                                            |
| ------------------- | ------------------------------------------------------------------ |
| $P(E)$              | probability of an ER visit                                         |
| $P(F)$              | probability of fever                                               |
| $P(E,F)$            | probability of an ER visit and fever                           |
| $P(E,F,T,S)$        | probability that all four features occur together                  |
| $P(E,F,T,S \mid C)$ | probability that all four features occur together among people with COVID |

So:

* $P(E,F,T,S)$ refers to how common that entire combination is in the population
* $P(E,F,T,S \mid C)$ refers to how common that combination is among people with COVID

The difficulty is that for $n$ binary features there are $2^n$ possible feature combinations. As the number of features increases, many of these combinations will have few or zero observations in a finite dataset.

For example, a MACCS fingerprint contains 166 binary features. This means there are:

$$
2^{166} \approx 9.35 \times 10^{49}
$$

possible feature combinations. In practice, only an extremely small fraction of these combinations will ever appear in real data.


## 2.2 Na√Øve Bayes inference with multiple features (worked example)

In this section, we use the Na√Øve Bayes assumption to compute an updated probability that a person has COVID given several observed features. Our goal is not classification or decision-making, but to see how multiple pieces of evidence combine to update belief.

We continue to use:

* $C$ = the event that a person has COVID

and the observed features:

* $E$ = ER visit
* $F$ = fever present
* $T$ = loss of taste
* $S$ = shortness of breath

### 2.2.1 Worked Example
To focus on the computation, we assume the following probabilities are known.

#### 2.2.1.1 Prior probability data

| Quantity | Meaning                                 | Value |
| -------- | --------------------------------------- | ----- |
| $P(C)$   | baseline probability a person has COVID | 0.02  |

---

#### 2.2.1.2 Feature likelihoods given COVID

These describe how common each feature is **among people who have COVID**.

| Feature | Meaning             | $P(x \mid C)$ |
| ------- | ------------------- | ------------- |
| $E$     | ER visit            | 0.20          |
| $F$     | fever               | 0.70          |
| $T$     | loss of taste       | 0.60          |
| $S$     | shortness of breath | 0.50          |

> These values are **assumed** for instructional purposes. They are not real epidemiological data.

##### Step 1 ‚Äî Write the Na√Øve Bayes inference model

From Section 2.1, the Na√Øve Bayes approximation gives:

$$
P(C \mid E,F,T,S)
\approx
\frac{
P(C)[P(E \mid C)]P(F \mid C)[P(T \mid C)]P(S \mid C)
}{
P(E,F,T,S)
}
$$

At this stage, we are interested in how the **numerator** changes as features are added.

##### Step 2 ‚Äî Compute the Na√Øve Bayes numerator

Substitute the assumed values and multiply:

$$
\begin{aligned}
\text{Numerator}
&=
P(C)[P(E \mid C)]P(F \mid C)[P(T \mid C)]P(S \mid C) \\
&=
(0.02)(0.20)(0.70)(0.60)(0.50)
\end{aligned}
$$
Evaluating this product:
$$
\text{Numerator} = 0.00084
$$

This value represents a score that measures how consistent the observed features are with COVID (sometimes called the unnormalized posterior). Larger values indicate stronger support for COVID given this combination of features.

**Conceptual takeaway:**  Na√Øve Bayes allows us to replace a difficult joint probability,

$$
P(E,F,T,S \mid C)
$$

with a product of simpler conditional probabilities,

$$
P(E \mid C)[P(F \mid C)]P(T \mid C)[P(S \mid C)]
$$

making inference feasible even when many features are present.



### 2.2.3 Interpreting the Na√Øve Bayes inference score

The Na√Øve Bayes inference score computed above is not a probability that can be interpreted on its own.Two important observations follow directly from the calculation:

1. As more features are added, the numerical value always decreases. This happens because Na√Øve Bayes multiplies together probabilities that are all less than one. Each additional feature further narrows the set of cases consistent with COVID, so the product becomes smaller. This behavior is expected and does not indicate weaker evidence.
2. The magnitude of this value by itself has no meaning. The value does not answer the question ‚ÄúDoes this person have COVID?‚Äù It only measures how consistent this particular combination of features is with COVID. Without a point of comparison, the number cannot support a decision.

#### 2.2.3.1 Comparing COVID vs. not-COVID Inferences
The following code cell reproduces the Na√Øve Bayes multiplication using Python as a calculator. It also computes the same product using the complement of each feature likelihood. This allows us to compare how strongly the same features support COVID versus not COVID. To make the comparison easier to see, the output reports ratios rather than raw probabilities. The absolute numbers are less important than how they compare.

**Run the cell below and examine how the ratio changes when we move from a single feature to multiple features.**

For this demonstration, the not-COVID feature likelihoods are approximated using the binary complements of the COVID likelihoods. That is, for each feature $x$ we use
$$P(x=0 \mid C) = 1 - P(x=1 \mid C)$$ Where 
 * $P(x=1 \mid C)$ is the probability of having the feature given COVID.
 * $P(x=0\mid C)$ is the probability of not having the feature given COVID and

This approximation is used only to illustrate how relative support changes when additional features are included. In a full classification model, the not-COVID likelihoods would be estimated separately as $P(x \mid \neg C)$ from data.

In [1]:
def naive_bayes_numerator(prior, likelihoods):
    value = prior
    for p in likelihoods:
        value *= p
    return value

prior = 0.02
likelihoods = [0.20, 0.70, 0.60, 0.50]
unlikelihoods = [0.80, 0.70, 0.40, 0.50]
onefeatlikelihood = [0.2]
onefeatunlikelihood = [0.8]

print(f"Likelihood of having COVID with 1 feature: {naive_bayes_numerator(prior, onefeatlikelihood)}")
print(f"Unikelihood of having COVID with 1 features: {naive_bayes_numerator(prior, onefeatunlikelihood)}")
print(F"Ratio of likelihood to unlikelihood for four features = {naive_bayes_numerator(prior, onefeatlikelihood)/naive_bayes_numerator(prior, onefeatunlikelihood)}")

print(f"Likelihood of having COVID with 4 feature: {naive_bayes_numerator(prior, likelihoods)}")
print(f"Unlikelihood of having COVID with 4 features: {naive_bayes_numerator(prior, unlikelihoods)}")
print(F"Ratio of likelihood to unlikelihood for four features = {naive_bayes_numerator(prior, likelihoods)/naive_bayes_numerator(prior, unlikelihoods)}")


Likelihood of having COVID with 1 feature: 0.004
Unikelihood of having COVID with 1 features: 0.016
Ratio of likelihood to unlikelihood for four features = 0.25
Likelihood of having COVID with 4 feature: 0.0008399999999999999
Unlikelihood of having COVID with 4 features: 0.0022400000000000002
Ratio of likelihood to unlikelihood for four features = 0.37499999999999994


<div class="alert alert-block alert-info">
<strong>Check your understanding</strong>
<p>
The code above computes Na√Øve Bayes inference scores for having COVID and not having COVID,
and then compares them by taking their ratio.
Explain what changes when additional features are included.
</p>

<div style="
  background-color: #efffff;
  color: #000000;
  padding: 10px;
  border-radius: 4px;
  border: 1px solid #dddddd;
  margin-top: 10px;
  ">
<details>
  <summary>Answer</summary>

When only one feature is used, the ratio of the COVID inference score to the not-COVID inference score is 0.25, meaning the evidence favors not having COVID by a factor of four.

When all four features are included, the absolute inference score for COVID becomes smaller, but the ratio increases to 0.375. This means that, relative to the alternative hypothesis, the additional features provide stronger support for COVID.

Although multiplying more probabilities always reduces the raw score, what matters for interpretation is the comparison between competing hypotheses. Additional features can increase relative support for one hypothesis even as the numerical score itself decreases.

</details>
</div>
</div>



### 2.2.4 From inference to models: placing Na√Øve Bayes in the bigger picture

Up to this point, we have used Na√Øve Bayes strictly as a tool for probabilistic inference. Given a single individual and a small number of observed features, we computed how consistent those features are with a particular hypothesis (having COVID).

#### 2.2.4.1 What is a feature? (binary case)

In this appendix, a feature is a measurable property associated with an individual case that can take one of two values. In our COVID example, all features are binary, meaning they represent a yes/no condition:
  * fever present or not
  * loss of taste or not
  * shortness of breath or not
      
Each feature answers a simple question about the individual and is encoded as either present (1) or absent (0). In cheminformatics, the same idea appears in a familiar form:

* each bit in a fingerprint (for example, a MACCS key) is a binary feature,
* each feature records the presence or absence of a specific structural pattern,
* each molecule is represented by a vector of many such binary features.

Because each feature is binary, molecular fingerprints naturally fit the assumptions of *Bernoulli Na√Øve Bayes*, which will be the first machine learning algorithm we use in Module 10.2.

#### 2.2.4.2 Why inference alone is not enough

The worked example in Section 2.2.1 showed that:
* adding features always reduces the raw inference score,
* the score only becomes meaningful when it is compared to an alternative hypothesis.

This comparison step, deciding which hypothesis is better supported by the data‚Äîis the conceptual bridge from inference to classification. Once we move from a single individual to:

* many individuals (patients, molecules),
* many features per individual,
* repeated predictions across a dataset,

we are no longer performing isolated inference. Instead, we are applying the same inference logic repeatedly using a shared set of learned probabilities. At that point, we are no longer working with individual calculations‚Äîwe are using a model.

### 2.2.5 What is a model?

A model is not a single probability calculation. It is a data-derived object that stores the probabilities needed to apply Na√Øve Bayes inference consistently and automatically:

* across many individuals,
* across many feature vectors,
* across entire datasets.

In this appendix, we assumed probability values by hand to make the mechanics of Na√Øve Bayes explicit. In practice, these probabilities are estimated from data, not chosen. That transition, from hand-specified probabilities to data-
, derived probabilities‚Äîis what distinguishes inference from model building.


# 3. From probabilistic inference to machine-learning models

Up to this point, we have used Na√Øve Bayes to reason about a single individual at a time, using a small number of features and assumed probabilities. This allowed us to focus on how probabilistic inference works and how evidence combines. Machine learning operates at a different level. Instead of reasoning about one case, we work with datasets, and instead of manually specifying probabilities, we learn them from data. This section explains how that transition happens conceptually, and how it is implemented in practice using the `scikit-learn` machine-learning library. No models are built here; the goal is to establish a shared framework that later modules will reuse.

## 3.1 Introducing scikit-learn

Up to this point, we have used Na√Øve Bayes as a mathematical tool for inference. To apply these ideas to real datasets, we need software that can estimate probabilities from data and reuse them to make predictions. In this course, that role is played by scikit-learn. Scikit-learn provides implementations of many machine-learning algorithms, including Na√Øve Bayes, in a consistent framework. Rather than computing probabilities by hand, we give scikit-learn:

* a **feature matrix** $X$ (many individuals, many features),
* a **label vector** $y$ (the outcome for each individual),

By convention:

- feature matrices are written as uppercase $X$,
- label vectors are written as lowercase $y$.

This notation reflects their different roles:

- $X$ represents many features per entity,
- $y$ represents a single outcome per entity.

Together, $(X, y)$ define a supervised learning problem
and it constructs a model that captures the relationships between features and labels.

At a conceptual level, scikit-learn can be used in two closely related ways:

1. **Model-focused use**, where we explicitly build and inspect a single model.
   This is the focus of Module 10.2, where we connect the mathematics of Na√Øve Bayes to concrete code.
2. **Pipeline-focused use**, where data preparation, modeling, and prediction are combined into a reusable workflow.
   This is the focus of Module 10.3, and becomes important when models are applied repeatedly or compared systematically.

Both approaches rely on the same underlying models. Pipelines do not replace models; they organize how models are applied. In this appendix, our goal is only to introduce the ideas that make model-based workflows possible. The actual construction and use of models is developed in the modules that follow.

## 3.2 Feature matrices ($X$)

In earlier sections, we reasoned about a single individual described by a small set of features. In practice, machine learning operates on datasets containing many individuals, each described by the same set of features. This requires a structured representation. Machine-learning datasets organize features into a feature matrix, conventionally denoted $X$:
* each row corresponds to one entity (e.g., a molecule),
* each column corresponds to one feature,
* each entry $x_{ij}$ records the value of feature $j$ for entity $i$.

For MACCS fingerprints:
* features are binary (0 or 1),
* each column represents a specific structural pattern,
* each row is a fingerprint describing one molecule.

This matrix representation generalizes the feature lists used earlier:
* the four COVID features become four columns,
* thousands of molecules become thousands of rows.

In practice, features are often prepared and inspected using Pandas DataFrames, which preserve column names and metadata. Before modeling, these features are converted into NumPy arrays, which is the format expected by scikit-learn algorithms. This separation allows us to maintain interpretability during data preparation while using efficient numeric representations during model training. Once features are arranged in this way, they can be passed directly to machine-learning algorithms.



## 3.3 Label vectors ($y$)

In supervised machine learning, features alone are not sufficient. Each entity in the dataset must also have an associated outcome or target value that the model is trying to predict. These outcomes are organized into a label vector, conventionally denoted $y$. 

### 3.3.1 Structure of the label vector

The label vector $y$ is a one-dimensional array with one entry per entity:
* each entry $y_i$ corresponds to the outcome for row $i$ of the feature matrix $X$,
* the ordering of $y$ must match the ordering of rows in $X$.

In the COVID example, $y$ would indicate whether each individual has COVID or not.
In cheminformatics, $y$ typically represents an experimental outcome, such as:

* active vs inactive in a bioassay,
* toxic vs non-toxic,
* binder vs non-binder.

Because we are performing binary classification, each label takes one of two values (for example, 1 or 0).

### 3.3.2 Binary labels and interpretation

In this appendix and the following modules, labels are binary:

* $y = 1$ indicates membership in the positive class (e.g., active, COVID),
* $y = 0$ indicates membership in the negative class (e.g., inactive, not COVID).

This binary structure aligns naturally with the probabilistic framework developed earlier, where we compared:

* $P(C \mid X)$ versus $P(\neg C \mid X)$.

Later modules will show how this same structure generalizes to multi-class problems, but the binary case provides the clearest starting point.
## 3.4 Role of NumPy and Pandas

During data preparation, labels are often stored in Pandas DataFrames alongside feature columns. This makes it easy to inspect, filter, and align labels with their corresponding features. Before modeling, labels are converted into NumPy arrays, which is the format expected by scikit-learn algorithms. NumPy arrays provide:

* consistent numeric typing,
* efficient computation,
* compatibility with scikit-learn‚Äôs API.

As with features, Pandas supports interpretability during preparation, while NumPy supports computation during modeling.

**A note on the word ‚Äúlabel‚Äù**

The word *label* appears in two different contexts that students often encounter together, which can be confusing.
* In machine learning, a label refers to the outcome or target value we want to predict. This is the vector $y$.
* In Pandas, a label refers to the name of a row or column used for indexing and data access.

Although these meanings often appear together in practice (for example, machine-learning labels are frequently stored as a Pandas column), they represent different concepts. In this appendix and throughout the machine-learning modules, the term label always refers to the machine-learning meaning (the target vector $y$), unless explicitly stated otherwise.
.

## 3.5 Extending Na√Øve Bayes to multicomponent systems
With this data representation in place, we can now write Na√Øve Bayes in its dataset-level form. For a single entity with feature vector $X = (x_1, x_2, \dots, x_n)$ and class label $y$, the Na√Øve Bayes model assumes:

$$
P(y \mid X) \propto P(y)\prod_{i=1}^{n} P(x_i \mid y)
$$

This equation generalizes everything you have seen so far:

* $x_i$ are features (columns of the feature matrix),
* $y$ is the class label (from the label vector),
* $P(x_i \mid y)$ are feature-conditional probabilities,
* the product replaces a joint probability over many features.

The key difference from earlier sections is how these probabilities are obtained.


## 3.6 What it means to train and use a model

In the worked examples earlier in this appendix, probabilities were assumed so that the mechanics of Na√Øve Bayes were clear. In machine learning, those probabilities are learned from data during training.

Training a Na√Øve Bayes model means estimating:

* the class prior $P(y)$ from label frequencies,
* the feature-conditional probabilities $P(x_i \mid y)$ from feature counts.

Once this estimation is complete, the model becomes a data-derived object that can be reused to make predictions on new, unseen cases. At this point, the model can be used in two distinct but related ways.

### 3.6.1 Producing class predictions (decisions)

The most basic use of a trained model is to produce a class label for each new entity. Conceptually, this corresponds to:  

$$
\hat{y} = \arg\max_{y} \; P(y \mid X)
$$
This expression means: choose the class label ùë¶ that has the highest predicted probability given the observed features. In other words, the model compares the probabilities of each class and returns the most likely class.

In scikit-learn, this behavior is exposed through methods such as:

* `.predict()`

The output is a vector of predicted labels (e.g., 0 or 1). These predictions are what we use to compute:
* confusion matrices,
* accuracy,
* precision and recall,
* F1 scores.

All of these metrics depend only on discrete class assignments, not on probabilities.

### 3.6.2 Producing class probabilities (confidence)

Many models‚Äîincluding Na√Øve Bayes‚Äîcan also report how confident they are in each prediction. Instead of returning just a class label, the model returns estimated probabilities such as:

$$
P(y=1 \mid X), \quad P(y=0 \mid X)
$$
In scikit-learn, this behavior is exposed through methods such as:

* `.predict_proba()`

The output is a probability distribution over classes for each entity. These probabilities are essential for:

* ROC curves,
* AUC (area under the curve),
* threshold-based decision analysis,
* ranking and prioritization of predictions.

Importantly, ROC and AUC cannot be computed from class labels alone; they require probability scores.



## 3.7 Bernoulli Na√Øve Bayes and binary features
When features are binary, as they are for fingerprints, the appropriate Na√Øve Bayes variant is Bernoulli Na√Øve Bayes.

This model:
* treats each feature as present or absent,
* estimates probabilities based on binary counts,
* aligns naturally with fingerprint-based representations.

In scikit-learn, this model is implemented as `BernoulliNB`. You will build and evaluate such a model in Module 10.2, using the feature matrices created in Module 10.1.


## 3.8 Generality beyond Na√Øve Bayes

Although this appendix uses Na√Øve Bayes as its running example, the data organization and workflow introduced here apply broadly. Many machine-learning algorithms share the same structure:

* feature matrix $X$,
* label vector $y$,
* a training step producing a reusable model.

Na√Øve Bayes is simply the first algorithm you encounter, not the last.


## 3.9 From Inference to Models: Key Takeaways

This appendix established the conceptual framework behind supervised machine learning:

* how features become a matrix $X$,
* how outcomes become a label vector $y$,
* how probabilistic inference becomes a trained model,
* and how that model can be used for decisions or probability-based evaluation.

In Module 10.2, these ideas are put into practice. You will construct, train, and evaluate a Bernoulli Na√Øve Bayes model using real chemical fingerprints and experimental labels. No new concepts are introduced there‚Äîonly concrete implementations of the framework developed here.


<div class="alert alert-block alert-info">
<strong>Check Your Understanding</strong>
<p>1. What role does the feature matrix $X$ play in supervised learning?
</p>
  <div style="
    background-color: #efffff;
    color: #000000;
    padding: 10px;
    border-radius: 4px;
    border: 1px solid #dddddd;
    margin-top: 10px;
  ">
<details>
    <summary>Answer</summary>
The feature matrix $X$ stores the measured properties of each entity in the dataset. Each row represents one entity (such as a molecule), and each column represents one feature. The model uses $X$ as the input information from which it learns patterns.
</details>
</div>
2. What information is stored in the label vector $y$
  <div style="
    background-color: #efffff;
    color: #000000;
    padding: 10px;
    border-radius: 4px;
    border: 1px solid #dddddd;
    margin-top: 10px;
  ">
<details>
    <summary>Answer</summary>
The label vector $y$ stores the known outcomes associated with each entity. Each entry in $y$ corresponds to a row in $X$ and indicates the class or value the model is trying to predict (for example, active vs inactive).
</details>
</div>
3. Why is a trained model more than a single probability calculation?
  <div style="
    background-color: #efffff;
    color: #000000;
    padding: 10px;
    border-radius: 4px;
    border: 1px solid #dddddd;
    margin-top: 10px;
  ">
<details>
    <summary>Answer</summary>
<p>A trained model captures relationships learned from an entire dataset and can be reused to make predictions on new, unseen data. Unlike a single probability calculation, a model stores learned parameters and applies them consistently across many predictions.
</p>
</details>
</div>
</div>

In cheminformatics, molecular fingerprints provide the features, experimental outcomes provide the labels, and machine-learning models learn how structure relates to activity across entire datasets.
