# Using Conditional probability to determine the effect of processing on fracture
- A student performs fracture measurements for 100 different oxide ceramics samples. Of the 100 samples, 50 were annealed at high temperatures for 1 hour, 65 samples fractured and 55 of the annealed samples fractured.

Explanation:
- Event A (Condition): The ceramic was annealed at high temperatures for 1 hour period (this is the "given" information).
- Event F: The ceramic fractures under a specific load.
- Let's create a Venn diagram illustrating the data

In [None]:
import matplotlib
import matplotlib.pyplot as plt
from matplotlib_venn import venn3, venn3_circles
font = {'family' : 'sans',
        'weight' : 'bold',
        'size'   : 10}
matplotlib.rc('font', **font)

# Create 3 sets for materials with different properties
set_materials = {i for i in range(100)}     # 100 materials
set_annealed = {i for i in range(50)}       # 50 annealed materials
set_fractured = {i for i in range (5, 70)}  # 65 fractured materials
# The number of annealed materials that shows fracture is 45 (from 5 to 49)

#Change the label color to match the circle color
venn = venn3([set_materials, set_annealed, set_fractured], ('100 Materials', '50 Annealed', '65 Fractured'), 
       set_colors=('white', 'orange', 'blue'), alpha=1)
for text in venn.set_labels: text.set_fontsize(20)

venn3_circles([set_materials, set_annealed, set_fractured], linewidth=1, color='k')

plt.show()

> ### Assignment 
> - Question 1: What is the probability that a ceramic fractured, given that it was annealed?

The percentage of ceramics fractured, given that they were annealed, equals the conditional probability
$$
P(F|A) = P(FA) / P(A) = 0.45 / 0.50 = 0.90 = 90\%
$$

> ### Assignment
> - Question 2: What fraction of ceramic samples fractured, given that they were not annealed?

The number of samples that fractured but were not annealed is 20. The number of samples that were not annealed is $100 - 50 = 50$. Hence, the conditional probability is
$$
P(F|\neg A) = P(F(\neg A))/P(\neg A) = 0.20 / 0.50 = 0.40 = 40\%.
$$

In this exercise, we calculated the probabilities for how different processing affects the properties of materials. The annealed samples were more likely to fracture than the non-annealed ones.

---


# Using Bayes' Theorem for Rare Event Detection in Materials Science

<img src="https://upload.wikimedia.org/wikipedia/commons/a/ab/Levitation_of_a_magnet_on_a_superconductor.jpg" alt="Superconductor image from Wikipedia" align="right" style="width: 200px;float: right;"/>

## Scenario: Predicting Superconducting Materials
We are developing a machine learning model to predict whether a given material will be a superconductor based on its features (e.g., composition, structure, and electronic properties). Superconductors are rare, and only a small fraction of materials exhibit this property.

### Definitions
- **Positive prediction (E)**: The model predicts a material is a superconductor.
- **Actual positive (H)**: The material is indeed a superconductor.
- **Prior probability (P(H))**: The fraction of all materials that are superconductors, e.g., around 2%.
- **False positive rate (FPR)**: Probability that a non-superconductor is predicted as a superconductor.
- **True positive rate (TPR)**: Probability that a superconductor is correctly identified.

---

## Confusion matrix: the single table that contains *all* outcomes

Let:

- $H$: the material **is** a superconductor (ground truth is positive)
- $\neg H$: the material is **not** a superconductor (ground truth is negative)
- $E$: the model **predicts** superconductor (prediction is positive)
- $\neg E$: the model predicts **not** superconductor (prediction is negative)

A **confusion matrix** counts what happens when we compare predictions to ground truth:

|                       | Actually $H$ | Actually $\neg H$ |
|---|---:|---:|
| Predicted $E$         | True Positive (TP)  | False Positive (FP) |
| Predicted $\neg E$   | False Negative (FN) | True Negative (TN) |

### Precision and recall as conditional probabilities

From this table, two metrics answer two different *materials-discovery* questions:

**Recall (a.k.a. sensitivity / true positive rate)**  
> “Of all true superconductors, what fraction did we catch?”

$$
\text{Recall} = P(E\mid H) = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

Interpretation:
- High recall → good for discovery science
- Low recall → model misses promising materials


**Precision (a.k.a. positive predictive value)**  
> “Of all candidates we send to synthesis/measurement, what fraction are truly superconductors?”

$$
\text{Precision} = P(H\mid E) = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

Interpretation:
- High precision → efficient use of synthesis time
- Low precision → many wasted experiments

**Important:** recall conditions on the *truth*; precision conditions on the *prediction*. They are different questions.

---

## Bayes’ theorem: why rare-event discovery is hard

Bayes’ theorem converts a quantity that is often easier to estimate from labeled data (**recall**, $P(E\mid H)$) into the quantity that experimentalists often care about (**precision**, $P(H\mid E)$):

$$
P(H\mid E) = \frac{P(E\mid H)\,P(H)}{P(E)}
$$

and

$$
P(E) = P(E\mid H)P(H) + P(E\mid \neg H)P(\neg H)
$$

where:

- $P(H)$ is the **base rate / prevalence** of superconductors in the population (typically small),
- $P(E\mid \neg H)$ is the **false positive rate (FPR)**.

Putting this together gives a precision formula that makes the “rare-event penalty” explicit:

$$
\text{Precision} = P(H\mid E)
= \frac{\text{TPR}\,P(H)}{\text{TPR}\,P(H) + \text{FPR}\,P(\neg H)}
$$

This is the main practical lesson for materials screening: even with high recall, precision can remain low if $P(H)$ is small and/or the FPR is not very small.



> ### Assignment
> Calculate the probability that a material is a superconductor given that the machine learning model made a positive prediction (accuracy of positive predictions) using Bayes' Theorem:
> - $P(H) = 0.02$ (rare event).
> - $P(\neg H) = 1 - P(H) = 0.98$.
> - Model's **Recall** TPR = 90%
> - Model's **False Positive Rate** FPR = 20%
>
> For the confusion-matrix method, assume you screen **$N=10{,}000$** candidate materials.

We calculate the model's **precision** (accuracy of positive predictions) using Bayes' Theorem:
$$
P(H|E) = \frac{P(E|H)P(H)}{P(E)}
$$
where:
$$
P(E) = P(E|H)P(H) + P(E|\neg H)P(\neg H)
$$


In [None]:
# Parameters (probabilities)
P_H = 0.02         # P(H): Prevalence of superconductors in the candidate pool
                   #       Probability of a material being a superconductor
P_not_H = 1 - P_H  # P(¬H): Probability of a material NOT being a superconductor
TPR = 0.9          # P(E|H): True positive rate = recall
FPR = 0.2          # P(E|¬H): False positive rate



In [None]:
# Hypothetical screening size
N = 10_000



### Analysis
Even with a high recall (90%) and a low false positive rate (20%), the **precision** is only 8.41%. This happens because superconducting materials are so rare that false positives dominate predictions.


### Insights
1. **Precision is crucial for rare event detection**: High recall ensures most superconductors are identified, but low precision means the predicted superconductors mostly include non-superconductors.
2. **Improving the model**:
   - Use more features or better descriptors to reduce false positives.
   - Increase the threshold for positive classification.
   - Incorporate prior knowledge (e.g., physics-based constraints) to refine predictions.
   
---

## Estimating a Confusion Matrix for a Superconductivity Model

### Background
A machine learning model has been trained to predict whether materials are superconductive or not. You are tasked with estimating a confusion matrix for this model based on the following known performance metrics:

- **Total Number of Materials:** 250,000
- **Precision:** 90% (0.90)
- **Recall:** 70% (0.70)
- **Class Distribution:**
  - 2% of the materials are superconductive (positive class).
  - 98% of the materials are non-superconductive (negative class).

### Tasks:
1. **Define Key Parameters**
   - Calculate the total number of **superconductive materials** and **non-superconductive materials** in the dataset.
   - Use the definitions of precision and false positive rate to guide the estimation of the confusion matrix.

2. **Estimate the Confusion Matrix**
   - Start with the definitions of precision and FPR:
     - **Precision:**  
       $$
       \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
       $$
     - **Recall:**  
       $$
       \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TN)} + \text{False Negatives (FP)}}
       $$
   - Estimate the values for:
     - **True Positives (TP)**
     - **False Positives (FP)**
     - **True Negatives (TN)**
     - **False Negatives (FN)**
   - Ensure the total counts in your confusion matrix equal the dataset size (250,000).

---

## Solution

### Given:
- **Total Materials (N):** 250,000
- **Precision:** 90% (0.90)
- **Recall:** 70% (0.70)
- **Class Distribution:**
  - **Superconductive (Positive Class):** $ P = 2\% \times 250,000 = 5,000 $
  - **Non-Superconductive (Negative Class):** $ N = 98\% \times 250,000 = 245,000 $

---


### Step 1: Calculate True Positives (TP) and False Negatives (FN)

From recall:
$$
\text{Recall} = \frac{\text{TP}}{P} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$
Rearranging:
$$
\text{FN} = \frac{\text{TP}}{\text{Recall}} - \text{TP}
$$

Substitute $ \text{Recall} = 0.70 $ and $ P = 5,000 $:
$$
\text{TP} = \text{Recall} \times P = 0.70 \times 5,000 = 3,500
$$
$$
\text{FN} = P - \text{TP} = 5,000 - 3,500 = 1,500
$$

---


### Step 2: Calculate False Positives (FP)

From precision:
$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$
Rearranging:
$$
\text{FP} = \frac{\text{TP}}{\text{Precision}} - \text{TP}
$$

Substitute $ \text{TP} = 3,000 $ and $ \text{Precision} = 0.90 $:
$$
\text{FP} = \frac{3,500}{0.90} - 3,500 \approx 389
$$

---


### Step 3: Calculate True Negatives (TN)

Using total negatives:
$$
\text{TN} = N - \text{FP} = 245,000 - 389 = 244,611
$$

---


### Final Confusion Matrix:

|                 | Predicted Positive | Predicted Negative |
|-----------------|--------------------|--------------------|
| **Actual Positive** | **3,500**            | **1,500**            |
| **Actual Negative** | **389**              | **244,611**          |

---

### Step 4: Calculate Accuracy

$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total Materials}}
$$

Substitute values:
$$
\text{Accuracy} = \frac{3,500 + 244,611}{250,000} = \frac{248,111}{250,000} = 0.9924 \, (99.24\%)
$$

---

### Analysis

- **Precision ($90\%$)**: Indicates the model is very selective when predicting superconductivity, resulting in very few false positives ($389$).
- **Recall ($70\%$)**: Suggests the model captures most actual superconductors but still misses some ($1,500$).
- **Accuracy ($99.24\%$)**: Highlights the model's overall effectiveness in correctly classifying materials, given the imbalance in the dataset.

**Trade-offs:** The model prioritizes precision, leading to fewer false positives, which is important for applications where false alarms are costly. However, the lower recall means some true superconductors are missed, which could be problematic depending on the application's goals.

--- 

This solution format is clear and ready for use in educational materials. Let me know if you need any changes!

In [None]:
# Given parameters
total_materials = 250000
positive_rate = 0.02
precision = 0.90
recall = 0.70

# Calculate total positive and negative samples
positives = total_materials * positive_rate
negatives = total_materials - positives

# Calculate True Positives (TP) and False Negatives (FN)
tp = recall * positives
fn = positives - tp

# Calculate False Positives (FP) and True Negatives (TN)
fp = (tp / precision) - tp
tn = negatives - fp

# Round values for confusion matrix
tp, fn, fp, tn = round(tp), round(fn), round(fp), round(tn)

# Display confusion matrix
confusion_matrix = {
    "True Positives (TP)": tp,
    "False Positives (FP)": fp,
    "True Negatives (TN)": tn,
    "False Negatives (FN)": fn
}
confusion_matrix

In [None]:
# Plot the confusion matrix
import seaborn as sns
import numpy as np

confusion_matrix_values = np.array([[confusion_matrix["True Positives (TP)"], confusion_matrix["False Negatives (FN)"]],
                                    [confusion_matrix["False Positives (FP)"], confusion_matrix["True Negatives (TN)"]]])
sns.heatmap(confusion_matrix_values, annot=True, fmt="d", cmap="Blues", cbar=False,
              xticklabels=["Predicted Positive", "Predicted Negative"],
              yticklabels=["Actual Positive", "Actual Negative"])
plt.title("Confusion Matrix")
plt.show()