# Homework Set 1
## Problem 1. Confusion Matrix + Bayes for a Materials Screening Model
(30 points)

### Background
A machine-learning classifier is used to **screen candidate solid-state electrolyte materials**.  
The model predicts whether a material is a **fast Li-ion conductor** (positive class) or **not** (negative class), based on structure-derived features.

You are given the following information about a screening workflow:

- **Total number of candidate materials screened:** 200,000  
- **Prevalence (ground truth):**  
  - 1% of candidates are truly **fast conductors** (positive class).  
  - 99% are **not** fast conductors (negative class).  
- **Model performance (measured on a representative validation set):**
  - **True-positive rate** (Sensitivity): 80% (0.80)  
  - **True-negative rate** (Specificity): 95% (0.95)

Your goal is to estimate the **confusion matrix** for the full screening campaign, and then use **Bayes’ theorem** to answer practical questions about what a model prediction means.

---

### Tasks

1. **Define key counts from prevalence**
   - Compute how many materials in the 200,000-candidate pool are truly:
     - **Fast conductors** (positive)
     - **Not fast conductors** (negative)

2. **Estimate the confusion matrix**
   - Use the definitions below to estimate the confusion-matrix entries:
     - **Sensitivity (TPR):**
       $$
       \text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}
       $$
     - **Specificity (TNR):**
       $$
       \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}
       $$
   - Solve for:
     - **True Positives (TP)**
     - **False Negatives (FN)**
     - **True Negatives (TN)**
     - **False Positives (FP)**
   - Check that your totals satisfy:
     $$
     \text{TP} + \text{FN} + \text{TN} + \text{FP} = 200{,}000.
     $$

3. **Answer two questions about “What does a prediction mean?” using Bayes theorem.**

   Use your confusion-matrix numbers (or the equivalent conditional probabilities) to answer:

   a) If the model predicts **“fast conductor”**, what fraction of those predicted-fast materials are truly fast conductors?

   b) If the model predicts **“not fast conductor”**, what fraction of those predicted-not-fast materials are *actually* fast conductors?

   *Hint:* Both questions are conditional probabilities of the form $P(\text{Truth} \mid \text{Prediction})$. You may compute them directly from the confusion matrix, or by writing Bayes’ theorem explicitly.

4. **Plot the confusion matrix**
   - Create a labeled confusion-matrix plot (counts, not just normalized fractions).
   - You may use `seaborn.heatmap` or any other plotting method you prefer.

---
---

## Problem 2. One-Hot Encoding for Chemical Species in Small Molecules (70 points)

### Objective:
Learn how to implement one-hot encoding for representing the chemical species in small molecules. This assignment will teach you how categorical data (e.g., atom types) can be transformed into numerical representations suitable for machine learning applications.

---

### What is One-Hot Encoding?

**Definition:**
One-hot encoding is a method for converting categorical data (data that can take on a limited number of distinct values) into a numerical format that machine learning models can understand. Each unique category is represented as a binary vector with a length equal to the number of categories. In this vector, one position corresponding to the category is marked as `1`, and all other positions are marked as `0`.

---

### Example:
For the categories `['H', 'C', 'O', 'N']`:
- `H` → `[1, 0, 0, 0]`
- `C` → `[0, 1, 0, 0]`
- `O` → `[0, 0, 1, 0]`
- `N` → `[0, 0, 0, 1]`

---

### Why not encode species as a single number?

A tempting alternative is **label encoding**, e.g. `H=1, C=2, N=3, O=4`. The problem is that many models (and essentially all distance-based reasoning) will treat these numbers as if they carry *meaningful magnitudes*. For instance, with label encoding:
- `|C − H| = |2 − 1| = 1` suggests **H is “closer” to C** than to O,
- `|O − H| = |4 − 1| = 3` suggests **H is “far” from O**.

But element identities are *categories*, not points on a number line. One-hot encoding avoids introducing this artificial “closeness” or ordering: all different species are equally distinct unless the data (or a learned embedding) provides a reason otherwise.

---

### Purpose in Machine Learning

1. **Handling Categorical Data:**
   Machine learning algorithms typically work with numerical data. One-hot encoding converts non-numeric categories into numbers **without implying any ranking** among categories.

2. **Preventing Misinterpretation:**
   Unlike label encoding, one-hot encoding does not smuggle in an unintended notion of distance or order between chemical species.

3. **Enabling Compatibility:**
   Many machine learning models (e.g., neural networks, decision trees) require consistent input shapes and cannot process raw categorical data directly.

4. **Avoiding Bias:**
   One-hot encoding treats all categories symmetrically, preventing the model from assuming that some species are “greater than” others.

---

### Why Use One-Hot Encoding for Molecules?

In cheminformatics and materials science, molecules often contain categorical data like atom types. Using one-hot encoding:
- Ensures that all atom types (e.g., H, C, O, N) are treated as distinct entities.
- Prepares molecular data for machine learning models that predict properties such as reactivity, stability, or other materials-relevant behavior.
- Captures molecular composition in a structured and interpretable format.

---

By applying one-hot encoding to molecules, we can convert molecular structures into a numerical representation suitable for machine learning workflows while avoiding unintended numerical assumptions.



## Problem Description:
You are provided with a small dataset of molecules represented by their chemical formulas. Each molecule is described by a list of atoms and their types (e.g., H, C, O, N). Your tasks are:

You may represent the dataset as a Python dictionary mapping molecule names to a list of atom symbols (e.g., `{'Molecule 1': ['H','H','O'], ...}`), or an equivalent list-of-lists.

1. **Identify Unique Chemical Species**:
   Extract all unique atom types across the dataset.

2. **Create One-Hot Encodings**:
   Assign a binary vector to each unique atom type.

3. **Encode Molecules Using One-Hot Representations**:
   Convert the list of atoms for each molecule into their corresponding one-hot encoded matrix.

4. **Composition of Molecule**:
   - Summarize each molecule by the total count of each species (e.g., [2, 1, 0, 0] for 2 H, 1 C, 0 O, and 0 N).
   - Visualize the one-hot encoded data using a heatmap.
  - For the heatmap, use the **per-molecule composition vectors** (rows = molecules, columns = species in your chosen ordering).

---

## Dataset Example:

| Molecule Name | Atoms        |
|---------------|--------------|
| Molecule 1    | H, H, O      |
| Molecule 2    | C, H, H, O   |
| Molecule 3    | N, H, H, C, O |

---

## Tasks:

Use **alphabetical ordering** of species (e.g., C, H, N, O) when constructing one-hot vectors so that all answers are consistent.

1. Extract the unique species from the dataset (e.g., $\{H, C, O, N\}$).
2. Create one-hot encodings for these species:
   - Example:
     - $H: [1, 0, 0, 0]$
     - $C: [0, 1, 0, 0]$
     - $O: [0, 0, 1, 0]$
     - $N: [0, 0, 0, 1]$
3. Convert each molecule into a one-hot encoded matrix:
   - Example for Molecule 1 $(H, H, O)$:
     $$
     \begin{bmatrix}
     1 & 0 & 0 & 0 \\
     1 & 0 & 0 & 0 \\
     0 & 0 & 1 & 0 \\
     \end{bmatrix}
     $$
4. (Optional) Summarize each molecule by counting the total occurrences of each species:
   - Example:
     - Molecule 1: $[2, 0, 1, 0]$
     - Molecule 2: $[2, 1, 1, 0]$

---

## Deliverables:
1. Python code that implements the above tasks.
2. A report explaining your implementation and showing the results (encoded matrices for each molecule).
3. A visualization of the one-hot encoded data.

---

## Hints:
- Use Python’s `set()` to extract unique atom types.
- Use libraries like `NumPy` or `pandas` for matrix manipulations.
- Use the seaborn library to create a heatmap of the one-hot encoding.
---
---

## Problem 3: Generalizing One-Hot Encoding for Molecules
For Graduate Students or Extra Credit for Undergraduate Students (30 points)

### Objective:
Write a Python program that generalizes the one-hot encoding process to work for a set of molecules given as XYZ files in a folder called `molecules`.

### Instructions:

1. **Folder Structure**:
   - Use the provided folder named `molecules` containing XYZ files. Each XYZ file represents a molecule with atomic coordinates.

2. **Reading XYZ Files**:
   - Write a function `read_xyz(file_path)` that reads an XYZ file and returns a list of atoms in the molecule.
   - Assume standard XYZ format: line 1 is the number of atoms, line 2 is a comment. Read the element symbol from each remaining line and ignore any extra columns beyond `Element x y z`.

3. **One-Hot Encoding**:
   - Build the **global species list** from *all* XYZ files in the folder first (or use a two-pass approach), then apply one-hot encoding using that shared ordering.
   - Implement a function `one_hot_encode_atoms(atom_list)` that takes a list of atoms and returns a one-hot encoded representation.
   - The one-hot encoding should create a binary vector for each atom type present in the dataset. For example, if the dataset contains Hydrogen (H), Carbon (C), and Oxygen (O), the one-hot encoding for H would be `[1, 0, 0]`, for C would be `[0, 1, 0]`, and for O would be `[0, 0, 1]`.

4. **Processing All Molecules**:
   - Write a function `process_molecules(folder_path)` that processes all XYZ files in the `molecules` folder, applies one-hot encoding to each molecule, and stores the results in a dictionary where the keys are the file names and the values are the one-hot encoded representations.

5. **Output**:
   - Print and visualize the one-hot encoded representations for each molecule.

6. **Testing**:
   - Make sure your program is general and reads all files in a given folder. We will test your program on a folder with a different set of molecules.

**Submission note:** You may submit either a `.py` script or a notebook export, as long as the required functions are clearly defined and runnable.


#### Example XYZ File Content:


In [None]:
5
Comment line
H 0.0 0.0 0.0  #Atom type  x  y  z
C 0.0 0.0 1.0
O 0.0 1.0 0.0
H 1.0 0.0 0.0
C 1.0 1.0 1.0



#### Example Output:


In [None]:
molecule1.xyz: [[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]
molecule2.xyz: [[0, 1, 0], [1, 0, 0], [0, 0, 1], [0, 1, 0]]



#### Submission:
Submit your Jupyter Notebook containing the functions and the main program. Ensure that your notebook and code are well-documented and follows best practices for readability and maintainability.