# Homework Set 1
## Problem 1. Estimating a Confusion Matrix for a Superconductivity Model (30 points)

### Background
A machine learning model has been trained to predict whether materials are superconductive or not. The model takes as input the crystal structure and a coarse estimate of the phonon density of states of a material. The machine learning model uses a crystal-graph convolutional neural network and the model is called BEEnet

You are tasked with determining the confusion matrix for this model based on the following known performance metrics:

- **Total Number of Materials:** 250,000
- **Class Distribution:**
  - 2% of the materials are superconductive (positive class).
  - 98% of the materials are non-superconductive (negative class).
- **Precision:** of BEEnet 90% (0.90)
- **Recall:** of BEEnet 70% (0.70)

---

### Tasks:
1. **Define Key Parameters**
   - Calculate the total number of **superconducting materials** and **non-superconducting materials** in the dataset.

2. **Estimate the Confusion Matrix**
   - Use the definitions of precision and recall to estimate the confusion matrix.
   - Start with the definitions of precision and recall:
     - **Precision:**  
       $$
       \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
       $$
     - **Recall:**  
       $$
       \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TN)} + \text{False Negatives (FP)}}
       $$
   - Estimate the values for:
     - **True Positives (TP)**
     - **False Positives (FP)**
     - **True Negatives (TN)**
     - **False Negatives (FN)**

3. **Create a plot of the confusion matrix**
   - You can use the seaborn package that include a function to plot confusion matrices
   - Ensure the total counts in your confusion matrix equal the dataset size (250,000).

---
---

## Problem 2. One-Hot Encoding for Chemical Species in Small Molecules (70 points)

### Objective:
Learn how to implement one-hot encoding for representing the chemical species in small molecules. This assignment will teach you how categorical data (e.g., atom types) can be transformed into numerical representations suitable for machine learning applications.

---

### What is One-Hot Encoding?

**Definition:**
One-hot encoding is a method for converting categorical data (data that can take on a limited number of distinct values) into a numerical format that machine learning models can understand. Each unique category is represented as a binary vector with a length equal to the number of categories. In this vector, one position corresponding to the category is marked as `1`, and all other positions are marked as `0`.

---

### Example:
For the categories `['H', 'C', 'O', 'N']`:
- `H` → `[1, 0, 0, 0]`
- `C` → `[0, 1, 0, 0]`
- `O` → `[0, 0, 1, 0]`
- `N` → `[0, 0, 0, 1]`

---

### Purpose in Machine Learning

1. **Handling Categorical Data:**
   Machine learning algorithms typically work with numerical data. One-hot encoding converts non-numeric categorical features into a numerical format without introducing any ordinality (unlike label encoding, which can mistakenly imply a ranking among categories).

2. **Preventing Misinterpretation:**
   For example, in the periodic table, `H` (Hydrogen) and `O` (Oxygen) are not numerically related. Using one-hot encoding ensures that no unintended relationships or biases are introduced.

3. **Enabling Compatibility:**
   Many machine learning models (e.g., neural networks, decision trees) require consistent input shapes and cannot process raw categorical data directly.

4. **Avoiding Bias:**
   One-hot encoding ensures all categories are treated equally, preventing the model from assuming that some categories are "greater than" others.

---

### Why Use One-Hot Encoding for Molecules?

In cheminformatics and materials science, molecules often consist of categorical data like atom types. Using one-hot encoding:
- Ensures that all atom types (e.g., H, C, O, N) are treated as distinct entities.
- Prepares molecular data for machine learning models that predict properties such as reactivity, toxicity, or material behavior.
- Captures the molecular composition in a structured and interpretable format.

---

By applying one-hot encoding to molecules, we can convert molecular structures into a numerical representation suitable for machine learning workflows, ensuring compatibility and preventing bias in the data.


## Problem Description:
You are provided with a small dataset of molecules represented by their chemical formulas. Each molecule is described by a list of atoms and their types (e.g., H, C, O, N). Your tasks are:

1. **Identify Unique Chemical Species**:
   Extract all unique atom types across the dataset.

2. **Create One-Hot Encodings**:
   Assign a binary vector to each unique atom type.

3. **Encode Molecules Using One-Hot Representations**:
   Convert the list of atoms for each molecule into their corresponding one-hot encoded matrix.

4. **Composition of Molecule**:
   - Summarize each molecule by the total count of each species (e.g., [2, 1, 0, 0] for 2 H, 1 C, 0 O, and 0 N).
   - Visualize the one-hot encoded data using a heatmap.

---

## Dataset Example:

| Molecule Name | Atoms        |
|---------------|--------------|
| Molecule 1    | H, H, O      |
| Molecule 2    | C, H, H, O   |
| Molecule 3    | N, H, H, C, O |

---

## Tasks:

1. Extract the unique species from the dataset (e.g., \(\{H, C, O, N\}\)).
2. Create one-hot encodings for these species:
   - Example:
     - \(H: [1, 0, 0, 0]\)
     - \(C: [0, 1, 0, 0]\)
     - \(O: [0, 0, 1, 0]\)
     - \(N: [0, 0, 0, 1]\)
3. Convert each molecule into a one-hot encoded matrix:
   - Example for Molecule 1 (\(H, H, O\)):
     \[
     \begin{bmatrix}
     1 & 0 & 0 & 0 \\
     1 & 0 & 0 & 0 \\
     0 & 0 & 1 & 0 \\
     \end{bmatrix}
     \]
4. (Optional) Summarize each molecule by counting the total occurrences of each species:
   - Example:
     - Molecule 1: \([2, 0, 1, 0]\)
     - Molecule 2: \([2, 1, 1, 0]\)

---

## Deliverables:
1. Python code that implements the above tasks.
2. A report explaining your implementation and showing the results (encoded matrices for each molecule).
3. A visualization of the one-hot encoded data.

---

## Hints:
- Use Python’s `set()` to extract unique atom types.
- Use libraries like `NumPy` or `pandas` for matrix manipulations.
- Use the seaborn library to create a heatmap of the one-hot encoding.


## Problem 3 f



## Problem 3: Generalizing One-Hot Encoding for Molecules
For Graduate Students or Extra Credit for Undergraduate Students (30 points)

### Objective:
Write a Python program that generalizes the one-hot encoding process to work for a set of molecules given as XYZ files in a folder called `molecules`.

### Instructions:

1. **Folder Structure**:
   - Use the provided folder named `molecules` containing XYZ files. Each XYZ file represents a molecule with atomic coordinates.

2. **Reading XYZ Files**:
   - Write a function `read_xyz(file_path)` that reads an XYZ file and returns a list of atoms in the molecule.

3. **One-Hot Encoding**:
   - Implement a function `one_hot_encode_atoms(atom_list)` that takes a list of atoms and returns a one-hot encoded representation.
   - The one-hot encoding should create a binary vector for each atom type present in the dataset. For example, if the dataset contains Hydrogen (H), Carbon (C), and Oxygen (O), the one-hot encoding for H would be `[1, 0, 0]`, for C would be `[0, 1, 0]`, and for O would be `[0, 0, 1]`.

4. **Processing All Molecules**:
   - Write a function `process_molecules(folder_path)` that processes all XYZ files in the `molecules` folder, applies one-hot encoding to each molecule, and stores the results in a dictionary where the keys are the file names and the values are the one-hot encoded representations.

5. **Output**:
   - Print and visualize the one-hot encoded representations for each molecule.

6. **Testing**:
   - Make sure your program is general and reads all files in a given folder. We will test your program on a folder with a different set of molecules.

#### Example XYZ File Content:


In [None]:
5
Comment line
H 0.0 0.0 0.0  #Atom type  x  y  z
C 0.0 0.0 1.0
O 0.0 1.0 0.0
H 1.0 0.0 0.0
C 1.0 1.0 1.0



#### Example Output:


In [None]:
molecule1.xyz: [[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]
molecule2.xyz: [[0, 1, 0], [1, 0, 0], [0, 0, 1], [0, 1, 0]]



#### Submission:
Submit your Python script file containing the functions and the main program. Ensure that your code is well-documented and follows best practices for readability and maintainability.