# Homework Assignment: One-Hot Encoding for Chemical Species in Small Molecules

## Objective:
Learn how to implement one-hot encoding for representing the chemical species in small molecules. This assignment will teach you how categorical data (e.g., atom types) can be transformed into numerical representations suitable for machine learning applications.

---

### What is One-Hot Encoding?

**Definition:**
One-hot encoding is a method for converting categorical data (data that can take on a limited number of distinct values) into a numerical format that machine learning models can understand. Each unique category is represented as a binary vector with a length equal to the number of categories. In this vector, one position corresponding to the category is marked as `1`, and all other positions are marked as `0`.

---

### Example:
For the categories `['H', 'C', 'O', 'N']`:
- `H` → `[1, 0, 0, 0]`
- `C` → `[0, 1, 0, 0]`
- `O` → `[0, 0, 1, 0]`
- `N` → `[0, 0, 0, 1]`

---

### Purpose in Machine Learning

1. **Handling Categorical Data:**
   Machine learning algorithms typically work with numerical data. One-hot encoding converts non-numeric categorical features into a numerical format without introducing any ordinality (unlike label encoding, which can mistakenly imply a ranking among categories).

2. **Preventing Misinterpretation:**
   For example, in the periodic table, `H` (Hydrogen) and `O` (Oxygen) are not numerically related. Using one-hot encoding ensures that no unintended relationships or biases are introduced.

3. **Enabling Compatibility:**
   Many machine learning models (e.g., neural networks, decision trees) require consistent input shapes and cannot process raw categorical data directly.

4. **Avoiding Bias:**
   One-hot encoding ensures all categories are treated equally, preventing the model from assuming that some categories are "greater than" others.

---

### Why Use One-Hot Encoding for Molecules?

In cheminformatics and materials science, molecules often consist of categorical data like atom types. Using one-hot encoding:
- Ensures that all atom types (e.g., H, C, O, N) are treated as distinct entities.
- Prepares molecular data for machine learning models that predict properties such as reactivity, toxicity, or material behavior.
- Captures the molecular composition in a structured and interpretable format.

---

By applying one-hot encoding to molecules, we can convert molecular structures into a numerical representation suitable for machine learning workflows, ensuring compatibility and preventing bias in the data.


## Problem Description:
You are provided with a small dataset of molecules represented by their chemical formulas. Each molecule is described by a list of atoms and their types (e.g., H, C, O, N). Your tasks are:

1. **Identify Unique Chemical Species**:
   Extract all unique atom types across the dataset.

2. **Create One-Hot Encodings**:
   Assign a binary vector to each unique atom type.

3. **Encode Molecules Using One-Hot Representations**:
   Convert the list of atoms for each molecule into their corresponding one-hot encoded matrix.

4. **Optional (Extra Credit)**:
   - Summarize each molecule by the total count of each species (e.g., [2, 1, 0, 0] for 2 H, 1 C, 0 O, and 0 N).
   - Visualize the one-hot encoded data using a heatmap.

---

## Dataset Example:

| Molecule Name | Atoms        |
|---------------|--------------|
| Molecule 1    | H, H, O      |
| Molecule 2    | C, H, H, O   |
| Molecule 3    | N, H, H, C, O |

---

## Tasks:

1. Extract the unique species from the dataset (e.g., \(\{H, C, O, N\}\)).
2. Create one-hot encodings for these species:
   - Example:
     - \(H: [1, 0, 0, 0]\)
     - \(C: [0, 1, 0, 0]\)
     - \(O: [0, 0, 1, 0]\)
     - \(N: [0, 0, 0, 1]\)
3. Convert each molecule into a one-hot encoded matrix:
   - Example for Molecule 1 (\(H, H, O\)):
     \[
     \begin{bmatrix}
     1 & 0 & 0 & 0 \\
     1 & 0 & 0 & 0 \\
     0 & 0 & 1 & 0 \\
     \end{bmatrix}
     \]
4. (Optional) Summarize each molecule by counting the total occurrences of each species:
   - Example:
     - Molecule 1: \([2, 0, 1, 0]\)
     - Molecule 2: \([2, 1, 1, 0]\)

---

## Deliverables:
1. Python code that implements the above tasks.
2. A report explaining your implementation and showing the results (encoded matrices for each molecule).
3. (Optional) A visualization of the one-hot encoded data.

---

## Hints:
- Use Python’s `set()` to extract unique atom types.
- Use libraries like `NumPy` or `pandas` for matrix manipulations.
