## Loss Functions for Classification

Once the model predicts (\hat{y}), a **loss function** measures how far it is from the actual output (y). For classification, we use **cross-entropy loss** in three variants depending on the problem.

---

###  Binary Cross Entropy (BCE)

**Use when:** Only **two classes** (0 or 1)

**Formula:**

$$
\text{Loss} = -\left[y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y})\right]
$$



🔹 Output layer uses **Sigmoid activation**
🔹 $\hat{y}$ is a probability between 0 and 1
🔹 Works for problems like spam detection, fraud detection, yes/no classification

**Examples:**

* Email: spam (1) / not spam (0)
* Disease: present (1) / absent (0)

---

### 2. Categorical Cross Entropy (CCE)

**Use when:** More than 2 classes **AND** target is **one-hot encoded**

Example target:

```
good → [1, 0, 0]
bad → [0, 1, 0]
neutral → [0, 0, 1]
```

**Formula:**
$$
\text{Loss} = -\sum_{j=1}^{C} y_{ij} \cdot \log(\hat{y}_{ij})
$$

🔹 Output layer uses **Softmax**
🔹 Produces **probability for each class**
🔹 Retains info about how likely each class is

**Examples:**

* Sentiment: positive / neutral / negative
* Digit recognition (0–9)

---

### 3. Sparse Categorical Cross Entropy (Sparse CCE)

**Use when:** More than 2 classes **AND** target is given as a **single integer label**, not one-hot

Example target:

```
good → 0  
bad  → 1  
neutral → 2
```

Underlying formula is same as CCE but WITHOUT one-hot encoding.

🔹 Still uses **Softmax**
🔹 Only cares about the correct index
🔹 Does **not** store probability of other classes explicitly

**Output Example (Softmax):**

```
[0.2, 0.3, 0.5]
```

Sparse CCE only needs the index with the highest probability → 2

✅ Good when labels are large or one-hot encoding is memory-heavy
❌ Loses the probability info of other classes

---

### 🔍 When to Use What?

| Problem Type                 | Loss Function         | Output Format | Activation |
| ---------------------------- | --------------------- | ------------- | ---------- |
| Binary Classification        | Binary Cross Entropy  | 0/1           | Sigmoid    |
| Multi-class + One-hot labels | Categorical CE        | $$0,1,0,0...$$  | Softmax    |
| Multi-class + Integer labels | Sparse Categorical CE | 2, 4, 1, ...  | Softmax    |

---

**Key Takeaway**

* **Binary → BCE + Sigmoid**
* **Multi-class (one-hot) → CCE + Softmax**
* **Multi-class (integer labels) → Sparse CCE + Softmax**
* Use **CCE** if probabilities of all classes matter
* Use **Sparse CCE** if only the winning class index matters
