
# Workflows

## Problem Definition

* Define the task (e.g., spam detection, sentiment analysis, medical diagnosis).
* Decide what the **input features** are (words, pixel values, categorical attributes).
* Decide what the **target labels** are (spam/ham, positive/negative, disease/healthy).

---

## Data Preparation

* Collect labeled data.
* Preprocess features:

  * For **text data** → tokenization, stopword removal, vectorization (Bag of Words, TF-IDF).
  * For **categorical data** → encode categories into counts or frequencies.
  * For **continuous data** → assume Gaussian distribution (Gaussian Naïve Bayes).
* Split dataset into **train/test (or validation)** sets.

---

## Training (Fit)

Naïve Bayes learns probabilities from data:

1. Compute **prior probabilities** for each class:

   $$
   P(y=c) = \frac{\text{count of class } c}{\text{total samples}}
   $$
2. Compute **likelihoods** for each feature given class:

   * For categorical:

     $$
     P(x_i \mid y=c) = \frac{\text{count}(x_i, y=c)}{\text{count}(y=c)}
     $$
   * For text: word frequencies (with Laplace smoothing).
   * For continuous: use Gaussian distribution:

     $$
     P(x_i \mid y=c) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}
     $$

---

## Prediction (Inference)

Given a new instance $X = (x_1, x_2, \dots, x_n)$:

1. Apply Bayes theorem:

   $$
   P(y \mid X) \propto P(y) \cdot \prod_{i=1}^n P(x_i \mid y)
   $$
2. Select the class with **maximum posterior probability**:

   $$
   \hat{y} = \arg\max_y P(y) \prod_i P(x_i \mid y)
   $$

---

## Evaluation

* Use metrics depending on task:

  * **Classification**: Accuracy, Precision, Recall, F1-score.
  * **Probabilistic predictions**: Log-loss, ROC-AUC.
* Cross-validation for robustness.

---

## 6. Deployment

* Save the trained model (priors + likelihoods).
* For new unseen data, run through preprocessing → prediction pipeline.

---

## Example: Spam Classification Workflow

1. **Data**: Emails labeled as spam/ham.
2. **Preprocessing**: Tokenize words → convert to TF-IDF features.
3. **Training**:

   * $P(\text{spam})$, $P(\text{ham})$ (priors).
   * $P(\text{word} \mid \text{spam})$, $P(\text{word} \mid \text{ham})$.
4. **Prediction**: For a new email, multiply word likelihoods + prior, choose class with higher posterior.
5. **Evaluation**: Check accuracy, precision, recall on test emails.

---

**Summary Workflow**
👉 Define Problem → Preprocess Data → Train (estimate probabilities) → Predict (posterior) → Evaluate → Deploy

