# Machine Learning

::::{grid} 1 1 2 2
:gutter: 3

:::{grid-item-card}
:class-header: bg-grid-header
:class-body: grid-center bg-grid-body

<span class="grid-title">Supervised – Regression</span>
^^^

* [Linear Regression](machine_learning/supervised_learning/Linear_Regression/overview)
* [Support Vector Machine](machine_learning/supervised_learning/Support_Vector_Machine/overview)
* [SVR](machine_learning/supervised_learning/Support_Vector_Machine/SVR/overview)
* [KNN](machine_learning/supervised_learning/KNN/overview)
* [Decision Tree](machine_learning/supervised_learning/Decision_Tree/overview)
* [DTC](machine_learning/supervised_learning/Decision_Tree/DTR/overview)
* [Random Forest](machine_learning/supervised_learning/Random_Forest/overview)
* [RFR](machine_learning/supervised_learning/Random_Forest/RFR)
* [AdaBoost](machine_learning/supervised_learning/AdaBoost/overview)
* [AdaBoost Regression](machine_learning/supervised_learning/AdaBoost/06_part)
* [Gradient Boosting](machine_learning/supervised_learning/Gradient_boosting/overview)
* [Gradient Boosting Regressor](machine_learning/supervised_learning/Gradient_boosting/05_part)
* [XGBoost](machine_learning/supervised_learning/XGBoost/overview)
* [XGBoost Regressor](machine_learning/supervised_learning/XGBoost/04_part)
  
:::

:::{grid-item-card}
:class-header: bg-grid-header
:class-body: grid-center bg-grid-body


<span class="grid-title">Supervised – Classification</span>
^^^

* [Logistic Regression](machine_learning/supervised_learning/Logistic_Regression/overview)
* [Support Vector Machine](machine_learning/supervised_learning/Support_Vector_Machine/overview)
* [SVC](machine_learning/supervised_learning/Support_Vector_Machine/SVC/overview)
* [Naive Bayes](machine_learning/supervised_learning/Naive_Bayes/overview)
* [KNN](machine_learning/supervised_learning/KNN/overview)
* [Decision Tree](machine_learning/supervised_learning/Decision_Tree/overview)
* [DTC](machine_learning/supervised_learning/Decision_Tree/DTC/overview)
* [Random Forest](machine_learning/supervised_learning/Random_Forest/overview)
* [RFC](machine_learning/supervised_learning/Random_Forest/RFC)
* [AdaBoost](machine_learning/supervised_learning/AdaBoost/overview)
* [AdaBoost Classification](machine_learning/supervised_learning/AdaBoost/08_part)
* [Gradient Boosting](machine_learning/supervised_learning/Gradient_boosting/overview)
* [Gradient Boosting Classifier](machine_learning/supervised_learning/Gradient_boosting/04_part)
* [XGBoost](machine_learning/supervised_learning/XGBoost/overview)
* [XGBoost Classifier](machine_learning/supervised_learning/XGBoost/05_part)

:::

:::{grid-item-card}
:class-header: bg-grid-header
:class-body: grid-center bg-grid-body

<span class="grid-title">Unsupervised – Clustering</span>
^^^
* [K-Means](machine_learning/unsupervised_learning/K_Means/overview)
* [Hierarchical Clustering](machine_learning/unsupervised_learning/Hierarchical_Clustering/overview)
* [DBSCAN](machine_learning/unsupervised_learning/DBSCAN/overview)
* [Anamoly Detection](machine_learning/unsupervised_learning/Anamoly_Detection/overview)
:::

:::{grid-item-card}
:class-header: bg-grid-header
:class-body: grid-center bg-grid-body

<span class="grid-title">Unsupervised – Dimensionality Reduction</span>
^^^
* [Principal Component Analysis (PCA)](machine_learning/unsupervised_learning/PCA/overview)
* [t-SNE](machine_learning/unsupervised_learning/T_SNE/overview)

:::
::::

```{dropdown} Click here for Contents
```{contents}

Machine Learning is a **subset of Artificial Intelligence (AI)** that focuses on teaching computers to **learn patterns from data** and **make predictions or decisions** without being explicitly programmed with fixed rules.

👉 Instead of writing step-by-step instructions, we provide **examples (data)**, and the algorithm learns the hidden relationships.

---

## Example to Understand ML

* Traditional programming:

  * Rules (explicitly coded) + Data → Output
* Machine Learning:

  * Data + Output (examples) → Algorithm learns rules → Predict new output

✨ Example: Predicting house prices

* Input: Size, Location, Number of rooms
* Output: House Price
* ML learns the mapping function:

  $$
  Price = f(Size, Location, Rooms)
  $$

---

## Types of Machine Learning

1. **Supervised Learning**

   * Learn from labeled data (input + correct output given).
   * Task: Prediction.
   * Examples:

     * Regression (predict numbers, e.g., house prices).
     * Classification (predict categories, e.g., spam vs not spam).

2. **Unsupervised Learning**

   * Learn from unlabeled data (only input, no output given).
   * Task: Discover patterns.
   * Examples:

     * Clustering (grouping customers by purchase behavior).
     * Dimensionality reduction (compressing features for visualization).

3. **Reinforcement Learning**

   * Learn by interacting with the environment (trial and error).
   * Task: Decision making.
   * Example:

     * Teaching a robot to walk.
     * AlphaGo beating humans in Go.

---





## Key Components of ML

1. **Dataset** → Collection of examples (features + labels).
2. **Model** → Mathematical representation that makes predictions.
3. **Training** → Process of learning patterns (adjusting model parameters).
4. **Evaluation** → Measuring performance (accuracy, error, etc.).
5. **Prediction** → Using the trained model on unseen data.

---

## Why is ML important?

* Handles **large, complex data** humans cannot analyze manually.
* **Automates tasks** (spam filtering, recommendation systems, fraud detection).
* Improves over time as it sees more data.


## Learning approach Variants

### Instance-based learning

* Learns by **memorizing training examples**.
* No explicit model is built.
* Prediction is made by comparing a new instance with stored instances.
* Uses a **similarity (distance) measure** to find closest examples.

**Examples:**

* k-Nearest Neighbors (kNN)
* Locally Weighted Regression

**Pros:**

* Simple, flexible.
* Works well if decision boundary is irregular.

**Cons:**

* Expensive at prediction time (must compare with many stored examples).
* Sensitive to noise and irrelevant features.

---

### Model-based learning

* Learns a **general model** from training data.
* The model captures underlying relationships, then is used for prediction.
* Parameters are estimated during training.

**Examples:**

* Linear Regression
* Logistic Regression
* Neural Networks
* Decision Trees

**Pros:**

* Fast prediction once model is trained.
* Generalizes well if model is appropriate.

**Cons:**

* Training can be computationally heavy.
* If model is too simple, it underfits; if too complex, it overfits.

---

**Key Difference**

* **Instance-based**: “Remember examples, predict by similarity.”
* **Model-based**: “Learn rules (parameters), predict by applying model.”


-------------------

## **List of Machine Learning Algorithms**

| **Category**                 | **Sub-type**                 | **Algorithms**                                                                                                                                                                                                                                                                                                               |
| ---------------------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Supervised Learning**      | **Regression**               | Linear Regression, Polynomial Regression, Ridge, Lasso, Elastic Net, SVR, Decision Tree Regression, Random Forest Regression, Gradient Boosting (XGBoost, LightGBM, CatBoost), kNN Regression, Bayesian Regression, Neural Networks                                                                                          |
|                              | **Classification**           | Logistic Regression, kNN, SVM, Decision Trees (CART, ID3, C4.5), Random Forest, Gradient Boosting (XGBoost, LightGBM, CatBoost), Naive Bayes (Gaussian, Multinomial, Bernoulli), Perceptron, Multi-layer Perceptrons, Ensemble Methods (Bagging, Stacking, Voting), Probabilistic Graphical Models (Bayesian Networks, CRFs) |
| **Unsupervised Learning**    | **Clustering**               | k-Means, Hierarchical Clustering, DBSCAN, OPTICS, Gaussian Mixture Models, Mean-Shift, Spectral Clustering, BIRCH, Affinity Propagation                                                                                                                                                                                      |
|                              | **Dimensionality Reduction** | PCA, Kernel PCA, ICA, SVD, Factor Analysis, t-SNE, UMAP, Autoencoders                                                                                                                                                                                                                                                        |
|                              | **Association Rules**        | Apriori, Eclat, FP-Growth                                                                                                                                                                                                                                                                                                    |
|                              | **Density Estimation**       | KDE, Expectation-Maximization (EM), Hidden Markov Models (unsupervised setting)                                                                                                                                                                                                                                              |
| **Semi-Supervised Learning** | —                            | Self-training, Co-training, Label Propagation/Spreading, Semi-supervised SVM, Graph-based methods, Semi-supervised Deep Learning (Consistency Regularization, Pseudo-labeling)                                                                                                                                               |
| **Reinforcement Learning**   | **Value-based**              | Q-Learning, SARSA, Deep Q-Networks (DQN)                                                                                                                                                                                                                                                                                     |
|                              | **Policy-based**             | Policy Gradient (REINFORCE), Actor–Critic (A2C, A3C), Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO)                                                                                                                                                                                            |
|                              | **Model-based / Advanced**   | DDPG, TD3, SAC, Monte Carlo Tree Search, Multi-agent RL                                                                                                                                                                                                                                                                      |
| **Other Methods**            | **Ensemble Methods**         | Bagging, Boosting (AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost), Stacking, Blending, Voting Classifier                                                                                                                                                                                                          |
|                              | **Probabilistic / Bayesian** | Naive Bayes, Bayesian Networks, Gaussian Processes, HMMs, Markov Random Fields                                                                                                                                                                                                                                               |
|                              | **Deep Learning**            | Feedforward NN, CNN, RNN, LSTM, GRU, Transformers (BERT, GPT), Variational Autoencoders (VAE), Generative Adversarial Networks (GANs)                                                                                                                                                                                        |



## Common ML Pitfalls & How to Prevent Them

---

### 1. Data Leakage

* **What it is:** Information from test/future data sneaks into training.
* **Example:** Scaling before splitting, or using “future” features.
* ✅ **Prevention:**

  * Always split before preprocessing.
  * Use scikit-learn **pipelines**.
  * In time-series, only use **past data** for training.

Got it 👍 — let’s go deep into **Data Leakage** because it’s one of the trickiest yet most common mistakes in machine learning.

---


#### Definition

Data leakage happens when **information that would not be available at prediction time** is used (directly or indirectly) during training.

👉 This gives the model **unfair hints**, making it look very accurate on validation/test data but fail on real-world unseen data.

---

#### Why It’s Dangerous

* Inflates model performance (fake high accuracy).
* Leads to overconfidence in the model.
* Deployment disaster: model fails when such information isn’t available.

It’s like *cheating in an exam with leaked answers* → perfect marks in practice, but no real skill.

---

#### Types of Data Leakage

##### A. Target Leakage

* Features include data that would only be available *after* the prediction is made.
* Example:

  * Predicting if a patient has diabetes.
  * Including “insulin prescribed” as a feature.
  * Problem: prescription decisions depend on knowing the patient has diabetes.

---

##### B. Train-Test Contamination

* Test data information accidentally influences training.
* Example:

  * Scaling or feature selection done **before splitting** dataset into train/test.
  * The test data indirectly shapes the training process.

---

##### C. Temporal Leakage

* In time-series data, using **future information** to predict the past.
* Example:

  * Predicting stock price at $t$.
  * Accidentally including features from $t+1$ or later.

---

##### D. Indirect / Proxy Leakage

* When a feature is a disguised form of the target.
* Example:

  * Predicting whether a customer churns.
  * Including “last month’s customer support ticket closure” → which directly correlates with churn.

---

#### Causes of Data Leakage

* Preprocessing the entire dataset before splitting.
* Poor feature engineering (using outcome-related variables).
* Mismanaged cross-validation (e.g., same patient’s data across train & test).
* Temporal misalignment in time-series datasets.

---

#### Real-World Examples

* **Healthcare:** Using "hospital billing code" as a feature when predicting disease → billing code assigned *after* diagnosis.
* **Finance:** Predicting loan defaults using “late payment flag” → this flag only appears after default happens.
* **E-commerce:** Predicting purchase likelihood using “discount applied” → but discount decisions happen *after* purchase intent.

---

#### How to Detect Data Leakage

* Too-good-to-be-true model performance.
* Validation accuracy much higher than real-world deployment.
* Suspicious features that seem too correlated with the target.
* Leakage found in **feature importance** analysis.

---

#### How to Prevent Data Leakage

* Best Practices:

1. **Split first, preprocess later**

   * Do train/test split before scaling, imputing, or feature selection.
2. **Pipelines**

   * Use sklearn `Pipeline` to ensure preprocessing happens separately for train/test.
3. **Audit features**

   * Check: *Would I have this feature at prediction time?*
4. **Careful with time-series**

   * Always split chronologically, not randomly.
5. **Cross-validation grouping**

   * Ensure related samples (same patient, same user) are not split across train/test.
6. **Domain expertise**

   * Work with subject experts to identify hidden leakage features.

---

#### Analogy

* Training with leakage = **student cheating with leaked exam answers**.
* Deployment = **real exam without leaks** → the student (model) fails badly.

---

**In summary:**
Data leakage = using future or unavailable information in training.
It’s subtle, dangerous, and often the reason behind “amazing models that collapse in production.”


---

### 2. Overfitting

* **What it is:** Model memorizes noise in training data → poor generalization.
* **Example:** Deep tree that perfectly fits training but fails on test.
* ✅ **Prevention:**

  * Use **regularization** (L1, L2, dropout).
  * Collect more data.
  * Use **cross-validation**.
  * Prune complexity (e.g., max depth in decision trees).

---

### 3. Underfitting

* **What it is:** Model too simple → misses important patterns.
* **Example:** Using linear regression on complex nonlinear data.
* ✅ **Prevention:**

  * Use more expressive models.
  * Add features or polynomial terms.
  * Reduce regularization strength.

---

### 4. Class Imbalance

* **What it is:** One class dominates (e.g., 99% normal, 1% fraud).
* **Example:** Classifier predicts “normal” always → high accuracy but useless.
* ✅ **Prevention:**

  * Resample (oversample minority, undersample majority).
  * Use **SMOTE** (synthetic data generation).
  * Choose **balanced metrics** (F1, ROC-AUC, Precision-Recall).
  * Apply **class weights** in algorithms.

---

### 5. Data Drift & Concept Drift

* **What it is:** Data or relationships change over time.
* **Example:** Customer behavior before vs after COVID.
* ✅ **Prevention:**

  * Monitor model performance regularly.
  * Retrain periodically.
  * Use **online learning** for streaming data.

---

### 6. Multicollinearity

* **What it is:** Features highly correlated → unstable coefficients.
* **Example:** Predicting salary with both “years of experience” and “months of experience”.
* ✅ **Prevention:**

  * Remove redundant features.
  * Use **regularization (Ridge/Lasso)**.
  * Apply **PCA** for dimensionality reduction.

---

### 7. Curse of Dimensionality

* **What it is:** As features grow, data becomes sparse → distance metrics fail.
* **Example:** kNN performs poorly in 1000 dimensions.
* ✅ **Prevention:**

  * Use **feature selection**.
  * Apply dimensionality reduction (PCA, t-SNE, UMAP).
  * Gather more data.

---

### 8. Sampling Bias

* **What it is:** Training data doesn’t represent real-world distribution.
* **Example:** Training only on urban customers → fails on rural customers.
* ✅ **Prevention:**

  * Ensure **stratified sampling**.
  * Collect **representative datasets**.
  * Be cautious with web-scraped or convenience samples.

---

### 9. Scaling & Normalization Issues

* **What it is:** Using features with different scales can mislead algorithms.
* **Example:** kNN treating “income (\$)” as more important than “age (years)”.
* ✅ **Prevention:**

  * Normalize/standardize features.
  * Use pipelines to prevent leakage.
  * Choose scale-invariant models if possible (trees).

---

### 10. Evaluation Pitfalls

* **What it is:** Using the wrong metric for the problem.
* **Example:** Accuracy in fraud detection (useless if data is imbalanced).
* ✅ **Prevention:**

  * Choose metrics suited to task (F1 for imbalance, RMSE for regression).
  * Use **cross-validation**.
  * Avoid test set reuse (keep a final hold-out set).


| Operator / Function | Definition | Usage / Intuition | Example |
|-------------------|------------|-----------------|---------|
| $\min_x f(x)$ | Minimum value of a function | Find smallest value of objective | $ \min_x (x-3)^2 = 0 $ |
| $ \max_x f(x) $ | Maximum value of a function | Find largest value of objective | $ \max_x -(x-3)^2 = 0 $ |
| $ \arg\min_x f(x) $ | Input where function is minimized | Optimization to find best parameters | $ \arg\min_x (x-3)^2 = 3 $ |
| $ \arg\max_x f(x) $ | Input where function is maximized | Find best parameter location | $ \arg\max_x -(x-3)^2 = 3 $ |
| $ \frac{d}{dx} f(x) $ | Derivative w.r.t scalar | Slope / rate of change | $ \frac{d}{dx} (x^2) = 2x $ |
| $ \frac{\partial f}{\partial x_i} $ | Partial derivative | Multivariate rate of change | $ \frac{\partial}{\partial x} (x^2 + y^2) = 2x $ |
| $ \nabla f(x) $ | Gradient vector | Direction of steepest ascent | $ \nabla (x^2 + y^2) = [2x,2y] $ |
| $ \theta_{t+1} = \theta_t - \eta \nabla_\theta L $ | Gradient Descent update | Iteratively minimize loss | Linear regression update |
| $ \langle u, v \rangle $ | Dot product / inner product | Similarity / projection | $ \langle [1,2],[3,4] \rangle = 11 $ |
| $ \|x\|_2 $ | L2 norm (Euclidean) | Magnitude of vector | $ \|[3,4]\|_2 = 5 $ |
| $ \|x\|_1 $ | L1 norm (Manhattan) | Sum of absolute values | $ \|[3,-4]\|_1 = 7 $ |
| $ A^\top $ | Matrix transpose | Switch rows ↔ columns | $ [[1,2],[3,4]]^\top = [[1,3],[2,4]] $ |
| $ \text{Tr}(A) $ | Trace of a matrix | Sum of diagonal | $ \text{Tr}([[1,2],[3,4]]) = 5 $ |
| $ \det(A) $ | Determinant | Scaling factor of matrix | $ \det([[1,2],[3,4]])=-2 $ |
| $ \mathbb{E}[X] $ | Expectation / mean | Average value | $ \mathbb{E}[X] = \sum x_i P(x_i) $ |
| $ \text{Var}(X) $ | Variance | Spread of X | $ \text{Var}([1,2,3]) = 2/3 $ |
| $ \text{Cov}(X,Y) $ | Covariance | Measure of correlation | $ \text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] $ |
| $ \mathbb{P}(A) $ | Probability | Chance of event | $ \mathbb{P}(X>0) $ |
| $ L(y,\hat{y}) $ | Loss function | Measures prediction error | `MSE, Cross-Entropy` |
| $ r_{im} = - \frac{\partial L}{\partial F(x_i)} $ | Pseudo-residuals (Boosting) | Direction to reduce loss | Gradient Boosting step |
| $ F_m = F_{m-1} + \nu \gamma_m h_m(x) $ | Boosted model update | Add tree’s contribution | Gradient Boosting |
| $ \text{sign}(x) $ | Sign function | Direction of number | $ \text{sign}(-5)=-1 $ |
| $ \mathbf{1}_{\{\text{condition}\}} $ | Indicator function | 1 if true, 0 if false | $ \mathbf{1}_{x>0} $ |
| $ \sigma(x) $ | Sigmoid function | Map to probability [0,1] | $ \sigma(0)=0.5 $ |
| $ \text{softmax}(z_i) $ | Softmax function | Multi-class probability | $ \text{softmax}([1,2,3])_i $ |
| $ \text{ReLU}(x) $ | Rectified Linear Unit | Nonlinear activation | $ \text{ReLU}(-2)=0, \text{ReLU}(3)=3 $ |
| $ \hat{y} = F_M(x) $ | Regression prediction | Final model output | Gradient Boosting Regressor |
| $ \hat{y} = \mathbf{1}[\sigma(F_M(x))>0.5] $ | Binary classification prediction | Threshold probability | Gradient Boosting Classifier |
| $ \hat{y}_i = \text{softmax}(F_M(x))_i $ | Multi-class classification | Probability per class | Gradient Boosting Multi-class |