# Machine Learning

::::{grid} 1 1 2 3
:gutter: 3

:::{grid-item-card} 
:link: machine_learning/supervised_learning/Linear_Regression/overview
:link-type: doc
:class-header: bg-grid-header
:class-body: grid-center bg-grid-body

<span class="grid-title">Linear Regression</span>
^^^

Predicts a continuous outcome by modeling the relationship between input features and the target as a straight line.
:::

:::{grid-item-card} 
:link: machine_learning/supervised_learning/Logistic_Regression/overview
:link-type: doc
:class-header: bg-grid-header
:class-body: grid-center bg-grid-body

<span class="grid-title">Logistic Regression</span>
^^^

Predicts the probability of a categorical outcome (usually binary) using a logistic (sigmoid) function.
:::

:::{grid-item-card} 
:link: machine_learning/supervised_learning/Support_Vector_Machine/overview
:link-type: doc
:class-header: bg-grid-header
:class-body: grid-center bg-grid-body

<span class="grid-title">Support Vector Machine</span>
^^^

Finds the optimal boundary (hyperplane) that best separates classes in the feature space for classification tasks.
:::
::::

```{dropdown} Click here for Contents
```{contents}

Machine Learning is a **subset of Artificial Intelligence (AI)** that focuses on teaching computers to **learn patterns from data** and **make predictions or decisions** without being explicitly programmed with fixed rules.

üëâ Instead of writing step-by-step instructions, we provide **examples (data)**, and the algorithm learns the hidden relationships.

---

## Example to Understand ML

* Traditional programming:

  * Rules (explicitly coded) + Data ‚Üí Output
* Machine Learning:

  * Data + Output (examples) ‚Üí Algorithm learns rules ‚Üí Predict new output

‚ú® Example: Predicting house prices

* Input: Size, Location, Number of rooms
* Output: House Price
* ML learns the mapping function:

  $$
  Price = f(Size, Location, Rooms)
  $$

---

## Types of Machine Learning

1. **Supervised Learning**

   * Learn from labeled data (input + correct output given).
   * Task: Prediction.
   * Examples:

     * Regression (predict numbers, e.g., house prices).
     * Classification (predict categories, e.g., spam vs not spam).

2. **Unsupervised Learning**

   * Learn from unlabeled data (only input, no output given).
   * Task: Discover patterns.
   * Examples:

     * Clustering (grouping customers by purchase behavior).
     * Dimensionality reduction (compressing features for visualization).

3. **Reinforcement Learning**

   * Learn by interacting with the environment (trial and error).
   * Task: Decision making.
   * Example:

     * Teaching a robot to walk.
     * AlphaGo beating humans in Go.

---





## Key Components of ML

1. **Dataset** ‚Üí Collection of examples (features + labels).
2. **Model** ‚Üí Mathematical representation that makes predictions.
3. **Training** ‚Üí Process of learning patterns (adjusting model parameters).
4. **Evaluation** ‚Üí Measuring performance (accuracy, error, etc.).
5. **Prediction** ‚Üí Using the trained model on unseen data.

---

## Why is ML important?

* Handles **large, complex data** humans cannot analyze manually.
* **Automates tasks** (spam filtering, recommendation systems, fraud detection).
* Improves over time as it sees more data.


## Learning approach Variants

### Instance-based learning

* Learns by **memorizing training examples**.
* No explicit model is built.
* Prediction is made by comparing a new instance with stored instances.
* Uses a **similarity (distance) measure** to find closest examples.

**Examples:**

* k-Nearest Neighbors (kNN)
* Locally Weighted Regression

**Pros:**

* Simple, flexible.
* Works well if decision boundary is irregular.

**Cons:**

* Expensive at prediction time (must compare with many stored examples).
* Sensitive to noise and irrelevant features.

---

### Model-based learning

* Learns a **general model** from training data.
* The model captures underlying relationships, then is used for prediction.
* Parameters are estimated during training.

**Examples:**

* Linear Regression
* Logistic Regression
* Neural Networks
* Decision Trees

**Pros:**

* Fast prediction once model is trained.
* Generalizes well if model is appropriate.

**Cons:**

* Training can be computationally heavy.
* If model is too simple, it underfits; if too complex, it overfits.

---

**Key Difference**

* **Instance-based**: ‚ÄúRemember examples, predict by similarity.‚Äù
* **Model-based**: ‚ÄúLearn rules (parameters), predict by applying model.‚Äù


-------------------

## **List of Machine Learning Algorithms**

| **Category**                 | **Sub-type**                 | **Algorithms**                                                                                                                                                                                                                                                                                                               |
| ---------------------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Supervised Learning**      | **Regression**               | Linear Regression, Polynomial Regression, Ridge, Lasso, Elastic Net, SVR, Decision Tree Regression, Random Forest Regression, Gradient Boosting (XGBoost, LightGBM, CatBoost), kNN Regression, Bayesian Regression, Neural Networks                                                                                          |
|                              | **Classification**           | Logistic Regression, kNN, SVM, Decision Trees (CART, ID3, C4.5), Random Forest, Gradient Boosting (XGBoost, LightGBM, CatBoost), Naive Bayes (Gaussian, Multinomial, Bernoulli), Perceptron, Multi-layer Perceptrons, Ensemble Methods (Bagging, Stacking, Voting), Probabilistic Graphical Models (Bayesian Networks, CRFs) |
| **Unsupervised Learning**    | **Clustering**               | k-Means, Hierarchical Clustering, DBSCAN, OPTICS, Gaussian Mixture Models, Mean-Shift, Spectral Clustering, BIRCH, Affinity Propagation                                                                                                                                                                                      |
|                              | **Dimensionality Reduction** | PCA, Kernel PCA, ICA, SVD, Factor Analysis, t-SNE, UMAP, Autoencoders                                                                                                                                                                                                                                                        |
|                              | **Association Rules**        | Apriori, Eclat, FP-Growth                                                                                                                                                                                                                                                                                                    |
|                              | **Density Estimation**       | KDE, Expectation-Maximization (EM), Hidden Markov Models (unsupervised setting)                                                                                                                                                                                                                                              |
| **Semi-Supervised Learning** | ‚Äî                            | Self-training, Co-training, Label Propagation/Spreading, Semi-supervised SVM, Graph-based methods, Semi-supervised Deep Learning (Consistency Regularization, Pseudo-labeling)                                                                                                                                               |
| **Reinforcement Learning**   | **Value-based**              | Q-Learning, SARSA, Deep Q-Networks (DQN)                                                                                                                                                                                                                                                                                     |
|                              | **Policy-based**             | Policy Gradient (REINFORCE), Actor‚ÄìCritic (A2C, A3C), Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO)                                                                                                                                                                                            |
|                              | **Model-based / Advanced**   | DDPG, TD3, SAC, Monte Carlo Tree Search, Multi-agent RL                                                                                                                                                                                                                                                                      |
| **Other Methods**            | **Ensemble Methods**         | Bagging, Boosting (AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost), Stacking, Blending, Voting Classifier                                                                                                                                                                                                          |
|                              | **Probabilistic / Bayesian** | Naive Bayes, Bayesian Networks, Gaussian Processes, HMMs, Markov Random Fields                                                                                                                                                                                                                                               |
|                              | **Deep Learning**            | Feedforward NN, CNN, RNN, LSTM, GRU, Transformers (BERT, GPT), Variational Autoencoders (VAE), Generative Adversarial Networks (GANs)                                                                                                                                                                                        |



## Common ML Pitfalls & How to Prevent Them

---

### 1. Data Leakage

* **What it is:** Information from test/future data sneaks into training.
* **Example:** Scaling before splitting, or using ‚Äúfuture‚Äù features.
* ‚úÖ **Prevention:**

  * Always split before preprocessing.
  * Use scikit-learn **pipelines**.
  * In time-series, only use **past data** for training.

Got it üëç ‚Äî let‚Äôs go deep into **Data Leakage** because it‚Äôs one of the trickiest yet most common mistakes in machine learning.

---


#### Definition

Data leakage happens when **information that would not be available at prediction time** is used (directly or indirectly) during training.

üëâ This gives the model **unfair hints**, making it look very accurate on validation/test data but fail on real-world unseen data.

---

#### Why It‚Äôs Dangerous

* Inflates model performance (fake high accuracy).
* Leads to overconfidence in the model.
* Deployment disaster: model fails when such information isn‚Äôt available.

It‚Äôs like *cheating in an exam with leaked answers* ‚Üí perfect marks in practice, but no real skill.

---

#### Types of Data Leakage

##### A. Target Leakage

* Features include data that would only be available *after* the prediction is made.
* Example:

  * Predicting if a patient has diabetes.
  * Including ‚Äúinsulin prescribed‚Äù as a feature.
  * Problem: prescription decisions depend on knowing the patient has diabetes.

---

##### B. Train-Test Contamination

* Test data information accidentally influences training.
* Example:

  * Scaling or feature selection done **before splitting** dataset into train/test.
  * The test data indirectly shapes the training process.

---

##### C. Temporal Leakage

* In time-series data, using **future information** to predict the past.
* Example:

  * Predicting stock price at $t$.
  * Accidentally including features from $t+1$ or later.

---

##### D. Indirect / Proxy Leakage

* When a feature is a disguised form of the target.
* Example:

  * Predicting whether a customer churns.
  * Including ‚Äúlast month‚Äôs customer support ticket closure‚Äù ‚Üí which directly correlates with churn.

---

#### Causes of Data Leakage

* Preprocessing the entire dataset before splitting.
* Poor feature engineering (using outcome-related variables).
* Mismanaged cross-validation (e.g., same patient‚Äôs data across train & test).
* Temporal misalignment in time-series datasets.

---

#### Real-World Examples

* **Healthcare:** Using "hospital billing code" as a feature when predicting disease ‚Üí billing code assigned *after* diagnosis.
* **Finance:** Predicting loan defaults using ‚Äúlate payment flag‚Äù ‚Üí this flag only appears after default happens.
* **E-commerce:** Predicting purchase likelihood using ‚Äúdiscount applied‚Äù ‚Üí but discount decisions happen *after* purchase intent.

---

#### How to Detect Data Leakage

* Too-good-to-be-true model performance.
* Validation accuracy much higher than real-world deployment.
* Suspicious features that seem too correlated with the target.
* Leakage found in **feature importance** analysis.

---

#### How to Prevent Data Leakage

* Best Practices:

1. **Split first, preprocess later**

   * Do train/test split before scaling, imputing, or feature selection.
2. **Pipelines**

   * Use sklearn `Pipeline` to ensure preprocessing happens separately for train/test.
3. **Audit features**

   * Check: *Would I have this feature at prediction time?*
4. **Careful with time-series**

   * Always split chronologically, not randomly.
5. **Cross-validation grouping**

   * Ensure related samples (same patient, same user) are not split across train/test.
6. **Domain expertise**

   * Work with subject experts to identify hidden leakage features.

---

#### Analogy

* Training with leakage = **student cheating with leaked exam answers**.
* Deployment = **real exam without leaks** ‚Üí the student (model) fails badly.

---

**In summary:**
Data leakage = using future or unavailable information in training.
It‚Äôs subtle, dangerous, and often the reason behind ‚Äúamazing models that collapse in production.‚Äù


---

### 2. Overfitting

* **What it is:** Model memorizes noise in training data ‚Üí poor generalization.
* **Example:** Deep tree that perfectly fits training but fails on test.
* ‚úÖ **Prevention:**

  * Use **regularization** (L1, L2, dropout).
  * Collect more data.
  * Use **cross-validation**.
  * Prune complexity (e.g., max depth in decision trees).

---

### 3. Underfitting

* **What it is:** Model too simple ‚Üí misses important patterns.
* **Example:** Using linear regression on complex nonlinear data.
* ‚úÖ **Prevention:**

  * Use more expressive models.
  * Add features or polynomial terms.
  * Reduce regularization strength.

---

### 4. Class Imbalance

* **What it is:** One class dominates (e.g., 99% normal, 1% fraud).
* **Example:** Classifier predicts ‚Äúnormal‚Äù always ‚Üí high accuracy but useless.
* ‚úÖ **Prevention:**

  * Resample (oversample minority, undersample majority).
  * Use **SMOTE** (synthetic data generation).
  * Choose **balanced metrics** (F1, ROC-AUC, Precision-Recall).
  * Apply **class weights** in algorithms.

---

### 5. Data Drift & Concept Drift

* **What it is:** Data or relationships change over time.
* **Example:** Customer behavior before vs after COVID.
* ‚úÖ **Prevention:**

  * Monitor model performance regularly.
  * Retrain periodically.
  * Use **online learning** for streaming data.

---

### 6. Multicollinearity

* **What it is:** Features highly correlated ‚Üí unstable coefficients.
* **Example:** Predicting salary with both ‚Äúyears of experience‚Äù and ‚Äúmonths of experience‚Äù.
* ‚úÖ **Prevention:**

  * Remove redundant features.
  * Use **regularization (Ridge/Lasso)**.
  * Apply **PCA** for dimensionality reduction.

---

### 7. Curse of Dimensionality

* **What it is:** As features grow, data becomes sparse ‚Üí distance metrics fail.
* **Example:** kNN performs poorly in 1000 dimensions.
* ‚úÖ **Prevention:**

  * Use **feature selection**.
  * Apply dimensionality reduction (PCA, t-SNE, UMAP).
  * Gather more data.

---

### 8. Sampling Bias

* **What it is:** Training data doesn‚Äôt represent real-world distribution.
* **Example:** Training only on urban customers ‚Üí fails on rural customers.
* ‚úÖ **Prevention:**

  * Ensure **stratified sampling**.
  * Collect **representative datasets**.
  * Be cautious with web-scraped or convenience samples.

---

### 9. Scaling & Normalization Issues

* **What it is:** Using features with different scales can mislead algorithms.
* **Example:** kNN treating ‚Äúincome (\$)‚Äù as more important than ‚Äúage (years)‚Äù.
* ‚úÖ **Prevention:**

  * Normalize/standardize features.
  * Use pipelines to prevent leakage.
  * Choose scale-invariant models if possible (trees).

---

### 10. Evaluation Pitfalls

* **What it is:** Using the wrong metric for the problem.
* **Example:** Accuracy in fraud detection (useless if data is imbalanced).
* ‚úÖ **Prevention:**

  * Choose metrics suited to task (F1 for imbalance, RMSE for regression).
  * Use **cross-validation**.
  * Avoid test set reuse (keep a final hold-out set).
