# **Introduction of Statistical Modeling:**

## **Vocabulary:**

${Y}$ = True Response Value 

$\hat{Y}$  = Model Prediction Value

$\varepsilon$ = Error Term

$E$ = Expected Value or Average Value

$B_n$ = Model Parameter or *Weight*

## **Inference Vs Prediction Modeling:**

### Inference Modeling:

Understanding the Association between a Response $Y$ and its Predictors $X_n$ . 

- Which Predictors are Associated wiht the Response
- What is the Relationship between the Response and **Each** Predictor
- How complicated is the Relationship? (Linear or Non-Linear) (Model Bias)

### Prediction Modeling:

Understanding How Well We Can Predict $Y$ from $X_n$

- How accurately can we predict $Y$ from new observations of $X_n$?
- What combination of predictors gives the best prediction?
- How well does the model generalize to unseen data?
- What is the trade-off between bias and variance?

## **Reducible vs Irreducible Error:**


$$ \mathbb{E}(Y - \hat{Y}) = \underbrace{\mathbb{E}[(f(X) - \hat{f}(X))^2]}_{\text{Reducible error}} + \underbrace{\operatorname{Var}(\varepsilon)}_{\text{Irreducible error}}
$$

Machine Learning Approaches Reduce Error by Applying Models that more Accurately Capture Underlying Relationships. This is Considered Reducible Error. Additionally, the Effectiveness of a Model Capturing the True Relationship is Considered Model Bias

However if $\hat{f}$ Perfectly Estimated ${f}$, there would still be Error due to Inherent Variability ($\varepsilon$). This is Considered Irreducible Error

## **Parametric Vs Non-Parametric Models:**

### **Parametric Methods:**

#### Assume the Form of the Function ${f}$ and Summarize the Data using a Fixed Number of Parameters

---

#### When to Use:

- When you have **strong prior knowledge** or assumptions about the data structure.
- When **interpretability** and **efficiency** are important.
- With **small to moderate** data where overfitting is a concern.

---

#### Example Models:

- Linear/Logistic Regression
- Ridge/Lasso Regression
- Naive Bayes (With Fixed Distribution)
- ARIMA (in Time Series)

### **Non-Parametric Methods:**

#### Make no Strong Assumption about the Form of ${f}$ and let the Data Determine the Model Complexity (Parameters Grow with Data)

---

#### When to Use:

- When the true relationship is **unknown or highly complex**.
- When you have **large datasets** that can support more **flexible** models.
- When **prediction accuracy** is more important than interpretability.

---

#### Example Models:
- K Nearest Neighbors
- Support Vector Machines
- Splines
- Random Forests
- Neural Networks/ Deep Learning

## **Supervised vs Unsupervised Learning:**

### **Supervised Learning:**

#### What Makes It Supervised?

- The model is trained on a labeled dataset, where each input $X$ is paired with a known output $Y$.
- The goal is to **learn a mapping** from inputs to outputs: $f: X \rightarrow Y$.
- Performance is measured using the difference between predicted and true labels (e.g., accuracy, MSE).

---
#### Industry Applications & Problems

1. **Healthcare**: Predicting disease risk (e.g., diabetes or cancer diagnosis from patient data)
2. **Finance**: Credit scoring and loan default prediction
3. **Retail & Commerce**: Forecasting sales or predicting customer churn

---
#### Common Supervised Learning Algorithms

1. **Linear Regression** – for continuous outputs
2. **Logistic Regression** – for binary classification
3. **Support Vector Machines (SVM)**
4. **Random Forests**
5. **Neural Networks** (for regression and classification)

---

### **Unsupervised Learning:**

#### What Makes It Unsupervised?

- The model is trained on data **without labels** — only input $X$ is observed.
- The goal is to **discover hidden patterns, structure, or groupings** in the data.
- There is no explicit “correct” output to compare to — evaluation is task-specific.

---

#### Industry Applications & Problems

1. **Banking**: Anomaly detection for fraud or unusual transactions
2. **E-commerce**: Identifying consumer behavior segments for targeted marketing
3. **Cybersecurity**: Detecting unusual network traffic or attacks

---

#### Common Unsupervised Learning Algorithms

1. **K-Means Clustering**
2. **Hierarchical Clustering**
3. **Principal Component Analysis (PCA)**
4. **Autoencoders**
5. **GMM (Gaussian Mixture Models)**

## **Regression Vs. Classification:**

### **Regression:**

#### What Makes It Regression?

- The **output variable $Y$ is continuous** — it can take any real value within a range.
- The goal is to **predict a numeric quantity** based on input features $X$.
- Evaluation metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and $R^2$ score.

---

#### Industry Applications & Problems

1. **Real Estate**: Predicting house prices based on location, size, and features  
2. **Finance**: Forecasting stock prices or market returns  
3. **Healthcare**: Estimating patient survival time or disease progression  

---

#### Common Regression Algorithms

1. Linear Regression  
2. Ridge / Lasso Regression  
3. Decision Trees for Regression  
4. Random Forest Regressor  
5. Gradient Boosting Regressor (e.g., XGBoost, LightGBM)  
6. Support Vector Regression (SVR)  
7. k-Nearest Neighbors (kNN) Regressor  
8. Neural Networks (for continuous output)

### **Classification:**

#### What Makes It Classification?

- The **output variable $Y$ is categorical** — it belongs to a finite set of classes or labels.
- The goal is to **assign input $X$ to one of the predefined classes**.
- Evaluation metrics include Accuracy, Precision, Recall, F1 Score, and AUC-ROC.

---

#### Industry Applications & Problems

1. **Healthcare**: Diagnosing disease (e.g., predicting if a tumor is malignant or benign)  
2. **Email Filtering**: Classifying spam vs. non-spam messages  
3. **Customer Analytics**: Predicting if a user will churn or renew  

---

#### Common Classification Algorithms

1. Logistic Regression  
2. Decision Trees for Classification  
3. Random Forest Classifier  
4. Support Vector Machines (SVM)  
5. k-Nearest Neighbors (kNN) Classifier  
6. Naive Bayes  
7. Gradient Boosting Classifier (e.g., XGBoost, LightGBM)  
8. Neural Networks (for classification tasks)

# **Measuring the Quality of a Models Fit**

## **Mean Squared Error(MSE)**

The **Mean Squared Error (MSE)** is a common loss function used to evaluate regression models. It measures the average of the squared differences between the actual values $Y$ and the predicted values $\hat{Y}$.

The formula is:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

- $y_i$ is the true value for the $i^{th}$ data point  
- $\hat{y}_i$ is the predicted value  
- $n$ is the total number of observations

---

#### How MSE Relates to Overfitting

- A **low training MSE** means the model fits the training data well.
- However, if the **test MSE starts increasing while training MSE continues to decrease**, the model is likely **overfitting** — it’s learning patterns and noise specific to the training set that do not generalize to new data.

## **Bias-Variance Trade Off**

In supervised learning, the expected squared prediction error at a given input $X$ can be decomposed into three components:

$$
\mathbb{E}[(Y - \hat{f}(X))^2] = \underbrace{[\text{Bias}(\hat{f}(X))]^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}(X))}_{\text{Variance}} + \underbrace{\operatorname{Var}(\varepsilon)}_{\\text{Irreducible error}}
$$

- **Bias** measures the error introduced by approximating a complex real-world problem with a simplified model. It reflects how far the average prediction $\mathbb{E}[\hat{f}(X)]$ is from the true function $f(X)$.
  
- **Variance** measures how much the model's prediction $\hat{f}(X)$ would change if it were trained on a different dataset. High variance means the model is sensitive to fluctuations in the training data.

- **Irreducible error** ($\operatorname{Var}(\varepsilon)$) represents the inherent noise in the data — variation in $Y$ that cannot be explained even if we knew the true function $f(X)$ perfectly.

The trade-off lies in managing bias and variance: improving one often worsens the other. A good model finds the right balance to minimize the overall prediction error.


## **Bayes Classifier**

The **Bayes classifier** is a theoretical model that assigns a new observation $X = x$ to the class $k$ that has the **highest probability of being correct**, given the observed input:

$$
\text{Class}(x) = \arg\max_k \ \mathbb{P}(Y = k \mid X = x)
$$

This probability — $\mathbb{P}(Y = k \mid X = x)$ — is called the **posterior probability**, meaning:
> "Given what I've observed (input $x$), how likely is it that this data point belongs to class $k$?"

---

### Bayes’ Theorem Behind the Scenes

The posterior is computed using **Bayes’ Theorem**:

$$
\mathbb{P}(Y = k \mid X = x) = \frac{\mathbb{P}(X = x \mid Y = k) \cdot \mathbb{P}(Y = k)}{\mathbb{P}(X = x)}
$$

Where:
- **Posterior**: $\mathbb{P}(Y = k \mid X = x)$ — the probability of class $k$ given input $x$
- **Likelihood**: $\mathbb{P}(X = x \mid Y = k)$ — how likely input $x$ is under class $k$
- **Prior**: $\mathbb{P}(Y = k)$ — how common class $k$ is before seeing any data
- **Evidence**: $\mathbb{P}(X = x)$ — the overall probability of observing $x$ (used for normalization)

Think of it like this:
- The **prior** is your starting belief about which class is likely, *before* you look at the data.
- The **likelihood** tells you how well the observed data $x$ fits each possible class.
- The **posterior** combines both: it updates your belief about the class *after* seeing the data.

---

### Why the Bayes Classifier Matters — and Why It’s Hard to Use

- It is **theoretically optimal**: No classifier can achieve a lower average error than the Bayes classifier if you knew the true distributions.
- But it's **not practical in real life**, because:
  - We **don’t know** the true distributions $\mathbb{P}(X \mid Y)$.
  - Estimating these distributions perfectly would require **infinite data**, or unrealistic assumptions.
  - Computing exact probabilities in high-dimensional space is **computationally expensive or impossible**.

---

### What We Do Instead

In practice, we approximate the Bayes classifier using simpler or more flexible models:
- **Naive Bayes** assumes features are conditionally independent
- **Linear Discriminant Analysis (LDA)** assumes normal distributions with equal variance
- **Non-parametric methods** like k-NN use data points directly to estimate probabilities

The Bayes classifier acts as a **gold standard** — a benchmark for how well our models could perform in the best-case scenario.


## **K-Nearest Neighbors:**

**k-NN** is a non-parametric method used for classification and regression. It makes predictions based on the $k$ closest training points to a new input $x$.

---

#### Classification Prediction Rule

$$
\hat{y} = \arg\max_{j \in \mathcal{C}} \sum_{i \in \mathcal{N}_k(x)} \mathbb{I}(y_i = j)
$$

- $\mathcal{C}$: set of possible classes  
- $\mathcal{N}_k(x)$: indices of the $k$ nearest training points to $x$  
- $\mathbb{I}(y_i = j)$: Indicator Function (1 if the $i$-th neighbor belongs to class $j$, 0 otherwise)

---

#### Class Probability Estimate

The estimated probability that $Y = j$ given input $X = x$ is:

$$
\mathbb{P}(Y = j \mid X = x) \approx \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} \mathbb{I}(y_i = j)
$$

This is the **proportion of neighbors** among the $k$ nearest that belong to class $j$.

---

#### How It Works

1. Compute distances from $x$ to all training points (e.g., Euclidean).
2. Identify the $k$ closest points.
3. Return the **majority class** or the **class with highest estimated probability**.

---

#### Decision Boundary

- The **decision boundary** is where the predicted class label changes.
- For small $k$, the boundary is highly sensitive to local data and may be irregular.
- Larger $k$ smooths the boundary by averaging over more neighbors.

k-NN does not learn a function during training—it stores data and makes decisions only at prediction time.


# Python Notes:

The Python Tips in This Section Revolve around x y and z