<img src="./images/banner.png" width="800">

# Parameter Estimation in Machine Learning

Parameter estimation stands at the very heart of machine learning, serving as the cornerstone upon which most algorithms are built. At its core, machine learning is fundamentally about discovering the optimal parameters for functions that can accurately describe and predict real-world phenomena.


From classic algorithms like Linear Regression and Naive Bayes to cutting-edge Deep Neural Networks, the fundamental goal remains consistent: estimating parameters that best fit the observed data and capture underlying patterns.


In our previous studies of probability theory, we encountered various distributions for random variables. Each of these distributions was characterized by specific parameters - the numerical inputs that define the behavior of the random variable. Up until now, we've worked with scenarios where these parameters were either explicitly provided or could be inferred from our understanding of the underlying process.


However, real-world machine learning problems often present a different challenge:

- We may not know the values of the parameters.
- We can't estimate them solely from expert knowledge.
- Instead of knowing the random variables, we have a large dataset generated from an unknown underlying distribution.


This is where parameter estimation comes into play. In this chapter, we will explore formal methods for estimating parameters from data, bridging the gap between theoretical probability models and practical machine learning applications.


Let's consider a simple example to illustrate the concept of parameter estimation. Imagine we're trying to predict a person's weight based on their height using linear regression. Our model might look like this:

$$ \text{Weight} = \beta_0 + \beta_1 \cdot \text{Height} + \epsilon $$

Where:
- $\beta_0$ is the y-intercept (base weight)
- $\beta_1$ is the slope (weight increase per unit of height)
- $\epsilon$ is the error term


In this case, $\beta_0$ and $\beta_1$ are our parameters. We don't know their true values, but we can estimate them from a dataset of height-weight measurements. The process of finding the best values for $\beta_0$ and $\beta_1$ that fit our data is parameter estimation.


By mastering parameter estimation techniques, you'll gain the ability to:

1. **Extract meaningful information** from complex datasets
2. **Fit models** that accurately represent underlying patterns
3. **Make predictions** based on learned parameters
4. **Understand the uncertainty** associated with your estimates


As we dive deeper into this crucial topic, you'll see how parameter estimation forms the foundation for many of the machine learning concepts we'll explore in future lectures. Let's embark on this journey to unravel the power of parameter estimation in machine learning!

**Table of contents**<a id='toc0_'></a>    
- [What is Parameter Estimation?](#toc1_)    
  - [Definition](#toc1_1_)    
  - [Key Components](#toc1_2_)    
  - [Example: Gaussian Distribution](#toc1_3_)    
  - [The Estimation Process](#toc1_4_)    
  - [Why It Matters](#toc1_5_)    
- [Types of Parameters in Machine Learning Models](#toc2_)    
  - [Model Parameters](#toc2_1_)    
  - [Hyperparameters](#toc2_2_)    
  - [Latent Parameters](#toc2_3_)    
  - [Fixed Parameters](#toc2_4_)    
  - [Structural Parameters](#toc2_5_)    
  - [Regularization Parameters](#toc2_6_)    
- [Overview of Parameter Estimation Techniques](#toc3_)    
  - [Maximum Likelihood Estimation (MLE)](#toc3_1_)    
  - [Bayesian Estimation](#toc3_2_)    
  - [Maximum A Posteriori (MAP) Estimation](#toc3_3_)    
  - [Method of Moments](#toc3_4_)    
- [Conclusion](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[What is Parameter Estimation?](#toc0_)

Parameter estimation is a fundamental process in statistics and machine learning where we attempt to determine the most likely values of parameters for a given model based on observed data. In essence, it's the bridge that connects our theoretical models to real-world observations.


### <a id='toc1_1_'></a>[Definition](#toc0_)


Formally, parameter estimation can be defined as:

> The process of using sample data to calculate a numerical value (an estimate) for an unknown population parameter.


In machine learning contexts, we can expand this definition:

> The task of finding the best set of parameters for a model that maximizes its ability to explain or predict the observed data.


### <a id='toc1_2_'></a>[Key Components](#toc0_)


1. **Model**: A mathematical representation of the relationship between variables in our data.
2. **Parameters**: The unknown values in our model that we need to estimate.
3. **Data**: The observed samples from which we estimate the parameters.
4. **Estimation Method**: The algorithm or approach used to find the best parameter values.


### <a id='toc1_3_'></a>[Example: Gaussian Distribution](#toc0_)


Let's consider a concrete example to illustrate parameter estimation. Suppose we have a dataset that we believe follows a Gaussian (Normal) distribution. The Gaussian distribution is defined by two parameters:

- $\mu$ (mu): the mean
- $\sigma$ (sigma): the standard deviation


The probability density function of a Gaussian distribution is:

$$ f(x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} $$


In this case, parameter estimation involves finding the values of $\mu$ and $\sigma$ that best describe our observed data.


### <a id='toc1_4_'></a>[The Estimation Process](#toc0_)


1. **Collect Data**: Gather a sample of observations.
2. **Choose a Model**: In this case, we've chosen the Gaussian distribution.
3. **Select an Estimation Method**: We might use Maximum Likelihood Estimation (MLE) or Method of Moments.
4. **Estimate Parameters**: Apply the chosen method to find $\hat{\mu}$ and $\hat{\sigma}$ (the "hat" notation denotes estimates).
5. **Evaluate**: Assess how well our estimated parameters fit the data.


### <a id='toc1_5_'></a>[Why It Matters](#toc0_)


Parameter estimation is crucial because it allows us to:

- **Understand Data**: By estimating parameters, we gain insights into the underlying structure of our data.
- **Make Predictions**: Once we have estimated parameters, we can use our model to make predictions on new, unseen data.
- **Quantify Uncertainty**: Many estimation methods also provide measures of uncertainty around our estimates.


As we progress through this course, you'll see how parameter estimation forms the backbone of many machine learning algorithms, from simple linear regression to complex neural networks. Understanding this concept is key to mastering the art and science of machine learning.

## <a id='toc2_'></a>[Types of Parameters in Machine Learning Models](#toc0_)

In machine learning, we encounter various types of parameters across different models. Understanding these parameter types is crucial for effective model design, training, and interpretation. Let's explore the main categories:


### <a id='toc2_1_'></a>[Model Parameters](#toc0_)


These are the parameters learned from the data during the training process. They define the model's behavior and are updated iteratively to minimize the loss function. Model parameters are specific to the chosen algorithm and represent the learned patterns in the data.


- **Characteristics**:
  - Estimated from the training data
  - Define the model's learned behavior
  - Updated iteratively during training


- **Examples**:
  - Weights and biases in neural networks
  - Coefficients in linear regression
  - Mean and variance in Gaussian Naive Bayes


```python
# Example: Linear Regression parameters
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
```


Here, β₀, β₁, β₂, ..., βₙ are model parameters.


### <a id='toc2_2_'></a>[Hyperparameters](#toc0_)


These are configuration variables that are set before the learning process begins. Hyperparameters control the learning process and model behavior, impacting performance and generalization. They are not learned from the data but are tuned based on validation results or domain knowledge.


- **Characteristics**:
  - Not learned from the data
  - Control the learning process
  - Often tuned using techniques like cross-validation


- **Examples**:
  - Learning rate in gradient descent
  - Number of hidden layers in a neural network
  - Regularization strength in regularized models


```python
# Example: Hyperparameter in scikit-learn
from sklearn.svm import SVC

svm = SVC(C=1.0, kernel='rbf')  # C and kernel are hyperparameters
```


### <a id='toc2_3_'></a>[Latent Parameters](#toc0_)


These are hidden or unobserved parameters that the model infers from the data. Latent parameters capture underlying patterns or structures that are not directly observed but are essential for modeling complex relationships. They are often used in probabilistic models to represent hidden variables.


- **Characteristics**:
  - Not directly observed in the data
  - Inferred during the learning process
  - Often represent underlying structure or factors


- **Examples**:
  - Topic distributions in Latent Dirichlet Allocation (LDA)
  - Hidden states in Hidden Markov Models (HMM)
  - Latent factors in matrix factorization


### <a id='toc2_4_'></a>[Fixed Parameters](#toc0_)


These are parameters that are fixed and not learned during training. Fixed parameters are predefined and remain constant throughout the learning process. They are often based on prior knowledge, constraints, or design choices. Fixed parameters can significantly impact the model's architecture and behavior.


- **Characteristics**:
  - Predetermined and constant
  - Not updated during training
  - Often based on domain knowledge or constraints


- **Examples**:
  - Convolutional filter sizes in CNNs
  - Activation function choices in neural networks


### <a id='toc2_5_'></a>[Structural Parameters](#toc0_)


These define the structure or architecture of the model. Structural parameters determine the model's capacity and form, influencing its complexity and representational power. They are set before training and can have a significant impact on the model's performance. Structural parameters are often chosen based on the problem domain and computational constraints.


- **Characteristics**:
  - Determine the model's capacity and form
  - Usually set before training
  - Can significantly impact model complexity


- **Examples**:
  - Number of neurons per layer in a neural network
  - Degree of a polynomial regression model


```python
# Example: Structural parameter in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(64, activation='relu', input_shape=(10,)),  # 64 is a structural parameter
    Dense(1)
])
```


### <a id='toc2_6_'></a>[Regularization Parameters](#toc0_)


These control the model's complexity and prevent overfitting. Regularization parameters are used to penalize large weights or complex models, encouraging simpler solutions that generalize better. They are crucial for balancing model fit and complexity, improving performance on unseen data. Regularization parameters are often tuned through cross-validation.


- **Characteristics**:
  - Influence the trade-off between model fit and simplicity
  - Often treated as hyperparameters


- **Examples**:
  - λ in L1 (Lasso) or L2 (Ridge) regularization
  - Dropout rate in neural networks


```python
# Example: Regularization parameter in scikit-learn
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)  # alpha is a regularization parameter
```


Understanding these different types of parameters is essential for:
- Properly designing and implementing machine learning models
- Effectively tuning and optimizing model performance
- Interpreting model results and understanding their limitations


As you work with various machine learning algorithms, you'll encounter these different parameter types, and knowing how to handle each type will be crucial for successful model development and deployment.

## <a id='toc3_'></a>[Overview of Parameter Estimation Techniques](#toc0_)

It turns out there isn’t just one way to estimate the value of parameters. There are two main schools of thought: frequentist and Bayesian. In the frequentist approach, we treat parameters as fixed but unknown values that we aim to estimate. In the Bayesian approach, we treat parameters as random variables with their own distributions. oth of these schools
of thought assume that your data are independent and identically distributed (i.i.d.). This means that each data point is drawn from the same distribution and is independent of the others.

Let's dive deeper into these three fundamental parameter estimation techniques, providing more intuitive explanations and relating them to different statistical philosophies.


### <a id='toc3_1_'></a>[Maximum Likelihood Estimation (MLE)](#toc0_)


**Statistical Philosophy**: Frequentist


**Intuitive Explanation**: 
Imagine you're trying to guess the rules of a game by watching many rounds. MLE is like choosing the set of rules that would make the observed outcomes most likely. It asks, "What parameter values would make our observed data most probable?" It's a frequentist approach that treats parameters as fixed but unknown values.


MLE seeks to find the parameter values that maximize the probability of observing the given data. It's based on the likelihood function, which expresses how likely the observed data is as a function of the parameters.


For a dataset X = {x₁, x₂, ..., xₙ} and parameters θ, the likelihood function is:

$$ L(\theta|X) = P(X|\theta) = \prod_{i=1}^n P(x_i|\theta) $$


We often work with the log-likelihood for computational convenience:

$$ \ell(\theta|X) = \log L(\theta|X) = \sum_{i=1}^n \log P(x_i|\theta) $$


MLE finds the parameters that maximize this function:

$$ \hat{\theta}_{MLE} = \arg\max_{\theta} \ell(\theta|X) $$


**Example**: 
In a coin flipping experiment, MLE would estimate the probability of heads by choosing the value that makes the observed sequence of flips most likely.


<img src="./images/prob-likelihood.webp" width="800">

**Pros**:
- Often leads to consistent and efficient estimators
- Widely applicable and computationally tractable

**Cons**:
- Can be biased for small samples
- Doesn't incorporate prior knowledge


### <a id='toc3_2_'></a>[Bayesian Estimation](#toc0_)


**Statistical Philosophy**: Bayesian


**Intuitive Explanation**: 
Imagine you're a detective solving a case. You start with some initial hunches (prior beliefs) about the culprit. As you gather evidence (data), you update your suspicions. Bayesian estimation works similarly, starting with prior beliefs about parameters and updating them based on observed data. 


**Detailed Explanation**:
Bayesian estimation uses Bayes' theorem to combine prior beliefs about parameters with observed data to produce a posterior distribution over the parameters.

Bayes' theorem states:

$$ P(\theta|X) = \frac{P(X|\theta)P(\theta)}{P(X)} $$

Where:
- $P(\theta|X)$ is the posterior distribution (updated beliefs about parameters)
- $P(X|\theta)$ is the likelihood (as in MLE)
- $P(\theta)$ is the prior distribution (initial beliefs about parameters)
- $P(X)$ is the evidence (a normalizing constant)


Instead of a point estimate, Bayesian estimation provides a full distribution over possible parameter values.


**Example**:
In estimating the skill of a chess player, we might start with a prior belief based on their official rating. After observing several games, we update this belief to form a posterior distribution of their true skill.


**Pros**:
- Incorporates prior knowledge
- Provides uncertainty quantification
- Works well with small sample sizes

**Cons**:
- Can be computationally intensive
- Choice of prior can be subjective


### <a id='toc3_3_'></a>[Maximum A Posteriori (MAP) Estimation](#toc0_)


**Statistical Philosophy:** Bayesian


MAP estimation is a Bayesian approach to parameter estimation, combining prior knowledge with observed data. It seeks to find the parameter values that maximize the posterior probability of the parameters given the data.


Imagine you're a detective trying to solve a case. You start with some initial hunches (prior beliefs) about the culprit. As you gather evidence (data), you update your suspicions. MAP estimation is like choosing the suspect that is most likely given both your initial hunches and the new evidence. It's a balance between prior beliefs and observed data.


**How MAP Works:**
1. **Start with a Prior**: Begin with a prior probability distribution over the parameters, P(θ).

2. **Incorporate Data**: Use the likelihood of the data given the parameters, P(X|θ).

3. **Apply Bayes' Theorem**: Combine prior and likelihood to get the posterior probability:

   P(θ|X) ∝ P(X|θ) * P(θ)

4. **Find the Maximum**: The MAP estimate is the value of θ that maximizes this posterior probability.


MAP estimation finds:

$$ \theta_{MAP} = \argmax_{\theta} P(\theta|X) = \argmax_{\theta} [P(X|\theta) \cdot P(\theta)] $$

Often, we work with the log posterior for computational convenience:

$$ \theta_{MAP} = \argmax_{\theta} [\log P(X|\theta) + \log P(\theta)] $$




**Comparison with Other Methods:**
1. **vs. Maximum Likelihood Estimation (MLE)**:
   - MLE maximizes only the likelihood: θ_MLE = argmax_θ P(X|θ)
   - MAP includes the prior: θ_MAP = argmax_θ [P(X|θ) * P(θ)]
   - When the prior is uniform, MAP and MLE give the same result.

2. **vs. Full Bayesian Estimation**:
   - Full Bayesian estimation provides the entire posterior distribution.
   - MAP gives a point estimate (the mode of the posterior).
   - MAP can be seen as an approximation to full Bayesian inference.


**Pros of MAP Estimation:**
1. **Incorporates Prior Knowledge**: Useful when you have reliable prior information.
2. **Regularization Effect**: The prior can help prevent overfitting, especially with small datasets.
3. **Point Estimate**: Provides a single "best" estimate, which can be more practical in some applications.

**Cons of MAP Estimation:**
1. **Sensitivity to Prior**: Results can be heavily influenced by the choice of prior.
2. **Point Estimate Only**: Doesn't provide full uncertainty quantification like full Bayesian methods.


<img src="./images/map-mle.png" width="800">

**Example**
Suppose we're estimating the probability of a coin landing heads. We might use a Beta prior (which is conjugate to the Binomial likelihood for a coin flip):

- Prior: $\theta \sim \text{Beta}(\alpha, \beta)$
- Likelihood: $X \sim \text{Binomial}(n, \theta)$
- Posterior: $\theta|X \sim \text{Beta}(\alpha + \text{heads}, \beta + \text{tails})$


The MAP estimate would be the mode of this Beta posterior distribution.


MAP estimation bridges the gap between frequentist methods like MLE and full Bayesian approaches. It incorporates prior knowledge while still providing a point estimate, making it a valuable tool in many machine learning and statistical inference tasks.


### <a id='toc3_4_'></a>[Method of Moments](#toc0_)


**Statistical Philosophy:** Frequentist


The Method of Moments is a parameter estimation technique that belongs to the frequentist school of thought in statistics. It's based on the idea of using sample moments to estimate population parameters.


Imagine you're trying to guess the recipe for a cake by tasting it. The Method of Moments is like matching the characteristics of your cake (sweetness, texture) to the known effects of different ingredients. In statistical terms, you're matching the "moments" of your sample data to the theoretical moments of the distribution you believe the data follows.


What are Moments?
1. **Population Moments**: These are expected values of different powers of a random variable.
   - First moment: E[X] (mean)
   - Second moment: E[X²]
   - Third moment: E[X³]
   - And so on...

2. **Sample Moments**: These are the observed equivalents in your data.
   - First sample moment: x̄ (sample mean)
   - Second sample moment: average of x²
   - And so on...


How does the Method of Moments work?
1. **Express Population Moments in Terms of Parameters**:
   For a given distribution, express the theoretical moments in terms of the parameters you want to estimate.

   E[X] = f(parameters)
   E[X²] = g(parameters)
   ...

2. **Equate Sample Moments to Population Moments**:
   Set up equations where sample moments equal their theoretical counterparts.

   x̄ ≈ E[X] = f(parameters)
   (1/n)Σx²ᵢ ≈ E[X²] = g(parameters)
   ...

3. **Solve for Parameters**:
   Solve these equations to get estimates for your parameters.


**Example:** Uniform Distribution
Let's estimate parameters $a$ and $b$ for a uniform distribution on interval $[a, b]$.

1. **Theoretical Moments**:
   $E[X] = \frac{a + b}{2}$
   $E[X^2] = \frac{a^2 + ab + b^2}{3}$

2. **Equate to Sample Moments**:
   $\bar{x} = \frac{a + b}{2}$
   $\frac{1}{n}\sum_{i=1}^n x_i^2 = \frac{a^2 + ab + b^2}{3}$

3. **Solve**:
   From these equations, we can solve for $a$ and $b$ to get our estimates.


A key advantage of the Method of Moments is its simplicity and universality. It can be applied even when likelihood functions are hard to work with, making it a valuable tool in the statistician's toolkit.


**Pros**:
1. **Simplicity**: Often leads to straightforward calculations.
2. **Universality**: Can be applied even when the likelihood function is hard to work with.

**Cons**:
1. **Efficiency**: Not always the most efficient estimator, especially for small samples.
2. **Potential Issues**: Can sometimes produce estimates outside the parameter space.


The choice of parameter estimation method depends on various factors, including the nature of the data, the model complexity, and the computational resources available. Here's a brief comparison of the three methods:
- **vs. MLE**: Generally simpler but often less efficient.
- **vs. Bayesian**: Doesn't incorporate prior information or provide a full posterior distribution.

The choice between them often depends on the specific problem, available data, computational resources, and philosophical preferences. In practice, statisticians and data scientists may use a combination of these methods or choose based on the particular requirements of their analysis. Here's a comparison table of the parameter estimation methods we've discussed:

| Aspect | Maximum Likelihood Estimation (MLE) | Bayesian Estimation | Method of Moments (MoM) | Maximum A Posteriori (MAP) |
|--------|-------------------------------------|---------------------|-------------------------|----------------------------|
| **Philosophy** | Frequentist | Bayesian | Frequentist | Bayesian |
| **Core Idea** | Maximize the likelihood of observed data | Update prior beliefs with observed data | Match sample moments to theoretical moments | Maximize the posterior probability |
| **Prior Information** | Not used | Incorporated | Not used | Incorporated |
| **Result** | Point estimate | Full posterior distribution | Point estimate | Point estimate |
| **Objective Function** | Likelihood: P(X\|θ) | Posterior: P(θ\|X) | Moment equations | Posterior: P(θ\|X) |
| **Computation** | Often analytically tractable | Can be computationally intensive | Usually simple | Often simpler than full Bayesian |
| **Uncertainty Quantification** | Requires additional methods (e.g., bootstrapping) | Natural part of the method | Not directly provided | Not directly provided |
| **Sample Size Sensitivity** | Performs well with large samples | Can work well with small samples if prior is informative | Can be unreliable with small samples | Can work well with small samples if prior is informative |
| **Efficiency** | Often most efficient (asymptotically) | Can be more efficient with informative priors | Generally less efficient than MLE | Often more efficient than MLE with informative priors |
| **Bias** | Can be biased for small samples | Can be biased if prior is poorly chosen | Can be biased | Can be biased if prior is poorly chosen |
| **Consistency** | Typically consistent | Typically consistent | Typically consistent | Typically consistent |
| **Handling Complex Models** | Can be challenging for very complex models | Can handle complex models via MCMC methods | Often simpler for complex models | Can handle moderately complex models |
| **Main Advantage** | Efficiency and wide applicability | Full uncertainty quantification | Simplicity and universality | Incorporates prior knowledge with point estimate |
| **Main Limitation** | Doesn't use prior information | Can be computationally intensive | Not always efficient | Sensitive to prior choice |

## <a id='toc4_'></a>[Conclusion](#toc0_)

Parameter estimation is a fundamental task in statistics and machine learning, serving as the bridge between theoretical models and real-world data. As we've explored, there are several approaches to this crucial task, each with its own strengths, limitations, and philosophical underpinnings. Let's summarize the key takeaways from this chapter:

1. **Diversity of Methods**: From the simplicity of the Method of Moments to the comprehensive uncertainty quantification of full Bayesian estimation, each approach offers unique advantages.

2. **Philosophical Divide**: The frequentist (MLE, MoM) and Bayesian (MAP, full Bayesian) approaches represent different ways of thinking about probability and inference.

3. **Trade-offs**: Each method involves trade-offs between computational complexity, incorporation of prior knowledge, efficiency, and interpretability.

4. **Contextual Choice**: The best method often depends on the specific context, including sample size, model complexity, available prior information, and computational resources.

5. **Complementary Nature**: These methods are not always competing; they can be complementary. For instance, MLE estimates might serve as starting points for Bayesian analyses.


When choosing a parameter estimation method, consider the following factors:
- **Small Samples**: Bayesian methods (including MAP) can be particularly valuable when dealing with small sample sizes, especially if informative priors are available.
- **Large Datasets**: MLE often shines with large datasets, providing efficient and consistent estimates.
- **Computational Constraints**: Method of Moments might be preferred when quick, rough estimates are needed, or when dealing with very complex models where other methods are intractable.
- **Uncertainty Quantification**: Full Bayesian estimation is unparalleled when a complete picture of parameter uncertainty is required.


As machine learning and statistical methods continue to evolve, we're likely to see:
- More sophisticated hybrid approaches that combine the strengths of different estimation techniques.
- Increased use of approximate Bayesian methods to balance the benefits of Bayesian inference with computational efficiency.
- Greater emphasis on interpretable and explainable models, where the choice of estimation method plays a crucial role in model interpretation.


Mastering these parameter estimation techniques equips data scientists and statisticians with a versatile toolkit for tackling a wide range of real-world problems. The ability to choose and apply the appropriate method for a given scenario is a hallmark of expertise in the field.


As you continue your journey in machine learning and statistics, remember that these methods are not just theoretical constructs but practical tools that form the foundation of data-driven decision-making across numerous domains. By understanding their strengths and limitations, you'll be well-prepared to extract meaningful insights from data and build robust, reliable models.