<img src="./images/banner.png" width="800">

# Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a cornerstone method in statistical inference and machine learning for estimating the parameters of a probabilistic model. It provides a principled approach to fitting models to data, based on the idea of maximizing the likelihood of observing the given data under the model.


The concept of maximum likelihood was first developed by R. A. Fisher in the 1920s, although similar ideas had been used earlier by Gauss and Laplace. Fisher's work formalized the method and demonstrated its wide applicability, laying the foundation for much of modern statistical theory.


At its heart, MLE asks a simple question: "Given our observed data, what parameter values would make this data most likely to occur?" This intuitive idea is then formalized into a rigorous mathematical framework.


Mathematically, we can express the MLE estimate as:

$$\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta|X) = \arg\max_{\theta} P(X|\theta)$$


Where:
- $\hat{\theta}_{MLE}$ is the MLE estimate of the parameter(s)
- $L(\theta|X)$ is the likelihood function
- $X$ is the observed data
- $P(X|\theta)$ is the probability of observing $X$ given parameters $θ$


💡 **Tip**: The 'arg max' notation means we're finding the value of θ that maximizes the function.

MLE plays a crucial role in many machine learning algorithms:

1. **Model Fitting**: It provides a principled way to fit models to data.
2. **Probabilistic Interpretations**: It allows for probabilistic interpretations of model outputs.
3. **Theoretical Guarantees**: MLE estimators often have desirable theoretical properties like consistency and efficiency.
4. **Foundation for Advanced Methods**: Many advanced techniques, like Expectation-Maximization (EM) algorithm, are based on MLE principles.


> 💡 **Note**: Understanding MLE will give you insights into many ML algorithms' inner workings.


In this lecture, we'll dive deeper into the mathematical foundations of MLE, explore its properties and limitations, and see how it's applied in various machine learning contexts. By the end, you'll have a solid understanding of this powerful estimation technique and be able to apply it in your own data analysis and model building tasks.


🚀 **Learning Goal**: By the end of this lecture, you'll be able to apply MLE to various probabilistic models and understand its role in machine learning algorithms.


Understanding MLE is not just about learning a technique; it's about grasping a fundamental principle in statistical learning that will enhance your ability to work with probabilistic models and make informed decisions in data science and machine learning projects.

**Table of contents**<a id='toc0_'></a>    
- [Fundamental Concept of Likelihood](#toc1_)    
  - [Likelihood vs. Probability](#toc1_1_)    
  - [Example: Coin Flipping](#toc1_2_)    
  - [Properties of Likelihood](#toc1_3_)    
  - [Likelihood Function](#toc1_4_)    
- [Mathematical Formulation of MLE](#toc2_)    
  - [Log-Likelihood Function](#toc2_1_)    
  - [MLE Objective](#toc2_2_)    
  - [Finding the Maximum](#toc2_3_)    
  - [Example: Bernoulli Distribution](#toc2_4_)    
  - [Numerical Methods](#toc2_5_)    
- [Step-by-Step Process of MLE](#toc3_)    
  - [Identify the Probability Distribution](#toc3_1_)    
  - [Write the Probability Function](#toc3_2_)    
  - [Construct the Likelihood Function](#toc3_3_)    
  - [Take the Logarithm](#toc3_4_)    
  - [Find the Maximum](#toc3_5_)    
  - [Solve for the Parameters](#toc3_6_)    
  - [Interpret the Results](#toc3_7_)    
  - [Example: Normal Distribution](#toc3_8_)    
  - [Practical Considerations](#toc3_9_)    
- [MLE in Common Probability Distributions](#toc4_)    
  - [Bernoulli Distribution](#toc4_1_)    
  - [Binomial Distribution](#toc4_2_)    
  - [Normal (Gaussian) Distribution](#toc4_3_)    
  - [Poisson Distribution](#toc4_4_)    
  - [Exponential Distribution](#toc4_5_)    
  - [Uniform Distribution](#toc4_6_)    
  - [Practical Example: Normal Distribution](#toc4_7_)    
  - [Important Considerations](#toc4_8_)    
- [Summary](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Fundamental Concept of Likelihood](#toc0_)

Understanding the concept of likelihood is crucial for grasping Maximum Likelihood Estimation. Let's dive into what likelihood means and how it differs from probability.


Likelihood is a function of the parameters of a statistical model, given some observed data. It measures how well a particular set of parameter values explains the observed data.


🔑 **Key Takeaway**: Likelihood is about parameters, given fixed data.


The likelihood is defined as the probability of observing the data given a specific set of parameter values, viewed as a function of the parameters. As such mathematically, for a parameter $\theta$ and observed data $X$, the likelihood $L(\theta|X)$ is proportional to the probability of observing $X$ given $\theta$

$$L(\theta|X) \propto P(X|\theta)$$


### <a id='toc1_1_'></a>[Likelihood vs. Probability](#toc0_)


While likelihood and probability are related, they have distinct meanings:

1. **Probability**:
   - Describes the chance of observing data, given fixed parameters.
   - Sums/integrates to 1 over all possible data outcomes.

2. **Likelihood**:
   - Describes the plausibility of parameter values, given fixed data.
   - Does not sum/integrate to 1 over parameter values.


💡 **Pro Tip**: Think of likelihood as "reverse probability" – it's about parameters, not data.


### <a id='toc1_2_'></a>[Example: Coin Flipping](#toc0_)


Let's illustrate with a coin-flipping example:

- Parameter: p (probability of heads)
- Observed data: 3 heads out of 5 flips


- Probability: P(3 heads | p = 0.5) = C(5,3) * 0.5³ * 0.5² = 0.3125
- Likelihood: L(p = 0.5 | 3 heads) ∝ 0.5³ * 0.5² = 0.03125


So likelihood is proportional to the probability, but we're now considering p as variable and the data as fixed.


### <a id='toc1_3_'></a>[Properties of Likelihood](#toc0_)


1. **Non-negative**: Likelihood is always non-negative.
2. **Relative**: Only relative values of likelihood matter, not absolute values.
3. **Not a Probability**: Likelihood doesn't sum or integrate to 1 over parameter space.


💡 **Note**: Because only relative values matter, we often work with log-likelihood for computational convenience.


### <a id='toc1_4_'></a>[Likelihood Function](#toc0_)


The likelihood function is central to MLE. It's a function of the parameters, treating the observed data as fixed:

$$L(\theta) = P(X|\theta)$$


For independent and identically distributed (i.i.d.) observations, the likelihood is the product of individual probabilities:

$$L(\theta) = \prod_{i=1}^n P(x_i|\theta)$$


**IID stands for "Independent and Identically Distributed."** This is a fundamental concept in probability theory and statistics, often used when describing random variables or data points. Let's break it down:

- Independent:
    - This means that the occurrence or value of one event or variable does not affect the probability of another.
    - In other words, knowing the outcome of one event provides no information about the outcome of another event.

- Identically Distributed:
    - This means that all the random variables or data points in a set follow the same probability distribution.
    - They all have the same underlying statistical properties (like mean and variance).

When we say a set of random variables or observations is IID, we mean:
- Each observation is independent of the others.
- All observations come from the same probability distribution.

🚀 **Learning Goal**: Understand how to construct likelihood functions for different probabilistic models.


In MLE, we seek to find the parameter values that maximize this likelihood function. This leads to the parameter estimates that make the observed data most probable under the model.


🔑 **Key Takeaway**: MLE finds the parameters that make the data most likely.


Understanding likelihood is fundamental to grasping how MLE works and why it's such a powerful method in statistics and machine learning. It provides a principled way to connect our models with observed data, forming the basis for parameter estimation and model fitting.

## <a id='toc2_'></a>[Mathematical Formulation of MLE](#toc0_)

The mathematical formulation of Maximum Likelihood Estimation provides a rigorous framework for finding the best parameter estimates given observed data. Let's break down this formulation step by step.


We start with the likelihood function, which expresses the probability of observing the data given the parameters:

$$L(\theta|X) = P(X|\theta)$$


Where:
- $L(\theta|X)$ is the likelihood function
- $\theta$ represents the parameter(s) we want to estimate
- $X$ is the observed data


For independent and identically distributed (i.i.d.) observations, the likelihood is the product of individual probabilities:

$$L(\theta|X) = \prod_{i=1}^n P(x_i|\theta)$$


### <a id='toc2_1_'></a>[Log-Likelihood Function](#toc0_)


In practice, we often work with the log-likelihood function. This is because:
1. It converts products to sums, which is computationally easier to handle.
2. It doesn't change the location of the maximum.


The log-likelihood function is defined as:

$$\ell(\theta|X) = \log L(\theta|X) = \sum_{i=1}^n \log P(x_i|\theta)$$


🔑 **Key Takeaway**: The log-likelihood is often easier to work with mathematically and computationally.


### <a id='toc2_2_'></a>[MLE Objective](#toc0_)


The goal of MLE is to find the parameter values that maximize the likelihood (or log-likelihood) function:

$$\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta|X) = \arg\max_{\theta} \ell(\theta|X)$$

Where $\hat{\theta}_{MLE}$ is the MLE estimate of the parameter(s).


### <a id='toc2_3_'></a>[Finding the Maximum](#toc0_)


To find the maximum, we typically follow these steps:

1. Take the derivative of the log-likelihood with respect to each parameter:

   $$\frac{\partial \ell(\theta|X)}{\partial \theta_j} = 0$$

2. Solve the resulting equation(s), known as the likelihood equations:

   $$\sum_{i=1}^n \frac{\partial \log P(x_i|\theta)}{\partial \theta_j} = 0$$

3. Check the second derivative to ensure it's a maximum, not a minimum:

   $$\frac{\partial^2 \ell(\theta|X)}{\partial \theta_j^2} < 0$$


### <a id='toc2_4_'></a>[Example: Bernoulli Distribution](#toc0_)


Let's apply this to a Bernoulli distribution with parameter $p$:

1. Likelihood: $L(p|X) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}$

2. Log-likelihood: $\ell(p|X) = \sum_{i=1}^n [x_i \log p + (1-x_i) \log (1-p)]$

3. Derivative: $\frac{\partial \ell}{\partial p} = \sum_{i=1}^n [\frac{x_i}{p} - \frac{1-x_i}{1-p}] = 0$

4. Solve: $\hat{p}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i$


This gives us the intuitive result that the MLE estimate for $p$ is the sample proportion of successes.


### <a id='toc2_5_'></a>[Numerical Methods](#toc0_)


For more complex models, closed-form solutions may not exist. In such cases, numerical optimization methods like gradient descent or Newton-Raphson are used to find the MLE estimates.


Understanding this mathematical formulation is crucial for applying MLE in various contexts and for grasping more advanced statistical and machine learning concepts that build upon these foundations.

## <a id='toc3_'></a>[Step-by-Step Process of MLE](#toc0_)

Applying Maximum Likelihood Estimation involves a systematic process. Let's break it down into clear, actionable steps that you can follow for various estimation problems.


### <a id='toc3_1_'></a>[Identify the Probability Distribution](#toc0_)


First, determine the probability distribution that best describes your data. This could be based on:
- The nature of your data (e.g., binary, count, continuous)
- Domain knowledge or theoretical considerations
- Exploratory data analysis


Example: For binary outcomes, you might choose a Bernoulli distribution.


### <a id='toc3_2_'></a>[Write the Probability Function](#toc0_)


Express the probability of observing a single data point given the parameters:

$$P(x_i|\theta)$$


Where $x_i$ is a single observation and $\theta$ represents the parameter(s) to be estimated.


### <a id='toc3_3_'></a>[Construct the Likelihood Function](#toc0_)


For independent observations, the likelihood is the product of individual probabilities:

$$L(\theta|X) = \prod_{i=1}^n P(x_i|\theta)$$


### <a id='toc3_4_'></a>[Take the Logarithm](#toc0_)


Convert the likelihood to log-likelihood for computational ease:

$$\ell(\theta|X) = \log L(\theta|X) = \sum_{i=1}^n \log P(x_i|\theta)$$


🔑 **Key Takeaway**: The log-likelihood converts products to sums, simplifying calculations.


### <a id='toc3_5_'></a>[Find the Maximum](#toc0_)


To find the parameter values that maximize the log-likelihood:

a) Take the derivative with respect to each parameter:

   $$\frac{\partial \ell(\theta|X)}{\partial \theta_j} = 0$$

b) Solve the resulting equation(s):

   $$\sum_{i=1}^n \frac{\partial \log P(x_i|\theta)}{\partial \theta_j} = 0$$

c) Check the second derivative to ensure it's a maximum:

   $$\frac{\partial^2 \ell(\theta|X)}{\partial \theta_j^2} < 0$$


### <a id='toc3_6_'></a>[Solve for the Parameters](#toc0_)


Depending on the complexity of the equations:
- For simple cases, solve analytically.
- For complex cases, use numerical methods like gradient descent or Newton-Raphson.


### <a id='toc3_7_'></a>[Interpret the Results](#toc0_)


Once you have the MLE estimates, interpret them in the context of your problem.


### <a id='toc3_8_'></a>[Example: Normal Distribution](#toc0_)


Let's apply this process to estimating the mean (μ) and variance (σ²) of a normal distribution:

1. Probability function:
   $$P(x_i|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}$$

2. Log-likelihood:
   $$\ell(\mu,\sigma^2|X) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2$$

3. Derivatives:
   $$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i-\mu) = 0$$
   $$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i-\mu)^2 = 0$$

4. Solve:
   $$\hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i$$
   $$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i-\hat{\mu})^2$$


This process gives us the sample mean and variance as MLE estimates for a normal distribution.


### <a id='toc3_9_'></a>[Practical Considerations](#toc0_)


- For complex models, software packages often implement MLE algorithms.
- In some cases, constraints on parameters may require constrained optimization techniques.
- Always check for global vs. local maxima, especially in multi-parameter problems.


By following this step-by-step process, you can apply MLE to a wide range of statistical and machine learning problems, from simple distribution fitting to complex model parameter estimation.

## <a id='toc4_'></a>[MLE in Common Probability Distributions](#toc0_)

Understanding how to apply MLE to common probability distributions is crucial for practical applications in statistics and machine learning. Let's explore MLE for several key distributions.


### <a id='toc4_1_'></a>[Bernoulli Distribution](#toc0_)


Used for binary outcomes (success/failure).

- Parameter: p (probability of success)
- Probability mass function: P(X = x) = p^x * (1-p)^(1-x), x ∈ {0,1}


MLE solution:
$$\hat{p}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i$$


🔑 **Key Takeaway**: The MLE for p is simply the sample proportion of successes.


### <a id='toc4_2_'></a>[Binomial Distribution](#toc0_)


Extension of Bernoulli for n independent trials.

- Parameters: n (number of trials), p (probability of success)
- Probability mass function: P(X = k) = C(n,k) * p^k * (1-p)^(n-k)


MLE solution:
$$\hat{p}_{MLE} = \frac{\sum_{i=1}^m k_i}{nm}$$
where m is the number of observations and k_i is the number of successes in the i-th observation.


### <a id='toc4_3_'></a>[Normal (Gaussian) Distribution](#toc0_)


Fundamental for continuous data.

- Parameters: μ (mean), σ² (variance)
- Probability density function: f(x) = (1 / √(2πσ²)) * e^(-(x-μ)²/(2σ²))


MLE solutions:
$$\hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i$$
$$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i-\hat{\mu})^2$$


### <a id='toc4_4_'></a>[Poisson Distribution](#toc0_)


Used for count data.

- Parameter: λ (rate)
- Probability mass function: P(X = k) = (e^(-λ) * λ^k) / k!


MLE solution:
$$\hat{\lambda}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i$$


### <a id='toc4_5_'></a>[Exponential Distribution](#toc0_)


Often used for modeling time between events.

- Parameter: λ (rate)
- Probability density function: f(x) = λe^(-λx), x ≥ 0


MLE solution:
$$\hat{\lambda}_{MLE} = \frac{n}{\sum_{i=1}^n x_i}$$


### <a id='toc4_6_'></a>[Uniform Distribution](#toc0_)


For data equally likely to occur in an interval [a,b].

- Parameters: a (minimum), b (maximum)
- Probability density function: f(x) = 1/(b-a), a ≤ x ≤ b


MLE solutions:
$$\hat{a}_{MLE} = \min(x_1, ..., x_n)$$
$$\hat{b}_{MLE} = \max(x_1, ..., x_n)$$


### <a id='toc4_7_'></a>[Practical Example: Normal Distribution](#toc0_)


Let's work through a small example for the normal distribution:

Given data: 2, 3, 5, 7, 11

1. Calculate $\hat{\mu}_{MLE}$:
   $$\hat{\mu}_{MLE} = \frac{2 + 3 + 5 + 7 + 11}{5} = 5.6$$

2. Calculate $\hat{\sigma}^2_{MLE}$:
   $$\hat{\sigma}^2_{MLE} = \frac{(2-5.6)^2 + (3-5.6)^2 + (5-5.6)^2 + (7-5.6)^2 + (11-5.6)^2}{5} = 11.04$$


Therefore, the MLE estimates for this data are μ ≈ 5.6 and σ² ≈ 11.04.


### <a id='toc4_8_'></a>[Important Considerations](#toc0_)


1. **Sample Size**: MLE estimates become more reliable with larger sample sizes.
2. **Computational Aspects**: For some distributions, numerical methods may be needed to find MLEs.
3. **Assumptions**: Ensure your data reasonably fits the assumed distribution.


Understanding how to apply MLE to these common distributions provides a solid foundation for more complex modeling tasks in statistics and machine learning. It allows you to estimate parameters efficiently and make informed decisions based on your data's underlying probabilistic structure.

## <a id='toc5_'></a>[Summary](#toc0_)

As we conclude our exploration of Maximum Likelihood Estimation, let's recap the main points and highlight the key takeaways from this lecture. Understanding these concepts will help you apply MLE effectively in your data analysis and machine learning projects.

1. **Concept of Likelihood**
   - Likelihood measures how well parameters explain observed data.
   - It's different from probability: likelihood is about parameters given fixed data.

2. **Mathematical Formulation**
   - MLE finds parameters that maximize the likelihood function.
   - We often work with log-likelihood for computational convenience.

3. **Step-by-Step Process**
   - Identify the probability distribution.
   - Construct the likelihood function.
   - Take the logarithm.
   - Find the maximum through differentiation or numerical methods.

4. **Properties of MLE**
   - Consistency: Converges to true parameter values as sample size increases.
   - Asymptotic normality: Estimates are approximately normally distributed for large samples.
   - Efficiency: Achieves the Cramér-Rao lower bound asymptotically.

5. **Applications in Machine Learning**
   - Fundamental to many algorithms, including linear and logistic regression.
   - Forms the basis for more advanced techniques like EM algorithm.


Understanding MLE lays a strong foundation for more advanced topics in statistical learning and machine learning. It will help you in:

- Grasping more complex estimation techniques.
- Understanding the principles behind many machine learning algorithms.
- Developing your own models and estimation procedures.


🚀 **Final Thought**: MLE is not just a technique, but a fundamental principle in statistical inference. Mastering it will significantly enhance your ability to work with data and build effective models in various domains of data science and machine learning.


By internalizing these concepts and practices, you're well-equipped to apply MLE in your work and to delve deeper into the world of statistical estimation and machine learning algorithms.