$$\newcommand{\F}{\mathbb{F}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\v}{\mathbf{v}}
\newcommand{\a}{\mathbf{a}}
\newcommand{\b}{\mathbf{b}}
\newcommand{\c}{\mathbf{c}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\u}{\mathbf{u}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\z}{\mathbf{z}}
\newcommand{\0}{\mathbf{0}}
\newcommand{\1}{\mathbf{1}}
\newcommand{\A}{\mathbf{A}}
\newcommand{\B}{\mathbf{B}}
\newcommand{\rank}{\textbf{rank}}
\newcommand{\P}{\mathcal{P}}
\newcommand{\C}{\mathbf{C}}$$

Density estimation is the problem of estimating the probability distribution for a sample of
observations from a problem domain. There are many techniques for solving density estimation,
although a common framework used throughout the field of machine learning is maximum
likelihood estimation. Maximum likelihood estimation involves defining a likelihood function
for calculating the conditional probability of observing the data sample given a probability
distribution and distribution parameters. This approach can be used to search a space of possible
distributions and parameters. This flexible probabilistic framework also provides the foundation
for many machine learning algorithms, including important methods such as linear regression
and logistic regression for predicting numeric values and class labels respectively, but also more
generally for deep learning artificial neural networks. In this tutorial, you will discover a gentle
introduction to maximum likelihood estimation. After reading this tutorial, you will know:
 Maximum Likelihood Estimation is a probabilistic framework for solving the problem of
density estimation.
 It involves maximizing a likelihood function in order to find the probability distribution
and parameters that best explain the observed data.
 It provides a framework for predictive modeling in machine learning where finding model
parameters can be framed as an optimization problem.
Let’s get started.


Probability for Machine Learning - Jason Brownlee

### Intuition for Probability Distribution Function



First you need to be very clear about what a **probability distribution** is. Consider that we have 10 students and we model their marks where the full marks of the test is 16/16. 

- Define a random variable $X$ where $X$ represents the marks of each student.
- Assume further that this **random variable** $X$ is following a **normal distribution** with $\mu = 11$ and $\sigma = 3$, can we find the probability distribution for the marks of the whole cohort (10 students)? Yes we can, because we have the parameters of the distribution. If you do not know what is the meaning of parameters, please go revise on it[^statistical_parameter], it is very important for you to understand that the **parameter** decides the probability distribution of any model. 
- Recall the general formula for the **PDF** of the normal distribution is

    $$f(X = x) = \dfrac{e^{-(x-\mu)^2}/(2\sigma^2)}{\sigma \sqrt{2\pi}}$$
    
- And in normal distribution once we have the mean and standard deviation of the dataset, we can recover the whole pdf of the model, hence the mean and standard deviation are our parameters.  So let us say we want to find $P(11 < X < 13~|~\mu = 11, \sigma = 3)$, we can easily find it to be around $0.31 = 31\%$, we can basically find any probabilities **as long as we are given the parameters**. So, **we must have the correct mindset that** probability density functions (or pmf alike) are legitimate functions that takes in any $X = x$ and outputs the probability of this $x$ **happening**. (Of course in continuous distribution we are usually only interested in the range of $x$, but for the purpose of intuition, we do not need to be so pedantic).

---

For the more mathematically formal people, here is the more precise definition of what I described above:

Suppose you have random variables $X$ which arise from a parameterized distribution $\P(X; \theta)$, where $\theta$ is the parameter characterizing the distribution $\P$.  Then the probability of $X = x$ would be: $P(X = x~|~\theta) = \P(x; \theta)$, with known $\theta$.  

[^statistical_parameter]: https://en.wikipedia.org/wiki/Statistical_parameter

### Likelihood Function

However, in the real world setting, more often than not, we have the data $X$, like we have conveniently the scores of all the 10 students above, which **could be a random sample taken** from the whole school's population. Now we are tasked to find the probability distribution of the whole population (say 10,000 students), and we would have calculated it ever so easily **if we knew what the parameters were!** Unfortunately we do not have the true parameters.

Our main motivation now is to find the **parameter**, because without it, we cannot complete the task of finding the distribution of the population. We can never know the real/true parameter $\theta = (\mu, \sigma)$, but we can obtain a good estimate of it by making use of the data that we do have! In this scenario we were given 10 data points (in real life it is usually much more), say the 10 data points are $$\mathbf{X} = [3,9,4,10,12,16,5,11,9,9]$$

So we do a sleight of hand using our original **probability density function**, $P(X = x~|~ \theta)$. Instead of being a function of $X = x$ where $\theta$ is known, we instead let $X = x$ be fixed, and let $\theta$ be the variable now. The idea is that this function now is \textbf{NO LONGER a function of} $X=x$, and is instead a function of $\theta$, where it takes in all possible values of $\theta$, and outputs a value called the \textbf{likelihood value.} So now, in a less informal way, our new function looks like $$P(\mathbf{X} = [3,9,4,10,12,16,5,11,9,9]~|~ \theta)$$ and it means \textbf{what is the probability of OBSERVING these data points, given different values of theta.} One needs to plot the graph of likelihood out to get a better idea (wikipedia).   \bigskip


    
\bigskip

So imagine our function (plot likelihood value vs parameter) has a local/global maximum, and that maximum is what we are finding ultimately. Because it is reasonable for us to believe that, the \textbf{parameter} that gives us the maximum value of $P(\mathbf{X} = [3,9,4,10,12,16,5,11,9,9]~|~ \theta)$ will suggest that \textbf{given these 10 data points}, this $\theta$ that we just found, gives us the most \textbf{likelihood/probability} that these 10 points are actually observed. 

\bigskip

We formally define this function to be $$\mathcal{L}(\theta~|~ X = x) = P(X = x~|~\theta) $$

\bigskip

I cannot emphasize enough that even those the likelihood function $\mathcal{L}$ and the probability function $P$ have the exact same form, they are fundamentally different in which one is a function of the parameter $\theta$, and the other is a function of the data points $X = x$.