# Introduction
The Gaussian distribution, also known as the normal distribution is a fundamental concept in statistics and machine learning.

### The bell curve
Imagine a line representing all possible values of some variable, like, height, weight, or examination scores. Each point corresponds to the probability of that specific value occurring. The peak of the bell curve represents the mean, which is the most likely value. The probability of encountering the values which are further away from mean decreases, following a symmetrical pattern.

### Data with 2 peaks (bimodal datasets)
2 peaks suggests that there are 2 distinct sub-populations within the dataset. For example, if a dataset containing the age information of a population, then one of the peaks represents the younger population, and the other peak represents the older population.

### Probability density function (PDF)
PDF, denoted by $f(x)$, mathematically describes the shape of the gaussian curve. It specifically describes the relative likelihood of encountering a specific value $x$ based on the mean $\mu$ and the standard deviation $\sigma$ of the distribution. The formula for PDF is mathematically represented as,

$f(x) = \frac{1}{\sigma \sqrt{2\pi}} * e^{\frac{-(x - \mu)^2}{2\sigma^2}}$

Where,
- $x$ = Value of which the probability is to be found.
- $\mu$ = Mean of the distribution, representing the most likely value.
- $\sigma$ = Standard deviation, which controls the spread of the curve. A smaller $\sigma$ leads to a narrower curve with higher probabilities concentrated around the mean. A larger $\sigma$ results in a wider curve with lower probabilities further from the mean.

NOTE: Continuous PDFs like gaussian do not assign probabilities to specific points. Instead they provide probabilities for small intervals or areas under the curve.

### Multi-modal curves
Data can exhibit even more complex structures. A multi-modal curve has multiple peaks, each representing a distinct cluster or sub-population within the dataset. Imagine a distribution of customer income with one peak for low-income earners, another for middle-income earners, and a third for high-income earners.

### Inferring probabilities for complex datasets
In multi-modal distributions, the regions between the curves represent points with non-zero probabilities of belonging to multiple clusters. For such points, the probability of belonging to each sub-population can be calculated. Consider that a data point has a probability of 0.2 of belonging to sub-population 1 and a probability of 0.6 of belonging to sub-population 2. By normalizing these values (dividing by the sum), the final probabilities can be determined: 25% chance of belonging to sub-population 1 and 75% chance of belonging to sub-population 2.

# Idea Behind Gaussian Mixture Models (GMMs)
In the real world, data rarely falls into neat, linear categories. GMMs provide a softer approach to clustering data points into categories. This allows GMMs to handle complex, non-spherical clusters that K-Means might struggle with.

### The core idea: A mixture of gaussians
Imagine a dataset with 2 distinct groups: onions (represented by crosses) and potatoes (represented by triangles). GMMs assume that this data can be modeled as a mixture of 2 gaussian distributions. Each gaussian distribution representing a cluster (onions or potatoes), has its own,
1. Mean ($\mu$): The centre or most likely value within the cluster.
2. Standard deviation ($\sigma$): The spread of the data points around the mean, controlling how tightly or loosely the data points are grouped.

### The advantage
Unlike K-Means, which assigns data points to a single, "hard" cluster, GMMs provide a softer approach. For a new data point, GMMs calculate the probability of it belonging to each gaussian distribution in the mixture. This allows for data points that might fall on the border between clusters, reflecting the natural messiness of real-world data.

Once the parameters ($\mu$ and $\sigma$) of each gaussian distribution in the mixture are determined, a wealth of information can be unlocked about the underlying data,
- Ranges: The likely range of values within each cluster can be identified.
- Probabilities: The probability of a new data point belonging to a specific cluster or falling within a particular range can be calculated.

This is the advantage of gaussian distribution, with just the mean and standard deviation, the entire distribution's behavior can be understood.

### Making decisions with soft probabilities
In GMMs, instead of a hard assignment (either onion or potato), probabilities are obtained. Consider that a new data point has a 70% chance of belonging to cluster A (onions), and a 30% chance of belonging to cluster B (potatoes). Based on these probabilities, it can be inferred that the point is more likely to belong to cluster A (onions).

### Why gaussians?
- Flexibilities: Gaussian distributions are versatile and can capture a wide range of shapes by adjusting their mean (centre) and standard deviation (spread).
- Mathematical convenience: Gaussians have well defined mathematical properties, making calculations and analysis within the model easier.

### Benefits of using a mixture
- Modeling non-spherical clusters: By combining multiple gaussians, GMMs can model clusters that are elongated, crescent shaped or have non spherical shapes.
- Soft clustering: Unlike K-Means, which assigns data points to a single "hard" cluster, GMMs provide a softer approach. GMMs calculate the probability of a data point belonging to each gaussian distribution in the mixture. This allows for data points that might fall on the border between clusters, reflecting the natural messiness of real-world data.