___

# PROBABILITY DISTRIBUTION

We leverage several kinds of data to do statistical analysis: categorical and numerical variables.

## Categorical Variables
+ nominal: cannot be sorted (flavors of ice cream).
+ ordinal: can be sorted; lack scale (survey responses).


## Numerical Variables
+ discrete: can only take certain values (dice roll, population).
+ continuous: can take any value within a range (age, height).


The [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution) of a random variable X tells what the possible values of X are and what probability each value has.


## PMF and PDF

For numerical variables:
+ **discrete**: probability of being exactly equal to some value. The sum of all probabilities is one. Its probability distribution is called the **Probability Mass Function**.
+ **continuous**: probability of falling within any interval of values. The area under the probability curve is equal to one. Its probability distribution is called the **Probability Density Function**.


## CDF

The [Cumulative Distribution Function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) is the probability that a random variable X takes a value less than or equal to x. It is the cumulative sum (or integral) of the probability distribution.


___

# PARAMETERS OF INTEREST

It is often interesting to summarize the **probability distribution** with a few numerical features of interest: the population **parameters**. 

Numerical data are typically associated with their:

+ **Central Tendency**: average value a variable can take.
+ **Dispersion**: spread from the average.

There are three main methods to measure central tendency and dispersion:
+ mode.
+ mean and standard deviation.
+ median and interquartile range.

<img src="https://sebastienplat.s3.amazonaws.com/e4b73954b4944120752f1082b78ed3cc1489605907921" alt="drawing" width="600px"/>


## Mode

The mode of a distribution is the value that appears most often in a set of data values. For discrete variables, this is the most likely value to be sampled, i.e. where the PMF takes its maximal value.

For continuous variables, where no two values will be exactly the same, the concept cannot be utilized in its raw form. Histograms or kernel density estimates, covered in the next section, can provide an estimate of the mode.


## Mean and Standard Deviation

+ mean: average value; sensitive to outliers.
+ variance: average squared distance from the mean; emphasis on higher distances.
+ standard deviation: square root of variance; same unit as the measure itself.


## Median and IQR

+ median: middle value of the ordered data; insensitive to outliers. Half of the possible values are on its left and half on its right.
+ range: maximum - minimum; extremely sensitive to outliers.
+ interquartile range (IQR): difference between the third and first quartile.

**Quartiles** divide a rank-ordered data set into four parts of equal size; each value has a 25% probability of falling into each quartile. The **Interquartile range (IQR)** is equal to Q3 minus Q1. This is why it is sometimes called the midspread or **middle 50%**. Q2 is the median.

_Note: The median is not skewed much by extremely large or small values. So it may give a better idea of a 'typical' value than the mean. This is also why it is considered a [**robust statistic**](https://en.wikipedia.org/wiki/Robust_statistics)._


___

# MODES, SKEWNESS, KURTOSIS

A few terms exist to describe the overall shape of a probability distribution: modes, skewness and kurtosis.

## Unimodality

The peaks of a probability distribution (i.e. local maxima) are called modes (not to be confused with the mode defined in the previous section). A variable with only one mode (or peak) is called unimodal. If the probability distribution has more than one mode, the variable is called multimodal. 

Here is an example of bimodal distribution; the largest peak is the mode of the probability distribution, i.e. the value most likely to occur. 

<img src="https://upload.wikimedia.org/wikipedia/commons/b/bc/Bimodal_geological.PNG" alt="drawing" width="300px"/>


## Skewness

Skewness measures the asymmetry of a probability distribution about its mean. For a unimodal distribution:
+ negative skew (left-skewed): the tail is on the left side of the distribution; usually appears as a right-leaning curve.
+ positive skew (right-skewed): the tail is on the right.

![missing](https://upload.wikimedia.org/wikipedia/commons/f/f8/Negative_and_positive_skew_diagrams_%28English%29.svg)

It is important to note that a distribution with zero skewness is not always symmetrical (an example of such distribution can be found [here](https://en.wikipedia.org/wiki/Skewness#/media/File:Asymmetric_Distribution_with_Zero_Skewness.jpg)).


## Kurtosis

The kurtosis measures the likelihood of outliers:
+ low kurtosis: the distribution produces fewer and less extreme outliers.
+ high kurtosis: the distribution produces more outliers.

A distribution with high kurtosis will have lower probabilities around the mean and [heavier tails](https://en.wikipedia.org/wiki/Heavy-tailed_distribution).

_Note: for a given distribution, the kurtosis can never be smaller than the squared skewness plus one._

![missing](../../img/kurtosis.png)
