# Random Variables

A random variable is a variable whose value is a numerical outcome of a physical phenomenon: 

+ an experiment: result of a coin toss, etc. 
+ an industrial process: default rate in manufacturing, etc.
+ a naturally occuring event: people's height, etc.
+ ...

## Numerical Variables
+ discrete: can only take certain values (dice roll, population).
+ continuous: can take any value within a range (age, height).

## Categorical Variables
+ nominal: cannot be sorted (flavors of ice cream).
+ ordinal: can be sorted; lack scale (survey responses).


___

# Probability Distribution

The probability distribution of a random variable X tells what the possible values of X are and what probability each value has.

## Probability Mass Function

The Probability Mass Function (PMF) of a discrete or ordinal variable measures the probability of being exactly equal to a given value. The sum of all probabilities is one.


## Categorical Distribution

The Categorical Distribution of a nominal variable is similar to a PMF, with the difference that there is no innate underlying ordering of the outcomes.


## Probability Density Function

The Probability Density Function (PDF) of a continuous variable measures the probability of falling within any interval of values. The area under the probability curve is one.


## Cumulative Distribution Function 

The Cumulative Distribution Function (CDF) is the probability that a numerical random variable takes a value less than or equal to x. It is the cumulative sum (or integral) of the probability distribution.

___

# Parameters of Interest

It is often interesting to summarize the **probability distribution** with a few numerical features of interest: the population **parameters**. 

Numerical data are typically associated with their:

+ **Central Tendency**: average value a variable can take.
+ **Dispersion**: spread from the average.


## Measures of Central Tendency

There are three main methods to measure central tendency: mode, mean, and median.

<img src="https://sebastienplat.s3.amazonaws.com/e4b73954b4944120752f1082b78ed3cc1489605907921" alt="drawing" width="600px"/>


### Mode

The mode of a distribution is the value that appears most often in a set of data values. For discrete variables, this is the most likely value to be sampled, i.e. where the PMF takes its maximal value.

For continuous variables, where no two values will be exactly the same, the concept cannot be utilized in its raw form. Histograms or kernel density estimates, covered in the next chapter, can provide an estimate of the mode.


### Median

The median is the middle value of the ordered data: Half of the possible values are on its left and half are on its right. 

The median is insensitive to outliers. Because it is not skewed much by extremely large or small values, it may give a better idea of a 'typical' value than the mean. This is also why it is considered a [**robust statistic**](https://en.wikipedia.org/wiki/Robust_statistics).


### Mean

The mean is the average value of the data: the sum of all values divided by the number of observations. It is very sensitive to outliers.

To lessen the effect of outliers, the most extreme values (such as 5% of all observations) can be excluded from the calculation. The resulting number is called the trimmed mean.


## Measures of dispersion

There are three main methods to measure spread: range, inter-quartile range, and standard deviation from the mean.

### Range

The range is simply the largest value minus the smallest value. It is extremely sensitive to outliers.


### Inter-Quartile Range

Quartiles divide a rank-ordered data set into four parts of equal size; each value has a 25% probability of falling into each quartile. The Inter-Quartile range (IQR) is equal to Q3 minus Q1. This is why it is sometimes called the midspread or middle 50%. 

The IQR is used in conjunction with the median that is, by definition, equal to Q2.


### Standard Deviation

The variance is the average squared distance from the mean; it puts an higher emphasis on higher distances. The higher the variance, the further from the mean each data point tends to be.

The standard deviation is square root of variance. It has the benefit of having the same unit as the measure itself.



___

# Modes, Skewness, Kurtosis

A few terms exist to describe the overall shape of a probability distribution: modes, skewness and kurtosis.

## Unimodality

The peaks of a probability distribution (i.e. local maxima) are called modes (not to be confused with the mode defined in the previous section). A variable with only one mode (or peak) is called unimodal. If the probability distribution has more than one mode, the variable is called multimodal. 

Here is an example of bimodal distribution; the largest peak is the mode of the probability distribution, i.e. the value most likely to occur. 

<img src="https://upload.wikimedia.org/wikipedia/commons/b/bc/Bimodal_geological.PNG" alt="drawing" width="300px"/>


## Skewness

Skewness measures the asymmetry of a probability distribution about its mean. For a unimodal distribution:
+ negative skew (left-skewed): the tail is on the left side of the distribution; usually appears as a right-leaning curve.
+ positive skew (right-skewed): the tail is on the right.

![missing](https://upload.wikimedia.org/wikipedia/commons/f/f8/Negative_and_positive_skew_diagrams_%28English%29.svg)

It is important to note that a distribution with zero skewness is not always symmetrical (an example of such distribution can be found [here](https://en.wikipedia.org/wiki/Skewness#/media/File:Asymmetric_Distribution_with_Zero_Skewness.jpg)).


## Kurtosis

The kurtosis measures the likelihood of outliers:
+ low kurtosis: the distribution produces fewer and less extreme outliers.
+ high kurtosis: the distribution produces more outliers.

A distribution with high kurtosis will have lower probabilities around the mean and [heavier tails](https://en.wikipedia.org/wiki/Heavy-tailed_distribution).

_Note: for a given distribution, the kurtosis can never be smaller than the squared skewness plus one._

![missing](../../img/kurtosis.png)


## Mean vs Median

There are two main methods to measure central tendency and dispersion. Which one is the most appropriate depends on the shape of the distribution:
+ the mean and standard deviation are well-suited for unimodal and symmetrical distribution with few outliers.
+ the median and interquartile range are better-suited otherwise.
