## Key definitions

### Average

For a data set, the arithmetic mean, also known as arithmetic average, is a central value of a finite set of numbers: specifically, the sum of the values divided by the number of values.

The formula for average is:

${\displaystyle \bar{a}={\frac {1}{n}}\sum _{i=1}^{n}a_{i}={\frac {a_{1}+a_{2}+\cdots +a_{n}}{n}}}$

### Variance

Variance measures the average degree to which each number in a data set is different from the mean.

The extent of the variance correlates to the size of the overall range of numbers, which means the variance is greater when there is a wider range of numbers in the group, and the variance is less when there is a narrower range of numbers.

The formula for population variance is:

${\displaystyle Variance(A)={\frac {1}{n}}\sum _{i=1}^{n}(a_{i} - \bar{a})^2}$

Note that the variance is always positive as it's a summation of squares.

### Standard deviation

Standard deviation is a statistical measurement that looks at how far a group of numbers is from the mean. Put simply, standard deviation measures how far apart numbers are in a data set.

It tells you, on average, how far each value lies from the mean.

This metric is calculated as the square root of the variance. This means you have to figure out the variation between each data point relative to the mean.

The formula for population standard deviation is:

$ {\displaystyle \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (a_i - \bar{a})^2}} $

### Covariance

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary together.

The formula for sample covariance is:

$ {\displaystyle cov_{x,y} = \frac{\sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y})}{n}} $

### Correlation

When two variables each have variation, you may want to measure to what extent this variation overlaps.
A correlation measures to what extent two variables have their variation in common.

To define a correlation you need:
* The mean
* Standard deviation
* Sample covariance

Correlation can be understood as a scale free measurement of the strength of the relationship between two variables. The range of correlation is `[-1, 1]` and can inform us about the strength and direction of a relationship between variables.

The formula for correlation is:

$ {\displaystyle r_{xy} = \frac{cov_{xy}}{\sqrt{Variance(x) * Variance(y)}}} $


### T-test

A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.

In the below example, a t-test is used to determine if the number of penalty kicks given to the home team is significantly larger than the number of penalty kicks given to the away team.

The t-test assumes your data:
1. Are independent
2. Are (approximately) normally distributed
3. Have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)

And, there are three main types of t-tests:

1. An Independent Samples t-test compares the means for two groups
2. A Paired sample t-test compares means from the same group at different times (say, one year apart)
3. A One sample t-test tests the mean of a single group against a known mean


### Week 2 Example 1

Let's explore a question making use of correlation: do home teams get more penalty kicks than away teams?

In [None]:
import numpy as np
from scipy import stats

"""
Start by creating a data set of penalty kick observations awarded by referees. Each
observation is referee i's awarded penalty kicks to home (column 0) and away (column 1) teams.
"""
pks = np.array([
    [53, 42, 24, 20, 20, 41, 33, 6, 37, 10, 5, 53, 0, 11, 9],
    [31, 31, 10, 12, 2, 18, 24, 1, 0, 4, 6, 28, 2, 4, 7]
], np.int32)

"""
Now let's calculate a benchmark value like the mean penalty kicks home and away teams each get.
"""
home_avg = np.average(pks[0])
away_avg = np.average(pks[1])

"""
We can also calculate the sample variance of these observations. This allows us to measure the
variation around a benchmark value like the average.
"""
home_var = np.var(pks[0], ddof=1)
away_var = np.var(pks[1], ddof=1)

"""
Next, calculate the sample standard deviation of these observations. With this value, we understand
on average how each value differs from the mean.
"""
home_std = np.std(pks[0], ddof=1)
away_std = np.std(pks[1], ddof=1)

"""
Do the same to calculate covariance.
"""
pks_cov = np.cov(pks, ddof=1)[0][1]

"""
Finally, calculate the correlation. Because the correlation is close to 1 (0.809), we can confirm that
referees who give more penalty kicks to home teams also give more penalty kicks to away teams.
"""
pks_corr = np.corrcoef(pks)[0][1]

"""
Because the p-value of the t-test is meaningfully small (~0.0006), we can reject the null-hypothesis
and conclude that home teams receive significantly more penalty kicks.
"""
t_test = stats.ttest_rel(pks[0], pks[1])

"""
The differences between the penalties for home teams and away teams does not have mean zero. In other
words, home teams seem to be in a favorable position.
"""
diffs = [a - b for a, b in zip(pks[0], pks[1])]
t_test_of_diffs = stats.ttest_1samp(diffs, popmean=0)

### Week 2 Example 2

Does the relative age effect exist in football?

In [None]:
from matplotlib import pyplot as plt 

"""
Start by creating a data set of age observations where each observation is the number of players born
in a given month during the 2018 Champions League season.
"""
ages = np.array([
    [88, 90, 72, 60, 69, 64, 63, 69, 58, 45, 52, 47],
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
])

"""
Given the strong, negative correlation, we can conclude that birth month's earlier in the year are likely
to contain more birthdays of good players. This conclusion is also visible in the plot.
"""
ages_cor = np.corrcoef(ages)
print("Correlation:", ages_cor[1][0])

plt.title("Relative age effect")
plt.xlabel("Month")
plt.ylabel("Number of players")
plt.bar(ages[1], ages[0])
plt.show()