# Week 2: Descriptive statistics

## Introduction to descriptive statistics
Descriptive statistics are numbers that are used to describe and summarize the data. 

### Measures of central tendency:
-   mean
-   median  
-   mode

### Measures of variability or dispersion:
-   variance or standard deviation
-   coefficient of variation
-   minimum and maximum values 
-   IQR (Interquartile Range) 
-   skewness 
-   kurtosis


### Measures of dispersion or variability
#### Variance
- Variance measures the dispersion of a set of data points around their mean value.

- Variance gives results in the original units squared.

#### Standard deviation
- Standard deviation is the most common used measure of variability.

- It is the square-root of the variance.

#### Skewness
- Skewness is a measure of a distribution's symmetry or more precisely lack of symmetry.

- It is used to mean the absence of symmetry from the mean of the dataset.

#### The rule of thumb for skewness values are:
- If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.

- If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed.

- If the skewness is less than -1 or greater than 1, the data are highly skewed.


![skew](../skew.png)

#### Kurtosis
- Kurtosis is the degree of peakedness of a distribution.

##### Mesokurtic curve: 
- kurtosis exactly 3 (excess kurtosis exactly 0).

##### Platykurtic curve: 
- A distribution with kurtosis < 3 (excess kurtosis < 0) is called platykurtic.

- As compared to a normal distribution, its central peak is lower and broader, and its tails are shorter and thinner.

##### Leptokurtic curve: 
- A distribution with kurtosis > 3 (excess kurtosis > 0) is called leptokurtic.

- As compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter.



![kurt](../kurt.png)

## Import python libraries

In [15]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [14]:
# Sample data 

data = [12, 15, 16, 18, 20, 21, 22, 23, 24, 25]

Given a sample of observations  $y_t, t = 1, \cdots, T$, the mean $\mu$ is calculated such that:


$\mu = \frac{\sum_{i=1}^T {y_t}}{T}$.

1. **Mean (Average):**

   Mean (μ) is calculated as the sum of all data points divided by the number of data points.

In [17]:
# Manual Calculation

Mean = (12 + 15 + 16 + 18 + 20 + 21 + 22 + 23 + 24 + 25) / 10 
Mean

19.6

In [19]:
# Calculation using Python

Mean_p = np.mean(data)
Mean_p

19.6

#### Variance
Variance  measures how data points vary from the mean.

The variance is calculated such that:

 $V(x) = \frac{\sum_{i=1}^T {(y_t - \mu)^2}}{T-1}$.

In [20]:
# Manual Calculation

Variance = ((12 - 19.6)**2 + (15 - 19.6)**2 + (16 - 19.6)**2 + (18 - 19.6)**2 + (20 - 19.6)**2 + (21 - 19.6)**2 + (22 - 19.6)**2 + (23 - 19.6)**2 + (24 - 19.6)**2 + (25 - 19.6)**2) / (10 - 1)
Variance

18.044444444444444

In [21]:
# Python implementation of Variance

variance_p = np.var(data, ddof=1)  # ddof=1 for sample variance
variance_p

18.044444444444444

The standard deviation is calculated such that:

$σ(x)= \sqrt{var}$.

In [22]:
# Manual Calculation

std_dev = np.sqrt(Variance)
std_dev

4.247875285886398

In [23]:
# Standard Deviation
std_dev_p = np.std(data, ddof=1)  # ddof=1 for sample standard deviation
std_dev_p

4.247875285886398

If we let 

$$
z_t = \frac{\sum_{t=1}^T {(y_t - \mu)}}{s}
$$ 

We can calculate the coefficient of skeweness as 

$$
skew = \frac{\sum_{t=1}^T {(z_t)^3}}{T}
$$

In [24]:
# Manual Calculation

Skewness = (pow((12 - 19.6),3) + pow((15 - 19.6),3) + pow((16 - 19.6),3) + pow((18 - 19.6),3) + pow((20 - 19.6),3) + pow((21 - 19.6),3) + pow((22 - 19.6),3) + pow((23 - 19.6),3) + pow(24 - 19.6,3) + pow((25 - 19.6),3)) / (10 * pow(std_dev,3))
Skewness

-0.37635737968447946

In [25]:
# Skewness python implementation

skew = stats.skew(data)
skew

-0.44079501259842147

We can calculate the coefficient of kurtosis as: 

$$
kurt = \frac{\sum_{t=1}^T {(z_t)^4}}{T}
$$

using the biaised estimator of the standard deviation $\sigma$ we can write the expression above as:

$$
kurt = T^{-1} \frac{\sum_{t=1}^T {(y_t - \mu)^4}}{(\sigma^2)^2}
$$

In [26]:
# Manual Calculation

Kurtosis = (pow((12 - 19.6),4) + pow((15 - 19.6),4) + pow((16 - 19.6),4) + pow((18 - 19.6),4) + pow((20 - 19.6),4) + pow((21 - 19.6),4) + pow((22 - 19.6),4) + pow((23 - 19.6),4) + pow((24 - 19.6),4) + pow((25 - 19.6),4)) / (10 * pow(std_dev,4))
Kurtosis

1.6444224562595553

In [27]:
# Kurtosis using python

kurtosis = stats.kurtosis(data)
kurtosis

-0.9698488194326482

and the covariance $Cov(x,y)$ between $x_i$ and $y_i$ as:

$$
Cov(x,y) = \frac{\sum_{t=1}^T {(y_t - \mu_y)(x_t - \mu_x)}}{T}
$$


and the correlation $\rho$ between $X_i$ and $Y_i$ as:

$$
\rho(x,y) = \frac{Cov(x,y)}{S(x)S(y)}
$$



Q2. 

In a speech, `Why Banks failed the stress test`, February 2009, Andrew Haldane of the Bank of England provides the following summary statistics for the "golden era" 1998-2007 and for a long period. Growth is annual percent GDP growth, inflation is annual percent change in the RPI and for both the long period is 1857-2007. FTSE is the monthly percent change in the all share index and the long period is 1693-2007.

|                 |  Growth |        | Inflation |        |   FTSE  |        |
|-----------------|:-------:|:------:|:---------:|:------:|:-------:|:------:|
|                 |  98-07  |  long  |   98-07   |  long  |  98-07  |  long  |
| Mean            |   2.9   |   2.0  |    2.8    |   3.1  |   0.2   |   0.2  |
|  SD             |   0.6   |   2.7  |    0.9    |   5.9  |   4.1   |   4.1  |
| Skew            |   0.2   |  -0.8  |    0.0    |   1.2  |   -0.8  |   2.6  |
| Excess Kurtosis |   -0.8  |   2.2  |    -0.3   |   3.0  |   3.8   |  62.3  |

(a) Explain how the mean; standard deviation, SD; coefficient of skewness and coefficient of kurtosis are calculated, and what they measure.

(b) What values for the coefficients of skewness and kurtosis would you expect from a normal distribution. Which of the series shows the least evidence of normality. 

( c) Haldane says "these distributions suggest that the Golden Era" distri­butions have a much smaller variance and slimmer tails" and "many risk man­agement models developed within the private sector during the golden decade were, in effect, pre-programmed to induce disaster myopia.". Explain what he means using these statistics. 

Q3. 

Consider the set of observations on a variable $X_i, i = 1, 2, ... , T$ and 


$w_i = a + b x_i$.



$\mu = \frac{\sum_{i=1}^T {x_i}}{T}$.


Q4. 

Consider the set of observations on variables $x_i, y_i$,   $i = 1, 2, ... , T$. 

Let: 

$$
z(y_i) = \frac{y_i - \mu_y}{S(y)}
$$

and 

$$
z(x_i) = \frac{x_i - \mu_x}{S(x)}
$$

(i) 

Show that each of the following four expressions are true

$$
 \frac{\sum_{t=1}^N z(y_i)}{N} =0
$$
$$
\frac{\sum_{t=1}^N [z(y_i)]^2}{N} =1
$$

$$
\rho(x,y) = \frac{\sum_{t=1}^N z(y_i)z(x_i)}{N}
$$

$$
\rho(x,x) = 1
$$




(ii) 

Show that the variance equals the mean of the squares minus the square of the mean:

$$
 \frac{\sum_{i=1}^N (y_i - \mu_y)^2}{N} = \frac{\sum_{i=1}^N (y_i)^2}{N} - (\mu_y)^2
$$


| 2 + data |
|----------|
| 2+12     |
| 2+15     |
| 2+16     |
| 2+18     |
| 2+20     |
| 2+21     |
| 2+22     |
| 2+23     |
| 2+24     |
| 2+25     |

In [4]:
%matplotlib inline
import os
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import mplfinance as mpf
import seaborn as sns

### FRED

In [13]:
start = datetime(2010, 1, 1)

end = datetime(2023, 1, 27)

gdp = web.DataReader('GDP', 'fred', start, end)

gdp.describe()

Unnamed: 0,GDP
count,53.0
mean,19444.975698
std,3255.623892
min,14764.61
25%,16728.687
50%,18892.639
75%,21647.64
max,26813.601


In [None]:
inflation = web.DataReader(['CPIAUCSL', 'CPILFESL'], 'fred', start, end)
inflation.info()

### World Bank

In [1]:
from pandas_datareader import wb
gdp_variables = wb.search('gdp.*capita.*const')
gdp_variables.head()

Unnamed: 0,id,name,unit,source,sourceNote,sourceOrganization,topics
691,6.0.GDPpc_constant,"GDP per capita, PPP (constant 2011 internation...",,LAC Equity Lab,GDP per capita based on purchasing power parit...,b'World Development Indicators (World Bank)',Economy & Growth
10978,NY.GDP.PCAP.KD,GDP per capita (constant 2015 US$),,World Development Indicators,GDP per capita is gross domestic product divid...,"b'World Bank national accounts data, and OECD ...",Economy & Growth
10980,NY.GDP.PCAP.KN,GDP per capita (constant LCU),,World Development Indicators,GDP per capita is gross domestic product divid...,"b'World Bank national accounts data, and OECD ...",Economy & Growth
10982,NY.GDP.PCAP.PP.KD,"GDP per capita, PPP (constant 2017 internation...",,World Development Indicators,GDP per capita based on purchasing power parit...,"b'International Comparison Program, World Bank...",Economy & Growth
10983,NY.GDP.PCAP.PP.KD.87,"GDP per capita, PPP (constant 1987 internation...",,WDI Database Archives,,b'',


In [10]:
wb_data = wb.download(indicator='NY.GDP.PCAP.KD', 
                      country=['US', 'CA', 'MX'], 
                      start=1990, 
                      end=2019)
wb_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,NY.GDP.PCAP.KD
country,year,Unnamed: 2_level_1
Canada,2019,45113.066282
Canada,2018,44917.483728
Canada,2017,44325.488337
Canada,2016,43536.913403
Canada,2015,43596.135537


### OECD

In [7]:
df = web.DataReader('TUD', 'oecd', start='2010', end='2019')
# df[['Japan', 'United States']]
df

Country,Australia,Austria,Belgium,Canada,Czech Republic,Denmark,Finland,France,Germany,Greece,...,United States,OECD - Total,Chile,Colombia,Costa Rica,Estonia,Israel,Latvia,Lithuania,Slovenia
Frequency,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,...,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual,Annual
Measure,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,...,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees,Percentage of employees
Time,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
2010-01-01,18.4,28.9,53.0,27.200001,16.1,68.099998,71.400002,10.8,18.9,22.200001,...,11.4,17.799999,13.9,9.2,12.9,8.2,,15.1,10.1,32.599998
2011-01-01,18.4,28.299999,54.200001,26.9,15.4,68.699997,69.599998,,18.4,,...,11.3,17.700001,13.8,9.1,13.4,7.0,,13.7,9.7,36.700001
2012-01-01,18.200001,28.0,54.099998,27.200001,14.8,69.0,69.199997,,18.299999,,...,10.8,17.299999,14.4,9.1,13.1,6.0,22.799999,13.2,9.0,26.799999
2013-01-01,17.0,27.799999,53.299999,27.1,13.6,68.800003,67.5,11.0,18.0,23.1,...,10.8,17.1,14.1,9.7,13.7,5.6,,12.9,8.4,26.200001
2014-01-01,15.1,27.700001,52.900002,26.4,12.9,68.5,67.800003,,17.700001,,...,10.7,16.799999,14.6,9.6,12.4,5.3,,12.8,8.1,29.4
2015-01-01,14.6,26.9,51.599998,26.299999,11.9,67.400002,65.699997,10.8,17.0,19.0,...,10.3,16.200001,16.9,9.5,19.299999,5.0,,12.4,7.7,
2016-01-01,13.7,26.299999,50.0,25.9,11.4,67.5,60.0,,16.6,,...,10.1,15.9,16.6,,19.4,5.9,,11.6,7.1,
2017-01-01,,27.4,52.299999,26.5,11.9,68.199997,67.5,,17.6,,...,10.6,16.5,15.3,9.4,18.6,4.5,,12.7,7.9,23.799999
2018-01-01,,26.700001,50.700001,26.299999,11.7,66.699997,62.900002,,16.700001,,...,10.3,16.0,17.0,9.5,19.299999,4.7,25.0,12.3,7.7,
2019-01-01,,26.299999,49.099998,26.1,,67.0,58.799999,,16.299999,,...,9.9,15.8,,,20.5,6.0,,,7.4,
