# Section 2 - Descriptive Statistics and Basic Jupyter
You should also have downloaded:
- stocks.csv
- Mortality_Hypertension_America.csv

In [None]:
# Module for arrays.
import numpy as np
# Module for data frames.
import pandas as pd

## 0 Load and display

Load data on monthly stock returns from 1926 to 2021 ([source](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html)) and store it as a pandas DataFrame.

In [None]:
# Load stock data as pandas DataFrame.
df_stocks = None            # TODO
# Display DataFrame.
display(df_stocks)

In [None]:
# Print the type of the "df_stocks" variable.
type(df_stocks)

Define `stocks_returns` as the column 'Mkt-RF'.

In [None]:
# Choose the stock returns column
stock_returns = None            # TODO
stock_returns

In [None]:
# Print the type of the "stock_returns" variable.
type(stock_returns)

Define `ret` as the stock returns column, but in numpy array data type.

In [None]:
# convert pandas column/series into a numpy array
ret = None            # TODO
ret

In [None]:
# Print the type of the "stock_returns_np" variable.
type(ret)

## 1 Compute Statistics (manual implementation)
**Task:** implement your own functions for these descriptive statistics below. 
- mean
- variance
- standard deviation
- skewness
- kurtosis

The functions should be designed to operate on a one-dimensional `ndarray`. You may use other Scipy and Numpy functions in your implementations.  Apply your functions to the `ret` data to test them.

In [None]:
# Sample mean.
# use len() and np.sum() 
def my_mean(arr):
    # Sample size (length of array).
    n = len(arr)
    # Sum of array elements divided by sample size.
    return None         # TODO

In [None]:
# Unbiased sample variance.
# use len(), my_mean(), and np.sum() 
# tip: make a variable for each meaningful term in the formula. makes it easier to debug
def my_var(arr):
    # Sample size.
    n = len(arr)
    # Array of deviations of array elements from sample mean.
    # Note that "array - scalar" will broadcast the subtraction.
    dev = arr - my_mean(arr)
    
    return None         # TODO

In [None]:
# Sample standard deviation derived from unbiased sample variance.
# use np.sqrt() and my_var()
def my_std(arr):
    return None         # TODO

Let $n$ be the number of samples, $x_1, \dots, x_n$ be the data, and $\overline{x}$ be the sample mean. Define the moments
$$m_k = \frac{1}{n} \sum_i (x_i - \overline{x})^k.$$

The adjusted (unbiased) Fisher-Pearson coefficient of **skewness** is
$$
 \frac{\sqrt{n(n-1)}}{n-2} \frac{m_3}{m_2^{3/2}}.
$$


In [None]:
# Adjusted (unbiased) Fisher-Pearson coefficient of skewness.
# See scipy.stats.skew documentation for the formula. Remember to center the data with the mean before using the formula
# use len(), my_mean(), np.sum(), np.sqrt()
# tip: as before, make variables for each meaningful term
def my_skew(arr):
    # Sample size.
    n = len(arr)
    # Centered data.
    arr_ctd = arr - my_mean(arr)
    # Third central sample moment.
    m3 = np.sum(arr_ctd**3) / n
    # Second central sample moment.
    m2 = np.sum(arr_ctd**2) / n
    # Bias-adjustment cofactor.
    adj = np.sqrt(n*(n-1)) / (n-2)
    return None         # TODO

The unbiased Fisher coefficient of **excess kurtosis** is
$$
 \frac{n-1}{(n-2)(n-3)} \left[\frac{(n+1) m_4}{m_2^2} - 3(n-1)\right].
$$

In [None]:
# Adjusted Fisher coefficient of excess kurtosis.
# See "sample kurtosis > standard unbiased estimator" in Kurtosis wikipedia page. This lines up with scipy.stats.kurtosis documentation, that unfortunately does not show the formula.
# use len(), my_mean(), np.sum()
# tip: as before, make variables for each meaningful term
def my_kurt(arr):
    # Sample size.
    n = len(arr)
    # Centered data.
    arr_ctd = arr - my_mean(arr)
    # Fourth central sample moment.
    m4 = np.sum(arr_ctd**4) / n
    # Second central sample moment.
    m2 = np.sum(arr_ctd**2) / n
    # Bias-adjustment cofactor.
    adj = (n-1) / ((n-2)*(n-3))
    return None         # TODO

Written for you is the code to print all these results out. 
- Take some time to understand what each line of code is doing. Presenting numbers with easy-to-read prints will be a helpful skill.

In [None]:
# List of labels to print.
label = ["Mean", "Variance", "Standard Deviation", "Skewness", "Kurtosis"]
# List of statistics.
my_value = [my_mean(ret), my_var(ret), my_std(ret), my_skew(ret), my_kurt(ret)]
# Print table.
print("*** My Functions ***")
for i in range(len(label)):
    # Print results in two columns.
    # "{}" is a placeholder for a string.
    # ":" indicates start of format specification.
    # "<" indicates left alignment.
    # "25" indicates field width of 25 characters.
    print("{:<25} {}".format(label[i], my_value[i]))

## 2 Compare against SciPy and NumPy

**Task:** 
- Compute the same quantities using the corresponding SciPy or NumPy functions with default arguments. 

In [None]:
# The point of this exercise:
# There are different definitions of the above statistics.
# Some implementations provide parameters that let you choose versions.

# Import specific functions from SciPy.
from scipy.stats import skew, kurtosis

# List of labels.
label = ["Mean", "Variance", "Standard Deviation", "Skewness", "Kurtosis"]
# List of statistics.
package_value = [None, None, None, None, None]        # TODO

# Print table.
print("{:<25} {:<20} {:<20}".format('', 'my functions', 'scipy/numpy'))
for i in range(len(label)):
    print("{:<25} {:<20} {:<20}".format(label[i], my_value[i], package_value[i]))

**Discuss:** Do your functions give different results from those of the numpy/scipy implementations?

If they are different:
- are your implementations incorrect, or 
- can you adjust some parameters of the SciPy or NumPy functions to get the same results as your manual implementations?

In [None]:
# Results differ for variance, standard deviation, skewness, and kurtosis.
# This is because the default parameters set the functions to compute alternative versions of the statistics.

# We can adjust the parameters to match our versions of the statistics.
# Note the extra arguments to the functions.
label = ["Mean", "Variance", "Standard Deviation", "Skewness", "Kurtosis"]
package_value_adj = [None, None, None, None, None] # TODO

# Print table.
print("{:<25} {:<20} {:<20}".format('', 'my functions', 'scipy/numpy'))
for i in range(len(label)):
    print("{:<25} {:<20} {:<20}".format(label[i], my_value[i], package_value_adj[i]))

# Results are now within a roundoff error of each other.
# Roundoff error is due to alternative arithmetic implementation.

## 3 Are stocks normally distributed? Visual check through histograms.

The normal distribution is commonly used to model natural and social phenomena. Investigate whether the hypothesis that stock return is normally distributed is plausible.

**Task:**
- Simulate draws from a normal distribution having 
    - mean equal to the sample mean of the stock returns and 
    - standard deviation equal to the sample standard deviation of the stock returns. 
    - Use a random state of 0, and 
    - for the number of draws, use the number of stock return observations.
- (Done for you) Compare histograms of the simulated data and the actual stock returns.
    - **Discuss:** Note how we choose the bins carefully. If you changed the 'bins=mybins' to 'bins=100', what difference do you see?

        **Ans:** 
        - The returns exhibit fatter tails in comparison to the simulated normal data. 
        - This is called "excess kurtosis" or "leptokurtosis." It indicates that extreme outcomes are more probable than under a normal distribution.
        - Excess kurtosis is a commonly observed feature of asset returns. The hypothesis that the historical stock return is normally distributed is implausible based on differing kurtosis.

In [None]:
# Import class for normal distribution from SciPy.
# It is customary to import this under an alias because "norm" is used for vector norms.
from scipy.stats import norm as gaussian

# Simulate normal random variates
sim_data = gaussian.rvs(loc=None, scale=None, size=None, random_state=0) # TODO
sim_data

In [None]:
# Import class for plotting.
import matplotlib.pyplot as plt

# Plot frequency histograms.
_, mybins, _ = plt.hist(ret, bins=100, label="Actual Returns")
plt.hist(sim_data, bins=mybins, alpha=0.66, label="Simulated Returns")

plt.legend()
plt.show()

# Optional, BME: More ways to check normality of data
## 4.1 Box Plot

The Box Plot is a visualization method that can be used to detect if sample is not following a normal distribution. First, we define the quartiles:
- first quartile, Q1
- third quartile, Q3
- interquartile range, IQR = Q3 - Q1


A box plot in matplotlib libary diplays the following summary statistics of the data:
- Q3 + 1.5*IQR
- Q3
- Median (which is also Q2)
- Q1
- Q1 - 1.5*IQR

**Task:** 
- Generate two random samples with normal distribution each contain 300 samples: 
    1. $\mathcal{N}(70,40)$
    2. $\mathcal{N}(100,10)$
- Generate box plots of these sample and compare them. (use random seed of 0)
- Discuss: which features of the box plot reflect the summary statistics stated above?

In [None]:
# Create two normally distributed random samples
D1 = gaussian.rvs(loc=None, scale=None, size=None, random_state=0) # TODO
D2 = gaussian.rvs(loc=None, scale=None, size=None, random_state=0) # TODO
Data = [D1, D2]

fig = plt.figure(figsize =(5, 4))

ax = fig.add_axes([0, 0, 1, 1]) # Creating axes instance
bp = ax.boxplot(Data)           # Creating plot
ax.set_xticklabels(['data_1', 'data_2']) # y-axis labels

plt.show()

Load the mortality hypertension data, which is obtained from the World Health Organization (WHO) website.

In [None]:
df_mortality_raw = pd.read_csv('Mortality_Hypertension_America.csv', skiprows=2)
df_mortality_raw

**Task:**
- Clean the data to start from 1951-2019.
- Keep only the rows with 'Sex' column being 'Male' or 'Female'. (not 'All')
- Keep only the rows with 'Age Group' column being '[All]'.

In [None]:
df_mortality = None             # TODO
df_mortality

**Task:**
- Make a boxplot of the number of mortalities, by Sex. This can be done using the pandas boxplot method.
- Discuss if the distribution looks normal, based on the previous boxplot visualization of random normal data.

In [None]:
# TODO

## 4.2 Violin Plot

Violin plots provide additional information by adding a density trace. 
**Task:**
- Define `male_mortality` and `female_mortality`, which are numpy arrays containing the number of mortalities for the corresponding sex. The arrays should be extracted from the `df_mortality` dataframe.
- Use these arrays to plot side-by-side a violin plot and a box plot. 
- How do you interpret your result?

In [None]:
male_mortality = None             # TODO
female_mortality = None             # TODO

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols=2, figsize=(12,8))

# generate combined data for male and female convert them float from object
Data = [male_mortality, female_mortality]

# plot Violin plot
axes[0].violinplot(Data, showmeans=False, showmedians=True)
axes[0].set_title('violin plot')
axes[0].set_xticks([1,2])
axes[0].set_xticklabels(['Male', 'Female'])
axes[0].set_ylabel('Mortality')

# plot Box plot
axes[1].boxplot(Data)
axes[1].set_title('box plot')
axes[1].set_xticklabels(['Male', 'Female'])
axes[1].set_ylabel('Mortality')

plt.show()

## 4.3 Q-Q plot

Q-Q plot or quantile-quantile plot is another visualization tool to determine if a set of random data is initiated from some pre-difeind theoritical distribution (Ex. normal distribution). In python, we could use **probplot** method from scipy.stats to generate Q-Q plots. 

- First generate Q-Q plot for a $\mathcal{N}(50,10)$ distribution by using 1000 sample. (Use a seed value of zero). 
- Then generate Q-Q plot for male and female mortality samples. Make sure to standardized your data first.

How do you interpret your result?

In [None]:
male_mortality_normalized = None             # TODO
female_mortality_normalized = None             # TODO

In [None]:
import scipy.stats as stats

fig = plt.figure(figsize =(15, 5))

# generate 1000 samples from normal distribution with zero mean and unit standard deviation
normal_data = gaussian.rvs(loc=None, scale=None, size=None, random_state=0) # TODO

plt.subplot(1, 3, 1)
stats.probplot(normal_data, dist="norm", plot=plt)
plt.title('Normal distribution')

# plot standardized male population mortality samples against normal distribution
plt.subplot(1, 3, 2)
stats.probplot(male_mortality_normalized, dist="norm", plot=plt)
plt.title('Male mortality data')

# plot standardized female population mortality samples against normal distribution
plt.subplot(1, 3, 3)
stats.probplot(female_mortality_normalized, dist="norm", plot=plt)
plt.title('Female mortality data')

plt.show()