Sometimes mean and variance are not enough to describe a distribution. <font color='blue'>When we calculate variance, we square the deviations around the mean. 

<font color='blue'>In the case of large deviations, we do not know whether they are likely to be positive or negative. This is where the skewness and symmetry of a distribution come in.

<font color='blue'>A distribution is symmetric if the parts on either side of the mean are mirror images of each other.</font> For example, the normal distribution is symmetric.

In [None]:
# Plot a normal distribution with mean = 0 and standard deviation = 2
xs = np.linspace(-6,6, 300)
normal = stats.norm.pdf(xs)
plt.plot(xs, normal);

<font color='blue'>A distribution which is not symmetric is called skewed</font>. For instance, a distribution can have many small positive and a few large negative values (negatively skewed) or vice versa (positively skewed), and still have a mean of 0

<font color='blue'>A symmetric distribution has skewness 0.

<font color='blue'>Positively skewed unimodal (one mode) distributions have the property that mean > median > mode.

<font color='blue'>Negatively skewed unimodal distributions are the reverse, with mean < median < mode.

<font color='blue'>All three are equal for a symmetric unimodal distribution.

The explicit formula for skewness is:
<font color='blue'>$ S_K = \frac{n}{(n-1)(n-2)} \frac{\sum_{i=1}^n (X_i - \mu)^3}{\sigma^3} $. 

<font color='blue'>The sign of this quantity describes the direction of the skew as described above. 

<font color='blue'>a negative skew typically indicates that the tail is fatter on the left, while a positive skew indicates that the tail is fatter on the right.

Although skew is less obvious when graphing discrete data sets, we can still compute it. 

In [None]:
start = '2012-01-01'
end = '2015-01-01'
pricing = get_pricing('SPY', fields='price', start_date=start, end_date=end)
returns = pricing.pct_change()[1:]
print 'Skew:', stats.skew(returns)
print 'Mean:', np.mean(returns)
print 'Median:', np.median(returns)
plt.hist(returns, 30);

Note that the skew is negative, and so the mean is less than the median.

<font color='blue'>Kurtosis attempts to measure the shape of the deviation from the mean. 

Generally, it describes <font color='blue'>how peaked a distribution is compared the the normal distribution, called mesokurtic. 

<font color='blue'> All normal distributions, regardless of mean and variance, have a kurtosis of 3.

<font color='blue'> A leptokurtic distribution (kurtosis > 3) is highly peaked and has fat tails, while a platykurtic distribution (kurtosis < 3) is broad.

 <font color='blue'>Sometimes, however, kurtosis in excess of the normal distribution (kurtosis - 3) is used, and this is the default in scipy

<font color='blue'>A leptokurtic distribution has more frequent large jumps away from the mean than a normal distribution does while a platykurtic distribution has fewer.

In [None]:
 # Plot some example distributions
plt.plot(xs,stats.laplace.pdf(xs), label='Leptokurtic')
print 'Excess kurtosis of leptokurtic distribution:', (stats.laplace.stats(moments='k'))
plt.plot(xs, normal, label='Mesokurtic (normal)')
print 'Excess kurtosis of mesokurtic distribution:', (stats.norm.stats(moments='k'))
plt.plot(xs,stats.cosine.pdf(xs), label='Platykurtic')
print 'Excess kurtosis of platykurtic distribution:', (stats.cosine.stats(moments='k'))
plt.legend();

The formula for kurtosis is
<font color='blue'>$$ K = \left ( \frac{n(n+1)}{(n-1)(n-2)(n-3)} \frac{\sum_{i=1}^n (X_i - \mu)^4}{\sigma^4} \right ) $$

while <font color='blue'>excess kurtosis is given by
$$ K_E = \left ( \frac{n(n+1)}{(n-1)(n-2)(n-3)} \frac{\sum_{i=1}^n (X_i - \mu)^4}{\sigma^4} \right ) - \frac{3(n-1)^2}{(n-2)(n-3)} $$

<font color='blue'>For a large number of samples, the excess kurtosis becomes approximately

<font color='blue'>$$ K_E \approx \frac{1}{n} \frac{\sum_{i=1}^n (X_i - \mu)^4}{\sigma^4} - 3 $$

We can use scipy to find the excess kurtosis of the S&P 500 returns from before.

In [None]:
print "Excess kurtosis of returns: ", stats.kurtosis(returns)

<font color='blue'>It's no coincidence that the variance, skewness, and kurtosis take similar forms. They are the first and most important standardized moments, of which the $k$th has the form
$ \frac{E[(X - E[X])^k]}{\sigma^k} $

<font color='blue'>The first standardized moment is always 0 $(E[X - E[X]] = E[X] - E[E[X]] = 0)$, so we only care about the second through fourth. All of the standardized moments are dimensionless numbers which describe the distribution, and in particular can be used to quantify how close to normal (having standardized moments $0, \sigma, 0, \sigma^2$) a distribution is.

<font color='blue'>The Jarque-Bera test is a common statistical test that compares whether sample data has skewness and kurtosis similar to a normal distribution.

We can run it here on the S&P 500 returns to find the p-value for them coming from a normal distribution.

<font color='blue'>The Jarque Bera test's null hypothesis is that the data came from a normal distribution. Because of this it can err on the side of not catching a non-normal process if you have a low p-value. To be safe it can be good to increase your cutoff when using the test.

<font color='blue'>Remember to treat p-values as binary and not try to read into them or compare them.</font> We'll use a cutoff of 0.05 for our p-value.

Remember that each test is written a little differently across different programming languages. You might not know whether it's the null or alternative hypothesis that the tested data comes from a normal distribution.

It is recommended that you use the ? notation plus online searching to find documentation on the test; plus it is often a good idea to calibrate a test by checking it on simulated data and making sure it gives the right answer.

In [None]:
from statsmodels.stats.stattools import jarque_bera
N = 1000
M = 1000
pvalues = np.ndarray((N))
for i in range(N):
    # Draw M samples from a normal distribution 
    X = np.random.normal(0, 1, M);
    _, pvalue, _, _ = jarque_bera(X)
    pvalues[i] = pvalue  
# count number of pvalues below our default 0.05 cutoff
num_significant = len(pvalues[pvalues < 0.05])
print float(num_significant) / N

In [None]:
_, pvalue, _, _ = jarque_bera(returns)
if pvalue > 0.05:
    print 'The returns are likely normal.'
else:
    print 'The returns are likely not normal.'
    

This tells us that the S&P 500 returns likely do not follow a normal distribution.