# Project - Testing Normality of Stock Market Returns

In this chapter, we examine the daily return of Microsoft's stock to determine if it follows a normal distribution.

### Getting Microsoft stock price data

Twenty years of Microsoft stock price data is stored in the msft20.csv file in the stocks directory. Let's read this in now setting the `timestamp` as the index.

In [None]:
import pandas as pd
msft = pd.read_csv('../data/stocks/msft20.csv', parse_dates=['date'], index_col='date')
msft.head()

### Select the closing price
For this problem, we are only interested in the closing price. Select the `adjusted_close` as this is adjusted for any stock splits that may have occurred.

In [None]:
close = msft['adjusted_close']
close.head(3)

### Daily percent change
pandas Series have a method called `pct_change` which returns the percentage difference between the current and previous elements. Let's use it to calculate the daily return.

In [None]:
close_change = close.pct_change()
close_change.head(3)

### Handling Missing Value
The first date has a missing value since there was no previous date. The `dropna` method can be used to remove any `NaN` elements.

In [None]:
close_change = close_change.dropna()
close_change.head(3)

### Checking for Normality

There are formal statistical tests for normality that can be used. Instead we will focus on simple data exploration to give us insight.

### Plotting the returns

The main plotting library in Python is matplotlib which will be covered in greater detail in the **Visualization** part. To output plots directly into the notebook, the magic command `%matplotlib inline` must be executed first. 

In [None]:
%matplotlib inline

pandas objects have hooks into matplotlib so it's not necessary to import matplotlib directly. The `plot` method can be used to create a number of different kinds of plots with the `kind` parameter. We pass it the string `hist` to create a histogram, along with a histogram-specific argument `bins` to control the number of bins which is also the number of bars plotted. We also change the size of the figure (in inches) by passing a tuple to the `figsize` parameter.

In [None]:
close_change.plot(kind='hist', bins=40, figsize=(8, 3));

### Use boolean selection to check for normality

The plot above is symmetrical and somewhat bell-shaped. It could possibly represent a normal distribution. To more formally check for normality we can count the number of observations that are within 1, 2, and 3 standard deviations. The [68-95-99.7 rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule) can be used to determine if the data is approximately normal. We first need to calculate the mean and standard deviation.

In [None]:
mean = close_change.mean()
std = close_change.std()
mean, std

### Absolute number of standard deviations from the mean

To standardize our results we can find the number of standard deviations away from the mean each daily return is. To do this, we subtract the mean from the entire Series. We then divide by the standard deviation. This quantity is referred to as the **z-score**.

In [None]:
z_score = (close_change - mean) / std

To help make calculations easier, we will use the absolute value of the z-score.

In [None]:
z_score_abs = z_score.abs()

### Find the percentage by taking the mean
Let's find the percentage of returns less than 1, 2, and 3 standard deviations by taking the mean of a boolean Series.

In [None]:
pct_within1 = (z_score_abs < 1).mean().round(3)
pct_within2 = (z_score_abs < 2).mean().round(3)
pct_within3 = (z_score_abs < 3).mean().round(3)
pct_within1, pct_within2, pct_within3

### Results Discussion
The percentages of returns within 1, 2 and 3 standard deviations are fairly different than the 68-95-99.7 rule. Much more of the data was concentrated within 1 standard deviation. A much greater percentage of the returns were greater than 3 standard deviations from the mean compared to just .3% for the rule. This strongly suggests that a normal distribution would not be a good fit for this type of data.

### Using the percentile to check for normality
Alternatively, we can work 'backwords' and find the z-score that represents the 68th, 95th, and 99.7th percentiles of the distribution. For normally distributed data, we would expect these to be 1, 2, and 3 respectively. The `quantile` method completes this operation for us. Note how far off the 68th and 99.7th percentiles are.

In [None]:
z_score_abs.quantile([.68, .95, .997]).round(2)

### Check that all Series values are `True`
Let's say we wanted to check if all the stock price returns were within 4 standard deviations from the mean. For boolean Series, the `all` method returns `True` if all values are `True` and `False` otherwise.

In [None]:
criteria = z_score_abs < 4
criteria.head(3)

In [None]:
criteria.all()

We can duplicate the above logic with the `any` method. The `any` method returns `True` if one or more values in the Series are `True`. Here, we check if any of the returns are greater than or equal to 4.

In [None]:
criteria = z_score_abs >= 4
criteria.any()

## Exercises

Execute the cells below to read in 20 years of Apple (AAPL) data as a Series and answer the exercises below with it.

In [None]:
stocks = pd.read_csv('../data/stocks/stocks10.csv', index_col='date', parse_dates=['date'])
stocks.head(3)

In [None]:
aapl = stocks['AAPL']
aapl.head()

### Exercise 1

<span  style="color:green; font-size:16px">Use one line of code to find the daily percentage returns of AAPL and drop any missing values. Save the result to `aapl_change`.</span>

### Exercise 2

<span  style="color:green; font-size:16px">Find the mean daily return for Apple, the first and last closing prices, and the number of trading days. Store all four of these values into separate variables.</span>

### Exercise 3

<span  style="color:green; font-size:16px">If Apple returned its mean percentage return every single day since the first day you have data, what would its last closing price be? Is it the same as the actual last closing price? You need to use all the variables calculated from Exercise 2.</span>

### Exercise 4

<span  style="color:green; font-size:16px">Find the z-score for the Apple daily returns. Save this to a variable `z_score_raw`. What is the max and minimum score?</span>

### Exercise 5

<span  style="color:green; font-size:16px">What percentage did Tesla stock increase when it had its highest maximum raw z-score?</span>

### Exercise 6

<span  style="color:green; font-size:16px">Create a function that accepts a Series of stock closing prices. Have it return the percentage of prices within 1, 2, and 3 standard deviations from the mean. Use your function to return results for different stocks found in the `stocks` DataFrame.</span>

### Exercise 7

<span  style="color:green; font-size:16px"> How many days did Apple close above 100 and below 120?</span>

### Exercise 8

<span  style="color:green; font-size:16px"> How many days did Apple close below 50 or above 150?</span>

### Exercise 9

<span  style="color:green; font-size:16px"> Look up the definition for interquartile range and slice Apple closing prices so it contains just the interquartile range. There are multiple ways to do this. Check the `quantile` method.</span>

### Exercise 10

<span  style="color:green; font-size:16px">Find the date of the highest closing price. Find out how many trading days it has been since Apple recorded it highest closing price.</span>