# Project 2: Breakout Strategy
## Instructions
Each problem consists of a function to implement and instructions on how to implement the function.  The parts of the function that need to be implemented are marked with a `# TODO` comment. After implementing the function, run the cell to test it against the unit tests we've provided. For each problem, we provide one or more unit tests from our `project_tests` package. These unit tests won't tell you if your answer is correct, but will warn you of any major errors. Your code will be checked for the correct solution when you submit it to Udacity.

## Packages
When you implement the functions, you'll only need to use the [Pandas](https://pandas.pydata.org/) and [Numpy](http://www.numpy.org/) packages. Don't import any other packages, otherwise the grader will not be able to run your code.

The other packages that we're importing is `helper`, `project_helper`, and `project_tests`. These are custom packages built to help you solve the problems.  The `helper` and `project_helper` module contains utility functions and graph functions. The `project_tests` contains the unit tests for all the problems.
### Install Packages

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

### Load Packages

In [None]:
import pandas as pd
import numpy as np
import helper
import project_helper
import project_tests

## Market Data
The data source we'll be using is the [Wiki End of Day data](https://www.quandl.com/databases/WIKIP) hosted at [Quandl](https://www.quandl.com). This contains data for many stocks, but we'll just be looking at the S&P 500 stocks. We'll also make things a little easier to solve by narrowing our range of time from 2007-06-30 to 2017-09-30.
### Set API Key
Set the `quandl_api_key` variable to your Quandl api key. You can find your Quandl api key [here](https://www.quandl.com/account/api).

In [None]:
# TODO: Add your Quandl API Key
quandl_api_key  = ''

### Download Data

In [None]:
import os

snp500_file_path = 'data/tickers_SnP500.txt'
wiki_file_path = 'data/WIKI_PRICES.csv'
start_date, end_date = '2013-07-01', '2017-06-30'
use_columns = ['date', 'ticker', 'adj_open', 'adj_close', 'adj_high', 'adj_low']

if not os.path.exists(wiki_file_path):
    with open(snp500_file_path) as f:
        tickers = f.read().split()
    
    helper.download_quandl_dataset(quandl_api_key, 'WIKI', 'PRICES', wiki_file_path, use_columns, tickers, start_date, end_date)
else:
    print('Data already downloaded')

### Load Data
While using real data will give you hands on experience, it's doesn't cover all the topics we try to condense in one project. We'll solve this by creating new stocks. We've create a scenario where companies mining [Terbium](https://en.wikipedia.org/wiki/Terbium) are making huge profits. All the companies in this sector of the market are made up. They represent a sector with large growth that will be used for demonstration latter in this project.

In [None]:
df_original = pd.read_csv(wiki_file_path, parse_dates=['date'], index_col=False)

# Add TB sector to the market
df = df_original
df = pd.concat([df] + project_helper.generate_tb_sector(df[df['ticker'] == 'AAPL']['date']), ignore_index=True)

print('Loaded Dataframe')

### View Data
In this project, you won't be building any charts.  We will provide all the code to plot or graph the data using our `project_helper` package. These charts will help you understand the data that you're working with.

Let's see what a single stock looks like from the DataFrame we loaded, called `df`. For this example and future display examples, we'll use Apple's stock "AAPL". Run the code below to view a candlestick chart of Apple stock.

In [None]:
apple_ticker = 'AAPL'
project_helper.plot_stock(df[df['ticker'] == apple_ticker], '{} Stock'.format(apple_ticker))

## The Alpha Research Process

In this project you will code and evaluate a "breakout" signal. It is important to understand where these steps fit in the alpha research workflow. The signal-to-noise ratio in trading signals is very low and, as such, it is very easy to fall into the trap of _overfitting_ to noise. It is therefore inadvisable to jump right into signal coding. To help mitigate overfitting, it is best to start with a general observation and hypothesis; i.e., you should be able to answer the following question _before_ you touch any data:

> What feature of markets or investor behaviour would lead to a persistent anomaly that my signal will try to exploit?

Ideally the assumptions behind the hypothesis will be testable _before_ you actually code and evaluate the signal itself. The workflow therefore is as follows:

![image](images/alpha_steps.png)

In this project, we assume that the first three steps area done ("observe & research", "form hypothesis", "validate hypothesis"). The hypothesis you'll be using for this project is the following:
- In the absence of news or significant investor trading interest, stocks oscillate in a range.
- Traders seek to capitalize on this range-bound behaviour periodically by selling/shorting at the top of the range and buying/covering at the bottom of the range. This behaviour reinforces the existence of the range.
- When stocks break out of the range, due to, e.g., a significant news release or from market pressure from a large investor:
    - the liquidity traders who have been providing liquidity at the bounds of the range seek to cover their positions to mitigate losses, thus magnifying the move out of the range, _and_
    - the move out of the range attracts other investor interest; these investors, due to the behavioural bias of _herding_ (e.g., [Herd Behavior](https://www.investopedia.com/university/behavioral_finance/behavioral8.asp)) build positions which favor continuation of the trend.


Using this hypothesis, let start coding..
## Compute the Highs and Lows in a Window
You'll use highs and lows for an indicator to the breakout strategy. In this section, implement `get_high_lows_lookback` to get the maximum high price and minimum low price over a window of days. The variable `lookback_days` contains the number of days to look in the past. Make sure this doesn't include the current day. The implementation should return the maximum and minimum of the prices as a tuple of Pandas Series, where maximum high prices are a Series and minimum low prices are a Series.

After implementing the function, run the cell to execute unit tests against your implementation.  If you pass the unit tests, it will display "Tests Passed". These tests are already added into the cells of each problem. In this problem, it's the line `project_tests.test_get_high_lows_lookback(get_high_lows_lookback)`.

*Note: Any time we talk about closing prices, open prices, etc., you can assume we're talking about the adjusted prices.*

In [None]:
def get_high_lows_lookback(df, lookback_days):
    """
    Get the high and low in a lookback window.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    lookback_days : int
        The number of days to look back
    
    Returns
    -------
    high_lows_lookback : (Pandas Series, Pandas Series)
        (High Lookbacks, Low Lookbacks)
    """
    #TODO: Implement function
    
    return None

project_tests.test_get_high_lows_lookback(get_high_lows_lookback)

### View Data
Let's use your implementation of `get_high_lows_lookback` to get the high and lows for the past 50 days and compare it to it their respective stock.  Just like last time, we'll use Apple's stock as the example to look at.

In [None]:
lookback_days = 50
df['lookback_high'], df['lookback_low'] = get_high_lows_lookback(df, lookback_days)
project_helper.plot_high_low(df[df['ticker'] == apple_ticker], 'High and Low of {} Stock'.format(apple_ticker))

## Compute Long and Short Signals
Using the generated indicator of highs and lows (columns "lookback_low" and "lookback_high" in `df`), create long and short signals using a breakout strategy. Implement `get_long_short` to generate the following signals:

| Signal | Condition |
|----|------|
| -1 | Low > Close Price |
| 1  | High < Close Price |
| 0  | Otherwise |

*Note: The **Close Price** is the adjusted close price. **Low** and **High** is the values from the columns "lookback_low" and "lookback_high" in df respectively.*

In [None]:
def get_long_short(df):
    """
    Generate the signals long, short, and do nothing.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    
    Returns
    -------
    long_short : Pandas Series
        The long, short, and do nothing signals
    """
    #TODO: Implement function
    
    return None

project_tests.test_get_long_short(get_long_short)

### View Data
Let's compare the signals you generated against the Apple stock. This chart will show a lot of signals. Too many in fact. We'll talk about filtering the redundant signals in the next problem. 

In [None]:
df['signal'] = get_long_short(df)
project_helper.plot_signal(
    df[df['ticker'] == apple_ticker],
    'Long and Short of {} Stock'.format(apple_ticker),
    'signal')

## Filter Signals
That was a lot of repeated signals! If we're already shorting a stock, having an additional signal to short a stock isn't helpful for this strategy. This also applies to additional long signals when the last signal was long.

Implement `filter_signals` to filter out repeated long or short signals within the `lookahead_days`. If the previous signal was the same, change the signal to `0` (do nothing signal). For example, say you have a single stock time series that is

`[1,NAN,1,NAN,1,NAN,-1,-1]`

Running `filter_signals` with a lookahead of 3 days should turn those signals into

`[1, 0, 0, 0, 1, 0, -1, 0]`

To help you implement the function, we have provided you with the `clear_signals` function. This will remove all signals within a window after the last signal. For example, say you're using a windows size of 3 with `clear_signals`. It would turn the Series of long signals

`[0, 1, 0, 0, 1, 1, 0, 1, 0]`

into

`[0, 1, 0, 0, 0, 1, 0, 0, 0]`

Note: it only takes a Series of the same type of signals, where `1` is the signal and `0` is no signal. It can't take a mix of long and short signals. Using this function, implement `filter_signals`.

In [None]:
def clear_signals(signals, window_size):
    """
    Clear out signals in a Series of just long or short signals.
    
    Remove the number of signals down to 1 within the window size time period.
    
    Parameters
    ----------
    signals : Pandas Series
        The long, short, or do nothing signals
    window_size : int
        The number of days to have a single signal       
    
    Returns
    -------
    signals : Pandas Series
        Signals with the signals removed from the window size
    """
    # Start with buffer of window size
    # This handles the edge case of calculating past_signal in the beginning
    clean_signals = [0]*window_size
    
    for signal_i, current_signal in enumerate(signals):
        # Check if there was a signal in the past window_size of days
        has_past_signal = bool(sum(clean_signals[signal_i:signal_i+window_size]))
        # Use the current signal if there's no past signal, else 0/False
        clean_signals.append(not has_past_signal and current_signal)
        
    # Remove buffer
    clean_signals = clean_signals[window_size:]

    # Return the signals as a Series of Ints
    return pd.Series(np.array(clean_signals).astype(np.int))


def filter_signals(df, signal_column, lookahead_days):
    """
    Filter out signals in a DataFrame.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    signal_column : str
        The column with the signals in `df`
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    signals : Pandas Series
        Filtered signals
    """
    #TODO: Implement function
    
    return None


project_tests.test_filter_signals(filter_signals)

### View Data
Let's view the same chart as before, but with the redundant signals removed.

In [None]:
df['signal_5'] = filter_signals(df, 'signal', 5)
df['signal_10'] = filter_signals(df, 'signal', 10)
df['signal_20'] = filter_signals(df, 'signal', 20)
for signal_days in [5, 10, 20]:
    signal_column = 'signal_{}'.format(signal_days)
    project_helper.plot_signal(
        df[df['ticker'] == apple_ticker],
        'Long and Short of {} Stock with {} day signal window'.format(apple_ticker, signal_days),
        signal_column)

## Lookahead Close Prices
With the trading signal done, we can start working on evaluating how many days to short or long the stocks. In this problem, implement `get_lookahead_prices` to get the close price days ahead in time. You can get the number of days from the variable `lookahead_days`. We'll use the lookahead prices to calculate future returns in another problem.

In [None]:
def get_lookahead_prices(df, lookahead_days):
    """
    Get the lookahead prices for `lookahead_days` days.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    lookahead_prices : Pandas Series
        The lookahead prices
    """
    #TODO: Implement function
    
    return None

project_tests.test_get_lookahead_prices(get_lookahead_prices)

### View Data
Using the `get_lookahead_prices` function, let's generate lookahead closing prices for 5, 10, and 20 days.

Let's also chart a subsection of a few months of the Apple stock instead of years. This will allow you to view the differences between the 5, 10, and 20 day lookaheads. Otherwise, they will mesh together when looking at a chart that zoomed out.

In [None]:
df['lookahead_5'] = get_lookahead_prices(df, 5)
df['lookahead_10'] = get_lookahead_prices(df, 10)
df['lookahead_20'] = get_lookahead_prices(df, 20)
project_helper.plot_lookahead_prices(
    df[df['ticker'] == apple_ticker].iloc[150:250],
    ['lookahead_5', 'lookahead_10', 'lookahead_20'],
    '5, 10, and 20 day Lookahead Prices for Slice of {} Stock'.format(apple_ticker))

## Lookahead Price Returns
Implement `get_return_lookahead` to generate the log price return between the closing price and the lookahead price. The lookahead prices are located in the column provided by the `lookahead_column` variable.

In [None]:
def get_return_lookahead(df, lookahead_column):
    """
    Calculate the price return from the lookahead days to the signal day.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    lookahead_column : str
        The column with the lookahead prices in `df`
    
    Returns
    -------
    return_lookahead : Pandas Series
        The lookahead price returns
    """
    #TODO: Implement function
    
    return None

project_tests.test_get_return_lookahead(get_return_lookahead)

### View Data
Using the same lookahead prices and same subsection of the Apple stock from the previous problem, we'll view the lookahead returns.

In order to view price returns on the same chart as the stock, a second y-axis will be added. When viewing this chart, the axis for the price of the stock will be on the left side, like previous charts. The axis for price returns will be located on the right side.

In [None]:
df['priceReturn_5'] = get_return_lookahead(df, 'lookahead_5')
df['priceReturn_10'] = get_return_lookahead(df, 'lookahead_10')
df['priceReturn_20'] = get_return_lookahead(df, 'lookahead_20')
project_helper.plot_price_returns(
    df[df['ticker'] == apple_ticker].iloc[150:250],
    ['priceReturn_5', 'priceReturn_10', 'priceReturn_20'],
    '5, 10, and 20 day Lookahead Returns for Slice {} Stock'.format(apple_ticker))

## Compute the Signal Return
Using the price returns from the column provided by `return_column`, generate the signal returns.

In [None]:
def get_signal_return(df, return_column, signal_column):
    """
    Compute the signal returns.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    return_column : str
        The column with the returns in `df`
    signal_column : str
        The column with the signals in `df`
    
    Returns
    -------
    signal_return : Pandas Series
        Signal returns
    """
    #TODO: Implement function
    
    return None

project_tests.test_get_signal_return(get_signal_return)

### View Data
Let's continue using the previous lookahead prices to view the signal returns. Just like before, the axis for the signal returns is on the right side of the chart.

In [None]:
title_string = '{} day LookaheadSignal Returns for {} Stock'
df['signalReturn_5'] = get_signal_return(df, 'priceReturn_5', 'signal_5')
df['signalReturn_10'] = get_signal_return(df, 'priceReturn_10', 'signal_10')
df['signalReturn_20'] = get_signal_return(df, 'priceReturn_20', 'signal_20')
project_helper.plot_signal_returns(
    df[df['ticker'] == apple_ticker],
    ['signalReturn_5', 'signalReturn_10', 'signalReturn_20'],
    ['signal_5', 'signal_10', 'signal_20'],
    [title_string.format(5, apple_ticker), title_string.format(10, apple_ticker), title_string.format(20, apple_ticker)])

## Test for Significance
### Histogram
Let's plot a histogram of the signal return values.

In [None]:
project_helper.plot_series_histograms(
    [df['signalReturn_5'], df['signalReturn_10'], df['signalReturn_20']],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

### Question: What do the histograms tell you about the signal?

*#TODO: Put Answer In this Cell*

### P-Value
Let's calculate the P-Value from the signal return.

In [None]:
pval_5 = project_helper.get_signal_return_pval(df['signalReturn_5'])
print('5  Day P-value: {}'.format(pval_5))
pval_10 = project_helper.get_signal_return_pval(df['signalReturn_10'])
print('10 Day P-value: {}'.format(pval_10))
pval_20 = project_helper.get_signal_return_pval(df['signalReturn_20'])
print('20 Day P-value: {}'.format(pval_20))

### Question: What do the p-values tell you about the signal?

*#TODO: Put Answer In this Cell*

## Outliers
You might have noticed the outliers in the 10 and 20 day histograms. To better visualize the outliers, let's compare the 5, 10, and 20 day signals returns to normal distributions with the same mean and deviation for each signal return distributions.

In [None]:
project_helper.plot_series_to_normal_histograms(
    [df['signalReturn_5'], df['signalReturn_10'], df['signalReturn_20']],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

## Find Outliers
While you can see the outliers in the histogram, we need to find the stocks that are cause these outlying returns. 

Implement the function `find_outliers` to use Kolmogorov-Smirnov test (KS test) between a normal distribution and each stock's signal returns in the following order: 
- Ignore rows without signals in `signal_column`. This will better fit the normal distribution and remove false positives.
- Run KS test on a normal distribution that with the same std and mean of all the signal returns in `signal_return_column` against each stock's signal returns in `signal_return_column`. You can use `kstest` or `ks_2samp` to perform the KS test.
- Ignore any items that don't pass the null hypothesis with a threshold of `pvalue_threshold`. You can consider them not outliers.
- Return all stock tickers with a KS value above `ks_threshold`.

In [None]:
from scipy.stats import kstest, ks_2samp, norm

def find_outliers(df, signal_column, signal_return_column, ks_threshold, pvalue_threshold=0.05):
    """
    Find stock outliers in `df` using Kolmogorov-Smirnov test against a normal distribution.
    
    Ignore stock with a p-value from Kolmogorov-Smirnov test greater than `pvalue_threshold`.
    Ignore stocks with KS static value lower than `ks_threshold`.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    signal_column : str
        The column with the signals in `df`
    signal_return_column : str
        The column with the signal returns in `df`
    ks_threshold : float
        The threshold for the KS static
    pvalue_threshold : float
        The threshold for the p-value
    
    Returns
    -------
    outliers : list of str
        Symbols that are outliers
    """
    #TODO: Implement function
    
    return None

project_tests.test_find_outliers(find_outliers)

### View Data
Using the `find_outliers` function you implemented, let's see what we found.

In [None]:
outlier_tickers = []
ks_threshold = 0.7

outlier_tickers.extend(find_outliers(df, 'signal_5', 'signalReturn_5', ks_threshold))
outlier_tickers.extend(find_outliers(df, 'signal_10', 'signalReturn_10', ks_threshold))
outlier_tickers.extend(find_outliers(df, 'signal_20', 'signalReturn_20', ks_threshold))
outlier_tickers = list(set(outlier_tickers))
print('{} Outliers Found:\n{}'.format(len(outlier_tickers), ', '.join(outlier_tickers)))

## Remove outliers
You might be asking yourself, "Why are we removing perfectly good data? The outliers we found just have a huge amount of growth". That's because we want our returns to reflect future returns. Our signal having high returns for a sector that is out extremely performing the market doesn't reflect how good our signal is. We want to evaluate how a signal will perform in the future, not in the past. These stocks will be removed from our signal analysis and while trading. That doesn't mean those returns will go unnoticed. If this was our job, we would look into ways of better trading those stocks after we finished this analysis.

Implement `remove_outliers` to remove the outliers (`outlier_symbols`) from the DataFrame. Return the new DataFrame without the outlier stocks.

In [None]:
def remove_outliers(df, outlier_symbols):
    """
    Compute the signal return.
    
    Parameters
    ----------
    df : DataFrame
        Stock prices with dates and ticker symbols
    outlier_symbols : list of str
        The outlier stocks to remove from `df`
    
    Returns
    -------
    outliers : Dataframe
        `df` with the outliers removed
    """
    #TODO: Implement function
    
    return None

project_tests.test_remove_outliers(remove_outliers)

### Show Significance without Outliers
Let's compare the 5, 10, and 20 day signals returns without outliers to normal distributions. Also, let's see how the P-Value has changed with the outliers removed.

In [None]:
# Remove outliers
outliers_removed_df = remove_outliers(df, list(set(outlier_tickers)))

project_helper.plot_series_to_normal_histograms(
    [outliers_removed_df['signalReturn_5'], outliers_removed_df['signalReturn_10'], outliers_removed_df['signalReturn_20']],
    'Signal Return Without Outliers',
    ('5 Days', '10 Days', '20 Days'))

outliers_removed_pval_5 = project_helper.get_signal_return_pval(outliers_removed_df['signalReturn_5'])
outliers_removed_pval_10 = project_helper.get_signal_return_pval(outliers_removed_df['signalReturn_10'])
outliers_removed_pval_20 = project_helper.get_signal_return_pval(outliers_removed_df['signalReturn_20'])

print('5  Day P-value (with outliers):    {}'.format(pval_5))
print('5  Day P-value (without outliers): {}'.format(outliers_removed_pval_5))
print('')
print('10 Day P-value (with outliers):    {}'.format(pval_10))
print('10 Day P-value (without outliers): {}'.format(outliers_removed_pval_10))
print('')
print('20 Day P-value (with outliers):    {}'.format(pval_20))
print('20 Day P-value (without outliers): {}'.format(outliers_removed_pval_20))

That's more like it! The returns are closer to a normal distribution. You have finished the research phase of a Breakout Strategy. You can now submit your project.
## Submission
Now that you're done with the project, it's time to submit it. Click the submit button in the bottom right. One of our reviewers will give you feedback on your project with a pass or not passed grade. You can continue to the next section while you wait for feedback.