### The below technical indicators are based on some of the technical indicators used in Kara, Yakup, Melek Acar Boyacioglu, and Ömer Kaan Baykan. "Predicting direction of stock price index movement using artificial neural networks and support vector machines: The sample of the Istanbul Stock Exchange." Expert Systems with Applications 38.5 (2011): 5311-5319.

In [2]:
import numpy as np
import pandas as pd
import math as m
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from scipy import stats
from statsmodels.graphics.tsaplots import plot_acf

### Moving averages can help smooth out these erratic movements by removing day-to-day fluctuations and make trends easier to spot. Since they take the average of past price movements, moving averages are better for accurately reading past price movements rather than predicting future past movements. (To learn more, read the Moving Averages tutorial).

#### Simple Moving Average
##### The most common type of moving average is the simple moving average, which simply takes the sum of all of the past closing prices over a time period and divides the result by the total number of prices used in the calculation. For example, a 10-day simple moving average takes the last ten closing prices and divides them by ten.

Read more: Technical Analysis: Moving Averages https://www.investopedia.com/university/technical/techanalysis9.asp#ixzz5V69dzQGf 


In [87]:
#Moving Average  
def MA(df, n):  
    r = df.rolling(window=n)
    MA = pd.Series(r['Close'].mean(), name = 'MA_' + str(n))  
    df = df.join(MA)  
    return df


#### Exponential moving average
##### The exponential moving average leverages a more complex calculation to smooth data and place a higher weight on more recent data points. While the calculation is beyond the scope of this tutorial, traders should remember that the EMA is more responsive to new information relative to the simple moving average. This makes it the moving average of choice for many technical traders.

Read more: Technical Analysis: Moving Averages https://www.investopedia.com/university/technical/techanalysis9.asp#ixzz5V6AIP1DM 

In [80]:

#Exponential Moving Average  
def EMA(df, n):  
    
    EMA = pd.Series(df['Close'].ewm(span=n).mean(), name = 'EMA_' + str(n))  
    df = df.join(EMA)  
    return df


#### Momentum is the measurement of the speed or velocity of price changes. In "Technical Analysis of the Financial Markets," John J. Murphy explains:

##### Momentum measures the rate of the rise or fall in stock prices. From the standpoint of trending, momentum is a very useful indicator of strength or weakness in the issue's price. History has shown us that momentum is far more useful during rising markets than during falling markets; the fact that markets rise more often than they fall is the reason for this. In other words, bull markets tend to last longer than bear markets. (To learn more, see: Banking Profits in Bull and Bear Markets.)

Read more: Understanding Momentum Indicators and RSI | Investopedia https://www.investopedia.com/investing/momentum-and-relative-strength-index/#ixzz5V6E9QXN2 

In [81]:

#Momentum  
def MOM(df, n):  
    M = pd.Series(df['Close'].diff(n), name = 'Momentum_' + str(n))  
    df = df.join(M)  
    return df


#### Relative Strength Index
##### The relative strength index (RSI) is another well known momentum indicators that’s widely used in technical analysis. The indicator is commonly used to identify overbought and oversold conditions in a security with a range between 0 (oversold) and 100 (overbought).

##### A reading above 70 suggests that a security is overbought, while a reading below 30 suggests that a security is oversold. Often times, the indicator is used by traders to determine if the price has been pushed to unreasonably higher or low levels after a snap reaction to news.

Read more: Technical Analysis: Indicators And Oscillators https://www.investopedia.com/university/technical/techanalysis10.asp#ixzz5V6CkCxSp 


In [139]:

#Relative Strength Index  
def RSI(df, n):  
    i = 0  
    UpI = [0]  
    DoI = [0]  
    while i  < df['Close'].count()-1:
        UpMove = df['High'].iat[i + 1] - df['High'].iat[i]
        DoMove = df['Low'].iat[i] - df['Low'].iat[i + 1]  
        if UpMove > DoMove and UpMove > 0:  
            UpD = UpMove  
        else: UpD = 0  
        UpI.append(UpD)  
        if DoMove > UpMove and DoMove > 0:  
            DoD = DoMove  
        else: DoD = 0  
        DoI.append(DoD)  
        i = i + 1  
    UpI = pd.Series(UpI)  
    DoI = pd.Series(DoI) 
    
    PosDI = pd.Series(UpI.ewm(span=n,min_periods = n - 1).mean())  
    NegDI = pd.Series(DoI.ewm(span=n,min_periods = n - 1).mean())  
    RSI = pd.Series(PosDI / (PosDI + NegDI), name = 'RSI_' + str(n))  
    df = df.join(RSI)  
    
    return df


#### Moving Average Convergence/Divergence
##### The moving average convergence-divergence (MACD) is one of the most powerful and well-known indicators in technical analysis. The indicator is comprised of two exponential moving averages that help measure momentum in a security. The MACD is simply the difference between these two moving averages plotted against a centerline, where the centerline is the point at which the two moving averages are equal. The exponential moving average of the MACD line itself is also plotted on the chart.

##### The MACD compares short-term momentum and long-term momentum to signal the current direction of momentum rather than the direction of price. Traders can think of it as the ‘derivative’ of price-based moving averages.

##### When the MACD is positive, it signals that the short-term moving average is above the long-term moving average and the security’s momentum is upward. The opposite is true when the MACD is negative, which signals that the short-term moving average is below the longter term average and suggests downward momentum.

Read more: Technical Analysis: Indicators And Oscillators https://www.investopedia.com/university/technical/techanalysis10.asp#ixzz5V6D4s1q3 


In [140]:

#MACD, MACD Signal and MACD difference  
def MACD(df, n_fast, n_slow):  
    EMAfast = pd.Series(df['Close'].ewm(span = n_fast, min_periods = n_slow - 1).mean())  
    EMAslow = pd.Series(df['Close'].ewm(span = n_slow, min_periods = n_slow - 1).mean())  
    MACD = pd.Series(EMAfast - EMAslow, name = 'MACD_' + str(n_fast) + '_' + str(n_slow))  
    MACDsign = pd.Series(MACD.ewm( span = 9, min_periods = 8).mean(), name = 'MACDsign_' + str(n_fast) + '_' + str(n_slow))  
    MACDdiff = pd.Series(MACD - MACDsign, name = 'MACDdiff_' + str(n_fast) + '_' + str(n_slow))  
    df = df.join(MACD)  
    df = df.join(MACDsign)  
    df = df.join(MACDdiff)  
    return df



#### The 'Commodity Channel Index - CCI'
##### The Commodity Channel Index (CCI) is a momentum-based technical trading tool used most often to help determine when an investment vehicle is reaching a condition of being overbought or oversold. As the price of an investment moves continually in one direction, these indicators help traders to determine when institutional conviction may be changing, and a pause or pullback in the market price may be coming. This information can permit traders to take profit or add to an existing position following a price pullback.

##### First developed by Donald Lambert, the CCI is a stochastic oscillator that measures the change in an instrument's price relative to a pre-defined moving average (MA) of the price divided by 1.5% of a normal deviation (D) from that average. Oscillating indicators in general are technical trading tools whose calculated values move back and forth between two pre-determined levels, the top level indicating a market that is in the condition of being overbought and the bottom one indicating a market that is in the condition of being oversold.

Read more: Commodity Channel Index (CCI) https://www.investopedia.com/terms/c/commoditychannelindex.asp#ixzz5V6EZT5GH 


In [141]:

#Commodity Channel Index  
def CCI(df, n):  
    
    PP = (df['High'] + df['Low'] + df['Close']) / 3  
    
    r = PP.rolling(window=n)
    CCI = pd.Series((PP - r.mean()) / r.std(), name = 'CCI_' + str(n))  
    df = df.join(CCI)  
    return df

#### Using the Dollar Bars produced from part 1 of the project, PCA analysis is used to determine which technical indicators induce the most variance


In [158]:
data = pd.read_csv('volume_bars.csv')
data['Close'] = data['close']
data['Open'] = data['open']
data['High'] = data['high']
data['Low'] = data['low']
#data['time_stamp'] = pd.to_datetime(data['time_stamp'])
data['idx_col'] = range(0, len(data) )
data.index=data['idx_col'] 
print(data)

ma = MA(data, 20)
#print(ma)

ema = EMA(ma, 20)
#print(ema)

mom = MOM(ema,20)
#print(mom)

cci = CCI(mom,20)
#print(cci)

macd = MACD(cci,10,20)
#print('with MACD',macd)

rsi = RSI(macd,20)
#print('with RSI',rsi)

final_df [['Close', 'MA','EMA', 'MOM', 'CCI', 'MACD','RSI']] = rsi[['Close',rsi.columns[15],rsi.columns[16],rsi.columns[17],rsi.columns[18],rsi.columns[19],rsi.columns[20]]]
#remove NaN rows
final_df = final_df.iloc[25:]
print(final_df)

from sklearn.decomposition import PCA
# Instantiate and fit PCA model
pca = PCA(n_components=7)
pca.fit(final_df)

print("Percentage of variance explained by each of the selected components:")
print(pca.explained_variance_ratio_) 


                  time_stamp     open     high      low    close     volume  \
idx_col                                                                       
0        2018-07-28 00:03:28  8177.72  8213.50  8177.72  8213.50  40.711876   
1        2018-07-28 00:04:00  8216.74  8223.04  8209.89  8221.40  57.380037   
2        2018-07-28 00:04:29  8221.30  8236.95  8221.30  8236.95  48.506734   
3        2018-07-28 00:10:13  8234.03  8238.87  8196.71  8204.35  53.305443   
4        2018-07-28 00:26:48  8204.35  8204.35  8172.09  8172.09  49.848752   
5        2018-07-28 01:08:04  8172.10  8183.49  8152.11  8160.50  47.563584   
6        2018-07-28 02:09:10  8160.77  8193.53  8151.85  8168.06  52.661836   
7        2018-07-28 03:04:01  8168.05  8195.00  8154.03  8160.10  49.638359   
8        2018-07-28 04:23:22  8161.15  8182.80  8154.65  8178.48  50.378057   
9        2018-07-28 04:39:58  8178.47  8198.73  8175.09  8180.61  49.004957   
10       2018-07-28 04:48:25  8180.43  8220.94  8178

           Close      MA       EMA        MOM        CCI       MACD       RSI
idx_col                                                                      
50       8169.07   -1.08  0.981691   9.912290   1.698165   8.214125  0.553920
51       8193.21   62.73  1.288159  11.237314   3.606963   7.630351  0.606750
52       8169.19   26.13  0.747382   9.955233   4.877132   5.078101  0.560901
53       8230.05  120.19  1.469748  14.077082   6.717719   7.359363  0.661636
54       8254.07  111.99  1.942918  18.954902   9.165791   9.789111  0.707902
55       8239.91  139.91  1.599413  21.016465  11.536419   9.480047  0.692778
56       8284.99  192.14  1.805516  26.067389  14.443096  11.624293  0.735886
57       8279.46  168.62  1.639002  28.877762  17.330413  11.547350  0.735902
58       8299.00  207.00  1.748186  32.144814  20.293608  11.851206  0.759661
59       8291.23  221.23  1.592426  33.335733  22.902255  10.433478  0.759661
60       8269.79  218.36  1.262450  31.788339  24.679593   7.108

### Cross-validation 
Cross-validation is a technique that allows us to make more efficient use of the data we have. 

#### K-fold cross-validation 
##### K-fold cross-validation splits the data into K equally (or as close to equal as possible) sized blocks, illustrated in the figure below. Each block takes its turn as a validation set for a training set comprised of the other K- 1 blocks. Averaging over the resulting K loss values gives us our final loss value. An extreme case of K-fold cross-validation is where K = N, the number of observations in our dataset: each data observation is held out in turn and used to test a model trained on the other
<img src="capture.png">

Reference: A first course in Machine Learning, Second Edition




### WHY K-FOLD CV FAILS IN FINANCE
#### One reason k-fold CV fails in finance is because observations cannot be assumed to be drawn from an IID process. A second reason for CV’s failure is that the testing set is used multiple times in the process of developing a model,leading to multiple testing and selection bias. 

##### By placing t and t + 1 in different sets, information is leaked. When a classifier is first trained on (Xt, Yt), and then it is asked to predict E[Yt+1|Xt+1] based on an observed Xt+1, this classifier is more likely to achieve Yt+1 = E[Yt+1|Xt+1] even if X is an irrelevant feature. If X is a predictive feature, leakage will enhance the performance of an already valuable strategy. The problem is leakage in the presence of irrelevant features, as this leads to false discoveries. 

#### There are at least two ways to reduce the likelihood of leakage:
1. Drop from the training set any observation i where Yi is a function of information
used to determine Yj, and j belongs to the testing set.
(a) For example, Yi and Yj should not span overlapping periods 
2. Avoid overfitting the classifier. In this way, even if some leakage occurs, the
classifier will not be able to profit from it. Use:
(a) Early stopping of the base estimators 
(b) Bagging of classifiers, while controlling for oversampling on redundant
examples, so that the individual classifiers are as diverse as possible.
i. Set max_samples to the average uniqueness.
ii. Apply sequential bootstrap 
Consider the case where Xi and Xj are formed on overlapping information, where
i belongs to the training set and j belongs to the testing set. Is this a case of informational
leakage? Not necessarily, as long as Yi and Yj are independent. For leakage to
take place, it must occur that (Xi, Yi) ≈ (Xj, Yj), and it does not suffice that Xi ≈ Xj or
even Yi ≈ Yj.

#### A SOLUTION: PURGED K-FOLD CV
##### One way to reduce leakage is to purge from the training set all observations whose labels overlapped in time with those labels included in the testing set. In addition, since financial features often incorporate series that exhibit serial correlation (like ARMA processes), we should eliminate from the training set observations that immediately follow an observation in the testing set.
#### Purging the Training Set
##### Suppose a testing observation whose label Yj is decided based on the information set Φj. In order to prevent the type of leakage described in the previous section, we would like to purge from the training set any observation whose label Yi is decided based on the information set



Reference: Advances in Financial Machine Learning, Marcos Lopez De Prado 