<a href="https://colab.research.google.com/github/an1ke7/StockMarket_prediction/blob/main/PDC_man_who_solved_the_market_and_can_you.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The man who solved the market and can you?



Author: Aniket Shirsat

LinkedIN profile: https://www.linkedin.com/in/aniketshirsat/ 

This notebook is developed to be presented at the Phoenix Data Conference 2020. 

Breakdown of the code: 

- Download the SP500 symbols
- Download the SP500 stocks data (OHLCV) 
- Using talib library, create momentum indicators as features. 
    - RSI
    - MFI
    - MACD
- Stich the data together
- Develop a buying strategy for swing trading (define stoploss % and target %)
- Train ML model 
- Validate ML Model / Backtesting  

# Environment and External Dependencies setup 

This step is specifically designed for colab notebooks. 

In [1]:
# download TA-Lib 
!wget http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz 

--2020-11-20 01:59:22--  http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz
Resolving prdownloads.sourceforge.net (prdownloads.sourceforge.net)... 216.105.38.13
Connecting to prdownloads.sourceforge.net (prdownloads.sourceforge.net)|216.105.38.13|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.sourceforge.net/project/ta-lib/ta-lib/0.4.0/ta-lib-0.4.0-src.tar.gz [following]
--2020-11-20 01:59:22--  http://downloads.sourceforge.net/project/ta-lib/ta-lib/0.4.0/ta-lib-0.4.0-src.tar.gz
Resolving downloads.sourceforge.net (downloads.sourceforge.net)... 216.105.38.13
Reusing existing connection to prdownloads.sourceforge.net:80.
HTTP request sent, awaiting response... 302 Found
Location: https://phoenixnap.dl.sourceforge.net/project/ta-lib/ta-lib/0.4.0/ta-lib-0.4.0-src.tar.gz [following]
--2020-11-20 01:59:22--  https://phoenixnap.dl.sourceforge.net/project/ta-lib/ta-lib/0.4.0/ta-lib-0.4.0-src.tar.gz
Resolving phoenixnap

In [None]:
# Extract and complile TA-lib
!tar xvzf ta-lib-0.4.0-src.tar.gz
import os
os.chdir('ta-lib') # Can't use !cd in co-lab
!./configure --prefix=/usr
!make
!make install
# wait ~ 30s

In [3]:
# Installing TA-lib
os.chdir('../')
!pip install TA-Lib

Collecting TA-Lib
[?25l  Downloading https://files.pythonhosted.org/packages/ac/cf/681911aa31e04ba171ab4d523a412f4a746e30d3eacb1738799d181e028b/TA-Lib-0.4.19.tar.gz (267kB)
[K     |█▎                              | 10kB 16.2MB/s eta 0:00:01[K     |██▌                             | 20kB 17.3MB/s eta 0:00:01[K     |███▊                            | 30kB 11.7MB/s eta 0:00:01[K     |█████                           | 40kB 8.5MB/s eta 0:00:01[K     |██████▏                         | 51kB 7.6MB/s eta 0:00:01[K     |███████▍                        | 61kB 7.1MB/s eta 0:00:01[K     |████████▋                       | 71kB 7.7MB/s eta 0:00:01[K     |█████████▉                      | 81kB 8.5MB/s eta 0:00:01[K     |███████████                     | 92kB 7.6MB/s eta 0:00:01[K     |████████████▎                   | 102kB 7.1MB/s eta 0:00:01[K     |█████████████▌                  | 112kB 7.1MB/s eta 0:00:01[K     |██████████████▊                 | 122kB 7.1MB/s eta 0:00:01[K

# Importing Python Dependencies

In [4]:
#importing python libraries

## general python dependencies
import pandas as pd
import numpy as np
import datetime as dt

## for fetching stock market data
import pandas_datareader as pdr

## for technical analysis
import talib as ta

## for plotting data
import plotly.graph_objects as go
import plotly.express as px

# Extracting SP500 symbols

In [5]:
#Download SP500 stocks symbols
table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
sp500 = table[0]
sp500.head()

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM,3M Company,reports,Industrials,Industrial Conglomerates,"St. Paul, Minnesota",1976-08-09,66740,1902
1,ABT,Abbott Laboratories,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888
2,ABBV,AbbVie Inc.,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
3,ABMD,ABIOMED Inc,reports,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,815094,1981
4,ACN,Accenture plc,reports,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [6]:
# create a list of SP500 symbols
tickers = list(sp500['Symbol'])

# Extracting data from Yahoo Finance API

In [7]:
# Extracting data from Yahoo Finance API

all_data = pd.DataFrame()

no_data = []

# Downlaoding data for each symbol seperately
for i in tickers:
    try:
        # extracting data from 2010 to today
        i_data = pdr.get_data_yahoo(i, start = dt.datetime(2010,1,1), end = dt.date.today())
        i_data['symbol'] = i

        # combine the data together
        all_data = all_data.append(i_data)
    except:
        # in case if data extraction fails, save the symbol into a list to check later
        no_data.append(i)

#Creating Return column
all_data['return'] = all_data.groupby('symbol')['Close'].pct_change() 

In [8]:
#take a backup of all_data
all_data_bkup = all_data

In [71]:
#restore the backup
all_data = all_data_bkup

# Visualizing Data

In [9]:
# Plotting OHLC data  


## Select your stock ticker
stock = "AAPL"

stock_data = all_data[all_data['symbol'] == 'AAPL']

trace1 = {
    'x': stock_data.index,
    'open': stock_data['Open'],
    'close': stock_data['Close'],
    'high': stock_data['High'],
    'low': stock_data['Low'],
    'type': 'candlestick',
    'name': stock,
    'showlegend': False
}

data = [trace1]
# Config graph layout
layout = go.Layout({
    'title': {
        'text': stock,
        'font': {
            'size': 15
        }
    }
})

fig = go.Figure(data=data, layout=layout)
fig.show()

In [10]:
all_data.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close,symbol,return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-04,83.449997,82.669998,83.089996,83.019997,3043700.0,61.793381,MMM,
2010-01-05,83.230003,81.699997,82.800003,82.5,2847000.0,61.406334,MMM,-0.006264
2010-01-06,84.599998,83.510002,83.879997,83.669998,5268500.0,62.277203,MMM,0.014182
2010-01-07,83.760002,82.120003,83.32,83.730003,4470100.0,62.321846,MMM,0.000717
2010-01-08,84.32,83.300003,83.690002,84.32,3405800.0,62.761005,MMM,0.007046


# Feature Modelling 

In [None]:
# We use 3 momentum indicators for our demo

# RSI  - Relative Strength Index
# MACD - Moving Average Convergence Divergence
# MFI  - Money Flow Index

# We use TA-lib functions to generate the indicator values. 
# all indicator hyperparameters are set to default settings 

# A seperate table is generated with index of symbol and date, so as to easily join with the main data. 

In [11]:
def myMACDp1(df):
    v1,v2,v3 = ta.MACD(df['Close'], fastperiod=12, slowperiod=26, signalperiod=9)
    return v1
MACD1 = all_data.groupby('symbol').apply(myMACDp1)
MACD1 = MACD1.fillna(0)
MACD1 = pd.DataFrame(MACD1)
MACD1.columns = ['MACD']
MACD1

Unnamed: 0_level_0,Unnamed: 1_level_0,MACD
symbol,Date,Unnamed: 2_level_1
A,2010-01-04,0.000000
A,2010-01-05,0.000000
A,2010-01-06,0.000000
A,2010-01-07,0.000000
A,2010-01-08,0.000000
...,...,...
ZTS,2020-11-13,1.096271
ZTS,2020-11-16,1.028770
ZTS,2020-11-17,0.975329
ZTS,2020-11-18,0.768385


In [12]:
def myMACDp2(df):
    v1,v2,v3 = ta.MACD(df['Close'], fastperiod=12, slowperiod=26, signalperiod=9)
    return v2
MACD2 = all_data.groupby('symbol').apply(myMACDp2)
MACD2 = MACD2.fillna(0)
MACD2 = pd.DataFrame(MACD2)
MACD2.columns = ['MACD']
MACD2

Unnamed: 0_level_0,Unnamed: 1_level_0,MACD
symbol,Date,Unnamed: 2_level_1
A,2010-01-04,0.000000
A,2010-01-05,0.000000
A,2010-01-06,0.000000
A,2010-01-07,0.000000
A,2010-01-08,0.000000
...,...,...
ZTS,2020-11-13,1.082181
ZTS,2020-11-16,1.071499
ZTS,2020-11-17,1.052265
ZTS,2020-11-18,0.995489


In [13]:
def myMACDp3(df):
    v1,v2,v3 = ta.MACD(df['Close'], fastperiod=12, slowperiod=26, signalperiod=9)
    return v3
MACD3 = all_data.groupby('symbol').apply(myMACDp3)
MACD3 = MACD3.fillna(0)
MACD3 = pd.DataFrame(MACD3)
MACD3.columns = ['MACD']
MACD3

Unnamed: 0_level_0,Unnamed: 1_level_0,MACD
symbol,Date,Unnamed: 2_level_1
A,2010-01-04,0.000000
A,2010-01-05,0.000000
A,2010-01-06,0.000000
A,2010-01-07,0.000000
A,2010-01-08,0.000000
...,...,...
ZTS,2020-11-13,0.014091
ZTS,2020-11-16,-0.042728
ZTS,2020-11-17,-0.076935
ZTS,2020-11-18,-0.227104


In [14]:
def myRSI(df):
    return ta.RSI(df['Close'], timeperiod=14)
RSI = all_data.groupby('symbol').apply(myRSI)
RSI = RSI.fillna(0)
RSI = pd.DataFrame(RSI)
RSI.columns = ['RSI']
RSI

Unnamed: 0_level_0,Unnamed: 1_level_0,RSI
symbol,Date,Unnamed: 2_level_1
A,2010-01-04,0.000000
A,2010-01-05,0.000000
A,2010-01-06,0.000000
A,2010-01-07,0.000000
A,2010-01-08,0.000000
...,...,...
ZTS,2020-11-13,52.944533
ZTS,2020-11-16,52.124412
ZTS,2020-11-17,52.351511
ZTS,2020-11-18,48.907300


In [15]:
def myMFI(df):
    return ta.MFI(df['High'], df['Low'], df['Close'], df['Volume'], timeperiod=14)
MFI = all_data.groupby('symbol').apply(myMFI)
MFI = MFI.fillna(0)
MFI = pd.DataFrame(MFI)
MFI.columns = ['MFI']
MFI

Unnamed: 0_level_0,Unnamed: 1_level_0,MFI
symbol,Date,Unnamed: 2_level_1
A,2010-01-04,0.000000
A,2010-01-05,0.000000
A,2010-01-06,0.000000
A,2010-01-07,0.000000
A,2010-01-08,0.000000
...,...,...
ZTS,2020-11-13,59.743375
ZTS,2020-11-16,59.364550
ZTS,2020-11-17,61.265156
ZTS,2020-11-18,55.860088


In [16]:
# Merging the features with main data

# We need to match index on main data to match that of our momentum indicators data tables. 

# indexes: symbol and Date

# then we can perform inner join to combine the data together

In [17]:
all_data = all_data.set_index(['symbol',all_data.index])
all_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Open,Close,Volume,Adj Close,return
symbol,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
MMM,2010-01-04,83.449997,82.669998,83.089996,83.019997,3043700.0,61.793381,
MMM,2010-01-05,83.230003,81.699997,82.800003,82.5,2847000.0,61.406334,-0.006264
MMM,2010-01-06,84.599998,83.510002,83.879997,83.669998,5268500.0,62.277203,0.014182
MMM,2010-01-07,83.760002,82.120003,83.32,83.730003,4470100.0,62.321846,0.000717
MMM,2010-01-08,84.32,83.300003,83.690002,84.32,3405800.0,62.761005,0.007046


In [18]:
all_data = all_data.merge(MFI, how='inner', on=None, left_on=None, right_on=None, 
                                  left_index=True, right_index=True, 
                                  sort=False, suffixes=['_x', '_y'], 
                                  copy=True, indicator=False, validate=None)
all_data = all_data.merge(RSI, how='inner', on=None, left_on=None, right_on=None, 
                                  left_index=True, right_index=True, 
                                  sort=False, suffixes=['_x', '_y'], 
                                  copy=True, indicator=False, validate=None)
all_data = all_data.merge(MACD1, how='inner', on=None, left_on=None, right_on=None, 
                                  left_index=True, right_index=True, 
                                  sort=False, suffixes=['_x', '_y'], 
                                  copy=True, indicator=False, validate=None)
all_data = all_data.merge(MACD2, how='inner', on=None, left_on=None, right_on=None, 
                                  left_index=True, right_index=True, 
                                  sort=False, suffixes=['_x', '_y'], 
                                  copy=True, indicator=False, validate=None)
all_data = all_data.merge(MACD3, how='inner', on=None, left_on=None, right_on=None, 
                                  left_index=True, right_index=True, 
                                  sort=False, suffixes=['_x', '_y'], 
                                  copy=True, indicator=False, validate=None)

In [19]:
# remove the index to create a simple structure
all_data = all_data.reset_index()

In [20]:
all_data.tail()

Unnamed: 0,symbol,Date,High,Low,Open,Close,Volume,Adj Close,return,MFI,RSI,MACD_x,MACD_y,MACD
1488887,ZTS,2020-11-13,167.029999,164.509995,165.460007,165.779999,1666600.0,165.779999,0.003572,59.743375,52.944533,1.096271,1.082181,0.014091
1488888,ZTS,2020-11-16,168.520004,164.600006,166.369995,165.289993,1548000.0,165.289993,-0.002956,59.36455,52.124412,1.02877,1.071499,-0.042728
1488889,ZTS,2020-11-17,166.110001,164.070007,164.490005,165.429993,1214300.0,165.429993,0.000847,61.265156,52.351511,0.975329,1.052265,-0.076935
1488890,ZTS,2020-11-18,166.330002,163.419998,165.100006,163.5,1457800.0,163.5,-0.011667,55.860088,48.9073,0.768385,0.995489,-0.227104
1488891,ZTS,2020-11-19,166.940002,163.539993,164.070007,166.309998,1329244.0,166.309998,0.017187,62.28755,53.684983,0.821652,0.960721,-0.139069


In [21]:
# To create a label, we need to look at forward values to create a decision variable. 
# Note: This forward values would only be used to create the label and then should be discarded,
#       We cannot provide forward value to the model because as of model execution time, 
#       these values will not be available. 

# Creating forward data columns to define forward goal. We use 10 days goal. 

In [22]:
all_data['1_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-1)
all_data['2_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-2)
all_data['3_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-3)
all_data['4_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-4)
all_data['5_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-5)
all_data['6_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-6)
all_data['7_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-7)
all_data['8_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-8)
all_data['9_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-9)
all_data['10_day_value_low'] = all_data.groupby(['symbol'])['Low'].shift(-10)

In [23]:
# dropping NAs as we cannot determine labels for most recent days. 

all_data.dropna(inplace = True)

In [24]:
# Define the label creating function 

# hyperparameters that can be tuned as per user preference: 
#.   - stop loss value - usually 2 % 
#.   - target value    - usually 4 to 8 % and always greater than stoploss

# we want the label to be 
# 1 = if stoploss has not been hit for 10 days (compare low value of 10 forward days to buy value, assumed as highest value on the current day)
#     and 10th day value is at target or above
# 0 = if otherwise
#


def day10_buy(df):
    buy = df['High']
    day10 = df['10_day_value_low']
    
    low_cols = [i for i in list(df.columns)  if i[-9:] == "value_low"]
    low_value = df[low_cols].min(axis =1)
    is_not_stop_loss = low_value > (buy*0.98)
    is_target = day10 > (buy*1.08)
    return  (is_not_stop_loss & is_target)*1

In [25]:
all_data = all_data.reset_index()
day10_buy_signal = all_data.groupby('symbol').apply(day10_buy)
day10_buy_signal = pd.DataFrame(day10_buy_signal)
day10_buy_signal.columns = ["buy10_day"]
day10_buy_signal

Unnamed: 0_level_0,Unnamed: 1_level_0,buy10_day
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,0
A,1,0
A,2,0
A,3,0
A,4,0
...,...,...
ZTS,1483323,0
ZTS,1483324,0
ZTS,1483325,0
ZTS,1483326,0


In [26]:
# Check number of oportunities we have in the complete data

day10_buy_signal['buy10_day'].sum()

18913

In [27]:
# reindex the data to merge with labels

all_data = all_data.set_index(['symbol',all_data.index])
all_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,index,Date,High,Low,Open,Close,Volume,Adj Close,return,MFI,RSI,MACD_x,MACD_y,MACD,1_day_value_low,2_day_value_low,3_day_value_low,4_day_value_low,5_day_value_low,6_day_value_low,7_day_value_low,8_day_value_low,9_day_value_low,10_day_value_low
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
A,0,1,2010-01-05,22.331903,22.002861,22.324749,22.145924,4186000.0,20.355618,-0.010863,0.0,0.0,0.0,0.0,0.0,22.002861,21.816881,21.74535,21.938484,21.616594,21.494993,21.816881,21.695278,21.709585,21.595137
A,1,2,2010-01-06,22.174536,22.002861,22.06724,22.06724,3243700.0,20.283297,-0.003553,0.0,0.0,0.0,0.0,0.0,21.816881,21.74535,21.938484,21.616594,21.494993,21.816881,21.695278,21.709585,21.595137,21.587982
A,2,3,2010-01-07,22.04578,21.816881,22.017168,22.038626,3095100.0,20.257004,-0.001297,0.0,0.0,0.0,0.0,0.0,21.74535,21.938484,21.616594,21.494993,21.816881,21.695278,21.709585,21.595137,21.587982,20.808298
A,3,4,2010-01-08,22.06724,21.74535,21.917025,22.031473,3733900.0,20.250422,-0.000325,0.0,0.0,0.0,0.0,0.0,21.938484,21.616594,21.494993,21.816881,21.695278,21.709585,21.595137,21.587982,20.808298,20.908442
A,4,5,2010-01-11,22.2103,21.938484,22.088697,22.04578,4781500.0,20.263571,0.000649,0.0,0.0,0.0,0.0,0.0,21.616594,21.494993,21.816881,21.695278,21.709585,21.595137,21.587982,20.808298,20.908442,20.729614


In [28]:
# Merging with labels 

all_data = all_data.merge(day10_buy_signal, how='inner', on=None, left_on=None, right_on=None, 
                                  left_index=True, right_index=True, 
                                  sort=False, suffixes=['_x', '_y'], 
                                  copy=True, indicator=False, validate=None)

In [29]:
# Check which stocks have the most positive labels (most buying opportunities)

all_data[all_data['buy10_day'] == 1].reset_index()['symbol'].value_counts()

TSLA    135
MU      120
NFLX    115
AMD     112
NVDA    111
       ... 
ATO       1
K         1
TTWO      1
WEC       1
OTIS      1
Name: symbol, Length: 500, dtype: int64

# Machine Learning

In [30]:
# import machine learning dependencies

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import sklearn
from sklearn.ensemble import RandomForestClassifier

In [68]:
# move index to main data
data = all_data.reset_index()

# some stock data has anamolies, remove where valume data is missing.
data = data[data['Volume'] != 0]

# in case if one stock data to pass to ML model
data = data[(data['symbol'] == "NFLX" ) ] 


# in case multiple stock filters add OR conditions above
#| (data['symbol'] == "AAPL" )

In [69]:
data.tail()

Unnamed: 0,symbol,level_1,index,Date,High,Low,Open,Close,Volume,Adj Close,return,MFI,RSI,MACD_x,MACD_y,MACD,1_day_value_low,2_day_value_low,3_day_value_low,4_day_value_low,5_day_value_low,6_day_value_low,7_day_value_low,8_day_value_low,9_day_value_low,10_day_value_low,buy10_day
851140,NFLX,851140,854727,2020-10-30,505.880005,472.209991,502.01001,475.73999,7807900.0,475.73999,-0.056465,35.832083,39.734107,-5.799264,-0.675199,-5.124065,475.0,478.76001,493.980011,503.450012,502.51001,467.26001,463.410004,478.26001,480.429993,477.799988,0
851141,NFLX,851141,854728,2020-11-02,486.299988,475.0,478.869995,484.119995,4408200.0,484.119995,0.017615,30.796581,43.118789,-6.539016,-1.847963,-4.691054,478.76001,493.980011,503.450012,502.51001,467.26001,463.410004,478.26001,480.429993,477.799988,477.299988,0
851142,NFLX,851142,854729,2020-11-03,495.309998,478.76001,484.929993,487.220001,3690200.0,487.220001,0.006403,26.108232,44.363614,-6.796782,-2.837726,-3.959055,493.980011,503.450012,502.51001,467.26001,463.410004,478.26001,480.429993,477.799988,477.299988,478.850006,0
851143,NFLX,851143,854730,2020-11-04,507.730011,493.980011,495.359985,496.950012,5137300.0,496.950012,0.01997,31.277908,48.195745,-6.145095,-3.4992,-2.645895,503.450012,502.51001,467.26001,463.410004,478.26001,480.429993,477.799988,477.299988,478.850006,477.720001,0
851144,NFLX,851144,854731,2020-11-05,518.72998,503.450012,506.559998,513.76001,5372800.0,513.76001,0.033826,37.192821,54.08037,-4.223516,-3.644063,-0.579453,502.51001,467.26001,463.410004,478.26001,480.429993,477.799988,477.299988,478.850006,477.720001,480.470001,0


In [70]:
#test on 2019
data_test = data[(data['Date']>dt.datetime(2018,12,31)) & (data['Date']<dt.datetime(2020,1,1))]   

#validate on 2018
data_validation = data[(data['Date']>dt.datetime(2017,12,31)) & (data['Date']<dt.datetime(2019,1,1))]

#train upto 2017
data_train = data[data['Date']<dt.datetime(2018,1,1)]

In [71]:
data_cols = ['RSI','MFI','MACD','MACD_x','MACD_y']

X_train = data_train[data_cols]
y_train = data_train['buy10_day']

X_test = data_test[data_cols]
y_test = data_test['buy10_day']

X_val = data_validation[data_cols]
y_val = data_validation['buy10_day']

In [72]:
# Define the XGBoost Classifier model paramaters

model = XGBClassifier(booster='gbtree',seed=0,nthread=-1,
                       gamma=0,learning_rate=0.001,n_estimators=40,
                      max_depth=5,objective='binary:logistic',subsample=1,scale_pos_weight=200)

# Train the model
model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.001, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=40, n_jobs=1,
              nthread=-1, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=200, seed=0,
              silent=None, subsample=1, verbosity=1)

In [73]:
# Notes of Model Validation and Backtesting

# Accuracy is not important metric. As data is imbalanced. 

# we need to focus on precision. Why? 
# The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. 

# There might be 1000 opportunities out there. If our model shows identifies 100 positive labels, 
# we want to focus to be how much is correct out of those 100, 
# and not that we only identified only 100 out of 1000. 



In [74]:
# Test Accuracy

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))


Accuracy: 77.38%


In [75]:
# Train Accuracy

y_pred_train = model.predict(X_train)
predictions = [round(value) for value in y_pred_train]
accuracy = accuracy_score(y_train, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 55.57%


In [76]:
# Validation Accuracy

y_pred_val = model.predict(X_val)
predictions_val = [round(value) for value in y_pred_val]
accuracy = accuracy_score(y_val, predictions_val)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 84.06%


In [77]:
from sklearn.metrics import classification_report
print(classification_report(y_val, predictions_val ))


              precision    recall  f1-score   support

           0       0.96      0.87      0.91       238
           1       0.11      0.31      0.17        13

    accuracy                           0.84       251
   macro avg       0.54      0.59      0.54       251
weighted avg       0.91      0.84      0.87       251



In [78]:
# In the validation set, how many were identified as correct.

(((y_val==1) & (y_val==predictions_val))*1).sum()

4

In [79]:
# In the validation set, how many were buy symbols. 

sum(predictions_val)

35

In [None]:
# save the model output on local
run_time[run_time['model_op_buy'] == 1].to_csv("model_op.csv")