# Capstone Project: Machine Learning and Financial Trading

## Overview

The idea of machine learning (ML) has been present for well over [60 years](https://en.wikipedia.org/wiki/Machine_learning#:~:text=The%20term%20machine%20learning%20was,computer%20gaming%20and%20artificial%20intelligence.). However, it's only recently _(over the last decade)_ that it has garnered a lot of attention _(from self-driving cars, to fraud detection, to product recommendations on online shopping platforms)_. There is little to no doubt about ML's contributions to solving real-world problems.

Yet, in the area of trading and investing, there are mixed feelings about ML usefulness.  Although institutions make use of [ML to gain an advantage in the markets](https://robusttechhouse.com/list-of-funds-or-trading-firms-using-artificial-intelligence-or-machine-learning/), many [retail traders](https://www.investopedia.com/articles/active-trading/030515/what-difference-between-institutional-traders-and-retail-traders.asp#:~:text=Retail%20traders%20typically%20invest%20in,of%20shares%20at%20a%20time.) _(individuals who trade their own money via discount brokers)_ have not experienced the same benefits for a couple reasons--

1. Lack of knowledge _(Some believe ML is difficult to understand, or that it simply offers no value)_

2. Lack of resources _(Institutions have millions of dollars to invest network infrastructure and hundreds of PhDs to help gain an advantage in the markets)_


## Problem Statement

Can machine learning enhance a retail trader's trading performance?

## Import Libraries

In [78]:
# Import standard data analysis and manipulation libraries
import pandas as pd
import numpy as np

# Import yfinance. Used to download financial data
import yfinance as yf

# Techincal Analysis Library
import talib as ta

## Data Collection

The data used in this project will be Gold ETF (`GLD`) price data downloaded from [Yahoo Finance](https://finance.yahoo.com/) using the [`yfinance`](https://anaconda.org/ranaroussi/yfinance) Python package.

- Get two years of daily GLD pricing data.

In [79]:
gld_data = yf.download('GLD', start="2011-01-01", end="2021-12-21", auto_adjust=True)

[*********************100%***********************]  1 of 1 completed


- Check dataframe information

In [80]:
gld_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2761 entries, 2011-01-03 to 2021-12-20
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Open    2761 non-null   float64
 1   High    2761 non-null   float64
 2   Low     2761 non-null   float64
 3   Close   2761 non-null   float64
 4   Volume  2761 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 129.4 KB


- Check first five rows of data

In [81]:
gld_data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011-01-03,138.669998,139.0,137.880005,138.0,11510200
2011-01-04,136.240005,136.279999,134.160004,134.75,26154300
2011-01-05,133.5,134.679993,133.100006,134.369995,16700900
2011-01-06,134.050003,134.380005,133.139999,133.830002,15965300
2011-01-07,133.380005,134.610001,133.179993,133.580002,16761400


- Check shape of dataframe

In [82]:
gld_data.shape

(2761, 5)

## Data Cleaning and Manipulation

In this section, the data needs to be checked for nulls.  After checking for nulls, additional columns for the [techinal indicators](https://www.investopedia.com/terms/t/technicalindicator.asp).

### Check for Nulls and Duplicates

- Check for nulls

In [83]:
gld_data.isnull().sum()

Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64

- Check for duplicates

Duplicates can distort the accuracy of trading signals and should be removed.

In [84]:
gld_data1 = gld_data[gld_data.duplicated(keep=False)]

The above result indicates that there are no duplicates rows in the data.

- Drop the `Volume` column

Since `Volume` data will not be used in this project, it can be dropped.

In [85]:
gld_data.drop(columns = 'Volume', inplace=True)

### Calculate Technical Indicators

Technical indicators are used to help traders determine when to enter and exit a trade.

In this section, several technical indicators will be calculated and added to the `gld_data` dataframe.

Instead of performing manual calculations to calculate technical indicators, the [TA-Lib](https://mrjbq7.github.io/ta-lib/) Python package will be used.



- Calculate RSI

Investopedia...

_"The [relative strength index](https://www.investopedia.com/terms/r/rsi.asp) (RSI) is a momentum indicator used in technical analysis that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset."_

In [86]:
# Calculate the RSI
gld_data['RSI'] = ta.RSI(gld_data['Close'], timeperiod=14)

- Calculate Parabolic SAR

Investopedia...

_"The [parabolic SAR](https://www.investopedia.com/ask/answers/06/parabolicsar.asp), or parabolic stop and reverse, is a popular indicator that is mainly used by traders to determine the future short-term momentum of a given asset."_

In [87]:
# Calculate pSAR
gld_data['pSAR'] = ta.SAR(gld_data['High'].values, 
                          gld_data['Low'].values,
                          acceleration=0.02, 
                          maximum=0.2)

- Triple Exponential Moving Average

Investopedia...

_"The [triple exponential moving average (TEMA)](https://www.investopedia.com/terms/t/triple-exponential-moving-average.asp)was designed to smooth price fluctuations, thereby making it easier to identify trends without the lag associated with traditional moving averages (MA)"_

In [88]:
# Calculate T3
gld_data['TEMA'] = ta.T3(gld_data['Close'], timeperiod=5, vfactor=0)

- ADX

Investopedia...

_"The [ADX](https://www.investopedia.com/articles/trading/07/adx-trend-indicator.asp)is used to quantify trend strength. ADX calculations are based on a moving average of price range expansion over a given period of time."_


In [89]:
# Calculate T3
gld_data['ADX'] = ta.ADX(gld_data['High'], 
                         gld_data['Low'], 
                         gld_data['Close'], timeperiod=14)

- Inspect data

In [90]:
gld_data

Unnamed: 0_level_0,Open,High,Low,Close,RSI,pSAR,TEMA,ADX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2011-01-03,138.669998,139.000000,137.880005,138.000000,,,,
2011-01-04,136.240005,136.279999,134.160004,134.750000,,139.000000,,
2011-01-05,133.500000,134.679993,133.100006,134.369995,,138.903200,,
2011-01-06,134.050003,134.380005,133.139999,133.830002,,138.671072,,
2011-01-07,133.380005,134.610001,133.179993,133.580002,,138.448230,,
...,...,...,...,...,...,...,...,...
2021-12-14,165.339996,166.139999,165.160004,165.440002,40.924239,167.686717,166.638833,11.175713
2021-12-15,165.270004,166.399994,163.800003,166.149994,44.258064,167.381046,166.549802,11.945480
2021-12-16,167.009995,168.179993,166.940002,168.160004,52.440786,163.800003,166.544353,11.471833
2021-12-17,168.729996,169.130005,167.779999,167.800003,50.996847,163.800003,166.611587,10.819805


### Drop Added Nulls

The calculation of technical indicators generated nulls that must be deleted. 

In [91]:
gld_data.dropna(inplace=True)

In [92]:
gld_data.isnull().sum()

Open     0
High     0
Low      0
Close    0
RSI      0
pSAR     0
TEMA     0
ADX      0
dtype: int64

In [93]:
gld_data

Unnamed: 0_level_0,Open,High,Low,Close,RSI,pSAR,TEMA,ADX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2011-02-10,132.110001,133.309998,132.000000,132.850006,46.092442,128.876541,131.462458,33.107729
2011-02-11,133.009995,133.440002,132.089996,132.320007,44.099798,129.153949,131.675929,31.074322
2011-02-14,132.949997,133.380005,132.699997,132.949997,47.031123,129.414712,131.874588,29.186158
2011-02-15,133.860001,134.169998,133.630005,133.970001,51.468512,129.659829,132.094962,27.407409
2011-02-16,134.229996,134.860001,133.449997,134.100006,52.020217,130.020643,132.338897,26.263413
...,...,...,...,...,...,...,...,...
2021-12-14,165.339996,166.139999,165.160004,165.440002,40.924239,167.686717,166.638833,11.175713
2021-12-15,165.270004,166.399994,163.800003,166.149994,44.258064,167.381046,166.549802,11.945480
2021-12-16,167.009995,168.179993,166.940002,168.160004,52.440786,163.800003,166.544353,11.471833
2021-12-17,168.729996,169.130005,167.779999,167.800003,50.996847,163.800003,166.611587,10.819805


### Export Data to CSV For Analysis and Modeling

In [94]:
gld_data.to_csv('./data/gld_data.csv')