<h1> <center> Predicting Day Trade Return by Deep Learning </center> </h1>

The aim of this project is predicting the possible outcome of a day trade by training a deep learning model on the image data of 22 day long candle stick charts with some financial indicators drawn on them, bollinger bands and 20 day moving average for now. 

- **Data Scraping**: 

	For 242 stocks listed in S&P500 index, the historical price data for the last ten years are scraped. 

- **Creating .png images***:

	For every 22 day long intervals, I draw the candlestick chart of the data along with some financial indicators (bollinger bands for now) on it. Create images that look like as follows:
    <img src="sample_image.png" style="height:256px" >
	For each image file, I created day_trade_precentage feature - which is calculated as the percentage return of buying the stock at the Close price of the 22nd day (the last day included in the candle stick chart) and selling it at the next day's Close price. Then I discretized the percentage returns into N (so far tried 3 and 14) many categories and label the images as one of these categories. 
	Save the image files in the directory images/label, where label is its category. 

- **Preparing Data Directory for flow_from_directory**: 

	In order to be able to use flow_from_directory method of Keras, split the data into 3 directories under images_separated directory, called train_data, validation_data, test_data. The structure of the directory is as follows:

	```pyton 
    images_separated/
		train_data/
			label_1/
				train1_image_1.png
				train1_image_2.png
				...
			label_2/
				train2_image_1.png
				train2_image_2.png
				...
			...
		validation_data/
			label_1/
				validation1_image_1.png
				validation1_image_2.png
				...
			label_2/
				validation2_image_1.png
				validation2_image_2.png
				...
			...
		test_data/
			label_1/
				test1_image_1.png
				test1_image_2.png
				...
			label_2/
				test2_image_1.png
				test2_image_2.png
				...
			...
    ```

- **Train CNN model**: 

	The architecture of the CNN model is as follows:


- **Results**: 

Below is the table showing the the accuracy as the number of categories representing the discretized percentage returns changes

|  num_cat |   14   |    5   |
|----------|--------|--------|
| accuracy |  0.18  | 0.3488 |


## Option to skip some processes 

Steps before modelling takes so long and not needed to be kept repeated. So, I create an option to skip them individually. However, makes ure that if you decide not skip data scraping step, you must nor skip any other steps either. 

In [1]:
is_skip_scrape = input("Do you want to skip data scraping? (y / n): ")
is_skip_image_create    = input("Do you want to skip creating candle stick charts? (y / n): ")
is_skip_directory_split = input("Do you want to skip splitting the image directory into train/validation/test sub-directories? (y / n): ")


Do you want to skip data scraping? (y / n): y
Do you want to skip creating candle stick charts? (y / n): y
Do you want to skip splitting the image directory into train/validation/test sub-directories? (y / n): n


In [2]:
# Load the required packages
import plotly.graph_objects as go
import pandas as pd
import os
import shutil
import numpy as np
import matplotlib.pyplot as plt
import math 

## Data Scraping

In [3]:
list_stocks = [
"MSFT", "AAPL", "AMZN", "GOOG", "GOOGL", "FB", "BRK.B", "V", "WMT", "JPM", "PG", 
"MA", "UNH", "INTC", "VZ", "T", "HD", "BAC", "MRK", "DIS", "PFE", "PEP", "CSCO", 
"CMCSA", "ORCL", "NFLX", "XOM", "NVDA", "ADBE", "ABT", "CRM", "NKE", "CVX", "LLY", "COST", 
"WFC", "MCD", "MDT", "BMY", "AMGN", "NEE", "PYPL", "TMO", "PM", "ABBV", "ACN", "CHTR", 
"LMT", "DHR", "UNP", "IBM", "TXN", "HON", "AVGO", "GILD", "C", "BA", "LIN", "UTX", 
"UPS", "SBUX", "MMM", "CVS", "QCOM", "FIS", "AXP", "TMUS", "MDLZ", "MO", "BLK", "LOW", "GE", 
"FISV", "CME", "D", "CI", "INTU", "SYK", "SO", "BDX", "PLD", "CAT", "EL", "SPGI", 
"ISRG", "CCI", "AGN", "TJX", "ADP", "VRTX", "ANTM", "CL", "GS", "AMD", "USB", "ZTS", "NOC", 
"MS", "NOW", "BIIB", "BKNG", "EQIX", "REGN", "CB", "MU", "TGT", "ITW", "ECL", "TFC", 
"ATVI", "CSX", "GPN", "SCHW", "MMC", "PGR", "PNC", "BSX", "KMB", "APD", "DE", "SHW", "AMAT", 
"AEP", "MCO", "EW", "WM", "BAX", "LHX", "NSC", "ILMN", "RTN", "HUM", "WBA", "SPG",  
"GD", "NEM", "DG", "SRE", "LRCX", "EXC", "DLR", "PSA", "ADI", "ROP", "CNC", "LVS", "COP", 
"FDX", "GIS", "KMI", "ADSK", "XEL", "ETN", "GM", "MNST", "ROST", "KHC", "HCA", "SBAC", "BK", 
"MET", "WEC", "ALL", "EMR", "STZ", "EA", "HSY", "ES", "ED", "SYY", "CTSH", "AFL", 
"MAR", "TRV", "COF", "DD", "HRL", "HPQ", "RSG", "EBAY", "INFO", "MSCI", "EQR", "ORLY", "MSI", 
"TROW", "KR", "PSX", "VFC", "AVB", "PEG", "VRSK", "KLAC", "AIG", "MCK", "APH", "A", "AWK", 
"CLX", "PAYX", "WLTW", "DOW", "PRU", "TEL", "BLL", "EOG", "FE", "IQV", "YUM", "PCAR", "F", 
"RMD", "WELL", "K", "VRSN", "EIX", "PPG", "AZO", "JCI", "TWTR", "CMI", "IDXX", "TT", "ZBH", 
"O", "PPL", "ETR", "HLT", "ANSS", "SLB", "DAL", "CTAS", "LUV", "DTE", "XLNX", "SNPS", 
"ADM", "ALXN", "VLO", "AEE", "CERN", "DLTR"
]

In [4]:
# Import the data scraping tool from functions folder
from functions.Scrape_Historical_Data import scrape_historical_data

if is_skip_scrape != 'y':
    for stock_code in list_stocks: 
        scrape_historical_data(stock_code)
    print('\nData scraping is complete.')
else:
    print('Using the previously scraped data located at /data_folder/.')

Using the previously scraped data located at /data_folder/.


## Preprocessing Data and Creating Image Files 

Now that we have our historical price data stored in .csv files, we can clean data, create labels based on the possible outcome of a day trade, and draw bollinger bands on top of their 22-day long candlesticks. As a result, we will have around 550,000 images that look like as follows: 
<img src="sample_image.png" style="height:256px" >

In [5]:
if is_skip_image_create != 'y':
    # Import the necessary tools to clean the data set and create the 
    # candlestick charts with the bollinger bands from functions folder
    from functions.DataFrame_Preprocessors import cleaner, calculate_return, categorizer 
    from functions.Bollinger_Bands import bollinger_bands 
    from functions.Image_Creator import image_creator 

    time_interval = 22
    categories = (-1, 0, 1)

    for sub_dir in categories:
        images_dir = 'images/{}'.format(sub_dir)
        if not os.path.exists(images_dir):
            os.makedirs(images_dir)

    for stock_name in os.listdir('historical_price_data'):    
        data_path = 'historical_price_data/' + stock_name 

        if os.stat(data_path).st_size <= 5:
            continue  

        stock_price = pd.read_csv(data_path)

        if len(stock_price) < 200 :
            continue 

        stock_price = cleaner(stock_price)
        stock_price = bollinger_bands(stock_price)
        stock_price = calculate_return(stock_price)
        stock_price = categorizer(stock_price)

        for start in range(len(stock_price) - time_interval):
            end = start + time_interval
            sub_stock_price = stock_price[start: end] 
            file_name = '{}_{}'.format(stock_name[:-4], start)

            image_creator(df = sub_stock_price, file_name = file_name)
else:
    print('Using the previously prepared .png files located at /images/.')    

Using the previously prepared .png files located at /images/.


## Splitting Data into Train / Validation / Test Directories 

Using the *train_test_directory_split* tool in functions folder, we split our data into three subsets, namely train (60%), validation (20%), and test (20%). Once *train_test_directory_split()* does its job, we will have a folder, images_separated, that in in the following format so that it can be fed into *ImageDataGenerator* method of Keras API. 

```pyton 
    images_separated/
		train_data/
			label_1/
				train1_image_1.png
                train1_image_2.png
				...
			label_2/
				train2_image_1.png
				...
			...
		validation_data/
			label_1/
				validation1_image_1.png
				...
			label_2/
				validation2_image_1.png
				...
			...
		test_data/
			label_1/
				test1_image_1.png
				...
			label_2/
				test2_image_1.png
				...
			...
    ```

In [7]:
if is_skip_directory_split != 'y':
    # Import the directory splitting tool from functions folder
    from functions.Train_Test_Directory_Split import train_test_directory_split
    
    categories = (-1, 0, 1)
    # Prepare the data directory to flow_from_direcoty method 
    train_test_directory_split(classes=categories)
else:
    print('Using the previously splitted data located at /images_separated/.') 


Total images in class -1 is 164605
	98763 copied to ../training/-1
	32921 copied to ../validation/-1
	32921 copied to ../testing/-1

Total images in class 0 is 189794
	113876 copied to ../training/0
	37959 copied to ../validation/0
	37959 copied to ../testing/0

Total images in class 1 is 189666
	113799 copied to ../training/1
	37933 copied to ../validation/1
	37934 copied to ../testing/1


## Deep Learning Model 

Now I will train a CNN model with two convolution layers with 3x3 kernel size and relu activation functions. Then I add two dense layers with 22 neurons. The main reason I chose 22 is that I am analyzing 22 day intervals. Once the model is compiled, following code will also print the summary of the model. 

In [8]:
from keras.layers import Conv2D, AveragePooling2D, Dense, Flatten, MaxPooling2D
from keras.models import Sequential 

cnn = Sequential() 

cnn.add(Conv2D(256, kernel_size=(5,5), input_shape=(256, 256, 4), padding='same', activation='relu')) 
cnn.add(MaxPooling2D(pool_size=(3,3))) 

cnn.add(Conv2D(128, kernel_size=(5,5), padding='same', activation='relu')) 
cnn.add(MaxPooling2D(pool_size=(3,3))) 

cnn.add(Conv2D(64,  kernel_size=(5,5), padding='same', activation='relu')) 
cnn.add(MaxPooling2D(pool_size=(3,3))) 

cnn.add(Flatten()) 
cnn.add(Dense(64, activation = 'relu')) 
cnn.add(Dense(32, activation = 'relu')) 
cnn.add(Dense(3, activation = 'softmax')) 

cnn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) 
print(cnn.summary()) 

Using TensorFlow backend.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 256, 256, 256)     25856     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 85, 85, 256)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 85, 85, 128)       819328    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 28, 28, 128)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 64)        204864    
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 9, 9, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 5184)             

Using the *flow_from_directory* method of Keras API, I feed in my train and validation data into my cnn model as batches of size 10. The original size of my images is 256x256, here in this example I shirnk them to half. Moreover, I handle the normalization during the same process by rescaling my data 1/255. Once I create my train and validation data generators, I feed them into my cnn model to train it. 

In [14]:
from keras.preprocessing.image import ImageDataGenerator
# Create train data generator
train_datagen = ImageDataGenerator(rescale=1./255) 
train_generator = train_datagen.flow_from_directory(
        'images_separated/train_data',
        color_mode = 'rgba', 
        shuffle = True, 
        batch_size=5) 

# Create validation data generator
validation_datagen = ImageDataGenerator(rescale=1./255)  
validation_generator = validation_datagen.flow_from_directory(
        'images_separated/validation_data',
        color_mode = 'rgba', 
        shuffle = True, 
        batch_size=5) 

# Fit the model
history  = cnn.fit_generator(
        train_generator,
        epochs= 6,
        validation_data=validation_generator
        ) 

Found 326438 images belonging to 3 classes.
Found 108813 images belonging to 3 classes.
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


Now that model is trained, we can test it on our test data, generated by following test data generaotr. 

In [15]:
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
        'images_separated/test_data',
        color_mode = 'rgba', 
        shuffle = True, 
        batch_size = 5)

# Calculate the accuracy of the model's predictions on the test data
test_loss, test_acc = cnn.evaluate_generator(test_generator, verbose=2)
print("Test Accuracy is {} %".format(round(100*test_acc,2)))

Found 108814 images belonging to 3 classes.
Test Accuracy is 34.88 %


Looks like we have pretty low accuracy. Observe that this accuracy could have been achieved by assigning all the test data into the same category since the size of our categories are almost 1/3. How about the precision and the recall of our model? 

In [16]:
labels_true = test_generator.classes
labels_pred_probs = cnn.predict_generator(test_generator) 
labels_pred = np.argmax(labels_pred_probs, axis=1)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(labels_true, labels_pred)
print(cm) 

[[    0 32921     0]
 [    0 37959     0]
 [    0 37934     0]]


## Results 

Here we have tried to classify the 22 day candlestick chart of a stock with its bollinger bands into three categories, namely as follows: 
    
    - category  1: percentage return > 0.5% 
    - category  0: percentage return between -0.5% and +0.5% 
    - category -1: percentage return < -0.5% 

The resulting accuracy is 34.88%. What we see in the confusion matrix is that our model is a really bad one since it assigns all the test images into the same class. 