<p style="font-family: Arial; font-size:3em;color:purple; font-style:bold"><br>
Final Project - Machine Learning Short Term Stock<br><br> Prices with A2M.AX</p><br>
<br>
The aim of this project is to answer the following question:

**With the help of daily stock data and machine learning tools, are we able to predict short term future prices of stock?**

The reason for this investigation is that I've always been curious about whether future stock prices can be predicted with any accuracy whatsoever using historical time series data. 

The dataset used is from Yahoo Finance. It contains 1,180 data samples of daily stock data for the stock A2M.AX, listed on the ASX. The data range is from 31 March 2015 to 22 November 2019. 

You can obtain similar data yourself by creating a proile on Yahoo Finance, and adding whatever stock or stocks you're interested to your portfolio, then exporting the data using csv format.

Location: https://finance.yahoo.com/

In [57]:
%config IPCompleter.greedy=True

# Below can be used to change the width of the Jupyter notebook. I did this so that the full dataset with all variables can be seen without needing to 
# use the scrollbar

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

Let's load our data in to the notebook and check out a few key features of it to get an idea how it's shaped and how it looks.

In [58]:
data = pd.read_csv('A2M-ASX.csv')

In [66]:
print(data.shape)
[x for x in data.columns]

(1180, 26)


['date',
 'open',
 'high',
 'low',
 'close',
 'adj_close',
 'range',
 'range_change',
 'range_move',
 'price_change',
 'price_move',
 'volume',
 'qty_change',
 'qty_move',
 '5d_avg_range',
 'change_in_5d_avg_range',
 '5d_avg_range_move',
 '5d_avg_px',
 'change_in_5d_avg_px',
 '5d_avg_px_move',
 '5d_avg_qty',
 'change_in_5d_avg_qty',
 '5d_avg_qty_move',
 '10d_avg_px',
 'change_in_10d_avg_px',
 '10d_avg_px_move']

Off the bat we can see how many features each data sample has, and what these features are. As can be seen, it's a mixture of price and volume data for daily intervals.

Let's view some of the data.

In [67]:
data[:10]

Unnamed: 0,date,open,high,low,close,adj_close,range,range_change,range_move,price_change,...,5d_avg_range_move,5d_avg_px,change_in_5d_avg_px,5d_avg_px_move,5d_avg_qty,change_in_5d_avg_qty,5d_avg_qty_move,10d_avg_px,change_in_10d_avg_px,10d_avg_px_move
0,3/31/2015,0.555,0.595,0.53,0.565,0.565,0.065,,Flat,0.01,...,,,,,,,,,,
1,4/1/2015,0.575,0.58,0.555,0.565,0.565,0.025,-0.04,Down,0.0,...,,,,,,,,,,
2,4/2/2015,0.56,0.565,0.535,0.555,0.555,0.03,0.005,Up,-0.01,...,,,,,,,,,,
3,4/7/2015,0.545,0.55,0.54,0.545,0.545,0.01,-0.02,Down,-0.01,...,,,,,,,,,,
4,4/8/2015,0.545,0.545,0.53,0.54,0.54,0.015,0.005,Up,-0.005,...,,0.554,,,2606643.8,,,,,
5,4/9/2015,0.54,0.54,0.532,0.535,0.535,0.008,-0.007,Down,-0.005,...,Down,0.548,-0.006,Down,1784754.2,-821889.6,Down,,,
6,4/10/2015,0.53,0.535,0.53,0.535,0.535,0.005,-0.003,Down,0.0,...,Down,0.542,-0.006,Down,944443.0,-840311.2,Down,,,
7,4/13/2015,0.535,0.54,0.53,0.54,0.54,0.01,0.005,Up,0.005,...,Down,0.539,-0.003,Down,501676.6,-442766.4,Down,,,
8,4/14/2015,0.54,0.54,0.535,0.54,0.54,0.005,-0.005,Down,0.0,...,Down,0.538,-0.001,Down,497519.4,-4157.2,Down,,,
9,4/15/2015,0.54,0.55,0.54,0.54,0.54,0.01,0.005,Up,0.0,...,Down,0.538,0.0,Flat,548980.0,51460.6,Up,0.546,,


Immediately we can see there are a number of NaN values. These will need dealing with. 

Let's view them all first to see why they are there and if we should simply removing or altering to another value.

In [68]:
nulls = data.isnull().any(axis=1)
data[nulls]

Unnamed: 0,date,open,high,low,close,adj_close,range,range_change,range_move,price_change,...,5d_avg_range_move,5d_avg_px,change_in_5d_avg_px,5d_avg_px_move,5d_avg_qty,change_in_5d_avg_qty,5d_avg_qty_move,10d_avg_px,change_in_10d_avg_px,10d_avg_px_move
0,3/31/2015,0.555,0.595,0.53,0.565,0.565,0.065,,Flat,0.01,...,,,,,,,,,,
1,4/1/2015,0.575,0.58,0.555,0.565,0.565,0.025,-0.04,Down,0.0,...,,,,,,,,,,
2,4/2/2015,0.56,0.565,0.535,0.555,0.555,0.03,0.005,Up,-0.01,...,,,,,,,,,,
3,4/7/2015,0.545,0.55,0.54,0.545,0.545,0.01,-0.02,Down,-0.01,...,,,,,,,,,,
4,4/8/2015,0.545,0.545,0.53,0.54,0.54,0.015,0.005,Up,-0.005,...,,0.554,,,2606643.8,,,,,
5,4/9/2015,0.54,0.54,0.532,0.535,0.535,0.008,-0.007,Down,-0.005,...,Down,0.548,-0.006,Down,1784754.2,-821889.6,Down,,,
6,4/10/2015,0.53,0.535,0.53,0.535,0.535,0.005,-0.003,Down,0.0,...,Down,0.542,-0.006,Down,944443.0,-840311.2,Down,,,
7,4/13/2015,0.535,0.54,0.53,0.54,0.54,0.01,0.005,Up,0.005,...,Down,0.539,-0.003,Down,501676.6,-442766.4,Down,,,
8,4/14/2015,0.54,0.54,0.535,0.54,0.54,0.005,-0.005,Down,0.0,...,Down,0.538,-0.001,Down,497519.4,-4157.2,Down,,,
9,4/15/2015,0.54,0.55,0.54,0.54,0.54,0.01,0.005,Up,0.0,...,Down,0.538,0.0,Flat,548980.0,51460.6,Up,0.546,,


We see that the null data is only our data which contains averages for preceding days, as these data points don't have enough previous days to average.

Let's remove these data points as they shouldn't affect our end results too much, and there's no other way to know what these data points should be without digging for more data.

In [69]:
data = data.dropna()
print(data.shape)
data.head()

(1170, 26)


Unnamed: 0,date,open,high,low,close,adj_close,range,range_change,range_move,price_change,...,5d_avg_range_move,5d_avg_px,change_in_5d_avg_px,5d_avg_px_move,5d_avg_qty,change_in_5d_avg_qty,5d_avg_qty_move,10d_avg_px,change_in_10d_avg_px,10d_avg_px_move
10,4/16/2015,0.545,0.545,0.525,0.53,0.53,0.02,0.01,Up,-0.01,...,Up,0.537,-0.001,Down,526133.4,-22846.6,Down,0.5425,-0.0035,Down
11,4/17/2015,0.525,0.54,0.52,0.54,0.54,0.02,0.0,Flat,0.01,...,Up,0.538,0.001,Up,600670.6,74537.2,Up,0.54,-0.0025,Down
12,4/20/2015,0.535,0.535,0.52,0.52,0.52,0.015,-0.005,Down,-0.02,...,Up,0.534,-0.004,Down,562333.0,-38337.6,Down,0.5365,-0.0035,Down
13,4/21/2015,0.525,0.53,0.52,0.52,0.52,0.01,-0.005,Down,0.0,...,Up,0.53,-0.004,Down,618294.0,55961.0,Up,0.534,-0.0025,Down
14,4/22/2015,0.52,0.53,0.52,0.52,0.52,0.01,0.0,Flat,0.0,...,Flat,0.526,-0.004,Down,511622.2,-106671.8,Down,0.532,-0.002,Down


Now that our data is clean, lets's find some more interesting aspects of our data. 

Using Panda's `.describe()` method we can analyse basic statistical properties of the data set. Note this only works for numerical data.

Given the length of the data, and the change of A2M.AX over time, this is not very useful for basic price data, but is more interesting for range and volume data, as well as averaged price data.

In [70]:
data.describe()

Unnamed: 0,open,high,low,close,adj_close,range,range_change,price_change,volume,qty_change,5d_avg_range,change_in_5d_avg_range,5d_avg_px,change_in_5d_avg_px,5d_avg_qty,change_in_5d_avg_qty,10d_avg_px,change_in_10d_avg_px
count,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0,1170.0
mean,6.322292,6.418415,6.22862,6.320603,6.320603,0.189796,0.0001752137,0.011496,4851845.0,4765.803,0.189169,0.000383,6.29785,0.011002,4839798.0,6276.96,6.272018,0.010297
std,5.064178,5.133915,4.997526,5.062496,5.062496,0.187427,0.1421932,0.217534,5401034.0,4971388.0,0.159886,0.035292,5.054732,0.098545,4053856.0,1280930.0,5.048993,0.070383
min,0.47,0.475,0.455,0.465,0.465,0.0,-1.42,-2.11,0.0,-41366400.0,0.0086,-0.369,0.471,-0.508,265498.2,-11352420.0,0.473,-0.268
25%,1.74,1.7605,1.715,1.735,1.735,0.045,-0.04,-0.05,2226296.0,-1185753.0,0.0497,-0.01,1.734,-0.016,2653897.0,-321196.8,1.721875,-0.01
50%,4.23,4.35,4.1925,4.305,4.305,0.13,-3.33067e-16,0.0,3506864.0,-113329.0,0.136,0.0,4.18,0.004,3752113.0,-11859.4,4.037,0.004
75%,10.94,11.1625,10.8075,10.97,10.97,0.29,0.04,0.06,5370262.0,1030598.0,0.3055,0.01,10.982,0.038,5862383.0,305162.6,10.92675,0.03
max,17.200001,17.299999,17.030001,17.129999,17.129999,2.085,1.205,2.59,61397300.0,41677410.0,1.062,0.393,17.01,0.726,33433660.0,9973982.0,16.778,0.397


Let's fix some of the rounding as it doesn't need to be 6 d.p. for all features.