This is a program that use supervised machine learning to predict stock prices. 

The script uses regularized linear regression that takes a series of continuous stock prices (Highest in the day) to predict stock prices (highest in the day) given immediately after. The data used is Tesla's historical stock prices downloaded from https://finance.yahoo.com/quote/TSLA/history?p=TSLA

In [25]:
import numpy
import csv
from sklearn.utils import shuffle

In [26]:
# General Parameters:

filename = 'TSLA.csv'
lag = 15 # Using the past 15 prices to predict the next
iterations = 300 # Train with 300 iterations
threshold = 9e9 # Stop training when cost exceeds the threshold

alpha_list = [1, 0.1, 0.01] # Training step size
lam_list = [0.01, 0.001, 0.0001] # Regularier parameter

First, read the data from csv sheets. If using lag=15, the script organizes prices of 15 continuous dates into a training example, and uses the 16th price as label. 

In [27]:
# Data Management:
X = []
Y = []
with open(filename, 'r') as djiFile:
    dji = csv.reader(djiFile)
    next(dji)
    row_buff = []
    
    for row in dji:
        if len(row_buff) < lag:
            row_buff.append(float(row[2]))
        else:
            X.append(row_buff)
            row_buff = row_buff[1:]
            row_buff.append(float(row[2]))
            Y.append(numpy.array(float(row[2])))
    
X = numpy.array(X)
Y = numpy.array(Y)
num_examples = numpy.size(X,0)
num_features = numpy.size(X,1)

print("Dataset size: ")
print(numpy.size(X,0), numpy.size(X,1))
print(numpy.size(Y))


Dataset size: 
993 15
993


Since the stock prices dealing with can sometimes be very high. If using DJIA it could be around 20,000, the script use mean normalization to preprocess the data. A feature of 1.0 is also added to all examples.

The normalized data is then shuffled, in case the training is influenced by potential trends over larger time periods. 

The script then separates all examples into training set, validation set, and test set, in the ratio of 60%:20%:20%. 

In [28]:
# Mean Normalization:
X_norm = numpy.zeros((num_examples, num_features), dtype=float)
X_mean = numpy.zeros((1, num_features), dtype=float)
X_std = numpy.zeros((1, num_features), dtype=float)

for j in range(0, num_features):
    X_mean[0,j] = numpy.mean(X[:,j])
    X_std[0,j] = numpy.std(X[:,j])
    X_norm[:,j] = (X[:,j] - X_mean[0,j]) / X_std[0,j]
    
Y_norm = (Y - numpy.mean(Y)) / numpy.std(Y)

# Add first feature as ones:
col_first = numpy.ones([numpy.size(X,0), 1])
X_norm = numpy.hstack((col_first, X_norm))
num_features += 1

# Shuffle data:
X_norm, Y_norm = shuffle(X_norm, Y_norm)

In [29]:
# Data separated into sets:
Xtrain = X_norm[0 : round(num_examples*0.6)]
Xval = X_norm[round(num_examples*0.6) : round(num_examples*0.8)]
Xtest = X_norm[round(num_examples*0.8) :]

Ytrain = Y_norm[0 : round(num_examples*0.6)]
Yval = Y_norm[round(num_examples*0.6) : round(num_examples*0.8)]
Ytest = Y_norm[round(num_examples*0.8) :]

size_train = numpy.size(Xtrain, 0)
size_val = numpy.size(Xval, 0)
size_test = numpy.size(Xtest, 0)

print("Training set size = ", numpy.size(Xtrain,0), numpy.size(Xtrain,1))
print("Validation set size = ", numpy.size(Xval,0))
print("Test set size = ", numpy.size(Xtest,0))
print("\nFirst three training examples: ")
print(Xtrain[0:3])
print("\nFirst three labels: ")
print(Ytrain[0:3])

Training set size =  596 16
Validation set size =  198
Test set size =  199

First three training examples: 
[[ 1.         -0.11067669 -0.11079889 -0.14252215 -0.04347066 -0.0491921
  -0.09453543 -0.13594171 -0.16781906 -0.20456711 -0.22963817 -0.20249216
  -0.22260569 -0.26033394 -0.27914084 -0.29571303]
 [ 1.          0.17928269  0.14595811  0.12855523  0.08158042  0.00643161
  -0.03381927 -0.03968856 -0.09010073 -0.0855334  -0.09533464 -0.07678908
  -0.03454301 -0.04611758 -0.03963776 -0.08295353]
 [ 1.         -0.01739461  0.03383042  0.01982154  0.02756989 -0.02457886
  -0.0104995  -0.00813766 -0.00387645 -0.01683906 -0.04259179 -0.03986282
  -0.01384663 -0.04746255 -0.06495192 -0.09101934]]

First three labels: 
[-0.45539701 -0.07032387 -0.12714381]


The training set will output a parameter for every alpha and lambda. The training iterations for certain alpha and lambda will be skipped if the cost increases exceeding the threshold.

The validation set is used to find optimal alpha and lambda. 

In [30]:
# Validation set up:
param_all = numpy.zeros((num_features, len(alpha_list), len(lam_list)))
    # Collection of all parameters learned for every alpha and lambda combination
cost_opt = 999999
alpha_opt_index = 0;
lam_opt_index = 0;

In [31]:
# Train the model:

for a in range(0, len(alpha_list)):
    alpha = alpha_list[a]
    
    for l in range(0, len(lam_list)):
        lam = lam_list[l]
        
        param = numpy.zeros((num_features, 1), dtype=float)
        cost = 0
        print("Training with alpha = ", alpha, " and lambda = ", lam)
        
        for r in range(0, iterations):
            param_temp = numpy.zeros(numpy.size(param))

            for j in range(0, num_features): 

                grad = 0
                for i in range(0, size_train):
                    grad += (Xtrain[i]@param - Ytrain[i]) * Xtrain[i,j]
                grad = (1/size_train) * grad
                if j != 0:
                    grad += (lam/size_train) * param[j]  
                param_temp[j] = param[j] - alpha * grad

            for j in range(0, num_features): 
                param[j] = param_temp[j]

            # Cost:
            sumSqrError = 0
            for i in range(0, size_train):
                sumSqrError += (Xtrain[i] @ param - Ytrain[i]) ** 2
            sumSqrParam = 0
            for j in range(1, num_features):
                sumSqrParam += param[j] ** 2
            cost = (1/(2*size_train)) * sumSqrError + lam * sumSqrParam
            #print("Iteration ", r+1, ", cost = ", cost, end='\n')
            
            if cost > threshold:
                print(" Skipped")
                break
            
        print(" cost = ", cost, end='\n')
        for j in range(0, num_features):
            param_all[j,a,l] = param[j]

Training with alpha =  1  and lambda =  0.01
 Skipped
 cost =  [2.56281837e+11]
Training with alpha =  1  and lambda =  0.001
 Skipped
 cost =  [2.55988226e+11]
Training with alpha =  1  and lambda =  0.0001
 Skipped
 cost =  [2.55958865e+11]
Training with alpha =  0.1  and lambda =  0.01
 cost =  [0.00932134]
Training with alpha =  0.1  and lambda =  0.001
 cost =  [0.00658298]
Training with alpha =  0.1  and lambda =  0.0001
 cost =  [0.00630908]
Training with alpha =  0.01  and lambda =  0.01
 cost =  [0.01410568]
Training with alpha =  0.01  and lambda =  0.001
 cost =  [0.01326499]
Training with alpha =  0.01  and lambda =  0.0001
 cost =  [0.01318091]


In [32]:
# Use validation set to find optimal alpha and lambda:

for a in range(0, len(alpha_list)):
    for l in range(0, len(lam_list)):
        sumSqrError = 0
        sumSqrParam = 0
        for i in range(0, size_val):
            sumSqrError += (Xval[i] @ param_all[:,a,l] - Yval[i]) ** 2
        for j in range(1, num_features):
            sumSqrParam += param[j] ** 2
        cost = (1/(2*size_val)) * sumSqrError + lam_list[l] * sumSqrParam
        
        if cost < cost_opt:
            cost_opt = cost
            alpha_opt_index = a
            lam_opt_index = l

param = param_all[:, alpha_opt_index, lam_opt_index]
print("Validation set has cost = ", cost_opt)
print("alpha = ", alpha_list[alpha_opt_index])
print("lambda = ", lam_list[lam_opt_index])
#print("Parameters: ", param)

Validation set has cost =  [0.00725882]
alpha =  0.1
lambda =  0.0001


The test set examines the results of learning. 

In [33]:
# Test on the test set:

print("Test set results:")

data = numpy.zeros((numpy.size(Xtest,0), numpy.size(Xtest,1)))
label = numpy.zeros((numpy.size(Ytest)))

sumSqrError = 0
for i in range(0, size_test):
    sumSqrError += (Xtest[i] @ param - Ytest[i]) ** 2
    
    # Visualize result:
    # Reverse normalization: 
    label[i] = Ytest[i] * numpy.std(Y) + numpy.mean(Y) 
    data[i,0] = 1.0
    for j in range(1, num_features):
        data[i,j] = Xtest[i,j] * X_std[0,j-1] + X_mean[0,j-1]  
    # Output predicted stock prices of test set:
    print("Prediction = ", round(data[i]@param, 2), ", Y = ", round(label[i], 2))

# Cost calculation:
sumSqrParam = 0
for j in range(1, num_features):
    sumSqrParam += param[j] ** 2

cost = (1/(2*size_test)) * sumSqrError + lam_list[lam_opt_index] * sumSqrParam

print("Test set has cost = ", cost, end='\n')

Test set results:
Prediction =  316.02 , Y =  318.55
Prediction =  254.88 , Y =  264.78
Prediction =  207.24 , Y =  213.32
Prediction =  295.41 , Y =  301.94
Prediction =  563.4 , Y =  573.86
Prediction =  374.91 , Y =  371.74
Prediction =  189.36 , Y =  188.88
Prediction =  339.38 , Y =  349.2
Prediction =  191.65 , Y =  199.35
Prediction =  295.93 , Y =  292.17
Prediction =  229.49 , Y =  231.0
Prediction =  204.6 , Y =  204.98
Prediction =  335.19 , Y =  347.84
Prediction =  336.48 , Y =  333.93
Prediction =  339.77 , Y =  352.3
Prediction =  313.85 , Y =  336.28
Prediction =  291.61 , Y =  304.88
Prediction =  369.15 , Y =  367.01
Prediction =  904.57 , Y =  954.44
Prediction =  348.97 , Y =  342.75
Prediction =  345.39 , Y =  357.44
Prediction =  194.18 , Y =  191.47
Prediction =  339.64 , Y =  339.6
Prediction =  219.86 , Y =  227.48
Prediction =  336.45 , Y =  324.45
Prediction =  603.72 , Y =  653.0
Prediction =  196.72 , Y =  195.0
Prediction =  304.41 , Y =  306.5
Prediction 

In [34]:
# Predict tomorrow's price:
recent = numpy.zeros((1, num_features), dtype=float) 
    # The most recent "lag" amount of prices
    
for j in range(0, num_features-1):
    recent[0,j] = X[-1,j]

recent[0,0] = 1.0
recent[0,-1] = Y[-1] # The last feature is the current price
print(recent)

prediction = recent @ param
print("On July 27th 2020, TSLA highest price will be ", prediction)

[[1.00000000e+00 1.37779004e+03 1.42950000e+03 1.41726001e+03
  1.40856006e+03 1.54892004e+03 1.79498999e+03 1.59000000e+03
  1.55000000e+03 1.53170996e+03 1.53751001e+03 1.65000000e+03
  1.67500000e+03 1.62642004e+03 1.68900000e+03 1.46500000e+03]]
On July 27th 2020, TSLA highest price will be  [1551.89882971]
