# Acquiring the data 

In this tutorial, we are going to use stock data from Quandl API free edition, to estimate whether the specific target is worth to invest or not. 

The benefit for using this API is that the raw data you get is saved in pandas data type, so you can implement the data right away without parsing. 

Document for Quandl API.
https://www.quandl.com/docs/api#introduction

In order to use the Quandl API, first you need to download the Quandl liberary. Use pip install quandl to install the liberary. Make sure you already install Numby, Scipy and Pandas liberary in your python.

First thing you need to do is import the liberaries. 

In [99]:
import quandl
import pandas
import scipy
import numpy as np

The Quandl API is free. If you would like to make more than 50 calls a day, however, you will need to create a free Quandl account, set your API key and authenticate you key.

In [2]:
quandl.ApiConfig.api_key = 'JYRNJTevRtk46AwBNXDN'


There's lots of database you can call in quandl, we're going to use the simplest dataset.

Take Apple.inc for example.

In [3]:
data = quandl.get("WIKI/AAPL")

In [4]:
print data.head()

             Open   High    Low  Close     Volume  Ex-Dividend  Split Ratio  \
Date                                                                          
1980-12-12  28.75  28.87  28.75  28.75  2093900.0          0.0          1.0   
1980-12-15  27.38  27.38  27.25  27.25   785200.0          0.0          1.0   
1980-12-16  25.37  25.37  25.25  25.25   472000.0          0.0          1.0   
1980-12-17  25.87  26.00  25.87  25.87   385900.0          0.0          1.0   
1980-12-18  26.63  26.75  26.63  26.63   327900.0          0.0          1.0   

            Adj. Open  Adj. High  Adj. Low  Adj. Close  Adj. Volume  
Date                                                                 
1980-12-12   0.427992   0.429779  0.427992    0.427992  117258400.0  
1980-12-15   0.407597   0.407597  0.405662    0.405662   43971200.0  
1980-12-16   0.377675   0.377675  0.375889    0.375889   26432000.0  
1980-12-17   0.385119   0.387054  0.385119    0.385119   21610400.0  
1980-12-18   0.396432   0.

The dataset include 12 feature we can utilize from 1980 to now. For this kind of predict problem, we can use the basic machine learning Artificial Neural Networks to solve this problem. 

# Artificial Neural Networks

Artificial Neural Networks is a comman machine learning method people are useing nowadays. Unlike the SVM model, ANN can creat a non linear model used for regression or classifications. The structure of ANN looks like below.

<img src="files/ann.png">

The number of node and layers is the variable we can control to improve the result. In every node there is a weight link to another node. Our goal is to find the best weight set that fit our data the most. 

A widely used type of composition is the nonlinear weighted sum, which is sum of the input times their weight and pass through a either sigmoid for step function.

<img src="files/ff.png">

Most of the algorithms used in training artificial neural networks is gradient descent and using backpropagation to compute the actual gradients. After we found the local error minimum, We can use this trained model to predict our data.

We want to predict how today's feature is going to influence tomorrow's prize. Therefore, our input will be today's feature, and output will be tomorrow's close price minus today's close price (To figure out how much it increase or decrease.) 

There are lots of ANN liberary we can utilize. We are going to use PyBrain in this tutorial.

Document for PyBrain.
http://pybrain.org/docs/index.html

First, we build a empty Feed Forward Networks structure with 12 input, 20 hidden layer and 1 output. 

In [102]:
from pybrain.structure import FeedForwardNetwork
from pybrain.structure import LinearLayer, SigmoidLayer
from pybrain.structure import FullConnection

n = FeedForwardNetwork()

inLayer = LinearLayer(12)
hiddenLayer = SigmoidLayer(20)
outLayer = LinearLayer(1)

In order to use them, we have to add them to the network.

In [6]:
n.addInputModule(inLayer)
n.addModule(hiddenLayer)
n.addOutputModule(outLayer)

And connect all the nural to each others.

In [35]:
in_to_hidden = FullConnection(inLayer, hiddenLayer)
hidden_to_out = FullConnection(hiddenLayer, outLayer)

n.addConnection(in_to_hidden)
n.addConnection(hidden_to_out)
n.sortModules()

print n

FeedForwardNetwork-6
   Modules:
    [<LinearLayer 'LinearLayer-3'>, <SigmoidLayer 'SigmoidLayer-7'>, <LinearLayer 'LinearLayer-8'>]
   Connections:
    [<FullConnection 'FullConnection-18': 'SigmoidLayer-7' -> 'LinearLayer-8'>, <FullConnection 'FullConnection-19': 'LinearLayer-3' -> 'SigmoidLayer-7'>, <FullConnection 'FullConnection-4': 'SigmoidLayer-7' -> 'LinearLayer-8'>, <FullConnection 'FullConnection-5': 'LinearLayer-3' -> 'SigmoidLayer-7'>]



Input a simple dataset to see whether it works or not.

In [36]:
 n.activate([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

array([ 0.51195787])

Next, we want to create the Dataset so we can train our model.


In [37]:
inputData = []

i = 0
while i < len(data):
    d = data.iloc[i,:].tolist()
    inputData.append(d)
    i = i + 1

In [38]:
outputData = []

i = 0
while i < len(data) - 1:
    d = data.iloc[i+1,3] - data.iloc[i,3]
    outputData.append(d)
    i = i + 1



Notice that length of inputData and outputData are not the same because of outputData is subtraction of tomorrow's price and today's price.

In [39]:
print len(data)
print len(inputData)
print len(outputData)

9053
9053
9052


Delete the last data (Today's feature)

In [40]:
del inputData[-1]
print len(inputData)



9052


In this case we are going to use supervised regression training. We create the Dataset to store our previous input and output data.

In [41]:
from pybrain.datasets import SupervisedDataSet
DS = SupervisedDataSet(12, 1)

i = 0
while i < len(inputData):
    DS.appendLinked(inputData[i], outputData[i])
    i = i + 1
    
len(DS)

9052

After create the ANN model and dataset, we can now train our data by creating a trainer by PyBrain.

In [42]:
from pybrain.supervised.trainers import BackpropTrainer
trainer = BackpropTrainer(n, DS)

Use .train() operator to train our data. The trainer now knows about the network and the dataset and we can train the net on the data. This call trains the net for one full epoch and returns a double proportional to the error.

In [72]:
trainer.train()

25.02740855737029

Because we only run one epoch, therefor the error is still pretty big. We can you another function .trainUntilConvergence() to train the model until it converge. We can input some variable such as max epochs, proportion of the validation data to improve our training method.

In [73]:
trainer.trainUntilConvergence(verbose = True, validationProportion = 0.15, maxEpochs = 100, continueEpochs = 10 )

train-errors: [  28.553605  28.450876  28.534078  28.395738  28.573129  28.522702  28.466523  28.442355  28.427197  28.607837  28.460477  28.503947  28.295400  28.448472  28.496585  28.389732  28.400720  28.491007  28.462959  28.314853  28.484003  28.466464  28.491665  28.447130  28.365838  28.526775  28.411707  28.494318  28.460812  28.399632  28.474114  28.463323  28.434803  28.432820  28.473464  28.397497  28.489896  28.529773  28.384167  28.535530  28.524136  28.429077  28.465292  28.537875  28.523297  28.344236  28.514974  28.487402  28.454055  28.521658  28.544148  28.447536  28.555750  28.461761  28.525229  28.435509  28.543606  28.623802  28.357472  28.434045  28.358381  28.607963  28.492648  28.505051  28.497292  28.440902  28.427037  28.419809  28.426526  28.417991  28.488478  28.281443  28.495775  28.434601  28.457321  28.331872  28.354853  28.463425  28.539600  28.409772  28.426272  28.438169  28.425740  28.415118  28.475638  28.513107  28.444468  28.451692  28.452744  28.4

([28.553605395034666,
  28.450876302543222,
  28.534078419492648,
  28.395738262865347,
  28.573129046916396,
  28.522702311341831,
  28.466522859307432,
  28.442355463529573,
  28.427197250143962,
  28.607836531603567,
  28.460476891745515,
  28.503947362154147,
  28.29540025270488,
  28.448472294385557,
  28.496585045133912,
  28.389732455887213,
  28.400720477568491,
  28.491006539850503,
  28.462958532501929,
  28.314853061052272,
  28.484002961051239,
  28.466464429819592,
  28.491664735955901,
  28.447130410145093,
  28.365838018546114,
  28.526775181724812,
  28.411706918037151,
  28.494318450611313,
  28.460811684885435,
  28.399632221765064,
  28.474114291187519,
  28.463323014616048,
  28.434803028866678,
  28.432820361523127,
  28.473464203009311,
  28.397497261251871,
  28.489896265626882,
  28.52977268991043,
  28.384167407969183,
  28.535529955046197,
  28.524135516295217,
  28.429077493953198,
  28.465291917155032,
  28.537874849818625,
  28.523297414701389,
  28.3442362

# Revisit the model

Because the model is relatively complex. A classification would be easier to train and better in result. We create 6 classifier in this model, (+ 0% ~ 5%), (+ 5% ~ 10%), (+ 10% ~ ) and negitive one. 

Now we need to revisit our output data. The out put data now become a classifier with 6 variable. For example, (1,0,0,0,0,0) for (+ 10% ~ ), (0,1,0,0,0,0) for (+ 5% ~ 10%) and so on. 

After the data run through our model, the out put will be a list with 6 number. We chose the higher one to be our result. For example, (0.7, 0.2, 0.5, 0.4, 0.3, 0.1) we will classify it to (1,0,0,0,0,0) which is (+ 10% ~ ).

In [44]:
from pybrain.datasets import ClassificationDataSet
from pybrain.tools.shortcuts import buildNetwork

Classify the previous output data to fit the classfier we defined.

In [45]:
outputDataClass = []

i = 0
while i < len(data) - 1:
    d = (data.iloc[i+1,3] - data.iloc[i,3])/data.iloc[i,3]
    if d > 0.1:
        d = 0
    elif d <= 0.1 and d > 0.05:
        d = 1
    elif d <= 0.05 and d > 0:
        d = 2
    elif d <= 0 and d > - 0.05:
        d = 3
    elif d <= -0.05 and d > - 0.1:
        d = 4
    elif d <= -0.1:
        d = 5
    outputDataClass.append(d)
    i = i + 1

Check the lengh of our output is correct to make sure we didn't miss any of them.

In [46]:
print len(outputDataClass)

9052


Create a classification dataset, it got 12 input features, 1 output class and total classes number of 6.

Split the data into 75% for training 25% for validation.

In [47]:
DSC = ClassificationDataSet(12, 1, nb_classes=6)

i = 0

while i < len(inputData):
    DSC.addSample(inputData[i], outputDataClass[i])
    i = i + 1

In [48]:
tstdata, trndata = DSC.splitWithProportion( 0.25 )

In [49]:
trndata._convertToOneOfMany( )
tstdata._convertToOneOfMany( )

In [50]:
print "Number of training patterns: ", len(trndata)
print "Input and output dimensions: ", trndata.indim, trndata.outdim
print "First sample (input, target, class):"
print trndata['input'][0], trndata['target'][0], trndata['class'][0]

Number of training patterns:  6789
Input and output dimensions:  12 6
First sample (input, target, class):
[  2.87500000e+01   2.88700000e+01   2.87500000e+01   2.87500000e+01
   2.09390000e+06   0.00000000e+00   1.00000000e+00   4.27992245e-01
   4.29778648e-01   4.27992245e-01   4.27992245e-01   1.17258400e+08] [0 0 0 0 1 0] [ 4.]


The number of our training data is 6789. And the input and output dimensions is correct.

Print out the first sample in out traing dataset. 

Looks wonderful, we are good to go.

Create a simple network called fnn. Specify the input dimensions, numbers of hidden node, out put dimensions and linked the nodes together.

Again create a trainer and to train our fnn network, this time we add some momentum to increase its training speed.

In [81]:
fnn = buildNetwork( trndata.indim, 20, trndata.outdim)

fnn = FeedForwardNetwork()

inLayer = LinearLayer(12)
hiddenLayer = SigmoidLayer(20)
outLayer = LinearLayer(1)

fnn.addInputModule(inLayer)
fnn.addModule(hiddenLayer)
fnn.addOutputModule(outLayer)

in_to_hidden = FullConnection(inLayer, hiddenLayer)
hidden_to_out = FullConnection(hiddenLayer, outLayer)

fnn.addConnection(in_to_hidden)
fnn.addConnection(hidden_to_out)
fnn.sortModules()


classTrainer = BackpropTrainer( fnn, dataset=trndata, momentum=0.1, verbose=True, weightdecay=0.01)

In [82]:
classTrainer.trainUntilConvergence( verbose = True, validationProportion = 0.15, maxEpochs = 100,  continueEpochs = 10 )

Total error: 0.0638048225759
Total error: 0.0509010353693
Total error: 0.0507027125006
Total error: 0.0507605724702
Total error: 0.0508695266371
Total error: 0.0507172102764
Total error: 0.0509452069536
Total error: 0.0511356610802
Total error: 0.0507979775721
Total error: 0.0508791874927
Total error: 0.0505876462502
Total error: 0.0506849220118
Total error: 0.0505828725658
Total error: 0.0507852382341
Total error: 0.0508089991174
Total error: 0.0504770965099
Total error: 0.0511221468637
Total error: 0.0513101172487
Total error: 0.0515267950013
Total error: 0.0515438999045
Total error: 0.0508353630618
Total error: 0.0506452406477
Total error: 0.0510346286347
Total error: 0.0509282289389
Total error: 0.0510165845963
Total error: 0.0516498835615
Total error: 0.0519517015875
Total error: 0.0520851928759
Total error: 0.051717467268
Total error: 0.0522422933662
Total error: 0.0514518442668
Total error: 0.0514376221099
Total error: 0.0511513491329
Total error: 0.0511788942963
Total error: 0.

([0.063804822575856998,
  0.0509010353692535,
  0.050702712500644835,
  0.050760572470228769,
  0.050869526637057377,
  0.050717210276405779,
  0.050945206953577994,
  0.051135661080247655,
  0.050797977572075756,
  0.050879187492698957,
  0.050587646250246081,
  0.050684922011788404,
  0.05058287256582935,
  0.050785238234051955,
  0.050808999117377629,
  0.050477096509889387,
  0.051122146863684229,
  0.051310117248672085,
  0.05152679500129307,
  0.051543899904514308,
  0.050835363061766566,
  0.05064524064770521,
  0.051034628634669336,
  0.050928228938866969,
  0.051016584596283873,
  0.051649883561489598,
  0.051951701587489893,
  0.052085192875887858,
  0.051717467268018028,
  0.052242293366197304,
  0.051451844266802382,
  0.051437622109888892,
  0.051151349132916843,
  0.051178894296255772,
  0.051195671786575775,
  0.051740967748494432,
  0.051792598980151539,
  0.052377606082964087,
  0.052202891604446591,
  0.051959911690378766,
  0.051615191788989052,
  0.05129659148635069

The error of the classifier is much smaller then regression. Now we can use this ANN to predic our data.

In [90]:
pre = fnn.activateOnDataset( trndata )

Notice that the output of this prediction is still looks like [0.7, 0.2, 0.5, 0.4, 0.3, 0.1]. The highest output activation gives the class.

In [101]:
preClass = np.argmax(pre, axis=1)

print preClass[0]

3


By our definition, class 3 means that the stock price will decrease about 0 ~ 0.05 percent. We can now successfully predict the value of stock price.



# Discussion

ANN is a very good model to make your prediction. Befor you train your data, like other machine learning model, you should check you input features to see whether it is reasonable or not. There are lots of variable you can adjust to make your model better. In this tutorial, we only consider one hidden layer and 20 hiden node. Those are the variables that you can change to improve your model. 

In reality, useing only ANN to predict the financial problem is not wisely. ANN can not point out the potention risk behind the data, which may cause a big loss in stock market. However, It can be a useful tool for you do decide which company that is worth to invest.