# Machine Learning on Quantopian

In Python, the Skikit-Learn package provides tools of Machine Learning for performing learning algorithms on imported data. Currently, Machine Learning divides into two camps: unsupervised learning and supervised learning. We'll focus on supervised learning here.  

## Supervised Learning

In supervised learning, users teach machines how to learn. Users provide a volume of data marked by their features and labels to train the machine with (in this case, the machine is essentially an algorithm (i.e. - classification algorithm) in a program). If the training performance is acceptable, according to the user, then subsequently, a separte set of similar data marked with just their features (labels are hidden from the machine and not the user) is sent to the machine to make predictions on the labels of this set of data and measure its accuracy. 

Below, is a simple example coded on Quantopian using SkiKit-Learn module on stock price data. The features are pricing movements and the labels are their future outcomes ("up" or "down"). 

Lets first focus on the initialize() function.

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from collections import Counter
import numpy as np

def initialize(context):

    context.stocks = symbols('XLY',  # XLY Consumer Discrectionary SPDR Fund   
                           'XLF',  # XLF Financial SPDR Fund  
                           'XLK',  # XLK Technology SPDR Fund  
                           'XLE',  # XLE Energy SPDR Fund  
                           'XLV',  # XLV Health Care SPRD Fund  
                           'XLI',  # XLI Industrial SPDR Fund  
                           'XLP',  # XLP Consumer Staples SPDR Fund   
                           'XLB',  # XLB Materials SPDR Fund  
                           'XLU')  # XLU Utilities SPRD Fund
    
    context.historical_bars = 100
    context.feature_window = 10

Note the different imports required in this program. SVC, LinearSVC, NuSVC, and RandomClassifier are classifiers that will be used in the program for learning. We also import for preprocessing for normalizing data, a counter for counting occurences, and Numpy for numerical analysis.

In initialize(), we assign the stock universe to context.stocks, the number of bars to assign as history to context.historical_bars, and the number of features to include in each feature set to context.feature_window. 

Now, lets move to the handle_data() function.

In [4]:
def handle_data(context, data):
    prices = history(bar_count = context.historical_bars, frequency='1d', field='price')

    for stock in context.stocks:
        try:
            # Simple grab of the short and long moving average
            ma1 = data[stock].mavg(50)
            ma2 = data[stock].mavg(200)

            start_bar = context.feature_window
            price_list = prices[stock].tolist()
            
            # X holds the feature sets and y holds the labels.
            X = []
            y = []

            bar = start_bar

            # feature creation
            while bar < len(price_list)-1:
                try:
                    end_price = price_list[bar+1]
                    begin_price = price_list[bar]

                    pricing_list = []
                    xx = 0
                    for _ in range(context.feature_window):
                        price = price_list[bar-(context.feature_window-xx)]
                        pricing_list.append(price)
                        xx += 1

                    features = np.around(np.diff(pricing_list) / pricing_list[:-1] * 100.0, 1)


                    #print(features)

                    # Classify current feature sets according to relative values of end_price and begin_price
                    if end_price > begin_price:
                        label = 1
                    else:
                        label = -1

                    bar += 1
                    X.append(features)
                    y.append(label)

                except Exception as e:
                    bar += 1
                    print(('feature creation',str(e)))

            #Create classifier
            clf = RandomForestClassifier()

            #Grab current feature set, normalize, and test label prediction with classifier
            last_prices = price_list[-context.feature_window:]
            current_features = np.around(np.diff(last_prices) / last_prices[:-1] * 100.0, 1)

            # Append current feature to container of all feature sets, then use preprocessing to convert data to a range b/w -1 
            # and 1. It's a common standardization technique of machine learning.
            X.append(current_features)
            X = preprocessing.scale(X)

            # Separate data, where current_features is the current feature set, and X is the set of feature sets with known
            # labels.
            current_features = X[-1]
            X = X[:-1]

            # We now train the classifier with fit() function, and then perform prediction on current feature set
            clf.fit(X,y)
            p = clf.predict(current_features)[0]

            print(('Prediction',p))
            # To test our performance, we pass to order_target_percent
            if p == 1:
                order_target_percent(stock,0.11)
            elif p == -1:
                order_target_percent(stock,-0.11)            

        except Exception as e:
            print(str(e))
            
            
    record('ma1',ma1)
    record('ma2',ma2)
    record('Leverage',context.account.leverage)

The first thing to take place in the handle_data() function is the grabbing of historical daily pricing for the stock universe defined in initialize(). In the for loop, we iterate through each stock in our stock universe. We first perform short and long term moving average. 

Then going into prepping for the classifier (in this case, the Random Forest classifier), we create variable X for holding array of feature sets, and variable y for holding the associated labels for each feature set in X. The while loop begins our feature creation. end_price and begin_price are assigned the next day and current day prices, respectively. These are used later for assigning label to feature set created later in the while loop. In the nested for loop, we populate our current feature list (defined as pricing_list), and then we normalize it to percent change outside the for loop with numpy and assign it to features variable. And then finally, we associate a label with the current feature set based on the relative values of the end_price and begin_price variables. 

Then outside the while loop, we begin the set up of our classifier, followed by feeding it the training feature sets and their associated labels. After training, we then feed it a test feature set, and perform prediction of its label. After completion of prediction, we then test performance of our prediction. 

The plot shows that performance for this classifier was not good.

Now, suppose we include our calculations of the moving averages, we get different performance results (shown below).

    if p == 1 and ma1 > ma2:
        order_target_percent(stock,0.11)
    elif p == -1 and ma1 < ma2:
        order_target_percent(stock,-0.11) 

And compare to using just the moving averages...

    if ma1 > ma2:
        order_target_percent(stock,0.11)
    elif ma1 < ma2:
        order_target_percent(stock,-0.11)

As you can see, the random forest classifier performs 1% better on our returns and has a Sharpe ratio that is greater by 0.5. This doesn't show much of significant improvement compared to using moving averages.

However, we can perform prediction with multiple classifiers. One example includes multiple classifiers that are in agreement with each other. Another example is the mode of the prediction of multiple classifiers. We'll attempt the former.

            clf1 = RandomForestClassifier()
            clf2 = LinearSVC()
            clf3 = NuSVC()
            clf4 = LogisticRegression()

            last_prices = price_list[-context.feature_window:]
            current_features = np.around(np.diff(last_prices) / last_prices[:-1] * 100.0, 1)

            X.append(current_features)
            X = preprocessing.scale(X)

            current_features = X[-1]
            X = X[:-1]

            clf1.fit(X,y)
            clf2.fit(X,y)
            clf3.fit(X,y)
            clf4.fit(X,y)

            p1 = clf1.predict(current_features)[0]
            p2 = clf2.predict(current_features)[0]
            p3 = clf3.predict(current_features)[0]
            p4 = clf4.predict(current_features)[0]
            
            
            if Counter([p1,p2,p3,p4]).most_common(1)[0][1] >= 4:
                p = Counter([p1,p2,p3,p4]).most_common(1)[0][0]
                
            else:
                p = 0
                
            print(('Prediction',p))


            if p == 1 and ma1 > ma2:
                order_target_percent(stock,0.11)
            elif p == -1 and ma1 < ma2:
                order_target_percent(stock,-0.11)

Four classifiers have been added: RandomForestClassifier, LinearSVC, NuSVC, and LogisticRegression. With these four classifiers, we see improvement in our performance (shown below). We have a 3.8% increase in return and 0.75 increase in the Sharpe ratio in comparison to the performance with just moving averages.