# Machine Learning in Technical Analysis
---

What is machine learning? 
>According to MIT, Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. (e.g.  recognize a visual scene, understand a text written in natural language, or recognize a voice, etc.)
>
From my personal experience
>Machine learning is about letting machines to find the pattern within existing data and make decisions (including predictions, classifications, etc.) based on them.


As a result, there are two main parts we need to focus on:
>1. Prepare data
>>1.1 Split data into features (x, independent variables) and labels (y, dependent variables)
<br>1.2 Perform feature scaling, make sure all features are on the same scale
<br>1.3 Split data into training, validating, and testing sets (train a model using training data, and then test it on testing data to check its performance)
<br>1.4 Handle missing values
<br>1.5 Convert non-numerical values to numerical values (machine only understandards numbers)
2. Learn data
>>2.1 [`Choose the correct model`](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) (think of a model as a way of thinking)
<br>2.2 Fit training data to the model
<br>2.3 Test the model on testing data and evaluate its performance

## A linear regression model
As my mother-in-law also invests in the stock market. Someone she knows says that just by looking at the trading volume and closing price, you can tell the future trend. 
<br>I am not a fan of this strategy although my mother-in-law says it's pretty accurate. So I want to test out how it works. 
<br>Suppose we want to predict stock price in 5 days using today's price and volume data

In [3]:
import pandas as pd
import numpy as np
import sklearn

In [4]:
data_path = 'stock_data/'
stock_code = '600000_浦发银行.csv'
df = pd.read_csv(data_path + stock_code)
df['Adj Close T+5'] = df['Adj Close'].shift(-5)
df.head(10)

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Adj Close T+5
0,4/01/2000,3.005759,2.885668,2.912484,2.981274,38562183,1.814433,1.859138
1,5/01/2000,3.029077,2.932305,2.981274,2.947462,45052693,1.793855,1.782502
2,6/01/2000,3.066387,2.920646,2.935803,3.030243,53430896,1.844237,1.76689
3,7/01/2000,3.206298,3.0454,3.066387,3.136342,183161852,1.908809,1.717219
4,10/01/2000,3.247105,3.11419,3.148002,3.17715,141859094,1.933645,1.731411
5,11/01/2000,3.182979,3.0454,3.17715,3.054728,80543448,1.859138,1.712962
6,12/01/2000,3.031409,2.891498,3.031409,2.928807,302548283,1.782502,1.712252
7,13/01/2000,2.943964,2.891498,2.914816,2.903157,68406039,1.76689,1.734249
8,14/01/2000,2.914816,2.78773,2.900825,2.821542,153198958,1.717219,1.722186
9,17/01/2000,2.849524,2.769075,2.807551,2.844861,69485695,1.731411,1.7158


In [57]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale

# handle missing data by dropping nan rows
df.dropna(inplace=True)

# split data into features and labels
X = df[['Volume','Adj Close']]
y = df['Adj Close T+5']

# perform feature scaling, so that Volume and Adj Close are on the same scale
X = scale(X)

# split data into training and testing sets
# turn shuffle to False as what we have is a time series data, and we want to keep them in order
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,shuffle=False)



In [58]:
# choose a model
from sklearn import linear_model
reg = linear_model.Ridge(alpha=.5)

# train the model
reg.fit(X_train,y_train)

# test the model
reg.score(X_test,y_test)


0.8690433529582906

For a linear regression model, the score is R-squared. It means how much of the variance within the data can be explained by our features. As 87% of the variance can be explained, it is a relatively high R-squared score.
We can also print the details of how the model predicts each test sample, and compare it with its true value.

In [59]:
# display results
# use zip() to output a pair of (x,y) each loop
for each_x, each_y in zip(X_test, y_test):
    # reshape each of our sample to (1,2), meaning 1 row, 2 columns
    y_pred = reg.predict(each_x.reshape(1,-1))
    print(f'Predicted y is {y_pred}, true y is {each_y}')

Predicted y is [9.84437372], true y is 9.973309517
Predicted y is [9.87638989], true y is 9.920680046
Predicted y is [9.87670689], true y is 9.854890823
Predicted y is [9.98284204], true y is 9.815420151
Predicted y is [9.98168484], true y is 9.775947571
Predicted y is [9.94841658], true y is 9.710161209
Predicted y is [9.89625524], true y is 9.611480713
Predicted y is [9.83156438], true y is 9.703583717
Predicted y is [9.79559019], true y is 10.0062027
Predicted y is [9.75681473], true y is 10.04567432
Predicted y is [9.68735322], true y is 10.04567432
Predicted y is [9.59052834], true y is 10.0062027
Predicted y is [9.68210657], true y is 9.874629021
Predicted y is [9.98652967], true y is 9.881207466
Predicted y is [10.02330367], true y is 9.894365311
Predicted y is [10.02248201], true y is 10.15093422
Predicted y is [9.98454444], true y is 10.17724895
Predicted y is [9.85303095], true y is 11.20297241
Predicted y is [9.85925351], true y is 11.12499332
Predicted y is [9.87262146], tr

Predicted y is [9.10654506], true y is 9.334929466
Predicted y is [9.03386214], true y is 9.325901985
Predicted y is [9.06961653], true y is 9.578684807
Predicted y is [9.06075755], true y is 9.596741676
Predicted y is [9.06076928], true y is 9.443265915
Predicted y is [9.31671878], true y is 9.380069733
Predicted y is [9.30510379], true y is 9.425209045
Predicted y is [9.55763436], true y is 9.416181564
Predicted y is [9.57510619], true y is 9.334929466
Predicted y is [9.42352358], true y is 9.371041298
Predicted y is [9.3586884], true y is 9.298818588
Predicted y is [9.40299723], true y is 9.280761719
Predicted y is [9.39407227], true y is 9.31687355
Predicted y is [9.31287751], true y is 9.343958855
Predicted y is [9.34837407], true y is 9.398125648
Predicted y is [9.27646989], true y is 9.325901985
Predicted y is [9.25821725], true y is 9.154370308
Predicted y is [9.29442822], true y is 9.271734238
Predicted y is [9.3210241], true y is 9.271734238
Predicted y is [9.37590903], true 

As we can see from the details above, the model is doing a great job at predicting stock price that it hasn't seen before with a good accuracy. Perhaps you can just utilize this model to predict the stock price after 5 days, but there are heaps of limitations: <br>1. Since even a change of one penny in the price matters in stock investment, the model's accuracy is not up to our standard yet.
<br>2. What we really need is for the model to tell us how price is going to change compared with today, so we can take actions (buy/sell/hold). Instead of using volume and price to make predictions, we could probably use percent change to predict percent change.

## *A classification model
In this section we want to combine the algo trading we have done earlier with machine learning.
<br>Basically, we want to test whether those indicators are helpful with predicting whether the stock price will go up in 5 days. 

In [65]:
def mtm_func(stock_price,time_frame=10):
    '''
    mtm(momentum)
    
    (current stock price/stock price n days ago) - 1
    
    '''
    mtm = stock_price / (stock_price.shift(time_frame)) - 1
    
    return mtm

def boll_func(stock_price,time_frame=10):
    '''
    boll(bollinger bands)
    
    mid line is n-day SMA
    upper line is SMA+2*std(std of the stock price in the past n days)
    lower line is SMA-2*std
    
    once the stock price goes beyond the upper line (or lower line), it signifies a sell (or buy) opportunity
    
    '''
    sma = stock_price.rolling(time_frame).mean()
    sigma = stock_price.rolling(time_frame).std()
    upper = sma + 2*sigma
    lower = sma - 2*sigma
    bolli = (stock_price - sma) / (2*sigma)
    
    return bolli, sma, upper, lower

def macd_func(stock_price,slow=26,fast=12):
    '''
    macd(moving average convergence divergence)
    
    diff = emafast - emaslow
    dea = ema(diff,9)
    histogram = diff - dea
    
    when histogram>0, diff is greater than dea (past 9-day diff average), fast and slow lines are diverging, signifying an upward trend in stock price
    or vice versa
    
    '''
    emaSlow = stock_price.ewm(span=slow,adjust=True).mean()
    emaFast = stock_price.ewm(span=fast,adjust=True).mean()
    diff = emaFast-emaSlow
    dea = diff.ewm(span=9,adjust=True).mean()
    macdi = diff - dea
    
    return macdi, emaSlow, emaFast, diff, dea


In [66]:
data_path = 'stock_data/'
stock_code = '600000_浦发银行.csv'
df = pd.read_csv(data_path + stock_code)
df['Adj Close T+3'] = df['Adj Close'].shift(-3)
df.head(10)

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Adj Close T+3
0,4/01/2000,3.005759,2.885668,2.912484,2.981274,38562183,1.814433,1.908809
1,5/01/2000,3.029077,2.932305,2.981274,2.947462,45052693,1.793855,1.933645
2,6/01/2000,3.066387,2.920646,2.935803,3.030243,53430896,1.844237,1.859138
3,7/01/2000,3.206298,3.0454,3.066387,3.136342,183161852,1.908809,1.782502
4,10/01/2000,3.247105,3.11419,3.148002,3.17715,141859094,1.933645,1.76689
5,11/01/2000,3.182979,3.0454,3.17715,3.054728,80543448,1.859138,1.717219
6,12/01/2000,3.031409,2.891498,3.031409,2.928807,302548283,1.782502,1.731411
7,13/01/2000,2.943964,2.891498,2.914816,2.903157,68406039,1.76689,1.712962
8,14/01/2000,2.914816,2.78773,2.900825,2.821542,153198958,1.717219,1.712252
9,17/01/2000,2.849524,2.769075,2.807551,2.844861,69485695,1.731411,1.734249


In [67]:
# construct new columns of data
df['price_change'] = (df['Adj Close T+3'] - df['Adj Close'])/ df['Adj Close']
df['mtm'] = mtm_func(df['Adj Close'])
df['bolli'],*k = boll_func(df['Adj Close']) 
df['macdi'],*k = macd_func(df['Adj Close'])
df['smai'] = df['Adj Close'].rolling(5).mean() - df['Adj Close'].rolling(10).mean()
df['up'] = [1 if df.loc[i,'price_change'] > 0 else 0 for i in df.index]


In [70]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale

# handle missing data by dropping nan rows
df.dropna(inplace=True)

# split data into features and labels
X = df[['bolli','macdi','mtm']]
y = df['up']

# perform feature scaling, so that Volume and Adj Close are on the same scale
X = scale(X)

# split data into training and testing sets
# turn shuffle to False as what we have is a time series data, and we want to keep them in order
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,shuffle=False)

In [71]:
# choose a model
from sklearn import svm
clf_svm = svm.SVC()

# train the model
clf_svm.fit(X_train,y_train)

# test the model
clf_svm.score(X_test,y_test)


0.5590243902439025