**BEAT THE STOCK MARKET - THE LAZY STRATEGY**

Is it possible to understand which stock will increase its value in the year *t* using only financial data available for the year *t-1*?

The objective of this kernel is to show that machine learning models can learn to recognize stocks that will grow in value after being trained on a set of financial indicators, like those that can be found in the 10-K filings.

* **ABOUT THE DATA**

    I already prepared and made available 2 example datasets:

    1. `Example_DATASET.csv` contains a table where each row is an US stock from the technology sector and each column is a financial indicators. The last column is the class (binary, IGNORE=0, BUY=1).
    2. `Example_2019_prive_var.cs`  contains a table where each row is an US stock from the technology sector and the column indicates the stock's percent price variation for the year 2019 (first trading day on Jan '19, last trading day on Dec '19).

  I will later share a kernel where I will show how you can build these types of datasets.

  From a general point of view, I will be considering the financial data collected in the `Example_DATASET.csv` as features. The data collected there are from the 10-K filings of 2018 (end of 2018). The class of each sample has been assigned by looking at the performance of the stock in the year 2019 (`Example_2019_price_var.csv`), that is:
    1. Class = 1 if 2019 price variation is positive;
    2. Class = 0 if 2019 price variation is negative.
    
  **So, this is a binary classification problem and the objective is to BUY all the stocks that will be classified as `1`, and to IGNORE all those stocks that will be classified as `0`.**

* **ABOUT THE MODELS**

    The machine learning models that I will be using are:
    1. Support Vector Machine (`SVC`, scikit-learn);
    2. Random Forest (`RandomForestClassifier`, scikit-learn);
    3. Extreme Gradient Boosting (`xgb`, xgboost);
    4. Multi-Layer Perceptron (`MLPClassifier`, scikit-learn).

  Nothing too fancy, I am more interested in the proof-of-concept right now (if the results are satisfactory, I will implement more sofisticated neural networks via Keras or Pytorch). For what concerns the training of the models, I will be running a `GridSearchCV` with `cv=5` and `scoring='precision_weighted'`.
    
* **ABOUT THE LAZY WAY**

    Compared to other algorithmic trading strategies, this kernel implements a **lazy strategy** which consists in buying the stocks that the machine learning models identify as buy-worthy on the first trading day of the year and selling them on the last trading day of the year. Indeed, it is widely known that most of the times buying-&-holding a stock yields higher returns compared to trying to time the market.
    
    As of now, this method has only been backtested with the year 2019, but I am planning to perform more backtesting with older data.
    
    
* **ABOUT THE MODEL'S EVALUATION**

    The evaluation of the machine learning models' performance is a two-step process:
    
    1. An analysis of the `classification_report` is performed in order to evaluate the models according to commonly used metrics
    2. An analysis of the machine learning models' Return On Investment (ROI) is performed in order to evaluate their performance from a financial point of view. Furthermore, the ROI of S&P 500 index (^GSPC), Dow Jones index (^DJI) and the ROI of the Technology sector are used as benchmarks to evaluate the performance of the **lazy strategy**.
   

Let's start by importing the required libraries. This kernel leverages the power of:
* `numpy`, `pandas` to manipulate data
* `pandas_datareader` to pull historical price data (using Yahoo Finance)
* `sklearn`, `xgb` for everything machine learning related

In [None]:
import numpy as np
import pandas as pd
from pandas_datareader import data

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
from sklearn.metrics import classification_report

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

We need a function that allows us to pull the historical price data for a given symbol. This will be used to compute the ROI of S&P 500 index (^GSPC) and Dow Jones index (^DJI).

In [None]:
# Custom function to pull prices of stocks
def get_price_var(symbol):
    '''
    Get historical price data for a given symbol leveraging the power of pandas_datareader and Yahoo.
    Compute the difference between first and last available time-steps in terms of Adjusted Close price.
    
    Input: ticker symbol
    Output: percent price variation 
    '''
    # read data
    prices = data.DataReader(symbol, 'yahoo', '2019-01-01', '2019-12-31')['Adj Close']

    # get all timestamps for specific lookups
    today = prices.index[-1]
    start = prices.index[0]

    # calculate percentage price variation
    price_var = ((prices[today] - prices[start]) / prices[start]) * 100
    return price_var

**LOAD DATA**

Now, we are ready to import the required data. As explained, I have prepared the two datasets beforetime, and I will upload a kernel to shows just that. Briefly:

* `PVAR` is a dataframe where each row is a stock from the US stock market Technology sector, and the column is the 2019 percent price variation for each stock.
* `DATA` is a dataframe where each row is a stock from the US stock market Technology sector, and the columns are financial indicators that have been scraped thanks to Financial Modeling Prep API (https://financialmodelingprep.com/developer/docs/). The financial indicators refer to the end of 2018. The last column of `DATA` corresponds to the binary classification of each stock, according to the rules explained above.

In [None]:
# Load 2019 percent price variation for all tech stocks
PVAR = pd.read_csv('../input/Example_2019_price_var.csv', index_col=0)

# Load dataset with all financial indicators (referring to end of 2018)
DATA = pd.read_csv('../input/Example_DATASET.csv', index_col=0)

**TRAIN TEST SPLIT**

Once the dataset is loaded, we must split it into training and testing. 20% of the data will be used to test the ML models, note the parameter `stratify` used in order to keep the same class-ratios between training and testing splits.
From the `train_split` and `test_spli` we extract both input data `X_train`, `X_test` and output target data `y_train`, `y_test`.
A sanity check is performed after.

In [None]:
# Divide data in train and testing splits
train_split, test_split = train_test_split(DATA, test_size=0.2, random_state=1, stratify=DATA['class'])
X_train = train_split.iloc[:, :-1].values
y_train = train_split.iloc[:, -1].values
X_test = test_split.iloc[:, :-1].values
y_test = test_split.iloc[:, -1].values

print(f'Total number of samples: {DATA.shape[0]}')
print()
print(f'Number of training samples: {X_train.shape[0]}')
print()
print(f'Number of testing samples: {X_test.shape[0]}')
print()
print(f'Number of features: {X_train.shape[1]}')

**DATA STANDARDIZATION**

The next step consists in the standardization of the data. We leverage the `StandardScaler()` available from `scikit-learn`. It is important to use the same coefficients when standardizing both training and testing data: for this reason we first fit the scaler to `X_train`, and then apply it to both `X_train` and `X_test` via the method `.fit_transform()`.

In [None]:
# Standardize input data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

**ML MODEL I: SUPPORT VECTOR MACHINE**

The first classification model we will run is the support vector machine. `GridSeachCV` is performed in order to tune some hyper-parameters (`kernel`, `gamma`, `C`). The required number of cross-validation sets is 5. We want to achieve maximum weighted precision.

In [None]:
# Parameter grid to be tuned
tuned_parameters = [{'kernel': ['rbf', 'linear'], 'gamma': [1e-3, 1e-4], 'C': [0.01, 0.1, 1, 10, 100]}]

clf1 = GridSearchCV(SVC(random_state=1),
                    tuned_parameters,
                    n_jobs=4,
                    scoring='precision_weighted',
                    cv=5)

clf1.fit(X_train, y_train)

print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf1.best_score_, clf1.best_params_))
print()

**ML MODEL II: RANDOM FOREST**

The second classification model we will run is the random forest. `GridSeachCV` is performed in order to tune some hyper-parameters (`n_estimators`, `max_features`, `max_depth`, `criterion`). The required number of cross-validation sets is 5. We want to achieve maximum weighted precision.

In [None]:
# Parameter grid to be tuned
tuned_parameters = {'n_estimators': [1024, 4096],
                    'max_features': ['auto', 'sqrt'],
                    'max_depth': [4, 6, 8],
                    'criterion': ['gini', 'entropy']}

clf2 = GridSearchCV(RandomForestClassifier(random_state=1),
                    tuned_parameters,
                    n_jobs=4,
                    scoring='precision_weighted',
                    cv=5)

clf2.fit(X_train, y_train)

print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf2.best_score_, clf2.best_params_))
print()

**ML MODEL III: EXTREME GRADIENT BOOSTING**

The third classification model we will run is the extreme gradient boosting. `GridSeachCV` is performed in order to tune some hyper-parameters (`learning_rate`, `max_depth`, `n_estimators`). The required number of cross-validation sets is 5. We want to achieve maximum weighted precision.

In [None]:
# Parameter grid to be tuned
tuned_parameters = {'learning_rate': [0.01, 0.001],
                    'max_depth': [4, 6, 8],
                    'n_estimators': [512, 1024]}

clf3 = GridSearchCV(xgb.XGBClassifier(random_state=1),
                   tuned_parameters,
                   n_jobs=4,
                   scoring='precision_weighted', 
                   cv=5)

clf3.fit(X_train, y_train)

print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf3.best_score_, clf3.best_params_))
print()

**ML MODEL IV: MULTI-LAYER PERCEPTRON**

The fourth classification model we will run is the multi-layer perceptron (feed-forward neural network). `GridSeachCV` is performed in order to tune some hyper-parameters (`hidden_layer_sizes`, `activation`, `solver`). The required number of cross-validation sets is 5. We want to achieve maximum weighted precision.

In [None]:
# Parameter grid to be tuned
tuned_parameters = {'hidden_layer_sizes': [(32,), (64,), (32, 64, 32)],
                    'activation': ['tanh', 'relu'],
                    'solver': ['lbfgs', 'adam']}

clf4 = GridSearchCV(MLPClassifier(random_state=1, batch_size=4, early_stopping=True), 
                    tuned_parameters,
                    n_jobs=4,
                    scoring='precision_weighted',
                    cv=5)

clf4.fit(X_train, y_train)

print('Best score, and parameters, found on development set:')
print()
print('%0.3f for %r' % (clf4.best_score_, clf4.best_params_))
print()

**EVALUATE THE MODELS**

Now that 4 classification models have been trained, we must test them and compare their performance with respect to each other and to the benchmarks (S&P 500, DOW JONES, sector).

In [None]:
# Get 2019 price variations ONLY for the stocks in testing split
pvar_test = PVAR.loc[test_split.index.values, :]

Now, we build a new dataframe `df1` in which, for each tested stock, we collect all the predicted classes from each model (it is reminded that the two classes are `0` = IGNORE, `1` = BUY).
If the model predicts class `1`, we proceed to buy 100 USD worth of that stock; otherwise, we ignore the stock.

In [None]:
# Initial investment can be $100 for each stock whose predicted class = 1
buy_amount = 100

# In new dataframe df1, store all the information regarding each model's predicted class and relative gain/loss in $USD
df1 = pd.DataFrame(y_test, index=test_split.index.values, columns=['ACTUAL']) # first column is the true class (BUY/INGORE)

df1['SVM'] = clf1.predict(X_test) # predict class for testing dataset
df1['VALUE START SVM [$]'] = df1['SVM'] * buy_amount # if class = 1 --> buy $100 of that stock
df1['VAR SVM [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START SVM [$]'] # compute price variation in $
df1['VALUE END SVM [$]'] = df1['VALUE START SVM [$]'] + df1['VAR SVM [$]'] # compute final value

df1['RF'] = clf2.predict(X_test)
df1['VALUE START RF [$]'] = df1['RF'] * buy_amount
df1['VAR RF [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START RF [$]']
df1['VALUE END RF [$]'] = df1['VALUE START RF [$]'] + df1['VAR RF [$]']

df1['XGB'] = clf3.predict(X_test)
df1['VALUE START XGB [$]'] = df1['XGB'] * buy_amount
df1['VAR XGB [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START XGB [$]']
df1['VALUE END XGB [$]'] = df1['VALUE START XGB [$]'] + df1['VAR XGB [$]']

df1['MLP'] = clf4.predict(X_test)
df1['VALUE START MLP [$]'] = df1['MLP'] * buy_amount
df1['VAR MLP [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START MLP [$]']
df1['VALUE END MLP [$]'] = df1['VALUE START MLP [$]'] + df1['VAR MLP [$]']

# Show dataframe df1
df1.head()

Finally, we build a compact dataframe `MODELS_COMPARISON` in which we collect the main information required to perform the comparison between the classification models and the benchmarks (S&P 500, DOW JONES, sector).
Leveraging the dataframe `df1`, we can easily compute gains and losses for each model (`net_gain_`, `percent_gain`)

Since we miss the data from the benchmarks, we quickly exploit the custom function `get_price_var` in order to get the percent price variation for both S&P 500 (^GSPC) and DOW JONES (^DJI) for the year 2019. The 2019 percent price variation for the Technology sector is computed by using `.mean()` on the whole column of the dataset `PVAR`.

In [None]:
# Create a new, compact, dataframe in order to show gain/loss for each model
start_value_svm = df1['VALUE START SVM [$]'].sum()
final_value_svm = df1['VALUE END SVM [$]'].sum()
net_gain_svm = final_value_svm - start_value_svm
percent_gain_svm = (net_gain_svm / start_value_svm) * 100

start_value_rf = df1['VALUE START RF [$]'].sum()
final_value_rf = df1['VALUE END RF [$]'].sum()
net_gain_rf = final_value_rf - start_value_rf
percent_gain_rf = (net_gain_rf / start_value_rf) * 100

start_value_xgb = df1['VALUE START XGB [$]'].sum()
final_value_xgb = df1['VALUE END XGB [$]'].sum()
net_gain_xgb = final_value_xgb - start_value_xgb
percent_gain_xgb = (net_gain_xgb / start_value_xgb) * 100

start_value_mlp = df1['VALUE START MLP [$]'].sum()
final_value_mlp = df1['VALUE END MLP [$]'].sum()
net_gain_mlp = final_value_mlp - start_value_mlp
percent_gain_mlp = (net_gain_mlp / start_value_mlp) * 100

percent_gain_sp500 = get_price_var('^GSPC') # get percent gain of S&P500 index
percent_gain_dj = get_price_var('^DJI') # get percent gain of DOW JONES index
percent_gain_sector = PVAR['2019 PRICE VAR [%]'].mean()

MODELS_COMPARISON = pd.DataFrame([start_value_svm, final_value_svm, net_gain_svm, percent_gain_svm],
                    index=['INITIAL COST [USD]', 'FINAL VALUE [USD]', '[USD] GAIN/LOSS', 'ROI'], columns=['SVM'])
MODELS_COMPARISON['RF'] = [start_value_rf, final_value_rf, net_gain_rf, percent_gain_rf]
MODELS_COMPARISON['XGB'] = [start_value_xgb, final_value_xgb, net_gain_xgb, percent_gain_xgb]
MODELS_COMPARISON['MLP'] = [start_value_mlp, final_value_mlp, net_gain_mlp, percent_gain_mlp]
MODELS_COMPARISON['S&P 500'] = ['', '', '', percent_gain_sp500]
MODELS_COMPARISON['DOW JONES'] = ['', '', '', percent_gain_dj]
MODELS_COMPARISON['TECH SECTOR'] = ['', '', '', percent_gain_sector]

# Show the dataframe
MODELS_COMPARISON

From the dataframe `MODELS_COMPARISO`, it is possible to see that:

* RF and XGB are the ML models that yield the highest ROI, 32.5% and 34.9% respectively
* RF and XGB outperform the S&P 500 and the Tech sector by about 4-6 p.p, while they outperform the DOW JONES by more than 10 p.p.
* MLP and SVM are closely matched, and yield an ROI of 28.3% and 27.6% respectively
* MLP and SVM perform similarly to the S&P 500, while they both outperfom the DOW JONES
* the SVM leads to the highest net gains, at about 3400 USD; however, it also has the highest initial investment cost at 12300 USD
* the RF leads to the lowest net gains, at about 2570 USD; however, it also has the lowest initial investment cost at 7900 USD

So, this kernel proves that, at least as proof-of-concept, it is possible to find useful information in the 10-K filings that the publicly traded companies release. The financial information can be used to train machine learning models that learn to recognize buy-worthy stocks. With that being said, the ROI generated are only a few percent points higher that buying-and-holding the S&P 500 index.

For what concerns a more traditional comparison between the performance of the ML models implemented, it is possible to analyze the `classification_repor`.

In [None]:
print(53 * '=')
print(15 * ' ' + 'SUPPORT VECTOR MACHINE')
print(53 * '-')
print(classification_report(y_test, clf1.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(20 * ' ' + 'RANDOM FOREST')
print(53 * '-')
print(classification_report(y_test, clf2.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(14 * ' ' + 'EXTREME GRADIENT BOOSTING')
print(53 * '-')
print(classification_report(y_test, clf3.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(15 * ' ' + 'MULTI-LAYER PERCEPTRON')
print(53 * '-')
print(classification_report(y_test, clf4.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')

Looking carefully, it is possible to see that the `classification_report` confirm what has already been noted in the analysis of `MODELS_COMPARISON`. Indeed:

* XGB has an excellent balance between *precision* and *recall* for the `BUY` class, which is highlighted by the highest *weighted average f1-score*
* XGB and RF share the highest *precision* for the BUY class (80%). This allows to minimize the number of false positives, which is what makes us invest in stocks that decrease in value in year 2019
* XGB has a slightly higher *f1-score* for the `IGNORE` class
* SVM has the lowest *f1-score* for the `IGNORE` class (*recall* is 5%). This means that most of its predictions are `BUY` (which explains the highest initial cost of 12300 USD), and it does not recognizes the stocks to `IGNORE`
* MLP has less *accuracy* than the SVM, but thanks to the higher *weighted average f1-score* it is able to generate the same ROI with a lower initial cost