# Fundamental Analysis With Machine Learning in Python

### By: Ari Silburt

### Date: January 20th, 2017

The full code for this analysis can be found at https://github.com/silburt/Machine_Learning/tree/master/Fundamental_Analysis

The Efficient Market Hypothesis (EMH) is a financial theory stating that current asset prices reflect all the available information. A direct extension of this theory is that a trading strategy cannot be concocted to consistently beat the market, and future prices cannot be predicted by analyzing prices from the past. It is a hotly debated theory, being supported by many like Burton Malkiel (wrote _A Random Walk Down Wall Street_) and Eugene Fama (credited with founding the theory), while also being contested by many like Andrew Lo and Craig MacKinlay (co-authored the paper _Stock Market Prices Do Not Follow Random Walks_ with >4000 citations).  

Many believe that EMH is false, because then money can be made by carefully selecting stocks. Surely you know of someone/some entity that has boasted about analyzing market trends, selecting a few stocks, and earning huge profits. However, this by itself doesn't falsify the EMH. For example, if I have 1,000 people each flip a coin 10 times, there will statistically be at least one person that flipped 10 heads in a row. That person may think they are particularly amazing, when really it's just random processes + statistics at play. In addition, a kind of weird contradiction is that many believe the EMH is false and yet Markov processes are an industry standard in Finance these days (e.g. Black-Scholes).  

Regardless of theory, it is an interesting exercise to see whether Machine Learning + Fundamental Analysis can be used to empirically predict future stock prices. Specifically, I will take all the stocks from the Wilshire 5000 index, get their stock prices and fundamental qualities at years $t$ and $t-1$, and see if the stock price at year $t+1$ can be accurately predicted. For concreteness, I will choose $t=2015$, but the code is generalizable to any $t$. I will also cast this problem as a classification problem instead of a regression problem. This means that if a stock increased between years $t$ and $t+1$ the machine learning algorithm should predict $1$, and $0$ otherwise. In contrast, casting this as a regression problem would mean I want to predict _how much_ a stock increased or decreased between $t$ and $t+1$. As you can imagine, this is much more difficult.

## Setup
### Getting Stock Prices
First we need some stocks. From [here](http://www.beatthemarketanalyzer.com/blog/wilshire-5000-stock-tickers-list/) I got an excel file containing all the stocks from the Wilshire5000 index and converted it to a csv, which can then be easily loaded with pandas:

In [1]:
import pandas as pd
tickers = pd.read_csv('fundamental_analysis/wilshire5000.csv',delimiter=",")
tickers.head()

Unnamed: 0,Symbol,Company
0,A,Agilent Technologies
1,AA,Alcoa Inc
2,AACC,Asset Accep Cap Corp
3,AAI,Airtran Hldgs Inc
4,AAII,Alabama Aircraft Ind In


Next, we need the prices for each stock, which can be pretty easily obtained using the `yahoo finance datareader`, called through `pandas` (along with a fix, since Yahoo! Finance decommissioned their historical data API). The following code gets the first couple stock prices for Agilent Technologies between Jan 1st-8th, 2015.

In [14]:
from pandas_datareader import data as pdr
import fix_yahoo_finance
pdr.get_data_yahoo(tickers['Symbol'][0], '2015-01-01', '2015-01-08')

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-02,41.18,41.310001,40.369999,40.560001,39.430428,1529200
2015-01-05,40.32,40.459999,39.700001,39.799999,38.691589,2041800
2015-01-06,39.810001,40.02,39.02,39.18,38.088852,2080600
2015-01-07,39.52,39.810001,39.290001,39.700001,38.594372,3359700
2015-01-08,40.240002,40.98,40.18,40.889999,39.75124,2116300


Choosing $t=2015$, we will take the mean Adjusted Close price in January 2015 and January 2014, and ultimately try to predict the mean Adjusted Close price in January 2016. 

### Getting Financial Data

Next, we need to scrape all the relevant fundamental quantities for each stock and process them down into features. [morningstar.com](http://www.morningstar.com/) has an extensive list of "Key Ratios" and "Financials" for each stock:<br>
<img src="fundamental_analysis/morningstar.png", width=700>

highlighted in purple is the export button that normal muggles might use to manually download financial data for all 5000 stocks, one by one. However, there's a faster, more efficient solution. We can scrape each stock's financial data into a csv file using the following code snippet:
***
```python
from pattern.web import URL
for stock in tickers['Symbol']:
    webpage = "http://financials.morningstar.com/ajax/exportKR2CSV.html?t=%s&culture=en-CA&region=USA&order=asc&r=314562"%stock
    url = URL(webpage)
    f = open('%s_keyratios.csv'%(path, stock), 'wb')
    f.write(url.download())
    f.close()
```
***
the `webpage` variable was obtained by:
- Navigating to developer tools under Chrome web browser and clicking the network tab to monitor the ALL tab. 
- Pressing that export button for a single stock and noticing the corresponding url request sent in the ALL tab. 

### Preparing Data Arrays
Now comes the most difficult part. We need to process all the financial data into `X` and `y` data arrays that a machine learning algorithm can use. In principal it's not difficult, but there's a lot of subtle cleaning and processing that has to happen like:
- Converting financial data that are cast in other currencies to USD (you'd think all stocks from the Wilshire 5000 would be in USD already...).
- Removing features that are sparsely filled.
- Filling NaN values with median feature values.
- Taking Year-Over-Year (YOY) features whenever applicable.
- Generating new features that are not in morningstar like Debt/Equity, Price/Book, etc.
- Casting 

Performing all these tasks leads to a data array that looks like:

In [6]:
X = pd.read_csv('fundamental_analysis/X.csv')
X.head()

Unnamed: 0,Stock,Asset Turnover,Asset Turnover (Average),Asset Turnover (Average) YOY,Asset Turnover YOY,Book Value Per Share * USD,Book Value Per Share * USD YOY,Cap Ex as a % of Sales,Cap Spending USD Mil,Cash & Short-Term Investments,...,Total Stockholders' Equity,Total Stockholders' Equity YOY,Working Capital Ratio,P/E Ratio,P/B Ratio,D/E Ratio,Working Capital Ratio YOY,P/E Ratio YOY,P/B Ratio YOY,D/E Ratio YOY
0,A,0.44,0.44,0.676923,0.676923,12.36,0.734403,2.43,-98.0,26.78,...,55.72,1.139002,2.258356,29.630282,2.876726,0.794688,1.153568,1.268805,1.279754,0.761083
1,AAI,0.38,0.38,0.974359,0.974359,11.4,1.01806,27.02,-18.0,8.04,...,100.0,1.546073,1.644196,0.126457,0.011404,0.6082,0.992956,1.116002,4.256461,0.651388
2,AAME,0.53,0.53,1.019231,1.019231,5.05,1.018145,0.19,-0.0,4.97,...,32.58,0.991177,1.483239,25.086183,0.943837,2.069368,0.995699,1.23839,1.21632,1.01326
3,AAN,1.24,1.24,0.976378,0.976378,18.5,1.116476,1.9,-61.0,1.4,...,51.4,1.032129,2.057613,12.284312,1.235071,0.945525,1.032922,0.466568,0.719705,0.937991
4,AAON,1.54,1.54,0.968553,0.968553,3.74,1.129909,5.85,-21.0,8.74,...,76.84,1.029061,4.317789,27.779792,6.239312,0.301406,1.093696,1.047728,0.973631,0.88851


The corresponding target, `y`, is an array of 0s/1s corresponding to whether each stock did/didn't increase between $t$ and $t+1$.

### Machine Learning Tyme

An article by [forbes](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#32d83d1d6f63) claimed that the average data scientist spend 60% of their time cleaning data, and this project is no exception. After a lot of scraping, cleaning and preparing, we are finally ready to do some machine learning. Here we will use XGBoost (eXtreme Gradient Boosted Decision Trees), a popular and powerful Random Forest classifier. 

Now we split our data into train and test sets, scale our positive class by `scale_pos_weight` to offset any class imbalances, and perform a random grid search over parameters to tune the hyperparameters on the training set using cross validation:
***
```python
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
scale_pos_weight = len(y_train[y_train==0])/float(len(y_train[y_train==1]))

model = xgb.XGBClassifier(scale_pos_weight=scale_pos_weight)
n_cv = 4        #number of cross validation folds per search
n_iter = 20     #number of RandomizedSearchCV search iterations
param_grid={
    'learning_rate': [0.1],
    'max_depth': [2,4,8,16],
    'min_child_weight': [0.05,0.1,0.2,0.5,1,3],
    'max_delta_step': [0,1,5,10],
    'colsample_bytree': [0.1,0.5,1],
    'gamma': [0,0.2,0.4,0.8],
    'n_estimators':[1000],
}

grid = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=n_iter, cv=n_cv, scoring='roc_auc')
grid.fit(X_train,y_train)
```
***

We then can take the best model and look at the results on the test set:
<img src="fundamental_analysis/results.png", width=700>

## Improvements
- Perform analysis on individual sectors vs. the wilshire 5000 which contains all sectors.
- filter by marketcap
- Cast the problem as a regression problem instead of classification (i.e. how much did the stock increase/decrease by). 
- These results aren't good for all possible years/combinations

I believe that the EMH is half true - in a vacuum, fundamental analysis can accurately predict the future performance of a stock, but as these measurements are mixed in with external random and chaotic events the predictability decreases. As the chaos and randomness increases, the predictability decreases. 