# Neural Net vs. Goldman
Can an LTSM Neural Net trained on fundamentals extracted from edgar XBRL from the S&P500 pick the same long and short list as Goldman's hedgefund meta list (appologies to my high school english teacher for the run on sentence)

## Data Sources
Download price and fundamental data for the S&P 500 using [pystock-crawler](https://github.com/eliangcs/pystock-crawler):  

`pystock-crawler reports ../tickers.csv -o ../reports.csv ../reports.log`  
`pystock-crawler prices ../tickers.csv -o ../prices.csv -l ../prices.log`

REMIND: Use pystock-crawler symbols to get symbols for training input...

X = 4 quarters of fundamental data and whether stock was up or down from the prior quarter y = whether the stock was up or down x days after the period that X was comprised of

In [1]:
import pandas
prices = pandas.read_csv('prices.csv', parse_dates=['date'], index_col=1)
reports = pandas.read_csv('reports.csv', parse_dates=['date'], index_col=1)
reports = reports[reports.amend == False]
symbols = pandas.read_csv('symbols.csv').sort('symbol').sort('symbol')

In [2]:
def features(symbol, window_size=4, overlap=1):
    num_windows = len(reports[reports.symbol == symbol]) - window_size + 1
    # Create a set of sequences of reports returning X, y
    r = reports[reports.symbol == symbol].sort(ascending=True)
    p = prices[prices.symbol == symbol].sort(ascending=True)
    print "Found", len(r), "reports for", symbol

    # Add closing stock price to each report
    r['close'] = r.index.map(lambda x: p.ix[p.index[p.index.searchsorted(x)]]['close'])

    # Fixup annual 10-k numbers by subtracting the prior 3 quarters
    # REMIND: Go back and verify the adjustments
    for c in ['revenues', 'op_income', 'net_income',
     'eps_basic', 'eps_diluted', 'dividend',
     'cash_flow_op', 'cash_flow_inv', u'cash_flow_fin']:
        r[c + '_adj'] = r[c] - r[c].shift(1) - r[c].shift(2) - r[c].shift(3)
        r.ix[r.period_focus == 'FY', c] = r[r.period_focus == 'FY'][c + '_adj']
        
    # Delete all non-numeric columns
    r = r.ix[:,5:-9]
    
    # Divide into overlapping windows
    X = [r[i:i + window_size] for i in range(len(r) - window_size + 1 - num_windows,len(r) - window_size + 1, overlap)]

    # Calculate % rise in stock price n days after the last report in the window
    days_after = 10
    y = [1 - p.ix[p.index[p.index.searchsorted(x.index[3]) + days_after]].close / x.ix[3].close for x in X]
    
    return [x.values for x in X], y

In [3]:
data = {s: features(s, window_size=4, overlap=1) 
        for s in symbols['symbol'][0:200:10] if len(reports[reports.symbol == s]) >= 4}
print "Generated features vectors for", len(data),"stocks"

Found 22 reports for A
Found 25 reports for ADBE
Found 24 reports for AET
Found 25 reports for AMAT
Found 24 reports for APA
Found 22 reports for AZO
Found 24 reports for BWA
Found 24 reports for CBS
Found 20 reports for CINF
Found 25 reports for CNX
Found 23 reports for CSCO
Found 20 reports for DAL
Found 25 reports for DTV
Found 19 reports for EL
Found 13 reports for ESRX
Found 24 reports for F
Found 24 reports for FITB
Found 14 reports for GAS
Generated features vectors for 18 stocks


In [4]:
# Split into train/test sets
# Should use sklearn.cross_validation.StratifiedShuffleSplit to try and maintain industry sector % in each
# Or bin by financial size http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/
# s = XY['symbol'].unique()
from sklearn.cross_validation import train_test_split
train_symbols, test_symbols = train_test_split(data.keys(), test_size = 0.2)
print "num train symbols:", len(train_symbols), "num test symbols:", len(test_symbols)
X_train = [data[s][0] for s in train_symbols]
y_train = [data[s][1] for s in train_symbols]
X_test = [data[s][0] for s in test_symbols]
y_test = [data[s][1] for s in test_symbols]
print "test/train", 1.0 * len(X_test)/len(X_train)

num train symbols: 14 num test symbols: 4
test/train 0.285714285714


In [11]:
def flatten(l):
    return [item for sublist in l for item in sublist]
[flatten(x) for x in flatten(X_test)][0]

[3006300000.0,
 242200000.0,
 15400000.0,
 0.02,
 0.02,
 0.050000000000000003,
 26025900000.0,
 4711100000.0,
 4136900000.0,
 341500000.0,
 8610100000.0,
 395300000.0,
 -185800000.0,
 -287500000.0,
 6.9199999999999999,
 3350000000.0,
 418200000.0,
 207600000.0,
 0.31,
 0.29999999999999999,
 0.050000000000000003,
 26560500000.0,
 5101300000.0,
 4220600000.0,
 473800000.0,
 8846200000.0,
 567900000.0,
 -185500000.0,
 -328100000.0,
 12.050000000000001,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 26962000000.0,
 5636900000.0,
 4746500000.0,
 716700000.0,
 9019400000.0,
 nan,
 nan,
 nan,
 14.050000000000001,
 3530900000.0,
 153400000.0,
 -26200000.0,
 -0.040000000000000001,
 -0.040000000000000001,
 0.050000000000000003,
 26756100000.0,
 5705200000.0,
 4712300000.0,
 872700000.0,
 9046100000.0,
 700700000.0,
 -73600000.0,
 -471100000.0,
 13.94]

In [12]:
import numpy
with open("train_test.npz", "wb") as f:
    numpy.savez(f, X_train=flatten(X_train), y_train=flatten(y_train), X_test=flatten(X_test), y_test=flatten(y_test))