## Introduction   
This notebook generates stock price predictions for a machine learning competition sponsored by Kaggle and Two Sigma, a hedge fund that uses artificial intelligence in its trading strategies.  The aim of the competition was to use financial and news data to predict how roughly 1,800 stocks will perform relative to their benchmarks 10 days out into the future.  

Selected financial and news data dating back to 2008 was supplied by the sponsors.   Participants were restricted to using this data; no additional data could be imported. 

Daily predictions for each asset were made by assigning a confidence value (between 0 and 1) indicating whether a stock price was likely to rise.  A prediction of 0.99 indicated very strong certainty that a stock would rise in value 10 days out relative to its benchmark.  Similarly, a prediction of 0.01 indicated near certainty that the stock would underperform its benchmark, and a 0.50 prediction meant equal odds of rising or falling.  

Final scores were calculated by summing the products of the confidence values times the actual relative returns for each stock for each day studied, and then dividing by the standard deviation of the daily scores.  In this way, the scoring was similar to the Sharpe ratio, a widely used measure of risk-adjusted financial returns.  

In the first phase of the contest, prediction models were scored based solely on historical data.  In the second phase of the contest, to take place in early 2019, the models will be re-scored against future outcomes. 

To ensure a level playing field, all models had to be run online on Kaggle kernels that were subject to memory and time restrictions. 

Complete information about the contest, including the data sources, can be found  at:   

https://www.kaggle.com/c/two-sigma-financial-news

## My Approach 
Using the supplied data, I generated a variety of features, including rolling averages of financial returns and news story sentiment.  I used a ightGBM  model (a gradient boosted decision tree) as a baseline prediction model and to generate information on feature importances.  I reduced the features further using principal component analyses.  In the end, I found that a neural network model outperformed the lightGBM model as well as various ensemble models.  Accordingly, I used the neural network for final predictions.  Further details are covered below. 

At the end of the first phase of the competition, I ranked in the top 10% of nearly 2,900 teams/participants.  Final results will be based on prediction success against actual stock returns in early 2019.

 ### Import packages

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
from datetime import datetime
from kaggle.competitions import twosigmanews
import time 
import gc  

### Load Data

In [None]:
# Data available only via this Kernel-only environment 
env = twosigmanews.make_env()
(market_train_df, news_train_df) = env.get_training_data()

In [None]:
# Translate time values to datetime objects for easier filtering 
market_train_df.time=pd.to_datetime(market_train_df['time'], utc=True)
news_train_df.time=pd.to_datetime(news_train_df['time'], utc=True)

In [None]:
# Reduce the number of observations used, due to time/memory constraints 
sample = True
if sample:
    market_train_df = market_train_df.loc[market_train_df.time >= '2009-1-1 00:00:00']
    news_train_df = news_train_df.loc[news_train_df.time >= '2009-1-1 00:00:00']

In [None]:
print(market_train_df.shape)
print(news_train_df.shape)

In [None]:
# Seed dataframes to be used later in final predictions; 
# provides lagging rolling averages leading up to the initial prediction days 
dates=market_train_df.time.unique()
cutoff=dates[-20] # Identfies the last n days' cut-off date
market_sliding_df = market_train_df.loc[market_train_df.time >= cutoff].copy().reset_index(drop=True)
news_sliding_df = news_train_df.loc[news_train_df.time >= cutoff].copy().reset_index(drop=True)

In [None]:
del dates, cutoff

 ### Prep Market Data

Many different data combinations/transformations were attempted, including different time windows and various moving average convergence/divergence analyses;  in the end, due to time/memory contstraints imposed by the contest sponsors, and  after testing feature importances on a lightGBM model, the following features were used.

In [None]:
def prep_market_data(mkt_trn_df, obs=False):
    '''Processes the market data'''
    # Trim the datetime object to just the date
    mkt_trn_df.time = mkt_trn_df.time.dt.date

    # Split the asset code into two parts -- main identifier and the suffix
    mkt_trn_df['assetCodeID'], mkt_trn_df['assetCodeclass'] = mkt_trn_df[
        'assetCode'].str.split('.', 1).str

    # Fill NaNs for selected columns; the market_obs df does not have a 'returnsOpenNextMktres10' (target) column, thus the
    # use of the "obs" boolean
    if obs:
        fill_cols = [
            'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
            'returnsClosePrevMktres10', 'returnsOpenPrevMktres10'
        ]
    else:
        fill_cols = [
            'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
            'returnsClosePrevMktres10', 'returnsOpenPrevMktres10',
            'returnsOpenNextMktres10'
        ]
    mkt_trn_df[fill_cols] = mkt_trn_df[fill_cols].fillna(0)

    # Get performance sign (positive or negative against benchmark) of backward-looking 1 and 10 day adjusted trend
    mkt_trn_df['returnsOpenPrevMktres1Sign'] = np.sign(
        mkt_trn_df.returnsOpenPrevMktres1)
    mkt_trn_df['returnsOpenPrevMktres10Sign'] = np.sign(
        mkt_trn_df.returnsOpenPrevMktres10)

    # Add rolling stats, basic and exponential; due to memory constraints, using just one time window with final model
    windows = [15]
    for i in range(len(windows)):
        mkt_trn_df[
            'rollingOpenAdjustedMean_' + str(windows[i])] = mkt_trn_df.groupby(
                "assetCode"
            )['returnsOpenPrevMktres1'].apply(
                lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        mkt_trn_df['rollingOpenAdjustedMedian_' + str(
            windows[i]
        )] = mkt_trn_df.groupby("assetCode")['returnsOpenPrevMktres1'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).median())
        mkt_trn_df['rollingOpenAdjustedSTD_' + str(
            windows[i]
        )] = mkt_trn_df.groupby("assetCode")['returnsOpenPrevMktres1'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).std(ddof=0))
        mkt_trn_df['rollingCloseAdjustedMean_' + str(
            windows[i]
        )] = mkt_trn_df.groupby("assetCode")['returnsClosePrevMktres1'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        mkt_trn_df['rollingCloseAdjustedMedian_' + str(
            windows[i]
        )] = mkt_trn_df.groupby("assetCode")['returnsClosePrevMktres1'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).median())
        mkt_trn_df['rollingCloseAdjustedSTD_' + str(
            windows[i]
        )] = mkt_trn_df.groupby("assetCode")['returnsClosePrevMktres1'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).std(ddof=0))
        mkt_trn_df['rollingOpenPrevMktres1Sign_' + str(
            windows[i])] = mkt_trn_df.groupby(
                "assetCode"
            )['returnsOpenPrevMktres1Sign'].apply(
                lambda x: x.rolling(window=windows[i], min_periods=1).sum())
        mkt_trn_df['rollingOpenPrevMktres10Sign_' + str(
            windows[i])] = mkt_trn_df.groupby(
                "assetCode"
            )['returnsOpenPrevMktres10Sign'].apply(
                lambda x: x.rolling(window=windows[i], min_periods=1).sum())
        # exponentially weighted moving averages
        mkt_trn_df['xrollingOpenAdjustedMean_' + str(
            windows[i])] = mkt_trn_df.groupby(
                "assetCode")['returnsOpenPrevMktres1'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        mkt_trn_df['xrollingCloseAdjustedMean_' + str(
            windows[i])] = mkt_trn_df.groupby(
                "assetCode")['returnsClosePrevMktres1'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
    # Rolling vs daily avg trading volume by asset
    grouped = mkt_trn_df.groupby(['assetCode'])['volume'].apply(
        lambda x: x.rolling(window=20, min_periods=1).mean()).to_frame()
    grouped.rename(columns={'volume': 'meanRolling20Volume'}, inplace=True)
    mkt_trn_df['meanRolling20Volume'] = grouped['meanRolling20Volume']
    mkt_trn_df[
        'dailyVolVSRollingAvg'] = mkt_trn_df.volume / mkt_trn_df.meanRolling20Volume
    # Filling small number of observations that had a zero divisor with 1
    mkt_trn_df.dailyVolVSRollingAvg.fillna(1, inplace=True)

    return mkt_trn_df

In [None]:
# Prep the market data 
market_train_df=prep_market_data(market_train_df)
market_train_df.shape

In [None]:
market_train_df.tail(2)

In [None]:
gc.collect()

### Prep news data 

Features derived from news data had a far smaller impact on model performance than the financial data.  Other contest participants echoed this finding in online discussions.  After experimentation with many different feature options, the most important features aligned with the findings of a 2015 paper entitled, ["Novel and Topical Business News and Their Impact on Stock Market Activity."](http://https://link.springer.com/article/10.1140/epjds/s13688-017-0123-7)    This paper found that only novel and unanticipated news had a meaningful impact on stock price performance.  My features likewise tended to focus on "flash" news stories, rather than general company commentary.  The only keywords that I found meaningful in terms of model performance were "upgrade" or "downgrade."    

In [None]:
# Utility function
def get_assetCodeID(row):
    '''For each row, removes the parentheses and extra quotation marks, and returns the first asset code'''
    return row.replace("{", '').replace("}", '').replace("'", '').split(
        ".", 1)[0]

In [None]:
from sklearn import preprocessing


def prep_news_data(news_trn_df):
    '''Processes the news data'''
    # Trim the datetime object to just the date
    date_cols = ['time']
    for col in date_cols:
        news_trn_df[col] = news_trn_df[col].dt.date

    # Select only Reuters stories; breaking alerts (not general stories and articles); and only where company
    # is in first sentence (ie, the clear subject)
    news_trn_df = news_trn_df.loc[
        (news_trn_df.provider == 'RTRS') & (news_trn_df.urgency <= 2)
        & (news_trn_df.firstMentionSentence == 1)].copy()

    # Split the asset code
    news_trn_df['assetCodeID'] = news_trn_df['assetCodes'].apply(
        get_assetCodeID)

    # Series of steps to identify key topics (e.g., upgrades and downgrades)
    # Group versions of articles that were actually updates of one article
    # Create a label (category) encoder object
    le = preprocessing.LabelEncoder()
    # Fit the encoder to the pandas column
    le.fit(news_trn_df.sourceId)
    # Apply the fitted encoder to the pandas column
    news_trn_df.sourceId = le.transform(news_trn_df.sourceId)
    #  Group updated articles as one
    news_trn_df = news_trn_df.groupby('sourceId', as_index=False).tail(1)
    # Identify key topics
    keywords = [['UPGRADE'], ['DOWNGRADE']]
    col = 'headline'

    # Put within broader function to be able to use local variables
    def topic_check(row):
        """Assigns result to topic column; 1 for yes (keyword appears in answer), 0 for no"""
        if any([word in row[col] for word in keywords[i]]):
            return 1
        else:
            return 0

    # Loop through the list of keywords
    for i in range(len(keywords)):
        news_trn_df['headline_' + keywords[i][0]] = news_trn_df.apply(
            topic_check, axis=1)

    # Remove unused columns
    droplist = [
        'sourceTimestamp', 'firstCreated', 'sourceId', 'headline',
        'takeSequence', 'assetName', 'assetCodes', 'wordCount',
        'sentimentWordCount'
    ]
    news_trn_df.drop(droplist, axis=1, inplace=True)

    # Group news by assetCodeID and date and generate summary stats
    grouped = news_trn_df.groupby(['time', 'assetCodeID'])
    # Dictionary with aggregations
    d = {
        'urgency':
        ['count'],  # will use the count for article count (see further below)
        'sentimentClass': ['mean', 'sum'],
        'sentimentNegative': ['mean', 'sum'],
        'sentimentNeutral': ['mean', 'sum'],
        'sentimentPositive': ['mean', 'sum'],
        'noveltyCount24H': ['mean', 'sum'],
        'noveltyCount7D': ['mean', 'sum'],
        'volumeCounts24H': ['mean', 'sum'],
        'volumeCounts7D': ['mean', 'sum'],
        'headline_UPGRADE': ['sum'],
        'headline_DOWNGRADE': ['sum']
    }
    rev = grouped.agg(d).reset_index()
    rev.columns = ['_'.join(col) for col in rev.columns.values]
    rev.rename(
        columns={
            "urgency_count": 'articleCountDay',
            'time_': 'time',
            'assetCodeID_': 'assetCodeID'
        },
        inplace=True)  # check column name for assetCodeID

    return rev

In [None]:
# Prep news data 
news_train_df=prep_news_data(news_train_df)
news_train_df.shape

### Join Market and News Data 

In [None]:
def prep_combined(mkt_trn_df, news_trn_df):
    '''Combines the market and news data and creates new combined data features'''
    # Merge the dataframes
    combined_df = mkt_trn_df.merge(
        news_trn_df, on=['time', 'assetCodeID'], how='left')

    # Perform rolling average/exp average calculations
    windows = [15]
    for i in range(len(windows)):
        combined_df['rollingArticleCountDayMean_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['articleCountDay'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingsentimentClass_sum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['sentimentClass_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingsentimentNegative_sum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['sentimentNegative_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingsentimentPositive_sum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['sentimentPositive_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingnoveltyCount24H_sum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['noveltyCount24H_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingnoveltyCount7D_sum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['noveltyCount7D_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingvolumeCounts24H_sum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['volumeCounts24H_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingvolumeCounts7D_sum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['volumeCounts7D_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingheadlineUpgrade_meanofsum_' + str(
            windows[i]
        )] = combined_df.groupby("assetCodeID")['headline_UPGRADE_sum'].apply(
            lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        combined_df['rollingheadlineDowngrade_meanofsum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID"
            )['headline_DOWNGRADE_sum'].apply(
                lambda x: x.rolling(window=windows[i], min_periods=1).mean())
        # exponential rolling average
        combined_df['xrollingArticleCountDayMean_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['articleCountDay'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingsentimentClass_sum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['sentimentClass_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingsentimentNegative_sum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['sentimentNegative_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingsentimentPositive_sum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['sentimentPositive_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingnoveltyCount24H_sum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['noveltyCount24H_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingnoveltyCount7D_sum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['noveltyCount7D_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingvolumeCounts24H_sum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['volumeCounts24H_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingvolumeCounts7D_sum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['volumeCounts7D_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingheadlineUpgrade_meanofsum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['headline_UPGRADE_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())
        combined_df['xrollingheadlineDowngrade_meanofsum_' + str(
            windows[i])] = combined_df.groupby(
                "assetCodeID")['headline_DOWNGRADE_sum'].apply(
                    lambda x: x.ewm(span=windows[i], min_periods=1).mean())

    # Fill NaNs
    non_cat_cols = [i for i in combined_df.columns if i != 'assetName']
    for col in non_cat_cols:
        combined_df[col].fillna(value=0, inplace=True)

    return combined_df

In [None]:
# Join news and market data  
combined_df=prep_combined(market_train_df, news_train_df)
combined_df.shape

In [None]:
combined_df.tail(2)

In [None]:
 del news_train_df, market_train_df

In [None]:
gc.collect()

### Split into train/test datasets

In [None]:
# For classifying if stock went up or not(T/F)
up = combined_df.returnsOpenNextMktres10 >= 0
# List of features
fcol = [
    c for c in combined_df.columns
    if c not in [
        'time', 'assetCode', 'assetName', 'assetCodeID', 'assetCodeclass',
        'volume', 'close', 'open', 'universe', 'meanRolling20Volume',
        'returnsOpenNextMktres10', 'provider', 'urgency'
    ]
]

In [None]:
X = combined_df[fcol].values
up = up.values  # True of False
r = combined_df.returnsOpenNextMktres10.values  # Actual return values, not just up/down
# Check 
assert X.shape[0] == up.shape[0] == r.shape[0]

In [None]:
# Scale the X values to 0-1
mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
rng = maxs - mins
X = 1 - ((maxs - X) / rng)

In [None]:
from sklearn.decomposition import PCA
# Use to select the optimal number of pc's
pca = PCA(n_components=35)
pca.fit(X)
# The amount of variance that each PC explains
var= pca.explained_variance_ratio_
print(var)
# Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print (var1)

In [None]:
# Create the principal components
N_COMP=15  # Use number as determined above 
pca = PCA(n_components=N_COMP)
pca.fit(X)
X1=pca.transform(X) 

In [None]:
# Split rain/test to avoid any data leakage related to the availability of the leading 10-day target variable;
# separate datasets by date, and leave a gap between train and test
test_slice = combined_df.loc[combined_df.time >= dt.date(
    year=2016, month=7, day=1)].shape[0]  # Last n mos of data
dropout_slice = combined_df.loc[combined_df.time >= dt.date(
    year=2016, month=6,
    day=1)].shape[0] - test_slice  # One month drop-out interval
train_slice = X1.shape[0] - test_slice - dropout_slice  # Everything up to dropout interval
X_train = X1[:train_slice]
X_test = X1[-test_slice:]
up_train = up[:train_slice]
up_test = up[-test_slice:]
r_train = r[:train_slice]
r_test = r[-test_slice:]

In [None]:
del combined_df

### Neural Network Model

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras import regularizers
from keras.wrappers.scikit_learn import KerasClassifier
from keras import optimizers
from keras.regularizers import l2, l1
from sklearn.metrics import mean_absolute_error, log_loss

In [None]:
Xs=X_train
ys=up_train.astype(int) # Translate T/F into 1/0
Xs_val=X_test
ys_val=up_test.astype(int)

In [None]:
del X_test, X_train, r, r_test, r_train, up, up_train, up_test

In [None]:
# Use for optimizing the neural network
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
callbacks_list = [
    EarlyStopping(
        patience=10, verbose=True, monitor='val_mean_absolute_error'),
    ModelCheckpoint('model.hdf5', verbose=True, save_best_only=True)
]

In [None]:
def build_model():
    '''Builds a keras/neural network model'''
    model = Sequential()
    model.add(Dense(N_COMP, input_dim=Xs.shape[1],
                    activation='relu')) 
    model.add(Dense(48, activation='relu'))
    model.add(Dense(48, activation='relu'))
    model.add(Dense(48, activation='relu'))
    model.add(Dense(48, activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adadelta', metrics=['mae']) 
    return model

In [None]:
## Train model
seed = 7
BATCH_SIZE = 128  # 32 is the default
EPOCHS = 200
np.random.seed(seed)

model = build_model()
history = model.fit(
    Xs,
    ys,
    epochs=EPOCHS,
    callbacks=callbacks_list,
    batch_size=BATCH_SIZE,
    validation_data=(Xs_val, ys_val))

In [None]:
# Plot training and validation log loss 
import matplotlib.pyplot as plt
%matplotlib inline

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training log loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation log loss')
plt.title('Training and validation log loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# Retrieve train MAE (mean absolute error) and test MAE
mae = history_dict['mean_absolute_error']
val_mae = history_dict['val_mean_absolute_error']
pd.options.display.max_rows = 999
val_mae=[round(elem, 4) for elem in val_mae]
val_mae_df=pd.DataFrame(val_mae)
val_mae_df

In [None]:
# Plot train MAE and test MAE
plt.clf()
plt.plot(epochs, mae, 'bo', label='Training MAE')
plt.plot(epochs, val_mae, 'b', label='Validation MAE')
plt.title('Training and validation MAE')
plt.xlabel('Epochs')
plt.ylabel('MAE')
plt.legend()
plt.show()

In [None]:
# Plot log loss vs test MAE
plt.plot(range(1, len(loss_values) + 1), val_mae)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

In [None]:
# Plot smoothed test MAE points 
def smooth_curve(points, factor=0.9):
    smoothed_points = []
    for point in points:
        if smoothed_points:
            previous = smoothed_points[-1]
            smoothed_points.append(previous * factor  + point * (1 - factor))
        else:
            smoothed_points.append(point)
    return smoothed_points

# Excludes first n epochs from plot, eg [:10]
smooth_mae_history = smooth_curve(val_mae)

plt.plot(smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Smoothed Validation MAE')
plt.show()

In [None]:
# Create dataframe with smoothed points 
smval_mae=[round(elem, 4) for elem in smooth_mae_history]
smval_mae_df=pd.DataFrame(smval_mae)
smval_mae_df

In [None]:
model.load_weights('model.hdf5')

The model below proved best after experimentation with deeper/shallower networks and with regularization via dropout and l1/l2 regularization. 

In [None]:
seed = 7
BATCH_SIZE = 128  # 32 is the default
EPOCHS = 85
np.random.seed(seed)

model = build_model()
model.fit(Xs, ys, epochs=EPOCHS, batch_size=BATCH_SIZE) 

  ### Make and Submit Predictions

In [None]:
# Delete columns not used in market_obs df and add a column for identifying which rows will be used for prediction
market_sliding_df.drop(columns=['universe','returnsOpenNextMktres10'], inplace=True)
market_sliding_df['makePrediction']=0

In [None]:
# Use for limiting the size of the sliding prediction dfs(due to memory constraints)
market_sliding_dflimit=market_sliding_df.shape[0] 
news_sliding_dflimit=news_sliding_df.shape[0] 

In [None]:
# Make and submit prediction within Kaggle environment
days = env.get_prediction_days()
n_days = 0

# Loop through the generator object, which delivers a day's worth of market and news data, plus a prediction template
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    n_days += 1
    print("Processing day: ", n_days)

    # Pre-prep the data
    market_obs_df.time = pd.to_datetime(market_obs_df['time'], utc=True)
    market_obs_df['makePrediction'] = 1
    market_sliding_df = market_sliding_df.append(
        market_obs_df, ignore_index=True, sort=True)
    market_sliding_df_copy = market_sliding_df.copy()

    news_obs_df.time = pd.to_datetime(news_obs_df['time'])
    news_sliding_df = news_sliding_df.append(
        news_obs_df, ignore_index=True, sort=True)
    news_sliding_df_copy = news_sliding_df.copy()

    # Prep and combine data
    market_sliding_prepped_df = prep_market_data(
        market_sliding_df_copy,
        obs=True)  # Need to use prep without the target value
    news_sliding_prepped_df = prep_news_data(news_sliding_df_copy)
    combined_sliding_df = prep_combined(market_sliding_prepped_df,
                                        news_sliding_prepped_df)

    # Extract the current day's rows for prediction
    combined_sliding_df = combined_sliding_df.loc[
        combined_sliding_df.makePrediction == 1].copy()

    # Use only assetCodes in the prediction template
    combined_obs_df = combined_sliding_df[combined_sliding_df.assetCode.isin(
        predictions_template_df.assetCode)].copy()

    # Limits sliding dfs to 2x the original size,for memory purposes, and preps sliding dfs for next round
    market_sliding_df = market_sliding_df.tail(market_sliding_dflimit)
    news_sliding_df = news_sliding_df.tail(news_sliding_dflimit)
    market_sliding_df.makePrediction.replace(
        to_replace=1, value=0, inplace=True)

    # X data for each day
    X_live = combined_obs_df[fcol].values
    # Scale X data (based on earlier range values)
    X_live = 1 - ((maxs - X_live) / rng)
    # PCA-transformed
    X_live = pca.transform(X_live)

    # Make predictions
    lp = model.predict(X_live).flatten()

    # assign the probabilities to a confidence variable
    confidence = lp
    # Normalize confidence from -1 to 1
    confidence = (confidence - confidence.min()) / (
        confidence.max() - confidence.min())
    confidence = confidence * 2 - 1
    # Put data into the predictions df
    preds = pd.DataFrame({
        'assetCode': combined_obs_df['assetCode'],
        'confidence': confidence
    })
    predictions_template_df = predictions_template_df.merge(
        preds, how='left').drop(
            'confidenceValue',
            axis=1).fillna(0).rename(columns={
                'confidence': 'confidenceValue'
            })

    # Adjustments to predictions
    # Mean center (as benchmarks center to 0 each day)
    predictions_template_df.confidenceValue = predictions_template_df.confidenceValue - predictions_template_df.confidenceValue.mean(
    )
    # Clip values in case mean centerring shifted values above/below 1 or -1
    predictions_template_df.confidenceValue.clip(-1.0, 1.0, inplace=True)

    env.predict(predictions_template_df)

env.write_submission_file()