# Project: predict stock return movements based on news data


This goal of this project is to predict how confident we are in positive or negative reutrn of an asset. The predicted value must be in interval [-1,1] where large negative value indicates that we are very confident in a negative return. Large positive value indicates that we are very confident in a positive return. Therefore in the models we predict return, and then clip into needed interval. 
<br>
<br>
This notebook contains predictive models based on random forest, and some tries with XGB method.

## Set up an environment

In [11]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
print(os.listdir("../input"))

['marketdata_sample.csv', 'news_sample.csv']


In [12]:
# Custom module kaggle.competitions.twosigmanews
from kaggle.competitions import twosigmanews
try:
    env = twosigmanews.make_env()
except Exception as e:
    print("Error")
    pass

Error


In [13]:
# Import data
(market_train_df, news_train_df) = env.get_training_data()
market_train_df['date'] = market_train_df['time'].dt.strftime('%Y-%m-%d')

In [14]:
# Analyze date
display(market_train_df.shape)
news_train_df.shape

(4072956, 17)

(9328750, 35)

The market dataset contains of over 4 million rows. Unfortunately because of the competition and ther rules we need to run everything in Kaggle kernel. When we try to merge two datasets together, the kernel could not handle it, and died every time. Also there seem to be many errors in older market data. These things considered, we only used the data starting from 2013. 

In [15]:
# Use only the lase years of data. 
market_train_df=market_train_df.loc[market_train_df['time']>"2013-01-01",:]
news_train_df=news_train_df.loc[news_train_df['time']>"2013-01-01",:]


## Cleaning market data

<br>Used kernel: https://www.kaggle.com/danielson/cleaning-up-market-data-errors-and-stock-splits

In [16]:
market_train_df.describe().round(3)

Unnamed: 0,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe
count,1718492.0,1718492.0,1718492.0,1718492.0,1718492.0,1711821.0,1711813.0,1718492.0,1718492.0,1682789.0,1682745.0,1718492.0,1718492.0
mean,2429024.0,47.221,47.211,0.001,0.001,0.0,0.0,0.006,0.006,-0.001,-0.0,-0.001,0.582
std,5769091.0,53.248,53.247,0.043,0.023,0.041,0.021,0.081,0.072,0.074,0.065,0.066,0.493
min,0.0,0.461,0.462,-0.978,-0.862,-1.236,-0.773,-0.977,-0.857,-3.343,-1.225,-1.232,0.0
25%,477739.0,19.54,19.54,-0.009,-0.009,-0.008,-0.008,-0.028,-0.028,-0.028,-0.028,-0.028,0.0
50%,976715.0,35.36,35.35,0.001,0.001,-0.0,0.0,0.005,0.005,-0.0,0.0,-0.0,1.0
75%,2265774.0,59.19,59.16,0.01,0.01,0.007,0.009,0.038,0.038,0.026,0.027,0.027,1.0
max,618237600.0,1578.13,1584.44,45.592,3.868,45.122,3.782,46.672,4.247,46.25,4.028,4.028,1.0


**1. Fix the errors of returnsClosePrevRaw1 **  <br>
The returnsClosePrevRaw1 column shows us the daily drop or increase in stock price, from max value we get that biggest rise was 45.59 (4559% !!) and min shows that one stock decreased in value almost 100% in one day (-0.978 = -98%).

In [17]:
# let's have a look on rows, that have more that 70% drop in one day
market_train_df[market_train_df['returnsClosePrevRaw1'] < -.7] 

#We can see that 4 of those have same date - 2016.07.07, let's take closer look on surrounding dates for these assets

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,date
3474114,2015-09-09 22:00:00+00:00,TTPH.O,Tetraphase Pharmaceuticals Inc,23468076.0,9.49,9.64,-0.788075,-0.77834,-0.755343,-0.729875,-0.753826,-0.752821,-0.717449,-0.675231,0.064186,0.0,2015-09-09
3607400,2015-12-28 22:00:00+00:00,CMRX.O,Chimerix Inc,26705567.0,6.62,7.86,-0.813888,-0.778279,-0.811852,-0.773448,-0.812305,-0.780263,-0.794776,-0.784926,-0.163051,0.0,2015-12-28
3847265,2016-07-07 22:00:00+00:00,FLEX.O,Flex Ltd,4481469.0,11.8,11.81,-0.904415,-0.28163,-0.886907,-0.273703,-0.097859,-0.091538,-0.223552,-0.09363,-0.011186,1.0,2016-07-07
3847633,2016-07-07 22:00:00+00:00,MAT.O,Mattel Inc,2091099.0,32.34,32.14,-0.738032,0.492108,-0.731417,0.463413,0.006536,-0.001243,-0.040354,-0.003972,-0.077818,1.0,2016-07-07
3848074,2016-07-07 22:00:00+00:00,SHLD.O,Sears Holdings Corp,497204.0,13.4,13.27,-0.891472,0.501131,-0.875653,0.480682,-0.022611,-0.058865,-0.133526,-0.057434,0.032036,0.0,2016-07-07
3848433,2016-07-07 22:00:00+00:00,ZNGA.O,Zynga Inc,34888980.0,2.76,2.73,-0.977646,-0.252055,-0.899473,-0.242279,0.086614,0.05814,-0.614571,0.047817,-0.045813,0.0,2016-07-07
3938226,2016-09-16 22:00:00+00:00,NVAX.O,Novavax Inc,242232485.0,1.29,1.17,-0.845324,-0.862028,-0.835574,-0.763006,-0.81085,-0.829197,-0.832264,-0.868582,0.497091,0.0,2016-09-16


In [18]:
someAssetsWithBadData = ['FLEX.O','MAT.O','SHLD.O','ZNGA.O']
someMarketData = market_train_df[(market_train_df['assetCode'].isin(someAssetsWithBadData)) 
                & (market_train_df['time'] >= '2016-07-05')
                & (market_train_df['time'] < '2016-07-08')].sort_values('assetCode')
someMarketData

#From here we get that, all these have similar close value on 6th - 123.45 and 123.47.
# I would have a look on all data between these dates, seems like input error. close-open >=10

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,date
3843668,2016-07-05 22:00:00+00:00,FLEX.O,Flex Ltd,3839393.0,11.66,11.7,-0.010187,-0.005102,-0.000815,-0.003561,-0.101695,-0.090909,-0.103166,-0.097041,-0.247207,1.0,2016-07-05
3845467,2016-07-06 22:00:00+00:00,FLEX.O,Flex Ltd,175451.0,123.45,16.44,9.587479,0.405128,9.482848,0.404713,8.503464,0.254962,8.425788,0.2552,0.087592,1.0,2016-07-06
3847265,2016-07-07 22:00:00+00:00,FLEX.O,Flex Ltd,4481469.0,11.8,11.81,-0.904415,-0.28163,-0.886907,-0.273703,-0.097859,-0.091538,-0.223552,-0.09363,-0.011186,1.0,2016-07-07
3844037,2016-07-05 22:00:00+00:00,MAT.O,Mattel Inc,3333108.0,31.62,31.46,0.002219,0.005433,0.014722,0.008522,-0.024676,-0.019021,-0.027185,-0.029008,0.396756,1.0,2016-07-05
3845835,2016-07-06 22:00:00+00:00,MAT.O,Mattel Inc,56994.0,123.45,21.54,2.904175,-0.315321,2.864919,-0.30448,2.842204,-0.334363,2.812233,-0.334071,-0.069237,1.0,2016-07-06
3847633,2016-07-07 22:00:00+00:00,MAT.O,Mattel Inc,2091099.0,32.34,32.14,-0.738032,0.492108,-0.731417,0.463413,0.006536,-0.001243,-0.040354,-0.003972,-0.077818,1.0,2016-07-07
3844479,2016-07-05 22:00:00+00:00,SHLD.O,Sears Holdings Corp,388228.0,12.98,13.63,-0.065515,0.00963,-0.054871,0.010488,-0.070201,-0.046853,-0.072216,-0.058148,0.541448,0.0,2016-07-05
3846276,2016-07-06 22:00:00+00:00,SHLD.O,Sears Holdings Corp,80940.0,123.47,8.84,8.512327,-0.351431,8.417827,-0.345786,7.78165,-0.370819,7.710716,-0.37063,0.059298,0.0,2016-07-06
3848074,2016-07-07 22:00:00+00:00,SHLD.O,Sears Holdings Corp,497204.0,13.4,13.27,-0.891472,0.501131,-0.875653,0.480682,-0.022611,-0.058865,-0.133526,-0.057434,0.032036,0.0,2016-07-07
3844838,2016-07-05 22:00:00+00:00,ZNGA.O,Zynga Inc,9732445.0,2.65,2.6,0.039216,0.044177,0.048598,0.045064,0.031128,0.044177,0.026816,0.029699,-0.306158,0.0,2016-07-05


In [19]:
#difference between close/open price is more than 10 (change in one day)
someMarketData2 = market_train_df[(market_train_df['time'] >= '2016-07-05') & (market_train_df['time'] < '2016-07-08') 
                                  & (market_train_df['close'] - market_train_df['open'] >=10)].sort_values('assetCode')
AssetsWithBadData=someMarketData2[(someMarketData2['close'] == 123.45)|(someMarketData2['close'] == 123.47)]['assetCode']
someMarketData2

Unnamed: 0,time,assetCode,assetName,volume,close,open,returnsClosePrevRaw1,returnsOpenPrevRaw1,returnsClosePrevMktres1,returnsOpenPrevMktres1,returnsClosePrevRaw10,returnsOpenPrevRaw10,returnsClosePrevMktres10,returnsOpenPrevMktres10,returnsOpenNextMktres10,universe,date
3844944,2016-07-06 22:00:00+00:00,AMZN.O,Amazon.com Inc,3938249.0,737.61,725.71,0.013061,0.004026,0.008342,0.004942,0.030441,0.013958,0.024262,0.014334,-0.020081,1.0,2016-07-06
3845015,2016-07-06 22:00:00+00:00,BBBY.O,Bed Bath & Beyond Inc,50303.0,123.45,30.68,1.921202,-0.295522,1.897847,-0.29161,1.849065,-0.303044,1.826249,-0.302549,-0.047101,1.0,2016-07-06
3845309,2016-07-06 22:00:00+00:00,DISH.O,DISH Network Corp,87466.0,123.47,63.29,1.430033,0.206903,1.407708,0.208864,1.325673,0.187207,1.295536,0.188725,-0.030583,1.0,2016-07-06
3845467,2016-07-06 22:00:00+00:00,FLEX.O,Flex Ltd,175451.0,123.45,16.44,9.587479,0.405128,9.482848,0.404713,8.503464,0.254962,8.425788,0.2552,0.087592,1.0,2016-07-06
3845835,2016-07-06 22:00:00+00:00,MAT.O,Mattel Inc,56994.0,123.45,21.54,2.904175,-0.315321,2.864919,-0.30448,2.842204,-0.334363,2.812233,-0.334071,-0.069237,1.0,2016-07-06
3845946,2016-07-06 22:00:00+00:00,NDAQ.O,Nasdaq Inc,113350.0,123.47,71.51,0.911596,0.122254,0.897776,0.122543,0.960775,0.135801,0.946862,0.13617,0.001852,1.0,2016-07-06
3846067,2016-07-06 22:00:00+00:00,PCAR.O,Paccar Inc,63374.0,123.45,66.41,1.445523,0.285521,1.420271,0.288671,1.261818,0.218979,1.242209,0.219653,0.030409,1.0,2016-07-06
3844271,2016-07-05 22:00:00+00:00,PCLN.O,Booking Holdings Inc,568815.0,1275.03,1259.56,0.006044,0.008447,0.021392,0.010721,-0.049875,-0.048412,-0.054914,-0.06466,-0.023824,1.0,2016-07-05
3846069,2016-07-06 22:00:00+00:00,PCLN.O,Booking Holdings Inc,562266.0,1292.04,1269.86,0.013341,0.008177,0.001566,0.013722,-0.037981,-0.05541,-0.047257,-0.054742,-0.018944,1.0,2016-07-06
3846151,2016-07-06 22:00:00+00:00,PZZA.O,Papa John's International Inc,25050.0,123.45,71.89,0.81758,0.056429,0.805544,0.056785,0.857229,0.076198,0.84787,0.076377,0.054807,0.0,2016-07-06


Turns out there are 9 assets with same error, next we will need to fix these errors. As the returns in longer period are affected. These rows could be deleted instead.

Assets: BBBY.O, DISH.O, FLEX.O, MAT.O, NDAQ.O, PCAR.O, PZZA.O, SHLD.O, ZNGA.O

In [20]:
assets=['ZNGA.O','FLEX.O','SHLD.O','MAT.O','BBBY.O','DISH.O','NDAQ.O', 'PCAR.O', 'PZZA.O']
for asset in assets:
    market_train_df = market_train_df[~((market_train_df['assetCode'] == asset)
                                  & (market_train_df['time'] >= '2016-05-21')
                                  & (market_train_df['time'] <= '2016-08-21'))]

#### 2. There is one asset that has two asset codes: "TW.N" and "WW.N", where the last one is erroneous.

In [21]:
# Fix asset code "WW.N" and remove the observations with erroneous return data
market_train_df.loc[market_train_df['assetCode'] == 'WW.N','assetCode'] = 'TW.N'
market_train_df = market_train_df[~((market_train_df['assetCode'] == 'TW.N')
                                  & (market_train_df['time'] >= '2009-12-16')
                                  & (market_train_df['time'] < '2010-01-08'))]

**3. Drop some more rows with erroneous data.**

In [22]:
# dropping Qorvo data through 2015-02-13
market_train_df = market_train_df[~((market_train_df['assetCode'] == 'QRVO.O')
                                  & (market_train_df['time'] < '2015-02-14'))]

In [23]:
# dropping TECD.O data in Feb-May 2015
market_train_df = market_train_df[~((market_train_df['assetCode'] == 'TECD.O')
                                  & (market_train_df['time'] >= '2015-01-30')
                                  & (market_train_df['time'] <= '2015-04-30'))]

In [24]:
# dropping EBR.N data in Oct 2016
market_train_df = market_train_df[~((market_train_df['assetCode'] == 'EBR.N')
                                  & (market_train_df['time'] >= '2016-10-01'))]

In [25]:
# dropping HGSI.O data in Feb and Mar 2016
market_train_df = market_train_df[~((market_train_df['assetCode'] == 'HGSI.O')
                                  & (market_train_df['time'] < '2009-04-01'))]

## Merging two datasets together


Most of the code is copied from public kernel: https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-and-news-data
and then modified for the set-up for this project.

In [26]:
# Different metrics to be calculated by every day about each asset. Aggregates multiple articles about certain asset in one day. 
news_cols_agg = {
    'urgency': ['min', 'count'],
    'takeSequence': ['max'],
    'bodySize': ['min', 'max', 'mean', 'std'],
    'wordCount': ['min', 'max', 'mean', 'std'],
    'sentenceCount': ['min', 'max', 'mean', 'std'],
    'companyCount': ['min', 'max', 'mean', 'std'],
    'marketCommentary': ['min', 'max', 'mean', 'std'],
    'relevance': ['min', 'max', 'mean', 'std'],
    'sentimentNegative': ['min', 'max', 'mean', 'std'],
    'sentimentNeutral': ['min', 'max', 'mean', 'std'],
    'sentimentPositive': ['min', 'max', 'mean', 'std'],
    'sentimentWordCount': ['min', 'max', 'mean', 'std'],
    'noveltyCount12H': ['min', 'max', 'mean', 'std'],
    'noveltyCount24H': ['min', 'max', 'mean', 'std'],
    'noveltyCount3D': ['min', 'max', 'mean', 'std'],
    'noveltyCount5D': ['min', 'max', 'mean', 'std'],
    'noveltyCount7D': ['min', 'max', 'mean', 'std'],
    'volumeCounts12H': ['min', 'max', 'mean', 'std'],
    'volumeCounts24H': ['min', 'max', 'mean', 'std'],
    'volumeCounts3D': ['min', 'max', 'mean', 'std'],
    'volumeCounts5D': ['min', 'max', 'mean', 'std'],
    'volumeCounts7D': ['min', 'max', 'mean', 'std']
}


In [27]:
def join_market_news(market_train_df, news_train_df):
    # Fix asset codes (str -> list)
    news_train_df['assetCodes2'] = news_train_df['assetCodes'].str.findall(f"'([\w\./]+)'")    
    
    # Expand assetCodes
    assetCodes_expanded = list(chain(*news_train_df['assetCodes2']))
    assetCodes_index = news_train_df.index.repeat( news_train_df['assetCodes2'].apply(len) )

    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})

    # Create expandaded news (will repeat every assetCodes' row)
    news_cols = ['time', 'assetCodes2'] + sorted(news_cols_agg.keys())
    news_train_df_expanded = pd.merge(df_assetCodes, news_train_df[news_cols], left_on='level_0', right_index=True, suffixes=(['','_old']))

    # Free memory
    del news_train_df, df_assetCodes

    # Aggregate numerical news features
    news_train_df_aggregated = news_train_df_expanded.groupby(['time', 'assetCode']).agg(news_cols_agg)
    
    # Free memory
    del news_train_df_expanded

    # Convert to float32 to save memory
    news_train_df_aggregated = news_train_df_aggregated.apply(np.float32)

    # Flat columns
    news_train_df_aggregated.columns = ['_'.join(col).strip() for col in news_train_df_aggregated.columns.values]

    # Join with train
    market_train_df = market_train_df.join(news_train_df_aggregated, on=['time', 'assetCode'])

    # Free memory
    del news_train_df_aggregated
    
    return market_train_df

In [28]:
def get_xy(market_train_df, news_train_df, le=None):
    x, le = get_x(market_train_df, news_train_df)
    y = market_train_df['returnsOpenNextMktres10'].clip(-1, 1)
    return x, y, le


def label_encode(series, min_count):
    vc = series.value_counts()
    le = {c:i for i, c in enumerate(vc.index[vc >= min_count])}
    return le


def get_x(market_train_df, news_train_df, le=None):
    # Split date into before and after 22h (the time used in train data)
    # E.g: 2007-03-07 23:26:39+00:00 -> 2007-03-08 00:00:00+00:00 (next day)
    #      2009-02-25 21:00:50+00:00 -> 2009-02-25 00:00:00+00:00 (current day)
    news_train_df['time'] = (news_train_df['time'] - np.timedelta64(22,'h')).dt.ceil('1D')

    # Round time of market_train_df to 0h of curret day
    market_train_df['time'] = market_train_df['time'].dt.floor('1D')

    # Join market and news
    x = join_market_news(market_train_df, news_train_df)
    
    # If not label-encoder... encode assetCode
    if le is None:
        le_assetCode = label_encode(x['assetCode'], min_count=1)
        le_assetName = label_encode(x['assetName'], min_count=5)
    else:
        # 'unpack' label encoders
        le_assetCode, le_assetName = le
        
    x['assetCode'] = x['assetCode'].map(le_assetCode).fillna(-1).astype(int)
    x['assetName'] = x['assetName'].map(le_assetName).fillna(-1).astype(int)
    
    try:
        x.drop(columns=['returnsOpenNextMktres10'], inplace=True)
    except:
        pass
    try:
        x.drop(columns=['universe'], inplace=True)
    except:
        pass
    x['dayofweek'], x['month'] = x.time.dt.dayofweek, x.time.dt.month
    x.drop(columns='time', inplace=True)
#    x.fillna(-1000,inplace=True)

    # Fix some mixed-type columns
    for bogus_col in ['marketCommentary_min', 'marketCommentary_max']:
        x[bogus_col] = x[bogus_col].astype(float)
    
    return x, (le_assetCode, le_assetName)

In [29]:
from itertools import chain
X, y, le = get_xy(market_train_df, news_train_df)

In [30]:
# Save universe data for latter use
universe = market_train_df['universe']

In [31]:
n_train = int(X.shape[0] * 0.8)

X_train, y_train = X.iloc[:n_train], y.iloc[:n_train]
X_valid, y_valid = X.iloc[n_train:], y.iloc[n_train:]

In [32]:
# For valid data, keep only those with universe > 0. This will help calculate the metric
u_valid = (universe.iloc[n_train:] > 0)
X_valid = X_valid[u_valid]
y_valid = y_valid[u_valid]

del u_valid

In [33]:
# Creat lgb datasets
#train_cols = X.columns.tolist()
u_train_cols = [
 'volume', 'returnsOpenPrevMktres1', 'returnsOpenPrevMktres10',
 'urgency_min', 'urgency_count', 'takeSequence_max',
 'bodySize_min', 'bodySize_max','bodySize_mean', 'bodySize_std',
 'wordCount_min', 'wordCount_max', 'wordCount_mean', 'wordCount_std',
 'sentenceCount_min', 'sentenceCount_max', 'sentenceCount_mean', 'sentenceCount_std',
 'companyCount_min', 'companyCount_max', 'companyCount_mean', 'companyCount_std',
 'marketCommentary_min', 'marketCommentary_max', 'marketCommentary_mean', 'marketCommentary_std',
 'relevance_min', 'relevance_max', 'relevance_mean', 'relevance_std',
 'sentimentNegative_min', 'sentimentNegative_max', 'sentimentNegative_mean', 'sentimentNegative_std',
 'sentimentNeutral_min', 'sentimentNeutral_max', 'sentimentNeutral_mean', 'sentimentNeutral_std',
 'sentimentPositive_min', 'sentimentPositive_max', 'sentimentPositive_mean', 'sentimentPositive_std',
 'sentimentWordCount_min', 'sentimentWordCount_max', 'sentimentWordCount_mean', 'sentimentWordCount_std',
 'noveltyCount12H_min', 'noveltyCount12H_max', 'noveltyCount12H_mean', 'noveltyCount12H_std',
 'noveltyCount24H_min', 'noveltyCount24H_max', 'noveltyCount24H_mean', 'noveltyCount24H_std',
 'noveltyCount3D_min', 'noveltyCount3D_max', 'noveltyCount3D_mean', 'noveltyCount3D_std',
 'noveltyCount5D_min', 'noveltyCount5D_max', 'noveltyCount5D_mean', 'noveltyCount5D_std',
 'noveltyCount7D_min', 'noveltyCount7D_max', 'noveltyCount7D_mean', 'noveltyCount7D_std',
 'volumeCounts12H_min', 'volumeCounts12H_max', 'volumeCounts12H_mean', 'volumeCounts12H_std',
 'volumeCounts24H_min', 'volumeCounts24H_max', 'volumeCounts24H_mean', 'volumeCounts24H_std',
 'volumeCounts3D_min', 'volumeCounts3D_max', 'volumeCounts3D_mean', 'volumeCounts3D_std',
 'volumeCounts5D_min', 'volumeCounts5D_max', 'volumeCounts5D_mean', 'volumeCounts5D_std',
 'volumeCounts7D_min', 'volumeCounts7D_max', 'volumeCounts7D_mean', 'volumeCounts7D_std',
 'dayofweek', 'month'
]

i_train_cols = [
 'dayofweek', 'month',
 'volume',
 'returnsOpenPrevMktres1',
 'returnsOpenPrevMktres10']

## Fitting models

In the merged dataset we have rows for assets at certain dates. Every row contains market data (returns, volumes), and if there was some media coverage then also aggregated information about news data. 

As not every row contains news data we train two models. One for the rows with news data and one for the rows without news data.  

The model trained without news data will be fitted is linear regression. Tried also RF model for this, but linear regression gave better scores.

In [34]:
# Funciton to separate rows with news data
def separate_rows_with_news(X, y, col, y_also=True):
    NA_indx=X[col].isnull()
    X_i=X.loc[NA_indx,:]
    X_u=X.loc[-NA_indx,:]
    X_u=X_u.fillna(0)
    X_i=X_i.fillna(0)
    if y_also==True:
        y_i=y.loc[NA_indx]
        y_u=y.loc[-NA_indx]
        return X_u, X_i, y_u, y_i
    else:
        return X_u, X_i

X_u, X_i, y_u, y_i=separate_rows_with_news(X_train, y_train, 'urgency_min')
X_valid_u, X_valid_i, y_valid_u, y_valid_i=separate_rows_with_news(X_valid, y_valid, 'urgency_min')

### Functions for modelling

In [35]:
# Evaluation of score
def score(X_valid_u, X_valid_i, y_valid_u, y_valid_i,pred_u, pred_i):
    resid_u=pd.concat([X_valid_u['date'],pred_u*y_valid_u],axis=1)
    resid_i=pd.concat([X_valid_i['date'], pred_i*y_valid_i],axis=1)
    xts=[]
    for dt in X_valid_i['date'].unique():
        xt=resid_u.loc[resid_u['date']==dt,'returnsOpenNextMktres10'].sum()+resid_i.loc[resid_i['date']==dt,'returnsOpenNextMktres10'].sum()
        xts.extend([xt])
    return np.mean(xts)/np.std(xts)

# Train two models - one for the part of data with some news about it, and other for the assets without news on certain date
def train_models(X_u,y_u, X_i, y_i, m_u, m_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols):
    fitted_u=m_u.fit(X_u[u_train_cols], y_u)
    fitted_i=m_i.fit(X_i[i_train_cols], y_i)
    return fitted_u, fitted_i

# Predict from two models 
def predict_from_fitted(X_u, X_i, fitted_u, fitted_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols):
    pred_i=np.clip(fitted_i.predict(X_i[i_train_cols]),-1,1)
    pred_u=np.clip(fitted_u.predict(X_u[u_train_cols]),-1,1)
    return pred_u, pred_i


### Model for the assets that did not have any news about it

In [36]:
from sklearn.linear_model import LinearRegression

m_i=LinearRegression()

### Linear regression model

This model was just constructed for comparison.

In [37]:
m_u=LinearRegression()

In [38]:
fitted_u, fitted_i=train_models(X_u,y_u, X_i, y_i, m_u, m_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols)
pred_u, pred_i=predict_from_fitted(X_valid_u, X_valid_i, fitted_u, fitted_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols)
score(X_valid_u, X_valid_i, y_valid_u, y_valid_i,pred_u, pred_i)

0.28683255252882184

In [39]:
# RESULTS: 
results=pd.DataFrame(columns=['Model','Kernel test score','Submission public score'])
results.loc[results.shape[0],:]=['LinearRegression()', 0.28683, 0.51676 ]
results

Unnamed: 0,Model,Kernel test score,Submission public score
0,LinearRegression(),0.28683,0.51676


### Random forest

The main hyperparameters tuned were the number of trees, max. depth of a tree, and number of features considered when choosing the best split. 

Unfortunately this Kaggle kernel couldn't handle automatic tuning very well, as the there are limits for RAM and time of the session. Many times the kernel just died or timed-out. Therefore automatic tuning was not used. Also the trained forests were not very big, on the same reasons.

In [40]:
from sklearn.ensemble import RandomForestRegressor
m_u=RandomForestRegressor(n_estimators=100, 
                      criterion='mse', 
                      max_depth=12, 
                      min_samples_split=2, 
                      min_samples_leaf=10, 
                      min_weight_fraction_leaf=0.0, 
                      max_features=None, 
                      max_leaf_nodes=None, 
                      min_impurity_decrease=0.0, 
                      min_impurity_split=None, 
                      bootstrap=True, 
                      oob_score=False, 
                      n_jobs=-1, 
                      random_state=1, 
                      verbose=0, 
                      warm_start=True)

In [41]:
# Fitting is commented out, as this runs for very long. And one model at a time was fitted.

#fitted_u, fitted_i=train_models(X_u,y_u, X_i, y_i, m_u, m_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols)
#pred_u, pred_i=predict_from_fitted(X_valid_u, X_valid_i, fitted_u, fitted_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols)
#score(X_valid_u, X_valid_i, y_valid_u, y_valid_i,pred_u, pred_i)

In [42]:
# Saved results from different runs.
results.loc[results.shape[0],:]=['RandomForestRegressor(n_estimators=100, max_depth=6)', 0.30670, np.nan ]
results.loc[results.shape[0],:]=['RandomForestRegressor(n_estimators=250, max_depth=6)', 0.30679, np.nan ]
results.loc[results.shape[0],:]=['RandomForestRegressor(n_estimators=100, max_depth=6, max_features=sqrt)', 0.26570, np.nan ]
results.loc[results.shape[0],:]=['RandomForestRegressor(n_estimators=100, max_depth=6, max_features=0.75)', 0.30836, np.nan ]
results.loc[results.shape[0],:]=['RandomForestRegressor(n_estimators=100, max_depth=12)', 0.31331, 0.52355 ]
results.loc[results.shape[0],:]=['RandomForestRegressor(n_estimators=200, max_depth=20)', 0.31361, np.nan ]
results

Unnamed: 0,Model,Kernel test score,Submission public score
0,LinearRegression(),0.28683,0.51676
1,"RandomForestRegressor(n_estimators=100, max_de...",0.3067,
2,"RandomForestRegressor(n_estimators=250, max_de...",0.30679,
3,"RandomForestRegressor(n_estimators=100, max_de...",0.2657,
4,"RandomForestRegressor(n_estimators=100, max_de...",0.30836,
5,"RandomForestRegressor(n_estimators=100, max_de...",0.31331,0.52355
6,"RandomForestRegressor(n_estimators=200, max_de...",0.31361,


### eXtreme Gradient Booster

In the public kernels there were many models based on gradient boosting. XGB method has become very popular in many Kaggle competitions. GB methods learn  "the unlearned" from previously fitted model.  This a short try of this method. The chosen max_depth is quite low, as it is usual for GB methods to build shallow trees.  So used max_depth is 4.  
Two parameters were changed. Eta and number of trees. When number of trees is high then eta should be low to aviod overfitting, and vice versa.

In [43]:
from xgboost import XGBRegressor

m_u=XGBRegressor(n_estimators=500, max_depth=4, eta=0.7)

In [44]:
#fitted_u, fitted_i=train_models(X_u,y_u, X_i, y_i, m_u, m_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols)
#pred_u, pred_i=predict_from_fitted(X_valid_u, X_valid_i, fitted_u, fitted_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols)
#score(X_valid_u, X_valid_i, y_valid_u, y_valid_i,pred_u, pred_i)


In [45]:
results.loc[results.shape[0],:]=['XGBRegressor(n_estimators=500, max_depth=4, eta=0.7)', 0.31331, 0.50125 ]
results.loc[results.shape[0],:]=['XGBRegressor(n_estimators=1000, max_depth=4, eta=0.4)', 0.31361, np.nan ]
results


Unnamed: 0,Model,Kernel test score,Submission public score
0,LinearRegression(),0.28683,0.51676
1,"RandomForestRegressor(n_estimators=100, max_de...",0.3067,
2,"RandomForestRegressor(n_estimators=250, max_de...",0.30679,
3,"RandomForestRegressor(n_estimators=100, max_de...",0.2657,
4,"RandomForestRegressor(n_estimators=100, max_de...",0.30836,
5,"RandomForestRegressor(n_estimators=100, max_de...",0.31331,0.52355
6,"RandomForestRegressor(n_estimators=200, max_de...",0.31361,
7,"XGBRegressor(n_estimators=500, max_depth=4, et...",0.31331,0.50125
8,"XGBRegressor(n_estimators=1000, max_depth=4, e...",0.31361,


## Writing a submission file

In [46]:
# For merging together two sets of predictions
def merge_predictions(X_valid_u, pred_u, X_valid_i, pred_i,le):
    le_assetCode, le_assetName = le
    assets=pd.DataFrame.from_dict(le_assetCode, orient='index').reset_index()
    assets.columns=['character','assetCode']
    result=pd.concat([pd.concat([X_valid_u['assetCode'].reset_index(),pd.DataFrame(pred_u)],axis=1),pd.concat([X_valid_i['assetCode'].reset_index(),pd.DataFrame(pred_i)],axis=1)],axis=0)
    result=result.merge(assets,on='assetCode')
    result=result.drop(['assetCode', 'index'], axis=1)
    result.columns=['preds', 'assetCode']
    result=result[['assetCode', 'preds']]
    return result 

def make_predictions(predictions_template_df, market_obs_df, news_obs_df):
    x, le = get_x(market_obs_df, news_obs_df)
    X_u, X_i=separate_rows_with_news(x, y_train, 'urgency_min', y_also=False)
    pred_u, pred_i=predict_from_fitted(X_u, X_i, fitted_u, fitted_i, u_train_cols=u_train_cols, i_train_cols=i_train_cols)
    
    preds=merge_predictions(X_u, pred_u, X_i, pred_i,le)
    
    predictions_template_df=predictions_template_df.merge(preds, on='assetCode')
    predictions_template_df=predictions_template_df.drop('confidenceValue', axis=1)
    predictions_template_df.columns=['assetCode','confidenceValue']
    return predictions_template_df

In [47]:
# The Main loop - to predict future dates:

days = env.get_prediction_days()

for (market_obs_df, news_obs_df, predictions_template_df) in days:
    predictions=make_predictions(predictions_template_df, market_obs_df, news_obs_df)
    env.predict(predictions)
print('Done!')


Done!


In [48]:
env.write_submission_file()

Your submission file has been saved. Once you `Commit` your Kernel and it finishes running, you can submit the file to the competition from the Kernel Viewer `Output` tab.


## Conclusion

This work did give some results, but as our obtained scores are comparable with the leaderbord, we didn't have impressing results. Probably it is needed to use more advanced methods or efficient hyperparameter tuning. 

### Graph used in poster

In [51]:
# Use only the years >=2015 of data. 
market_train_df=market_train_df.loc[market_train_df['time']>"2015-01-01",:]

#market_train_df.nlargest(100, 'volume')['assetName'].unique()[0:10]
market_train_df.groupby(by='assetName').mean().sort_values(['volume'], ascending=False).nlargest(10, 'volume').index


CategoricalIndex(['Bank of America Corp', 'Apple Inc', 'General Electric Co',
                  'Freeport-McMoRan Inc', 'Sirius XM Holdings Inc',
                  'Chesapeake Energy Corp', 'Microsoft Corp', 'Ford Motor Co',
                  'Pfizer Inc', 'Vale SA'],
                 categories=['21Vianet Group Inc', '2U Inc', '3Com Corp', '3D Systems Corp', '3M Co', '500.Com Ltd', '51job Inc', '58.com Inc', ...], ordered=False, name='assetName', dtype='category')

In [52]:
import matplotlib.pyplot as plt
%matplotlib inline


import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

assets=['Bank of America Corp', 'Apple Inc', 'General Electric Co',
                  'Freeport-McMoRan Inc', 'Sirius XM Holdings Inc',
                  'Chesapeake Energy Corp', 'Microsoft Corp', 'Ford Motor Co',
                  'Pfizer Inc', 'Vale SA']
data = []
for asset in assets:#np.random.choice(market_train_df['assetName'].unique(), 10):
    asset_df = market_train_df[(market_train_df['assetName'] == asset)]

    data.append(go.Scatter(
        x = asset_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = asset_df['close'].values,
        name = asset
    ))
layout = go.Layout(dict(title = "Closing prices of 10 assets with highest avg. trading volumes in 2015-2017",
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')