# Using Market State and Financial News to Predict Stocks Movement
## 1.  Introduction
In this [kaggle competition](https://www.kaggle.com/c/two-sigma-financial-news) we will predict how stocks will change based on the market state and news articles.  We will loop through a long series of trading days; for each day, we'll receive an updated state of the market, and a series of news articles which were published since the last trading day, along with impacted stocks and sentiment analysis.  We'll use this information to predict whether each stock will have increased or decreased ten trading days into the future.  

**Evaluation details**: we must predict a signed confidence value,$\hat{y}_{ti} \in [-1,1]$, which is multiplied by the market-adjusted return of a given assetCode over a ten day window. If we expect a stock to have a large positive return--compared to the broad market--over the next ten days, we might assign it a large, positive confidenceValue (near 1.0). If we expect a stock to have a negative return, you might assign it a large, negative confidenceValue (near -1.0). If unsure, you might assign it a value near zero.

$$x_t = \sum_i \hat{y}_{ti}  r_{ti}  u_{ti},$$

where $r_{ti}$ is the 10-day market-adjusted leading return for day t for instrument i, and $u_{ti}$ is a 0/1 universe variable that controls whether a particular asset is included in scoring on a particular day.

The submission score is then calculated as the mean divided by the standard deviation of your daily xt values:
$$\text{score} = \frac{\bar{x}_t}{\sigma(x_t)}.$$
If the standard deviation of predictions is 0, the score is defined as 0.



## 2. Load and explore the data 
First let's import the module and create an environment. According to the compition rules we must use custom kaggle.competitions.twosigmerfanews Python module to import the market and news data.

In [None]:
#Improt all the needed pacakges
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
stop = set(stopwords.words('english'))

from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.metrics import accuracy_score


## Import and explore the training data
In accordance with the competition rule we import the data with the custom module "twosigmanews". There are 4,072,956 rows and 16 columns in the maket training set and 9,328,750 rows with 35 columns in the news training set.

In [None]:
import numpy as np
import pandas as pd
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()
(market_train_df, news_train_df) = env.get_training_data()

In [None]:
market_train_df.info()

In [None]:
news_train_df.info()

In [None]:
print(market_train_df.time.min())
market_train_df_5years=market_train_df[market_train_df.time>'2011-12-30 22:00:00+0000']


In [None]:
market_train_df_5years.to_csv("market_2011_2016.csv",index=False)

In [None]:
import os
print(os.listdir("../input"))

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinemarket_train_df.info()
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

Check with the missing values. There are 4 features with missing values.

In [None]:
#Deal with the missing number,fill the missing values as mean values,
missing_cols=['returnsClosePrevMktres1',
              'returnsOpenPrevMktres1',
              'returnsClosePrevMktres10',
              'returnsOpenPrevMktres10']

for col in missing_cols:
    market_train_df[col]=market_train_df[col].fillna(market_train_df[col].mean())

In [None]:
import datetimeprint(f'The market data start from {market_train_df["time"].min()}, and end on {market_train_df["time"].max()}')

In [None]:
#Explore the data related to returns.
#data = []
return_terms=[
             "returnsClosePrevRaw1",
             "returnsOpenPrevRaw1",
             "returnsClosePrevMktres1",
             "returnsOpenPrevMktres1",
             "returnsClosePrevRaw10",
             "returnsOpenPrevRaw10",
             "returnsClosePrevMktres10",
             "returnsOpenPrevMktres10",
             "returnsOpenNextMktres10"
             ] 
for  return_term in return_terms:
    market_train_df[return_term]=market_train_df[return_term].clip(-1,1)


There are 9328750 rows with 35 features in the news train set.

Find if there are any missing value in news_train_df. We are lucky that no missing value is found in news trainning data.

In [None]:
print(f'The news data start from {news_train_df["time"].min()}, and end on {news_train_df["time"].max()}')


In [None]:
headlineTag_dic = {k: v for v, k in enumerate(news_train_df['headlineTag'].unique())}
provider_dic = {k: v for v, k in enumerate(news_train_df['provider'].unique())}
marketCommentary_map = {False:0,True:1}
news_drop_col=['time','sourceTimestamp','assetName','headline','subjects','audiences']
#news_train_df=news_train_df.sample(6000000)

In [None]:
import numpy as np
import pandas as pd
#This function expands a row in to multiple rows from the list of assetcodes.
def expand(df, expand_column):
    lens = []
    d = {}
    expands_list=[]
    for i,item in df[expand_column].items():
        expand=list(item[2:-2].split("\', \'"))
        lens.append(len(expand))
        expands_list.extend(expand)        
    d[expand_column] = expands_list
    
    #print(len(d[expand_column]))
    #print(np.mean(lens))
    for col in df.columns.values:
        if col != expand_column:
             d[col] = np.repeat(df[col].values, lens)
    return pd.DataFrame(d)


## 2. Build a model without the content of the news as the baseline. 

**2.1  Feature engineering**

In [None]:
def prep_market(market_df):    
    #print("Deal with the market data")
    market_df=market_df.drop(['assetName'],axis=1)
    market_df["time"]=pd.to_datetime(market_df["time"])
    market_df['time'] = market_df.time.dt.date
    market_df['close_to_open'] = market_df['close'] / market_df['open'] 
    #print(f'The shape of market data is {market_df.shape}')
    return market_df
    
def prep_news(news_df):    
    #print("Deal with the news data")    
    news_df['firstCreated']=pd.to_datetime(news_df['firstCreated'])
    news_df=news_df.drop(news_drop_col,axis=1)
    news_df['headlineTagT'] = news_df['headlineTag'].map(headlineTag_dic)    
    news_df['provider'] = news_df['provider'].map(provider_dic)    
    news_df['marketCommentary'] = news_df['marketCommentary'].map(marketCommentary_map)
    #print('expand')
    news_df=expand(news_df, "assetCodes")
    #print('group')
    #news_df = news_df.groupby(['firstCreated', 'assetCodes'], as_index=False).mean()
    #print(f'The shape of news data is {news_df.shape}')
    return news_df

def group_news(news_df):
    #print('group')
    news_df['firstCreated']=news_df['firstCreated'].dt.date
    news_df = news_df.groupby(['firstCreated', 'assetCodes'], as_index=False).mean()
    return news_df

def merge_market_news(market_df,news_df):    
    market_and_news_df = pd.merge(market_df, news_df, left_on=['time', 'assetCode'], right_on=['firstCreated', 'assetCodes'])   
    return market_and_news_df

**2.2  Traning and testing date splitting**

In [None]:

news_train_df=prep_news(news_train_df)
news1=news_train_df[news_train_df['firstCreated']<="2009-12-31"]
news2=news_train_df[(news_train_df['firstCreated']<="2012-12-31")&(news_train_df['firstCreated']>"2009-12-31")]
news3=news_train_df[news_train_df['firstCreated']>"2012-12-31"]
del(news_train_df)
news1=group_news(news1)
news2=group_news(news2)
news3=group_news(news3)
news_train_df=pd.concat([news1,news2,news3],sort='False')


In [None]:
market_train_df=prep_market(market_train_df)
market_news_train_df = merge_market_news(market_train_df, news_train_df)


In [None]:
features_col=['volume', 'close', 'open', 'returnsClosePrevRaw1','returnsOpenPrevRaw1', 
              'returnsClosePrevMktres1','returnsOpenPrevMktres1', 'returnsClosePrevRaw10',
              'returnsOpenPrevRaw10', 'returnsClosePrevMktres10','returnsOpenPrevMktres10', 
              'close_to_open', 'urgency', 'takeSequence', 'provider',
              'bodySize', 'companyCount', 'marketCommentary', 'sentenceCount','wordCount', 
              'firstMentionSentence', 'relevance', 'sentimentClass','sentimentNegative', 
              'sentimentNeutral', 'sentimentPositive','sentimentWordCount', 'noveltyCount12H', 
              'noveltyCount24H','noveltyCount3D', 'noveltyCount5D', 'noveltyCount7D', 'volumeCounts12H',
              'volumeCounts24H', 'volumeCounts3D', 'volumeCounts5D', 'volumeCounts7D','headlineTagT']

features_col_market=['volume', 'close', 'open', 'returnsClosePrevRaw1','returnsOpenPrevRaw1', 
                     'returnsClosePrevMktres1','returnsOpenPrevMktres1', 'returnsClosePrevRaw10',
                     'returnsOpenPrevRaw10', 'returnsClosePrevMktres10','returnsOpenPrevMktres10', 'close_to_open']

In [None]:
UpOrDown = market_news_train_df.returnsOpenNextMktres10 >= 0
UpOrDown = UpOrDown.values
returns = market_news_train_df.returnsOpenNextMktres10.values
X=market_news_train_df[features_col].values
mins_X = np.min(X, axis=0)
maxs_X = np.max(X, axis=0)
range_X = maxs_X - mins_X
X = 1 - ((maxs_X - X) / range_X)


In [None]:
UpOrDown_market = market_train_df.returnsOpenNextMktres10 >= 0
UpOrDown_market = UpOrDown_market.values
returns_market = market_train_df.returnsOpenNextMktres10.values
X_market=market_train_df[features_col_market].values
mins_X_market = np.min(X_market, axis=0)
maxs_X_market = np.max(X_market, axis=0)
range_X_market = maxs_X_market - mins_X_market
X_market = 1 - ((maxs_X_market - X_market) / range_X_market)



**2.3  Buiding estimator with market and news data**

In [None]:
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split

X_train, X_test, UpOrDown_train, UpOrDown_test, returns_train, returns_test = model_selection.train_test_split(X, UpOrDown, returns, test_size=0.1, random_state=99)

In [None]:
params = {'learning_rate': 0.05, 'max_depth': 5, 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'is_training_metric': True, 'seed': 42}
model = lgb.train(params, train_set=lgb.Dataset(X_train, label=UpOrDown_train), num_boost_round=2000,
                  valid_sets=[lgb.Dataset(X_train, label=UpOrDown_train), lgb.Dataset(X_test, label=UpOrDown_test)],
                  verbose_eval=50, early_stopping_rounds=30)


**2.3  Buiding estimator with only market data **

In [None]:
X_train_market, X_test_market, UpOrDown_train_market, UpOrDown_test_market, returns_train_market, returns_test_market = model_selection.train_test_split(X_market, UpOrDown_market, returns_market, test_size=0.1, random_state=99)

params = {'learning_rate': 0.05, 'max_depth': 5, 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'is_training_metric': True, 'seed': 42}
model_market = lgb.train(params, train_set=lgb.Dataset(X_train_market, label=UpOrDown_train_market), num_boost_round=2000,
                  valid_sets=[lgb.Dataset(X_train_market, label=UpOrDown_train_market), lgb.Dataset(X_test_market, label=UpOrDown_test_market)],
                  verbose_eval=50, early_stopping_rounds=30)

In [None]:
df = pd.DataFrame({'imp': model.feature_importance(), 'col':features_col})
df = df.sort_values(['imp','col'], ascending=[True, False])
df.plot.bar(x='col',y='imp')

In [None]:
df_market = pd.DataFrame({'imp': model_market.feature_importance(), 'col':features_col_market})
df_market = df_market.sort_values(['imp','col'], ascending=[True, False])
df_market.plot.bar(x='col',y='imp')

In [None]:
#Predect only use market data
'''
days = env.get_prediction_days()
import time
n_days = 0
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    n_days +=1
    if n_days % 50 == 0:
        print(n_days,end=' ')    
    market_obs_df=prep_market(market_obs_df)
    market_obs_df = market_obs_df[market_obs_df.assetCode.isin(predictions_template_df.assetCode)]
    X_live_market=market_obs_df[features_col_market].values
    X_live_market=1-((maxs_X_market - X_live_market) / range_X_market)
    lp_market = model_market.predict(X_live_market)
    confidence_market = 2 * lp_market -1
    preds_market= pd.DataFrame({'assetCode':market_obs_df['assetCode'],'confidence':confidence_market})
    predictions_template_df = predictions_template_df.merge(preds_market,how='left').drop('confidenceValue',axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'})
    env.predict(predictions_template_df)

env.write_submission_file()
'''

In [None]:

days = env.get_prediction_days()
import time

n_days = 0


for (market_obs_df, news_obs_df, predictions_template_df) in days:
    n_days +=1
    if n_days % 50 == 0:
        print(n_days,end=' ')
    market_obs_df = market_obs_df[market_obs_df.assetCode.isin(predictions_template_df.assetCode)]
    market_obs_df=prep_market(market_obs_df)

    news_obs_df=prep_news(news_obs_df)
    news_obs_df=group_news(news_obs_df)
    market_news_obs_df = merge_market_news(market_obs_df, news_obs_df)
    assetcode_set=set(market_news_obs_df['assetCode'].values)
    X_live = market_news_obs_df[features_col].values
    X_live = 1 - ((maxs_X - X_live) / range_X)        
    lp = model.predict(X_live)
    confidence = 2 * lp -1
    preds = pd.DataFrame({'assetCode':market_news_obs_df['assetCode'],'confidence':confidence})
    
    market_only_obs_df=market_obs_df[~market_obs_df['assetCode'].isin(assetcode_set)]
    X_live_market_only=market_only_obs_df[features_col_market].values
    X_live_market_only=1-((maxs_X_market - X_live_market_only) / range_X_market)
    lp_market_only = model_market.predict(X_live_market_only)
    confidence_market_only = 2 * lp_market_only -1
    preds_market_only= pd.DataFrame({'assetCode':market_only_obs_df['assetCode'],'confidence':confidence_market_only})
    
    preds_all=pd.concat([preds,preds_market_only],sort='False')
    
    predictions_template_df = predictions_template_df.merge(preds_all,how='left').drop('confidenceValue',axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'})
    env.predict(predictions_template_df)
    
env.write_submission_file()

In [None]:
# We've got a submission file!
import os
print([filename for filename in os.listdir('.') if '.csv' in filename])

