# Two Sigma Financial News Competition Official Getting Started Kernel
## Introduction
In this competition you will predict how stocks will change based on the market state and news articles.  You will loop through a long series of trading days; for each day, you'll receive an updated state of the market, and a series of news articles which were published since the last trading day, along with impacted stocks and sentiment analysis.  You'll use this information to predict whether each stock will have increased or decreased ten trading days into the future.  Once you make these predictions, you can move on to the next trading day. 

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Kernels, and you may not use other data sources, GPU, or internet access.
* This is a **two-stage competition**.  In Stage One you can edit your Kernels and improve your model, where Public Leaderboard scores are based on their predictions relative to past market data.  At the beginning of Stage Two, your Kernels are locked, and we will re-run your Kernels over the next six months, scoring them based on their predictions relative to live data as those six months unfold.
* You must use our custom **`kaggle.competitions.twosigmanews`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions for the current trading day.

## In this Starter Kernel, we'll show how to use the **`twosigmanews`** module to get the training data, get test features and make predictions, and write the submission file.
## TL;DR: End-to-End Usage Example
```
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

(market_train_df, news_train_df) = env.get_training_data()
train_my_model(market_train_df, news_train_df)

for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days():
  predictions_df = make_my_predictions(market_obs_df, news_obs_df, predictions_template_df)
  env.predict(predictions_df)
  
env.write_submission_file()
```
Note that `train_my_model` and `make_my_predictions` are functions you need to write for the above example to work.

## In-depth Introduction
First let's import the module and create an environment.

In [None]:
import pandas as pd
import gc
import re
import numpy as np

In [None]:
from kaggle.competitions import twosigmanews
# You can only call make_env() once, so don't lose it!
env = twosigmanews.make_env()

## **`get_training_data`** function

Returns the training data DataFrames as a tuple of:
* `market_train_df`: DataFrame with market training data
* `news_train_df`: DataFrame with news training data

These DataFrames contain all market and news data from February 2007 to December 2016.  See the [competition's Data tab](https://www.kaggle.com/c/two-sigma-financial-news/data) for more information on what columns are included in each DataFrame.

In [None]:
(market_train_df, news_train_df) = env.get_training_data()

In [None]:
market_train_df.shape

In [None]:
market_train_df.head()

In [None]:
market_train_df.tail()

In [None]:
market_train_df.columns

In [None]:
market_train_df.dtypes

In [None]:
def add_id(df):
    df["id"] = df.index + 1

In [None]:
add_id(market_train_df)

In [None]:
market_train_df[:1]

In [None]:
def add_confidence(df):
    # TODO change confidence by return proportion
    df["confidence"] = (df["returnsOpenNextMktres10"] > 0).astype(int)

In [None]:
add_confidence(market_train_df)

In [None]:
market_train_df[:1]

# plot data distribution

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

## market train

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=7)

In [None]:
import itertools

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=7, sharex=False, sharey=False, figsize=(20, 60))
market_columns = market_train_df.columns
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    if idx == 13:
        break
    col = market_columns[idx+3]
    ax.set_title(col)
    market_train_df.hist(column=col, ax=ax, bins=500)

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=7, sharex=False, sharey=False, figsize=(20, 60))
market_columns = market_train_df.columns
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    if idx == 13:
        break
    col = market_columns[idx+3]
    ax.set_title(col)
    market_train_df.boxplot(column=col, ax=ax)

There are many outliners in each feature except universe.

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=7, sharex=False, sharey=False, figsize=(20, 60))
market_columns = market_train_df.columns
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    col = market_columns[idx+3]
    if idx == 13:
        break
    ax.set_title(col)
    market_train_df[col].apply(np.log1p).hist(ax=ax, bins=100)

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=7, sharex=False, sharey=False, figsize=(20, 60))
market_columns = market_train_df.columns
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    col = market_columns[idx+3]
    if "returns" not in str(col):
        continue
    if idx == 13:
        break
    ax.set_title(col)
    market_train_df[col].apply(np.log1p).hist(ax=ax, bins=500)

In [None]:
fig, axes = plt.subplots(ncols=2, nrows=7, sharex=False, sharey=False, figsize=(20, 60))
market_columns = market_train_df.columns
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    col = market_columns[idx+3]
    if "returns" not in str(col):
        continue
    if idx == 13:
        break
    ax.set_title(col)
    seq = market_train_df[col].apply(lambda x: np.log1p(x))
    seq = (seq - seq.min()) / (seq.max() - seq.min())
    seq.hist(ax=ax, bins=500)

# NEWS

In [None]:
news_train_df.head()

In [None]:
news_train_df.tail()

In [None]:
news_train_df.columns

In [None]:
news_train_df.dtypes

In [None]:
news_train_df.shape

In [None]:
NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS = [
       'urgency', 'takeSequence', 'provider',
       'bodySize', 'companyCount', 'headlineTag', 'marketCommentary',
       'sentenceCount', 'wordCount', 'assetName',
       'firstMentionSentence', 'relevance', 'sentimentClass',
       'sentimentNegative', 'sentimentNeutral', 'sentimentPositive',
       'sentimentWordCount', 'noveltyCount12H', 'noveltyCount24H',
       'noveltyCount3D', 'noveltyCount5D', 'noveltyCount7D', 'volumeCounts12H',
       'volumeCounts24H', 'volumeCounts3D', 'volumeCounts5D',
       'volumeCounts7D']

In [None]:
news_train_df.headlineTag = news_train_df.headlineTag.astype("category")

In [None]:
news_train_df.sentimentClass = news_train_df.sentimentClass.astype("category")

In [None]:
import seaborn as sns

In [None]:
fig, axes = plt.subplots(ncols=2, nrows= len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS) // 2 + int(bool(len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS) % 2)), sharex=False, sharey=False, figsize=(20, 60))
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    if idx >= len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS):
        break
    col = NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS[idx]
    ax.set_title(col)
    seq = news_train_df[col] 
    if seq.dtype.name == 'category':
        seq = seq.value_counts()
    if seq.dtype.name == 'bool':
        seq = seq.astype("int")
    
    sns.distplot(seq, ax=ax)

In [None]:
fig, axes = plt.subplots(ncols=2, nrows= len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS) // 2 + int(bool(len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS) % 2)), sharex=False, sharey=False, figsize=(20, 60))
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    if idx >= len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS):
        break
    col = NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS[idx]
    ax.set_title(col)
    seq = news_train_df[col] 
    if seq.dtype.name == 'category' or seq.dtype.name == 'bool':
        continue
    seq = seq.apply(np.log1p)
    sns.distplot(seq, ax=ax)

In [None]:
fig, axes = plt.subplots(ncols=2, nrows= len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS) // 2 + int(bool(len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS) % 2)), sharex=False, sharey=False, figsize=(20, 60))
for idx, ax in enumerate(itertools.chain.from_iterable(axes)):
    if idx >= len(NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS):
        break
    col = NEWS_NUMERIC_AND_CATEGORICAL_COLUMNS[idx]
    ax.set_title(col)
    seq = news_train_df[col] 
    if seq.dtype.name == 'category' or seq.dtype.name == 'bool':
        continue
    seq = seq.apply(lambda x: np.log1p(np.log1p(x)))
    sns.distplot(seq, ax=ax)

## headline

In [None]:
print("headline max length: {}".format(news_train_df.headline.apply(len).max()))

In [None]:
news_train_df.headline.apply(len).plot.hist()

# fill null values




In [None]:
def replace_null_news(df):
    df.provider = df.provider.cat.add_categories(["UNKNOWN"])
    df.provider.fillna("UNKNOWN", inplace=True)
    df.audiences = df.audiences.cat.add_categories(["UNKNOWN"])
    df.audiences.fillna("UNKNOWN", inplace=True)

# Feature Extraction

# remove unnecc

In [None]:
def remove_unnecessary_columns(market_df, news_df):
    news_df.drop(['time', 'sourceId', 'sourceTimestamp', 'headline', 
                   "subjects", "audiences"], axis=1, inplace=True)

In [None]:
remove_unnecessary_columns(market_train_df, news_train_df)

# link data and news 

## check assecName links

In [None]:
#market_train_df.duplicated(subset=["time", "assetName"]).sum()

In [None]:
 #     asset_names_in_markets = set(market_df.assetName.unique().tolist())
    #     print("asset : {} " % len(asset_names_in_markets))
    #     asset_names_in_news = set(news_df.assetName.unique().tolist())
    #     asset_names_in_news_size = len(asset_names_in_news)
    # len(asset_names_in_news - asset_names_in_markets)
    # asset_names_not_in_news = asset_names_in_markets - asset_names_in_news
    # asset_names_not_in_news
    # len(asset_names_not_in_news)
    # There  are 73 assets 'not having any news in news df.
    # asset_names_not_in_market = asset_names_in_news - asset_names_in_markets
    # list(asset_names_not_in_market)[:100]
    # len(asset_names_not_in_market)
    # asset_names_in_both_market_and_news = asset_names_in_news & asset_names_in_markets
    # len(asset_names_in_both_market_and_news)
    # only 3483 / 8902 news have linking market information.
    # TODO drop if assetCode is null

In [None]:
MAX_DAY_DIFF = 3
MULTIPLE_CODES_PATTERN = re.compile(r"[{}'']")
import itertools
def link_data_and_news(market_df, news_df):
    assetCodes_in_markests = market_df.assetCode.unique()
    print("assetCodes pattern in markets: {}".format(len(assetCodes_in_markests)))
    assetCodes_in_news = news_df.assetCodes.unique()
    assetCodes_in_news_size = len(assetCodes_in_news)
    print("assetCodes pattern in news: {}".format(assetCodes_in_news_size))
    parse_multiple_codes = lambda codes: re.sub(r"[{}'']", "", str(codes)).split(", ")
    parsed_assetCodes_in_news = [parse_multiple_codes(str(codes)) for codes in assetCodes_in_news]
    # len(max(parsed_assetCodes_in_news, key=lambda x: len(x)))
    all_assetCode_type_in_news = list(set(itertools.chain.from_iterable(assetCodes_in_news)))
    # check linking
    links_assetCodes = [[[raw_codes, market_assetCode] for parsed_codes, raw_codes in zip(parsed_assetCodes_in_news, assetCodes_in_news) if str(market_assetCode) in parsed_codes] for market_assetCode in assetCodes_in_markests]
    links_assetCodes = list(itertools.chain.from_iterable(links_assetCodes))
    print("links for assetCodes: {}".format(len(links_assetCodes)))
    links_assetCodes = pd.DataFrame(links_assetCodes, columns=["newsAssetCodes", "marketAssetCode"], dtype='category')

    ## check date linking
    news_df["firstCreatedDate"] = news_df.firstCreated.dt.date
    market_df["date"] = market_df.time.dt.date

    working_dates = news_df.firstCreatedDate.unique().astype(np.datetime64)
    working_dates.sort()
    market_dates = market_df.date.unique().astype(np.datetime64)
    market_dates.sort()


    def find_prev_date(date):
        for diff_day in range(1, MAX_DAY_DIFF + 1):
            prev_date = date - np.timedelta64(diff_day, 'D')
            if len(np.searchsorted(working_dates, prev_date)) > 0:
                return prev_date
        return None

    prev_news_days_for_market_day = np.apply_along_axis(arr=market_dates, func1d=find_prev_date, axis=0) 

    prev_news_days_for_market_day[:10]

    prev_news_days_for_market_day[-10:]

    date_df = pd.DataFrame(columns=["date", "prevDate"])

    date_df.date = market_dates

    date_df.prevDate = prev_news_days_for_market_day

    date_df[:10]

    date_df.dtypes

    market_df.date = market_df.date.astype(np.datetime64)

    market_df.dtypes

    market_df.date.dtype

    market_df = market_df.merge(date_df, left_on="date", right_on="date")

    market_df[:10]

    del date_df
    gc.collect()

    ## merge assetCodes links

    market_df = market_df.merge(links_assetCodes, left_on="assetCode", right_on="marketAssetCode")

    market_df[:10]

    market_df.drop(["marketAssetCode"], axis=1, inplace=True)

    del links_assetCodes
    gc.collect()
    ## merge market and news

    news_df.firstCreatedDate = news_df.firstCreatedDate.astype(np.datetime64)

    news_df.columns

    news_df[:10]

    #news_time_link_df = news_df[["sourceId", "assetCodes", "firstCreated", "firstCreatedDate"]]

    #news_time_link_df.loc[:, "market_id"] = None

    #news_time_link_df[:1]

    #market_df.columns

    #market_time_link_df = market_df[["id", "time", "assetCode", "date", "prevDate"]]

    # type(news_time_link_df.index.values.tolist()[0])


    gc.collect()


    market_df_today_news = market_df.merge(news_df, left_on=["newsAssetCodes", "date"], 
                                           right_on=["assetCodes", "firstCreatedDate"])

    market_df_today_news.shape

    market_df_today_news[:1]

    # remove news after market obs
    market_df_today_news = market_df_today_news[market_df_today_news["time"] > market_df_today_news["firstCreated"]]

    market_df_today_news.shape

    market_df_today_news.sort_values(by=["firstCreated"], inplace=True)

    market_df_today_news[:10]

    # only leave latest news
    market_df_today_news.drop_duplicates(subset=["id"], keep="last", inplace=True)

    market_df_today_news[:10]

    market_df_today_news.shape

    gc.collect()

    market_df_prev_day_news = market_df.merge(news_df, left_on=["newsAssetCodes", "prevDate"], 
                                           right_on=["assetCodes", "firstCreatedDate"])

    market_df_prev_day_news.shape

    market_df_prev_day_news.sort_values(by=["firstCreated"], inplace=True)

    # only leave latest news
    market_df_prev_day_news.drop_duplicates(subset=["id"], keep="last", inplace=True)

    market_df_prev_day_news.shape

    del market_df

    gc.collect()

    market_df = pd.concat([market_df_prev_day_news, market_df_today_news]).sort_values(["firstCreated"])

    market_df.shape

    del market_df_prev_day_news

    del market_df_today_news

    gc.collect()

    market_df.drop_duplicates(subset=["id"], keep="last", inplace=True)

    return market_df

In [None]:
market_train_df = link_data_and_news(market_train_df, news_train_df)

In [None]:
del news_train_df

In [None]:
gc.collect()

In [None]:
# this code is very slow
# def link_latest_news(row, date_col):
#     predict_time = row["time"]
#     predict_date = row[date_col]
#     asset_code = row["assetCode"]
#     market_id = row["id"]
    
#     latest_news_df = news_time_link_df[news_time_link_df["firstCreatedDate"] == predict_date][["assetCodes"]]
#     latest_news_df = latest_news_df[latest_news_df.assetCodes.apply(lambda codes: asset_code in re.sub(r"[{}'']", "", str(codes)).split(", "))]
#     news_time_link_df.iloc[latest_news_df.index.values.tolist(), -1] = market_id

In [None]:
raise Error()

## `get_prediction_days` function

Generator which loops through each "prediction day" (trading day) and provides all market and news observations which occurred since the last data you've received.  Once you call **`predict`** to make your future predictions, you can continue on to the next prediction day.

Yields:
* While there are more prediction day(s) and `predict` was called successfully since the last yield, yields a tuple of:
    * `market_observations_df`: DataFrame with market observations for the next prediction day.
    * `news_observations_df`: DataFrame with news observations for the next prediction day.
    * `predictions_template_df`: DataFrame with `assetCode` and `confidenceValue` columns, prefilled with `confidenceValue = 0`, to be filled in and passed back to the `predict` function.
* If `predict` has not been called since the last yield, yields `None`.

In [None]:
# You can only iterate through a result from `get_prediction_days()` once
# so be careful not to lose it once you start iterating.
days = env.get_prediction_days()

In [None]:
(market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
market_obs_df.head()

In [None]:
market_obs_df.columns

In [None]:
market_obs_df.time.max()

In [None]:
market_obs_df.time.min()

In [None]:
news_obs_df.head()

In [None]:
news_obs_df.time.max()

In [None]:
news_obs_df.time.min()

In [None]:
predictions_template_df.head()

Note that we'll get an error if we try to continue on to the next prediction day without making our predictions for the current day.

In [None]:
next(days)

### **`predict`** function
Stores your predictions for the current prediction day.  Expects the same format as you saw in `predictions_template_df` returned from `get_prediction_days`.

Args:
* `predictions_df`: DataFrame which must have the following columns:
    * `assetCode`: The market asset.
    * `confidenceValue`: Your confidence whether the asset will increase or decrease in 10 trading days.  All values must be in the range `[-1.0, 1.0]`.

The `predictions_df` you send **must** contain the exact set of rows which were given to you in the `predictions_template_df` returned from `get_prediction_days`.  The `predict` function does not validate this, but if you are missing any `assetCode`s or add any extraneous `assetCode`s, then your submission will fail.

Let's make random predictions for the first day:

In [None]:
raise ValueError()

In [None]:
import numpy as np
def make_random_predictions(predictions_df):
    predictions_df.confidenceValue = 2.0 * np.random.rand(len(predictions_df)) - 1.0

In [None]:
make_random_predictions(predictions_template_df)
env.predict(predictions_template_df)

Now we can continue on to the next prediction day and make another round of random predictions for it:

In [None]:
(market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
market_obs_df.head()

In [None]:
news_obs_df.time.max()

In [None]:
news_obs_df.head()

In [None]:
news_obs_df.time.max()

In [None]:
news_obs_df.time.min()

In [None]:
predictions_template_df.head()

In [None]:
make_random_predictions(predictions_template_df)
env.predict(predictions_template_df)

In [None]:
(market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
market_obs_df.time.max()

In [None]:
market_obs_df.time.min()

In [None]:
news_obs_df.head()

In [None]:
news_obs_df.time.max()

In [None]:
news_obs_df.time.min()

In [None]:
make_random_predictions(predictions_template_df)
env.predict(predictions_template_df)

In [None]:
(market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
market_obs_df.time.max()

In [None]:
market_obs_df.time.min()

In [None]:
news_obs_df.head()

In [None]:
news_obs_df.time.max()

In [None]:
news_obs_df.time.min()

In [None]:
make_random_predictions(predictions_template_df)
env.predict(predictions_template_df)

In [None]:
(market_obs_df, news_obs_df, predictions_template_df) = next(days)

In [None]:
market_obs_df.time.max()

In [None]:
market_obs_df.time.min()

In [None]:
news_obs_df.head()

In [None]:
news_obs_df.time.max()

In [None]:
news_obs_df.time.min()

## Main Loop
Let's loop through all the days and make our random predictions.  The `days` generator (returned from `get_prediction_days`) will simply stop returning values once you've reached the end.

In [None]:
# for (market_obs_df, news_obs_df, predictions_template_df) in days:
#     make_random_predictions(predictions_template_df)
#     env.predict(predictions_template_df)
# print('Done!')

## **`write_submission_file`** function

Writes your predictions to a CSV file (`submission.csv`) in the current working directory.

In [None]:
#env.write_submission_file()

In [None]:
# We've got a submission file!
import os
print([filename for filename in os.listdir('.') if '.csv' in filename])

As indicated by the helper message, calling `write_submission_file` on its own does **not** make a submission to the competition.  It merely tells the module to write the `submission.csv` file as part of the Kernel's output.  To make a submission to the competition, you'll have to **Commit** your Kernel and find the generated `submission.csv` file in that Kernel Version's Output tab (note this is _outside_ of the Kernel Editor), then click "Submit to Competition".  When we re-run your Kernel during Stage Two, we will run the Kernel Version (generated when you hit "Commit") linked to your chosen Submission.

## Restart the Kernel to run your code again
In order to combat cheating, you are only allowed to call `make_env` or iterate through `get_prediction_days` once per Kernel run.  However, while you're iterating on your model it's reasonable to try something out, change the model a bit, and try it again.  Unfortunately, if you try to simply re-run the code, or even refresh the browser page, you'll still be running on the same Kernel execution session you had been running before, and the `twosigmanews` module will still throw errors.  To get around this, you need to explicitly restart your Kernel execution session, which you can do by pressing the Restart button in the Kernel Editor's bottom Console tab:
![Restart button](https://i.imgur.com/hudu8jF.png)