# Two Sigma Financial News Competition Official Getting Started Kernel
## Introduction
In this competition you will predict how stocks will change based on the market state and news articles.  You will loop through a long series of trading days; for each day, you'll receive an updated state of the market, and a series of news articles which were published since the last trading day, along with impacted stocks and sentiment analysis.  You'll use this information to predict whether each stock will have increased or decreased ten trading days into the future.  Once you make these predictions, you can move on to the next trading day. 

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Kernels, and you may not use other data sources, GPU, or internet access.
* This is a **two-stage competition**.  In Stage One you can edit your Kernels and improve your model, where Public Leaderboard scores are based on their predictions relative to past market data.  At the beginning of Stage Two, your Kernels are locked, and we will re-run your Kernels over the next six months, scoring them based on their predictions relative to live data as those six months unfold.
* You must use our custom **`kaggle.competitions.twosigmanews`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions for the current trading day.

## In this Starter Kernel, we'll show how to use the **`twosigmanews`** module to get the training data, get test features and make predictions, and write the submission file.
## TL;DR: End-to-End Usage Example
```
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

(market_train_df, news_train_df) = env.get_training_data()
train_my_model(market_train_df, news_train_df)

for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days():
  predictions_df = make_my_predictions(market_obs_df, news_obs_df, predictions_template_df)
  env.predict(predictions_df)
  
env.write_submission_file()
```
Note that `train_my_model` and `make_my_predictions` are functions you need to write for the above example to work.

In [None]:
import pandas as pd
import gc
import re
import numpy as np

In [None]:
from kaggle.competitions import twosigmanews
# You can only call make_env() once, so don't lose it!
env = twosigmanews.make_env()

## **`get_training_data`** function

Returns the training data DataFrames as a tuple of:
* `market_train_df`: DataFrame with market training data
* `news_train_df`: DataFrame with news training data

These DataFrames contain all market and news data from February 2007 to December 2016.  See the [competition's Data tab](https://www.kaggle.com/c/two-sigma-financial-news/data) for more information on what columns are included in each DataFrame.

In [None]:
(market_train_df, news_train_df) = env.get_training_data()

In [None]:
market_train_df.shape

In [None]:
market_train_df.head()

In [None]:
market_train_df.tail()

In [None]:
market_train_df.describe()

In [None]:
news_train_df.head()

In [None]:
news_train_df.tail()

In [None]:
def add_id(df, id_name):
    df[id_name] = df.index.astype("int32") + 1

## compress dtypes

In [None]:
def compress_dtypes(news_df):
    for col, dtype in zip(news_df.columns, news_df.dtypes):
        if dtype == np.dtype('float64'):
            news_df[col] = news_df[col].astype("float32")
        if dtype == np.dtype('int64'):
            news_df[col] = news_df[col].astype("int32")

In [None]:
compress_dtypes(news_train_df)

In [None]:
news_train_df.dtypes

In [None]:
news_train_df.tail()

# add necessary info

In [None]:
MARKET_ID = "id"
NEWS_ID = "news_id"

In [None]:
def add_ids(market_df, news_df):
    add_id(market_df, MARKET_ID)
    add_id(news_df, NEWS_ID)

In [None]:
add_ids(market_train_df, news_train_df)

In [None]:
market_train_df["id"].max()

In [None]:
news_train_df["news_id"].max()

In [None]:
market_train_df[:1]

In [None]:
def add_confidence(df):
    # TODO change confidence by return proportion
    df["confidence"] = df["returnsOpenNextMktres10"] >= 0

In [None]:
add_confidence(market_train_df)

In [None]:
market_train_df[:10]

In [None]:
market_train_df[:1]

In [None]:
market_train_df.shape

In [None]:
market_train_df.id.tail()

# full fill missing values

In [None]:
market_train_df.isnull().sum(axis=0)

returnsClosePrevMktres1 , returnsOpenPrevMktres1  returnsClosePrevMktres10  and returnsOpenPrevMktres10 aren't used here for this model, I ignore them

In [None]:
news_train_df.isnull().sum(axis=0)

In [None]:
# empty string check
for col, dtype in zip(news_train_df.columns, news_train_df.dtypes):
    if dtype == np.dtype('O'):
        n_empty = (news_train_df[col]=="").sum()
        print("empty value in {}: {}".format(col, n_empty))

In [None]:
news_train_df.headlineTag.value_counts()

It seems headlineTag values are categorical values. Therefore, I convert the column into categorical values.

In [None]:
def fill_missing_value_news_df(news_df):
    news_df.headlineTag.replace("", "UNKNONWN", inplace=True)

In [None]:
fill_missing_value_news_df(news_df=news_train_df)

In [None]:
def to_category_news_df(news_df):
    news_df.headlineTag = news_df.headlineTag.astype('category')

In [None]:
to_category_news_df(news_train_df)

In [None]:
news_train_df.dtypes

# Feature Extraction

## news

In [None]:
categorical_features = ["provider", "subjects", "audiences", "headlineTag"]

In [None]:
def encode_categorical_fields(news_df):
    categories = []
    for cat_column in categorical_features:
        categories.append(news_df[cat_column].cat.categories)
        news_df[cat_column] = news_df[cat_column].cat.codes
    return categories

In [None]:
news_categories = encode_categorical_fields(news_train_df)

In [None]:
news_categories

### headline

In [None]:
#from gensim.sklearn_api import D2VTransformer

In [None]:
RANDOM_SEED = 10

In [None]:
#head_line_d2vec_model = D2VTransformer(min_count=5, size=50, workers=2, seed=RANDOM_SEED)

In [None]:
# %%time
# # more sophisticated cleaning
# head_line_d2vec_feature = head_line_d2vec_model.fit_transform([text.split() for text in news_train_df.headline.tolist()])

In [None]:
#head_line_d2vec_feature[:10]

In [None]:
import joblib

In [None]:
#HEADLINE_DOC2VEC_MODEL = "./headline_doc2vec_model.pickle"

#joblib.dump(head_line_d2vec_model, HEADLINE_DOC2VEC_MODEL)

#del head_line_d2vec_model

In [None]:
gc.collect()

In [None]:
import fastText

In [None]:
FASTTEXT_MODEL_PATH = "./fasttext_headline.model"

from pathlib import Path

TEMP_HEAD_LINE_FILE = Path("./headline.txt")

news_train_df.headline.to_csv(TEMP_HEAD_LINE_FILE, index=False, encoding="utf-8")

In [None]:
%%time
head_line_fastText_model = fastText.train_unsupervised(str(TEMP_HEAD_LINE_FILE), dim=50)

In [None]:
TEMP_HEAD_LINE_FILE.unlink()

In [None]:
def extract_headline_fastText(news_df, fastText_model):
    feature = news_df.headline.apply(fastText_model.get_sentence_vector)
    return np.vstack(feature).astype("float16")

In [None]:
%%time
head_line_fastText_feature = extract_headline_fastText(news_train_df, head_line_fastText_model)

In [None]:
head_line_fastText_feature.shape

In [None]:
head_line_fastText_model.save_model(FASTTEXT_MODEL_PATH)

In [None]:
del head_line_fastText_model

In [None]:
gc.collect()

In [None]:
NEWS_FEATURE_SAVE_FILE = "news_features.npz"

In [None]:
np.savez_compressed(NEWS_FEATURE_SAVE_FILE, headline_fastText=head_line_fastText_feature)

In [None]:
del head_line_fastText_feature

In [None]:
gc.collect()

 # remove unnecc

In [None]:
def remove_unnecessary_columns(market_df, news_df):
    #market_df.drop(["returnsOpenNextMktres10", "universe"], axis=1, inplace=True)
    news_df.drop(['time', 'sourceId', 'sourceTimestamp', 'headline'], axis=1, inplace=True)

In [None]:
def remove_unnecessary_columns_train(market_df, news_df):
    market_df.drop(["returnsOpenNextMktres10", "universe"], axis=1, inplace=True)
    remove_unnecessary_columns(market_df, news_df)

In [None]:
remove_unnecessary_columns_train(market_train_df, news_train_df)

In [None]:
gc.collect()

# link data and news 

## check assecName links

In [None]:
MAX_DAY_DIFF = 3
MULTIPLE_CODES_PATTERN = re.compile(r"[{}'']")
import itertools
def link_data_and_news(market_df, news_df):
    assetCodes_in_markests = market_df.assetCode.unique()
    print("assetCodes pattern in markets: {}".format(len(assetCodes_in_markests)))
    assetCodes_in_news = news_df.assetCodes.unique()
    assetCodes_in_news_size = len(assetCodes_in_news)
    print("assetCodes pattern in news: {}".format(assetCodes_in_news_size))
    parse_multiple_codes = lambda codes: re.sub(r"[{}'']", "", str(codes)).split(", ")
    parsed_assetCodes_in_news = [parse_multiple_codes(str(codes)) for codes in assetCodes_in_news]
    # len(max(parsed_assetCodes_in_news, key=lambda x: len(x)))
    all_assetCode_type_in_news = list(set(itertools.chain.from_iterable(assetCodes_in_news)))
    # check linking
    links_assetCodes = [[[raw_codes, market_assetCode] for parsed_codes, raw_codes in zip(parsed_assetCodes_in_news, assetCodes_in_news) if str(market_assetCode) in parsed_codes] for market_assetCode in assetCodes_in_markests]
    links_assetCodes = list(itertools.chain.from_iterable(links_assetCodes))
    print("links for assetCodes: {}".format(len(links_assetCodes)))
    links_assetCodes = pd.DataFrame(links_assetCodes, columns=["newsAssetCodes", "marketAssetCode"], dtype='category')

    ## check date linking
    news_df["firstCreatedDate"] = news_df.firstCreated.dt.date
    market_df["date"] = market_df.time.dt.date

    working_dates = news_df.firstCreatedDate.unique().astype(np.datetime64)
    working_dates.sort()
    market_dates = market_df.date.unique().astype(np.datetime64)
    market_dates.sort()


    def find_prev_date(date):
        for diff_day in range(1, MAX_DAY_DIFF + 1):
            prev_date = date - np.timedelta64(diff_day, 'D')
            if len(np.searchsorted(working_dates, prev_date)) > 0:
                return prev_date
        return None

    prev_news_days_for_market_day = np.apply_along_axis(arr=market_dates, func1d=find_prev_date, axis=0) 

    date_df = pd.DataFrame(columns=["date", "prevDate"])
    date_df.date = market_dates
    
    date_df.prevDate = prev_news_days_for_market_day

    market_df.date = market_df.date.astype(np.datetime64)

    market_df = market_df.merge(date_df, left_on="date", right_on="date")

    market_df[:10]

    del date_df
    gc.collect()

    ## merge assetCodes links

    market_df = market_df.merge(links_assetCodes, left_on="assetCode", right_on="marketAssetCode")

    market_df[:10]

    market_df.drop(["marketAssetCode"], axis=1, inplace=True)

    del links_assetCodes
    gc.collect()
    ## merge market and news

    news_df.firstCreatedDate = news_df.firstCreatedDate.astype(np.datetime64)


    market_df_today_news = market_df.merge(news_df, left_on=["newsAssetCodes", "date"], 
                                           right_on=["assetCodes", "firstCreatedDate"])

    # remove news after market obs
    market_df_today_news = market_df_today_news[market_df_today_news["time"] > market_df_today_news["firstCreated"]]

    market_df_today_news.shape

    market_df_today_news.sort_values(by=["firstCreated"], inplace=True)

    # only leave latest news
    market_df_today_news.drop_duplicates(subset=["id"], keep="last", inplace=True)

    gc.collect()

    market_df_prev_day_news = market_df.merge(news_df, left_on=["newsAssetCodes", "prevDate"], 
                                           right_on=["assetCodes", "firstCreatedDate"])

    market_df_prev_day_news.sort_values(by=["firstCreated"], inplace=True)

    # only leave latest news
    market_df_prev_day_news.drop_duplicates(subset=["id"], keep="last", inplace=True)

    del market_df
    del news_df

    gc.collect()

    market_df = pd.concat([market_df_prev_day_news, market_df_today_news]).sort_values(["firstCreated"])

    del market_df_prev_day_news

    del market_df_today_news

    gc.collect()

    market_df.drop_duplicates(subset=["id"], keep="last", inplace=True)
    market_df.drop(["assetCode", "date", "prevDate", "newsAssetCodes", "assetName_x", "assetCodes", "assetName_y", 
                     "firstCreated", "firstCreatedDate"], axis=1, inplace=True)
    gc.collect()
    print("linking done")
    return market_df

In [None]:
%%time
market_train_df = link_data_and_news(market_train_df, news_train_df)

In [None]:
del news_train_df

In [None]:
gc.collect()

In [None]:
market_train_df.columns

In [None]:
market_train_df.columns

In [None]:
market_train_df.sort_values(by=["time"], inplace=True)

# convert into trainable form

In [None]:
def to_Y(train_df):
    return np.asarray(train_df.confidence)

In [None]:
train_Y = to_Y(train_df=market_train_df)

In [None]:
market_train_df.drop(["confidence"], axis=1, inplace=True)

In [None]:
def to_X(df, news_features, news_feature_names):
    market_obs_ids = df.id
    news_obs_ids = df.news_id
    market_obs_times = df.time
    df.drop(["id", "news_id", "time"], axis=1, inplace=True)
    X = df.values.astype("float32")
    feature_names = df.columns.tolist()
    feature_names.extend(news_feature_names)
    del df
    gc.collect()
    row_indices = [market_id - 1 for market_id in news_obs_ids.tolist()]
    news_features = news_features[row_indices]
    X = np.hstack([X, news_features])
    del news_features
    return X, market_obs_ids, news_obs_ids, market_obs_times, feature_names

In [None]:
news_features = [ np.load("news_features.npz")["headline_fastText"] ] 

In [None]:
news_feature_names = [ "headline_fastText"]


In [None]:
news_feature_names = [["{}_{}".format(name, i) for i in range(feature.shape[1])]for feature, name 
                       in zip(news_features, news_feature_names)]

In [None]:
news_feature_names = itertools.chain.from_iterable(news_feature_names)

In [None]:
news_features = np.hstack(news_features)

In [None]:
gc.collect()

In [None]:
%%time
X, market_train_obs_ids, news_train_obs_ids, market_train_obs_times, feature_names = to_X(
    market_train_df, news_features, news_feature_names
)

In [None]:
del news_features

In [None]:
gc.collect()

In [None]:
type(feature_names[0])

In [None]:
len(feature_names)

# create validation data

# train model

In [None]:
import lightgbm as lgb

In [None]:
train_size = X.shape[0] // 5 * 4

In [None]:
train_size

In [None]:
X, valid_X, train_Y, valid_Y = (X[range(train_size)], X[train_size:], 
                               train_Y[:train_size], train_Y[train_size:])

In [None]:
X.shape

In [None]:
valid_X.shape

In [None]:
feature_names

In [None]:
X = lgb.Dataset(X, label=train_Y, feature_name=feature_names, categorical_feature=categorical_features, free_raw_data=False)

In [None]:
valid_X = X.create_valid(valid_X, label=valid_Y)

In [None]:
gc.collect()

## train

In [None]:
hyper_params = {"objective": "binary", "boosting":"gbdt", "num_iterations": 100, 
               "learning_rate": 0.02, "num_leaves": 31, "num_threads": 2,
                "seed": RANDOM_SEED, "early_stopping_round": 10
               }

In [None]:
model = lgb.train(params=hyper_params, train_set=X, valid_sets=[valid_X])

In [None]:
for feature, imp in zip(model.feature_name(), model.feature_importance()):
    print("{}: {}".format(feature, imp))

In [None]:
del X

In [None]:
del valid_X

In [None]:
gc.collect()

In [None]:
import plotly.plotly as plotly

In [None]:
import seaborn as sns

In [None]:
import plotly.graph_objs as go

In [None]:
# bar_data = [go.Bar(x=X.feature_name, y=model.feature_importance())]

# plotly.iplot(bar_data, filename="feature_importance")

In [None]:
sns.set()

In [None]:
sns.set_context("notebook")

In [None]:
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

In [None]:
sns.barplot(x=model.feature_name(), y=model.feature_importance(), 
            ax=plt.subplots(figsize=(20, 10))[1])

## `get_prediction_days` function

Generator which loops through each "prediction day" (trading day) and provides all market and news observations which occurred since the last data you've received.  Once you call **`predict`** to make your future predictions, you can continue on to the next prediction day.

Yields:
* While there are more prediction day(s) and `predict` was called successfully since the last yield, yields a tuple of:
    * `market_observations_df`: DataFrame with market observations for the next prediction day.
    * `news_observations_df`: DataFrame with news observations for the next prediction day.
    * `predictions_template_df`: DataFrame with `assetCode` and `confidenceValue` columns, prefilled with `confidenceValue = 0`, to be filled in and passed back to the `predict` function.
* If `predict` has not been called since the last yield, yields `None`.

### **`predict`** function
Stores your predictions for the current prediction day.  Expects the same format as you saw in `predictions_template_df` returned from `get_prediction_days`.

Args:
* `predictions_df`: DataFrame which must have the following columns:
    * `assetCode`: The market asset.
    * `confidenceValue`: Your confidence whether the asset will increase or decrease in 10 trading days.  All values must be in the range `[-1.0, 1.0]`.

The `predictions_df` you send **must** contain the exact set of rows which were given to you in the `predictions_template_df` returned from `get_prediction_days`.  The `predict` function does not validate this, but if you are missing any `assetCode`s or add any extraneous `assetCode`s, then your submission will fail.

Let's make random predictions for the first day:

In [None]:
from pandas.api.types import CategoricalDtype

In [None]:
def to_category_type(df, category_columns, categories_list):
    for col, categories in zip(category_columns, categories_list):
        cat_type = CategoricalDtype(categories=categories)
        df[col] = df[col].astype(cat_type)

In [None]:
headline_fastText_model = fastText.load_model(FASTTEXT_MODEL_PATH)

In [None]:
def extract_features(news_df):
    return extract_headline_fastText(news_df, headline_fastText_model)

In [None]:
def make_predictions(market_obs_df, news_obs_df, predictions_df):
    add_ids(market_obs_df, news_obs_df)
    fill_missing_value_news_df(news_obs_df)
    to_category_type(news_obs_df, category_columns=categorical_features, 
                     categories_list= news_categories)
    encode_categorical_fields(news_df=news_obs_df)
    news_features = extract_features(news_df=news_obs_df)
    remove_unnecessary_columns(market_obs_df, news_obs_df)
    market_obs_df = link_data_and_news(market_obs_df, news_obs_df)
    X, market_train_obs_ids, news_train_obs_ids, market_train_obs_times, feature_names = to_X(market_obs_df, news_features, news_feature_names)
    predictions_df.confidenceValue[[market_id - 1 for market_id in market_train_obs_ids]] = model.predict(X) * 2 - 1
    

In [None]:
days = env.get_prediction_days()

In [None]:
# random prediction for debug
# def make_random_predictions(predictions_df):
#     predictions_df.confidenceValue = 2.0 * np.random.rand(len(predictions_df)) - 1.0
# make_random_predictions(predictions_template_df)
# env.predict(predictions_template_df)

## Main Loop
Let's loop through all the days and make our random predictions.  The `days` generator (returned from `get_prediction_days`) will simply stop returning values once you've reached the end.

In [None]:
from tqdm import tqdm

In [None]:
%time
for (market_obs_df, news_obs_df, predictions_template_df) in tqdm(days):
     make_predictions(market_obs_df, news_obs_df, predictions_template_df)
     env.predict(predictions_template_df)
print('Done!')

 ## **`write_submission_file`** function

Writes your predictions to a CSV file (`submission.csv`) in the current working directory.

In [None]:
env.write_submission_file()

In [None]:
# We've got a submission file!
import os
print([filename for filename in os.listdir('.') if '.csv' in filename])

As indicated by the helper message, calling `write_submission_file` on its own does **not** make a submission to the competition.  It merely tells the module to write the `submission.csv` file as part of the Kernel's output.  To make a submission to the competition, you'll have to **Commit** your Kernel and find the generated `submission.csv` file in that Kernel Version's Output tab (note this is _outside_ of the Kernel Editor), then click "Submit to Competition".  When we re-run your Kernel during Stage Two, we will run the Kernel Version (generated when you hit "Commit") linked to your chosen Submission.

## Restart the Kernel to run your code again
In order to combat cheating, you are only allowed to call `make_env` or iterate through `get_prediction_days` once per Kernel run.  However, while you're iterating on your model it's reasonable to try something out, change the model a bit, and try it again.  Unfortunately, if you try to simply re-run the code, or even refresh the browser page, you'll still be running on the same Kernel execution session you had been running before, and the `twosigmanews` module will still throw errors.  To get around this, you need to explicitly restart your Kernel execution session, which you can do by pressing the Restart button in the Kernel Editor's bottom Console tab:
![Restart button](https://i.imgur.com/hudu8jF.png)