# Wall_Street_Winners Competition Kernel
## Introduction
This Kernel is base on the Kaggle Official Getting Started Kernel and a kernel developed by Bruno G. do Amaral ("A simple model - using the market and news data"). The  competition attempts to predict how stocks will change based on the market state and news articles.  You will loop through a long series of trading days; for each day, you'll receive an updated state of the market, and a series of news articles which were published since the last trading day, along with impacted stocks and sentiment analysis.  You'll use this information to predict whether each stock will have increased or decreased ten trading days into the future.  Once you make these predictions, you can move on to the next trading day. 

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Kernels, and you may not use other data sources, GPU, or internet access.
* This is a **two-stage competition**.  In Stage One you can edit your Kernels and improve your model, where Public Leaderboard scores are based on their predictions relative to past market data.  At the beginning of Stage Two, your Kernels are locked, and we will re-run your Kernels over the next six months, scoring them based on their predictions relative to live data as those six months unfold.
* You must use our custom **`kaggle.competitions.twosigmanews`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions for the current trading day.

## Based on the starter kernel, this kernel will use the **`twosigmanews`** module to get the training data, get test features and make predictions, and write the submission file.

from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()

(market_train_df, news_train_df) = env.get_training_data()
train_my_model(market_train_df, news_train_df)

for (market_obs_df, news_obs_df, predictions_template_df) in env.get_prediction_days():
  predictions_df = make_my_predictions(market_obs_df, news_obs_df, predictions_template_df)
  env.predict(predictions_df)
  
env.write_submission_file()
```
Note that `train_my_model` and `make_my_predictions` are functions you need to write for the above example to work.


## **Contents**   
[1. Import Modules & Create Environment](#1)   
[2. Examine Data Sets](#2)  
&nbsp;&nbsp;&nbsp;&nbsp; [2.1 Basic dataframe info structure](#2.1)  
&nbsp;&nbsp;&nbsp;&nbsp; [2.2 Preview 5 rows of each data frame](#2.2)  
&nbsp;&nbsp;&nbsp;&nbsp; [2.3  Counts of unique values in market_train_df](#2.3)  
&nbsp;&nbsp;&nbsp;&nbsp; [2.4  Counts of unique values in news_train_df](#2.4)  
&nbsp;&nbsp;&nbsp;&nbsp; [2.5  Distribution graphs of market_train_df numerics ](#2.5)  
[3. Create and test 1st-run data set](#1)  
&nbsp;&nbsp;&nbsp;&nbsp; [3.1 Create aggregated list of numberic news data](#3.1)  
&nbsp;&nbsp;&nbsp;&nbsp; [3.2 Prep market data and join with news data](#3.2)  
&nbsp;&nbsp;&nbsp;&nbsp; [3.3 Create lgb datasets](#3.3)  
&nbsp;&nbsp;&nbsp;&nbsp; [3.4 Set lgb parameters](#3.4)  
&nbsp;&nbsp;&nbsp;&nbsp; [3.5 Fit lgb model and plot sigma score](#3.5)  
&nbsp;&nbsp;&nbsp;&nbsp; [3.6 Plot feature importance](#3.6)  
[4. Train full lgb model & write submission file](#4)  

## <a id="1">1. Import Module & Create Environment</a>  

In [None]:
####JHL? Should we set this to false and run a model using all the data to submit? Interesting to see how it might improve our rankings

toy=True      # if toy=true, rows in dataframe will be reduced for easy testing

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy.stats

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from itertools import chain

## JHL's function to Hot-Code a categorical variable_
    # Takes as parameters 1) a dataframe 2) a string variable with the column name to recode
    # Leaves in tack the initial variable that was recoded
def HotC(dframe,col):   # Function to Hot-Code a categorical variable
    if not(isinstance(dframe,pd.DataFrame)):
        print('!!ERROR!! The first variable in the HotC function must be a dataframe')
        return
    if not(isinstance(col,str)):
        print('!!ERROR!! The second variable in the HotC function must be a string representing a column in the dataframe')
        return
    #df2=pd.DataFrame(dframe[col].str.get_dummies())
    df2=pd.get_dummies(dframe[col],prefix=col)
    df3=pd.concat([dframe,df2],axis=1)
    return df3

# Inport competion data
from kaggle.competitions import twosigmanews
# You can only call make_env() once, so don't lose it!
# ours is <kaggle.competitions.twosigmanews.env.TwoSigmaNewsEnv object at 0x7ff474b06278>
# however, it looks like you have to redefine it every time you start up a new session?
# I've commented out the next command as the environment has already been created & saved
if 'env' not in globals():
    env = twosigmanews.make_env()
# Set up dataframe--use a sample for memory reasons if toy=True
(market_train_df, news_train_df) = env.get_training_data()
if toy:
    market_train_df = market_train_df.tail(500_000)
    news_train_df = news_train_df.tail(800_000)
else:
    market_train_df = market_train_df.tail(3_000_000)
    news_train_df = news_train_df.tail(6_000_000)
print ('env value:', env)
print('Environment set & Data imported!')
print('Dataframe shapes of market_train_df.shape, news_train_df.shape')
print(market_train_df.shape, news_train_df.shape)


## <a id="2">2. Examine Data Sets</a>  
### &nbsp;&nbsp;  <a id="2.1">2.1  Basic data frame info structure </a>

In [None]:
#### Right now I don't believe any 'Validation' is being done. Should be take out 20% of the data to create a validation set?
# List basic structure of each data frame
print ('<STRUCTURE: market_train_df.info>')
print(market_train_df.info())
print ('\n','='*80,'\n','<STRUCTURE:news_train_df.info>')
print(news_train_df.info())

### &nbsp;&nbsp;  <a id="2.2">2.2  Preview 5 rows of each data frame </a>

In [None]:
## Basic Data Frame Description
# First widen display
pd.options.display.max_columns=35   ## Force number of coluns to show
pd.options.display.max_rows=1000     ## Force number of rows to show
# Now show 5 rows of all data sets
print ('<CONTENT OF market_train_df>')
print(market_train_df.head(5))
print ('______________________________________________________________')
print ('\n<CONTENT OF market_train_df.tail>')
print (market_train_df.tail(5))
print ('______________________________________________________________')
print('\n<CONTENT OF news_train_df>')
print(news_train_df.head(10))
print ('_____________________________________________________________')
print('\nCONTENT OF <news_train_df.tail>')
print(news_train_df.tail(10))

### &nbsp;&nbsp;  <a id="2.3">2.3  Counts of unique values in market_train_df </a>

In [None]:
## Value Counts for market_train_df
print('Value Counts for market_train_df Features','\n')
for elem in market_train_df.columns.values:
    print(elem)
    if market_train_df[elem].nunique()<8:
        if market_train_df[elem].dtype != np.object:
            if market_train_df[elem].nunique()>7:
                print(market_train_df[elem].value_counts(bins=8),'\nNaN Count:',market_train_df[elem].isna().sum())
            else:
                print(market_train_df[elem].value_counts(),'\nNaN Count:',market_train_df[elem].isna().sum())
        else:
            print(market_train_df[elem].value_counts(),'\nNaN Count:',market_train_df[elem].isna().sum())
    else:
        print('Unique values:',market_train_df[elem].nunique(),'\nNaN Count:',market_train_df[elem].isna().sum())
    print('\n')

### &nbsp;&nbsp;  <a id="2.4">2.4  Counts of unique values in news_train_df </a>

In [None]:
## Value Counts for news_train_df
print('Value Counts for news_train_df Features','\n')
for elem in news_train_df.columns.values:
    print(elem)
    if news_train_df[elem].nunique()<8:
        if news_train_df[elem].dtype != np.object:
            if news_train_df[elem].nunique()>7:
                print(news_train_df[elem].value_counts(bins=8),'\nNaN Count:',news_train_df[elem].isna().sum())
            else:
                print(news_train_df[elem].value_counts(),'\nNaN Count:',news_train_df[elem].isna().sum())
        else:
            print(news_train_df[elem].value_counts(),'\nNaN Count:',news_train_df[elem].isna().sum())
    else:
        print('Unique values:',news_train_df[elem].nunique(),'\nNaN Count:',news_train_df[elem].isna().sum())
    print('\n')

### &nbsp;&nbsp;  <a id="2.5">2.5  Distribution graphs of market_train_df numerics </a>

In [None]:
#### JHL: Tried a few different bin sizes, but some graphs not looking right. How to fix?
market_train_df.hist( bins = 1000, figsize =( 16,15))
plt.show()

## &nbsp;&nbsp;  <a id="3.">3  Create and test 1st-run data set </a>  
### &nbsp;&nbsp;  <a id="3.1">3.1  Create aggregated list of numberic news data </a>  

In [None]:
####It's unclear to me what role these aggregates play and how the factor into the model?
# Prep list of news columns that are numeric
news_cols_agg = {
    'urgency': ['min', 'count'],
    'takeSequence': ['max'],
    'bodySize': ['min', 'max', 'mean', 'std'],
    'wordCount': ['min', 'max', 'mean', 'std'],
    'sentenceCount': ['min', 'max', 'mean', 'std'],
    'companyCount': ['min', 'max', 'mean', 'std'],
    'marketCommentary': ['min', 'max', 'mean', 'std'],
    'relevance': ['min', 'max', 'mean', 'std'],
    'sentimentNegative': ['min', 'max', 'mean', 'std'],
    'sentimentNeutral': ['min', 'max', 'mean', 'std'],
    'sentimentPositive': ['min', 'max', 'mean', 'std'],
    'sentimentWordCount': ['min', 'max', 'mean', 'std'],
    'noveltyCount12H': ['min', 'max', 'mean', 'std'],
    'noveltyCount24H': ['min', 'max', 'mean', 'std'],
    'noveltyCount3D': ['min', 'max', 'mean', 'std'],
    'noveltyCount5D': ['min', 'max', 'mean', 'std'],
    'noveltyCount7D': ['min', 'max', 'mean', 'std'],
    'volumeCounts12H': ['min', 'max', 'mean', 'std'],
    'volumeCounts24H': ['min', 'max', 'mean', 'std'],
    'volumeCounts3D': ['min', 'max', 'mean', 'std'],
    'volumeCounts5D': ['min', 'max', 'mean', 'std'],
    'volumeCounts7D': ['min', 'max', 'mean', 'std']
}
print(news_cols_agg)

### &nbsp;&nbsp;  <a id="3.2">3.2  Prep mareket data and join with news data </a>  

In [None]:

## Functions to use in combining data sets

def join_market_news(market_train_df, news_train_df):
    # Fix asset codes (str -> list)
    news_train_df['assetCodes'] = news_train_df['assetCodes'].str.findall(f"'([\w\./]+)'")    
    
    # Expand assetCodes
    assetCodes_expanded = list(chain(*news_train_df['assetCodes']))
    assetCodes_index = news_train_df.index.repeat( news_train_df['assetCodes'].apply(len) )

    assert len(assetCodes_index) == len(assetCodes_expanded)
    df_assetCodes = pd.DataFrame({'level_0': assetCodes_index, 'assetCode': assetCodes_expanded})

    # Create expandaded news (will repeat every assetCodes' row)
    news_cols = ['time', 'assetCodes'] + sorted(news_cols_agg.keys())
    news_train_df_expanded = pd.merge(df_assetCodes, news_train_df[news_cols], left_on='level_0', right_index=True, suffixes=(['','_old']))

    # Free memory
    del news_train_df, df_assetCodes

    # Aggregate numerical news features
    news_train_df_aggregated = news_train_df_expanded.groupby(['time', 'assetCode']).agg(news_cols_agg)
    
    # Free memory
    del news_train_df_expanded

    # Convert to float32 to save memory
    news_train_df_aggregated = news_train_df_aggregated.apply(np.float32)

    # Flat columns
    news_train_df_aggregated.columns = ['_'.join(col).strip() for col in news_train_df_aggregated.columns.values]

    # Join with train
    market_train_df = market_train_df.join(news_train_df_aggregated, on=['time', 'assetCode'])

    # Free memory
    del news_train_df_aggregated
    
    return market_train_df

print('Functions created to combine market and news data')

In [None]:
def get_xy(market_train_df, news_train_df, le=None):
    x, le = get_x(market_train_df, news_train_df)
    y = market_train_df['returnsOpenNextMktres10'].clip(-1, 1)
    return x, y, le


def label_encode(series, min_count):
    vc = series.value_counts()
    le = {c:i for i, c in enumerate(vc.index[vc >= min_count])}
    return le


  #### What is le and what are these lines of code doing? Seems to be a 'label encoder'. What's that?
def get_x(market_train_df, news_train_df, le=None):
    # Split date into before and after 22h (the time used in train data)
    # E.g: 2007-03-07 23:26:39+00:00 -> 2007-03-08 00:00:00+00:00 (next day)
    #      2009-02-25 21:00:50+00:00 -> 2009-02-25 00:00:00+00:00 (current day)
    news_train_df['time'] = (news_train_df['time'] - np.timedelta64(22,'h')).dt.ceil('1D')

    # Round time of market_train_df to 0h of curret day
    market_train_df['time'] = market_train_df['time'].dt.floor('1D')

    # Join market and news
    x = join_market_news(market_train_df, news_train_df)
    
    # If not label-encoder... encode assetCode
    if le is None:
        le_assetCode = label_encode(x['assetCode'], min_count=10)
        le_assetName = label_encode(x['assetName'], min_count=5)
    else:
        # 'unpack' label encoders
        le_assetCode, le_assetName = le
        
    x['assetCode'] = x['assetCode'].map(le_assetCode).fillna(-1).astype(int)
    x['assetName'] = x['assetName'].map(le_assetName).fillna(-1).astype(int)
    
    try:
        x.drop(columns=['returnsOpenNextMktres10'], inplace=True)
    except:
        pass
    try:
        x.drop(columns=['universe'], inplace=True)
    except:
        pass
    x['dayofweek'], x['month'] = x.time.dt.dayofweek, x.time.dt.month
    x.drop(columns='time', inplace=True)
#    x.fillna(-1000,inplace=True)

    # Fix some mixed-type columns
    for bogus_col in ['marketCommentary_min', 'marketCommentary_max']:
        x[bogus_col] = x[bogus_col].astype(float)
    
    return x, (le_assetCode, le_assetName)

print('Additional functions created to help on join')

In [None]:
%%time
####? What does a line of these variables look like?
# This will take some time...
X, y, le = get_xy(market_train_df, news_train_df)


In [None]:
# Save universe data for latter use
universe = market_train_df['universe']
time = market_train_df['time']

# Free memory
del market_train_df, news_train_df
X_ = X
print('universe saved')

In [None]:
# Keep only text columns
X = X_#.iloc[:, X.columns.get_loc('urgency_min'):X.columns.get_loc('dayofweek')]
X.tail()

n_train = int(X.shape[0] * 0.8)

X_train, y_train = X.iloc[:n_train], y.iloc[:n_train]
X_valid, y_valid = X.iloc[n_train:], y.iloc[n_train:]

print('Sampple of X_train rows')
print(X_train.head(5))

In [None]:
# For valid data, keep only those with universe > 0. This will help calculate the metric
u_valid = (universe.iloc[n_train:] > 0)
t_valid = time.iloc[n_train:]

X_valid = X_valid[u_valid]
y_valid = y_valid[u_valid]
t_valid = t_valid[u_valid]
del u_valid

### &nbsp;&nbsp;  <a id="3.3">3.3  Create Lgb datasets </a>  

In [None]:
# Create lgb datasets
train_cols = X.columns.tolist()
categorical_cols = [] # ['assetCode', 'assetName', 'dayofweek', 'month']
# zzz=train_cols
# train_cols = ['volume','close','open','returnsClosePrevMktres1','returnsOpenPrevMktres1','returnsClosePrevMktres10','returnsOpenPrevMktres10','urgency_min','ugency_count','sentimentNegative_mean','sentimentPositive_mean','noveltyCount12H_mean','volumeCounts12H_mean','dayofweek','month']
# print ('train_cols'+train_cols)

# Note: y data is expected to be a pandas Series, as we will use its group_by function in `sigma_score`
dtrain = lgb.Dataset(X_train.values, y_train, feature_name=train_cols, categorical_feature=categorical_cols, free_raw_data=False)
dvalid = lgb.Dataset(X_valid.values, y_valid, feature_name=train_cols, categorical_feature=categorical_cols, free_raw_data=False)


### &nbsp;&nbsp;  <a id="3.4">3.4  Set lgb parameters </a>  

In [None]:
# We will 'inject' an extra parameter in order to have access to df_valid['time'] inside sigma_score without globals

dvalid.params = {
    'extra_time': t_valid.factorize()[0]
}

lgb_params = dict(
    objective = 'regression_l1',
    learning_rate = 0.1,
    num_leaves = 127,
    max_depth = -1,
#     min_data_in_leaf = 1000,
#     min_sum_hessian_in_leaf = 10,
    bagging_fraction = 0.75,
    bagging_freq = 2,
    feature_fraction = 0.5,
    lambda_l1 = 0.0,
    lambda_l2 = 1.0,
    metric = 'None', # This will ignore the loss objetive and use sigma_score instead,
    seed = 42 # Change for better luck! :)
)

def sigma_score(preds, valid_data):
    df_time = valid_data.params['extra_time']
    labels = valid_data.get_label()
    
#    assert len(labels) == len(df_time)

    x_t = preds * labels #  * df_valid['universe'] -> Here we take out the 'universe' term because we already keep only those equals to 1.
    
    # Here we take advantage of the fact that `labels` (used to calculate `x_t`)
    # is a pd.Series and call `group_by`
    x_t_sum = x_t.groupby(df_time).sum()
    score = x_t_sum.mean() / x_t_sum.std()

    return 'sigma_score', score, True



### &nbsp;&nbsp;  <a id="3.5">3.5  Fit lgb model and plot sigma score </a>  

In [None]:
# Fit model
evals_result = {}
m = lgb.train(lgb_params, dtrain, num_boost_round=1000, valid_sets=(dvalid,), valid_names=('valid',), verbose_eval=25,
              early_stopping_rounds=100, feval=sigma_score, evals_result=evals_result)
df_result = pd.DataFrame(evals_result['valid'])
 # Plot sigma
ax = df_result.plot(figsize=(12, 8))
ax.scatter(df_result['sigma_score'].idxmax(), df_result['sigma_score'].max(), marker='+', color='red')
num_boost_round, valid_score = df_result['sigma_score'].idxmax()+1, df_result['sigma_score'].max()
print(lgb_params)
print(f'Best score was {valid_score:.5f} on round {num_boost_round}')

### &nbsp;&nbsp;  <a id="3.6">3.6  Plot feature importance </a> 

In [None]:
# plot importance figures
fig, ax = plt.subplots(1, 2, figsize=(14, 14))
lgb.plot_importance(m, ax=ax[0])
lgb.plot_importance(m, ax=ax[1], importance_type='gain')
fig.tight_layout()

## &nbsp;&nbsp;  <a id="4">4 Train full lgb model & write submission file </a> 

In [None]:
# Train full model with num_boost_round found in validation
dtrain_full = lgb.Dataset(X, y, feature_name=train_cols, categorical_feature=categorical_cols)

model = lgb.train(lgb_params, dtrain, num_boost_round=num_boost_round)

# generated predictions
def make_predictions(predictions_template_df, market_obs_df, news_obs_df, le):
    x, _ = get_x(market_obs_df, news_obs_df, le)
    predictions_template_df.confidenceValue = np.clip(model.predict(x), -1, 1)
    
days = env.get_prediction_days()

for (market_obs_df, news_obs_df, predictions_template_df) in days:
    make_predictions(predictions_template_df, market_obs_df, news_obs_df, le)
    env.predict(predictions_template_df)
print('Model fit!')



In [None]:
# Write submission file
env.write_submission_file()
print('Submission file written')
