# Two Sigma: Using News to Predict Stock Movements
At the stage 1, incorporate News data  provided by Thomson Reuters into Market data provided by Intrinio to "predict" historical stock movements.  
To evaluate the predictive ability of the model, I should submit a signed confidence value, $\hat{y}_{ti} \in [-1, 1]$. This is to say, if you expect a stock to have a large positive return--compared to the broad market--over the next ten days, you might assign it a large, positive confidence value (near 1.0), and vice versa. If unsure, you might assign it a value near zero.  
To simply this problem, I started with one stock with no missing values and the outcome as close stock price.  

## Packages for Market data and News data
### A Python rookie, using references
>1. [Andrew Lukyanenko](https://www.kaggle.com/artgor/eda-feature-engineering-and-everything)  
>2. [duvallwh](https://www.kaggle.com/duvallwh/finding-and-removing-bad-open-values)
>3. [Bruno G. do Amaral](https://www.kaggle.com/bguberfain/a-simple-model-using-the-market-and-news-data)  
>4. [Peter](https://www.kaggle.com/pestipeti/simple-eda-two-sigma)  
>5. [Ashish Patel(阿希什)](https://www.kaggle.com/ashishpatel26/bird-eye-view-of-two-sigma-nn-approach) 
>6. [Aguiar](https://www.kaggle.com/jsaguiar/baseline-with-news)  
>7. [skmisc.loess.loess](https://has2k1.github.io/scikit-misc/generated/skmisc.loess.loess.html)
>8. [Locally Weighted Linear Regression (Loess)](https://xavierbourretsicotte.github.io/loess.html)

In [None]:
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from datetime import datetime, timedelta
import lightgbm as lgb 
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split

# interactive plot by plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

# for news data
from wordcloud import WordCloud

# loess/lowess 
# from skmisc import loess as skmloess
from scipy.interpolate import interp1d
import statsmodels.api as sm

## try pyGAM

# not fully understand all packages yet
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler




from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.metrics import accuracy_score

# We've got a submission file!
import os

## Import data, Get Market and News training dataset 

In [None]:
from kaggle.competitions import twosigmanews
# You can only call make_env() once, so don't lose it!
env = twosigmanews.make_env()


## EDA and Preliminary Model Fitting


### Market Data
#### Introduction of Market Data
This dataset contains stock market performance over the past decade, including open/close price, volume on that day and etc.  
According to the data description:
> The data is stored and retrieved as Pandas dataframes in the Kernels environment. Columns types are optimized to minimize space in memory.  
> The `returnsOpenNextMktres10` (float64): the next 10 days, **market-residualized** return, meaning that the movement of the market as a whole has been accounted for, leaving only movements inherent to the instrument. This is the target variable used in competition scoring.


In [None]:
# Get the market data
(market_train_df, news_train_df) = env.get_training_data()
market_train_df.head()

#### Descriptive of Market Data

In [None]:
# variables Market
print("Within Market data: time is always 22:00 UTC \n assetCode is the unique ID \n assetName is not unique and can be Unknown \n ")
print(f'{market_train_df.shape[0]} samples and {market_train_df.shape[1]} features in the training market dataset.')

In [None]:
# summary statistics of market data
market_train_df.describe(include = 'all')

In [None]:
# missing data
market_train_df.isna().sum()

`returnsOpenNextMktres10` has been  filtered such that it is always not null.  
This variable is directly associated with the outcome, the next 10 days **confidenceValue**, taking market residuals. 

In [None]:
# Summary of raw target variable.
target = market_train_df['returnsOpenNextMktres10']
print(f'The range of target: {target.min()}, {target.max()} ')

#### Take a glimpse of "close price"
20 stocks were chosen randomly by the assetCode. 

In [None]:
# 
np.random.seed(1024)
# count all unique assetName
print(market_train_df['assetName'].value_counts().head() )
# whole unique name series 3511 
print(market_train_df['assetName'].nunique())
print(market_train_df['assetCode'].nunique())

# 20 random stocks from market data
print("Assets once appeared in News must have assetName, Unknown means may not have News")
# names of stocks in pilot data
pilot = np.random.choice(market_train_df['assetCode'].unique(), 20)
print(pilot) # numpy.ndarray

# select big company pilot, based on quantiles of volume and close price 
# .groupby('a')['b'].mean()
# market_train_df['assetName'].unique()
# [market_train_df.groupby('assetName')['close'].mean() >=50]
print("groupby assetName has dimension1 3780")
pilot_df = market_train_df[(market_train_df['assetCode'].isin(pilot))]

# setup empty df to save pilot data
data = []
data_big = []



##### Plot the closing price of 20 random stocks. 

In [None]:
# create trace plot for big company 

# create trace plot for whole pilot 
for asset in pilot:
    asset_df = market_train_df[(market_train_df['assetCode'] == asset)]

    data.append(go.Scatter(
        x = asset_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = asset_df['close'].values,
        name = asset
    ))
layout = go.Layout(dict(title = "Closing prices of 20 random assets",
                  xaxis = dict(title = 'Year'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')


#### Missingness in the Market Data
1. Drop-out:  instruments leave this subset of data.   
2. Left-censoring:  instruments enter this subset of data. 
3. Smooth lines for `MTD.N` and `ARTC.O`, which are abnormal. Intermittent missing data? Instruments entered, left and entered again. 
4. Might exist a lot of changes in instruments around 2008 to 2009.  
5. Choose HES.N, because it does not have obvious missing values 


In [None]:
# how many days in total in this dataset
print(market_train_df['time'].nunique())
print("2498 days of market data")

# check MTD.N, ARTC.O
MTDN = market_train_df[(market_train_df['assetCode'] == 'MTD.N')]['time']
print(MTDN.size)
print(MTDN.nunique())
print(MTDN[0])
print(MTDN[(MTDN.nunique() -1)]) 
# for ARTCO
ARTCO = market_train_df[(market_train_df['assetCode'] == 'ARTC.O')]['time']
print(ARTCO.size)
print(ARTCO.nunique())
print(ARTCO[0])
print(ARTCO[(ARTCO.nunique() -1)]) 

# this is what I mentioned as 3. Intermittent missing data

# check the missing of HES.N
HESN = market_train_df[(market_train_df['assetCode'] == 'HES.N')]['time']
print(HESN.size)

#### Model a stock close price with LOESS
![HESS](https://i.ibb.co/yyjhPc2/hess.jpg)
Locally-weighted polynomial regression (LOESS) to fit the HES.N   
>1. LOESS uses a **kd tree** to divide the box (also called the initial cell or bucket) enclosing all the predictor data points into rectangular cells. The vertices of these cells are the points at which local least squares fitting is done.   
>2. The default of LOESS is using **least-squares fitting** for **Gaussian** distribution of residuals,  
>3. **frac** is the tuning parameter, the smoothing factor, as a fraction of the number of points to take into account. Should be in the range (0, 1]. The stock is about 10-year data. The smoothing factor 1/20 was chosen, indicating that about half a year of data was used to fitting one dot. (Compared with 1/5, 1/10, 1/40)
>4. By default,  **locally-quadratic fitting**, the polynomial up to 2 at most.
>5. **p** is the number of features. For this implementation in Python, p = 1.  
>6. Features should be numerical.


In [None]:
# a class of loess in skmisc
x_hess = range(HESN.size)
y_hess = market_train_df[(market_train_df['assetCode'] == 'HES.N')]['close']
loess_sm = sm.nonparametric.lowess
# hess_loess = loess_sm(y_hess, x_hess,frac=1/20)
hess_loess_1 = loess_sm(y_hess, x_hess,frac=1/5, it = 3, return_sorted = False)
hess_loess_2 = loess_sm(y_hess, x_hess,frac=1/10, it = 3, return_sorted = False)
hess_loess_3 = loess_sm(y_hess, x_hess,frac=1/20, it = 3, return_sorted = False)
hess_loess_4 = loess_sm(y_hess, x_hess,frac=1/40, it = 3, return_sorted = False)
# TRY TO MAKE prediction with loess
# unpack the lowess smoothed points to their values
## lowess_x = list(zip(*hess_loess))[0]
### lowess_y = list(zip(*hess_loess))[1]

# run scipy's interpolation. There is also extrapolation I believe
## f = interp1d(lowess_x, lowess_y, bounds_error=False)
## x_hess_new = range(HESN.size, (HESN.size + 10))
## y_hess_new = f(xnew)

# plot the loess fitting
plt.figure(figsize=(12,6))
plt.scatter(x_hess,y_hess, facecolors = 'none', edgecolor = 'lightblue', label = 'HESS Close Price')
plt.plot(x_hess,hess_loess_1,color = 'magenta', label = 'Loess, 0.2: statsmodel')
plt.plot(x_hess,hess_loess_2,color = 'green', label = 'Loess, 0.1: statsmodel')
plt.plot(x_hess,hess_loess_3,color = 'red', label = 'Loess, 0.05: statsmodel')
plt.plot(x_hess,hess_loess_4,color = 'darkblue', label = 'Loess, 0.025: statsmodel')
## plt.plot(xnew, ynew, color = 'red', label = 'Loess: Prediction')
plt.legend()
plt.title('HESS STOCK 2007 - 2016: Loess Regression')
plt.show()




In [None]:
### loess with skmloess
## hess_loess_skm = skmloess.loess(x_hess, y_hess, weights = None, p = 1, family='gaussian', span = 0.1, degree=2)
# hess_loess_fit = skmloess.loess.fit(hess_loess_skm)
# print(hess_loess_skm)
# plot
# plt.figure(figsize=(12,6))
# plt.scatter(x_hess,y_hess, facecolors = 'none', edgecolor = 'lightblue', label = 'HESS Close Price')
# plt.plot(x_hess,hess_loess_fit,color = 'magenta', label = 'Loess: statsmodel')
# plt.legend()
# plt.title('HESS STOCK 2007 - 2016: Loess regression comparisons')
# plt.show()


In [None]:
# outliers in Market data 

In [None]:
# choose 

### News data
#### Introduction of News Data
The news data contains information at both the news article level and asset level (in other words, the table is intentionally not normalized), including timestamps, news id, headline, urgency, companyCount, assetCodes, assetName, relevance (a decimal number indicating the relevance of the news item to the asset. It ranges from 0 to 1.), sentimentClass (-1, 0, 1), sentimentNegative, sentimentNeutral and sentimentPositive. 

In [None]:
news_train_df.head()

In [None]:
news_train_df.describe()

In [None]:
print(f'{news_train_df.shape[0]} samples and {news_train_df.shape[1]} features in the training news dataset.')

#### Word cloud of headlines of News Data
The file is too huge to work with text directly.  
100,000 out of 9,328,750 headlines were chosen randomly.  
I am still learning to figure out why this plot is not reproducible. 

In [None]:
# variables News
# The file is too huge to work with text directly
stop = set(stopwords.words('english'))
np.random.seed(1024)
# 
text = ' '.join(np.random.choice(news_train_df['headline'], 100000))
# text = ' '.join(news_train_df['headline'].str.lower().values[-1000000:])
wordcloud = WordCloud(max_font_size=None, stopwords=stop, background_color='white',
                      width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top words in random selected headline')
plt.axis("off")
plt.show()


#### Missing Data in News Data

In [None]:
# missing data, no missing data
news_train_df.isna().sum()

Chose asset code from the first few rows and test if these could be matched to the market data. 

In [None]:
# Check match of news data and market data
print("CHDN.OQ" in market_train_df['assetCode'])
print("0857.HK" in market_train_df['assetCode'])


It looks like there was no missing data in News data. However, we all know that there cannot be News reported on everyday for each asset.  In addition, asset codes in News Data did not match with the Market Data totally.  
Intuitively, the news data should be very important to make the prediction. However, by reviewing the baseline results of others, the features from news were shown to be much less important than the previous trend of the stock. This indicates we might need to impute the news data, since no news is good news. If there was no news, the opinoins on this stock are likely to be neutral.  

#### Process the News Data and Merge it to the Market Data
Remove some columns.  
Unstacking news. We need to merge with market data which has individual asset codes. Therefore, we are going to unstack each asset code and save the original index.  
There can be many News on a single date for the same asset, so we need to group this data.  

In [None]:
# transform the date
news_test = news_train_df
news_test['date'] = news_test.time.dt.date  # Add date column
print(news_test.head(3))

## check unique days of news data
print(news_test['date'].nunique())

## check days of news for one asset

