In [None]:
import numpy as np
import pandas as pd
import random
from kaggle.competitions import twosigmanews
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
env = twosigmanews.make_env()
(market_train_df, news_train_df) = env.get_training_data()

The marketdata contains a variety of returns calculated over different timespans. All of the returns in this set of marketdata have these properties:

* Returns are always calculated either open-to-open (from the opening time of one trading day to the open of another) or close-to-close (from the closing time of one trading day to the open of another).

* Returns are either raw, meaning that the data is not adjusted against any benchmark, or market-residualized (Mktres), meaning that the movement of the market as a whole has been accounted for, leaving only movements inherent to the instrument.

* Returns can be calculated over any arbitrary interval. Provided here are 1 day and 10 day horizons.
* Returns are tagged with 'Prev' if they are backwards looking in time, or 'Next' if forwards looking.

| ﻿   	| column                            	| desc                                                                                                                                                                                                                                                                                                                                                                                         	|
|----	|-----------------------------------	|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
| 1  	| time(datetime64[ns, UTC])         	| the current time (in marketdata, all rows are taken at 22:00 UTC)                                                                                                                                                                                                                                                                                                                            	|
| 2  	| assetCode(object)                 	| a unique id of an asset                                                                                                                                                                                                                                                                                                                                                                      	|
| 3  	| assetName(category)               	| the name that corresponds to a group of assetCodes. These may be "Unknown" if the corresponding assetCode does not have any rows in the news data.                                                                                                                                                                                                                                           	|
| 4  	| universe(float64)                 	| a boolean indicating whether or not the instrument on that day will be included in scoring. This value is not provided outside of the training data time period. The trading universe on a given date is the set of instruments that are avilable for trading (the scoring function will not consider instruments that are not in the trading universe). The trading universe changes daily. 	|
| 5  	| volume(float64)                   	| trading volume in shares for the day                                                                                                                                                                                                                                                                                                                                                         	|
| 6  	| close(float64)                    	| the close price for the day (not adjusted for splits or dividends)                                                                                                                                                                                                                                                                                                                           	|
| 7  	| open(float64)                     	| the open price for the day (not adjusted for splits or dividends)                                                                                                                                                                                                                                                                                                                            	|
| 8  	| returnsClosePrevRaw1(float64)     	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 9  	| returnsOpenPrevRaw1(float64)      	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 10 	| returnsClosePrevMktres1(float64)  	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 11 	| returnsOpenPrevMktres1(float64)   	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 12 	| returnsClosePrevRaw10(float64)    	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 13 	| returnsOpenPrevRaw10(float64)     	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 14 	| returnsClosePrevMktres10(float64) 	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 15 	| returnsOpenPrevMktres10(float64)  	|                                                                                                                                                                                                                                                                                                                                                                                              	|
| 16 	| returnsOpenNextMktres10(float64)  	| 10 day, market-residualized return. This is the target variable used in competition scoring. The market data has been filtered such that returnsOpenNextMktres10 is always not null.                                                                                                                                                                                                         	|

In [None]:
market_train_df.columns

In [None]:
market_train_df['Value'] = market_train_df['close'] - market_train_df['open']

In [None]:
market_train_df[['time', 'assetCode', 'volume', 'close', 'open',
       'returnsClosePrevRaw1', 'returnsOpenPrevRaw1',
       'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
       'returnsClosePrevRaw10', 'returnsOpenPrevRaw10',
       'returnsClosePrevMktres10', 'returnsOpenPrevMktres10',
       'returnsOpenNextMktres10', 'universe']]

In [None]:
for company in random.choices(market_train_df['assetCode'].unique(),k=3):
    print(market_train_df[market_train_df['assetCode']==company])

In [None]:
print(pd.Series(market_train_df['assetName'].unique()))
print(len(pd.Series(market_train_df['assetName'].unique())))

In [None]:
plt.figure(figsize=(20,10))
plt.plot(market_train_df[market_train_df['assetCode'] == 'A.N']['Value'])

In [None]:
for column in market_train_df.columns:
    print('number of unique values in ',column,':',market_train_df[column].nunique())



| ﻿   	| column                               	| desc                                                                                                                                                                                                                                                                                                                                                                 	|
|----	|--------------------------------------	|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
| 1  	| time(datetime64[ns, UTC])            	| UTC timestamp showing when the data was available on the feed (second precision)                                                                                                                                                                                                                                                                                     	|
| 2  	| sourceTimestamp(datetime64[ns, UTC]) 	| UTC timestamp of this news item when it was created                                                                                                                                                                                                                                                                                                                  	|
| 3  	| firstCreated(datetime64[ns, UTC])    	| UTC timestamp for the first version of the item                                                                                                                                                                                                                                                                                                                      	|
| 4  	| sourceId(object)                     	| an Id for each news item                                                                                                                                                                                                                                                                                                                                             	|
| 5  	| headline(object)                     	| the item's headline                                                                                                                                                                                                                                                                                                                                                  	|
| 6  	| urgency(int8)                        	| differentiates story types (1: alert, 3: article)                                                                                                                                                                                                                                                                                                                    	|
| 7  	| takeSequence(int16)                  	| the take sequence number of the news item, starting at 1. For a given story, alerts and articles have separate sequences.                                                                                                                                                                                                                                            	|
| 8  	| provider(category)                   	| identifier for the organization which provided the news item (e.g. RTRS for Reuters News, BSW for Business Wire)                                                                                                                                                                                                                                                     	|
| 9  	| subjects(category)                   	| topic codes and company identifiers that relate to this news item. Topic codes describe the news item's subject matter. These can cover asset classes, geographies, events, industries/sectors, and other types.                                                                                                                                                     	|
| 10 	| audiences(category)                  	| identifies which desktop news product(s) the news item belongs to. They are typically tailored to specific audiences. (e.g. "M" for Money International News Service and "FB" for French General News Service)                                                                                                                                                       	|
| 11 	| bodySize(int32)                      	| the size of the current version of the story body in characters                                                                                                                                                                                                                                                                                                      	|
| 12 	| companyCount(int8)                   	| the number of companies explicitly listed in the news item in the subjects field                                                                                                                                                                                                                                                                                     	|
| 13 	| headlineTag(object)                  	| the Thomson Reuters headline tag for the news item                                                                                                                                                                                                                                                                                                                   	|
| 14 	| marketCommentary(bool)               	| boolean indicator that the item is discussing general market conditions, such as "After the Bell" summaries                                                                                                                                                                                                                                                          	|
| 15 	| sentenceCount(int16)                 	| the total number of sentences in the news item. Can be used in conjunction with firstMentionSentence to determine the relative position of the first mention in the item.                                                                                                                                                                                            	|
| 16 	| wordCount(int32)                     	| the total number of lexical tokens (words and punctuation) in the news item                                                                                                                                                                                                                                                                                          	|
| 17 	| assetCodes(category)                 	| list of assets mentioned in the item                                                                                                                                                                                                                                                                                                                                 	|
| 18 	| assetName(category)                  	| name of the asset                                                                                                                                                                                                                                                                                                                                                    	|
| 19 	| firstMentionSentence(int16)          	| the first sentence, starting with the headline, in which the scored asset is mentioned. 1: headline2: first sentence of the story body3: second sentence of the body, etc0: the asset being scored was not found in the news item's headline or body text. As a result, the entire news item's text (headline + body) will be used to determine the sentiment score. 	|
| 20 	| relevance(float32)                   	| a decimal number indicating the relevance of the news item to the asset. It ranges from 0 to 1. If the asset is mentioned in the headline, the relevance is set to 1. When the item is an alert (urgency == 1), relevance should be gauged by firstMentionSentence instead.                                                                                          	|
| 21 	| sentimentClass(int8)                 	| indicates the predominant sentiment class for this news item with respect to the asset. The indicated class is the one with the highest probability.                                                                                                                                                                                                                 	|
| 22 	| sentimentNegative(float32)           	| probability that the sentiment of the news item was negative for the asset                                                                                                                                                                                                                                                                                           	|
| 23 	| sentimentNeutral(float32)            	| probability that the sentiment of the news item was neutral for the asset                                                                                                                                                                                                                                                                                            	|
| 24 	| sentimentPositive(float32)           	| probability that the sentiment of the news item was positive for the asset                                                                                                                                                                                                                                                                                           	|
| 25 	| sentimentWordCount(int32)            	| the number of lexical tokens in the sections of the item text that are deemed relevant to the asset. This can be used in conjunction with wordCount to determine the proportion of the news item discussing the asset.                                                                                                                                               	|
| 26 	| noveltyCount12H(int16)               	| The 12 hour novelty of the content within a news item on a particular asset. It is calculated by comparing it with the asset-specific text over a cache of previous news items that contain the asset.                                                                                                                                                               	|
| 27 	| noveltyCount24H(int16)               	| same as above, but for 24 hours                                                                                                                                                                                                                                                                                                                                      	|
| 28 	| noveltyCount3D(int16)                	| same as above, but for 3 days                                                                                                                                                                                                                                                                                                                                        	|
| 29 	| noveltyCount5D(int16)                	| same as above, but for 5 days                                                                                                                                                                                                                                                                                                                                        	|
| 30 	| noveltyCount7D(int16)                	| same as above, but for 7 days                                                                                                                                                                                                                                                                                                                                        	|
| 31 	| volumeCounts12H(int16)               	| the 12 hour volume of news for each asset. A cache of previous news items is maintained and the number of news items that mention the asset within each of five historical periods is calculated.                                                                                                                                                                    	|
| 32 	| volumeCounts24H(int16)               	| same as above, but for 24 hours                                                                                                                                                                                                                                                                                                                                      	|
| 33 	| volumeCounts3D(int16)                	| same as above, but for 3 days                                                                                                                                                                                                                                                                                                                                        	|
| 34 	| volumeCounts5D(int16)                	| same as above, but for 5 days                                                                                                                                                                                                                                                                                                                                        	|
| 35 	| volumeCounts7D(int16)                	| same as above, but for 7 days                                                                                                                                                                                                                                                                                                                                        	|

In [None]:
market_train_df.shape

In [None]:
news_train_df.head()

In [None]:
news_train_df.nunique()

TBC