# Notes for user:

*   While ubiquitous, textual sources of information such as company reports, social media posts, etc. are hardly
included in prediction algorithms for time series, despite the relevant information they may contain. 
*   In this model, daily news headlines and tweets filtering of USA, Britian and COVID-19 are leveraged to predict time series of USD/GBP with a single pipeline.
*   Two methods of numerical representation of text are considered, namely traditional Term Frequency - 
Inverse Document Frequency (TF-IDF) and Count Vectorize.



# Importing News Headings API


In [94]:
action = 'API_getOpenTextData'
parameters = {
    'token':'TOKEN HERE',
    'consumption_confirmation':'on',
    'date_filter':{"start_date":"2020-01-01T06:00:00.000Z", 
                   "end_date":"2020-03-10T06:00:00.000Z"},
    'sources':[
                # Wall Street Journal
               {'id_dataset' : '5e2ef74e9516294390e810a9', 
                 'features' : ['text']},
                # ABC News Headlines
               {'id_dataset':"5d8848e59516294231c59581", 
                'features' : ["headline", "title"]},
                # USA Today Twitter
               {'id_dataset' : "5e32fd289516291e346c1726", 
                'features' : ["text"]},
                # CNN News
               {'id_dataset' : "5d571b9e9516293a12ad4f5c", 
                'features' : ["headline", "title"]}
    ],
    'aggregate_in_time_interval' : {
              'time_interval_size' : 60 * 60 * 24
    },
    'text_filter_search':['covid', 'coronavirus', 'ncov', 'America', 'United States', 'USA', 'United States of America'],
    'add_date' : 'date'    
}
covid_news_headlines = pd.read_json(json.dumps(OpenBlender.call(action, parameters)['sample']), convert_dates=False, convert_axes=False).sort_values('timestamp', ascending=False)
covid_news_headlines.reset_index(drop=True, inplace=True)

Task ID: '5f1dce7e90eaa07497e356c6'.
Total estimated consumption: 47617.04 processing units.
Continue?  [y] yes 	 [n] noy
Task confirmed. Starting download..
CSV will be stored in: 1595788928.969431.csv
100.0 % completed.


In [95]:
covid_news_headlines.head()

Unnamed: 0,source,timestamp.date,timestamp,source_lst
0,were not rational especially during times...,2020-03-10,1583798400,[we dont want those kids in our classrooms: pa...
1,twenty-three people remain trapped in the ...,2020-03-09,1583712000,[new york (cnn business)airline passenger traf...
2,"""the simple truth is none of us knew what w...",2020-03-08,1583625600,[countries in asia and europe are reporting ri...
3,the case of a seriously ill father with roo...,2020-03-07,1583539200,[coronavirus update: catch up on all fridays n...
4,a new york city public-school teacher showe...,2020-03-06,1583452800,[the package will replace the initial white ho...


In [96]:
covid_news_headlines.tail()

Unnamed: 0,source,timestamp.date,timestamp,source_lst
65,the transformation of the american family d...,2020-01-05,1578182400,[thousands marched in baghdad in a funeral pro...
66,american airlines is negotiating with boe...,2020-01-04,1578096000,"[motorcycle sales, particularly in the united ..."
67,"a sinking jakarta, already 40% below sea le...",2020-01-03,1578009600,"[at 11 years running, the united states is exp..."
68,an attempt by supporters of iran-backed mil...,2020-01-02,1577923200,[an attempt by supporters of iran-backed milit...
69,from wsjopinion the lesson from 2019 is tha...,2020-01-01,1577836800,[sen susan collins is one of the senators who ...


In [125]:
import OpenBlender
import pandas as pd
import json


action = 'API_getObservationsFromDataset'

# ANCHOR: 'USD/GBP'

        
parameters = { 
  'token':'TOKEN HERE',
	'id_dataset':'5f1db9989516291eeabd3c71',
	'consumption_confirmation':'on',
	'blends':[] 
}
        
us_gbp = pd.read_json(json.dumps(OpenBlender.call(action, parameters)['sample']), convert_dates=False, convert_axes=False).sort_values('timestamp', ascending=False)
us_gbp.reset_index(drop=True, inplace=True)
us_gbp.head()
us_gbp.tail()

Task ID: '5f1dd3c10895fafb4a9d8d4b'.
Total estimated consumption: 20376.99 processing units.
Continue?  [y] yes 	 [n] noy
Task confirmed. Starting download..
CSV will be stored in: 1595790275.3442318.csv
100.0 % completed.


Unnamed: 0,timestamp,spot_price
369,1549411200,1.2963
370,1549324800,1.294
371,1549238400,1.3085
372,1548979200,1.3093
373,1548892800,1.3152


# Cleaning & Collating Data

In [108]:
# Merging timestamp from two tables
dataset = pd.merge(left=covid_news_headlines, left_on='timestamp',
         right=us_gbp, right_on='timestamp')
dataset.tail()

Unnamed: 0,source,timestamp.date,timestamp,source_lst,spot_price
44,workers have access to thousands of educati...,2020-01-08,1578441600,[the commander in chief took out the worlds mo...,1.3099
45,us foreign policy does not exist to dispe...,2020-01-07,1578355200,[us foreign policy does not exist to dispense ...,1.3122
46,thousands gathered in baghdad as part of a ...,2020-01-06,1578268800,"[the patriots dynasty, the dominance that is u...",1.3159
47,"a sinking jakarta, already 40% below sea le...",2020-01-03,1578009600,"[at 11 years running, the united states is exp...",1.3072
48,an attempt by supporters of iran-backed mil...,2020-01-02,1577923200,[an attempt by supporters of iran-backed milit...,1.3189


In [111]:
# need to filter so that only america and covid are seen
import numpy as np
# stored as a 2d array
news = np.array(dataset['source_lst'])
print(news[0][0])
print(news[1][5])
print(news[48][0])


we dont want those kids in our classrooms: parents warned as coronavirus closes two schools
coronavirus is officially declared a pandemic but health authorities in australia are calling for calm, reiterating that a runny nose or scratchy throat does not mean you have covid-19_
an attempt by supporters of iran-backed militias to storm the u_s_ embassy in baghdad was called off, as protesters withdrew from the area after their leadership ordered the suspension of a violent challenge to american troop presence in iraq


# Count Vectorize Breakdown

In [112]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer 
from pandas import DataFrame

In [114]:
# Count Vectorize
# Takes in multiple documents in form of a python list = message_list
# Vectorizer converts from text to numbers
def create_document_term_matrix(message_list, vectorizer):
  # Transforms all messages into numbers
  message_list=[message_list]
  doc_term_matrix = vectorizer.fit_transform(message_list)
  # Converts all into a matrix and return a dataframe
  return DataFrame(doc_term_matrix.toarray(),
                   columns=vectorizer.get_feature_names())

# Method #1 Count Vectorize 

In [115]:

msg_1 = news[0][0]

In [116]:
# Bag of words approach where each word in document is separated into tokens
count_vect = CountVectorizer()

In [117]:
# As you can see vivienne has appeared only once in 0th document, matrix 1 occurs 0 did not occur
# If using as classification task you can use this data as x train and y train data
# Some words are not significant for a classification task such as is, to etc
# Remove these unwanted words from corpus of words or give them a lower value compare to the important words
# We do this by creating a TF – Term frequency and IDF - Indicator Data Frequency
create_document_term_matrix(msg_1, count_vect)

Unnamed: 0,as,classrooms,closes,coronavirus,dont,in,kids,our,parents,schools,those,two,want,warned,we
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


# Method #2 TF_IDF Vectorizer

In [118]:
# TF-IDF is how often a word occurs in a document. 
# If we have multiple occurances of one word we except this value of this word to rise.
# Contains 2 documents 
msg_2 = news[0][0]
print(msg_2)

we dont want those kids in our classrooms: parents warned as coronavirus closes two schools


In [119]:
# We create an instance of the tfidf vectoriser
tfidf_vect = TfidfVectorizer()


In [120]:

create_document_term_matrix(msg_2, tfidf_vect)

Unnamed: 0,as,classrooms,closes,coronavirus,dont,in,kids,our,parents,schools,those,two,want,warned,we
0,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199,0.258199


# Method #3 TF_IDF Vectorizer

In [121]:
msg_3 = news[0][2]

In [122]:
create_document_term_matrix(msg_3, tfidf_vect)

Unnamed: 0,all,coronavirus,disrupts,everyday,for,italians,life
0,0.377964,0.377964,0.377964,0.377964,0.377964,0.377964,0.377964


# Method #4 TF_IDF Vectorizer

In [123]:
msg_4 = news[0][3]

In [124]:
create_document_term_matrix(msg_4, tfidf_vect)

Unnamed: 0,and,come,coronavirus,crisis,during,emergency,especially,frydenberg,grips,ian,josh,morrison,need,not,of,rational,response,scott,something,thats,the,they,times,to,unveil,verrender,were,when,with,writes
0,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078,0.348155,0.174078,0.174078,0.174078,0.174078,0.174078,0.174078
