# Economic NLP

### Table of Contents

* [Aim](#aim)
* [Approach](#approach)
    * [Global Knowledge Graph](#gkg)
* [Phase 1](#phase1)
    * [Tone of Articles](#tone_of_articles)
    * [Consumer Sentiment](#consumer_sentiment)
* [Phase 2](#phase2)
    * [Correlation](#correlation)
    * [Topic Modeling](#topic_modeling)
* [Phase 3](#phase3)
    * [Macroeconomic Indicator Forecast](#gdp_forecast)

## Aim <a class="anchor" id="aim"></a>
The aim of this project is explore how alternative data could help our economic forecasts. One idea that has been considered was the scraping of newspapers, blogs etc for attitudes to things that change sentiment of the populus and subsequently and indirectly, the economy. 

## Approach <a class="anchor" id="approach"></a>
To test this theory, the team conducted some analysis to explore topics around COVID 19. In particular, we have access to a public dataset courtest of https://www.gdeltproject.org/. The GDELT database monitors print, broadcast, and web news media in over 100 languages from across every country in the world to keep continually updated on breaking developments anywhere on the planet. Its historical archives stretch back to January 1, 1979 and update every 15 minutes. Through its ability to leverage the world's collective news media, GDELT moves beyond the focus of the Western media towards a far more global perspective on what's happening and how the world is feeling about it.

### Global Knowledge Graph <a class="anchor" id="gkg"></a>
One particular dataset within GDELT of interest, is the Global Knowledge Graph. Not only do we have the links to the articles, but we have various useful enrichments:

Overview

- Around 1bn rows

- Updated every 15 mins

- Each entry is a news article

- Detailed information on every person, location, number and theme mentioned in an article

Can explore the GDELT database in Google Big Query here: https://console.cloud.google.com/bigquery?authuser=1&project=goldenfleece

## Phase 1 <a class="anchor" id="phase1"></a>

### Tone of Articles <a class="anchor" id="tone_of_articles"></a>

We extracted the tone data from GKG to understand the general sentiment of all the articles across the year.

![Tone of Articles](img/articles_tone.png)

the task is to mine the GDELT database for all news articles related to the mentions of “vaccine” and analyse if we think the sentiment is positive (vaccine will work, life back to normal etc), negative (side effects will be awful, not proven, not safe etc) to see how possible it is. This could be explored via a multitude of approaches, such as Bag of words Models, unsupervised LDA topic modelling and associated sentiment analysis to understand the context of the articles that display positive vs negative sentiment. Sentiment can also change over time, so there is an opporunity to explore this also.

Link to dashboard: https://datastudio.google.com/reporting/50b45093-253e-4be0-b917-b5c9bf2d0cef/page/vIRqB

### Consumer Sentiment <a class="anchor" id="consumer_sentiment"></a>

So far we have conducted a lot of research around the supply side of news, but what about the response to it?

#### Reddit Comments
We obtained access to reddit comments for the same time period, but this time focused on April and Novembver, courtesy of https://pushshift.io/

Sentiment was calculated using subjectivity * polairty, where subjectivity ranges between 0-1 and polarity from -1 and +1. Essentially, very personal positive comments tends towards score of +1. Sentiment scores are the average of comments on that day.

![Consumer Sentiment](img/consumer_sentiment.png)

Looks like the overall sentiment for November is generally slightly higher!

 
Looking at the most negative comments in April:

![Negative comment 1](img/negative_comment1.png)

![Negative comment 2](img/negative_comment2.png)





Then looking at some positive comments in November - some politically linked, comments on shares. 
![Positive comment 1](img/positive_comment1.png)

![Positive comment 2](img/positive_comment2.png)

![Positive comment 3](img/positive_comment3.png)

# Phase 2 <a class="anchor" id="phase2"></a>

### Stock Market Correlation

Ok, so we know how to mine data for sentiment, but what about trying to get a signal from it? We explore ways in which we can leverage the use of topic models to infer better understanding and attempt to correlate to the stock market

### Topic Modeling <a class="anchor" id="topic_modeling"></a>

Each news article in GDELT dataset was tagged with a variety of themes. The themes are taxonomy classifications from a number of different organisations including UN and World Bank. 

>There are thousands of different GKG themes, but no complete list seems to be provided anywhere. Probably the most comprehensive list of themes can be found within the documentation section of the GDELT project website. The list currently includes over 2,500 different themes. Among these themes are around 2,200 themes from the World Bank Taxonomy.

We can manually sample some news articles and extract the economy themes like the ones below but this is not really efficient.

- TAX_DISEASE_CORONAVIRUS
- HEALTH_VACCINATIONS
- EPU_GOVERNMENT_POLICY
- EPU_POLICY_BUDGET
- EPU_ECONOMY
- EPU_ECONOMY_DEBT
- ECON_STOCKMARKET
- ECON_TAXATION
- UNEMPLOYMENT
- PROTEST

Therefore, we apply Latent Dirichlet Allocation (LDA) topic modeling, which is a type of unsupervised learning to identify topics in a set of documents.

# Phase 3 <a class="anchor" id="phase3"></a>

### Macroeconomic Indicator Forecast

So! we had some with topic modelling, but what else can we do?

The bank of england released a paper called Making text count: economic forecasting
using newspaper text drawn from three popular UK newspapers that collectively represent
UK newspaper readership in terms of political perspective and editorial style

The found that Incorporating text into forecasts by combining counts of terms with supervised machine
learning delivers the best forecast improvements both in marginal terms and relative to existing text-based
methods. These improvements are most pronounced during periods of economic stress when, arguably,
forecasts matter most.

Steps:

1. We constructed a term frequency model using word lists such as this: https://sraf.nd.edu/textual-analysis/resources/

2. Build an ngram dataset covering all of the speech to text transcriptions of BBC news, bloomberg, CNBC and CNN over the last 10 years

3. We then transformed the dataset and aligned it to timeseries of various economic indicatoers