## Pre-defined terms in BOE working paper

[BOE working paper](#https://www.bankofengland.co.uk/-/media/boe/files/working-paper/2020/making-text-count-economic-forecasting-using-newspaper-text.pdf?la=en&hash=E81EC91956CEA4FC6F63C4DC5942F0E9D4580558)

We use 9660 terms, with up to 3-grams. The pre-defined list of terms used to construct the term
frequency matrix uses the union of several dictionaries. These are those dictionaries found in 
- Nyman et al. (2018)
- Loughran and McDonald (2013)
- Nielsen (2011)
- Hu and Liu (2004)
- Hu et al. (2017), and 
- Correa et al. (2017). 

We add to this a collection of words related to economics and finance (most of these come from https://home.ubalt.edu/ntsbarsh/stat-data/KeywordsPhra.htm and http://home.ubalt.edu/ntsbarsh/Business-stat/stat-data/KeysPhrasFinance.htm.) and the Harvard IV psychological dictionary used by Tetlock (2007). 

We use n-grams up to trigrams only if they already exist individually in these dictionaries. For example, “interest rate risk” is a tri-gram
inherited from one of the component dictionaries. This gives 9660 unique terms of which 8030 appear in our corpus.

In [1]:
from google.cloud import bigquery
import pandas as pd
import numpy as np
import seaborn as sns
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
import matplotlib.pyplot as plt
sns.set()

In [2]:
client = bigquery.Client()

query=f"""
select * from goldenfleece.final_task.tone
"""
vocab_df = client.query(query).to_dataframe()
vocab_df.head(3)

Unnamed: 0,tone_lmcd_negative,tone_lmcd_positive,tone_lmcd_uncertainty,tone_lmcd_litigious,tone_lmcd_strong_modal,tone_lmcd_weak_modal,tone_lmcd_constraining,tone_rid_anxiety
0,ABANDON,ABLE,ABEYANCE,ABOVEMENTIONED,ALWAYS,ALMOST,ABIDE,TREMOR
1,ABANDONED,ABUNDANCE,ABEYANCES,ABROGATE,BEST,APPARENTLY,ABIDING,AFRAID
2,ABANDONING,ABUNDANT,ALMOST,ABROGATED,CLEARLY,APPEARED,BOUND,AGHAST


In [8]:
term_list = [term.lower() for row in vocab_df.values.tolist() for term in row if term is not None]

print(f"Number of terms: {len(term_list)}")

Number of terms: 4189


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
doc_set = [Doc1, Doc2]

my_vocabulary = ['grand slam', 'australian open', 'french open', 'us open']
my_vocabulary_dict = {term:i for i, term in enumerate(my_vocabulary)}

In [None]:
my_vocabulary_dict

In [None]:
vectorizer = CountVectorizer(ngram_range=(2, 2), vocabulary=my_vocabulary_dict)

In [None]:
term_count = vectorizer.transform(doc_set)
term_count.toarray()

In [None]:
# to inspect how well the default analyzer in CountVectorizer performs
analyze = vectorizer.build_analyzer()

# first 20 tokenized terms
analyze(Doc1)