This notebook introduces basic ideas in pre-processing text using minutes from the Bank of England's Monetary Policy Committee meetings as an example, as well as information retrieval via dictionary methods.  For a more extensive discussion of pre-processing, see the notebook in the "text-mining-tutorial" repository.

Apart from the standard numpy and pandas packages, the tutorial also relies on the topicmodels package, which can be installed using ```pip install topic-modelling-tools```, see https://github.com/sekhansen/text-mining-tutorial for more details.

In [1]:
import pandas as pd
import numpy as np
import topicmodels

We begin by loading the minutes data.

In [2]:
data = pd.read_table("mpc_minutes.txt", encoding="utf-8")

In [3]:
data.shape

(7277, 2)

In [4]:
data.columns

Index([u'year', u'minutes'], dtype='object')

The data contains 7,277 paragraphs along with a meeting identifier.

In [5]:
data.year.values[0] # first year in sample

199706

In [6]:
data.year.values[-1] # last year in sample

201410

In [7]:
data.minutes.values[1] # second paragraph in sample

u'  The 12-month growth rate of notes and coins had fallen back since January, when it was 7.1%. It had fallen to 6.3% in April and the provisional estimate for May was 6.1%. It was not yet clear whether the fall simply reflected a deceleration in demand for cash following the recent fall in retail price inflation, or whether it had implications for future spending.'

For the purposes of this notebook, we aggregate the data to the full meeting rather than paragraph level.

In [8]:
data_agg = data.groupby('year').agg(lambda x: ' '.join(x))
data_agg.shape[0] # total number of meetings in data

209

# Example of Pre-Processing


The first step in pre-processing is to tokenize the data.  Tokenization breaks a raw character string into individual 'tokens' based on some pre-defined rule.

In [29]:
docsobj = topicmodels.RawDocs(data_agg.minutes, "long") # creates object for pre-processing
docsobj.tokens[1][1:30] # first thirty tokens of second meeting

[u'section',
 u'i',
 u'of',
 u'this',
 u'minute',
 u'summarises',
 u'the',
 u'analysis',
 u'presented',
 u'to',
 u'the',
 u'mpc',
 u'by',
 u'bank',
 u'staff',
 u',',
 u'and',
 u'also',
 u'incorporates',
 u'information',
 u'that',
 u'became',
 u'available',
 u'to',
 u'the',
 u'committee',
 u'after',
 u'the',
 u'presentation']

The second step in pre-processing is to remove all non-alphabetic tokens and all tokens of length one.

In [30]:
docsobj.token_clean(1)
docsobj.tokens[1][1:30] # first thirty tokens of second meeting

[u'of',
 u'this',
 u'minute',
 u'summarises',
 u'the',
 u'analysis',
 u'presented',
 u'to',
 u'the',
 u'mpc',
 u'by',
 u'bank',
 u'staff',
 u'and',
 u'also',
 u'incorporates',
 u'information',
 u'that',
 u'became',
 u'available',
 u'to',
 u'the',
 u'committee',
 u'after',
 u'the',
 u'presentation',
 u'section',
 u'ii',
 u'summarises']

It is instructive to keep track of the dimensionality of the data as we go through different pre-processing steps.

In [31]:
all_stems = [s for d in docsobj.tokens for s in d]
print("number of unique tokens = %d" % len(set(all_stems)))
print("number of total tokens = %d" % len(all_stems))

number of unique tokens = 8967
number of total tokens = 1115996


The next step in pre-processing is to remove stopwords, which here have been defined by the "long" argument to RawDocs above.

In [32]:
docsobj.stopwords # the stopwords removed in this example

{u'a',
 u'about',
 u'above',
 u'after',
 u'again',
 u'against',
 u'all',
 u'also',
 u'am',
 u'an',
 u'and',
 u'another',
 u'any',
 u'are',
 u'as',
 u'at',
 u'back',
 u'be',
 u'because',
 u'been',
 u'before',
 u'being',
 u'below',
 u'between',
 u'both',
 u'but',
 u'by',
 u'could',
 u'did',
 u'do',
 u'does',
 u'doing',
 u'down',
 u'during',
 u'each',
 u'even',
 u'ever',
 u'every',
 u'few',
 u'first',
 u'five',
 u'for',
 u'four',
 u'from',
 u'further',
 u'get',
 u'go',
 u'goes',
 u'had',
 u'has',
 u'have',
 u'having',
 u'he',
 u'her',
 u'here',
 u'hers',
 u'herself',
 u'high',
 u'him',
 u'himself',
 u'his',
 u'how',
 u'however',
 u'i',
 u'if',
 u'in',
 u'into',
 u'is',
 u'it',
 u'its',
 u'itself',
 u'just',
 u'least',
 u'less',
 u'like',
 u'long',
 u'made',
 u'make',
 u'many',
 u'me',
 u'more',
 u'most',
 u'my',
 u'myself',
 u'never',
 u'new',
 u'no',
 u'nor',
 u'not',
 u'now',
 u'of',
 u'off',
 u'old',
 u'on',
 u'once',
 u'one',
 u'only',
 u'or',
 u'other',
 u'ought',
 u'our',
 u'ours',


In [33]:
docsobj.stopword_remove("tokens")

all_stems = [s for d in docsobj.tokens for s in d]
print("number of unique tokens = %d" % len(set(all_stems)))
print("number of total tokens = %d" % len(all_stems))

number of unique tokens = 8818
number of total tokens = 613560


Look at the effect on the number of total tokens from removing a relatively small number of unique tokens!

The final pre-processing step in this example is stemming, which removes suffixes from words in order to map tokens with different grammatical forms into a single linguistic root.

In [34]:
docsobj.stem()

all_stems = [s for d in docsobj.stems for s in d]
print("number of unique stems = %d" % len(set(all_stems)))
print("number of total stems = %d" % len(all_stems))

docsobj.stopword_remove("stems") # remove stems that are on the stopword list

number of unique stems = 5550
number of total stems = 613560


Here the total number of terms has stayed the same, but the number of unique terms is much less.

# Example of Dictionary Methods

For this example, we use the monetary policy sentiment dictionaries from Apel and Blix-Grimaldi (2012) to characterize the sentiment of each MPC meeting.  We will then compare the measured sentiment to UK GDP as measured by the Office for National Statistics.

In [35]:
bowobj = topicmodels.BOW(docsobj.stems) # create an object for bag-of-words analysis

In [36]:
topicmodels.bow_data.pos_dict # the positive sentiment words

{u'accelerate',
 u'accelerated',
 u'accelerates',
 u'accelerating',
 u'expand',
 u'expanded',
 u'expanding',
 u'expands',
 u'fast',
 u'faster',
 u'fastest',
 u'gain',
 u'gained',
 u'gaining',
 u'gains',
 u'high',
 u'higher',
 u'highest',
 u'increase',
 u'increased',
 u'increases',
 u'increasing',
 u'strong',
 u'stronger',
 u'strongest'}

In [37]:
topicmodels.bow_data.neg_dict # the negative sentiment words

{u'contract',
 u'contracted',
 u'contracting',
 u'contracts',
 u'decelerate',
 u'decelerated',
 u'decelerates',
 u'decelerating',
 u'decrease',
 u'decreased',
 u'decreases',
 u'decreasing',
 u'lose',
 u'losing',
 u'loss',
 u'losses',
 u'lost',
 u'low',
 u'lower',
 u'lowest',
 u'slow',
 u'slower',
 u'slowest',
 u'weak',
 u'weaker',
 u'weakest'}

The overall sentiment indicator is formed of the net count of positive words divided by the total number of sentiment words (positive + negative).  All sentiment words are stemmed in order to match the data we formed in pre-preprocessing above.

In [23]:
data_agg['pos'] = bowobj.pos_count('stems')
data_agg['neg'] = bowobj.neg_count('stems')
data_agg['sentiment'] = (data_agg.pos - data_agg.neg) /\
                        (data_agg.pos + data_agg.neg)

Next we add quarterly GDP data collected from the ONS website.

In [24]:
ons = pd.read_csv('ons_quarterly_gdp.csv')
data_agg['gdp_growth'] = ons.gdp_growth.values
data_agg['quarter'] = ons.quarter.values

Finally, we compute the average MPC minutes sentiment per quarter, and correlated with GDP.

In [25]:
temp = data_agg.groupby('quarter').agg(np.mean)
print(temp.corr())
temp['quarter'] = sorted(set(ons.label))
temp[['quarter', 'sentiment', 'gdp_growth']].to_csv('output.csv', index=False)

                 pos       neg  sentiment  gdp_growth
pos         1.000000  0.536057   0.572704    0.382814
neg         0.536057  1.000000  -0.289257    0.153855
sentiment   0.572704 -0.289257   1.000000    0.409506
gdp_growth  0.382814  0.153855   0.409506    1.000000


In spite of its arguable lack of subtlety, here dictionary methods have produced a sentiment indicator that indeed correlates with real activity.  Further exploration can be done with the output file printed above.

# Bigram Feature Space

Above we used stem counts as a feature space, but we could alternatively have used bigram counts.

In [39]:
docsobj.bigram('stems')

all_bigrams = [s for d in docsobj.bigrams for s in d]
print("number of unique bigrams = %d" % len(set(all_bigrams)))
print("number of total bigrams = %d" % len(all_bigrams))

number of unique bigrams = 172016
number of total bigrams = 610013


In [48]:
docsobj.term_rank('hh')

UnboundLocalError: local variable 'v' referenced before assignment