This code takes as inputs news articles focusing on a specific country and compute the sentiment associated with the article. 

Our dataset of news comes from Factiva.com. Articles are indexed using region and subject tags. Each article is annotated with topics and geographic tags generated by Factiva using a proprietary algorithm. Note that an article can be tagged with multiple locations and topics. 

We focused on English articles published by Reuters between 1991 and 2015 and tagged with either `economic news` or `financial market news` as well as with one of the 25 countries in our sample (9 AE and 16 EM). Overall, our dataset covers a wide range of economic topics (e.g. economic policy, government finance, etc.), financial topics (e.g. commodity markets, equity markets, forex, etc.), as well as corporate and political news. 

Sentiment is measured using a simple dictionary approach based on Loughran and McDonald 2011. To measure news sentiment, we use a `bag-of-words` model, allowing us to reduce complex and multi-dimensional text data into a single number. 

First, we combine existing lists of positive and negative words found in financial texts by Loughran and McDonald 2011 and in texts related to economic policy by Young and Soroka 2012. We then expand our lists by including the inflections of each word: for example, the word `lose` belongs to the negative list, hence we also include the words `losing`, `loser`, `lost`, `loss`, etc, leading to a final list of 7,217 negative words and 3,250 positive words. 

Next, we define the sentiment of an article $j$ as:
$$ s_{j} = \dfrac{\sum_{i} w_{ij} p_{ij} - \sum_{i} w_{ij} n_{ij}}{\sum_{i} w_{ij} t_{ij}} $$
where $p_{ij}$ is the number of occurrences of positive word $i$ in article $j$, $n_{ij}$ is the number of occurrences of negative word $i$ in article $j$, $t_{ij}$ is the number of occurrences of word $i$ in article $j$, and $w_{ij}$ is the weight associated with word $i$ in article $j$. 

In our baseline estimates, we take $w_{ij} = 1$, allowing each word to contribute to the sentiment measure proportionally to its frequency of occurrence. In a robustness check, we let each word contribute to the sentiment measure proportionally to its `Term Frequency–Inverse Document Frequency` (TF-IDF, Manning 2010) by taking:
$$ w_{ij} = \log\left(\frac{N}{N_{i}}\right) $$
where $N$ is the number of articles in the corpus and $N_{i}$ is the number of articles in which word $i$ is present. Hence, this weighting smoothes out differences in word frequency naturally occurring in the English language by giving more weight to words that appear more rarely across documents. It is well established that the distribution of words in the English language follows a power law. For a broader discussion on power laws in Economics, see Gabaix 2016. 


In [1]:
import os
import pandas as pd
from timeit import default_timer as timer
from datetime import datetime
import re
import sys
import numpy as np
import multiprocessing as mp
import collections
import nltk
from nltk.corpus import stopwords
from timeit import default_timer as timer
from operator import itemgetter
import pickle
import more_itertools as mit

In [2]:
path_to_data='/scratch/spf248/news/data'

filenames_snapshots=[
'eeocwhc7sy.pkl.xz', # 2016_2019_25_countries
'vynzboapen.pkl.xz', # 2019_2019_25_countries
'c35plzarij.pkl.xz', # 1991_2019_remaining_countries
'644qmwwrwf.pkl.xz', # 2019_2020
'lnl7ntre52.pkl.xz', # 2020_2020
]

filenames_bulk=[
'reuters-econ-fin-mkt-25-countries-1991-1996.pkl',
'reuters-econ-fin-mkt-25-countries-1991-2015.pkl.xz',
]

# Load snapshot dataset

In [3]:
dfs={}

In [4]:
print('Import Data...\n')
start = timer()

for filename in filenames_snapshots:
    print(filename)
    dfs[filename]=pd.read_pickle(os.path.join(path_to_data,filename))
    
print("Done in", round(timer()-start), "sec")

Import Data...

eeocwhc7sy.pkl.xz
vynzboapen.pkl.xz
c35plzarij.pkl.xz
644qmwwrwf.pkl.xz
lnl7ntre52.pkl.xz
Done in 118 sec


In [5]:
print('Clean Snapshot Data...\n')
start = timer()

def clean_snapshot_data(df):
    df.rename(columns={'region_codes_to_label':'regions',
                       'subject_codes_to_label':'subjects',
                       'publication_datetime':'date'},inplace=True)
    df['full_text']=\
    df['title'].replace(np.nan,'')+' '+\
    df['snippet'].replace(np.nan,'')+' '+\
    df['body'].replace(np.nan,'')
    df.reset_index(inplace=True)
    df.drop([x for x in df.columns if x not in ['an','date','regions','subjects','full_text']],1,inplace=True)

for filename in filenames_snapshots:
    print(filename)
    clean_snapshot_data(dfs[filename])
        
print("Done in", round(timer()-start), "sec")

Clean Snapshot Data...

eeocwhc7sy.pkl.xz
vynzboapen.pkl.xz
c35plzarij.pkl.xz
644qmwwrwf.pkl.xz
lnl7ntre52.pkl.xz
Done in 14 sec


# Load bulk dataset

In [6]:
print('Import Data...\n')
start = timer()

for filename in filenames_bulk:
    print(filename)
    dfs[filename]=pd.read_pickle(os.path.join(path_to_data,filename))
    
print("Done in", round(timer()-start), "sec")

Import Data...

reuters-econ-fin-mkt-25-countries-1991-1996.pkl
reuters-econ-fin-mkt-25-countries-1991-2015.pkl.xz
Done in 182 sec


In [14]:
print('Clean Bulk Data...\n')
start = timer()

def clean_bulk_data(df_old):
    
    df=df_old.copy()
    
    df.rename(columns={'id':'an'},inplace=True)

    df['full_text']=\
    df['headline'].replace(np.nan,'')+' '+\
    df['leading paragraph'].replace(np.nan,'')+' '+\
    df['text'].replace(np.nan,'')

    if 'hour ET' in df.columns and 'minute' in df.columns:
        df['date']=pd.NaT
        idx1=df[['year','month','day','hour ET','minute']].dropna().index
        df.loc[idx1,'date']=pd.to_datetime(df.loc[idx1,['year','month','day','hour ET','minute']].rename(columns={'hour ET':'hour'}))
        idx2=df.loc[~df.index.isin(idx1)].index
        df.loc[idx2,'date']=pd.to_datetime(df.loc[idx2,['year','month','day']])
    else:
        df['date']=pd.to_datetime(df[['year','month','day']])

    for tags in ['regions','subjects']:
        df = df.loc[(~df[tags].isnull())].copy()
        df[tags]=df[tags].apply(lambda x:[y.split(':')[1].strip() for y in x],1)

    df.drop([x for x in df.columns if x not in ['an','date','regions','subjects','full_text']],1,inplace=True)
    
    return df
    
for filename in filenames_bulk:
    print(filename)
    dfs[filename]=clean_bulk_data(dfs[filename])
        
print("Done in", round(timer()-start), "sec")

Clean Bulk Data...

reuters-econ-fin-mkt-25-countries-1991-1996.pkl
reuters-econ-fin-mkt-25-countries-1991-2015.pkl.xz
Done in 63 sec


In [18]:
print('Merge...\n')
start = timer()

df = pd.concat([v for k,v in dfs.items()],sort=True)
del dfs
print('# Articles:', df.shape[0])

df.drop_duplicates('an',inplace=True)
df.set_index('an',inplace=True)
df.sort_values(by='date',inplace=True)
print('# Articles:', df.shape[0])

print("Done in", round(timer()-start), "sec")

Merge...

# Articles: 7683019
# Articles: 6751733
Done in 13 sec


In [23]:
df.count()

date         6751733
full_text    6751733
regions      6751733
subjects     6751733
dtype: int64

In [24]:
df.tail()

Unnamed: 0_level_0,date,full_text,regions,subjects
an,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LBA0000020201115egbf00kr9,2020-11-15 14:23:17,IMF MISSION WILL CONTINUE VIRTUAL WORKING MEET...,"[Argentina, Developing Economies, Latin Americ...","[Top Wire News, Government Borrowing, Governme..."
LBA0000020201115egbf00lj1,2020-11-15 14:48:31,Member of IMF team in Argentina tests positive...,"[Argentina, Buenos Aires, Developing Economies...","[Top Wire News, Government Borrowing, Outbreak..."
LBA0000020201115egbf00rut,2020-11-15 18:30:13,Asia Morning Call-Global Markets Nov 16 ...,"[China, Japan, Dalian, South Korea, United Sta...","[SARS/MERS Viruses, Outbreaks/Epidemics, Equit..."
LBA0000020201115egbf00wvd,2020-11-15 20:50:19,UPDATE 1-Asia Morning Call-Global Markets ...,"[China, Japan, South Korea, Dalian, United Sta...","[SARS/MERS Viruses, Outbreaks/Epidemics, Forei..."
LBA0000020201115egbf00x99,2020-11-15 21:00:00,"Risk of German recession this winter rises, st...","[Germany, DACH Countries, European Union Count...","[Economic News, Facility Openings, SARS/MERS V..."


# Compute sentiment

In [28]:
with open(os.path.join(path_to_data,'tone2keywords.pkl'),'rb') as f:
    tone2keywords = pickle.load(f)
    
for name in tone2keywords:
    print(name)
    tone2keywords[name] = tone2keywords[name].set_index('word')['IDF'].to_dict()

strong
positive
negative
uncertainty
weak


In [29]:
def get_counts(idx,data=df['full_text'],tones=list(tone2keywords)):
    
    # Split into words and remove non-letter characters
    tokens = re.sub("[^a-zA-Z]"," ", data.loc[idx].lower()).split()
    
    # Return Words and Their Count
    counter = collections.Counter(tokens)

    # Word Count
    T = sum(counter.values())

    values = [T]
    index  = ['# words']

    for tone in sorted(tones):

        # Tonal Words In the Text
        words = list(set(counter.keys())&set(tone2keywords[tone].keys()))

        if words:

            # Tonal Words Counts
            counts = itemgetter(*words)(counter)

            # Tonal Words IDFs
            idfs = itemgetter(*words)(tone2keywords[tone])

            if len(words) > 1:
                tf = sum(counts)/T
            else:
                tf = counts/T
                
            tfidf = np.dot(counts,idfs)/T

        else:

            tf = 0
            tfidf  = 0

        values.append(tf)
        index.append('% '+tone)

        values.append(tfidf)
        index.append('% '+tone+' tfidf')
        
    return pd.Series(values,index=index,name=idx)

In [19]:
print("Compute Sentiment...\n")
start = timer()

with mp.Pool() as pool:
    sentiments = pd.DataFrame(pool.map(get_counts, df.index))
    
end = timer()
print("Done In", round(end - start),"Sec")

Compute Sentiment...

Done In 147 Sec


In [29]:
sentiments.head()

Unnamed: 0,# words,% negative,% negative tfidf,% positive,% positive tfidf,% strong,% strong tfidf,% uncertainty,% uncertainty tfidf,% weak,% weak tfidf
lba0000020011204dj5r010vk,54.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
lba0000020011204dj5t011pc,76.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.041433,0.013158,0.041433
lba0000020011204dj6100vhg,188.0,0.031915,0.164103,0.015957,0.080317,0.0,0.0,0.0,0.0,0.0,0.0
lba0000020011204dj6100vdc,223.0,0.022422,0.159844,0.013453,0.05111,0.004484,0.020037,0.0,0.0,0.0,0.0
lba0000020011204dj6100vcl,88.0,0.045455,0.254071,0.034091,0.144083,0.0,0.0,0.0,0.0,0.0,0.0


# Geocode regions

In [83]:
print('Regions...\n')
start = timer()

regions = df['regions'].explode().value_counts()
regions = regions.rename('n_obs').reset_index().rename(columns={'index':'region'})
regions.to_csv(os.path.join(path_to_data,'regions_'+datetime.today().strftime('%m%Y')+'.csv'))

print("Done in", round(timer()-start), "sec")

Regions...

Done in 17 sec


In [100]:
regions.head()

Unnamed: 0,region,n_obs
0,Europe,2661597
1,Emerging Market Countries,2329458
2,North America,2210742
3,Western Europe,2051720
4,United States,2032726


In [106]:
regions_geocoded = pd.read_csv(os.path.join(path_to_data,'regions_geocoded_'+datetime.today().strftime('%m%Y')+'.csv'),index_col=0)

In [109]:
regions_geocoded.head()

Unnamed: 0,region,n_obs,country,country_code
0,Europe,2661597,,0
1,Emerging Market Countries,2329458,,0
2,North America,2210742,,0
3,Western Europe,2051720,,0
4,United States,2032726,United States,1


In [117]:
print("Regions...\n")
start = timer()

# List of country tags
regions2countries = df['regions'].explode().rename('region').reset_index().merge(regions_geocoded[['region','country']]).dropna().groupby('an')['country'].apply(lambda x: sorted(set(x)))
print('# Articles with country tags:', regions2countries.shape[0])

end = timer()
print("Done In", round(end - start),"Sec")

Regions...

# Articles: 6751733
Done In 15 Sec


# Combine news features

In [128]:
features = pd.concat([df[['date','subjects']],regions2countries,sentiments[['# words','% negative','% positive']]],1).sort_values(by='date')

In [132]:
features.count()

date          6751733
country       6493433
subjects      6751733
# words       6751733
% negative    6751733
% positive    6751733
dtype: int64

In [129]:
features.head()

Unnamed: 0,date,country,subjects,# words,% negative,% positive
lba0000020011204dj5r010vk,1987-05-27,[Taiwan],"[Output/Production, Marketing, Corporate/Indus...",54.0,0.0,0.0
lba0000020011204dj5t011pc,1987-05-29,[Taiwan],"[Economic Performance/Indicators, Public Secto...",76.0,0.0,0.0
lba0000020011204dj6100vf7,1987-06-01,[Singapore],"[Debt/Bond Markets, Commodity/Financial Market...",128.0,0.0,0.015625
lba0000020011204dj6100vd5,1987-06-01,[Australia],"[Economic Performance/Indicators, Money Supply...",96.0,0.010417,0.010417
lba0000020011204dj6100vdn,1987-06-01,[Sri Lanka],"[Economic Performance/Indicators, Public Secto...",32.0,0.0,0.03125


In [130]:
print("Save...\n")
start = timer()

features.to_pickle(os.path.join(path_to_data,'news_features_'+datetime.today().strftime('%m%Y')+'.pkl'))

end = timer()
print("Done In", round(end - start),"Sec")

Save...

Done In 20 Sec
