# Sentiment extraction scripts

This notebook contains scripts for calculating the sentiment score for raw text. The text can be extracted from various sources, such as scraping the GDELT links, raw txts from Factiva or tweets from twitter. The scripts here assume that the data is already in a raw text format, thus it is required to load the data in some other function before calling the sentiment methods presented here.

The method presented in this notebook is rather simple yet efficient. It is based on having sets of keywords for negative and positive sentiment and calculating the frequency of those words in the lemmatized unigrams (eg. bag of words)

**Advise for calculating a time series of sentiment**
* Calculate each time window separately and save the intermediate results in a temporary file. This helps to avoid losing computation time in case there are crashes or other limitations
* Use Unix time epochs for each day or other unit of time
* CSV and JSON are good formats for storing the data
* Using notebooks for the analysis is recommended, but for pure calculations the script format is more efficient

An example for loading GDELT data in memory and calculating the sentiment is presented at the end of this notebook.

Notice that these scripts are for advice and inspiration, running them needs modifications and adjustments regarding the system you are using.

In [13]:
import re
import sys
import csv
import os
import string
import time
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import urllib

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tuomastakko/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Constructing the sentiment dictionaries


In [4]:
csv.field_size_limit(sys.maxsize)
negative_words = ('jitter','threatening','distrusted','jeopardized','jitters','hurdles','fears','feared','traumatic','fail','erodes','uneasy','distressed','unease','disquieted','perils','traumas','alarm','distrusting','doubtable','terrors','worries','panics','eroding','terrifying','doubt','traumatised','panic','imperils','mistrusts','failings','nervousness','conflicted','reject','doubting','fearing','dreads','distrust','disquiet','questioned')
positive_words=('excited','incredible','ideal','attract','tremendous','satisfactorily','brilliant','meritorious','superbly','satisfied','perfect','win','amazes','energizing','gush','wonderful','attracts','enthusiastically','exceptionally','encouraged','excels','impressively','encouraging','impress','favoured','enjoy','pleasures','positive','unique','impressed','enhances','delighted','energise','spectacular','enjoyed','enthusiastic','inspiration','galvanized','amaze','excelling')
ecount = 0 
tte=0
acount = 0 
tta=0


In [5]:
'''
Scraping tool for getting synonyms from Merriam Webster

'''
def scrapewebster(word):
    # specify which URL/web page we are going to be scraping
    url = "https://www.merriam-webster.com/thesaurus/"+(word.replace(' ','_'))

    page = urllib.request.urlopen(url)

    soup = BeautifulSoup(page, "lxml")

    infobox = []
    for row in soup.findAll('div', class_='thes-list-content synonyms_list'):
        keys=row.findAll('li')
        k = None
        for i in keys:
            k = i.find(text=True)
            tmp = k.replace('(','').replace(')','')
            if len(tmp)>1:
                infobox.append(tmp)

    return infobox

In [6]:
def scrapethesaur(entity):
    # specify which URL/web page we are going to be scraping
    url = "https://www.thesaurus.com/browse/"+(entity.replace(' ','_'))

    page = urllib.request.urlopen(url)

    soup = BeautifulSoup(page, "lxml")

    infobox = []
    
    interpret = soup.findAll('div', class_='css-kv266z e1qo4u830')[0]
    
    #for interpret in interprets:
    #    print(interpret)
    for row in interpret.findAll('ul', class_='css-1ytlws2 et6tpn80'):
        keys=row.findAll('li')
        k = None
        for i in keys:
            k = i.find(text=True)
            tmp = k.replace('(','').replace(')','')
            if len(tmp)>1:
                infobox.append(tmp)
            #for j in k.find(text=True):
            #    infobox.append(j)
    return infobox

In [8]:
'''
Function for calculating the occurrence of positive and negative words in the string parameter body.

Note that this does not use lemmatization and is not as accurate.

'''
def body_keyword_count(body, key_positive, key_negative):
    counter_p = 0 
    counter_n = 0
    for word in body.split():
        if word in key_positive:
            counter_p += 1
        if word in key_negative:
            counter_n += 1
    
    return (len(body.split()), counter_p, counter_n)

In [9]:
'''
Function for calculating the occurrence of positive and negative words in the string parameter body.

Uses lemmatization for the dictionary words and compares the lemmas to the words in the text.

'''
def body_keyword_count_lemma(body, key_positive, key_negative):
    lemmatizer = WordNetLemmatizer() 
    counter_p = 0 
    counter_n = 0
    for word in body.split():
        for key_p in positive_words:
            if lemmatizer.lemmatize(key_p) in word:
                counter_p += 1
                break
        for key_n in negative_words:
            if lemmatizer.lemmatize(key_n) in word:
                counter_n += 1
                break        
        
    return (len(body.split()), counter_p, counter_n)

In [15]:
'''
Scrape Google Translate
'''

def translate_token(word, from_lang, to_lang):
    url = 'https://translate.google.com/#view=home&op=translate&sl='+from_lang+'&tl='+to_lang+'&text='+word
    page = urllib.request.urlopen(url)

    soup = BeautifulSoup(page, "lxml")

    infobox = []
    for row in soup.findAll('div', class_='result-shield-container tlid-copy-target'):
        keys=row.findAll('span')
        k = None
        for i in keys:
            k = i.find(text=True)
            tmp = k.replace('(','').replace(')','')
            if len(tmp)>1:
                infobox.append(tmp)
    return infobox

translate_token('trust', 'en', 'fi')

HTTPError: HTTP Error 403: Forbidden

## How to use different lexicons for different sentiments?

Using the functions described above, one can calculate the occurrence of the keyword counts for each article.

The process is the following:

1. Decide which words and their synonyms you want to include to the BOW.
2. Use scrapewebster and scrapethesaur for getting the sets of words (with antonyms as well) ie. scrapethesaur('trust') scrapethesaur('distrust')
1. Iterate through your local files containing the text
4. For each text body, run the body_keyword_count_lemma(text, lex1, lex2)

In [14]:
'''
Example

'''

data = ['The authors show that sentiments from newspaper articles can explain and predict movements in the term structure of U.S. government bonds. This effect is stronger at the short end of the curve, coinciding with greater volatility and investors need to continually reassess the Feds reaction function. Facing such uncertainty, market participants rely on news and sentiment as a central element in their decision-making process. Considering this dependence, the authors propose a new yield curve factor—news sentiment—that is distinct from the 3 established yield curve factors (level, slope, and curvature) as well as from fundamental macroeconomic variables.']
trust = scrapethesaur('trust')
distrust = scrapethesaur('distrust')

for body in data:
    print(body_keyword_count_lemma(body, trust, distrust))


(96, 0, 0)


## Running the analysis

For running the analysis one should have the textual data in a readable format (ie. cleaned from markup tags etc).
The following is a simple example on how one would run the script if the text files were stored in a local folder and the filename contained the date.

In [None]:
filelocation = 'location/of/files/'
dailysentiments = {}
for textfile in glob.glob(filelocation+'*.txt'):
    filedate = textfile.split('/')[-1][:-4]
    with open(textfile, 'r') as readfile:
        textdata = readfile.read()
    dailysentiments[filedate] = body_keywork_count_lemma(textdata,positive_words, negative_words)

## Gdelt example

This example from the last semester's project course shows a way of filtering relevant GDELT news using the precalculated tags. One can use the sentiment (V2Tone) in parallel with the manually calculated bag of words sentiment. The topics in this example include financial institutions, but changing the topics and entities listed changes the results.

This particular example uses the csv format. If the json format is more familiar to you, check the other notebook provided.

### Collecting the GKG dataset

In [None]:
import requests
import lxml.html as lh
import sys 
import os.path
import urllib.request
import zipfile
import glob
import operator

'''
Here we construct the system for downloading the csv files for Gdelt GKG

'''


infilecounter = 0
outfilecounter = 0
 

gdelt_base_url = 'http://data.gdeltproject.org/gkg/' 

# get the list of all the links on the gdelt file page
page = requests.get(gdelt_base_url+'index.html')
doc = lh.fromstring(page.content)
link_list = doc.xpath("//*/ul/li/a/@href")

# separate out those links that begin with four digits 
file_list = [x for x in link_list if str.isdigit(x[0:4]) and x[9:17] != 'gkgcount']

# uncomment this line to run the full crawl
file_list = file_list[0:3]


In [None]:
local_path = 'yourpath' # Change this path according to the system you are using!!

debug_condition_3 = False

'''
The following functions test if the topics or entities are present in the gkg file format,
eg. filters the correct news pieces.

'''
# it returns True if at least one key is in list_of_items
def themesInEntry(list_of_keys, list_of_items):
    for key in list_of_keys:
        if key in list_of_items:
            return True
    return False

def organizationInEntry(orgs, column):
    for entry in column:
        if entry in orgs:
            return True
    return False

file_is_empty = True

In [None]:
fips_country_code = 'US'
fips_country_code_hash = '#US#'
themes = ['WB_1234_BANKING_INSTITUTIONS','WB_1236_COMMERCIAL_BANKING','WB_1256_CREDIT_UNIONS','ECON_DEBT','ECON_STOCK_MARKET','WB_1234_BANKING_INSTITUTIONS','WB_1920_FINANCIAL_SECTOR_DEVELOPMENT','ECON_CENTRALBANK','WB_318_FINANCIAL_ARCHITECTURE_AND_BANKING','WB_332_CAPITAL_MARKETS','WB_611_PENSION_FUNDS','WB_971_BANKING_REGULATION']
organization = ['1347 property insurance', 	'acmat corporation', 	'aflac', 	'alleghany', 	'allstate', 	'ambac financial group', 	'american financial group', 	'american international group', 	'amerisafe', 	'arthur j gallagher', 	'assurant', 	'atlas financial holdings', 	'berkshire hathaway', 	'blue water global', 	'brown brown', 	'brp group', 	'cincinnati financial corporation', 	'cna financial corporation', 	'cno financial group', 	'conifer holdings', 	'corvel corporation', 	'crawford company', 	'donegal group', 	'ehealth', 	'employers holdings', 	'equitable holdings', 	'erie indemnity company', 	'fednat holding co', 	'fidelity national financial', 	'first acceptance corporation', 	'first american financial corporation', 	'gainsco', 	'goosehead insurance', 	'grand havana', 	'hallmark financial services', 	'hanover insurance group', 	'hartford financial services group', 	'hci group', 	'health insurance innovations', 	'heritage insurance holdings', 	'hilltop holdings', 	'horace mann educators corporation', 	'huize holding', 	'icc holdings', 	'inspro technologies corporation', 	'investors title company', 	'kemper corporation', 	'kingstone companies', 	'kinsale capital group', 	'loews corporation', 	'markel corporation', 	'marsh mclennan companies', 	'mbia', 	'mercury general corporation', 	'mgic investment corporation', 	'national general holdings corp', 	'ni holdings', 	'nmi holdings', 	'old republic international corporation', 	'pacific ventures groupinc', 	'palomar holdings inc', 	'pmi group inc', 	'positive physicians holdingsinc', 	'principal financial group', 	'proassurance corporation', 	'progressive corp', 	'prosight global inc', 	'protective insurance corp', 	'qbe insurance group limited - adr', 	'radian group', 	'reinsurance group of america inc', 	'rli corp', 	'safety insurance group inc', 	'selective insurance group', 	'state auto financial corp', 	'stewart information services corporation', 	'sundance strategies', 	'travelers companies', 	'triad guaranty', 	'unico american corporation', 	'united fire casualty co', 	'united insurance holdings corp', 	'universal insurance holdings', 	'unum group', 	'voya financial', 	'w r berkley corp', 	'american equity investment life holding', 	'american national insurance', 	'atlantic american corporation ', 	'brighthouse financial', 	'citizens financial corp', 	'citizens inc', 	'emergent capital inc ', 	'fbl financial group', 	'federal life group', 	'galaxy next generation', 	'genworth financial', 	'globe life inc', 	'gwg holdings', 	'independence holding company ', 	'kansas city life insurance company ', 	'lincoln national', 	'metlife', 	'national security group', 	'national western life group', 	'primerica', 	'prudential', 	'sanlam limited', 	'utg Inc', 	'vericity']

for compressed_file in file_list[infilecounter:]:
    print(compressed_file)
    # if we dont have the compressed file stored locally, go get it. Keep trying if necessary.
    while not os.path.isfile(local_path+compressed_file):
        urllib.request.urlretrieve(url=gdelt_base_url+compressed_file,   filename=local_path+compressed_file)
    # extract the contents of the compressed file to a temporary directory
    print('extracting')
    z = zipfile.ZipFile(file=local_path+compressed_file, mode='r')
    z.extractall(path=local_path+'tmp/')
    # parse each of the csv files in the working directory,
    print('parsing')
    for infile_name in glob.glob(local_path+'tmp/*'):
        # WE INSERTED THE FOLLOWING LINE
        if not os.path.exists(local_path+'country/'):
            os.mkdir(local_path+'country/')
        outfile_name = local_path+'country/'+fips_country_code+'%04i.csv'%outfilecounter
        # open the infile and outfile
        with open(infile_name, mode='r', encoding="utf8") as infile, open(outfile_name, mode='w', encoding="utf8") as outfile:
            for line in infile:
                if line[0:4] == 'DATE':
                    continue
                # extract lines with our interest country code
                #print(line) 
                #print("lenght of line split is {}".format(len(line.split('\t'))))
                
                #for i in range(len(line.split('\t'))):
                #   print("{} is : {}".format(i, line.split('\t')[i]))
                #sys.exit()
                
                
                
                if fips_country_code_hash in line.split('\t')[4] and themesInEntry(themes, line.split('\t')[3].split(';')) and themesInEntry(organization, line.split('\t')[6].split(';')): 
                    if debug_condition_3 is True:
                        if themesInEntry(organization, line.split('\t')[6].split(';')):
                            for key in organization:
                                for item in line.split('\t')[6].split(';'):
                                    if key in item:
                                        print("'{}' IN '{}'".format(key, item))			
                        
                    
            
                    file_is_empty  = False
                    outfile.write(line)
            outfilecounter +=1
        # delete the temporary file
        os.remove(infile_name)
        z.close()
        os.remove(local_path+compressed_file)
    if file_is_empty == True:
        if merge_results == True:
            os.remove(outfile_name)
    infilecounter +=1
    print('done')

### Scraping a sample of the articles using BeautifulSoup

Saving the raw texts into separate files is once again recommended.

In [None]:
# import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import time, os, sys


# change these two lines to scrape different files
folder = 'mutuals'
input_csv = 'full_shuffled.csv'

local_path = 'path to your dir'

'''
This script used only a small sample of the available data. 
Remove these target variables if you want the full set of data.
'''
start = 0
end = 30000
target_lines = 6000

output_csv = 'scraped_articles_'+str(target_lines)+'.csv'
infile_name = local_path+input_csv
outfile_name = local_path+output_csv

write_on_file = True
# open the collated csv
infile = open(infile_name, mode='r', encoding="utf8")


# start a foor loop that reads one line at a time

counter = 0

if write_on_file:            
    outfile = open(outfile_name, mode ='w+', encoding="utf8")
    
if os.path.isdir(outfile_name):
    print("ERROR file is not there")
    sys.exit()
    
    
counter_written = 0
for line in infile:
    counter += 1
    # for each line get the url entry
    url = line.split('\t')[10]
    try:
        page = urlopen(url, timeout=10)
    except:
        print("Link {} FAILED".format(counter))
        continue

    print("Link {} OK".format(counter))
    # parse the html using beautiful soup and store in variable `soup`
    soup = BeautifulSoup(page, 'html.parser')


    # extract the title 
    try:
        title = soup.title.string
    except:
        title = "NoTitle"
        
    if title is None:
        title = "NoTitle"
        
    # extract the body of the entry 
    # Take out the <div> of name and get its value
    #content = soup.find('div', {"class": "story-body sp-story-body gel-body-copy"})    
    content = soup.body

    if content is not None:
        print("Link {} has Content".format(counter))
        article = ''
        for i in content.findAll('p'):
            new_text = i.text.replace('"',' ')
            article = article + '~' +  new_text
        # save the title and body on file 
        # Saving the scraped text
        if len(article) > 0:
            print("Link {} has Text [{}] \n".format(counter, len(content.findAll('p'))))
            #print(line.split('\t')[0]+'\t'+url+'\t'+title+'\t'+article+'\n')
            #prepare strings for writing on file
            date = line.split('\t')[0]
            date = date.replace('\n',' ') 
            
            #url = str(url)
            url = url.replace('\n',' ')
            url = url.replace('\t',' ')
            
            #title = str(title)
            title = title.replace('\n',' ')
            title.replace('\t',' ')
            
            #article = str(article)
            article = article.replace('\n',' ')
            article.replace('\t','~')
            delimiter = '\t'
            #print("{}  {} {} {} \n".format(counter, date, title, article[:10]))
            string = folder+delimiter+date+delimiter+url+delimiter+title+delimiter+article+'\n'
            #print(string)
            if write_on_file:            
                counter_written += 1
                written_bites = outfile.write(string)
                print("counter = {} written_bites = {}, url = {}".format(counter_written, written_bites, url))
                
                if counter_written == target_lines:
                    break      
        #sys.exit()
print("closing file")
outfile.close()

### Calculating the sentiment for the downloaded GDELT files

The following part produces the output for the downloaded csv files. Recommendation is to split the work into smaller subsets as crashes or errors might result in loss of data.

In [None]:
folder = 'Sentiment'
input_csv = 'input.csv'
output_csv = "result.csv"

local_path = 'yourpath/'

input_file_path = local_path + input_csv
output_file_path = local_path + output_csv

In [None]:
delimiter  = '\t'

with open(input_file_path,"r",encoding="utf8") as inputfile, open(output_file_path,'w',encoding="utf8")as outputfile: #copy to a new file
    counter_row = 0
    for row in inputfile:
        row.lower()
        
        
        row_split = row.split('\t')

        article_type = row_split[0]
        date = row_split[1]
        url = row_split[2]
        title = row_split[3]
        body = row_split[4]
        body.translate(str.maketrans('','',string.punctuation))
        #print(row_split[:3])
    
        body = body.replace('~', ' ')
        
        # 1-to-1 match
        #num_words, p_count, n_count  = body_keyword_count(body, positive_words, negative_words)

        # lemmatized match
        num_words, p_count, n_count  = body_keyword_count_lemma(body, positive_words, negative_words)
        
        
        #time.sleep(1)
        print(num_words, p_count, n_count)
        

        line = article_type + delimiter + date + delimiter + str(num_words) + delimiter + str(p_count) + delimiter + str(n_count) + delimiter + url + delimiter + title + '\n'
        
        written_bites = outputfile.write(line)
        #print(line)
        counter_row += 1