#### Question 

Can a program accurately analyze the sentiment of textual news and accurately predict future returns based on the sentiment of market moving news?

#### Goals 

1. Programmatically analyze the sentiment of market moving news (new news) and return a positive/negative signal. 
2. Implement a trading strategy on out of sample data that goes long a stock if the sentiment analysis is positive (for x number of news reports) and goes short if the sentiment is negative (for x number of news reports).
3. Evaluate the performance of the sentiment analysis program by calculating the returns of the strategy. 

#### Data (must be "new" news) 

I have decided to avoid social media and traditional news reports because it would be difficult to determine if the news is "new" or already reported in other sources and therefore may not be market moving. This project will focus on corporate releases which tend to be "new" and material information at the date of release.

###### Earnings Call Transcripts 

These transcripts will be the primary data source used for training and tweaking the sentiment analysis program. Downloaded 9973 earnings call transcripts from 2002 - 2005. Files are available online here: 

https://www.dropbox.com/sh/2tjbi4iika8z2kc/AAD_AoogWEpTjwXpsvjs0Z8Sa

I will attempt to use the dropbox or google drive API to read in the online files. The process seems quite complex so I've downloaded the files and will be reading in the files on my computer until I figure out the API code.

##### Earnings releases vs analyst expectations 

I am considering using analyst expectations as a secondary data source to add context to the analysis. While a company may perform well and have a positive earnings call, if analysts aready expected good performance and the market factored in the expectation into the stock price, the earnings call may not move the stock price up further. 

I will focus on earnings per share expectations vs actuals data, which I do not anticipate difficulty accessing. Since this is a secondary aim of the project, I will only explore it after completing the primary goal of sentiment analysis with the earnings call transcripts. 

#### Methodology:

##### 1. Read in transcripts 

See code below.

##### 2. Remove stop words

Stop words are generally words that are not considered to add information content to the question at hand. The Notre Dame Software Repository for Accounting and Finance (https://sraf.nd.edu/textual-analysis/resources/#Master%20Dictionary) provides a generic stop word list based on the stop word list used by Python's Natural Language Toolkit (NLTK):

https://drive.google.com/file/d/1BWI-95c1gQ1WL1UeV7jKGc6-GhXubbVf/view?usp=sharing

##### 3. Use a dictionary to calculated sentiment indicators 

From SRAF: https://drive.google.com/file/d/13OVoxlfr0_xZ1Did4FRZGU8fVAcHPthM/view?usp=sharing

The dictionary reports counts, proportion of total, average proportion per document, standard deviation of proportion per document, document count (i.e., number of documents containing at least one occurrence of the word), eight sentiment category identifiers, Harvard Word List identifier, number of syllables, and source for each word.  The sentiment categories are: negative, positive, uncertainty, litigious, modal, constraining.   Modal words are flagged as 1, 2 or 3, with 1 = Strong Modal, 2 = Moderate Modal, and 3 = Weak Modal.  The other sentiment words are flagged with a number indicating the year in which they were added to the list. 
    
##### 4. Compare sentiment with future returns 

Hypothesis:  A positive earnings call should result in positive future returns (if postive performance is not already baked into the current share price) and vice versa. 

While I do not anticipate getting the price data to be a signifcant problem since the data is publicly available on yahoo finance, I am still working on code to read price data from the appropriate hisorical period. I found code on Kaggle that successfully downloads the price data of all companies in the S&P over the last 5 years (see code below). However, I've been unable to tweak the code to download prices from 2002 - 2005 (the dates of the transcripts) presumably because the companies in the S&P at the time do not match the current list used in the code. 


# 1) Read in Data

I have the AAPL folder downloaded on my desktop. Using the larger Transcripts zip takes too long.
TBD: code to access files online via dropbox api.

In [83]:
import pandas as pd
import zipfile
import chardet

path = "/Users/yusef/Desktop"
file = "AAPL.zip"
file_name = path + "/" + file

zf = zipfile.ZipFile(file_name) 

file_list = zf.namelist()    
del file_list[0]

file_list

['AAPL-Transcript-2004-01-14T22_00.txt',
 'AAPL-Transcript-2004-04-14T21_00.txt',
 'AAPL-Transcript-2003-04-16T21_00.txt',
 'AAPL-Transcript-2004-07-14T21_00.txt',
 'AAPL-Transcript-2005-01-12T22_00.txt',
 'AAPL-Transcript-2002-10-16T21_00.txt',
 'AAPL-Transcript-2005-10-11T21_00.txt',
 'AAPL-Transcript-2002-06-18T21_00.txt',
 'AAPL-Transcript-2002-04-17T21_30.txt',
 'AAPL-Transcript-2005-04-13T21_00.txt',
 'AAPL-Transcript-2004-10-13T21_00.txt',
 'AAPL-Transcript-2003-07-16T21_30.txt',
 'AAPL-Transcript-2002-07-16T21_00.txt',
 'AAPL-Transcript-2003-01-15T22_00.txt',
 'AAPL-Transcript-2005-07-13T21_00.txt',
 'AAPL-Transcript-2003-10-15T21_30.txt']

In [84]:
indices = (0,2,3)
for file in range(len(file_list)):
    string = file_list[file].split(sep="-")
    name = [string[i] for i in indices]
    print(name)

['AAPL', '2004', '01']
['AAPL', '2004', '04']
['AAPL', '2003', '04']
['AAPL', '2004', '07']
['AAPL', '2005', '01']
['AAPL', '2002', '10']
['AAPL', '2005', '10']
['AAPL', '2002', '06']
['AAPL', '2002', '04']
['AAPL', '2005', '04']
['AAPL', '2004', '10']
['AAPL', '2003', '07']
['AAPL', '2002', '07']
['AAPL', '2003', '01']
['AAPL', '2005', '07']
['AAPL', '2003', '10']


Storing all the words in the each transcript as a list (word_list) which is then associated with a key tuple(ticker, year, report #) added to a dictionary (data_dict).

In [85]:
data_dict = {}

for file in file_list:
    
    print(file)
    
    try:
        indices = (0,2,3)
        string = file.split(sep="-")
        name = [string[i] for i in indices]
        key = tuple(name)
    except IndexError:
        print("An error occured naming " + file)
    
    with zf.open(file,"r") as f:
        lines = f.readlines()
        word_list = [] 
        for line in lines:
            try:
                encoding = chardet.detect(line)['encoding']
                word_list.extend(line.decode(encoding).split(sep= " "))
            except UnicodeDecodeError:
                print("An error occured decoding " + file)
        data_dict[key] = word_list
        print("Added " + file)

AAPL-Transcript-2004-01-14T22_00.txt
Added AAPL-Transcript-2004-01-14T22_00.txt
AAPL-Transcript-2004-04-14T21_00.txt
Added AAPL-Transcript-2004-04-14T21_00.txt
AAPL-Transcript-2003-04-16T21_00.txt
Added AAPL-Transcript-2003-04-16T21_00.txt
AAPL-Transcript-2004-07-14T21_00.txt
Added AAPL-Transcript-2004-07-14T21_00.txt
AAPL-Transcript-2005-01-12T22_00.txt
Added AAPL-Transcript-2005-01-12T22_00.txt
AAPL-Transcript-2002-10-16T21_00.txt
Added AAPL-Transcript-2002-10-16T21_00.txt
AAPL-Transcript-2005-10-11T21_00.txt
Added AAPL-Transcript-2005-10-11T21_00.txt
AAPL-Transcript-2002-06-18T21_00.txt
Added AAPL-Transcript-2002-06-18T21_00.txt
AAPL-Transcript-2002-04-17T21_30.txt
Added AAPL-Transcript-2002-04-17T21_30.txt
AAPL-Transcript-2005-04-13T21_00.txt
Added AAPL-Transcript-2005-04-13T21_00.txt
AAPL-Transcript-2004-10-13T21_00.txt
Added AAPL-Transcript-2004-10-13T21_00.txt
AAPL-Transcript-2003-07-16T21_30.txt
Added AAPL-Transcript-2003-07-16T21_30.txt
AAPL-Transcript-2002-07-16T21_00.txt
Add

In [86]:
len(data_dict)

16

In [87]:
data_dict.keys()

dict_keys([('AAPL', '2004', '01'), ('AAPL', '2004', '04'), ('AAPL', '2003', '04'), ('AAPL', '2004', '07'), ('AAPL', '2005', '01'), ('AAPL', '2002', '10'), ('AAPL', '2005', '10'), ('AAPL', '2002', '06'), ('AAPL', '2002', '04'), ('AAPL', '2005', '04'), ('AAPL', '2004', '10'), ('AAPL', '2003', '07'), ('AAPL', '2002', '07'), ('AAPL', '2003', '01'), ('AAPL', '2005', '07'), ('AAPL', '2003', '10')])

In [88]:
data_dict.get(('AAPL', '2002', '04'))


['\r\n',
 '\r\n',
 '\r\n',
 'Thomson',
 'Reuters',
 'StreetEvents',
 'Event',
 'Transcript\r\n',
 'F',
 'I',
 'N',
 'A',
 'L',
 '',
 '',
 'V',
 'E',
 'R',
 'S',
 'I',
 'O',
 'N\r\n',
 '\r\n',
 'AAPL',
 '-',
 'Apple',
 'Inc.\r\n',
 'Q2',
 '2002',
 'Apple',
 'Computer',
 'Earnings',
 'Conference',
 'Call\r\n',
 'Apr',
 '17,',
 '2002',
 '/',
 '09:30PM',
 '',
 'GMT',
 '\r\n',
 '\r\n',
 'Conference',
 'Call',
 'Participants\r\n',
 '',
 '',
 '',
 '*',
 '',
 'Nancy',
 'Paxton\r\n',
 '',
 '',
 '',
 '',
 '',
 '',
 'Apple',
 'Computer',
 '-',
 'Director',
 'of',
 'Investor',
 'Relations',
 'and',
 'Corporate',
 'Finance\r\n',
 '',
 '',
 '',
 '*',
 '',
 'Peter',
 'Oppenheimer\r\n',
 '',
 '',
 '',
 '',
 '',
 '',
 'Apple',
 'Computer',
 '-',
 'Senior',
 'Vice',
 'President',
 'of',
 'Finance\r\n',
 '',
 '',
 '',
 '*',
 '',
 'Fred',
 'D.',
 'Anderson\r\n',
 '',
 '',
 '',
 '',
 '',
 '',
 'Apple',
 'Computer',
 '-',
 'Chief',
 'Financial',
 'Officer\r\n',
 '',
 '',
 '',
 '*',
 '',
 'Kimberly',
 'Alexy

# 2) Remove stop words

1. Read in stop word list from https://sraf.nd.edu/textual-analysis/resources/#Master%20Dictionary
 (The file is a google drive link. downloaded to desktop for now. TBD: learn to use google drive api)
2. Edit list for up, down, above
3. Clean transcript word lists for stop words

In [89]:
path = "/Users/yusef/Desktop"
file = "StopWords_GenericLong.txt"
file_name = path + "/" + file

stop_words = pd.read_table(file_name)
stop_words.set_index("Words")
stop_words = stop_words["Words"].tolist()
stop_words.extend(["","/","*","-"])
print(len(stop_words))
stop_words

575


['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [90]:
removeL = ["up","down","above"]
for word in removeL:
    if stop_words.count(word) > 0:
        stop_words.remove(word) #these words may contain meaning in this context
print(len(stop_words))

572


In [91]:
wordsbefore = len(data_dict[('AAPL', '2004', '01')])
print(wordsbefore)

12160


In [92]:
for key in data_dict.keys():
    data_dict[key] = [word for word in data_dict.get(key) if word not in stop_words]

In [93]:
wordsafter = len(data_dict[('AAPL', '2004', '01')])
print(str((wordsbefore-wordsafter)) + " words removed")

5619 words removed


In [95]:
#need to clean html tags from data, remove punctuation, and convert to lower case 
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
    
for key in data_dict.keys():
    print(key)
    tempList = []
    for word in data_dict[key]:
        if tokenizer.tokenize(BeautifulSoup(word.lower(), "lxml").get_text()):
            tempList.append(tokenizer.tokenize(BeautifulSoup(word.lower(), "lxml").get_text())[0])        
    data_dict[key] = tempList

('AAPL', '2004', '01')


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


('AAPL', '2004', '04')


  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


('AAPL', '2003', '04')
('AAPL', '2004', '07')
('AAPL', '2005', '01')
('AAPL', '2002', '10')
('AAPL', '2005', '10')
('AAPL', '2002', '06')


  ' Beautiful Soup.' % markup)


('AAPL', '2002', '04')
('AAPL', '2005', '04')
('AAPL', '2004', '10')
('AAPL', '2003', '07')


  ' Beautiful Soup.' % markup)


('AAPL', '2002', '07')
('AAPL', '2003', '01')
('AAPL', '2005', '07')
('AAPL', '2003', '10')


In [76]:
a = []
not a

True

In [96]:
data_dict[('AAPL', '2002', '04')] #proof of cleanliness ! 

['thomson',
 'reuters',
 'streetevents',
 'event',
 'transcript',
 'f',
 'i',
 'n',
 'a',
 'l',
 'v',
 'e',
 'r',
 's',
 'i',
 'o',
 'n',
 'aapl',
 'apple',
 'inc',
 'q2',
 '2002',
 'apple',
 'computer',
 'earnings',
 'conference',
 'call',
 'apr',
 '17',
 '2002',
 '09',
 'gmt',
 'conference',
 'call',
 'participants',
 'nancy',
 'paxton',
 'apple',
 'computer',
 'director',
 'investor',
 'relations',
 'corporate',
 'finance',
 'peter',
 'oppenheimer',
 'apple',
 'computer',
 'senior',
 'vice',
 'president',
 'finance',
 'fred',
 'd',
 'anderson',
 'apple',
 'computer',
 'chief',
 'financial',
 'officer',
 'kimberly',
 'alexy',
 'prudential',
 'securities',
 'daniel',
 'niles',
 'lehman',
 'brothers',
 'daniel',
 'kunstler',
 'j',
 'morgan',
 'richard',
 'gardner',
 'salomon',
 'smith',
 'barney',
 'andrew',
 'neff',
 'bear',
 'stearns',
 'david',
 'bailey',
 'gerard',
 'klauer',
 'mattison',
 'charles',
 'wolf',
 'needham',
 'company',
 'don',
 'young',
 'ubs',
 'warburg',
 'brett',
 

# 3) Use a dictionary to calculate sentiment indicator

In [80]:
file = "LoughranMcDonald_MasterDictionary_2016.csv"
file_name = path + "/" + file
SentimentDict = pd.read_csv(file_name)

In [81]:
#Creating list of pos and neg words in lower case

PosWords = SentimentDict[SentimentDict['Positive'] != 0]
PosWordsL = PosWords["Word"].str.lower().tolist()

NegWords = SentimentDict[SentimentDict['Negative'] != 0]
NegWordsL = NegWords["Word"].str.lower().tolist()

PosWordsL

['able',
 'abundance',
 'abundant',
 'acclaimed',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishing',
 'accomplishment',
 'accomplishments',
 'achieve',
 'achieved',
 'achievement',
 'achievements',
 'achieves',
 'achieving',
 'adequately',
 'advancement',
 'advancements',
 'advances',
 'advancing',
 'advantage',
 'advantaged',
 'advantageous',
 'advantageously',
 'advantages',
 'alliance',
 'alliances',
 'assure',
 'assured',
 'assures',
 'assuring',
 'attain',
 'attained',
 'attaining',
 'attainment',
 'attainments',
 'attains',
 'attractive',
 'attractiveness',
 'beautiful',
 'beautifully',
 'beneficial',
 'beneficially',
 'benefit',
 'benefited',
 'benefiting',
 'benefitted',
 'benefitting',
 'best',
 'better',
 'bolstered',
 'bolstering',
 'bolsters',
 'boom',
 'booming',
 'boost',
 'boosted',
 'breakthrough',
 'breakthroughs',
 'brilliant',
 'charitable',
 'collaborate',
 'collaborated',
 'collaborates',
 'collaborating',
 'collaboration',
 'collaborations',
 'coll

In [86]:
list(data_dict.keys())

[('AAPL', '2004', '01'),
 ('AAPL', '2004', '04'),
 ('AAPL', '2003', '04'),
 ('AAPL', '2004', '07'),
 ('AAPL', '2005', '01'),
 ('AAPL', '2002', '10'),
 ('AAPL', '2005', '10'),
 ('AAPL', '2002', '06'),
 ('AAPL', '2002', '04'),
 ('AAPL', '2005', '04'),
 ('AAPL', '2004', '10'),
 ('AAPL', '2003', '07'),
 ('AAPL', '2002', '07'),
 ('AAPL', '2003', '01'),
 ('AAPL', '2005', '07'),
 ('AAPL', '2003', '10')]

In [101]:
ReportSentiments = {}
for key in data_dict.keys():
    print(key)
    PosCount = 0
    NegCount = 0
    for word in data_dict[key]:
        if PosWordsL.count(word) > 0:
            PosCount +=1
        elif NegWordsL.count(word) >0:
            NegCount +=1
    ReportSentiments[key] = [PosCount, NegCount]

('AAPL', '2004', '01')
('AAPL', '2004', '04')
('AAPL', '2003', '04')
('AAPL', '2004', '07')
('AAPL', '2005', '01')
('AAPL', '2002', '10')
('AAPL', '2005', '10')
('AAPL', '2002', '06')
('AAPL', '2002', '04')
('AAPL', '2005', '04')
('AAPL', '2004', '10')
('AAPL', '2003', '07')
('AAPL', '2002', '07')
('AAPL', '2003', '01')
('AAPL', '2005', '07')
('AAPL', '2003', '10')


In [102]:
ReportSentiments 
#first val is positive words, second val is negative words. raw data so we can use these values 
#to compute whatever ratios/metrics we want

{('AAPL', '2002', '04'): [95, 88],
 ('AAPL', '2002', '06'): [28, 66],
 ('AAPL', '2002', '07'): [72, 116],
 ('AAPL', '2002', '10'): [60, 67],
 ('AAPL', '2003', '01'): [96, 105],
 ('AAPL', '2003', '04'): [65, 76],
 ('AAPL', '2003', '07'): [69, 95],
 ('AAPL', '2003', '10'): [97, 81],
 ('AAPL', '2004', '01'): [106, 91],
 ('AAPL', '2004', '04'): [102, 111],
 ('AAPL', '2004', '07'): [91, 127],
 ('AAPL', '2004', '10'): [128, 115],
 ('AAPL', '2005', '01'): [94, 93],
 ('AAPL', '2005', '04'): [105, 109],
 ('AAPL', '2005', '07'): [98, 118],
 ('AAPL', '2005', '10'): [68, 88]}

# 4) Compare sentiment with returns 

In [103]:
# code from https://chrisconlan.com/download-historical-stock-data-google-r-python/

import quandl
import datetime
 
quandl.ApiConfig.api_key = 'qH_zHPv3ZSxgJQ9ezs-m'
 
def quandl_stocks(symbol, start_date=(2002, 1, 1), end_date=(2005, 1, 1)):
    """
    symbol is a string representing a stock symbol, e.g. 'AAPL'
 
    start_date and end_date are tuples of integers representing the year, month,
    and day
 
    end_date defaults to the current date when None
    """
 
    query_list = ['WIKI' + '/' + symbol + '.' + str(k) for k in range(1, 13)]
 
    start_date = datetime.date(*start_date)
 
    if end_date:
        end_date = datetime.date(*end_date)
    else:
        end_date = datetime.date.today()
 
    return quandl.get(query_list, 
            returns='pandas', 
            start_date=start_date,
            end_date=end_date,
            collapse='daily',
            order='asc'
            )
 
#if _name_ == '_main_':
 
apple_data = quandl_stocks('AAPL')
apple_data.drop(apple_data.columns[4:12],axis=1,inplace = True)

In [104]:
apple_data

Unnamed: 0_level_0,WIKI/AAPL - Open,WIKI/AAPL - High,WIKI/AAPL - Low,WIKI/AAPL - Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2002-01-02,22.050,23.30,21.96,23.300
2002-01-03,23.750,23.75,22.77,23.580
2002-01-04,23.340,23.95,22.99,23.690
2002-01-07,23.720,24.00,22.75,22.900
2002-01-08,22.750,23.05,22.46,22.610
2002-01-09,22.800,22.93,21.28,21.650
2002-01-10,21.220,21.47,20.26,21.230
2002-01-11,21.390,21.84,20.60,21.050
2002-01-14,21.010,21.40,20.90,21.150
2002-01-15,21.320,21.76,21.21,21.700
