## CS410 Text Information Systems Technology Review

### Topic: Sentiment Analysis using Python nltk for Financial Reports


### Introduction

In rencent years, the analysis of textual information for financial decision making has become increasingly popular.
For instance, Li (2010) use it to predict future earnings, Kravet and Musli (2013) use textual analysis to predict stock prices and Humphreys et al. (2011) employ it for detecting corporate fraud. 

Academic research aside, the use of textual materials for conducting sentiment analysis is also important from an industry perspective. It is not difficult to see that effective sentiment analysis can be converted into potential alpha signals for quantitative fund managers.  

A common method for discerning sentiment of a company is to run textual analysis on a company's Management Discussion and Analysis (MD&A) section of Form 10-Q and 10-K. Sentiment can be classified into positive, negative or neutral tones which can then be used to generate a signal for earnings or stock price prediction.

In this note, we will review of the technologies used by quantitative fund managers and academics to determine stock / company sentiment via textual analysis. We also provide a basic tutorial in Python of how this works in practice.



### Literature Review


An overview of some academic papers on text analysis in the accounting and finance field.

| Authors                      | Data                    | Method                                         |
|------------------------------|-------------------------|------------------------------------------------|
| Antweiler and Frank (2004)   | Yahoo! Finance Postings | single-label classifier                        |
| Li (2008)                    | 10-K                    | content analysis: readability                  |
| Tetlock et al. (2008)        | Financial News Articles | word categorization: positive / negative words |
| Loughran and McDonald (2009) | 10-K                    | word categorization: positive / negative words |
| Feldman et al. (2009)        | 10-K MD&A               | word categorization: positive / negative words |
| Hanley and Hoberg (2010)     | IPO Prospectus          | single-label classifier                        |
| Li (2010)                    | 10-K MD&A               | single-label classifier                        |
| Huang and Li (2011)          | 10-K Risk Factors       | multi-label classifier                         |


In Li (2008), the author  examines the relationship between corporate earnings and 10-K filing readability.
To measure readability the author uses the two basic metrics:

* The Fog Index: based on average sentence length and the number of complex words with 3 or more syllables. 
$$
0.4 \times ( \frac{words}{sentences} + 100 \times \frac{complex_words}{words} )
$$

* Length: the total length of the 10-K filing

Li (2008) found that companies with 10-K filings that were longer and less readable (higher Fog index), had lower earnings and generally were in poorer financial conditions. 

Tetlock et al. (2008) examine if sentiment on the Wall Street Journal had any impact on the stock market. 
Sentiment is obtained  by the number of positive words and negative words categorized by the General Inquirer (GI), a popular dictionary used by psychologists. Tetlock et al. (2008) find that media pessimism causes downward pressure on market prices. 
  

Feldman et al. (2009) were first to focus specifically on the management's discussion and analysis (MD&A) section. This is because the MD&A offers the most subjectivity in 10-K filings, and thus is best placed for sentiment analysis.
Their analysis remained relatively simplistic, and focused on tone change via word categorization.
Their results suggested that stock market reactions around SEC filings were positively related to tone change in the MD&A section. 






### Tutorial Part 1: Getting MD&A Textual Data

In this section, we will run through how to download 10-K / 10-Q reports from the EDGAR database managed by the SEC.
This is only for US listed stocks. 


First, we need to download the master index file from EDGAR.
For simplicity without a loss of generality, let us simply look at the year 2018, quarter 1.
Therefore, we will be downloading links to company 10-Q reports in this particular case.
This will be saved in a file named `master.idx'.


In [None]:
import shutil
import urllib.request
import urllib.error

idx_url = 'https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.idx'
response = urllib.request.urlopen(idx_url)
out_file = open('master.idx', 'wb')
shutil.copyfileobj(response, out_file)
out_file.close()


If you open up the master file and have a peek inside, you will realise that there a lot of different types of forms.
We're only interested in 10-Q files.

In [None]:
with open("master.idx") as file_in:
    head = [next(file_in) for x in range(15)]
     
print(head[12])

As you can see, we have the filing url to Nicholas Financial Inc's 10-Q report.
Now, we can proceed to download the file.


In [2]:
from bs4 import BeautifulSoup
import re

filing_url = "edgar/data/1000045/0001193125-18-037381.txt"
file = urllib.request.urlopen('http://www.sec.gov/Archives/' + filing_url)
raw_text = file.read()
soup = BeautifulSoup(raw_text, 'html.parser')

NameError: name 'urllib' is not defined

If you would like to view the 10-Q file in HTML format, you can use the prettify method and write to your local drive.

In [None]:
html = soup.prettify("utf-8")
with open("output_10Q.html", "wb") as file:
    file.write(html)

Now we can proceed to clean the text, removing html tags, non-ascii characters and other formating quirks using regular expressions.

In [None]:
text_clean = re.sub(r'[^\x00-\x7F]+|\W{2,}', ' ', soup.document.get_text())
text_clean = re.sub('\n', ' ', text_clean)

In theory, with `text_clean`, we are now able to run sentiment analysis. 
However, if we want to isolate the management discussion and analysis section, we need to do some more work.

First, we need to identify when `management's discussion and analysis` is first brought up in the corpus. 
Any text before this point can be trimmed, as it is no longer necessary.
We can use the re.search() function for this. (re.I is short for re.IGNORECASE and re.M is short for re.MULTILINE)

However, the first identification of MDA is usually in the content's page. Therefore, it is the second time we find the term `management's discussion and analysis` that the actual section begins.

In a standard 10-Q report, the management discussion and analysis section occurs just directly before the 
`Quantitative and Qualitative Disclosures about Market Risk` section. 




In [None]:
# Trim all the text before the management's discussion and analysis section
trim_beginning = re.search(r'management[\s\']*s discussion and analysis', text_clean, re.M | re.I)
text_tmp = text_clean[trim_beginning.end():]
trim_beginning = re.search(r'management[\s\']*s discussion and analysis', text_tmp, re.M | re.I)
text_tmp = text_tmp[trim_beginning.start():]

# Trim all the text after the management's discussion and analysis section
trim_end = re.search(r'quantitative and qualitative', text_tmp, re.M | re.I)
text_mda = text_tmp[:trim_end.start()]

# Resulting management's discussion and analysis section after trim
print(text_mda[0:100])


### Tutorial Part 2: Conducting Sentiment Analysis




### References

Huang, K.-W. and Z. Li. 2011. "A Multilabel Text Classification Algorithm for Labeling Risk Factors in SEC Form 10-K," ACM Transactions on MIS (2:3), Article 18. 

Humpherys, S. L., K. C. Moffitt, et al. 2011. "Identification of fraudulent financial statements using linguistic credibility analysis," Decision Support Systems (50:3), pp.  585–594.

Kravet, T. and V. Muslu (2013). "Textual Risk Disclosures and Investors' Risk Perceptions," Review of Accounting Studies.

Li, F. 2010. "The Information Content of Forward-Looking Statements in Corporate Filings—A Naive Bayesian Machine Learning Approach," Journal of Accounting Research (48:5), pp.  1049-1102. 

