## CS410 Text Information Systems Technology Review

### Topic: Sentiment Analysis using Python nltk for Financial Reports


### Introduction

In rencent years, the analysis of textual information for financial decision making has become increasingly popular.
For instance, Li (2010) use it to predict future earnings, Kravet and Musli (2013) use textual analysis to predict stock prices and Humphreys et al. (2011) employ it for detecting corporate fraud. 

Academic research aside, the use of textual materials for conducting sentiment analysis is also important from an industry perspective. It is not difficult to see that effective sentiment analysis can be converted into potential alpha signals for quantitative fund managers.  

A common method for discerning sentiment of a company is to run textual analysis on a company's Management Discussion and Analysis (MD&A) section of Form 10-Q and 10-K. Sentiment can be classified into positive, negative or neutral tones which can then be used to generate a signal for earnings or stock price prediction.

In this note, we will review of the technologies used by quantitative fund managers and academics to determine stock / company sentiment via textual analysis. We also provide a basic tutorial in Python of how this works in practice.



### Literature Review





| Authors                      | Data                    | Method                                         |
|------------------------------|-------------------------|------------------------------------------------|
| Loughran and McDonald (2009) | 10-K                    | word categorization: positive / negative words |
| Tetlock et al. (2008)        | Financial News Articles | word categorization: positive / negative words |
| Antweiler and Frank (2004)   | Yahoo! Finance Postings | single-label classifier                        |
| Hanley and Hoberg (2010)     | IPO Prospectus          | single-label classifier                        |
| Feldman et al. (2009)        | 10-K MD&A               | word categorization: positive / negative words |
| Li (2010)                    | 10-K MD&A               | single-label classifier                        |
| Li (2008)                    | 10-K                    | content analysis: readability                  |
| Huang and Li (2011)          | 10-K Risk Factors       | multi-label classifier                         |






### Tutorial: Getting MD&A Textual Data

In this section, we will run through how to download 10-K / 10-Q reports from the EDGAR database managed by the SEC.
This is only for US listed stocks. 


First, we need to download the master index file from EDGAR.
For simplicity without a loss of generality, let us simply look at the year 2018, quarter 1.
Therefore, we will be downloading links to company 10-Q reports in this particular case.
This will be saved in a file named `master.idx'.


In [None]:
import shutil
import urllib.request
import urllib.error

idx_url = 'https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.idx'
response = urllib.request.urlopen(idx_url)
out_file = open('master.idx', 'wb')
shutil.copyfileobj(response, out_file)
out_file.close()


If you open up the master file and have a peek inside, you will realise that there a lot of different types of forms.
We're only interested in 10-Q files.

In [11]:
with open("master.idx") as file_in:
    head = [next(file_in) for x in range(15)]
     
print(head[13])

['CIK|Company Name|Form Type|Date Filed|Filename\n', '--------------------------------------------------------------------------------\n', '1000032|BINCH JAMES G|4|2018-02-16|edgar/data/1000032/0000913165-18-000034.txt\n', '1000045|NICHOLAS FINANCIAL INC|10-Q|2018-02-09|edgar/data/1000045/0001193125-18-037381.txt\n']


As you can see, we have the filing url to Nicholas Financial Inc's 10-Q report.
Now, we can proceed to download the file.


In [None]:
filing_url = "edgar/data/1000045/0001193125-18-037381.txt"
file = urllib.request.urlopen('http://www.sec.gov/Archives/' + filing_url)



### References

Huang, K.-W. and Z. Li. 2011. "A Multilabel Text Classification Algorithm for Labeling Risk Factors in SEC Form 10-K," ACM Transactions on MIS (2:3), Article 18. 

Humpherys, S. L., K. C. Moffitt, et al. 2011. "Identification of fraudulent financial statements using linguistic credibility analysis," Decision Support Systems (50:3), pp.  585–594.

Kravet, T. and V. Muslu (2013). "Textual Risk Disclosures and Investors' Risk Perceptions," Review of Accounting Studies.

Li, F. 2010. "The Information Content of Forward-Looking Statements in Corporate Filings—A Naive Bayesian Machine Learning Approach," Journal of Accounting Research (48:5), pp.  1049-1102. 

