# Sentiment Analysis using NLTK

Objective of this project is **to extract some sections (which are mentioned below) from SEC / EDGAR financial reports** and **perform text analysis to compute variables**.
  
The project is divided and worked under two tasks:

 (1) **Data Extraction** and (2) **Sentiment Analysis**

.

.

.


(1) **Sections to extract data from :**

1 : Management's Discussion and Analysis of Financial Condition and Results of Operations

2 : Quantitative and Qualitative Disclosures about Market Risk

3 : Risk Factors

(2) **Variables to be computed from extracted data:**

1 : All polarity scores (6)

2 : Word count, Sentence count, Average Sentence Length

3 : Fog index

Therefore, we need 11 x 3 = **33 variables**.

## Import required Python Libraries :
We need the following the libraries and functions for our data parsing, extraction and text analysis(NLTK)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import string
import xlwt

In [2]:
from nltk.tokenize.regexp import WhitespaceTokenizer
from nltk.tokenize import regexp_tokenize
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

## Read the excel file, complete the links column :
Using pandas function of reading a excel file stored in the project directory.
The excel file consists of various entries such as CIK Number, Form, SECFNAME, etc. The SECFNAME column is a column of incomplete html links of our financial reports. We need to add every cell in the column with **https://www.sec.gov/Archives/** to form get access.

In [3]:
data = pd.read_excel('C:\Saurav\Work\projects\Sentiment Analysis\cik_list.xlsx')                      

data['SECFNAME'] = 'https://www.sec.gov/Archives/' + data['SECFNAME'].astype(str)
data.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME
0,3662,SUNBEAM CORP/FL/,199803,1998-03-06,10-K405,https://www.sec.gov/Archives/edgar/data/3662/0...
1,3662,SUNBEAM CORP/FL/,199805,1998-05-15,10-Q,https://www.sec.gov/Archives/edgar/data/3662/0...
2,3662,SUNBEAM CORP/FL/,199808,1998-08-13,NT 10-Q,https://www.sec.gov/Archives/edgar/data/3662/0...
3,3662,SUNBEAM CORP/FL/,199811,1998-11-12,10-K/A,https://www.sec.gov/Archives/edgar/data/3662/0...
4,3662,SUNBEAM CORP/FL/,199811,1998-11-16,NT 10-Q,https://www.sec.gov/Archives/edgar/data/3662/0...


## Grab the content by iterating over each link in the SECFNAME column :
The following code is the main iteration loop which will select each link, parsing the html content and extracting the data under the three given sections. I have used **BeautifulSoup** for parsing the html content in my links.

I will be using **RegexpTokenizer** from **nltk.tokenizer** package for grabbing the data under the three different sections. If any particular report does not include any the sections, it will be simply stored empty.

In [4]:
result_df = pd.read_excel('C:\Saurav\Work\projects\Sentiment Analysis\Output Data Structure.xlsx')
result_df = result_df.append(data)
result_df.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME,mda_negative_score,mda_neutral_score,mda_positive_score,mda_compound_score,...,qqdmr_sent_count,qqdmr_fog_index,rf_negative_score,rf_neutral_score,rf_compound_score,rf_subjectivity_score,rf_average_sentence_length,rf_word_count,rf_sent_count,rf_fog_index
0,3662,SUNBEAM CORP/FL/,199803,1998-03-06,10-K405,https://www.sec.gov/Archives/edgar/data/3662/0...,,,,,...,,,,,,,,,,
1,3662,SUNBEAM CORP/FL/,199805,1998-05-15,10-Q,https://www.sec.gov/Archives/edgar/data/3662/0...,,,,,...,,,,,,,,,,
2,3662,SUNBEAM CORP/FL/,199808,1998-08-13,NT 10-Q,https://www.sec.gov/Archives/edgar/data/3662/0...,,,,,...,,,,,,,,,,
3,3662,SUNBEAM CORP/FL/,199811,1998-11-12,10-K/A,https://www.sec.gov/Archives/edgar/data/3662/0...,,,,,...,,,,,,,,,,
4,3662,SUNBEAM CORP/FL/,199811,1998-11-16,NT 10-Q,https://www.sec.gov/Archives/edgar/data/3662/0...,,,,,...,,,,,,,,,,


In [16]:
result_df.shape

(152, 32)

In [6]:
result_df.columns

Index(['CIK', 'CONAME', 'FYRMO', 'FDATE', 'FORM', 'SECFNAME',
       'mda_negative_score', 'mda_neutral_score', 'mda_positive_score',
       'mda_compound_score', 'mda_subjectivity_score',
       'mda_average_sentence_length', 'mda_word_count', 'mda_sent_count',
       'mda_fog_index', 'qqdmr_negative_score', 'qqdmr_neutral_score',
       'qqdmr_positive_score', 'qqdmr_compound_score',
       'qqdmr_subjectivity_score', 'qqdmr_average_sentence_length',
       'qqdmr_word_count', 'qqdmr_sent_count', 'qqdmr_fog_index',
       'rf_negative_score', 'rf_neutral_score', 'rf_compound_score',
       'rf_subjectivity_score', 'rf_average_sentence_length', 'rf_word_count',
       'rf_sent_count', 'rf_fog_index'],
      dtype='object')

In [17]:
result_df.iloc[0,0]

3662

In [18]:
row = 0
for link in result_df['SECFNAME']:
    
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'lxml')
    link_data = soup.text

    link_results = []

    sec1 = regexp_tokenize(link_data,"(MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS\n        OF OPERATION)(.*?)( ITEM | PART | Item | Part )")
    #print("\033[1m" + "Management's Discussion and Analysis Section:" + "\033[0m")

    if sec1:
        sec1_result = text_analyzer(sec1[0][1])
        result_df.iloc[row,6:14] = sec1_result

    sec2 = regexp_tokenize(link_data,"(QUANTITATIVE AND\nQUALITATIVE DISCLOSURES ABOUT MARKET RISK)(.*?)( ITEM | PART | Item | Part )")
    #print("\033[1m" + "Quantitative and Qualitative Disclosures about Market Risk Section:" + "\033[0m")

    if sec2:
        sec2_result = text_analyzer(sec2[0][1])
        result_df.iloc[row,15:23] = sec1_result

    sec3 = regexp_tokenize(link_data,"(RISK FACTORS)(.*?)( ITEM | PART | Item | Part )")
    #print("\033[1m" + "Risk Factors Section:" + "\033[0m")

    if sec3:
        sec3_result = text_analyzer(sec3[0][1])
        result_df.iloc[row,24:32] = sec1_result

    row+=1

In [20]:
result_df.iloc[10]

CIK                                                                           3662
CONAME                                                            SUNBEAM CORP/FL/
FYRMO                                                                       199905
FDATE                                                          1999-05-17 00:00:00
FORM                                                                       NT 10-Q
SECFNAME                         https://www.sec.gov/Archives/edgar/data/3662/0...
mda_negative_score                                                             NaN
mda_neutral_score                                                              NaN
mda_positive_score                                                             NaN
mda_compound_score                                                             NaN
mda_subjectivity_score                                                         NaN
mda_average_sentence_length                                                    NaN
mda_

## Total Analysis Function :

In [9]:
def text_analyzer(text):
    filtered_data = Cleaning(text)
    results = []
    
    if filtered_data: 
        sentiment_score = SentimentAnalyzer(filtered_data)
    else: sentiment_score = {}
    
    results = [value for value in sentiment_score.values()]
    
    tokens = WhitespaceTokenizer().tokenize(text)
    word_count = len(tokens)
    sent_count = len(sent_tokenize(text))
    avg_sent_length = round(float(word_count)/sent_count, ndigits=0)
    
    results.extend((avg_sent_length, word_count, sent_count))
    
    return results # returing a list

### Sentiment Score function :

In [10]:
def SentimentAnalyzer(content):
    scores = []
    sid = SentimentIntensityAnalyzer()
    
    if len(content)!= 0 :
        
        scores = sid.polarity_scores(" ".join(content))
        subjectivity_score = ( scores['neg'] + scores['pos'] )/ ((len(content)) + 0.000001)
        
        scores.update(sub = subjectivity_score)
        
    else:
        return scores
#         print("No such section")
    return scores

### Function of Analysis of Readibility :

In [11]:
def Readibility(text):
    
    return word_count

### Text Cleaning function (Filtering the stopwords and punctuations) :

In [12]:
def Cleaning(content):                                                                 # Cleaning text function
    
    stop_words = sorted(stopwords.words('english'))

    # removing stopwords
    word_tokens = [w for w in word_tokenize(content,"english") if not w in stop_words]
    
    # removing punctuations from word_tokes list
    filtered_sentence = list(filter(lambda x : x not in string.punctuation, word_tokens))
    
    return filtered_sentence