Skip to content

The analysis of financial documents available on the sec website(https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt) and returns the polarity, complex words, positive score, negative score and several other attributes in an excel sheet

Notifications You must be signed in to change notification settings

shivkantsharma33/Data-Extraction-and-Text-Analysis

Repository files navigation

Data-Extraction-and-Text-Analysis

The analysis of financial documents available on the sec website(https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt) and returns the polarity, complex words, positive score, negative score and several other attributes in an excel sheet

Objective of this assignment is to extract some sections (which are mentioned below) from SEC / EDGAR financial reports and perform text analysis to compute variables those are explained below. Link to SEC / EDGAR financial reports are given in excel spreadsheet “cik_list.xlsx”. Please add https://www.sec.gov/Archives/ to every cells of column F (cik_list.xlsx) to access link to the financial report. Example: Row 2, column F contains edgar/data/3662/0000950170-98-000413.txt Add https://www.sec.gov/Archives/ to form financial report link i.e. https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt

Variables: “Text Analysis.docx” you need to compute following: Section 1.1: Positive score, negative score, polarity score Section 2: Average Sentence Length, percentage of complex words, fog index Section 4: Complex word count Section 5: Word count

In addition to these eight variables, compute two more items: “uncertainty” and “constraining”. These variables are calculated similar to the ones in Section 1.1 or Section 4. Attached the lists of words that are classified as uncertain or constraining.

For uncertainty: “uncertainty_dictionary.xlsx” For constraining: “constraining_dictionary.xlsx”

That means you need to collect/compute 10 variables in total.

Sections: For each report (financial reports, links available in excel, cik list), we would like these 10 variables calculated for three sections. These are “Management's Discussion and Analysis”, “Quantitative and Qualitative Disclosures about Market Risk”, and “Risk Factors”. If a report does not include any of these sections, leave those fields blank.

In other words, we need 10 x 3 = 30 variables. Attached the spreadsheet “cik_list.xlsx”, which also contains the links to reports. It would be ideal if you could add 30 columns to each row, so that we would have the # rows unchanged after your data collection.

Additional Variables: positive/negative and uncertainty/constraining word proportion The absolute values of “Positive/Negative Scores” are equal to the number of positive/negative words in each section of 10-Q/K; so the (Loughran-McDonald) positive/negative word proportion can be simply calculated as “Positive/Negative Scores divided by Word Count – compute these measure in addition to Polarity Score. And, the “uncertainty score” and “constraining score” will be also just equal to the number of corresponding words and you can calculate the portion of these words as same as above.

Additional Variable: Constraining words for whole report Add one variable to the mix, which will be calculated only once for the whole report (i.e., not three times). It’s the number of “constraining” words over the whole report rather than in any specific section.

Output Data Structure Notations: “Management's Discussion and Analysis”: MDA “Quantitative and Qualitative Disclosures about Market Risk”: QQDMR “Risk Factors”: RF

Output Variables: All input variables in “cik_list.xlsx” mda_positive_score mda_negative_score mda_polarity_score mda_average_sentence_length mda_percentage_of_complex_words mda_fog_index mda_complex_word_count mda_word_count mda_uncertainty_score mda_constraining_score mda_positive_word_proportion mda_negative_word_proportion mda_uncertainty_word_proportion mda_constraining_word_proportion qqdmr_positive_score qqdmr_negative_score qqdmr_polarity_score qqdmr_average_sentence_length qqdmr_percentage_of_complex_words qqdmr_fog_index qqdmr_complex_word_count qqdmr_word_count qqdmr_uncertainty_score qqdmr_constraining_score qqdmr_positive_word_proportion qqdmr_negative_word_proportion qqdmr_uncertainty_word_proportion qqdmr_constraining_word_proportion rf_positive_score rf_negative_score rf_polarity_score rf_average_sentence_length rf_percentage_of_complex_words rf_fog_index rf_complex_word_count rf_word_count rf_uncertainty_score rf_constraining_score rf_positive_word_proportion rf_negative_word_proportion rf_uncertainty_word_proportion rf_constraining_word_proportion constraining_words_whole_report

About

The analysis of financial documents available on the sec website(https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt) and returns the polarity, complex words, positive score, negative score and several other attributes in an excel sheet

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages