# Tutorial - Downloading and Extracting MD&A Text

This tutorial will illustrate how to download master indexes from the EDGAR system, extract the desired line items and build objects that isolate MD&A text and save to disk. 

> **Note:**

> These files are designed to be run in sequential order.

### Step 1. Downloading a Master Index

Use ```download_master.py``` to download indexes for a given range of years. Indexes are in CSV format and are roughly 20MB each.

**Syntax:**

```
$ python download_master.py <start_year> <end_year>
```

**Example:**
```
$ python download_master.py 2016 2017
```

**Output:**

```
Downloading 2016 Q1 master index.
Downloading 2016 Q2 master index.
Downloading 2016 Q3 master index.
Downloading 2016 Q4 master index.
Downloading 2017 Q1 master index.
Downloading 2017 Q2 master index.
Downloading 2017 Q3 master index.
Downloading 2017 Q4 master index.
```

### Step 2. Extracting Form 10-Q and Form 10-K Filings

The ```extract_10x_from_master.py``` file iterates through the above master indexes for a given range of years and extracts the Form 10-Q and Form 10-K lines. These are saved to a new file ```10X_<start_year>_<end_year>.idx```.

**Syntax:**

```
$ python extract_10x_from_master.py <start_year> <end_year>
```

**Example:**

```
$ python extract_10x_from_master.py 2016 2017
```

**Output:**

```
Extracting filing data:
    2016 Q1... Found 3798 filings
    2016 Q2... Found 7416 filings
    2016 Q3... Found 6870 filings
    2016 Q4... Found 6798 filings
    2017 Q1... Found 6819 filings
    2017 Q2... Found 7129 filings
    2017 Q3... Found 6549 filings
    2017 Q4... Found 6369 filings
Saving to Data/10X-2016-2017.idx
```

### Step 3. Building MD&A Filing Objects

In order to extract the MD&A section from Form 10-Q and Form 10-K documents, each raw text filing must be parsed carefully to remove financial terms and dollar amounts. 

First, for each entry in the index created in Step 2 above, the raw text filing is downloaded from EDGAR and any HTML is removed. This is done using BeautifulSoup. Whitespace and unicode characters are removed using a series of regular expressions. The remaining text is then passed through a large stopword index to help isolate unique terms. The final, refined text is then saved to disk for later processing by topic modeling libraries.

As you can see from Step 2 above, this file parses through thousands of filings and could take several hours to download and analyze.

**Syntax:**

```
$ python build_edgar_mda_files.py <index_filename> [cik_maskfile] [max_count]
```

**Parameters:**

- ```<index_filename>``` - 'The index created in Step 2 containing only Form 10-Q and Form 10-K filing references.
- ```[cik_maskfile]``` - An optional reference to a text file containing a list of CIK numbers. Only filings for companies matching these CIK numbers will be downloaded and processed.
- ```[max_count]``` - An optional parameter limiting the number of filings to be processed.

**Example:**

```
$ python build_edgar_mda_files.py ./Data/10X-2016-2017.idx ./Data/737x.csv 1
```

**Output:**

```
Loading index ./Data/10X-2016-2017.idx.
Loading 757 stopwords.
----------------------------------------------------------------
1. Processing ('1002517', 'Nuance Communications, Inc.', '10-Q', '2016-02-09', 'edgar/data/1002517/0001002517-16-000049.txt')
Downloading raw filing text
Searching for SIC code
	Found 7372
Searching for Accession Number... 
	Found 000100251716000049
Searching for MD&A text
	Trimming MD&A START to 2978.
	Trimming MD&A START to 66595.
	Trimming MD&A END to 63566.
	Success!
Removing stopwords and garbage.
Saving MD&A text to ./Data/MDA/mda_7372_2016-02-09_1002517_000100251716000049.txt
================================================================
Done. 1 filing(s) loaded.
```

After loading the index and stopword file, each line is processed as mentioned above. When the first mention of the MD&A section is reached, the program trims the file to that point. It continues seeking through the file until the last occurrence of the MD&A start header is found, trimming along the way. This is because the MD&A section is frequently referenced in other parts of the filing. The end of the MD&A section is located, allowing us to finally extract the desired section. Stopwords and other undesired text are removed and the MD&A is saved to a file.

Now that the data sets have been prepared, we can begin the topic modeling and visualization process.

### Other Tutorials

- [Topic Modeling and Visualizations Using Latent Dirichlet Allocation](Visualizations.ipynb)
- [Project Overview](README.md)