# Homework exercise 1
## Deadline: upload to Moodle by 17 May 18:00 h

__Please submit your homework either as a Jupyter Notebook or using .py files.__

If you use .py files, please also include a PDF containing the output of your code and your explanations. Either way, the code needs to be in a form that can be easily run on another computer.

__Name:__Evamaria Hammerschmid


The name of the file that you upload should be named *Homework1_YourLastName_YourStudentID*.

Reminder: you are required to attend class on 18 May to earn points for this homework exercise unless you have a valid reason for your absence.

You are expected to work on this exercise individually. If any part of the questions is unclear, please ask on the Moodle forum.

__SEC EDGAR__

Filings made by companies to the regulator are another very useful source of text data. The most important source in this regard is the US Securities and Exchange Commission (SEC).

The SEC provides information on how to access their filings here: https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm

Please write a function that

* downloads index files sorted by form type for a particular day or a list of days
* then downloads the _HTML versions_ of the filings made on that day (or each day in the list), with an optional argument that can specify the form type if you want to access only files of one such form type. Note that you can identify the file containing the main filing, which is the file to be downloaded, by considering the column 'Type' in the table, e.g., here: https://www.sec.gov/Archives/edgar/data/946644/0001493152-21-005524-index.htm

Please write another function that 
* downloads the HTML versions of the files of form type 10-Q file on a given day
* removes all tables and images from the files if there are any
* returns a DataFrame in which the columns correspond to the different parts/items of the form and the content of each filing is written to one row of the DataFrame. Item here is a technical term here as you will see when looking at such filings, e.g., here: https://www.sec.gov/Archives/edgar/data/1530425/000147793221001290/arrt_10q.htm ;  the items are numbered and items with the same number that are contained in the same part of the filing always have the same name.

Please test your code for days comprising a total of at least 10 filings.

In [252]:
from bs4 import BeautifulSoup, SoupStrainer
import re
import requests
import pandas as pd
import datetime
import time


In [253]:
def getIndexFile(dates, form_type = ""): #by default takes all form types if not stated otherwise
    base_url = "https://www.sec.gov/Archives/edgar/daily-index/"
    index_dfs = []
    all_filings = []
    for date in dates:
        filings = []
        year = str(date.year)
        month = str(date.month)
        day = str(date.day)
        if len(day) == 1:
            day = "0" + day
        if len(month) == 1:
            month = "0" + month
        quarter = str(date.quarter)
        url = base_url + year + "/QTR" + quarter + "/master."+year+month+day+".idx"
        res = requests.get(url)
        time.sleep(1)
        table = BeautifulSoup(res.text).body.p.text.split('\n')[5:]   #List with length of rows
        table_df = pd.DataFrame([x.split('|') for x in table[2:]], columns = table[0].split('|'))
        table_df.dropna(inplace=True)
        table_df.sort_values(by="Form Type", inplace=True)
        index_dfs.append(table_df.copy())  #in index_dfs --> list of Dataframes for chosen dates
        if form_type != "":
            table_df = table_df[table_df['Form Type'] == form_type]
        print("Total number of files: "+str(len(table_df['File Name'])))
        count = 0
        for file_name in table_df['File Name']:
            filing_url = "https://www.sec.gov/Archives/"+file_name[:-4]+"-index.html"
            filing_res = requests.get(filing_url)
            time.sleep(1)
            soup = BeautifulSoup(filing_res.text)
            t = soup.table
            if t:
                main_file_url = "https://www.sec.gov"+t.find_all('tr')[1].find('a')['href']
                main_file_res = requests.get(main_file_url)
                filings.append(main_file_res.text)
                time.sleep(1)
            count += 1
#             if count > 20:
#                 break
        print("Download done.")
        all_filings.append(filings)
    return index_dfs, all_filings

    

In [92]:
index_dfs, filings = getIndexFile([pd.Timestamp("2020-01-02")], form_type="10-Q")



Total number of files: 6
Download done.


In [54]:
#print(index_dfs)

[          CIK                                   Company Name  Form Type  \
2910  1745449                          Elegance Brands, Inc.    1-A POS   
3252  1790320                        Mystic Holdings Inc./NV      1-A/A   
2723  1706656  Fundrise National For-Sale Housing eFund, LLC        1-U   
2424  1661023       Fundrise Midland Opportunistic REIT, LLC        1-U   
3052  1768726                Fundrise Growth eREIT 2019, LLC        1-U   
...       ...                                            ...        ...   
4019   837465                               CALLAWAY GOLF CO     UPLOAD   
2908  1745240                           BEESPOKE CAPITAL LLC    X-17A-5   
1864  1527312          GEORGE K. BAUM CAPITAL ADVISORS, INC.  X-17A-5/A   
2907  1745240                           BEESPOKE CAPITAL LLC  X-17A-5/A   
4313   922113                       GEORGE K. BAUM & COMPANY  X-17A-5/A   

     Date Filed                                    File Name  
2910   20200102  edgar/data/1745449

In [254]:

def get10QFile(date):
    index_dfs, filings = getIndexFile([date], form_type="10-Q")
    html_files = filings[0]
    output = []
    item1=[]
    item2=[]
    item3=[]
    item4=[]
    for file in html_files:
        file = BeautifulSoup(file, 'html.parser')
        for tbl in file.find_all('table'):
            tbl.decompose()
        for img in file.find_all('img'):
            img.decompose()
        filetext = str(file.body.get_text())
        re_object1=re.compile(r'Item\w*\s*1\.', re.I)
        search_test1 = re_object1.search(filetext)

        re_object2=re.compile(r'Item\w*\s*2\.', re.I)
        search_test2 = re_object2.search(filetext)

        re_object3=re.compile(r'Item\w*\s*3\.', re.I)
        search_test3 = re_object3.search(filetext)

        re_object4=re.compile(r'Item\w*\s*4\.', re.I)
        search_test4 = re_object4.search(filetext)

        item1.append(filenum[search_test1.end():search_test2.start()] if (search_test1 and search_test2) else None)
        item2.append(filenum[search_test2.end():search_test3.start()] if (search_test2 and search_test3) else None)
        item3.append(filenum[search_test3.end():search_test4.start()] if (search_test3 and search_test4) else None)
        item4.append(filenum[search_test4.end():] if (search_test3 and search_test4) else None)
    Q10 = pd.DataFrame({'Item 1': item1, 'Item 2': item2, 'Item 3': item3,"Item 4": item4})
        
        
    return Q10

In [269]:
df = get10QFile(pd.Timestamp("2021-02-10"))

Total number of files: 25
Download done.


In [270]:
df

Unnamed: 0,Item 1,Item 2,Item 3,Item 4
0,"company, which specializes in acquiring\nstud...","ng, or operating\nlease and requires lessees t...",m time to time in other reports we file with\n...,"rly\nReport on Form 10-Q include, but are not ..."
1,E YEARS:\n \n\n\n\n\n\n\n\n Ind...,,,
2,all documents and reports required to be file...,,,
3,,,,
4,HUBILU\nVENTURE CORPORATION\n \n\n\n\...,,,
5,,,,
6,,,,
7,,,,
8,DURING THE PRECEDING FIVE YEARS:\n \n\n...,TIONS\n \n\n\n\n\n\n\n\n\n\n\n ...,"rmation. Certain risks, uncertainties or other...",ng statements in this Quarterly\nReport on For...
9,HUBILU\nVENTURE CORPORATION\n ...,,,


In [133]:
# test = BeautifulSoup(files[2], 'html.parser')
# html = test
# html = html.prettify("utf-8")

In [134]:
# with open("test_output2.html", "wb") as file:
#     file.write(html)