Reproducible steps to create a corpus from [FinnHub](https://finnhub.io/).

# Pseudocode

Below is a list of the steps we take.
Keep in mind that these steps are a 10 thousand foot view.
The implementation will be commented to a more detailed level.

1. Get the tickers from the [SEC](https://www.sec.gov/file/company-tickers)
2. Using the retrieved data, get the tickers for every publicly traded stock in the U.S. market.

In [1]:
from pathlib import Path

tickers_url = 'https://www.sec.gov/files/company_tickers.json'
user_agent = 'FinnHub'
limit = 20

data_folder = Path('./data/')
tickers_file = data_folder.joinpath('./tickers.csv')
raw_folder = data_folder.joinpath('./raw')
corpus_folder = data_folder.joinpath('./corpus')

# Step 1

1. Get the list of tickers from the SEC
2. Convert the tickers into an array, then sort it.
3. Save the tickers to a CSV

In [4]:
import requests
import pandas as pd
import json

def get_tickers(tickers_file: Path, tickers_url: str, user_agent: str, ) -> pd.DataFrame:
    if not tickers_file.exists():
        tickers = None
        with requests.Session() as session:
            session.headers['User-Agent'] = user_agent
            with session.get(tickers_url) as result:
                if result.status_code == 200:
                    t1 = json.loads(result.text)
                    t2 = [x for x in t1.values()]
                    t3 = sorted(t2, key = lambda tup: tup['ticker'])
                    tickers = [(x['cik_str'], x['ticker'], x['title']) for x in t3]
        if tickers is not None:
            df = pd.DataFrame(tickers, columns = ['CIK', 'Ticker', 'Name'])
            if not tickers_file.parent.exists():
                tickers_file.parent.mkdir(parents = True)
            df.to_csv(tickers_file, index = False)
        else:
            raise RuntimeError('Error retrieving tickers')          
    return pd.read_csv(tickers_file) #type: ignore

tickers = get_tickers(tickers_file, tickers_url, user_agent)