Reproducible steps to create a corpus from [FinnHub](https://finnhub.io/).

# Pseudocode

Below is a list of the steps we take.
Keep in mind that these steps are a 10 thousand foot view.
The implementation will be commented to a more detailed level.

1. Get the tickers from the [SEC](https://www.sec.gov/file/company-tickers)
2. Using the retrieved data, get the tickers for every publicly traded stock in the U.S. market.

In [1]:
from pathlib import Path
import requests
import pandas as pd
import json
import csv
from tqdm.notebook import tqdm

In [2]:
tickers_url = 'https://www.sec.gov/files/company_tickers.json'
user_agent = 'FinnHub-Data-Ingestion'
limit = 20

data_folder = Path('./data/')
tickers_file = data_folder.joinpath('./tickers.csv')
raw_folder = data_folder.joinpath('./raw')
corpus_folder = data_folder.joinpath('./corpus')

# Step 1

1. Get the list of tickers from the SEC
2. Convert the tickers into an array, then sort it.
3. Save the tickers to a CSV

In [3]:
def get_tickers(tickers_file: Path, tickers_url: str, user_agent: str, ) -> pd.DataFrame:
    if not tickers_file.exists():
        tickers = None
        with requests.Session() as session:
            session.headers['User-Agent'] = user_agent
            with session.get(tickers_url) as result:
                if result.status_code == 200:
                    t1 = json.loads(result.text)
                    t2 = [x for x in t1.values()]
                    t3 = sorted(t2, key = lambda tup: tup['ticker'])
                    tickers = [(x['cik_str'], x['ticker'], x['title']) for x in t3]
        if tickers is not None:
            df = pd.DataFrame(tickers, columns = ['CIK', 'Ticker', 'Name'])
            if not tickers_file.parent.exists():
                tickers_file.parent.mkdir(parents = True)
            df.to_csv(tickers_file, index = False)
        else:
            raise RuntimeError('Error retrieving tickers')          
    return pd.read_csv(tickers_file) #type: ignore

tickers_df = get_tickers(tickers_file, tickers_url, user_agent)

# Step 2

1. Iterate through each ticker in the `tickers.csv` file and download all available trading data.
2. Save data of each ticker to JSON files.

In [4]:
base_url = f'https://finnhub.io/api/v1/stock/congressional-trading?symbol='

headers = {
        'Accept': '/',
        'User-Agent': 'Thunder Client (https://www.thunderclient.com)',
        'X-FinnHub-Token': "co2cfnhr01qvggee05j0co2cfnhr01qvggee05jg",  # Replace with your actual API key
                }

tickers = tickers_df['Ticker']

In [5]:
for ticker in tqdm(tickers):
    url = base_url + ticker
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        filename = f'data/raw/{ticker}.json'
        with open(filename, 'w') as jsonfile:
            jsonfile.write(response.text)
        print(f'Response for {ticker} saved to {filename}')
    else:
        print(f'Failed to retrieve data for {ticker}. Status code: {response.status_code}')

  0%|          | 0/10442 [00:00<?, ?it/s]

Response for A saved to data/A.json
Response for AA saved to data/AA.json
Response for AAAU saved to data/AAAU.json
Response for AACG saved to data/AACG.json
Response for AACI saved to data/AACI.json
Response for AACIU saved to data/AACIU.json
Response for AACIW saved to data/AACIW.json
Response for AACT saved to data/AACT.json
Response for AACT-UN saved to data/AACT-UN.json
Response for AACT-WT saved to data/AACT-WT.json
Response for AADI saved to data/AADI.json
Response for AAGH saved to data/AAGH.json
Response for AAGR saved to data/AAGR.json
Response for AAGRW saved to data/AAGRW.json
Failed to retrieve data for AAIDX. Status code: 403
Response for AAL saved to data/AAL.json
Response for AAMC saved to data/AAMC.json
Response for AAME saved to data/AAME.json
Response for AAN saved to data/AAN.json
Response for AAOI saved to data/AAOI.json
Response for AAON saved to data/AAON.json
Response for AAP saved to data/AAP.json
Response for AAPI saved to data/AAPI.json
Response for AAPL save

KeyboardInterrupt: 