### Scraping using XBRL.us API

In [1]:
import numpy as np
import pandas as pd 
import requests
import json
import io
import lxml
from bs4 import BeautifulSoup as bs
import matplotlib.pyplot as plt
import random
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import re
from XBRL import xbrl_login
import urllib
from urllib.parse import urlencode
from datetime import datetime

### Access Limits

| Membership Type    	| Records per query 	| Record offset limit 	|
|--------------------	|:-----------------:	|:-------------------:	|
| non-Member *       	|        100        	|        1,000        	|
| Basic Individual * 	|        500        	|        2,000        	|
| All Other Members  	|       2,000       	|      unlimited      	|

### Resources

#### sample jupyter notebook

https://mybinder.org/v2/gh/xbrlus/xbrl-api-ipynb/python?filepath=xbrl_us_api.ipynb

#### simple tutorial

https://www.youtube.com/watch?v=3PWypzk2Yac&ab_channel=XBRLUS

## Create login details

In [2]:
email = xbrl_login.email
password = xbrl_login.password
clientid = xbrl_login.Client_ID
secret = xbrl_login.Client_Secret

body_auth = {'username' : ''.join(email), 
            'client_id': ''.join(clientid), 
            'client_secret' : ''.join(secret), 
            'password' : ''.join(password), 
            'grant_type' : 'password', 
            'platform' : 'ipynb' }

payload = urlencode(body_auth)
url = 'https://api.xbrl.us/oauth2/token'
headers = {"Content-Type": "application/x-www-form-urlencoded"}

res = requests.request("POST", url, data=payload, headers=headers)
auth_json = res.json()

if 'error' in auth_json:
    print ("\n\nThere was a problem generating an access token with these credentials. Run the first cell again to enter credentials.")
else:
    print ("\n\nYour access token expires in 60 minutes. After it expires, run the cell immediately below this one to generate a new token and continue to use the query cell. \n\nFor now, skip ahead to the section 'Make a Query'.")
access_token = auth_json['access_token']
refresh_token = auth_json['refresh_token']
newaccess = ''
newrefresh = ''
#print('access token: ' + access_token + ' refresh token: ' + refresh_token)



Your access token expires in 60 minutes. After it expires, run the cell immediately below this one to generate a new token and continue to use the query cell. 

For now, skip ahead to the section 'Make a Query'.


### Refresh your token

- Tokens need to be refreshed every 60 minutes

In [17]:
token = token if newrefresh != '' else refresh_token 

refresh_auth = {'client_id': ''.join(clientid), 
            'client_secret' : ''.join(secret), 
            'grant_type' : 'refresh_token', 
            'platform' : 'ipynb', 
            'refresh_token' : ''.join(token) }
refreshres = requests.post(url, data=refresh_auth)
refresh_json = refreshres.json()
access_token = refresh_json['access_token']
refresh_token = refresh_json['refresh_token']#print('access token: ' + access_token + 'refresh token: ' + refresh_token)
print('Your access token is refreshed for 60 minutes. If it expires again, run this cell to generate a new token and continue to use the query cells below.')
print(access_token)

Your access token is refreshed for 60 minutes. If it expires again, run this cell to generate a new token and continue to use the query cells below.
0ace9372-8657-46b4-b07b-f12424a933fa


### Make a query
After the access token confirmation appears above, you can modify the query below, then use the Cell >> Run menu option from the cell immediately below this text to run the entire query for results.

The sample results are from 10+ years of data for companies in an SIC code, and may take several minutes to recreate. To test for results quickly, modify the params to comment out

- report.sic-code 

and uncomment 

- entity.cik and 
- period.fiscal-year 

so the search runs for several companies across a few years.

Refer to XBRL API documentation at https://xbrlus.github.io/xbrl-api/#/Facts/getFactDetails for other endpoints and parameters to filter and return.

### Define the parameters for the filter and fields to be returned

- **offset_value** -  
- **XBRL_Elements** - These are the real metrics that you want to track, e.g. `Assets`, `Liabilities` etc..
- **sic_code** - Standard Industrial Classification codes . We can look these up here https://siccode.com/sic-code-lookup-directory, but these are not really necessary as long as we have CIK codes for each company
- **periods** - Year or quarter or both, e.g. ["Y"]
- **years** - List of years for which you want to retrieve data. e.g. [2021, 2020, 2019, 2018, 2017]
- **companies_cik** - The Central Index Key (CIK) is used on the SEC's computer systems to identify corporations and individual people who have filed disclosure with the SEC. Can be looked up on https://www.sec.gov/edgar/searchedgar/cik.htm
e.g. '0000789019', Microsoft (MSFT)
- **fields** - Column names for res_df

In [4]:
offset_value = 0
res_df = []

In [21]:
### These are the real metrics that you want to track

##For a full list refer to https://xbrl.us/data-rule/dqc_0015-le/

XBRL_Elements = ["Assets",
                 "AssetsCurrent",
                 "Liabilities",
                 "LiabilitiesAndStockholdersEquity",
                 "CashCashEquivalentsAndShortTermInvestments"]


#sic_code = [2080]

periods = ['Y']

# In the default example, the companies_cik and years parameters 
# are commented out and not used in the query. Comment out report.sic-code
# and uncomment entity.cik and period.fiscal-year to change the query.

years = list(range(2021, 2000, -1))

## Microsoft (MSFT)]
companies_cik = ['0000789019', '0001018724'] 
                 
# '0001018724', ## Amazon (AMZN)
# '0000320193', ## Apple (AAPL)
# '0000051143' ## IBM (IBM)

# Define data fields to return (multi-sort based on order)

# this is the list of the characteristics of the data being returned by the query
fields = ['period.fiscal-year.sort(DESC)',
         'entity.name.sort(ASC)',
         'concept.local-name.sort(ASC)',
         'fact.value',
         'unit',
         'fact.decimals',
         'report.filing-date']
                 
#'report.sic-code'                 

#string_sic = [str(int) for int in sic_code]
string_years = [str(int) for int in years]

params = { # this is the list of what's being queried against the search endpoint
         'concept.local-name': ','.join(XBRL_Elements),
         #'report.sic-code': ','.join(string_sic),
         'entity.cik': ','.join(companies_cik),
         'period.fiscal-year': ','.join(string_years),
         'period.fiscal-period': ','.join(periods),  
         'fact.ultimus': 'TRUE', # return only the latest occurrence of a specific fact (eg. 2018 revenues)
         'fact.has-dimensions': 'FALSE', # generally, 'FALSE' will return face financial data only
         'fields': ','.join(fields)
         }

In [19]:
params

{'concept.local-name': 'Assets,AssetsCurrent,Liabilities,LiabilitiesAndStockholdersEquity,CashCashEquivalentsAndShortTermInvestments',
 'entity.cik': '0000789019,0001018724',
 'period.fiscal-year': '2021,2020,2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001',
 'period.fiscal-period': 'Q',
 'fact.ultimus': 'TRUE',
 'fact.has-dimensions': 'FALSE',
 'fields': 'period.fiscal-year.sort(DESC),entity.name.sort(ASC),concept.local-name.sort(ASC),fact.value,unit,fact.decimals,report.filing-date'}

In [22]:
# Execute the query with loop for all results

search_endpoint = 'https://api.xbrl.us/api/v1/fact/search'
orig_fields = params['fields']

count = 0
query_start = datetime.now()
printed = False
while True:
    if not printed:
        print("On", query_start.strftime("%c"), email, "(client ID:", str(clientid.split('-')[0]), "...) started the query and")
        printed = True
    res = requests.get(search_endpoint, params=params, headers={'Authorization' : 'Bearer {}'.format(access_token)})
    res_json = res.json()
    if 'error' in res_json:
        print('There was an error: {}'.format(res_json['error_description']))
        break

    print("up to", str(offset_value + res_json['paging']['limit']), "records are found so far ...")

    res_df += res_json['data']

    if res_json['paging']['count'] < res_json['paging']['limit']:
        print(" - this set contained fewer than the", res_json['paging']['limit'], "possible, only", str(res_json['paging']['count']), "records.")
        break
    else: 
        offset_value += res_json['paging']['limit'] 
        if 100 == res_json['paging']['limit']:
                params['fields'] = orig_fields + ',fact.offset({})'.format(offset_value)
                if offset_value == 10 * res_json['paging']['limit']:
                        break 
        elif 500 == res_json['paging']['limit']:
                params['fields'] = orig_fields + ',fact.offset({})'.format(offset_value)
                if offset_value == 4 * res_json['paging']['limit']:
                        break 
        params['fields'] = orig_fields + ',fact.offset({})'.format(offset_value)

if not 'error' in res_json:
    current_datetime = datetime.now().replace(microsecond=0)
    time_taken = current_datetime - query_start
    index = pd.DataFrame(res_df).index
    total_rows = len(index)
    your_limit = res_json['paging']['limit']
    limit_message = "If the results below match the limit noted above, you might not be seeing all rows, and should consider upgrading (https://xbrl.us/access-token).\n"
    
    if your_limit == 100:
        print("\nThis non-Member account has a limit of " , 10 * your_limit, " rows per query from our Public Filings Database. " + limit_message)
    elif your_limit == 500:
        print("\nThis Basic Individual Member account has a limit of ", 4 * your_limit, " rows per query from our Public Filings Database. " + limit_message)
    
    print("\nAt " + current_datetime.strftime("%c") +  ", the query finished with  ", str(total_rows), "  rows returned in " + str(time_taken) + " for \n" +  urllib.parse.unquote(res.url))
    
    # the format truncates the HTML display of numerical values to two decimals; .csv data is unaffected
    pd.options.display.float_format = '{:,.2f}'.format
    my_df = pd.DataFrame()
    for i in res_df:
        my_df = pd.concat([my_df, pd.DataFrame.from_dict([i])])
    



On Thu May 19 13:49:55 2022 aptsearchatl@gmail.com (client ID: 03279424 ...) started the query and
up to 100 records are found so far ...
up to 200 records are found so far ...
 - this set contained fewer than the 100 possible, only 13 records.

This non-Member account has a limit of  1000  rows per query from our Public Filings Database. If the results below match the limit noted above, you might not be seeing all rows, and should consider upgrading (https://xbrl.us/access-token).


At Thu May 19 13:49:56 2022, the query finished with   113   rows returned in 0:00:00.456241 for 
https://api.xbrl.us/api/v1/fact/search?concept.local-name=Assets,AssetsCurrent,Liabilities,LiabilitiesAndStockholdersEquity,CashCashEquivalentsAndShortTermInvestments&entity.cik=0000789019,0001018724&period.fiscal-year=2021,2020,2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001&period.fiscal-period=Y&fact.ultimus=TRUE&fact.has-dimensions=FALSE&fields=period.fiscal-y

In [23]:
res.json()

{'paging': {'limit': 100, 'offset': 100, 'count': 13},
 'data': [{'period.fiscal-year': 2010,
   'entity.name': 'MICROSOFT CORP',
   'concept.local-name': 'CashCashEquivalentsAndShortTermInvestments',
   'fact.value': '36788000000',
   'unit': 'USD',
   'fact.decimals': '-6',
   'report.filing-date': '2011-07-28'},
  {'period.fiscal-year': 2010,
   'entity.name': 'MICROSOFT CORP',
   'concept.local-name': 'Liabilities',
   'fact.value': '39938000000',
   'unit': 'USD',
   'fact.decimals': '-6',
   'report.filing-date': '2011-07-28'},
  {'period.fiscal-year': 2010,
   'entity.name': 'MICROSOFT CORP',
   'concept.local-name': 'LiabilitiesAndStockholdersEquity',
   'fact.value': '86113000000',
   'unit': 'USD',
   'fact.decimals': '-6',
   'report.filing-date': '2011-07-28'},
  {'period.fiscal-year': 2009,
   'entity.name': 'AMAZON COM INC',
   'concept.local-name': 'Assets',
   'fact.value': '13813000000',
   'unit': 'USD',
   'fact.decimals': '-6',
   'report.filing-date': '2011-01-28'}

In [25]:
my_df.reset_index(drop = True, inplace = True)
my_df

Unnamed: 0,period.fiscal-year,entity.name,concept.local-name,fact.value,unit,fact.decimals,report.filing-date
0,2021,"AMAZON.COM, INC.",Assets,420549000000,USD,-6,2022-04-29
1,2021,"AMAZON.COM, INC.",AssetsCurrent,161580000000,USD,-6,2022-04-29
2,2021,"AMAZON.COM, INC.",LiabilitiesAndStockholdersEquity,420549000000,USD,-6,2022-04-29
3,2021,MICROSOFT CORPORATION,Assets,333779000000,USD,-6,2022-04-26
4,2021,MICROSOFT CORPORATION,AssetsCurrent,184406000000,USD,-6,2022-04-26
...,...,...,...,...,...,...,...
108,2009,MICROSOFT CORP,CashCashEquivalentsAndShortTermInvestments,31447000000,USD,-6,2010-07-30
109,2009,MICROSOFT CORP,LiabilitiesAndStockholdersEquity,77888000000,USD,-6,2010-07-30
110,2008,AMAZON COM INC,Assets,8314000000,USD,-6,2010-01-29
111,2008,AMAZON COM INC,AssetsCurrent,6157000000,USD,-6,2010-01-29


In [29]:
my_df.groupby(["entity.name", "concept.local-name", "period.fiscal-year"]).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,fact.value,unit,fact.decimals,report.filing-date
entity.name,concept.local-name,period.fiscal-year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AMAZON COM INC,Assets,2008,8314000000,USD,-6,2010-01-29
AMAZON COM INC,Assets,2009,13813000000,USD,-6,2011-01-28
AMAZON COM INC,Assets,2010,18797000000,USD,-6,2012-02-01
AMAZON COM INC,Assets,2011,25278000000,USD,-6,2013-01-30
AMAZON COM INC,Assets,2012,32555000000,USD,-6,2014-01-31
...,...,...,...,...,...,...
MICROSOFT CORPORATION,LiabilitiesAndStockholdersEquity,2017,250312000000,USD,-6,2018-08-03
MICROSOFT CORPORATION,LiabilitiesAndStockholdersEquity,2018,258848000000,USD,-6,2019-08-01
MICROSOFT CORPORATION,LiabilitiesAndStockholdersEquity,2019,286556000000,USD,-6,2020-07-31
MICROSOFT CORPORATION,LiabilitiesAndStockholdersEquity,2020,301311000000,USD,-6,2021-07-29


https://www.sec.gov/dera/data/financial-statement-data-sets.html

In [31]:
new_df = pd.read_csv(r"C:\Users\samit\OneDrive\Desktop\2022q1\num.txt", sep = "\t")
new_df.head()

  new_df = pd.read_csv(r"C:\Users\samit\OneDrive\Desktop\2022q1\num.txt", sep = "\t")


Unnamed: 0,adsh,tag,version,coreg,ddate,qtrs,uom,value,footnote
0,0000038777-22-000013,NetIncomeLoss,us-gaap/2020,,20201231,1,USD,345300000.0,
1,0000038777-22-000013,NetIncomeLoss,us-gaap/2020,,20211231,1,USD,453200000.0,
2,0000022444-22-000014,NetIncomeLoss,us-gaap/2020,,20201130,1,USD,64093000.0,
3,0000022444-22-000014,NetIncomeLoss,us-gaap/2020,,20211130,1,USD,232889000.0,
4,0000096223-22-000006,NetIncomeLoss,us-gaap/2020,,20211130,4,USD,1667403000.0,
