# Companies House API
### Document download example
---
How to download PDF documents from the Companies House via API.  
Example for Tesco's most recent annual report (PDF format).

In [2]:
# First of all, include your Companies House REST API KEY for authorisation here.

api_key = ''


Below we download the top (which means the most recent) record in Tesco's filing history for the **Accounts** category.  

Please notice that the authorisation method in the request header is `"Authorization": api_key` and as such we need to supply the provided key. Another type of  authorisation method is used further down.  

Also we make use of `items_per_page` in the _url query string_ to limit the number of records retrieved to just one (by default the most recent). 

In [None]:
# base modules
import sys, requests, pandas as pd

# essential variables
company_number = '00445790' #TESCO Company number
category = 'accounts'
items_per_page = 1

# url and header variables
url = f"https://api.company-information.service.gov.uk/company/{company_number}/filing-history?items_per_page={items_per_page}&category={category}"
fh_headers = {"Authorization": api_key, "Accept": "application/json"} #notice the auth method, fh stands for filing history

# response processing
response = requests.get(url, headers = fh_headers)
response_json = response.json()
response_table = pd.json_normalize(response_json.get("items")) #transforming the json data into a table format (pandas dataframe)

# show the response as a table
pd.set_option('max_colwidth', None)
response_table



Field `links.document_metadata` is what matters the most.  
It shows a link to the retrieve the *document object metadata* which in turn includes the __link to the pdf file__.  

As documented in the [API reference](https://developer-specs.company-information.service.gov.uk/), the correct way to handle the _document object metadata_ is through another Companies House API product called [Document API](https://developer-specs.company-information.service.gov.uk/document-api/reference).  
Until now, we have used the **Companies House Public Data API** which goal is to show public company data, not documents.

In [None]:
# Let's store the link to the document in a variable
document_metadata_link = response_table["links.document_metadata"][0]

# show document_link
document_metadata_link

Now we will be using the **Document API** to eventually download the document.  

<span style="color:red">Please notice that the following `requests.get` calls do not use the _API_KEY_ authentication method. Whilst for the Public Data API many examples on the internet are using the _API_KEY_ method, for the **Document API** call we **must** use the _USER_ authentication method.<span>
    
<span style="color:red">Normally the _USER_ authentication method wants a _username_ and _password_, but the [authorisation reference](https://developer-specs.company-information.service.gov.uk/guides/authorisation) says to include the api key as  _username_ and leave the _password_ blank. Please see comments for variable `user_authorisation` below. The call's header will also be different, as there is no `authorisation` directive.<span>

<span style="color:red">In order to get to the actual document, there is an additional step to follow. We need to extract the pdf download link from the _document object_ and eventually fire the download.<span>
    
<span style="color:red">Indeed, we are going to make a second call using the download link. Notice that the header for this call is set to accept a PDF document and not a json object like before.<span>
    
<span style="color:red">The document is saved in a variable, and subsequently stored in a system file.<span>

In [None]:
# the following is a tuple object with this structure: 
# (username, password).
# Therefore it will be filled in with values api_key for username and "" for password.   
user_authorisation = (api_key, "") 

# retrieve the document
document_object_response= requests.get(document_metadata_link, auth=user_authorisation)

# retrieve the download link for the pdf file within document_object_response 
document_download_link = document_object_response.json()["links"]["document"]

# let's take a look at the different request header
pdf_headers = {"Accept": "application/pdf"}

# store the pdf document into a variable 
pdf_document = requests.get(document_download_link, auth=user_authorisation, headers = pdf_headers)

# when the download is complete, print the following message:
print('All done. Tesco document ready to be saved to a system file.')


Save the document to "Tesco.pdf", same folder of this notebook.

In [None]:
with open('tesco.pdf', 'wb') as f:
    f.write(pdf_document.content)
    
f.close() # close and release the file

print('tesco.pdf saved to file.')