# Accessing PHE Covid Data

[Public Health England](https://www.gov.uk/government/organisations/public-health-england) (PHE) is one of many Institutions worldwide running a Coronavirus [dashboard](https://coronavirus.data.gov.uk/), with current statistics on the pandemic. In this series of notebooks, we will guide you through creating your own simple dashboard based on PHE data and putting it online as a [Binder](https://mybinder.org/).

You want to start by clicking on *Developer's<sup>1</sup> Guide* in the dashboard menu, which will take you straight to the  [API documentation](https://coronavirus.data.gov.uk/details/developers-guide/main-api). This will be your main source of information for this project, as far as the interaction with the PHE servers is concerned.

<sup>1</sup> *Developer*: that's you!

## The web-based API and the SDK

Many websites support access to their underlying data through a web-based [Application Programming Interface](https://en.wikipedia.org/wiki/API) (an API for short). This is often based on the *http* protocol, and may involve the excange of information in [JSON format](https://en.wikipedia.org/wiki/JSON). Specifically, using a web-based API typically involves sending *http* requests with parameters conforming to a given schema to a dedicated URL (the API *endpoint*), to which the server responds with JSON content. All this is specified in the PHE [API documentation](https://coronavirus.data.gov.uk/details/developers-guide/main-api).

As you can see from the documentation, PHE actually also offers a Python Software Devekopment Kit (SDK). This is a Python wrapper that facilitates access to the API by building the requests for you, forwarding them to the API access point and packaging the response into some convenient format. The PHE [Pyhton SDK](https://pypi.org/project/uk-covid19/) is found on the standard Python package repository, PyPi.

The first step, therefore, consists in installing the Python SDK. This will depend on how you are accessing these matierials.

* If you are working in a Binder: the SDK is already installed for you.
* On the EECS JupyterHub: Save your notebooks and close them. Open a terminal and enter the following command (note the double minus ```--```):
```
pip install --user uk-covid19
```
you can then close the terminal (type ```exit``` and close the tab). Finally go to *Control Panel > Stop my Server*, start the server again and reopen your notebooks.
* On your local machine: this depends on your setup. Entering this line in a command shell will hopefully work:
```
pip install uk-covid19
```
if it doesn't, try searching the documentation of your Python distribution or Google for instructions on how to install PiPy packages on your machine. 

If you have successfully  installed the SDK in your environment, the following cell should work:

In [62]:
# note the name of the module has an underscore in place of the -
from uk_covid19 import Cov19API
import json

## Accessing the API through the SDK

We are now ready to download data from the server via the SDK. The SDK documentation is currently (Oct 2023) unavailable; refer to the general [API documentation](https://coronavirus.data.gov.uk/details/developers-guide/main-api) for the meaning of the main parameters. The examples below follow the documentation closely.

According to the documentation, the first step is defining a *filter* - this is a **list** specifiying an *areaType* parameter and an optional *areaName*, *areaCode* and *date* parameters. In the jargon of the documentation, these parameters are called *metrics* - check out the metrics allowed for filters [here](https://coronavirus.data.gov.uk/details/developers-guide/main-api#params-filters).

In [63]:
filters = [
    'areaType=nation', # note each metric-value pair is inside one string
    'areaName=England'
]

Next, you want to define a *structure*. According to the SDK ocumentation, this is a **dictionary** that specifies which data fields you want to request. In fact, looking at the API documentation for [structures](https://coronavirus.data.gov.uk/details/developers-guide/main-api#params-structure) shows that they do more: they also specify the "format" in which you want to receive the response. Read the documentation carefully, including the list of valid *metrics* for a *structure* (which is in fact the most important part).

The structure below selects the metrics (that is, the PHE database fields) given as values of the dictionary, and at the same time instructs the API to rename the fields to the simpler names given as keys before serving them to us.

In [64]:
# values here are the names of the PHE metrics
structure = {
    "date": "date",
    "cases": "newCasesBySpecimenDateRollingRate",
    "hospital": "newAdmissionsRollingRate",
    "deaths": "newDailyNsoDeathsByDeathDate" 
}

**NOTE:** As of version 1.2.0 of the SDK, keys in the ```structure``` dictionary are apparently restricted to alphabet letters. Spaces, digits and underscores seem to cause the query to fail and should be avoided.

Accessing the API is easy at this point. You just need to create a ```Cov19API``` object by passing the filters and structure to its constructor, as follows:

In [65]:
api = Cov19API(filters=filters, structure=structure)

Finally, calling the ```get_json()``` method of the ```api``` object actually sends the request to the API and retrieves the response (theoretically) in JSON format.

In [66]:
# NOTE: this call polls the server. It may fail in case of connectivity problems or if the data
# are not available for any reason. It will also fail if the metric in the structure are not compatible
# with the filters (eg they are not defined at the national or local level).
timeseries=api.get_json()

In [67]:
# you may want to collapse/clear the output of this cell after viewing it, see the Edit and View menus
print(timeseries)

{'data': [{'date': '2023-12-02', 'cases': 7.3, 'hospital': None, 'deaths': None}, {'date': '2023-12-01', 'cases': 7.2, 'hospital': 3.6, 'deaths': None}, {'date': '2023-11-30', 'cases': 7.3, 'hospital': 3.6, 'deaths': None}, {'date': '2023-11-29', 'cases': 7.1, 'hospital': 3.6, 'deaths': None}, {'date': '2023-11-28', 'cases': 7.0, 'hospital': 3.5, 'deaths': None}, {'date': '2023-11-27', 'cases': 6.8, 'hospital': 3.4, 'deaths': None}, {'date': '2023-11-26', 'cases': 6.9, 'hospital': 3.4, 'deaths': None}, {'date': '2023-11-25', 'cases': 6.9, 'hospital': 3.3, 'deaths': None}, {'date': '2023-11-24', 'cases': 7.0, 'hospital': 3.3, 'deaths': None}, {'date': '2023-11-23', 'cases': 6.9, 'hospital': 3.3, 'deaths': None}, {'date': '2023-11-22', 'cases': 7.1, 'hospital': 3.3, 'deaths': None}, {'date': '2023-11-21', 'cases': 7.1, 'hospital': 3.4, 'deaths': None}, {'date': '2023-11-20', 'cases': 7.5, 'hospital': 3.4, 'deaths': None}, {'date': '2023-11-19', 'cases': 7.5, 'hospital': 3.5, 'deaths': No

In [68]:
print(type(timeseries)) # hold on, this is not JSON!

<class 'dict'>


As you can see, ```get_json()``` is a bit of a misnomer - the function actually returns a dictionary containing nested lists and dictionaries. As we will see this is not too different from what a JSON string looks like, but technically, this is something different. All the better for us, we don't need to decode it - but we may actually want to encode it, see below.

Also, some entries contain a value of ```None```, we will asssume that stands for a ```0``` rather than for not available. In any case, by and large we got our data, so this is a success.

### Another example: Cases by gender and age

The example above lends itself to visualisation as a plot of cases, hospital admissions and fatalities vs time. In this example, instead, we investigate the distribution of cases by sex and age bands; eventually, we will plot this as a bar chart.

Again we define our *filters* and *structure*, but with different *metrics*, as follows: 

In [69]:
filters = [
    'areaType=nation',
    'areaName=England'
]


# values here are the names of the PHE metrics
structure = {
    "males": "maleCases",
    "females": "femaleCases"
}

The next two steps are standard:

In [70]:
api = Cov19API(filters=filters, structure=structure)

In [71]:
# NOTE: this call polls the server. It may fail in case of connectivity problems or if the data
# are not available for any reason. It will also fail if the metric in the structure are not compatible
# with the filters (eg they are not defined at the national or local level).
agedistribution=api.get_json()

In [72]:
# you may want to collapse/clear the output of this cell after viewing it, see the Edit and View menus
print(agedistribution)

{'data': [{'males': [{'age': '50_to_54', 'rate': 33505.8, 'value': 640403}, {'age': '85_to_89', 'rate': 29871.5, 'value': 106397}, {'age': '70_to_74', 'rate': 20954.7, 'value': 281616}, {'age': '55_to_59', 'rate': 30693.8, 'value': 568631}, {'age': '5_to_9', 'rate': 32167.4, 'value': 583633}, {'age': '30_to_34', 'rate': 41152.1, 'value': 788643}, {'age': '60_to_64', 'rate': 28358.5, 'value': 444800}, {'age': '10_to_14', 'rate': 46560.2, 'value': 820332}, {'age': '20_to_24', 'rate': 38503.7, 'value': 689872}, {'age': '25_to_29', 'rate': 39891.3, 'value': 767675}, {'age': '0_to_4', 'rate': 14321.7, 'value': 238068}, {'age': '75_to_79', 'rate': 23572.4, 'value': 220184}, {'age': '45_to_49', 'rate': 36269.6, 'value': 654016}, {'age': '90+', 'rate': 41046.8, 'value': 69594}, {'age': '80_to_84', 'rate': 23267.9, 'value': 148919}, {'age': '15_to_19', 'rate': 40088.7, 'value': 642002}, {'age': '35_to_39', 'rate': 40039.2, 'value': 741914}, {'age': '40_to_44', 'rate': 41158.1, 'value': 712145},

Again this is a dictionary and it does contain the data we expected. Note however how the formatting and the type of data depend, maybe unsurprisingly, on the specific *metric* we queried.

## Saving the data in JSON format

At this point, we want to save the result of our API queries in order to 
* have something definite to work on in the other notebooks
* eventually, give our dashboard some starting data.

The problem arises of how to save these dictionaries to the disk. Luckily we do not have to save them in a bespoke way at this stage - we can use the [json module](https://docs.python.org/3/library/json.html) in the stardard library to dump them sa they are in [JSON format](https://en.wikipedia.org/wiki/JSON). This is straightforward:

In [73]:
import json

In [74]:
with open("timeseries.json", "wt") as OUTF:
    json.dump(timeseries, OUTF)

In [75]:
with open("agedistribution.json", "wt") as OUTF:
    json.dump(agedistribution, OUTF)

If you now use a text editor (or the Jupyter Notebook interface), you will see that the content of the files closely resembles the tangle of dictionaries and lists we have seen above. However, technically, these are no longer Python dictionary and files, rather the JSON representation of them, and could be opened by another program written in another language, that will map them to an equivalent data structure (whichever is provided by that language).

## Your turn

Explore the various *metrics* available for [*filters*](https://coronavirus.data.gov.uk/details/developers-guide/main-api#filters-multiple_params) and [*structures*](https://coronavirus.data.gov.uk/details/developers-guide/main-api#structure-metrics) and think of a query that may be of interest to you, and how you might then want to visualise the data. You can modify either the *structure*, in order to select different types of data, or the *filters*, to specify a different granularity (national or local level, specific dates, etc). Possible graphs of interest might be:
* a comparison of the number of test carried out with planned testing capacity;
* a comparison of the number of new cases with the number of tests;
* the above, broken down by region;
* a comparison of the number of cases with hospital admissions;
* a comparison of hospital admissions with ventilator bed occupancy;
* new cases in major cities as a fraction of the population of those cities;
* a comparison of the age distribution of new cases at different times (this will require more than one API access)

Please keep in mind the following points:
* Not all metrics are available for all dates, or at all levels of granularity; querying for data that's unavailable will result in an error.
* Documentation is poor - welcome to the real world. A BSc in Reverse Engineering would come in handy.
* Experimenting is fine. However, avoid flooding the server with multiple queries at machine speed - if you use a ```for``` loop to generate API accesses, use the ```sleep()``` function from the ```time``` module to introduce a 1 second delay between one query and the next (see [here](https://docs.python.org/3/library/time.html#time.sleep) for details). The last thing you want is for PHE to ban you.

Once you succeed in retrieving the data you want, save them in JSON format and move on to the next stage - visualisation.


**NOTE:** In the visualisation stage, we are going to load the data into a ```pandas.DataFrame``` structure. As you will see from the documentation, the SDK contains a method ```get_dataframe()``` that allows you to retrieve the data in the form of a ```DataFrame``` directly. From my tests, however, this method only works with certain *metrics*, and returns gibberish in other cases. I would therefore advise you to stick with JSON. In any case, JSON is a de-facto standard for APIs, and some familiarity with this format is a valuable skill in itself.

**NOTE:** As of 2023, a variety of *metrics* have been added to *structure*, but not all of them are defined for all area types, and some are no longer updated. You can check when a metric was last updated via a [latest_by](https://coronavirus.data.gov.uk/details/developers-guide/main-api#params-latestby) request as below:

In [76]:
filt = ['areaType=nation', 'areaName=England']
struct = {'query': 'newAdmissionsRollingRate'} # the metric you are querying must be in the structure
Cov19API(filters=filt, structure=struct, latest_by=struct['query']).get_json()

{'data': [{'query': 3.6}],
 'lastUpdate': '2023-12-07T18:00:02.000000Z',
 'length': 1,
 'totalPages': None}

I would like to create a comparison graph of “compare the number of new COVID-19 cases with the number of tests conducted in England. ” through the provided data. Firstly, refer to the information in the provided data source : https://coronavirus.data.gov.uk/， we can find metrics related to "the number of new COVID-19 cases" and "the number of tests conducted in England."

In [77]:
# Import necessary libraries
from uk_covid19 import Cov19API

# Define filters and structure for new cases and tests
comparison_filters = [
    'areaType=nation',
    'areaName=England'
]

comparison_structure = {
    "date": "date",
    "new_cases": "newCasesBySpecimenDateRollingRate",
    "tests": "newTestsByPublishDate"
}

# Access the API for the comparison
comparison_api = Cov19API(filters=comparison_filters, structure=comparison_structure)
comparison_data = comparison_api.get_json()

# Extract relevant data
dates = [entry['date'] for entry in comparison_data['data']]
new_cases = [entry['new_cases'] for entry in comparison_data['data']]
tests = [entry['tests'] for entry in comparison_data['data']]

# Calculate ratio, handling None values
for date, new_case, test in zip(dates, new_cases, tests):
    if new_case is not None and test is not None and test != 0:
        ratio = new_case / test
        print(f"Date: {date}, New Cases: {new_case}, Tests: {test}, Ratio: {ratio}")
    else:
        print(f"Date: {date}, New Cases: {new_case}, Tests: {test}, Ratio: N/A (Data not available)")


Date: 2023-12-02, New Cases: 7.3, Tests: None, Ratio: N/A (Data not available)
Date: 2023-12-01, New Cases: 7.2, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-30, New Cases: 7.3, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-29, New Cases: 7.1, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-28, New Cases: 7.0, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-27, New Cases: 6.8, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-26, New Cases: 6.9, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-25, New Cases: 6.9, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-24, New Cases: 7.0, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-23, New Cases: 6.9, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-22, New Cases: 7.1, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-21, New Cases: 7.1, Tests: None, Ratio: N/A (Data not available)
Date: 2023-11-20, New Cases: 7.5, Tests: None, Ratio

In [78]:
import json


# Define the filename for the JSON file
json_filename = "comparison_data.json"

# Save data as JSON
with open(json_filename, 'w') as json_file:
    json.dump(comparison_data, json_file)

print(f"Data successfully saved to {json_filename}")

Data successfully saved to comparison_data.json


**(C) 2020,2023 Fabrizio Smeraldi** ([f.smeraldi@qmul.ac.uk](mailto:f.smeraldi@qmul.ac.uk) - [web](http://www.eecs.qmul.ac.uk/~fabri/)). This notebook is released under the [GNU GPLv3.0 or later](https://www.gnu.org/licenses/).