# Reuse of Curated Chemical Data

```{dropdown} About this interactive ![icons](../static/img/rocket.png) recipe
- Author(s): [Stuart Chalk](https://orcid.org/0000-0002-0703-7776)
- Reviewer(s): 
- Topic(s): Data reuse, 
- Format(s): Interactive Jupyter Notebook (Python)
- Scenario(s): How can I work with open data to reuse it in my research
- Skill(s): You should be familiar with
    - [GitHub](https://docs.github.com/en/get-started/quickstart/hello-world)
    - [CSV Files](https://b-greve.gitbook.io/beginners-guide-to-clean-data/common-csv-problems)
    - [Django Framework (Python)](https://docs.djangoproject.com/en/5.0/)
- Learning outcomes: After completing this example you should understand:
    - How to judge the quality of open data
    - How to work with CSV files to import data into Python
    - How to save data in a database
    - How to use the Django web framework to create a data website
- Citation: 'Reuse of Curated Chemical Data', The IUPAC FAIR Chemistry Cookbook, 
- Reuse: This notebook is made available under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
```

One of the advantages of FAIR data is that researchers are now making data more available, and in simplier, usable formats than ever before.  
Subsequently, as the chemistry community sees more data becoming available there will start to be an expectation that researchers will make their data 
available. This then enables reuse of the data and may provide insights that otherwise might never be found (especially in large aggregated sets).

This menu (multistep workflow) is presented in two sections.  The first describes the creation of the IUPAC Dissociation Constant (IDC) dataset found at
[https://github.com/IUPAC/Dissociation-Constants](https://github.com/IUPAC/Dissociation-Constants).  The second takes the data found in the IDC dataset 
and resuses it to create a data website using the Django (Python) web framework.  The idea is to show what it took to create and publish the IDC and 
subsequently what can be done with the data to reuse it.

## The IUPAC Dissociation Constant Dataset

```{note}
The description of how the IDC was created is taken from the excellent discussion that Jonathan Zheng, the data creator, on the readme page of the 
repository where the data is available.  This author greatly appreciates the detail that is provided in the readme.  Also, thanks to Ye Li at MIT who 
provided information about the timeframe and workflow of the project.
```

The IUPAC Dissociation Constant (IDC) dataset was initiated as a project when [Jonathan Zheng](https://greengroup.mit.edu/jonathan-zheng)
a graduate student in chemical engineering at MIT contacted [Ye Li](https://libguides.mit.edu/profiles/yel) an MIT librarian and past chair of the American Chemical Society (ACS) [Division of Chemical Information](https://www.acscinf.org/).  Subsequently, [Leah McEwen](https://iupac.org/member/leah-r-mcewen/), Chair of the IUPAC Committee on Publications and Cheminformatics Data Standards ([CPCDS](https://iupac.org/body/024/)) was brought in to give guidance from the IUPAC perspective.

### Original Source Documents
In the 1960's IUPAC started to publish compilations of chemical data which eventually led to the development of the "[IUPAC Chemical Data Series](https://doi.org/10.1351/pac197749010125)".  
Some of these volumes contain acid and base dissociation constant data and as a result became of interest to Jonathan Zheng as part of his Ph.D.
Jonathan received permission from IUPAC to digitize three volumes (below) in December 2021, and less than a year later the aggregated dataset was made 
available on [GitHub](https://github.com/IUPAC/Dissociation-Constants).

- IUPAC, D.D. Perrin. Dissociation Constants of Organic Bases in Aqueous Solution; Butterworths, 1965; ISBN: 9780408891714
- IUPAC, D.D. Perrin. Dissociation Constants of Organic Bases in Aqueous Solution, Supplement; Butterworths, 1972; ISBN 9780408704083
- IUPAC, E.P. Serjeant, B. Dempsey. Ionisation Constants of Organic Acids in Aqueous Solution; Oxford/Pergamon, 1979 (v23, IUPAC Chemical Data Series); ISBN: 9780080223391

### Extracting the Data
With manual extraction of the dissociation constant data being impractical, Jonathan turned to the trusted technology of optical character recognition
([OCR](https://en.wikipedia.org/wiki/Optical_character_recognition)). Jonathan used the paid Amazon Web Services (AWS) [Textract](https://aws.amazon.com/pm/textract/) platform to convert the scanned pages into text and subsequently data tables, and addition data process was done in Python using unpaper, OCRmyPDF, camelot, tabula. Note: If users do not have access to AWS Textract, there are a number of [OCR software packages/services](https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software) including the well regarded free [Tessaract](https://github.com/tesseract-ocr/tesseract) software that can easily be used in Python using the [pytessaract](https://github.com/madmaze/pytesseract) package. 

Chemaxon [Molconvert](https://docs.chemaxon.com/display/docs/molconvert.md) was used to translate names of chemical substances into chemical structures, a very important step to provide an accurate dataset.  In addition, IUPAC name to SMILES translation was aided by [OPSIN](https://opsin.ch.cam.ac.uk), [PubChem](https://pubchem.ncbi.nlm.nih.gov/), and the [Chemical Identifier Resolver](https://cactus.nci.nih.gov/chemical/structure). More details can be found [here](https://github.com/IUPAC/Dissociation-Constants#methodological-information).

### Data Quality Assurance
If automation (via web services or programming languages) is used to process data, it is important to create a set of quality control criteria that can be implemented using scripts to apply common sense chemical criteria and boundaries to a detect inconsistencies in data.  This is especially needed when the data formats (data models) of the sources are not the same (as in the case).  Criteria general revolve around making specific columns of data 'normalized', that is self-consistent as far as possible. This might involve checking for the prescence/absence of specific characters, identifying values in numeric columns that are not written as numbers, cleaning chemical names and normalization to the IUPAC name (or the InChI/InchIKey), or checking for a specific format of data in a column.  Many if not all of these criteria can be implemented using the [regular expression (regex)](https://www.regular-expressions.info/) functionality but into all programming languages.  Learning regex is an important part of doing data science and producing high quality data, and great way to work with your data is to develop regex string using the [https://regex101.com/](https://regex101.com/) website. The QA that was done in this work is described [here](https://github.com/IUPAC/Dissociation-Constants#quality-assurance).

### Describing the Dataset
When deciding to make a dataset available to the community one area that must be addressed is how you are going to provide documentation about the data so that users can understand it (and subsequently reuse it).  This is best done by describing each column of data in terms of what the data is (and is not), what datatype it is (strings, numbers, dates, etc...), its cardinality, are values from a controlled vocabulary (e.g. enumerated list) or free form, are the entries unique ids, and are there any range limitations. In addition, if there are errors with the data they should be captured in a separate column, and if the values could have different units, a separate unit column is needed.  There is a detailed description of the fields of this dataset [here](https://github.com/IUPAC/Dissociation-Constants#data-specific-information).

### Making the Dataset Available
Although GitHub is more known for being a code repository platform (built for the git version control software), there is growing use of 
GitHub for using the collaborative open platform for data science, as any files can be version controlled and annotation of a repositories 
content can be easily added using [Markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) files.  In addition, inclusion of Jupyter Notebooks (as this file is) allows readers to see how code works right in a browser (see the next section for examples).  All of these features made it the obvious choice for making the IDC dataset openly available to the research community.

### Sharing the Research Process
In the context of FAIR and research data it is one thing to make data available for the community to reuse, but it is another to show how you did all or part(s) of the research by making available code that you may have developed in your project.  This has now become much easier with the development of [Juypter notebooks](https://realpython.com/jupyter-notebook-introduction/) that allow you to run Python code in a web browser.  Sharing one or more Jupyter Notebook (.ipynb file) can thus allow the research community to run code that was used in a project to see how things were done.  This not only enables other researchers to run the same workflow, but also allows communities to collarboratively develop worflows for community use thus improving the reproducility of research data. Jonathan has made three Jupyter notebooks available in the IDC repository here, and you can click on the 'Binder' button on the page to launch the notebooks in your browser in the [Binder service](https://mybinder.org) and run the code within.

### Announcing the Availability of the Dataset
Its one thing to create a repository on GitHub and another to get the word out about where it is and what it contains.  Just like publishing a paper, it is important to think about how you are going to make it findable (the F in FAIR).  For data, just like a paper, it is a good idea to get a DOI.  As the data is on GitHub this can easily be done through GitHubs [Zenodo integration](https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content).  The DOI that is created is provided by [DataCite](http://datacite.org/) a DOI authority focusing on code and data resources.  This integration makes it easy to add a button to the readme of the repository (code is available to the owner, on Zenodo) that links to content on Zenodo, and conveniently you can automatically create a new DOI for each version you publish on your repository. Of course, it doesn't hurt to then write a paper about your data/code ([FAIR datasets for acid dissociation constants](https://doi.org/10.1515/ci-2023-0310)) and the DOI's can be linked together as they refer to the same resource.

Additionally, there are resources available where datasets can be indexed which further improves discoverability and the IDC has been added to a couple of the most important.
 - [Google Dataset Search](https://datasetsearch.research.google.com/) which indexes datasets available on the Internet by harvesting [specific metadata](https://developers.google.com/search/docs/appearance/structured-data/dataset).  This metadata is at the bottom of the [readme file](https://github.com/IUPAC/Dissociation-Constants#dataset-metadata) for the IDC repository. [[IDC Entry](https://datasetsearch.research.google.com/search?docid=L2cvMTF0d3NuZjkyaA%3D%3D)]
  - [FAIRSharing.org](https://fairsharing.org) which collects metadata about FAIR resources and provides search functionality to find resources in its database [[IDC Entry]()]

## Reusing The IUPAC Dissociation Constant Dataset
The idea of making research data available is that it can and will get reused by the community.  As has been discussed above, researchers that find research data available will need to have a lot of information about the data, so they can understand it can be integrated with other data they (or others) have.  This section describes the scenario where the IDC dataset is reused to create a website where users can search for the data based on chemical substance, experimental conditions (e.g. temperature or ionic strength), or research paper and obtain the data in other open formats like XML or JSON.

### Extracting the 