# Chemistry example: Cement (Possible Solution)

This example is about chemical components. The idea is to take input from a natural language text that contains descriptions about chemical components. With Natural Language Processing the relevant component names shall be extracted and the corresponding components shall be looked up in the Wikidata knowledge graph.

See also:
[Entity Linking](https://rq.bitplan.com/index.php/Entity_Linking)

## Example
Input as natural text: "Polylactic acid (PLA) and Acrylonitrile butadiene styrene (ABS) are 3D printing filaments."
Extraction of component names:

* PLA - Polylactic acid
* ABS - Acrylonitrile butadiene styrene

Lookup in Wikidata as:
* [PLA - Q413769](https://www.wikidata.org/wiki/Q413769) (C₃H₆O₃)x
* [ABS - Q143496](https://www.wikidata.org/wiki/Q143496) C₁₅H₁₇N

# Prerequisites
Please run the prerequiste cells before trying the examples.

## Install python module dependencies

In [None]:
import sys
def installModule(projectName:str, moduleName:str=None):
    '''Installs and loads the given module if not already installed'''
    if moduleName is None:
        moduleName=projectName
    !python -m pip install -U --no-input $projectName
    print(f'{projectName} installed')
    %reload_ext $moduleName

installModule('jupyter-xml')
installModule('SPARQLWrapper')
installModule('tabulate')
installModule('spacy==3.4', 'spacy')
installModule('newspaper3k', 'newspaper')
installModule("pylodstorage", "lodstorage")

## Download Models

In [None]:
!python -m spacy download en_core_web_sm    # efficient

# Chemistry Example Wikidata Query
see [pyLodStorage Random Substances with CAS number example](http://wiki.bitplan.com/index.php/PyLoDStorage#15_Random_substances_with_CAS_number)

## Extract text from website
We take input from [Penn State University text about the composition of Cement](https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm)
which contains mentions of compounds such as "Silicon dioxide". We'd like to lookup the corresponding Wikidata entry [Silicon dioxide: Q116269](https://www.wikidata.org/wiki/Q116269)

In [3]:
from newspaper import Article
url="https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm"
article = Article(url)
article.download()
article.parse()
text=article.text
print(text[:149],"...")

Composition of cement

Introduction

Portland cement gets its strength from chemical reactions between the cement and water. The process is known as  ...


# NLP with spacy
Try to identify [chemical compounds](https://www.wikidata.org/wiki/Q11173) with the natural language processing library [Spacy](https://spacy.io/)

In [4]:
import spacy
from tabulate import tabulate
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
nouns=[chunk.text for chunk in doc.noun_chunks]

print(f"Found nouns:\n {nouns}")

foundEntities=[{"Text":entity.text, "Entity Tag":entity.label_} for entity in doc.ents]
print(tabulate(foundEntities, headers="keys"))

Found nouns:
 ['Composition', 'cement\n\nIntroduction\n\nPortland cement', 'its strength', 'chemical reactions', 'the cement', 'water', 'The process', 'hydration', 'This', 'a complex process', 'that', 'the chemical composition', 'cement', 'Manufacture', 'cement\n\nPortland cement', 'milling', 'the following materials', 'CaO', 'limestone', 'chalk', 'shells', 'sand', 'clay', 'argillaceous rock', 'sand', ', old bottles', 'clay', 'argillaceous rock\n\nAlumina', 'Al', 'O', 'O', 'clay\n\nIron', '2 O', 'clay', 'iron ore', 'scrap iron', 'ash', 'O', 'clay', 'iron ore', 'scrap iron', 'Gypsum', 'CaSO', '.2H', 'limestone\n\nChemical shorthand', 'the complex chemical nature', 'cement', 'a shorthand form', 'the chemical compounds', 'The shorthand', 'the basic compounds', 'Compound Formula Shorthand', 'Calcium oxide', 'lime', 'Ca0 C Silicon dioxide', 'silica', '2 S Aluminum oxide', 'alumina', 'Al', 'O', 'O', '2 O H Sulfate', '3 S\n\nCompound Formula Shorthand', '%', 'Tricalcium aluminate', 'Ca', 'O',

# Query wikidata for mentioned Chemical Compounds
The NER (Named Entity Recognition) of Spacy seems not to detect the chemical compounds as a known category. 
Thus we need to work around this limitation and try to use the found nouns to query wikidata for the referenced compounds.

## Setting up the [wikidata query endpoint](https://query.wikidata.org/)

In [5]:
from lodstorage.query import QueryManager, Query
from lodstorage.sparql import SPARQL
from IPython.display import display, Markdown
endpoint=SPARQL("https://query.wikidata.org/sparql")

## Get matching entites
The found nouns by spacy can be matched against the labels of chemical compounds.

A wikidata entity such as [Silicon dioxide: Q116269](https://www.wikidata.org/wiki/Q116269) has [rdfs:label](https://www.w3.org/TR/rdf-schema/#ch_label) for different languages specified with the languages tag (e.g. @en). 

To also query the names that are defined in the __Also known as__ column of the entity page the [skos:altLabel](https://www.w3.org/2009/08/skos-reference/skos.html#altLabel) property has to be queried. Searching for the label in both properties can be accomplished by using the [alternative path](https://www.w3.org/TR/sparql11-query/#propertypaths) feature of SPARQL.

Note: The wikidata endpoint has a horizontal line limit and thus the values in the [VALUES clause](https://www.w3.org/TR/sparql11-query/#inline-data) are added over multiple lines.


In [7]:
from SPARQLWrapper import SPARQLWrapper, JSON, CSV
from tabulate import tabulate

nouns=[noun.replace("\n"," ").strip() for noun in nouns]
# building the query string
values=[' '.join([f'"{noun}"@en' for noun in nouns[n*50:n*50+50]]) for n in range((len(nouns)//50)+1)]
queryValueFormatTags='%s\n'*(len(values))
substanceQueryStr = """
SELECT DISTINCT ?substance
WHERE { 
  VALUES ?substanceLabel {%s}
  ?substance wdt:P31 wd:Q11173;
             rdfs:label|skos:altLabel ?substanceLabel.
}
LIMIT 50
""" 
substanceQueryStr = substanceQueryStr % ("\n".join(values))
# executing the query
substanceQuery=Query(query=substanceQueryStr,
            name="Recognized chemical compounds",
            lang="sparql")
substanceQueryResLoD=endpoint.queryAsListOfDicts(substanceQuery.query)

substances=[record.get('substance') for record in substanceQueryResLoD]

# pretty printout of the result
doc=substanceQuery.documentQueryResult(substanceQueryResLoD, tablefmt="github",floatfmt=".0f",tryItUrl="https://query.wikidata.org")
display(Markdown(str(doc)))

## Recognized chemical compounds

### query
```sparql

SELECT DISTINCT ?substance
WHERE { 
  VALUES ?substanceLabel {"Composition"@en "cement  Introduction  Portland cement"@en "its strength"@en "chemical reactions"@en "the cement"@en "water"@en "The process"@en "hydration"@en "This"@en "a complex process"@en "that"@en "the chemical composition"@en "cement"@en "Manufacture"@en "cement  Portland cement"@en "milling"@en "the following materials"@en "CaO"@en "limestone"@en "chalk"@en "shells"@en "sand"@en "clay"@en "argillaceous rock"@en "sand"@en ", old bottles"@en "clay"@en "argillaceous rock  Alumina"@en "Al"@en "O"@en "O"@en "clay  Iron"@en "2 O"@en "clay"@en "iron ore"@en "scrap iron"@en "ash"@en "O"@en "clay"@en "iron ore"@en "scrap iron"@en "Gypsum"@en "CaSO"@en ".2H"@en "limestone  Chemical shorthand"@en "the complex chemical nature"@en "cement"@en "a shorthand form"@en "the chemical compounds"@en "The shorthand"@en
"the basic compounds"@en "Compound Formula Shorthand"@en "Calcium oxide"@en "lime"@en "Ca0 C Silicon dioxide"@en "silica"@en "2 S Aluminum oxide"@en "alumina"@en "Al"@en "O"@en "O"@en "2 O H Sulfate"@en "3 S  Compound Formula Shorthand"@en "%"@en "Tricalcium aluminate"@en "Ca"@en "O"@en "6 C"@en "A 10 Tetracalcium aluminoferrite"@en "Ca"@en "Al 2 Fe"@en "O 10 C 4 AF 8 Belite or dicalcium silicate"@en "Ca 2 SiO 5 C 2 S 20 Alite"@en "tricalcium"@en "Ca 3 SiO 4 C 3 S 55 Sodium oxide"@en "Na"@en "Up to 2 Potassium oxide"@en "K"@en "O K Gypsum CaSO"@en ".2H"@en "2 O C S H"@en "2 5  Representative"@en "Actual weight"@en "type"@en "cement"@en "Source"@en "Mindess"@en "Young  Properties"@en "cement"@en "These compounds"@en "the properties"@en "cement"@en "different ways"@en "Tricalcium aluminate"@en "C"@en "3  Tricalcium silicate"@en "C 3 S:-  Dicalcium silicate"@en "C 2 S"@en "Ferrite"@en "C"@en
"4 AF"@en "these compounds"@en "manufacturers"@en "different types"@en "cement"@en "several construction environments"@en "References"@en "Sidney Mindess"@en "J. Francis Young"@en "Concrete"@en "Prentice-Hall, Inc."@en "Englewood Cliffs"@en "pp"@en "Steve Kosmatka"@en "William Panarese"@en "Concrete Mixes"@en "Portland Cement Association"@en "pp"@en "Michael Mamlouk"@en "John Zaniewski"@en "Materials"@en "Civil and Construction Engineers"@en "Addison Wesley Longman, Inc."@en}
  ?substance wdt:P31 wd:Q11173;
             rdfs:label|skos:altLabel ?substanceLabel.
}
LIMIT 50

```
[try it!](https://query.wikidata.org/#%0ASELECT%20DISTINCT%20%3Fsubstance%0AWHERE%20%7B%20%0A%20%20VALUES%20%3FsubstanceLabel%20%7B%22Composition%22%40en%20%22cement%20%20Introduction%20%20Portland%20cement%22%40en%20%22its%20strength%22%40en%20%22chemical%20reactions%22%40en%20%22the%20cement%22%40en%20%22water%22%40en%20%22The%20process%22%40en%20%22hydration%22%40en%20%22This%22%40en%20%22a%20complex%20process%22%40en%20%22that%22%40en%20%22the%20chemical%20composition%22%40en%20%22cement%22%40en%20%22Manufacture%22%40en%20%22cement%20%20Portland%20cement%22%40en%20%22milling%22%40en%20%22the%20following%20materials%22%40en%20%22CaO%22%40en%20%22limestone%22%40en%20%22chalk%22%40en%20%22shells%22%40en%20%22sand%22%40en%20%22clay%22%40en%20%22argillaceous%20rock%22%40en%20%22sand%22%40en%20%22%2C%20old%20bottles%22%40en%20%22clay%22%40en%20%22argillaceous%20rock%20%20Alumina%22%40en%20%22Al%22%40en%20%22O%22%40en%20%22O%22%40en%20%22clay%20%20Iron%22%40en%20%222%20O%22%40en%20%22clay%22%40en%20%22iron%20ore%22%40en%20%22scrap%20iron%22%40en%20%22ash%22%40en%20%22O%22%40en%20%22clay%22%40en%20%22iron%20ore%22%40en%20%22scrap%20iron%22%40en%20%22Gypsum%22%40en%20%22CaSO%22%40en%20%22.2H%22%40en%20%22limestone%20%20Chemical%20shorthand%22%40en%20%22the%20complex%20chemical%20nature%22%40en%20%22cement%22%40en%20%22a%20shorthand%20form%22%40en%20%22the%20chemical%20compounds%22%40en%20%22The%20shorthand%22%40en%0A%22the%20basic%20compounds%22%40en%20%22Compound%20Formula%20Shorthand%22%40en%20%22Calcium%20oxide%22%40en%20%22lime%22%40en%20%22Ca0%20C%20Silicon%20dioxide%22%40en%20%22silica%22%40en%20%222%20S%20Aluminum%20oxide%22%40en%20%22alumina%22%40en%20%22Al%22%40en%20%22O%22%40en%20%22O%22%40en%20%222%20O%20H%20Sulfate%22%40en%20%223%20S%20%20Compound%20Formula%20Shorthand%22%40en%20%22%25%22%40en%20%22Tricalcium%20aluminate%22%40en%20%22Ca%22%40en%20%22O%22%40en%20%226%20C%22%40en%20%22A%2010%20Tetracalcium%20aluminoferrite%22%40en%20%22Ca%22%40en%20%22Al%202%20Fe%22%40en%20%22O%2010%20C%204%20AF%208%20Belite%20or%20dicalcium%20silicate%22%40en%20%22Ca%202%20SiO%205%20C%202%20S%2020%20Alite%22%40en%20%22tricalcium%22%40en%20%22Ca%203%20SiO%204%20C%203%20S%2055%20Sodium%20oxide%22%40en%20%22Na%22%40en%20%22Up%20to%202%20Potassium%20oxide%22%40en%20%22K%22%40en%20%22O%20K%20Gypsum%20CaSO%22%40en%20%22.2H%22%40en%20%222%20O%20C%20S%20H%22%40en%20%222%205%20%20Representative%22%40en%20%22Actual%20weight%22%40en%20%22type%22%40en%20%22cement%22%40en%20%22Source%22%40en%20%22Mindess%22%40en%20%22Young%20%20Properties%22%40en%20%22cement%22%40en%20%22These%20compounds%22%40en%20%22the%20properties%22%40en%20%22cement%22%40en%20%22different%20ways%22%40en%20%22Tricalcium%20aluminate%22%40en%20%22C%22%40en%20%223%20%20Tricalcium%20silicate%22%40en%20%22C%203%20S%3A-%20%20Dicalcium%20silicate%22%40en%20%22C%202%20S%22%40en%20%22Ferrite%22%40en%20%22C%22%40en%0A%224%20AF%22%40en%20%22these%20compounds%22%40en%20%22manufacturers%22%40en%20%22different%20types%22%40en%20%22cement%22%40en%20%22several%20construction%20environments%22%40en%20%22References%22%40en%20%22Sidney%20Mindess%22%40en%20%22J.%20Francis%20Young%22%40en%20%22Concrete%22%40en%20%22Prentice-Hall%2C%20Inc.%22%40en%20%22Englewood%20Cliffs%22%40en%20%22pp%22%40en%20%22Steve%20Kosmatka%22%40en%20%22William%20Panarese%22%40en%20%22Concrete%20Mixes%22%40en%20%22Portland%20Cement%20Association%22%40en%20%22pp%22%40en%20%22Michael%20Mamlouk%22%40en%20%22John%20Zaniewski%22%40en%20%22Materials%22%40en%20%22Civil%20and%20Construction%20Engineers%22%40en%20%22Addison%20Wesley%20Longman%2C%20Inc.%22%40en%7D%0A%20%20%3Fsubstance%20wdt%3AP31%20wd%3AQ11173%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%7Cskos%3AaltLabel%20%3FsubstanceLabel.%0A%7D%0ALIMIT%2050%0A)
## result
| substance                                |
|------------------------------------------|
| http://www.wikidata.org/entity/Q283      |
| http://www.wikidata.org/entity/Q185006   |
| http://www.wikidata.org/entity/Q191924   |
| http://www.wikidata.org/entity/Q10860582 |
| http://www.wikidata.org/entity/Q186474   |
| http://www.wikidata.org/entity/Q416265   |
| http://www.wikidata.org/entity/Q20816880 |

## Extract additional information about the entities
The queried substances can now be used to query additional information about them.
Since some information might not be avaliable for a substance, the prperties are queried in an [OPTIONAL clause](https://www.w3.org/TR/sparql11-query/#optionals) to include the entity eventhough the triple match is not fulfilled.
You might want to extract the chemical formula, structure and CAS id.

In [8]:
substanceDetailQueryStr = """
SELECT DISTINCT ?substance ?substanceLabel ?formula ?structure ?CAS
WHERE { 
  VALUES ?substance { %s }
  ?substance wdt:P31 wd:Q11173.
  OPTIONAL{
      ?substance wdt:P231 ?CAS.
  }
  OPTIONAL{
      ?substance wdt:P274 ?formula.
  }
  OPTIONAL{
      ?substance wdt:P117  ?structure.
  }
  OPTIONAL{
      ?substance rdfs:label ?substanceLabel.
      FILTER(lang(?substanceLabel)="en")
  }                    
}
LIMIT 50

""" % " ".join([f"<{substance}>" for substance in substances])

substanceDetailQuery=Query(query=substanceDetailQueryStr,
            name="Chemical compound details",
            lang="sparql")
substanceDetailQueryResLoD=endpoint.queryAsListOfDicts(substanceDetailQuery.query)

# pretty printout of the result
doc=substanceDetailQuery.documentQueryResult(substanceDetailQueryResLoD, tablefmt="github",floatfmt=".0f",tryItUrl="https://query.wikidata.org")
display(Markdown(str(doc)))

## Chemical compound details

### query
```sparql

SELECT DISTINCT ?substance ?substanceLabel ?formula ?structure ?CAS
WHERE { 
  VALUES ?substance { <http://www.wikidata.org/entity/Q283> <http://www.wikidata.org/entity/Q185006> <http://www.wikidata.org/entity/Q191924> <http://www.wikidata.org/entity/Q10860582> <http://www.wikidata.org/entity/Q186474> <http://www.wikidata.org/entity/Q416265> <http://www.wikidata.org/entity/Q20816880> }
  ?substance wdt:P31 wd:Q11173.
  OPTIONAL{
      ?substance wdt:P231 ?CAS.
  }
  OPTIONAL{
      ?substance wdt:P274 ?formula.
  }
  OPTIONAL{
      ?substance wdt:P117  ?structure.
  }
  OPTIONAL{
      ?substance rdfs:label ?substanceLabel.
      FILTER(lang(?substanceLabel)="en")
  }                    
}
LIMIT 50


```
[try it!](https://query.wikidata.org/#%0ASELECT%20DISTINCT%20%3Fsubstance%20%3FsubstanceLabel%20%3Fformula%20%3Fstructure%20%3FCAS%0AWHERE%20%7B%20%0A%20%20VALUES%20%3Fsubstance%20%7B%20%3Chttp%3A//www.wikidata.org/entity/Q283%3E%20%3Chttp%3A//www.wikidata.org/entity/Q185006%3E%20%3Chttp%3A//www.wikidata.org/entity/Q191924%3E%20%3Chttp%3A//www.wikidata.org/entity/Q10860582%3E%20%3Chttp%3A//www.wikidata.org/entity/Q186474%3E%20%3Chttp%3A//www.wikidata.org/entity/Q416265%3E%20%3Chttp%3A//www.wikidata.org/entity/Q20816880%3E%20%7D%0A%20%20%3Fsubstance%20wdt%3AP31%20wd%3AQ11173.%0A%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%3Fsubstance%20wdt%3AP231%20%3FCAS.%0A%20%20%7D%0A%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%3Fsubstance%20wdt%3AP274%20%3Fformula.%0A%20%20%7D%0A%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%3Fsubstance%20wdt%3AP117%20%20%3Fstructure.%0A%20%20%7D%0A%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%3Fsubstance%20rdfs%3Alabel%20%3FsubstanceLabel.%0A%20%20%20%20%20%20FILTER%28lang%28%3FsubstanceLabel%29%3D%22en%22%29%0A%20%20%7D%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%0A%7D%0ALIMIT%2050%0A%0A)
## result
| substance                                | substanceLabel       | formula   | structure                                                                         | CAS        |
|------------------------------------------|----------------------|-----------|-----------------------------------------------------------------------------------|------------|
| http://www.wikidata.org/entity/Q283      | water                | H₂O       | http://commons.wikimedia.org/wiki/Special:FilePath/H2O%202D%20labelled.svg        | 7732-18-5  |
| http://www.wikidata.org/entity/Q185006   | calcium oxide        | CaO       |                                                                                   | 1305-78-8  |
| http://www.wikidata.org/entity/Q20816880 | L-lysine             | C₆H₁₄N₂O₂ |                                                                                   | 56-87-1    |
| http://www.wikidata.org/entity/Q416265   | magnesium iodide     | I₂Mg      |                                                                                   | 10377-58-9 |
| http://www.wikidata.org/entity/Q186474   | L-Cysteine           | C₃H₇NO₂S  | http://commons.wikimedia.org/wiki/Special:FilePath/L-Cystein%20-%20L-Cysteine.svg | 52-90-4    |
| http://www.wikidata.org/entity/Q191924   | D-methamphetamine    | C₁₀H₁₅N   | http://commons.wikimedia.org/wiki/Special:FilePath/N-Methyl-D-amphetamin.svg      | 537-46-2   |
| http://www.wikidata.org/entity/Q10860582 | Tricalcium aluminate |           |                                                                                   | 12042-78-3 |