# Chemistry example: Cement

This example is about chemical components. The idea is to take input from a natural language text that contains descriptions about chemical components. With Natural Language Processing the relevant component names shall be extracted and the corresponding components shall be looked up in the Wikidata knowledge graph.

See also:
[Entity Linking](https://rq.bitplan.com/index.php/Entity_Linking)

## Example
Input as natural text: "Polylactic acid (PLA) and Acrylonitrile butadiene styrene (ABS) are 3D printing filaments."
Extraction of component names:

* PLA - Polylactic acid
* ABS - Acrylonitrile butadiene styrene

Lookup in Wikidata as:
* [PLA - Q413769](https://www.wikidata.org/wiki/Q413769) (C₃H₆O₃)x
* [ABS - Q143496](https://www.wikidata.org/wiki/Q143496) C₁₅H₁₇N

# Prerequisites
Please run the prerequiste cells before trying the examples.

## Install python module dependencies

In [None]:
import sys
def installModule(projectName:str, moduleName:str=None):
    '''Installs and loads the given module if not already installed'''
    if moduleName is None:
        moduleName=projectName
    !python -m pip install -U --no-input $projectName
    print(f'{projectName} installed')
    %reload_ext $moduleName

installModule('jupyter-xml')
installModule('SPARQLWrapper')
installModule('tabulate')
installModule('spacy=3.1.3', 'spacy')

## Download Models

In [None]:
!python -m spacy download en_core_web_sm    # efficent

# Chemistry Example Wikidata Query
see [pyLodStorage Random Substances with CAS number example](http://wiki.bitplan.com/index.php/PyLoDStorage#15_Random_substances_with_CAS_number)

## Extract text from website
We take input from a 
[Penn State University text about the composition of Cement](https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm)
which contains mentions of compounds such as "Silicon dioxide". We'd like to lookup the corresponding Wikidata entry [Silicon dioxide: Q116269](https://www.wikidata.org/wiki/Q116269)

In [None]:
from newspaper import Article
url="https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm"
article = Article(url)
article.download()
article.parse()
text=article.text
print(text[:149],"...")

# NLP with spacy
Try to identify [chemical compounds](https://www.wikidata.org/wiki/Q11173) with the natural language processing library [Spacy](https://spacy.io/)

In [None]:
import spacy
from tabulate import tabulate
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
nouns=[chunk.text for chunk in doc.noun_chunks]

print(f"Found nouns:\n {nouns}")

foundEntities=[{"Text":entity.text, "Entity Tag":entity.label_} for entity in doc.ents]
print(tabulate(foundEntities, headers="keys"))

# Query wikidata for mentioned Chemical Compounds
The NER (Named Entity Recognition) of Spacy seems not to detect the chemical compounds as a known category. 
Thus we need to work around this limitation and try to use the found nouns to query wikidata for the referenced compounds.

<div class="alert alert-block alert-warning">
<b>ToDo:</b> Optimize query some components that are referenced do not have all queried properties. Making the properties OPTIONAL results in an timeout.
</div>

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON, CSV
from tabulate import tabulate

queryValueFormatTags='%s,\n'*((len(nouns)//100)) + '%s'
q = """
SELECT DISTINCT ?substance ?substanceLabel ?formula ?structure ?CAS
WHERE { 
  ?substance wdt:P31 wd:Q11173.
  ?substance wdt:P231 ?CAS.
  ?substance wdt:P274 ?formula.
  ?substance wdt:P117  ?structure.
  ?substance rdfs:label ?substanceLabel
  FILTER(str(?substanceLabel) in ( %s ))
}
LIMIT 50

""" % queryValueFormatTags
values=tuple([', '.join([f'"{noun.strip()}"' for noun in nouns[n:n+100]]) for n in range((len(nouns)//100)+1)])
q = q % values
sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setQuery(q)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
table = [[result[column]["value"] for column in result] for result in results["results"]["bindings"]]
print(tabulate(table))

<div class="alert alert-block alert-warning">
<b>ToDo:</b> Decide which SPARQL query framework sould be used and simplfy the access and interface for the usage in jupyter notebooks
</div>