# Chemistry example: Cement

This example is about chemical components. The idea is to take input from a natural language text that contains descriptions about chemical components. With Natural Language Processing the relevant component names shall be extracted and the corresponding components shall be looked up in the Wikidata knowledge graph.

See also:
[Entity Linking](https://rq.bitplan.com/index.php/Entity_Linking)

## Example
Input as natural text: "Polylactic acid (PLA) and Acrylonitrile butadiene styrene (ABS) are 3D printing filaments."
Extraction of component names:

* PLA - Polylactic acid
* ABS - Acrylonitrile butadiene styrene

Lookup in Wikidata as:
* [PLA - Q413769](https://www.wikidata.org/wiki/Q413769) (C₃H₆O₃)x
* [ABS - Q143496](https://www.wikidata.org/wiki/Q143496) C₁₅H₁₇N

# Prerequisites
Please run the prerequiste cells before trying the examples.

## Install python module dependencies

In [None]:
import sys
def installModule(projectName:str, moduleName:str=None):
    '''Installs and loads the given module if not already installed'''
    if moduleName is None:
        moduleName=projectName
    !python -m pip install -U --no-input $projectName
    print(f'{projectName} installed')
    %reload_ext $moduleName

installModule('jupyter-xml')
installModule('SPARQLWrapper')
installModule('tabulate')
installModule('spacy==3.1.3', 'spacy')
installModule('newspaper3k', 'newspaper')
installModule("pylodstorage", "lodstorage")

## Download Models

In [None]:
!python -m spacy download en_core_web_sm    # efficent

# Chemistry Example Wikidata Query
see [pyLodStorage Random Substances with CAS number example](http://wiki.bitplan.com/index.php/PyLoDStorage#15_Random_substances_with_CAS_number)

## Extract text from website
We take input from a 
[Penn State University text about the composition of Cement](https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm)
which contains mentions of compounds such as "Silicon dioxide". We'd like to lookup the corresponding Wikidata entry [Silicon dioxide: Q116269](https://www.wikidata.org/wiki/Q116269)

In [None]:
from newspaper import Article
url="https://www.engr.psu.edu/ce/courses/ce584/concrete/library/construction/curing/Composition%20of%20cement.htm"
article = Article(url)
article.download()
article.parse()
text=article.text
print(text[:149],"...")

# NLP with spacy
Try to identify [chemical compounds](https://www.wikidata.org/wiki/Q11173) with the natural language processing library [Spacy](https://spacy.io/)

In [None]:
import spacy
from tabulate import tabulate
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
nouns=[chunk.text for chunk in doc.noun_chunks]

print(f"Found nouns:\n {nouns}")

foundEntities=[{"Text":entity.text, "Entity Tag":entity.label_} for entity in doc.ents]
print(tabulate(foundEntities, headers="keys"))

# Query wikidata for mentioned Chemical Compounds
The NER (Named Entity Recognition) of Spacy seems not to detect the chemical compounds as a known category. 
Thus we need to work around this limitation and try to use the found nouns to query wikidata for the referenced compounds.

## Setting up the [wikidata query endpoint](https://query.wikidata.org/)

In [None]:
from lodstorage.query import QueryManager, Query
from lodstorage.sparql import SPARQL
from IPython.display import display, Markdown
endpoint=SPARQL("https://query.wikidata.org/sparql")

## Get matching entites
The found nouns by spacy can be matched against the labels of chemical compounds.

A wikidata entity such as [Silicon dioxide: Q116269](https://www.wikidata.org/wiki/Q116269) has [rdfs:label](https://www.w3.org/TR/rdf-schema/#ch_label) for different languages specified with the languages tag (e.g. @en). 

To also query the names that are defined in the __Also known as__ column of the entity page the [skos:altLabel](https://www.w3.org/2009/08/skos-reference/skos.html#altLabel) property has to be queried. Searching for the label in both properties can be accomplished by using the [alternative path](https://www.w3.org/TR/sparql11-query/#propertypaths) feature of SPARQL.

Note: The wikidata endpoint has a horizontal line limit and thus the values in the [VALUES clause](https://www.w3.org/TR/sparql11-query/#inline-data) are added over multiple lines.


In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON, CSV
from tabulate import tabulate

nouns=[noun.replace("\n"," ").strip() for noun in nouns]
# building the query string
values=[' '.join([f'"{noun}"@en' for noun in nouns[n*50:n*50+50]]) for n in range((len(nouns)//50)+1)]
queryValueFormatTags='%s\n'*(len(values))
substanceQueryStr = """
SELECT DISTINCT ?substance
WHERE { 
  VALUES ?substanceLabel {%s}
  ?substance wdt:P31 wd:Q11173;
             rdfs:label|skos:altLabel ?substanceLabel.
}
LIMIT 50
""" 
substanceQueryStr = substanceQueryStr % ("\n".join(values))
# executing the query
substanceQuery=Query(query=substanceQueryStr,
            name="Recognized chemical compounds",
            lang="sparql")
substanceQueryResLoD=endpoint.queryAsListOfDicts(substanceQuery.query)

substances=[record.get('substance') for record in substanceQueryResLoD]

# pretty printout of the result
doc=substanceQuery.documentQueryResult(substanceQueryResLoD, tablefmt="github",floatfmt=".0f",tryItUrl="https://query.wikidata.org")
display(Markdown(str(doc)))

## Extract additional information about the entities
The queried substances can now be used to query additional information about them. Since some information might not be avaliable for a substance, the prperties are queried in an [OPTIONAL clause](https://www.w3.org/TR/sparql11-query/#optionals) to include the entity eventhough the triple match is not fulfilled.

In [None]:
substanceDetailQueryStr = """
SELECT DISTINCT ?substance ?substanceLabel ?formula ?structure ?CAS
WHERE { 
  VALUES ?substance { %s }
  ?substance wdt:P31 wd:Q11173.
  OPTIONAL{
      ?substance wdt:P231 ?CAS.
  }
  OPTIONAL{
      ?substance wdt:P274 ?formula.
  }
  OPTIONAL{
      ?substance wdt:P117  ?structure.
  }
  OPTIONAL{
      ?substance rdfs:label ?substanceLabel.
      FILTER(lang(?substanceLabel)="en")
  }                    
}
LIMIT 50

""" % " ".join([f"<{substance}>" for substance in substances])

substanceDetailQuery=Query(query=substanceDetailQueryStr,
            name="Chemical compound details",
            lang="sparql")
substanceDetailQueryResLoD=endpoint.queryAsListOfDicts(substanceDetailQuery.query)

# pretty printout of the result
doc=substanceDetailQuery.documentQueryResult(substanceDetailQueryResLoD, tablefmt="github",floatfmt=".0f",tryItUrl="https://query.wikidata.org")
display(Markdown(str(doc)))