# HERCULES-EXTRACTION

HERCULES-EXTRACTION is an extraction tool which goal is to extract named entities from a text. To do so, we use already functional tools and APIs that are built for English texts. To be able to use these tools and APIs, we first need to [translate](#Translation) the text to English. After the text has been translated, we [extract entities](#Entity-Extraction) from the translated text using differents tools and APIs. We also use a [coreference resolution](#Coreference-Resolution) approach to filter the different extracted entities. Then, we [translate back](#Translation-Back) the entities and we [export](#Export) the triples to a rdf format.

## Set Up

In [None]:
import os
from pathlib import Path
import requests
import subprocess 
import sys
import zipfile

Set up the notebook.

In [None]:
setup_path = Path('setup')

**Prerequisites**:
- Java 8

Install the requirements.

In [None]:
!{sys.executable} -m pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Set up all the components.

**AzureTranslator**

Set the `AZURE_TOKEN` environment variable to an Azure Text API key. 

In [None]:
os.environ['AZURE_TOKEN'] = ''

**GoogleCloudTranslator** and **GoogleEntityExtractor**

Set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to a Google service account JSON keyfile.

In [None]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = ''

**MyMemoryTranslator**

Set the `MYMEMORY_TOKEN` environment variable to a MyMemory API key.

In [None]:
os.environ['MYMEMORY_TOKEN'] = ''

**DandelionEntityExtractor**

Set the `DANDELION_TOKEN` environment variable to a Dandelion API key.

In [None]:
os.environ['DANDELION_TOKEN'] = ''

**StanfordCoreferenceResolver**

Download the Stanford CoreNLP server.

In [None]:
corenlp_zip_path = setup_path / 'stanford-corenlp-full-2018-10-05.zip'
corenlp_dir_path = setup_path / 'stanford-corenlp-full-2018-10-05'
corenlp_url = 'http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip'

setup_path.mkdir(parents=True, exist_ok=True) 

if not corenlp_zip_path.is_file():
    response = requests.get(corenlp_url)
    with corenlp_zip_path.open('wb') as f:
        f.write(response.content)

if not corenlp_dir_path.is_dir():
    with zipfile.ZipFile(corenlp_zip_path, 'r') as zip_ref:
        zip_ref.extractall(setup_path)

Start the Stanford CoreNLP server.

In [None]:
corenlp_server = subprocess.Popen(['java', '-Xmx5G', '-cp', str(corenlp_dir_path.resolve() / '*'), 'edu.stanford.nlp.pipeline.StanfordCoreNLPServer', '-port 9000', '-timeout 60000', '-threads 5', '-maxCharLength 100000', '-quiet True', '-preload tokenize,ssplit,pos,lemma,ner,parse,coref'])

Set the `CORENLP_HOME` environment variable to the path of the Stanford CoreNLP.

In [None]:
os.environ['CORENLP_HOME'] = str(corenlp_dir_path.resolve())

## Initial Configuration

In [None]:
import translation
import extraction
import coreference
import export

Read the text from a file. For this example we picked a text about Notre-Dame Basilica.

In [None]:
text_language = 'fr'
extraction_language = 'en'

text_path = Path('sample', 'default', 'text.txt')
text = text_path.read_text(encoding='utf-8')
print(text)

## Translation

Translate the text to English.

**AzureTranslator**

This translator uses the Azure Text API.

In [None]:
azure_translator = translation.AzureTranslator()
azure_translated_text = azure_translator.translate(text, text_language, extraction_language)
print(azure_translated_text)

**GoogleCloudTranslator**

This translator uses the Google Translation Cloud API.

In [None]:
google_translator = translation.GoogleCloudTranslator()
google_translated_text = google_translator.translate(text, text_language, extraction_language)
print(google_translated_text)

**GoogletransTranslator**

This translator uses the Google Translation website.

In [None]:
googletrans_translator = translation.GoogletransTranslator()
googletrans_translated_text = googletrans_translator.translate(text, text_language, extraction_language)

print(googletrans_translated_text)

**MyMemoryTranslator**

This translator uses the MyMemory API.

In [None]:
my_memory_translator = translation.MyMemoryTranslator()
my_memory_translated_text = my_memory_translator.translate(text, text_language, extraction_language)
print(my_memory_translated_text)

Let's take the MyMemory translated text for the following tasks.

In [None]:
translator = my_memory_translator
translated_text = my_memory_translated_text

## Entity Extraction

Extract the entities from the translated text.

**DandelionEntityExtractor**

This entity extractor uses the Dandelion API.

In [None]:
dandelion_entity_extractor = extraction.DandelionEntityExtractor()
dandelion_entities = dandelion_entity_extractor.extract_entities(translated_text)
for entity in dandelion_entities:
    print(entity)

**GoogleEntityExtractor**

This entity extractor uses the Google Natural Language API.

In [None]:
google_entity_extractor = extraction.GoogleEntityExtractor()
google_entities = google_entity_extractor.extract_entities(translated_text)
for entity in google_entities:
    print(entity)

Let's take the Dandelion entities for the following tasks.

In [None]:
extracted_entities = dandelion_entities

## Coreference Resolution

In [None]:
skip_coreference = False

Filter the previously extracted entities using some coreference resolution.

In [None]:
def get_relevant_entity_from_mention(mention):
    if len(mention) <= 0:
        return None
    for entity in mention:
        if entity.entity_type != extraction.EntityType.THING:
            return entity
    return mention[0]

**StanfordCoreferenceResolver**

This coreference resolver uses a local intance of the Stanford CoreNLP server. If there is a `Read timed out` error, you can skip this step by changing `skip_coreference` to `True`.

In [None]:
if not skip_coreference:
    stanford_coreference_resolver = coreference.StanfordCoreferenceResolver(start_server=False, endpoint='http://localhost:9000')
    stanford_mentions = stanford_coreference_resolver.resolve_coreferences(translated_text, extracted_entities)

    stanford_filtered_entities = []
    for mention in stanford_mentions:
        entity = get_relevant_entity_from_mention(mention)
        if entity is not None:
            stanford_filtered_entities.append(entity)

    for entity in stanford_filtered_entities:
        print(entity)

Let's take the Stanford filtered entities for the following tasks.

In [None]:
if skip_coreference:
    filtered_entities = extracted_entities
else:
    filtered_entities = stanford_filtered_entities

## Translation Back

Translate back the filtered entities.

In [None]:
translated_back_entities = []
for entity in filtered_entities:
    entity_name = translator.translate(entity.name, extraction_language, text_language)
    translated_back_entity = extraction.Entity(entity_name, entity.entity_type, None, None)
    translated_back_entities.append(translated_back_entity)
    print(translated_back_entity)

## Export

Export the translated back entities to an rdf file.

In [None]:
export_path = Path('notebook-export')
export_path.mkdir(parents=True, exist_ok=True) 

**CIDOCCRMExporter**

This exporter is specifically crafted to work with the [CIDOC CRM ontology](http://www.cidoc-crm.org/).

In [None]:
export_language = 'turtle'
entity_namespace = 'http://culture.gouv.qc.ca/entity/'
ontology_namespace = 'http://www.cidoc-crm.org/cidoc-crm/'
cidoccrm_export_path = export_path / 'cidoccrm.ttl'

cidoccrm_exporter = export.CIDOCCRMExporter()
cidoccrm_export = cidoccrm_exporter.export(translated_back_entities, entity_namespace, ontology_namespace, export_language)

cidoccrm_export_path.write_text(cidoccrm_export, encoding='utf-8')
print(cidoccrm_export)

## Clean Up

Kill the CoreNLP server spawned by this notebook.

In [None]:
corenlp_server.kill()