# Exploring Morphological data after a major refactoring of the French extraction program

Before using these, the new version of the extraction program should be run, and the diffs between previous extractions should be computed, especially for the morphology data.

A local fuseki server instance must be installed and accessible through the http://localhost:3030 URL.

A new (TDB2) dataset must be created, called `fr-morpho` and 3 graph should be created with associated data : 
* `http://kaiko.getalp.org/fra/lost` contains the lost morphological data
* `http://kaiko.getalp.org/fra/gain` contains the gained morphological data
* `http://kaiko.getalp.org/fra/` contains the new (complete) morphological data

This particular notebook is a python notebook, using SPARQL queries as originally explained by Bob DuCharme. A SPARQL kernel also exists in jupyter, but it seems not so much adapted for my use case (it cannot display the result of a describe in turtle for instance and seems to be unusable with certain configuration to access a Fuseki server).

In [34]:
import rdflib
from IPython.core.display import display, HTML

def queryResultToHTMLTable(queryResult):
   HTMLResult = '<table><tr style="color:white;background-color:gray;font-weight:bold">'
   # print variable names
   for varName in queryResult.vars:
       HTMLResult = HTMLResult + '<td>' + varName + '</td>'
   HTMLResult = HTMLResult + '</tr>'
   # print values from each row
   for row in queryResult:
      HTMLResult = HTMLResult + '<tr>'   
      for column in row:
         HTMLResult = HTMLResult + '<td>' + column + '</td>'
      HTMLResult = HTMLResult + '</tr>'
   HTMLResult = HTMLResult + '</table>'
   display(HTML(HTMLResult))

def asTable(queryResult):
    HTMLResult = '<table><tr style="color:white;background-color:gray;font-weight:bold">'
    # print variable names
    for varName in queryResult["head"]["vars"]:
        HTMLResult = HTMLResult + '<td>' + varName + '</td>'
    HTMLResult = HTMLResult + '</tr>'
    # print values from each row
    for row in queryResult["results"]["bindings"]:
        HTMLResult = HTMLResult + '<tr>'
        for varName in queryResult["head"]["vars"]:
           HTMLResult = HTMLResult + '<td>' + row[varName]["value"] + '</td>'
        HTMLResult = HTMLResult + '</tr>'
    HTMLResult = HTMLResult + '</table>'
    display(HTML(HTMLResult))

import html

def raw(data):
    HTMLResult = '<pre><code>'
    dataAsString = data if isinstance(data, str) else (data.decode() if isinstance(data, bytes) else "")
    HTMLResult += html.escape(dataAsString)
    HTMLResult += '</code></pre>'
    return HTMLResult

def asRaw(data):
    display(HTML(raw(data)))

# Utility functions to get the description of all uris in a bindings    
def splitPrefixes(turtleCode):
    prefixes = {}
    content = ''
    inprefix = True
    for line in turtleCode.splitlines():
        if inprefix & line.strip().lower().startswith("@prefix "):
            k, v = line[8:].strip().split(":", 1)
            prefixes[k.strip()] = v.strip()
        else:
            inprefix = False
            content += line + '\n'
    return content, prefixes
            
def toSPARQLPrefixString(prefixes):
    prefString = ''
    for prefix in prefixes:
        prefString += "PREFIX %s: %s\n" % (prefix, prefixes[prefix])
    return prefString

def getDescription(uri, prefixes):
    if uri.startswith('http'):
        uri = '<' + uri + '>'
    queryText = toSPARQLPrefixString(prefixes) + """DESCRIBE %s""" % uri
    connection = SPARQLWrapper("http://host.docker.internal:3030/fr-morpho")
    connection.setQuery(queryText)
    g = connection.query().convert()
    return splitPrefixes(g.serialize(format='turtle').decode())


def describeUris(smartWrapper, prefixes):
    detailled = []
    for binding in smartWrapper.bindings :
        detrow = []
        for var in smartWrapper.variables:
            if binding[var].type == u"uri":
                detrow.append(getDescription(binding[var].value, prefixes))
            else:
                detrow.append(binding[var].value)
        detailled.append(detrow)
    return detailled

def detailledTableHtml(smartWrapper, prefixes):
    HTMLResult = '<table><tr style="color:white;background-color:gray;font-weight:bold">'
    # print variable names
    for varName in smartWrapper.variables:
        HTMLResult = HTMLResult + '<td>' + varName + '</td>'
    HTMLResult = HTMLResult + '</tr>'
    # print values from each row
    for row in describeUris(smartWrapper, prefixes):
        HTMLResult = HTMLResult + '<tr>'
        for val in row:
           HTMLResult = HTMLResult + '<td>' + raw(val[0]) + '</td>'
        HTMLResult = HTMLResult + '</tr>'
    HTMLResult = HTMLResult + '</table>'
    return HTMLResult

def asDetailledTable(smartWrapper, prefixes):
    display(HTML(detailledTableHtml(smartWrapper, prefixes)))
    
# simple functions
def describe(uri, prefixes=prefixes):
    turtle, prefixes = getDescription(uri, prefixes)
    asRaw(turtle)
    

In the preceding scripts, the set of predefined prefixes is given by the variable `prefixes`which is a dictionary


In [9]:
prefixes = {
    'ontolex': '<http://www.w3.org/ns/lemon/ontolex#>',
    'lexinfo': '<http://www.lexinfo.net/ontology/2.0/lexinfo#>',
    'fra': '<http://kaiko.getalp.org/dbnary/fra/>'
}


First, let's check if the Fuseki server has the required graphs and their content

In [3]:
from SPARQLWrapper import SPARQLWrapper, JSON

queryString = """SELECT ?g (count(?s) as ?count)
WHERE {
    GRAPH ?g {?s ?p ?o}
} GROUP BY ?g"""
sparql = SPARQLWrapper("http://host.docker.internal:3030/fr-morpho",returnFormat=JSON)

sparql.setQuery(queryString)

# ret will be converted depending on the kind of result (here, we asked for a JSON result) 
ret = sparql.query().convert()

In [4]:
ret

{'head': {'vars': ['g', 'count']},
 'results': {'bindings': [{'g': {'type': 'uri',
     'value': 'http://kaiko.getalp.org/fra/lost'},
    'count': {'type': 'literal',
     'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
     'value': '943852'}},
   {'g': {'type': 'uri', 'value': 'http://kaiko.getalp.org/fra/'},
    'count': {'type': 'literal',
     'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
     'value': '19515113'}},
   {'g': {'type': 'uri', 'value': 'http://kaiko.getalp.org/fra/gain'},
    'count': {'type': 'literal',
     'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
     'value': '108377'}}]}}

In [5]:
asTable(ret)

0,1
g,count
http://kaiko.getalp.org/fra/lost,943852
http://kaiko.getalp.org/fra/,19515113
http://kaiko.getalp.org/fra/gain,108377


In [10]:
from rdflib import Graph
from SPARQLWrapper import SPARQLWrapper2

changesWithDifferentWrittenRep = """
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>

SELECT ?nwf ?owf
FROM <http://kaiko.getalp.org/fra/gain>
FROM NAMED <http://kaiko.getalp.org/fra/lost>
WHERE {
  ?le ontolex:otherForm ?nwf . ?nwf ontolex:writtenRep ?wr.
  GRAPH <http://kaiko.getalp.org/fra/lost> {
   ?le ontolex:otherForm ?owf . 
   ?owf ontolex:writtenRep ?wr ; 
        ontolex:phoneticRep ?pr.
   FILTER (! contains(str(?pr), ' ou '))
  }
}
LIMIT 50"""
sparql = SPARQLWrapper2("http://host.docker.internal:3030/fr-morpho")
sparql.setQuery(changesWithDifferentWrittenRep)

differentWrittenRep = sparql.query().convert()
differentWrittenRep

<SPARQLWrapper.SmartWrapper.Bindings at 0xffff8128d550>

In [15]:
asDetailledTable(differentWrittenRep, prefixes)

0,1
nwf,owf
"fra:__wf_meU6Mw--_qi_gong__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:masculine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""tʃi ɡɔ̃ɡ""@fr-fonipa,  ""ʃi ɡɔ̃ɡ""@fr-fonipa ;  ontolex:writtenRep ""qi gongs""@fr .","fra:__wf_HLKxQg--_qi_gong__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""ʃi ɡɔ̃ɡ""@fr-fonipa ;  ontolex:writtenRep ""qi gongs""@fr ."
"fra:__wf_dOI7WQ--_pré-cognition__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:feminine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""pʁe.kɔ.ni.sjɔ̃""@fr-fonipa,  ""pʁe.kɔɡ.ni.sjɔ̃""@fr-fonipa ;  ontolex:writtenRep ""pré-cognitions""@fr .","fra:__wf_BFncfw--_pré-cognition__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""pʁe.kɔɡ.ni.sjɔ̃""@fr-fonipa ;  ontolex:writtenRep ""pré-cognitions""@fr ."
"fra:__wf_ejvv7w--_Ahuilléenne__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:feminine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""a.u.je.ɛn""@fr-fonipa,  ""a.ɥi.le.ɛn""@fr-fonipa ;  ontolex:writtenRep ""Ahuilléennes""@fr .","fra:__wf_zBCDfw--_Ahuilléenne__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""a.ɥi.le.ɛn""@fr-fonipa ;  ontolex:writtenRep ""Ahuilléennes""@fr ."
"fra:__wf_GrwevA--_conseiller__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:masculine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""kɔ̃.se.je""@fr-fonipa,  ""kɔ̃.sɛ.je""@fr-fonipa ;  ontolex:writtenRep ""conseillers""@fr .","fra:__wf_sBK97w--_conseiller__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""kɔ̃.sɛ.je""@fr-fonipa ;  ontolex:writtenRep ""conseillers""@fr ."
"fra:__wf_UwNQQg--_aiguillon__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:masculine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""e.ɡɥi.jɔ̃""@fr-fonipa,  ""ɛ.ɡɥi.jɔ̃""@fr-fonipa ;  ontolex:writtenRep ""aiguillons""@fr .","fra:__wf_JNVmzw--_aiguillon__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""e.ɡɥi.jɔ̃""@fr-fonipa ;  ontolex:writtenRep ""aiguillons""@fr ."
"fra:__wf_xeKwsw--_hindī__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:masculine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""in.di""@fr-fonipa ;  ontolex:writtenRep ""hindīs""@fr .","fra:__wf_l9Ofyg--_hindī__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""in.di""@fr-fonipa ;  ontolex:writtenRep ""hindīs""@fr ."
"fra:__wf_v7bouQ--_tue-chien__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:masculine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""ty.ʃjɛ̃""@fr-fonipa ;  ontolex:writtenRep ""tue-chiens""@fr .","fra:__wf_kafX0A--_tue-chien__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""ty.ʃjɛ̃""@fr-fonipa ;  ontolex:writtenRep ""tue-chiens""@fr ."
"fra:__wf_shSlJA--_paan__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:masculine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""pan""@fr-fonipa,  ""pɑːn""@fr-fonipa ;  ontolex:writtenRep ""paans""@fr .","fra:__wf_AeJuuw--_paan__nom__1 a ontolex:Form ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""pan""@fr-fonipa ;  ontolex:writtenRep ""paans""@fr ."
"fra:__wf_up6-dQ--_enculé_de_ta_race__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:feminine ;  lexinfo:number lexinfo:plural ;  ontolex:phoneticRep ""ɑ̃.ky.ˌle.də.lœʁ.ˈʁas""@fr-fonipa ;  ontolex:writtenRep ""enculées de leur race""@fr .","fra:__wf_QS5LYQ--_enculé_de_ta_race__nom__1 a ontolex:Form ;  lexinfo:gender lexinfo:feminine ;  lexinfo:number lexinfo:plural ;  lexinfo:person lexinfo:thirdPerson ;  ontolex:phoneticRep ""ɑ̃.ky.ˌle.də.lœʁ.ˈʁas""@fr-fonipa ;  ontolex:writtenRep ""enculées de leur race""@fr ."


In [35]:
describe('fra:post-modernisme__nom__1', prefixes)