
# Export IT-SR-ner to NIF

LDL conference 2024: **Ranka Stanković**, **Milica Ikonić Nešić**

open data: use case for 2 languages

Last update 04.03.2024.


**Description**: Source files are It-Sr-NER corpus https://github.com/jerteh/It-Sr-NER/tree/main/corpus/monolingual/ (POS taged, lemmatised, with annotated named entities, also supplied by Wikidata( For this porpues it's used 1000 sentences of parallel text.)

## install & import part

In [None]:
#!python.exe -m pip install --upgrade pip

In [None]:
# for importing/clonning repository with novels
#!pip install gitpython
#!pip install rdflib
#!pip install pydotplus
#!pip install mkwikidata

In [None]:
#!pip3 install rdflib sparqlwrapper pydotplus graphviz

In [None]:
#!pip install mkwikidata

In [None]:
# os,sys,glob
import os
import os.path
from os import path
import sys
import glob
import locale
import spacy
import string

# rdf
import rdflib
from rdflib import Graph
from rdflib.namespace import RDF, RDFS, XSD, OWL, DCAT
from rdflib import URIRef, BNode, Literal
import networkx as nx
import io
import pydotplus
from IPython.display import display, Image
from rdflib.tools.rdf2dot import rdf2dot
import mkwikidata
import re

ITSRDF=rdflib.Namespace("http://www.w3.org/2005/11/its/rdf#")
NIF = rdflib.Namespace("http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#")
# NERD= rdflib.Namespace("http://nerd.eurecom.fr/ontology#")
DC = rdflib.Namespace("http://purl.org/dc/elements/1.1/")
DCT = rdflib.Namespace("http://purl.org/dc/terms/")
FOAF = rdflib.Namespace("http://xmlns.com/foaf/0.1/")
MS = rdflib.Namespace("http://w3id.org/meta-share/meta-share/")
WD = rdflib.Namespace("http://www.wikidata.org/entity/")
#WD = rdflib.Namespace("http://www.wikidata.org/wiki/")
WDT = rdflib.Namespace("http://www.wikidata.org/prop/direct/")
DBO = rdflib.Namespace("https://dbpedia.org/ontology/")
OLIA = rdflib.Namespace("http://purl.org/olia/discourse/olia_discourse.owl#")
SKOS = rdflib.Namespace("http://www.w3.org/2004/02/skos/core#")


# Metadata description

For metadata analysed: https://link.springer.com/chapter/10.1007/978-3-319-18818-8_20, lexmeta : https://lexbib.elex.is/wiki/LexMeta,
https://github.com/pennyl67/LexMeta/blob/main/lexmeta.ttl

Metadata for collections described in Wikidata: https://www.wikidata.org/wiki/Wikidata:WikiProject_ELTeC (first edition, printed edition, ELTeC edition)

Key values prepared for this edition:
* ID - dct:identifier
* title - dct:title (from teiHeader : titleStmt)
* author (name) - ms:author
* authorQID - dc:creator,
* novelQID - linked as edition of novel
* publisher - dct:publisher (foaf:name)
* licence - ms:LicenceTerms
* year - ms:publicationDate
* language - dc:Language, ms:Language
* collection - wdt:P1433


# Ontology mapping

Initially used:
https://github.com/NERD-project/nerd-ontology/blob/master/nerd.owl  but replaced with:
http://purl.org/olia/discourse/olia_discourse.owl

* http://nerd.eurecom.fr/ontology#Person  (mapped to http://purl.org/olia/discourse/olia_discourse.owl#Person)
* http://nerd.eurecom.fr/ontology#Location (mapped to http://purl.org/olia/discourse/olia_discourse.owl#Space)
* http://nerd.eurecom.fr/ontology#Organization  (mapped to http://purl.org/olia/discourse/olia_discourse.owl#Organization)
* http://nerd.eurecom.fr/ontology#Event  (mapped to http://purl.org/olia/discourse/olia_discourse.owl#Event)
* ROLE - Names of posts and job titles (profession, nobility, office, military)
* DEMO - DEMO -  Demonyms, names of kinds of people: national, regional, political (Frenchwoman; German; Parisiens;...)
* WORK - titles of books, songs, plays, newspaper, paintings, sculptures  and other creations

Additional:
* https://dbpedia.org/ontology/Person, https://www.wikidata.org/wiki/Q5, https://schema.org/Person
* https://dbpedia.org/ontology/Place, https://www.wikidata.org/wiki/Q7884789, https://schema.org/Place
* https://dbpedia.org/ontology/Organisation, https://www.wikidata.org/wiki/Q43229, https://schema.org/Organization
* https://dbpedia.org/ontology/Event, https://www.wikidata.org/wiki/Q1656682, https://schema.org/Event
* https://dbpedia.org/ontology/Profession, https://www.wikidata.org/wiki/Q28640, https://schema.org/Occupation
* demonym as property only: https://dbpedia.org/ontology/demonym, https://www.wikidata.org/wiki/Q217438
* https://dbpedia.org/ontology/Work, https://www.wikidata.org/wiki/Q386724, https://schema.org/CreativeWork,


Rizzo, Giuseppe, Raphaël Troncy, Sebastian Hellmann, and Martin Bruemmer. "NERD meets NIF: Lifting NLP extraction results to the linked data cloud." In LDOW. 2012. http://ceur-ws.org/Vol-937/ldow2012-paper-02.pdf

# General classes: Token, Sentence, NamedEntity

TEI+RDF+LLOD literature

https://content.iospress.com/articles/semantic-web/sw222859

* P. Ruiz Fabo, H. Bermúdez Sabel, C. Martínez Cantón and E. González-Blanco, The diachronic Spanish sonnet corpus: TEI and linked open data encoding, data distribution, and metrical findings, Digital Scholarship in the Humanities (2020). doi:10.1093/llc/fqaa035.


* S. Tittel, H. Bermúdez-Sabel and C. Chiarcos, Using RDFa to link text and dictionary data for medieval French, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), European Language Resources Association (ELRA), 2018.

* Khan, Anas Fahad, Christian Chiarcos, Thierry Declerck, Daniela Gifu, Elena González-Blanco García, Jorge Gracia, Maxim Ionov et al. "When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Data." (2021).


In [None]:
# manage reading token properties from token (xml node) and generate graf triples
class Token:

    def __init__(self, token, lng): #token_row
        # 1-1	0-5	Veoma	_	_	ADV	veoma ##example of token
        self.lng = lng #OVO SAM SAD DODALA 17.10
        tokens = token.split('\t')
        self.text =  tokens[2]
        # print(self.text)
        self.pos = tokens[5]
        self.lemma = "" if self.pos == "PUNCT" else tokens[6][:-1]
        self.id = tokens[0].split('-')[0]

        self.index_start=int(tokens[1].split('-')[0])
        self.index_end=int(tokens[1].split('-')[1]) # na celom tekstu pocetna i krajnja pozicija

        #2-17	241-247	Čerulo	http://www.wikidata.org/entity/Q122388880[3]	PERS[3]	PROPN	Čerulo
        #for NER (pretpostavljamo da gde ima NER ima wiki)
        self.NER = False
        if  tokens[4] != "_":
            self.NERtype = tokens[4] #PERS[3]
            self.NERtext = tokens[2]

            wiki = tokens[3]

            if '_' not in wiki:
                if '[' in wiki:
                    self.Wiki = wiki.split('/')[4].split('[')[0]
                else:
                    self.Wiki = wiki.split('/')[4]
            else:
                self.Wiki = ""
            self.NER = True


    # create graph triples for this token
    def init_gtoken(self,g,base_url):
        gtoken= URIRef(base_url+"#char={0},{1}".format(int(self.index_start),int(self.index_end)))
        g.add((gtoken, RDF.type, NIF.String )) # added after validation 26-5-2023, https://core.ac.uk/download/pdf/226128976.pdf
        g.add( (gtoken, RDF.type, NIF.Word ) )
        g.add( (gtoken, RDF.type, NIF.RFC5147String  ) )
   #   g.add( (gtoken, RDF.type, NIF.Phrase) )
    # g.add( (gtoken,RDF.type, NIF.OffsetBasedString) ) Christian sugestion to remove
        g.add( (gtoken, NIF.anchorOf, Literal(self.text, datatype=XSD.string)) )
        g.add( (gtoken, NIF.beginIndex, Literal(self.index_start, datatype=XSD.nonNegativeInteger)) ) #
        g.add( (gtoken, NIF.endIndex, Literal(self.index_end, datatype=XSD.nonNegativeInteger)) ) #
        g.add( (gtoken, NIF.posTag, Literal(self.pos, datatype=XSD.string)))
        g.add((gtoken, NIF.oliaCategory , self.get_olia_postag(self.pos)))
        g.add((gtoken,DC.Language, Literal(self.lng,datatype=XSD.string)))

        if self.lemma!="":
            g.add( (gtoken, NIF.lemma, Literal(self.lemma, datatype=XSD.string)))

        return gtoken

    # olia postag
    def get_olia_postag(self,ud_postag):
        olia_pos = {    "ADJ": OLIA.Adjective, # adjective
                        "ADP": OLIA.Adposition, # adposition
                        "ADV": OLIA.Adverb, # adverb
                        "AUX": OLIA.AuxiliaryVerb, # auxiliary
                        "CCONJ": OLIA.CoordinatingConjunction, # coordinating conjunction
                        "DET": OLIA.Determiner, # determiner
                        "INTJ": OLIA.Interjection, # interjection
                        "NOUN": OLIA.CommonNoun, # noun
                        "NUM": OLIA.Numeral, # numeral
                        "PART": OLIA.Particle, # particle
                        "PRON": OLIA.Pronoun, # pronoun
                        "PROPN": OLIA.ProperNoun, # proper noun
                        "PUNCT": OLIA.Punctuation, # punctuation
                        "SCONJ": OLIA.SubordinatingConjunction, # subordinating conjunction
                        "SYM": OLIA.Symbol, # symbol
                        "VERB": OLIA.Verb, # verb
                        "X": OLIA.Residual # other"
             }

        return olia_pos[ud_postag] if (ud_postag in olia_pos.keys()) else OLIA.Residual


In [None]:
# manage reading sentence properties from sent (xml) and generate graph triples
class Sentence:

    def __init__(self, sent,lng): #recnik gde je id kljuc recenica cela vrednost
        self.lng = lng # OVO SAM SAD DODALA 17.10
        self.tokens = sent[1] # lista redova mojih!!!!!!!!
      #  self.id = sent[1][0].split('\t')[0].split('-')[0] # id recenice
        self.text = sent[0]
       # self.id = get_key(self.text)
        #self.index_start=cur_index
        #self.index_start=int(sent[0].split('\t')[1].split('-')[0])

        self.otokens = [] #objekti tokena lista
        # loop all tokens in sentence
        for i in range(len(self.tokens)):
            otoken = Token(self.tokens[i],lng)
            self.otokens.append(otoken)
            if i == 0:
                self.id = otoken.id # uzimamo ID recenice
                self.index_start = otoken.index_start
            if i == len(self.tokens)-1:
                self.index_end = otoken.index_end
       # print(self.id)
        #for ner
        ner_list = []
        for otoken in self.otokens:
            if otoken.NER and '[' in otoken.NERtype:
                ner_list.append(otoken.NERtype)

        nertypes = list(set(ner_list)) #all different types that appears, as PERS[1], ORG[12], and so on...
      #  print(nertypes)

        self.NER = []
        ner_text = []

        for ner_type in nertypes: # prolazim kroz sve tipove i kupim tekst
            for otoken in self.otokens:
                if otoken.NER:
                    if otoken.NERtype == ner_type:
                        ner_text.append((otoken.NERtext, otoken.index_start, otoken.index_end, otoken.Wiki))


            self.NER.append([ner_text, ner_type]) #[(otoken.text1, otoken.index_start, otoken.index_end),(otoken.text2, otoken.index_start, otoken.index_end),....,], ner_type]
            ner_text = []

        for otoken in self.otokens:
            if otoken.NER and '[' not in otoken.NERtype:
                self.NER.append([(otoken.NERtext, otoken.index_start, otoken.index_end, otoken.Wiki) , otoken.NERtype])
      #  print(self.NER) OVOOOOO vratiti
    # create graph triples for this sentence


    def init_gsent(self,g,base_url):
        gsent= URIRef(base_url+"#char={0},{1}".format(self.index_start,self.index_end)) # self.index_start+len(self.text)
        g.add( (gsent, RDF.type, NIF.Sentence) )
        g.add( (gsent, DCT.identifier, Literal(self.id , datatype=XSD.nonNegativeInteger)))
  #    g.add( (gsent, RDF.type,NIF.Context) ) sentence is not context
        g.add( (gsent, RDF.type, NIF.RFC5147String) )
        g.add( (gsent, RDF.type, NIF.String) )
      # NIF.anchorOf, NIF.endIndex later
        g.add( (gsent, NIF.beginIndex, Literal(self.index_start , datatype=XSD.nonNegativeInteger))) #
        g.add((gsent,DC.Language, Literal(self.lng,datatype=XSD.string)))


        return gsent

In [None]:
def get_NerWiki(list_ner):
    #[(otoken.text1, otoken.index_start, otoken.index_end),(otoken.text2, otoken.index_start, otoken.index_end), ner_type]
    # [[[('Trgu', 135, 139), ('mučenika', 140, 148)], 'LOC'], [[('Stefana', 24, 31), ('Karačija', 32, 40)], 'PERS']]
    #  [[[('Lina', 236, 240), ('Čerulo', 241, 247)], 'PERS'], [('Elena', 151, 156), 'PERS'], [('Lila', 269, 273), 'PERS']]
    list_text = list_ner[0]
    ne_type = list_ner[1]
    ne_text =""
    if '[' in ne_type:
        for i in range(len(list_text)):
            if i==0:
                ind_start = list_text[i][1]
                ne_wiki = list_text[i][3]
            if i == len(list_text)-1:
                ind_end = list_text[i][2]
            if i<len(list_text) - 1:
                ne_text += list_text[i][0]+" "
            else:
                ne_text +=list_text[i][0]
        ne_type = ne_type.split('[')[0]
    else:
        ne_text = list_text[0]
        ind_start = list_text[1]
        ind_end = list_text[2]
        ne_wiki = list_text[3]
    return ne_text, ne_type, ind_start, ind_end, ne_wiki

In [None]:
# manage NE named entities: PERS LOC ORG EVENT ROLE DEMO WORK
class NamedEntity:

    def __init__(self, list_ner,lng):
        self.text, self.ne_type, self.index_start, self.index_end, self.wiki=  get_NerWiki(list_ner)
        self.lng = lng #ovo sam dodala za jezik

  # create graph triples for this token
    def init_gne(self,g, base_url): #dodala sam lng
        gne= URIRef(base_url+"#char={0},{1}".format(self.index_start,self.index_end) )
        g.add( (gne, RDF.type, NIF.String) )
        g.add( (gne, RDF.type, NIF.Phrase) )
        g.add( (gne, RDF.type, NIF.RFC5147String) )
    # g.add( (gne, RDF.type, NIF.OffsetBasedString) )
     # g.add((gne, RDF.type, NIF.EntityOccurrence )) not in NIF2.0 FK and CC sugested to eliminate and search by properties
        g.add( (gne, NIF.anchorOf,Literal(self.text, datatype=XSD.string)));
        g.add( (gne, NIF.beginIndex,Literal(self.index_start, datatype=XSD.nonNegativeInteger)))  #
        g.add( (gne, NIF.endIndex,Literal(self.index_end , datatype=XSD.nonNegativeInteger)))  #
        g.add((gne,DC.Language, Literal(self.lng,datatype=XSD.string))) #ovo sam dodala sad!!! za jezik


        if self.ne_type in ['PERS', 'PERSON']:
            g.add( (gne, ITSRDF.taClassRef, OLIA.Person))
            g.add( (gne, ITSRDF.taClassRef, WD.Q5))
            g.add( (gne, ITSRDF.taClassRef, DBO.Person))
        elif self.ne_type in ['LOC', 'LOCATION', 'PLACE', 'GPE']:
            # g.add( (gne, ITSRDF.taClassRef, NERD.Location))
            g.add( (gne, ITSRDF.taClassRef, OLIA.Space))
            g.add( (gne, ITSRDF.taClassRef, WD.Q7884789))
            g.add( (gne, ITSRDF.taClassRef, DBO.Place))
        elif self.ne_type in ['ORG', 'ORGANISATION']:
            g.add( (gne, ITSRDF.taClassRef, OLIA.Organization))
            g.add( (gne, ITSRDF.taClassRef, WD.Q43229))
            g.add( (gne, ITSRDF.taClassRef, DBO.Organisation))
        elif self.ne_type in ['EVENT']:
            g.add( (gne, ITSRDF.taClassRef, OLIA.Event))
            g.add( (gne, ITSRDF.taClassRef, WD.Q1656682))
            g.add( (gne, ITSRDF.taClassRef, DBO.Event))
        elif self.ne_type in ['ROLE']:
            # g.add( (gne, ITSRDF.taClassRef, Literal("<" +self.ne_type+">", datatype=XSD.string ))) # just string
            g.add( (gne, ITSRDF.taClassRef, WD.Q28640))    # profession, not exact, it could be title as well
            g.add( (gne, ITSRDF.taClassRef, DBO.Profession))
        elif self.ne_type in ['DEMO']:
            # g.add( (gne, ITSRDF.taClassRef, Literal("<" +self.ne_type+">", datatype=XSD.string ))) # just string
            g.add( (gne, ITSRDF.taClassRef, WD.Q217438 ))
            g.add( (gne, ITSRDF.taClassRef, DBO.demonym)) # check this
        elif self.ne_type in ['WORK']:
            # g.add( (gne, ITSRDF.taClassRef, Literal("<" +self.ne_type+">", datatype=XSD.string ))) # just string
            g.add( (gne, ITSRDF.taClassRef, WD.Q386724 ))
            g.add( (gne, ITSRDF.taClassRef, DBO.Work))
        else :  # something not expected
            g.add( (gne, ITSRDF.taClassRef, Literal("<" +self.ne_type+">", datatype=XSD.string )))

        # for Wiki
        if self.ne_type != '_':
            if self.wiki !="":
                wikiQID = self.wiki
                g.add( (gne, ITSRDF.taIdentRef, URIRef("http://www.wikidata.org/entity/"+wikiQID) ))
        return gne

# Class Corpusm

In [None]:
def read_sentences(file_path_name):
    file = open(file_path_name, 'r', encoding = 'utf-8')
    lines = file.readlines()
    Dict = {}
    id_text =1
    for i in range(len(lines)):
        if '#Text' in lines[i]:
            Dict[id_text]= lines[i][6:-1] # kljuc je broj recenice, a vrednost linija od koje pocinje recenica
            id_text+=1
    sentences = []
    sent = [] # one sentences tokenize
    for id_rec in range(1,1001):
        for line in lines:
            if str(id_rec) == line.split('\t')[0].split('-')[0]:
                sent.append(line)
        sentences.append((Dict[id_rec], sent))
        sent = []
    return sentences

In [None]:
# manage reading monolingual corpus and generate graph triples
class Corpusm:

    def __init__(self, file_path_name, lng):
        self.id = "it1"

       # do the senteces
        self.sentences = read_sentences(file_path_name) # funkcija koja cita sve recenice  i tekst recenice
        self.text=""
        self.index_start=0
        self.lng=lng
      #  self.file_name =os.path.basename(file_path_name).replace(".txt","_tekst.txt") #txt file zbog razresenja referenci
        self.url="http://llod.jerteh.rs/ItSrNIF/" +self.id +".txt"  #dodala sam ovde lng!!!!!!

       #  metadata
       # author
        self.author= "Ranka Stanković, Milica Ikonić Nešić"
        self.publicationStmt="ItSrNER1000"
        self.publisher = "JeRTeh"

    # graph initialisation for monolingual corpus
    # for graph
    def  init_gcorpus_lng(self,g):
        gcorpus_lng= URIRef(self.url) # +"{0}_{1}".format(0,) #menjati gnovel
        g.add( (gcorpus_lng, RDF.type, NIF.String) )
        g.add( (gcorpus_lng, RDF.type, NIF.Context  ) )
        g.add( (gcorpus_lng, RDF.type, NIF.RFC5147String) )
        g.add( (gcorpus_lng, NIF.beginIndex, Literal(self.index_start,datatype=XSD.nonNegativeInteger) ))
        g.add( (gcorpus_lng, DCT.identifier,  Literal(self.id,datatype=XSD.string)    ))

   #     g.add( (gcorpus_lng, DCT.title, Literal(self.title.text,datatype=XSD.string)) )
       # language
        g.add((gcorpus_lng,DC.Language, Literal(self.lng,datatype=XSD.string)))
        g.add((gcorpus_lng,MS.Language, Literal(self.lng,datatype=XSD.string)))

        g.add((gcorpus_lng,MS.author,Literal(self.author,datatype=XSD.string)))
        g.add((gcorpus_lng,MS.publisher, Literal(self.publisher,datatype=XSD.string)) )

        return gcorpus_lng

# Initialisation and metadata

# Main function to write ttl file

Guidelines for developing NIF-based NLP services
https://www.w3.org/2015/09/bpmlod-reports/nif-based-nlp-webservices/

In [None]:
# create graph, ...
def write_gcorpusm(file_path_name,lng,sent_num):
    g = Graph()
    g.bind('itsrdf', ITSRDF)
    g.bind('nif', NIF)
    g.bind('olia', OLIA)
    g.bind('dc',DC)
    g.bind('dct',DCT)
    g.bind('ms',MS)
    g.bind('wd', WD)
    g.bind('wdt', WDT)
    g.bind('dbo', DBO)
 # g.bind('eltec', ELTEC)

    ocorpusm = Corpusm(file_path_name,lng)

  # insert initial triples for monolingual corpus
    gcorpusm = ocorpusm.init_gcorpus_lng(g)
    gsent_before=None
    one=None

    scount=0 # sentence count just for testing, remove later

    for sent in ocorpusm.sentences: # loop all sentences in one corpusm
        scount+=1
        if scount > sent_num:
            break    # break after few sentences for test
        osent=Sentence(sent,lng)  #dodati sentence id iznad u Sentence klasi dodala
      #  print(osent.text)
        gsent = osent.init_gsent(g,ocorpusm.url)  # objekat grafa
        if gsent_before != None:
            g.add( (gsent, NIF.previousSentence, gsent_before ) )
            g.add( (gsent_before, NIF.nextSentence, gsent ) )
        g.add( (gsent, NIF.referenceContext, gcorpusm))
        gtoken_before=None
        # loop all tokens in sentence
        for otoken in osent.otokens:
            gtoken = otoken.init_gtoken(g,ocorpusm.url)
            g.add( (gtoken, NIF.referenceContext ,gcorpusm))
            g.add( (gtoken, NIF.sentence, gsent))
            g.add( (gsent, NIF.word, gtoken))
           # relate token: previous, next
            if gtoken_before!=None:
                g.add((gtoken_before,NIF.nextWord,gtoken))
                g.add((gtoken, NIF.previousWord, gtoken_before))
            gtoken_before =gtoken
           # end of token
        for ners in osent.NER:
            one=NamedEntity(ners,lng)
            gne = one.init_gne(g,ocorpusm.url)
            g.add( (gne, NIF.referenceContext ,gcorpusm))
            g.add( (gne, NIF.sentence, gsent)) #ovo je dodato

        # finish sentence graph  sid,len(tokens),sent_text
        # anchorOf / isString
        g.add( (gsent, NIF.anchorOf  , Literal(osent.text, datatype=XSD.string)) )
        g.add( (gsent, NIF.endIndex  , Literal(osent.index_end, datatype=XSD.nonNegativeInteger)) )

        # concatenate to monolingual corpus
        ocorpusm.text+= osent.text+" "
      #  print(ocorpusm.text) vratiti ovo!!!
        #cur_index+=1
        gsent_before = gsent
        # end of sentence


      # finish monolingual corpus
    g.add( (gcorpusm, NIF.endIndex  , Literal(len(ocorpusm.text),datatype=XSD.nonNegativeInteger))) #
      # Do we use NIF.anchorOf or NIF.isString
    g.add( (gcorpusm, NIF.isString  , Literal(ocorpusm.text, datatype=XSD.string)) ) #Corpusm
      #fn=file_path_name.replace(".xml",".txt").replace("\\","/").replace("/level2/","/NIF/")
    fn= file_path_name.replace(".tsv", ".txt")
    print(file_path_name)
    print(fn)
    f_txt = open(fn, "w",encoding = "utf-8")
    f_txt.write(ocorpusm.text)
    f_txt.close()
      # write RDF file
      #file_path_name_out=file_path_name.replace(".xml",".ttl").replace("\\","/").replace("/level2/","/NIF/")
    file_path_name_out=fn.replace(".txt",".ttl")
    g.serialize(destination=file_path_name_out)

    return ocorpusm
 #   return ocorpusm

### Testing

In [None]:
# test on small portion of one file or list of files
lng="it"
#ELTEC = rdflib.Namespace("http://llod.jerteh.rs/IT-SR-NER/NIF/")
# file_name='D:/Korpusi/ELTEC/ELTeC-'+lng+'-master/level2/ROM096-L2.xml' # 96,100
p='C:/Users/38164/Desktop/LREC_23/NIFrev/' #LREC23
#l=["POR0100_LucCor_Duquesa"]
#l=["ENG18471_Bronte"]
l=["it1000_final.tsv"]
#l=["FRA00101_Adam"]
for f in l:
    file_name=p+f
    print(file_name)
    write_gcorpusm(file_name,lng,1000)

C:/Users/38164/Desktop/LREC_23/NIFrev/it1000_final.tsv
C:/Users/38164/Desktop/LREC_23/NIFrev/it1000_final.tsv
C:/Users/38164/Desktop/LREC_23/NIFrev/it1000_final.txt
