# Parse development notebook

### Notebook purpose
This notebook is development space for python parse.ts replacement and upgrade.
It reads specified google sheets and output actants.json file, which can be imported to inkVisitor RethinkDB.py

### Prerequisities
 * generated json schema for all used objects (run generate-json-schemas.py)

### JSon schemas for the actants
...
...

### The import tables:
 * Texts
 * Manuscripts (must be done alongside T, David thinks) = O of class defined by col. class_id
 * Resources
 * C
 * A

### Input variables

In [21]:
#                  sheet_name,  code, header_in_row
input_sheets = {
    "texts" : ("Texts","13eVorFf7J9R8YzO7TmJRVLzIIwRJS737r7eFbH1boyE", 5), #https://docs.google.com/spreadsheets/d/13eVorFf7J9R8YzO7TmJRVLzIIwRJS737r7eFbH1boyE/edit#gid=2056508047
    "manuscripts" : ("Manuscripts", "13eVorFf7J9R8YzO7TmJRVLzIIwRJS737r7eFbH1boyE", 4),
    "resources" : ("Resources", "13eVorFf7J9R8YzO7TmJRVLzIIwRJS737r7eFbH1boyE", 4),
    "actions" :  ("Statements","1vzY6opQeR9hZVW6fmuZu2sgy_izF8vqGGhBQDxqT_eQ", 4), # https://docs.google.com/spreadsheets/d/1vzY6opQeR9hZVW6fmuZu2sgy_izF8vqGGhBQDxqT_eQ/edit#gid=0
    "concepts" : ("Concepts","1nSqnN6cjtdWK-y6iKZlJv4iGdhgtqkRPus8StVgExP4", 4) # https://docs.google.com/spreadsheets/d/1nSqnN6cjtdWK-y6iKZlJv4iGdhgtqkRPus8StVgExP4/edit#gid=0
}

root_sheet_url = "https://docs.google.com/spreadsheets/d/"
google_api_dotenv_path = "../env/.env.googleapi"  # contains google api specs for sheet access with Dator
schema_path = '../schemas/' # path for dir with scheas
json_schemas = {}  # holder for schemas, so they can be used for jsonschema validate

### Libraries

In [22]:
import os, warlock, json
from jsonschema import validate
import dissinetpytools.dator as dator
from dotenv import load_dotenv
import pandas as pd

### Initialisation

In [23]:
load_dotenv(google_api_dotenv_path) # fills os.environ['GDRIVE_API_CREDENTIALS']
d = dator.Dator(loglevel=10, print_log_online=True, cache=True, project_name="inkvisitor-import") # expects 'GDRIVE_API_CREDENTIALS' in the global system variables (os.environ)
d.google_authenticate()
logger = d.logger

20 2022-02-14 21:31:30 : Google authentification start
20 2022-02-14 21:31:30 : Google authentification end
20 2022-02-14 21:31:30 : Dator initiation succesfull end


In [24]:
# read all schemas inside and warlock them
schema_filenames = os.listdir(schema_path)
for schema in schema_filenames:
    name = schema.split(".")[0]
    file_handler = open(schema_path + schema,"r")
    schema_json = json.load(file_handler)
    json_schemas[name] = schema_json
    globals()[name] = warlock.model_factory(schema_json)
    logger.info("Class " + name + " available.")

2022-02-14 21:31:30,331 INFO Class IActant available.
2022-02-14 21:31:30,333 INFO Class IAction available.
2022-02-14 21:31:30,335 INFO Class IEntity available.
2022-02-14 21:31:30,336 INFO Class ILabel available.
2022-02-14 21:31:30,337 INFO Class IProp available.
2022-02-14 21:31:30,339 INFO Class IResource available.
2022-02-14 21:31:30,340 INFO Class IStatement available.
2022-02-14 21:31:30,342 INFO Class ITerritory available.
2022-02-14 21:31:30,343 INFO Class IUser available.


In [25]:
# load all input tables
tables = {}
header_infos = {}
for key, sheet in input_sheets.items():
    logger.info(f"Calling for {key} with sheet_name {sheet[0]}.")
    tables[key], header_infos[key] = d.load_df_from_gsheet(sheet[0],root_sheet_url + sheet[1], sheet[0], fromCache=True, header_in_row=sheet[2], clean=True, fillna=True, cleanByColumn="label") # , clean=True, fillna=True, cleanByColumn='id'

2022-02-14 21:31:30,365 INFO Calling for texts with sheet_name Texts.


20 2022-02-14 21:31:31 : Loading dataset Texts
20 2022-02-14 21:31:31 : Opting for variant header at row 5.
20 2022-02-14 21:31:32 : Dropping empty columns in the dataset Texts : (1011, 92)
20 2022-02-14 21:31:32 : Deleted 869 empty rows by label.
20 2022-02-14 21:31:32 : Loaded and prepared dataset Texts : (142, 92)


2022-02-14 21:31:32,903 INFO Calling for manuscripts with sheet_name Manuscripts.


20 2022-02-14 21:31:32 : Making pickle cache of  Texts : (142, 92)
20 2022-02-14 21:31:33 : Loading dataset Manuscripts
20 2022-02-14 21:31:33 : Opting for variant header at row 4.


2022-02-14 21:31:34,420 INFO Calling for resources with sheet_name Resources.


20 2022-02-14 21:31:34 : Dropping empty columns in the dataset Manuscripts : (999, 43)
20 2022-02-14 21:31:34 : Deleted 860 empty rows by label.
20 2022-02-14 21:31:34 : Loaded and prepared dataset Manuscripts : (139, 43)
20 2022-02-14 21:31:34 : Making pickle cache of  Manuscripts : (139, 43)
20 2022-02-14 21:31:34 : Loading dataset Resources
20 2022-02-14 21:31:34 : Opting for variant header at row 4.


2022-02-14 21:31:35,974 INFO Calling for actions with sheet_name Statements.


20 2022-02-14 21:31:35 : Dropping empty columns in the dataset Resources : (1000, 20)
20 2022-02-14 21:31:35 : Deleted 934 empty rows by label.
20 2022-02-14 21:31:35 : Loaded and prepared dataset Resources : (66, 20)
20 2022-02-14 21:31:35 : Making pickle cache of  Resources : (66, 20)
20 2022-02-14 21:31:36 : Loading dataset Statements
20 2022-02-14 21:31:36 : Opting for variant header at row 4.


2022-02-14 21:31:38,335 INFO Calling for concepts with sheet_name Concepts.


20 2022-02-14 21:31:38 : Dropping empty columns in the dataset Statements : (1030, 73)
20 2022-02-14 21:31:38 : Deleted 588 empty rows by label.
20 2022-02-14 21:31:38 : Loaded and prepared dataset Statements : (442, 73)
20 2022-02-14 21:31:38 : Making pickle cache of  Statements : (442, 73)
20 2022-02-14 21:31:39 : Loading dataset Concepts
20 2022-02-14 21:31:39 : Opting for variant header at row 4.
20 2022-02-14 21:31:43 : Dropping empty columns in the dataset Concepts : (3019, 57)
20 2022-02-14 21:31:43 : Deleted 724 empty rows by label.
20 2022-02-14 21:31:43 : Loaded and prepared dataset Concepts : (2295, 57)
20 2022-02-14 21:31:43 : Making pickle cache of  Concepts : (2295, 57)


In [26]:
tables['texts']


Unnamed: 0,id,label,language,label_short,text_name_original,detail,region_covered,microregion_covered,author_label,language_id,...,dissinet_coding_priority,dissinet_person,number_defendants,number_persons,persons_index_link,places_index_link,old_genre_general,old_genre_label,note,parsing_rows_explained
0,T1,Process against Bernard Niort and his family,English,,,Early 1234.,Languedoc,,,C0938,...,,,,,,,,deposition,,
1,T2,Sentences of William Arnold and Stephen of Sai...,English,,,,Languedoc,Toulousain #Lauragais,William Arnold #Stephen of Saint-Thibéry,C0938,...,,RS?,,,,,register,sentence,End-folio sometimes cited as 184v (e.g. Roche...,
2,T3,Peter Seila’s Register of Penances,English,Seila,Penitenciae fratris Petri Sellani,Penitenciae fratris Petri Sellani. Register of...,Languedoc,Quercy (west),,C0938,...,1,RS,,,,,register,sentence #culpa,,
3,T4,Register FFF of the Carcassonne inquisition,English,FFF,,,Languedoc,Montségur #Lauragais #Cabardès #Quercy (east) ...,Ferrer #William Raymond #Pons Gary #Peter Durand,C0938,...,,,,,,,register,deposition,,
4,T5,Confirmation of depositions before Ferrer and ...,English,,,,Languedoc,,Ferrer #Pons Gary,C0938,...,,,,,,,register,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,T139,letter of Evervin of Steinfeld to Bernard of C...,English,,,,,,,C0938,...,,,,,,,,,,
138,T140,letter from Liège to pope,English,,,,,,,C0938,...,,,,,,,,,,
139,T141,Annales Aquenses,Latin,,,,,,,C0938,...,,,,,,,,,,
140,T142,Annales Rodenses,Latin,,,,,,,C0938,...,,,,,,,,,,
