### External modules used in this documentation

In this short code section we upload all the needed modules for this documentation.

In [33]:
from pandas import *
import json

___

# Software documentation

1. [Introduction](#introduction) 
2. [Data description](#data_description)
3. [Software organization](#software_organization)
    - [requirements.txt](#requirements.txt)
    - [const.py](#const.py)
    - [URIs.py](#URIs.py)
    - [data_model.py](#data_model.py)
    - [common_utils.py](#common_utils.py)
    - [main.py](#main.py)
    - [data_processors.py](#data_processors.py)
    - [relational_processor.py](#relational_processor.py)
    - [triplestore_processor.py](#triplestore_processor.py)
    - [query_processors.py](#query_processors.py)
    - [queries.py](#queries.py)
    
4. [Functioning](#functioning)

<div id=introduction> </div>

## 1. Introduction

### Project goal
Our goal is to build a software that enables to populate two kind of databases, a relational and a graph database, and to query these databases simultaneously.  
In this project we will deal with *structured data* coming from CSV and JSON files. For processing the data in Python we will use [***pandas***](https://pandas.pydata.org/) , a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

After shortly analysing the provided data and their characteristics, we developed the processors that will extract the data from the provided CSVs and JSONs, in order to populate our structured collection of data. We will than upload the records created into our databases, as well as providing the possibility to query these databases simultaneously and return specific Python objects.

### Design and syntax choices

#### Type annotation
Since an important characteristic of the Python language is the consistency of its object model. Each object has an associated type (e.g., integer, string, or function) and internal data. We choose, for a better readability, to make the Data Types of the input parameters (arguments) and the output of the function explicit by the use of type annotation.
E.g.  

    def set_df(self, _df: DataFrame) -> None:

This meas that `set_df` will take a DataFrame as input and will return anything, so None. Note, though, that this is syntactic sugar only, Python will not raise an error if the Data Type of the arguments specified in the annotation is not respected (e.g. string, integer, list,…).
    
#### Try/except statement
We made use of the `try/except` statement in the code to handle errors. The `try/except` statement works as follows:

- First, the try clause (the statement(s) between the try and except keywords) is executed.

- If no exception occurs, the except clause is skipped and execution of the try statement is finished.

- If an exception occurs during the execution of the try clause, the rest of the clause is skipped. Then, if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try/except block.

Is it also possible to catch the specific Python error in the except statement with the build-in class `Exception` and print it in the terminal.

#### Terminal messages
To be able to know at which step our program is during the execution, we made use of terminal messages. We divided these messages in three main categories. `INFO` messages updated the user about the step currently starting or just finished. `ERR` messages are alerts that report where the app failed its execution and stop while `WARN` was used to indicate when a query return zero results.

#### Case formats
We tried to use different case formats for indicate the different parts of our code. More in detail we used:
- Pascal Case (PascalCase) for naming classe (e.g. `IdentifiableEntity`, `Person`, etc…)
- Camel case (camelCase) for the class methods (e.g. `getIds`, `getGivenName`, etc…)
- Sanke case (snake_case), with an additional initial "_" for the arguments (e.g. `_id_list`, `_url`, etc…)

#### Constants names
In order to easly read when a constat is used, we explicitly used a all capitals letter syntax.
E.g. We will write the datasets paths as_

    GRAPH_CSV_FILE; GRAPH_JSON_FILE; RELATIONAL_CSV_FILE; RELATIONAL_JSON_FILE

###  Let's start
We started by analysing the data that has been provided in order to understand the different cases we had to handle and their characteristics.  We have then created two data processors that we will use to extract data from the datasets provided and we will process them for the relational database and for the graph one.

---

<a id='data_description'></a>

## 2. Data description

We started by analysing the exemplar JSON and CSV files that we have been provided with to test the software.
The CSV files, both the *relational_publications.csv* and the *graph_publications.csv*, are composed by the following columns:

    id, title, type, publication_year, issue, volume, chapter, publication_venue, venue_type, publisher, event

In [30]:
relational_csv = read_csv("data/relational_publications.csv")
for columns in relational_csv:
    print(columns)

id
title
type
publication_year
issue
volume
chapter
publication_venue
venue_type
publisher
event


In [41]:
graph_csv = read_csv("data/graph_publications.csv")
for columns in graph_csv:
    print(columns)

id
title
type
publication_year
issue
volume
chapter
publication_venue
venue_type
publisher
event


Each row defines a publication entity.
As defined in the given UML, we will have three type of pubblications: journal articles, book chapters and proceedings papers.
Journal articles can also have issue and volume specified, while book chapters must have a chapter number.

![datamodel](software-documentation/img/datamodel.png)

In the JSON files, *relational_other_data.json* and *relational_other_data.json*, we will find additional informations about the publications and their related classes. In particular the JSON files are structured in 4 main keys:

    authors, venues_id, references, publishers

The first three sections contain additional information about authors, venues and citations of other publications by means of the publications' unique identifiers (DOI) used as sub-key inside each of these three macro "dictionaries". The fourth key give further information about the publishers that can be connected to the information of our csv through the mediation of their crossref identifier, which is used as key inside the json.

The files have been analysed both manually and by means of Python in order to better understand both the quantitative and the qualitative characteristics fo the data.

In [38]:
relational_json = open("data/relational_other_data.json", 'r', encoding='utf-8')

relational_json_df = load(relational_json)
for columns in relational_json_df:
    print(columns)

authors
venues_id
references
publishers


In [39]:
graph_json = open("data/graph_other_data.json", 'r', encoding='utf-8')

graph_json_df = load(graph_json)
for columns in graph_json_df:
    print(columns)

authors
venues_id
references
publishers


### Quantitative characteristcs of datasets.

#### CSV

We can inspect how the datasets are compose trough the *info* pandas method after reading into Python the CSVs. In addition we can use the *head* method to look the first 5 rows of our CSV tables.

In [18]:
relational_publications = read_csv("data/relational_publications.csv")
relational_publications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 500 non-null    object 
 1   title              500 non-null    object 
 2   type               500 non-null    object 
 3   publication_year   500 non-null    int64  
 4   issue              347 non-null    object 
 5   volume             443 non-null    object 
 6   chapter            22 non-null     float64
 7   publication_venue  498 non-null    object 
 8   venue_type         498 non-null    object 
 9   publisher          498 non-null    object 
 10  event              0 non-null      float64
dtypes: float64(2), int64(1), object(8)
memory usage: 43.1+ KB


In [19]:
relational_publications.head()

Unnamed: 0,id,title,type,publication_year,issue,volume,chapter,publication_venue,venue_type,publisher,event
0,doi:10.1162/qss_a_00023,"Opencitations, An Infrastructure Organization ...",journal-article,2020,1,1,,Quantitative Science Studies,journal,crossref:281,
1,doi:10.1007/s11192-019-03217-6,"Software Review: Coci, The Opencitations Index...",journal-article,2019,2,121,,Scientometrics,journal,crossref:297,
2,doi:10.1007/s11192-019-03311-9,Nine Million Book Items And Eleven Million Cit...,journal-article,2019,2,122,,Scientometrics,journal,crossref:297,
3,doi:10.1038/sdata.2016.18,The Fair Guiding Principles For Scientific Dat...,journal-article,2016,1,3,,Scientific Data,journal,crossref:297,
4,doi:10.1371/journal.pbio.3000385,The Nih Open Citation Collection: A Public Acc...,journal-article,2019,10,17,,Plos Biology,journal,crossref:340,


In [20]:
graph_publication = read_csv("data/graph_publications.csv")
graph_publication.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 500 non-null    object 
 1   title              500 non-null    object 
 2   type               500 non-null    object 
 3   publication_year   500 non-null    int64  
 4   issue              303 non-null    object 
 5   volume             391 non-null    object 
 6   chapter            93 non-null     float64
 7   publication_venue  486 non-null    object 
 8   venue_type         486 non-null    object 
 9   publisher          486 non-null    object 
 10  event              0 non-null      float64
dtypes: float64(2), int64(1), object(8)
memory usage: 43.1+ KB


In [21]:
graph_publication.head()

Unnamed: 0,id,title,type,publication_year,issue,volume,chapter,publication_venue,venue_type,publisher,event
0,doi:10.1016/j.websem.2021.100655,Crossing The Chasm Between Ontology Engineerin...,journal-article,2021,,70,,Journal Of Web Semantics,journal,crossref:78,
1,doi:10.1007/s10115-017-1100-y,Core Techniques Of Question Answering Systems ...,journal-article,2017,3,55,,Knowledge And Information Systems,journal,crossref:297,
2,doi:10.1016/j.websem.2014.03.003,Api-Centric Linked Data Integration: The Open ...,journal-article,2014,,29,,Journal Of Web Semantics,journal,crossref:78,
3,doi:10.1093/nar/gkz997,The Monarch Initiative In 2019: An Integrative...,journal-article,2019,D1,48,,Nucleic Acids Research,journal,crossref:286,
4,doi:10.3390/publications7030050,Dras-Tic Linked Data: Evenly Distributing The ...,journal-article,2019,3,7,,Publications,journal,crossref:1968,


As we see in this first exploration of the two CSV provided, we already see some quantitative difference between the two datasets. This lead to the conclusion that the two databases we will create cuold have different informations about different publications.

#### JSON

In [22]:
relational_other_data = read_json("data/relational_other_data.json")
relational_other_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 540 entries, doi:10.1162/qss_a_00023 to crossref:301
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   authors     508 non-null    object
 1   venues_id   498 non-null    object
 2   references  500 non-null    object
 3   publishers  32 non-null     object
dtypes: object(4)
memory usage: 21.1+ KB


In [23]:
relational_other_data.describe(include="all")

Unnamed: 0,authors,venues_id,references,publishers
count,508,498,500,32
unique,486,297,99,32
top,"[{'family': 'Leydesdorff', 'given': 'Loet', 'o...","[issn:0138-9130, issn:1588-2861]",[],"{'id': 'crossref:6228', 'name': 'Codon Publica..."
freq,4,50,366,1


In [24]:
graph_other_data = read_json("./data/graph_other_data.json")
graph_other_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 563 entries, doi:10.1016/j.websem.2021.100655 to crossref:4443
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   authors     526 non-null    object
 1   venues_id   486 non-null    object
 2   references  500 non-null    object
 3   publishers  37 non-null     object
dtypes: object(4)
memory usage: 22.0+ KB


In [25]:
graph_other_data.describe(include="all")

Unnamed: 0,authors,venues_id,references,publishers
count,526,486,500,37
unique,492,309,101,37
top,"[{'family': 'Pal', 'given': 'Kamalendu', 'orci...",[issn:2076-3417],[],"{'id': 'crossref:735', 'name': 'Thomas Telford..."
freq,4,15,331,1


As for the CSVs, also for the data provided inside the JSONs, we found some quantitative difference.

Since the final outputs of the software must be Python objects that reflects the data stored in both databases, we will have to check for common entries and diffent ones. We kept this in mind while creating the generic query processor.

<a id='software_organization'></a>

## 3. Software organization

We try to organize the software to be scalable and accessible.
It is scalable since we tried to keep the code as general-purpose as possible. The idea is to have a basic structure that can be adapted to different datasets or future implementation of different kind of databases.
It is accessible since we designed an **entry point, `main.py`,** foreasly access the software. This feature was also important in the testing and debugging phases.

Let's take a look at all the files contained in the program.

<div id="requirements.txt"></div>

## requirements.txt

Python requirements files are a great way to keep track of the Python modules. It is a simple text file that saves a list of the modules and packages required by your project. By creating a Python requirements.txt file, you save yourself the hassle of having to track down and install all of the required modules manually.

It makes it easy to share your project with others. They install the same Python modules you have listed in your requirements file and run your project without any problems.

In case you ever need to update or add a Python module to your project, you simply update the requirements file rather than having to search through all of your code for every reference to the old module.

<div id="const.py"></div>

## const.py

In this file we stored all the constants we will need in ou project. This feature allows to easly change the value of elements widly used in the execution. For example you will find the data source local path, the base url of our RDF resources we will create, the path where the relational database file will be stored, and so on.

Regarding the triplestore database, we also decide to store here the queries we will use in the `TriplestoreQueryProcessor`, this allows to have a more clear code in the query process.

<div id="URIs.py"></div>

## URIs.py

The [*Resource Description Framework* (RDF)](https://www.w3.org/RDF/) allows users to describe both Web documents and concepts from the real world—people, organisations, topics, things—in a computer-processable way. Publishing such descriptions on the Web creates the Semantic Web. [URIs (*Uniform Resource Identifiers*)](https://www.w3.org/Addressing/URL/uri-spec.html) are very important, providing both the core of the framework itself and the link between RDF and the Web.

To have them all in one place we created a dedicate file. All classes of resources and proprieters that relates them, defined by the UML provided, will be presented here.

<div id="data_model.py"></div>

## data_model.py

The structure presented in UML is translated in this file.

![datamodel](software-documentation/img/datamodel.png)

We define all the Python classes and relative sub-classes.
This process is import in order to return Python objects from the queries.

<div id="common_utils.py"></div>

## common_utils.py

Here you can find general function that we used in the programm. They address specific task we need to perform during the execution. You can find in this file custom functions like: `csv_to_df`, `json_to_df` or `blazegraph_instance_is_active` that check trough a HTTP request is the Blazegraph service is active or not.

<div id="main.py"></div>

## main.py

This is the entry point of our program, we launch all the processes we need to obtain our outputs, all wrapped in an `app` function.

A key aspect for devolp a complex program like this one was to allow the possibility to run debugs.
Since we used Visual Studio Core as editor, we took advantage of one of the key features of this editor, its great debugging support. VS Code's built-in debugger helps accelerate the edit, compile, and debug loop.
To be able to do so we setted up the `launch.json` (contained in the `.vscode` folder) file as reported in [this guide by VS Code](https://code.visualstudio.com/docs/editor/debugging).



<div id="data_processors.py"></div>

## data_processors.py

In this file we process all the data provided and we build the [pandas DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) we will need for populate our databases.

We created a custom class `DataProcessor` that will take as attributes all the DataFrames we will create. As made in other cases, when we need to write or read attributes of a class, we used the respective *set* and *get* methods for each attribute.

You can also find the costum `DataProcessor`'s sub-class we created, `GraphDataProcessor`, where we compose all the triplets of our graph database. As properties of this class we setted all the dictonaries we used in the process to create the relation between the different entities we describe. 

In the original design of the program we had imagined to perform in this file also the creation of the tables for the relational database. In the final design presented here, we instead splited the creation of the tables for the relational database in another file that we will see later.

<div id="relational_processor.py"></div>

## relational_processor.py #TO CHECK

#### Relational Data Processor
The upload data function in relational data processors are divided into 2 parts. when we upload csv data we create all tables required for both csv and json. for csv data we populate these: Book, BookChapter, Journal, JournalArticles, PublicationID. Proceeeding and ProceedingPaper are empty according to our data.

we also create these empty tables to make it ready to use for json data in next step:
Author, Cites, Organization, OrgID, Person, PersonID, and VenueID.

with json data, references are linked to publications. Moreover, venue ids is linked to venues. Additionaly, authors and organizations are linked to publications and venues.

For creating tables for each publication type we merge citations, authors and venues.

In order to link publications to authors, we create author table with dois and personID table.

<div id="triplestore_processor.py"></div>

## triplestore_processor.py

After the creation of the triplets in the `data_processor.py`, we upload them to our online service, Blazegraph, to be stored and to have a queryable endpoint.

The base class `TriplestoreProcessor` sotres in the variable `endpointUrl` the URL of the SPARQL endpoint of the triplestore, initially set as an empty string, that will be updated with the method `setEndpointUrl`

The sub-class `TriplestoreDataProcessor`, with its method `uploadData` enables to upload the collection of data specified in the input file path into the database. We check if the file in input is a CSV or JSON and we launch the methods for the creation of the different DataFrames, already created in the `GraphDataProcessor`. Before this step we step if the data are already been uploaded in a precedent execution.

<div id="query_processors.py"></div>

## query_processors.py

Two main classes will be contained in this file: `RelationalQueryProcessor` and `TriplestoreQueryProcessor`. Both classes will have as methods the queries required in the project guidelines. As designed in the "UML of additional classes", these two classes will be sub-classes of the respective processor classes (`RelationalProcessor` and `TriplestoreProcessor`) and also both sub-classes of a generic `QueryProcessor`.

In this same file we clean the DataFrames returning from each query processor and we combine the two in one DataFrame with the [pandas concat method.](https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat#pandas.concat)

<div id="queries.py"></div>

## query.py

We created this costum file to manage the execution of all the queries. From the `main.py` we launche the execution of the function here contained. All the functions here contained will take as input the list of processor we have in our project (this feature allows to add in future also another data processor if needed, like a NoSQL processor for example), after the execution of the generic query, we process the result and translate them into Python object defined in the `data_model.py`. At least we write a *txt* file for each query, these files will be stored in the `queries-results` folder.

<div id="functioning"></div>

# 4.Functioning

The execution of our program starts from the entry point, setted in the `launch.json` file, the `main.py`. The app starts runnining and the first message appears in console to confirm that. 

The `app` function contains the whole program. As already said, this design, helps create a more clear sequence of steps that the program needs to execute to achive our final goal.

Since we will use few external libraries to handle specific tasks, we insert the execution of `app` in a *try/except* statement. We do that to be able to understand if and when an error is produced during the execution, above all if the error occur in a part of the code out of our program, an external library for example. 

As first thing we see a flag `data_has_been_uploaded` setted by default at *False*. We will need that more further on to check if the data are correctly updated on Blazegraph.

Main è l'entry point del nostro programma. `if __name__ == "__main__":` fa partire l'esecuzione del programma e lancia il metodo app. La funzione app() contiene all'interno il nostro programma. Tutte l'esecuzione del nostro programma è eseguita qui. Oltre a rispettare le specifiche di un programma Python, aiuta a creare un flusso di esecuzione chiaro. Uqesto facilita la comprensione e il dubbugging. La configurazione del nostro entry point è fatta nel launch.json. 

La prima cosa che facciamo, dato che utilizziamo funzioni di pandas, libreria esterna, potrebbero dare errore per diversi motivi (parametri sbagliati ecc...). Sempre bene proteggere l'esecuzione del programma quando si utilizzano librerie esterne. Questo permetto di individuare chiaramente dove il programma fallisce.Se anche sola una delle esecuzione comprese nel try fallise, except cattura l'errore e blocca l'esecuzione. Finally è utilizzato per fare quelle azioni necessarie anche se il programma fallisce. Exception è una classe Python che rappresenta l'errore riscontrato, l'oggetto che ci arriva, della classe Exception viene messo in una variabile error che poi stamperemo nel terminale.

Le costanti sono state definite tutte con lettere maiuscole come prassi, questo facilita l'individuazione di questi.

Il primo flag che incontriamo è data_has_been_uploaded. Questo flag è solo definito e settato a false per adesso. Ci servirà più avanti per controllare il corretto upload dei dati.

Blazegraph is the web service we used for upload or graph database.

Blazegraph è un servizio web, abbiamo inserito questa azione (l'esecuzione del servizio) e avviamo in automatico il servizio. L'output della funzione è un boolean. nel caso il servizio non sia attivo, una funzione lo fa partire start_blazegraph_server(). In common utilities raccoglieremo questo tipo di funzioni. Andando dentro vediamo che questa funzione, utilizzando sempre un try, prendiamo il percorso del file nel nostro blazegraph.jar che permetto il collegamento con il servizio web di Blazegraph. La corretta risposta è raccolta in blazegraph_instance_is_active(), utilizzando la libreria requests che permette di fare chiamare HTTP facilmente. Se il codice di risposta è 200 proseguiamo con l'esecuzione del software. 

Dopo il time.sleep(3) controlliamo nuovamente se blazegraph è attivo. Risettiamo il flag a False per ordine nostro. Generiamo noi l'errore e tramite il raise blocchiamo l'esecuzione del programma. In questo caso il try vai nell'except e poi sul finally per concludere l'esecuzione e riportarci l'errore.

Se il servizio è partito, inziamo con i nostri processori.Inziamo dal triplestore. Per evitare di ricaricare le triplette, notando anche la generezione di particolare sovrascrizioni, non ricarichiamo ulteriormente le triplette. In blazegraph_instance_is_empty() utilizziamo una get della libreria di SPARQL. df_sparql prende tre parametri che settiamo e facciamo una query generica di tutto il database. Con questo metodo facciamo una query generica per controllare che ci sia o no qualcosa. Il metodo empty, applicato sul dataframe di risultato, controlla che il dataframe non sia vuoto. Se è vuoto iniziamo a processare le nostre cose, altrimenti scrivo in console che è già stato popolato ed evito di fare tutte quelle azioni necessarie per popolarlo. Nela caso sia vuoto, inizializiamo la classe TriplestoreDataProcessor(). Istanzionado la classe, abbiamo accesso ai suoi metodi. In questo caso .setEndpointUrl viene ereditato dalla super classe TriplestoreProcessor(). Il processor (propietà della classe) ci permetto di elaborare tutti i dataframe necessari. Per fare questo abbiamo creato la classe GraphDataProcessor(), popoleremo la proprietà della ckassa TriplestoreDataProcessor con la classe GraphDataProcessor.


### main.py

In the main file, we launch all the processes we need to obtain our outputs, all wrapped in an `app` function. 

After checking the correct connection with [Blazegraph](https://blazegraph.com/), an ultra high-performance graph database supporting RDF/SPARQL APIs, where our triplestore database will be stored, we instantiate the `TriplestoreDataProcessor` class. The method `setEndpointUrl` will write the URL as attribute of the `TriplestoreDataProcessor` class.
After the triplestore, we set the relational database path where the database will be stored, and instantiate also the `RelationalDataProcessor` class. As for for the triplestore, also for the relational database, the attribute of the class `RelationalProcessor`, `bdPath` will be write by the method `setDbPath`.

Both databases need a basic class that sets and gets the relative paths where they are stored. The `RelationalProcessor()` class and the `TriplestoreProcessor()` class will take as attribute this path. Through the set method the path will be written in the class attribute, while with the get method it will be read from the class attribute. 

In the next step, we create the query processors for both databases, instantiating `TriplestoreQueryProcessor` and `RelationalQueryProcessor`. Both data processor will be appended to a list of processors. This list will be pass as attribute of the `GenericQueryProcessor` class.
The variable `queryProcessor` will cointain this list of *QueryProcessor* objects to involve when one of the *get* methods (the actual queries) are executed. In practice, every time a *get* method is executed, the method will call the related method on all the `QueryProcessor` objects included in the variable *queryProcessor*, before combining the results and returning the requested object. 

After that the query methods are launched, asking about data in both databases. After checking that both query processors are added we run all the generic query functions and we write the records in respective text files.

## data_processors.py

We read the datasets provided and we build all the DataFrames we need for further operations. The `DataProcessor()` class is the first one of the additional classes we created, even if it wasn't required by the UML diagram provided.

We shaped this main class to be able to store the DataFrames we need, passing them as attributes. We then use "checking" methods that can handle the set and get all the DataFrames. This design feature is important to break the code into a more understandable series of instructions and it also builds a security level that will check the correctness of the DataFrames that we will build.

`DataProcessor()` will be shaped like this:

    class DataProcessor(object):
        def __init__(self) -> None:
            
            self.publications_df: DataFrame = DataFrame()
            self.authors_df: DataFrame      = DataFrame()
            self.references_df: DataFrame   = DataFrame()
            self.venues_df: DataFrame       = DataFrame()
            self.publishers_df: DataFrame   = DataFrame()

Where the attributes will contain the DataFrames we will build later on.

In order to be able to write these attributes we will use a set function for all of them. We show here an example of how we handle the Publications DataFrame:

    def set_publications_df(self, _publications_df):
        # Check if the DataFrame is empty.
        if len(_publications_df) < 1 :
            print('-- WARN: Publications Data Frame is empty :(')

        self.publications_df = _publications_df

    def get_publications_df(self) -> DataFrame:
        return self.publications_df

For each DataFrame a builder function will take as input the original dataset path (CSV or JSON depending on where the data are stored) and will write the relative attribute of the class.
Continuing with the Publications DataFrame example, the method will look like this:

        def publicationsDfBuilder(self, _csv_f_path: str) -> None:

        dtype = {
            'id'                 : 'string',
            'title'              : 'string',
            'type'               : 'string',
            'publication-year'   : 'int',
            'issue'              : 'string',
            'volume'             : 'string',
            'chapter'            : 'string',
            'pubblication_venue' : 'string',
            'venue_type'         : 'string',
            'publisher'          : 'string',
            'event'              : 'string'
        }

        publications_df = csv_to_df(_csv_f_path, dtype) 
        
        self.set_publications_df(publications_df)

Here we provide a list of all the `DataProcessor()` methods:

    set_publications_df(self, _publications_df: DataFrame) -> None:
    get_publications_df(self) -> DataFrame:
    set_authors_df(self, _publications_df: DataFrame) -> None:
    get_authors_df(self) -> DataFrame:
    set_publishers_df(self, _publications_df: DataFrame) -> None:
    get_publishers_df(self) -> DataFrame:
    set_references_df(self, _publications_df: DataFrame) -> None:
    get_references_df(self) -> DataFrame:
    set_venues_df(self, _publications_df: DataFrame) -> None:
    get_venues_df(self) -> DataFrame:

    def data_frames_has_been_built(self) -> bool:

    def publicationsDfBuilder(self, _csv_f_path: str) -> None:
    def referncesDfBuilder(self, _json_f_path: str) -> None:
    def venuesDfBuilder(self, _json_f_path: str) -> None:
    def authorsDfBuilder(self, _json_f_path: str) -> None:
    def publishersDfBuilder(self, _json_f_path: str) -> None:

 

TriplestoreDataProcessor() la funzione speciale def __init__(self) sono funzioni interne di Python. L'unica cosa che fa init, nel caso di una sottoclasse è scatanere la superclasse relativa e popola la sua proprietà. sel.processor dichiara la proprietà della classe. Questo processor non è ne una costante, ne una variabile qualunque ma è una proprietà della classe. Creata nel main questa classe delegata a caricare i dati, scateniamo il metodo setEndpointUrl() che ci servirà per andare a caricare i dati nel database passandogli BASE_URL, questo valore, dentro la funzione setEndpointUrl settiamo il parametro _url che useremo, passandogli il dato corretto.

L'endpointUrl sarà il base url delle nostre risorse (triplette) che caricheremo sul graph.

uploadData ritorna un boolean e ci servirà questo per aggiornare il nostro flag precedente data_has_been_uploaded. Se i files sono validi (files_are_valid) controllo che tipo di file sono, CSV o JSON. Se è un csv il processor precedente processor che abiamo inizializzato, creaimo il data frame delle Publication (metodo della classe GraphDataProcessor) che costruisce il dataframe generico delle Pubblicazione tramite il path che gli abbiamo passato. Invece di scrivere ogni volta le azione che leggono il csv e fanno il corretto data frame necessario, abbiamo racchiuso in una funzione che abbiamo messo nelle nostre common utilities, csv_to_df, che compone il dataframe con i dtype (le colonne che setteremo ad esempio in PublicationDfBuilder) come dictonary.
publicationDfBuilder scrive dentro la classe DataProcessor, come parametri, i diversi data frame che creiamo e che ci serviranno per popolare il database.

Ogni data frame necessario buildato sarà poi scritto nelle proprietà delle classe. Alla fine controlliamo, con data_frames_has_been_built, se tutti i data frames sono stati fatti correttamente, costriusco le triplette. Lanciamo quindi GraphBuilder per creare tutte le triplette. Nei metodi do_..._triples prendiamo i data frame che ci servono.

Triplestore deploy inizializza la connession con .open, gli passiamo l'endpoint due volte perchè la prima sarà per creare la connessione e il secondo sarà per permettere la comunicazione, aggiornarlo -> è scritto cmq nella documentazione della libreria. 

Per ogni tripletta dentro il graph aggiungi al db online (triplestore.add(triple)). In ongi caso, finito il deploy, chiudo la connessione del triplestore con triplestore.close().

Settiamo a questo punto, se non ci sono errori, data_has_been_uploaded a True. Il valore verrà ritornato da uploadData come spechifiche comunicate dal prof. Questo succede nel finally e andrà a ritornare un variabile che appunto sarà data_has_been_uploaded. 

A questo punto possiamo eseguire le query.

Init processors - inzializiamo i processori con i loro metodi. L'istanza di queste due classi, la inseriamo in una lista. instaziamo il generic query processor e gli mettiamo dentro la lista con un metodo della classe GenericQueryProcessor (addQueryProcessor) e ti restituisce True o Flase. 
Se questo è fatto correttamente il GenericQuery avrà dentro tutti e due i processors. Nel caso avessi più di due tipi di database, la scatola GenericQueryProcessor potrebbe contenerlo senza problemi. Questo è stato fatto perchè il metodo addQuery scrive dentro la propria proprietà (la lista) i diversi processor. Il Generic ha come proprietà la lista dei diversi processor query che abbiamo aggiunto. Generic ha dentro tutti i processor. get.publicationByauthorId è dentro il generic non dentro i diversi processor, perchè le query sono le stesse anche se scritte in linguaggi diversi (SQL e SPARQL). La chimata di ogni query, a secondo del tipo di processor, è eseguita una volata sola ma fa cose diverse a seconda del tipo di Database. 

I due metodi relativi a relational e graph restituiscono i data frame di risultato. Una volta uniti i due dataframe di risultato dei due database, viene fatta una lista e restituiamo così un Python object.
La stessa lociga è applicata a tutte le query. 

I metodi set e get permettono di tenere al "sicuro" le proprietà. Se deve essere una scritta il controllo sarà effettuato nel set ad esempio. Questo aggiunge un livello di sicurezza ed evita di scrivere, ad esempio, un numero invece che una stringa. Abbiamo sfruttato l'occasiane di usare questi metodi dlegati per proteggere le nostre proprietà delle classi. In graphdatsprocessor in setGraph ad esempio è impostato già un controllo che vede che il graph non sia vuoto.

TriplestoreDataProcessor contiene una propprietà che è il vero processore dei dati che è frutto di un'ulteriore classe, GraphDataProcessor. Così giustifichiamo anche la creazione della classe TriplestoreDataProcessor che effettivamente processa i dati prima di caricarli.
