# Software documentation

1. Introduction
2. Data description
3. Software organization
4. Functioning

## 1. Introduction

### Project goal
Our goal is to build a software that enables to populate and to query two kind of databases: a relational and a graph database.  
In this project we will deal with *structured data* coming from CSV and JSON files. For processing the data in Python we will use ***pandas*** that provides high-level data structures and functions designed to make working with structured or tabular data intuitive and flexible.

We will than upload the records created into our databases, as well as providing the possibility to query these databases simultaneously and return specific Python objects.

An important characteristic of the Python language is the consistency of its object model. Every number, string, data structure, function, class, module, and so on exists in the Python interpreter in its own “box”, which is referred to as a Python object. Each object has an associated type (e.g., integer, string, or function) and internal data. For a better readability, in our class methods, we tried to make the Data Types of the input parameters (arguments) and the output of the function explicit by the use of type annotation. For example:  

    def set_df(self, _df: DataFrame) -> None:

This meas that `set_df` will take a DataFrame as input and will return anything, so None. Note, though, that this is syntactic sugar only, Python will not raise an error if the Data Type of the arguments is another one (e.g. string, integer, list,…).

###  Let's start
We started by analysing the data that has been provided in order to understand the different cases we had to handle and their characteristics.  We have then created two data processors that we will use to extract data from the datasets provided and we will process them for the relational database and for the graph one.

---


## 2. Data description

We started by analysing the exemplar JSON and CSV files that we have been provided with to test the software.
The CSV files, both the *relational_publications.csv* and the *graph_publications.csv*, are composed by the following columns:

    id, title, type, publication_year, issue, volume, chapter, publication_venue, venue_type, publisher, event

Each row defines a publication entity.
As defined in the given UML, we will have three type of pubblications: journal articles, book chapters and proceedings papers.
Journal articles can also have issue and volume specified, while book chapters must have a chapter number.

In the JSON files, *relational_other_data.json* and *relational_other_data.json*, we will find additional informations about the publications and their related classes. In particular the JSON files are structured in 4 main keys:

    authors, venues_id, references, publishers

The first three sections contain additional information about authors, venues and citations of other publications by means of the publications' unique identifiers (DOI) used as sub-key inside each of these three macro "dictionaries". The fourth key give further information about the publishers that can be connected to the information of our csv through the mediation of their crossref identifier, which is used as key inside the json.

The files have been analysed both manually and by means of Python in order to better understand both the quantitative and the qualitative characteristics fo the data.

### Quantitative characteristcs of datasets.

#### CSV

We can inspect how the datasets are compose trough the *info* pandas method after reading into Python the CSVs. In addition we can use the *head* method to look the first 5 rows of our CSV tables.

In [80]:
from pandas import *

relational_publications = read_csv("data/relational_publications.csv")
relational_publications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 500 non-null    object 
 1   title              500 non-null    object 
 2   type               500 non-null    object 
 3   publication_year   500 non-null    int64  
 4   issue              347 non-null    object 
 5   volume             443 non-null    object 
 6   chapter            22 non-null     float64
 7   publication_venue  498 non-null    object 
 8   venue_type         498 non-null    object 
 9   publisher          498 non-null    object 
 10  event              0 non-null      float64
dtypes: float64(2), int64(1), object(8)
memory usage: 43.1+ KB


In [81]:
relational_publications.head()

Unnamed: 0,id,title,type,publication_year,issue,volume,chapter,publication_venue,venue_type,publisher,event
0,doi:10.1162/qss_a_00023,"Opencitations, An Infrastructure Organization ...",journal-article,2020,1,1,,Quantitative Science Studies,journal,crossref:281,
1,doi:10.1007/s11192-019-03217-6,"Software Review: Coci, The Opencitations Index...",journal-article,2019,2,121,,Scientometrics,journal,crossref:297,
2,doi:10.1007/s11192-019-03311-9,Nine Million Book Items And Eleven Million Cit...,journal-article,2019,2,122,,Scientometrics,journal,crossref:297,
3,doi:10.1038/sdata.2016.18,The Fair Guiding Principles For Scientific Dat...,journal-article,2016,1,3,,Scientific Data,journal,crossref:297,
4,doi:10.1371/journal.pbio.3000385,The Nih Open Citation Collection: A Public Acc...,journal-article,2019,10,17,,Plos Biology,journal,crossref:340,


In [82]:
graph_publication = read_csv("data/graph_publications.csv")
graph_publication.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 500 non-null    object 
 1   title              500 non-null    object 
 2   type               500 non-null    object 
 3   publication_year   500 non-null    int64  
 4   issue              303 non-null    object 
 5   volume             391 non-null    object 
 6   chapter            93 non-null     float64
 7   publication_venue  486 non-null    object 
 8   venue_type         486 non-null    object 
 9   publisher          486 non-null    object 
 10  event              0 non-null      float64
dtypes: float64(2), int64(1), object(8)
memory usage: 43.1+ KB


In [83]:
graph_publication.head()

Unnamed: 0,id,title,type,publication_year,issue,volume,chapter,publication_venue,venue_type,publisher,event
0,doi:10.1016/j.websem.2021.100655,Crossing The Chasm Between Ontology Engineerin...,journal-article,2021,,70,,Journal Of Web Semantics,journal,crossref:78,
1,doi:10.1007/s10115-017-1100-y,Core Techniques Of Question Answering Systems ...,journal-article,2017,3,55,,Knowledge And Information Systems,journal,crossref:297,
2,doi:10.1016/j.websem.2014.03.003,Api-Centric Linked Data Integration: The Open ...,journal-article,2014,,29,,Journal Of Web Semantics,journal,crossref:78,
3,doi:10.1093/nar/gkz997,The Monarch Initiative In 2019: An Integrative...,journal-article,2019,D1,48,,Nucleic Acids Research,journal,crossref:286,
4,doi:10.3390/publications7030050,Dras-Tic Linked Data: Evenly Distributing The ...,journal-article,2019,3,7,,Publications,journal,crossref:1968,


As we see in this first exploration of the two CSV provided, we already see some quantitative difference between the two datasets. This lead to the conclusion that the two databases we will create cuold have different informations about different publications.

#### JSON

In [84]:
relational_other_data = read_json("data/relational_other_data.json")
relational_other_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 540 entries, doi:10.1162/qss_a_00023 to crossref:301
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   authors     508 non-null    object
 1   venues_id   498 non-null    object
 2   references  500 non-null    object
 3   publishers  32 non-null     object
dtypes: object(4)
memory usage: 21.1+ KB


In [85]:
relational_other_data.describe(include="all")

Unnamed: 0,authors,venues_id,references,publishers
count,508,498,500,32
unique,486,297,99,32
top,"[{'family': 'Leydesdorff', 'given': 'Loet', 'o...","[issn:0138-9130, issn:1588-2861]",[],"{'id': 'crossref:6228', 'name': 'Codon Publica..."
freq,4,50,366,1


In [86]:
graph_other_data = read_json("./data/graph_other_data.json")
graph_other_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 563 entries, doi:10.1016/j.websem.2021.100655 to crossref:4443
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   authors     526 non-null    object
 1   venues_id   486 non-null    object
 2   references  500 non-null    object
 3   publishers  37 non-null     object
dtypes: object(4)
memory usage: 22.0+ KB


In [87]:
graph_other_data.describe(include="all")

Unnamed: 0,authors,venues_id,references,publishers
count,526,486,500,37
unique,492,309,101,37
top,"[{'family': 'Pal', 'given': 'Kamalendu', 'orci...",[issn:2076-3417],[],"{'id': 'crossref:735', 'name': 'Thomas Telford..."
freq,4,15,331,1


As for the CSVs, also for the data provided inside the JSONs, we found some quantitative difference.

Since the final outputs of the software must be Python objects that reflects the data stored in both databases, we will have to check for common entries and diffent ones. We kept this in mind while creating the generic query processor.

## 3. Software organization

We try to organize the software to be scalable and accessible.
It is scalable since we tried to keep the code as general-purpose as possible. The idea is to have a basic structure that can be adapted to different datasets.
It is accessible since we designed an **entry point, `main.py`,** foreasly access the software. This feature was also important in the testing and debugging phases.

### main.py

In the main file, we launch all the processes we need to obtain our outputs, all wrapped in an `app` function. 

Already here, in the entry point, you can see how we made use of the `try/except` statement in the code to handle errors. The `try/except` statement works as follows:

- First, the try clause (the statement(s) between the try and except keywords) is executed.

- If no exception occurs, the except clause is skipped and execution of the try statement is finished.

- If an exception occurs during the execution of the try clause, the rest of the clause is skipped. Then, if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try/except block.

Is it also possible to catch the specific Python error in the except statement with the build-in class `Exception` and print it in the terminal. In the following example, while the condition True is satisfied, so while you write a number (int) in the input, the function will raise no error. In other case it will stops and print: 'invalid literal for int() with base 10'.

    while True:
        try:
            x = int(input("Please enter a number: "))
            break
        except Exception as error:
            print(error)

Moreover, after checking the correct connection with Blazegraph (an ultra high-performance graph database supporting RDF/SPARQL APIs, where our triplestore database will be stored), we instantiate the `TriplestoreDataProcessor` class.
After we set the relational database path and instantiate also the `RelationalDataProcessor` class.

Both databases need a basic class that sets and gets the relative paths where they are stored. The `RelationalProcessor()` class and the `TriplestoreProcessor()` class will take as attribute this path. Through the set and get methods the path will be respectively written and read from the class attribute.

In the next step, we create the query processors for both databases, using the related classes.

Finally, create a generic query processor for asking about data. After checking that both query processors are added we run all the generic query functions and we write the records in respective text files.

## data_processors.py

We read the datasets provided and we build all the DataFrames we need for further operations. The `DataProcessor()` class is the first one of the additional classes we created, even if it wasn't required by the UML diagram provided.

We shaped this main class to be able to store the DataFrames we need, passing them as attributes. We then use "checking" methods that can handle the set and get all the DataFrames. This design feature is important to break the code into a more understandable series of instructions and it also builds a security level that will check the correctness of the DataFrames that we will build.

`DataProcessor()` will be shaped like this:

    class DataProcessor(object):
        def __init__(self) -> None:
            
            self.publications_df: DataFrame = DataFrame()
            self.authors_df: DataFrame      = DataFrame()
            self.references_df: DataFrame   = DataFrame()
            self.venues_df: DataFrame       = DataFrame()
            self.publishers_df: DataFrame   = DataFrame()

Where the attributes will contain the DataFrames we will build later on.

In order to be able to write these attributes we will use a set function for all of them. We show here an example of how we handle the Publications DataFrame:

    def set_publications_df(self, _publications_df):
        # Check if the DataFrame is empty.
        if len(_publications_df) < 1 :
            print('-- WARN: Publications Data Frame is empty :(')

        self.publications_df = _publications_df

    def get_publications_df(self) -> DataFrame:
        return self.publications_df

For each DataFrame a builder function will take as input the original dataset path (CSV or JSON depending on where the data are stored) and will write the relative attribute of the class.
Continuing with the Publications DataFrame example, the method will look like this:

        def publicationsDfBuilder(self, _csv_f_path: str) -> None:

        dtype = {
            'id'                 : 'string',
            'title'              : 'string',
            'type'               : 'string',
            'publication-year'   : 'int',
            'issue'              : 'string',
            'volume'             : 'string',
            'chapter'            : 'string',
            'pubblication_venue' : 'string',
            'venue_type'         : 'string',
            'publisher'          : 'string',
            'event'              : 'string'
        }

        publications_df = csv_to_df(_csv_f_path, dtype) 
        
        self.set_publications_df(publications_df)

Here we provide a list of all the `DataProcessor()` methods:

    set_publications_df(self, _publications_df: DataFrame) -> None:
    get_publications_df(self) -> DataFrame:
    set_authors_df(self, _publications_df: DataFrame) -> None:
    get_authors_df(self) -> DataFrame:
    set_publishers_df(self, _publications_df: DataFrame) -> None:
    get_publishers_df(self) -> DataFrame:
    set_references_df(self, _publications_df: DataFrame) -> None:
    get_references_df(self) -> DataFrame:
    set_venues_df(self, _publications_df: DataFrame) -> None:
    get_venues_df(self) -> DataFrame:

    def data_frames_has_been_built(self) -> bool:

    def publicationsDfBuilder(self, _csv_f_path: str) -> None:
    def referncesDfBuilder(self, _json_f_path: str) -> None:
    def venuesDfBuilder(self, _json_f_path: str) -> None:
    def authorsDfBuilder(self, _json_f_path: str) -> None:
    def publishersDfBuilder(self, _json_f_path: str) -> None: