# Example #2: Using the `Database` class

This notebook demonstrates how to use the `Database` class to load/save existing database or create new database. The `Database` class has the following public attributes:
* `db` [pd.DataFrame]: The database object as a Pandas DataFrame.
* `creation_time` [datetime.datetime]: Denotes when the database was created.
* `description` [str]: A description of the database. For example "Active inference articles from ArXiv between July 2024 and December 2024.
* `tag_version` [int]: A hash of the tag file that serves as the tag file version.

Note that not all attributes will contain values until certain methods are performed (loading a new database, saving a database, etc.)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
os.chdir("..")

from src.db import Database

TABLE_DIR = "data/tables/2023_09_14"
TAG_PATH = "data/tags/tags.yaml"
# TAG_PATH = "data/tags/tag_test.yaml"

## 2.1 Creating a new database

A "database" in this context is not a proper database but a Pandas dataframe. A true database is overkill for the small size of the tables involved.

When we create a database we must pull from existing tables. Most commonly these tables are scraped from paper archives online PubMed or ArXiv (see the notebook `1_scrapers.ipynb` for more detail). However, you can create your own tables to load so long as they contain the expected columns (DOI, title, authors, year, where_published).

The reason for this process is to ensure reproducibility and tracing back to a source of scraped papers.

In [3]:
database = Database()
database.create(table_dir=TABLE_DIR)
database.db

INFO:src.db:Loading tables...
INFO:src.db:Database created at 2025-04-07 15:37:59.026888.


Unnamed: 0,title,authors,where_published,year,doi
0,Friston's free energy principle: new life for ...,Holmes J.,BJPsych Bull,2022,10.1192/bjb.2021.6
1,Friston's theory of everything,McCrone J.,Lancet Neurol,2022,10.1016/S1474-4422(22)00137-5
2,Voxel-based morphometry--the methods,"Ashburner J, Friston KJ.",Neuroimage,2000,10.1006/nimg.2000.0582
3,Scientific realism about Friston blankets with...,"Kiverstein J, Kirchhoff M.",Behav Brain Sci,2022,10.1017/S0140525X22000267
4,Structural and functional brain networks: from...,"Park HJ, Friston K.",Science,2013,10.1126/science.1238411
...,...,...,...,...,...
618,Changes in both top-down and bottom-up effecti...,"Thomas GEC, Zeidman P, Sultana T, Zarkali A, R...",Brain Commun,2022,10.1093/braincomms/fcac329
619,Spectral-temporal EEG dynamics of speech discr...,"Gilley PM, Uhler K, Watson K, Yoshinaga-Itano C.",BMC Neurosci,2017,10.1186/s12868-017-0353-4
620,A robot or a dumper truck? Facilitating play-b...,"Paldam E, Roepstorff A, Steensgaard R, Lundsga...",Autism Dev Lang Impair,2022,10.1177/23969415221086714
621,The neurophenomenology of early psychosis: An ...,"Nelson B, Lavoie S, Gawęda Ł, Li E, Sass LA, K...",Conscious Cogn,2020,10.1016/j.concog.2019.102845


## 2.2 Attaching tags to a database

After the database is created, the goal is to tag the papers in it. This is done separately using the `Tag` class (see the notebook `3_tags.ipynb` for more detail). Once the tag file is created and saved, it can be attached to the database by joining on the DOIs. If there are papers that are untagged then these entries will be filled with the tag "untagged" automatically to reflect this. Note that if a tag file is already loaded and `attach_tags()` is called it will not overwrite the existing tag file unless the `overwrite=true` flag is set.

In [4]:
database.attach_tags(tag_path=TAG_PATH, overwrite=False)
database.db

INFO:src.db:Loading tags...
INFO:src.tags:YAML tag file successfully loaded from data/tags/tags.yaml.
INFO:src.db:Adding tags to database...
INFO:src.db:0 papers are currently untagged.


Unnamed: 0,title,authors,where_published,year,doi,tag
0,Friston's free energy principle: new life for ...,Holmes J.,BJPsych Bull,2022.0,10.1192/bjb.2021.6,"[psychoanalysis, psychotherapy]"
1,Friston's theory of everything,McCrone J.,Lancet Neurol,2022.0,10.1016/S1474-4422(22)00137-5,[editorial]
2,Voxel-based morphometry--the methods,"Ashburner J, Friston KJ.",Neuroimage,2000.0,10.1006/nimg.2000.0582,"[review, neuroimaging]"
3,Scientific realism about Friston blankets with...,"Kiverstein J, Kirchhoff M.",Behav Brain Sci,2022.0,10.1017/S0140525X22000267,"[Markov blankets, philosophy, comment / response]"
4,Structural and functional brain networks: from...,"Park HJ, Friston K.",Science,2013.0,10.1126/science.1238411,"[review, network analysis]"
...,...,...,...,...,...,...
3580,Changes in both top-down and bottom-up effecti...,"Thomas GEC, Zeidman P, Sultana T, Zarkali A, R...",Brain Commun,2022.0,10.1093/braincomms/fcac329,[predictive processing]
3581,Spectral-temporal EEG dynamics of speech discr...,"Gilley PM, Uhler K, Watson K, Yoshinaga-Itano C.",BMC Neurosci,2017.0,10.1186/s12868-017-0353-4,[predictive processing]
3582,A robot or a dumper truck? Facilitating play-b...,"Paldam E, Roepstorff A, Steensgaard R, Lundsga...",Autism Dev Lang Impair,2022.0,10.1177/23969415221086714,[predictive processing]
3583,The neurophenomenology of early psychosis: An ...,"Nelson B, Lavoie S, Gawęda Ł, Li E, Sass LA, K...",Conscious Cogn,2020.0,10.1016/j.concog.2019.102845,[predictive processing]


## 2.3 Saving a database

After creating a database and attaching tags, you can save the database using the `save()` method. Saving a database pickles the database alongside some metadata which includes the database description, creation time, and hash identifier for the tag file. If no save path is specified (`outpath` parameter is `None`) then a filename will be created automatically using a timestamp.

In [5]:
database.save(database_description="Active inference papers")

INFO:src.db:Database saved to data/databases/database__2025-04-07__15:37:59.026888.pkl.


## 2.4 Load a database

If we wish to reload a previously saved database we can use the `load()` method. The load method takes a path to a pickled database file (`.pkl` extension) and unpacks it into a Pandas DataFrame stored in the `db` attribute of the `Database` class.

In [6]:
database = Database()
database.load("data/databases/database__2025-04-07__15:37:59.026888.pkl")
database.db

INFO:src.db:Database loaded from data/databases/database__2025-04-07__15:37:59.026888.pkl.


Unnamed: 0,title,authors,where_published,year,doi,tag
0,Friston's free energy principle: new life for ...,Holmes J.,BJPsych Bull,2022.0,10.1192/bjb.2021.6,"[psychoanalysis, psychotherapy]"
1,Friston's theory of everything,McCrone J.,Lancet Neurol,2022.0,10.1016/S1474-4422(22)00137-5,[editorial]
2,Voxel-based morphometry--the methods,"Ashburner J, Friston KJ.",Neuroimage,2000.0,10.1006/nimg.2000.0582,"[review, neuroimaging]"
3,Scientific realism about Friston blankets with...,"Kiverstein J, Kirchhoff M.",Behav Brain Sci,2022.0,10.1017/S0140525X22000267,"[Markov blankets, philosophy, comment / response]"
4,Structural and functional brain networks: from...,"Park HJ, Friston K.",Science,2013.0,10.1126/science.1238411,"[review, network analysis]"
...,...,...,...,...,...,...
3580,Changes in both top-down and bottom-up effecti...,"Thomas GEC, Zeidman P, Sultana T, Zarkali A, R...",Brain Commun,2022.0,10.1093/braincomms/fcac329,[predictive processing]
3581,Spectral-temporal EEG dynamics of speech discr...,"Gilley PM, Uhler K, Watson K, Yoshinaga-Itano C.",BMC Neurosci,2017.0,10.1186/s12868-017-0353-4,[predictive processing]
3582,A robot or a dumper truck? Facilitating play-b...,"Paldam E, Roepstorff A, Steensgaard R, Lundsga...",Autism Dev Lang Impair,2022.0,10.1177/23969415221086714,[predictive processing]
3583,The neurophenomenology of early psychosis: An ...,"Nelson B, Lavoie S, Gawęda Ł, Li E, Sass LA, K...",Conscious Cogn,2020.0,10.1016/j.concog.2019.102845,[predictive processing]


We can view the other `Database` attributes to confirm the loaded database details.

In [7]:
print(f"Creation time: {database.creation_time}.")
print(f"Description: {database.description}.")
print(f"Tag version: {database.tag_version}")

Creation time: 2025-04-07 15:37:59.026888.
Description: Active inference papers.
Tag version: 6537033589147705874


## 2.5 Remove papers from database

If you would like to remove papers from a database use the `remove()` method. Papers to remove are identified using their DOI. The papers to remove must be sent as list of DOI strings, even if there is a single DOI.

Below we filter for two DOIs of interest we wish to remove. Then we remove the papers and filter the DataFrame again to show that they have indeed been removed.

In [8]:
papers_to_remove = ["10.1192/bjb.2021.6", "10.1016/S1474-4422(22)00137-5"]
database.db[database.db["doi"].isin(papers_to_remove)]

Unnamed: 0,title,authors,where_published,year,doi,tag
0,Friston's free energy principle: new life for ...,Holmes J.,BJPsych Bull,2022.0,10.1192/bjb.2021.6,"[psychoanalysis, psychotherapy]"
1,Friston's theory of everything,McCrone J.,Lancet Neurol,2022.0,10.1016/S1474-4422(22)00137-5,[editorial]


In [9]:
database.remove(doi_list=papers_to_remove)

INFO:src.db:Specified papers have been dropped from the database.


In [10]:
database.db[database.db["doi"].isin(papers_to_remove)]

Unnamed: 0,title,authors,where_published,year,doi,tag


## 2.6 Update database from dict

New papers may be added to the database using a list of dictionaries. The dictionary must have the required fields: DOI, authors, where_published, year, and title.

In [11]:
papers = [{
    "doi"    : "https://doi.org/10.48550/arXiv.2409.15532",
    "authors": [
        "Lancelot Da Costa", "Nathael Da Costa", "Conor Heins", "Johan Medrano", "Grigorios A. Pavliotis", "Thomas Parr", "Ajith Anil Meera", "Karl Friston"],
    "where_published": "ArXiv",
    "year": 2024,
    "title": "A theory of generalised coordinates for stochastic differential equations"
},
{
    "doi" : "https://doi.org/10.48550/arXiv.2503.13223",
    "authors" : ["Allahkaram Shafiei", "Hozefa Jesawada", "Karl Friston", "Giovanni Russo"],
    "where_published": "ArXiv",
    "year": 2025,
    "title": "Robust Decision-Making Via Free Energy Minimization"
}]

Examining the last five papers show that the two ones we added are now part of the database.

In [12]:
database.update_from_dicts_list(entries=papers)
database.db.iloc[-5:]

INFO:src.db:Successfully added new papers to database.


Unnamed: 0,title,authors,where_published,year,doi,tag
3536,A robot or a dumper truck? Facilitating play-b...,"Paldam E, Roepstorff A, Steensgaard R, Lundsga...",Autism Dev Lang Impair,2022.0,10.1177/23969415221086714,[predictive processing]
3537,The neurophenomenology of early psychosis: An ...,"Nelson B, Lavoie S, Gawęda Ł, Li E, Sass LA, K...",Conscious Cogn,2020.0,10.1016/j.concog.2019.102845,[predictive processing]
3538,Noradrenergic deficits contribute to apathy in...,"Hezemans FH, Wolpe N, O'Callaghan C, Ye R, Rua...",PLoS Comput Biol,2022.0,10.1371/journal.pcbi.1010079,[predictive processing]
3539,A theory of generalised coordinates for stocha...,"[Lancelot Da Costa, Nathael Da Costa, Conor He...",ArXiv,2024.0,10.48550/arXiv.2409.15532,[untagged]
3540,Robust Decision-Making Via Free Energy Minimiz...,"[Allahkaram Shafiei, Hozefa Jesawada, Karl Fri...",ArXiv,2025.0,10.48550/arXiv.2503.13223,[untagged]


Note that the new papers are untagged and so they are marked as such in the tag column. 

The database may also be updated from a CSV file. The CSV file must also have the required fields: DOI, authors, where_published, and title. Then one must run: `database.update_from_CSV(csv_path=PATH_TO_CSV)`



## 2.7 Detaching a database

If you wish to clear out a database completely so the `Database` object is empty and back to initialization (i.e. as if `Database()` was called), then use the `detach()` method.

In [13]:
database.detach()

INFO:src.db:Database detached.


In [15]:
print(f"Database: {database.db}.")
print(f"Creation time: {database.creation_time}.")
print(f"Description: {database.description}.")
print(f"Tag version: {database.tag_version}")

Database: None.
Creation time: None.
Description: None.
Tag version: None
