Skip to content

2. PhenoDB overview

Johan Källberg Zvrskovec edited this page Jul 3, 2024 · 17 revisions

Schematic overview of entities (slightly simplified)

Screenshot 2023-03-13 at 10 38 50

Schema description

  • met Metadata schema. Holds metadata of the individual level data in the coh and sec schemas, and of the summary level data in the sum schema.
  • coh Cohort data schema. Holds individual level data. Each entry over time is saved to preserve old versions of data in case of new imports overwriting older imports.
  • sec Secure cohort data schema. Holds more sensitive individual level data that is always deemed to be hidden from a standard extraction. This can be individual identifiable identifiers such as study participant ID's, personal contact information etc.
  • sum Summary data schema. This schema holds non-individual level data and aggregated summary statistics such as phenotype population prevalences and genotype association dataset specific data.

Standardised table description

The Metadata Schema 'met'

Tables and views in this schema are generally friendly to read and interact with manually, in contrast to the tables in the coh schema. The tables are named and created to model most entities and relations of the database. The drawback of this is that old versions of the data in these tables are not saved.

  • assessment Assessments, characterised by a code and version code. Corresponds to a typical standardised questionnaire, but can capture anything of the sort from other kinds of assessments. Different types of assessments are modelled by the assessment_type. Each version of assessment will have its own row and identifier.
  • assessment_item Assessment items, characterised by an assessment, assessment_item_type and an item code. Assessment items groups assessment_item_variables into groups that generally corresponds to an item in a questionnaire or the corresponding for other assessment_types (has its own question text or description, but multiple pieces of data).
  • assessment_item_type Assessment item types; an entity created to characterise assessment_items to allow for multiple different types in one assessment. Characterised by an assessment_type (describing the typical assessment type that this item type may be associated with), and a code. An example of this would be an assessment that combined questionnaire type items and interview or imaging items. A typical use would be to be able to filter among multiple items across assessments.
  • assessment_item_variable Assessment item variables, characterised by an assessment_item and a code, is the entity modelling the most granular piece of individual cohort data in the database. Each assessment item variable should correspond to a column in a table under the 'coh' schema.
  • assessment_type Assessment types, characterised by a code, describes the top sorting category for an assessment. An assessment can only have one assessment type, which may make it more useful to create detailed assessment item types instead. The default assessment types (which were rather arbitrarily created and named) are:
    • Questionnaire - A questionnaire type of assessment, either on paper or digitally distributed.
    • Interview - An interview type of assessment.
    • Imaging - An imaging type assessment.
    • Biological sample - An assessment made on a biological sample.
    • Cognitive test - A cognitive test performed using either a technical platform or other means to assess the result.
    • Probe - Any kind of non-imaging technical measurement.
  • cohort Cohorts, characterised by a code. An entity to model a cohort. All individual level cohort data are sorted under a cohortinstance of a cohort entity.
  • cohortinstance An instance of a cohort, characterised by a cohort and a code. Created to hold different versions, iterations, or sub-cohorts sorted under one cohort entity. This can for example be used to separate data extractions with different compositions of assessments, participants, or data.
  • cohortstage A stage of a cohort (study), characterised by a cohort and a code. Stages were created to represent timepoints or parallel stages of a cohort study.
  • country Countries in the world. Used for annotating other entities with countries.
  • phenotype Phenotypes, that is; a measurement concerning a commonly shared trait of an organism, characterised by a code. Historically used to describe summary type data such as GWAS summary statistics or population prevalences. Planned to additionally be linked to assessments (but still not decided how this should be done, i.e. 1:many or many:many).
  • phenotype_assessment_type Assessment type for phenotypes, characterised by a code. Used to describe how a measurement was measured for summary type data. Hypothetically possible to better harmonise with assessments and assessment_types.
  • phenotype_category Category for phenotypes, characterised by a code. For searching and sorting in a repository of phenotypes. Multiple categories are possible to assign to each phenotype.
  • phenotype_phenotype_category Link table to link phenotypes with multiple phenotype categories.
  • phenotype_type Phenotype type, characterised by a code.
  • population An entity to model ancestry for GWAS or similar. Used for summary type data.
  • reference An external reference to a publication.
  • summary Summary level data entity, characterised by a sort code, sort counter (numeric integer), and summary_type. Each row is referenced by rows in tables holding summary level data under the 'sum' schema.
  • summary_type The type of summary level data. Characterised by a code.

The Cohort Schema 'coh'

The tables under the 'coh' schema holds the individual level cohort data in multiple import instances. This means that there may be multiple rows that contain data for the same variable. To read the latest version of each variable from the cohort data tables, you can use the coh.create_current_assessment_item_variable_tview function to create a temporary view of your selection from which to read, as described in the code templates.

The cohort data tables are mapped to the corresponding metadata and named following the convention: [cohort code]_[cohortinstance code]_[assessment code]_[assessment version code]_[table index] The naming convention is used to link information about cohorts, cohortinstances, and assessments with the right table for insertion and extraction. The table index is used as an array index to reference multiple tables in case the number of variables for a certain cohortinstance/assessment combination is seen to exceed the maximum number of columns in a PostgreSQL table.

Data columns in the cohort data tables similarly follow the naming convention: [item code]_[variable code] Standardised metadata columns are prefixed with an underscore and have a specific function or meaning.

There should be a cohort data table holding each item and variable defined by the metadata. In the current live version, tables for older cohortinstances (representing experimental data imports) may have been removed.

The Secure Schema 'sec'

The secure schema holds two tables:

  • individual Individuals. Each row holds information on one individual, which theoretically can be shared across cohorts.
  • individual_cohortinstance_identifier Cohortinstance specific data about an individual. Characterised by an individual, an internal unique UUID-identifier, and a string meant to hold the study/cohort participant ID.

The general idea of these tables is that the individual table can be rather static and would not require much update while each cohortinstance may update the data on each individual chronologically. Each cohortinstance generates new UUID identifiers for participants as to not have these shared across cohorts or cohortinstances. This is to allow for the use of these UUID identifiers, which are easily regenerated with a new cohortinstance, rather than the study/cohort participant ID, which may be globally identifiable.

The Summary Data Schema 'sum'

Each table in the 'sum' schema represents one summary_type, which holds data specific to that type. Otherwise all summary data share the data in the met.summary table. The summary data functionality is still a WIP, and some relations and conventions are not fully defined.