# Data set selection
This notebook should serve as a basis for selecting the data sets for our bachelor thesis.

## Data sets
The OGB benchmark data sets for link prediction and node classification are suitable for us as data sets.

Link prediction:
* [ogbl-ppa](https://ogb.stanford.edu/docs/linkprop/#ogbl-ppa)
* [ogbl-collab](https://ogb.stanford.edu/docs/linkprop/#ogbl-collab)
* [ogbl-ddi](https://ogb.stanford.edu/docs/linkprop/#ogbl-ddi)
* [ogbl-citation2](https://ogb.stanford.edu/docs/linkprop/#ogbl-citation2)
* [ogbl-wikikg2](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2)
* [ogbl-biokg](https://ogb.stanford.edu/docs/linkprop/#ogbl-biokg)
* [ogbl-vessel](https://ogb.stanford.edu/docs/linkprop/#ogbl-vessel)

Node classification:
* [ogbn-products](https://ogb.stanford.edu/docs/nodeprop/#ogbn-products)
* [ogbn-proteins](https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins)
* [ogbn-arxiv](https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv)
* [ogbn-papers100M](https://ogb.stanford.edu/docs/nodeprop/#ogbn-papers100M)
* [ogbn-mag](https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag)


## Dataset Transfer Learning requirement criteria

### Same domain
Below are our requirement criteria for two datasets for transfer learning within the domain, ordered by priority.
1. minimum size (> 100'000 nodes) and maximum size (<= 3'000'000 nodes)
2. two **similar** data sets, similarity defined via domain, features
3. feature interpretation would be nice to have
4. if possible undirected (complexity)

### Other domain
Below are our requirement criteria for a dataset for transfer learning outside the domain,
1. minimum size (> 100'000 nodes) and maximum size (<= 3'000'000 nodes)
2. **Different domains** than the datasets from the same domain
3. feature interpretation would be nice to have

## Criteria
In the next cells, we will evaluate the data sets based on the above criteria and, if necessary, eliminate them from the selection.

### Same domain - criterion 1 (size)
For the first criterion, we eliminate all data records that have less than **100K** nodes.

This concerns the following data sets:
* ogbl-ddi
* ogbl-biokg

These data records are therefore already eliminated from the selection.

### Same domain - criterion 2 (similarity)
The second criterion is always looked at with data record pairs. If a data record has no partner (same domain), it is eliminated.

| ID | dataset | domain | dataset partner | remarks |
|---- |---------------- |----------------- |------------------------- |------------------------------------------------------------------------ |
| A | ogbn-mag | citations, authors | B, C, D, I | must be converted from two-mode network to one-mode network |
| B | ogbn-arxiv | citations | A, D, I | can be a subset of "obgn-mag" |
| C | ogbl-collab | authors | A | can be a subset of "obgn-mag" |
| D | ogbl-citation2 | Citations | A, B, I | can be a subset of "obgn-mag" |
| E | ogbl-ppa | proteins | F | |
| F | ogbn-proteins | proteins | E | |
| G | ogbn-products | products | |                                                            

### Same domain - criterion 3 (feature interpretation)
The third criterion defines whether the features of the nodes and edges can be interpreted directly or only after processing.
An example of a directly interpretable feature would be a continuous variable from everyday life. For example, the age or height of a "person" node. Embedding vectors are an example of a feature that cannot be interpreted directly.  


| ID | Dataset | Features (Node) | Features (Edges) | Interpretation <br>direct | Interpretation <br>with Processing | Remarks |
|---- |---------------- |--------------------------------------------- |------------------------------------- |--------------------------- |------------------------------------- |------------------------------------------------------------------------------------------------------------------------------------- |
| A | ogbn-mag | 128-dimensional <br>word2vec feature vector | keine | ❌ | ✅ | 128-dimensional feature vector obtained by <br>averaging the embeddings of words <br>in the paper's title and abstract.             	|
| B | ogbn-arxiv | 128-dimensional <br>word2vec feature vector | keine | ❌ | ✅ | 128-dimensional feature vector obtained by <br>averaging the embeddings of words <br>in the paper's title and abstract.             	|
| C | ogbl-collab | 128-dimensional <br>word2vec feature vector | publication year,<br>Authors | ❌ | ✅ | 128-dimensional feature vector obtained by <br>averaging the word embeddings of papers <br>published by the authors |
| D | ogbl-citation2 | 128-dimensional <br>word2vec feature vector | keine | ❌ | ✅ | 128-dimensional feature vector summarize<br> the title and abstract of the paper |
| E | ogbl-ppa | Species Vektor | keine | ✅ | ✅ | species vector indicates the species <br>that the corresponding protein comes from.                                                 	|
| F | ogbn-proteins | Species Vektor | keine | ✅ | ✅ | presumably species is not entirely clear,,<br>species vector indicates the species <br>that the corresponding protein comes from. 	| |
| I | ogbn-papers100M | 128-dimensional <br>word2vec feature vector | keine | ❌ | ✅ | 128-dimensional feature vector obtained by <br>averaging the embeddings of words <br>in the paper's title and abstract.             	|

There is a clear distinction in the features of the citation datasets (embedding based) and the protein datasets which keep the species as a feature.
Since the species features only define the node types, the embedding features are better for us because more information (128 dimensions versus 1 dimension) is available.

We therefore eliminate **ogbn-proteins** and **ogbl-ppa** from the selection.
However, they remain good candidates for the data set of another domain.

### Same domain - criterion 4 (direction)
The fourth criterion determines whether the edges are directed or undirected. Undirected edges reduce the complexity of the prediction and are therefore preferred.
| ID | Data set | Undirected |
|---- |---------------- |------------- |
| A | ogbn-mag | ❌ |
| B | ogbn-arxiv | ❌ |
| C | ogbl-collab | ✅ |
| D | ogbl-citation2 | ❌ |
| I | ogbn-papers100M | ❌ |

Since we need a few and only **ogbl-collab** is undirected, we can no longer completely eliminate pairs.

### Same domain - final choice

Possible pairs are as follows:

| data set 1 | data set 2 |
|--------|--------|
| A | B |
| A | C |
| A | D |
| A | I |
| B | D |
| **B** | **I** |
| D | I |

**ogbn-papers100M** (I) is a comprehensive dataset that is highly compatible with other datasets due to its thematic focus. It is particularly suitable for pre-training purposes, as it covers a large number of scientific publications and various disciplines. A broad spectrum of training data can thus be obtained. **ogbn-arxiv** (B) then enables evaluation and fine-tuning on a comparatively smaller data set. In addition to scientific publications, ogbn-arxiv also contains a proportion of financial and economic publications.

We therefore select the pair **ogbn-papers100M** and **ogbn-arxiv**.

### Other domain - Criterion 1 (size)
For the first criterion, we eliminate all data records that have fewer than **100K** nodes.

This also applies to the following datasets:
* ogbl-ddi
* ogbl-biokg

These data records are therefore already eliminated from the selection.

### Other domain - Criterion 2 (Other domain)
The second criterion determines whether a data set has a domain other than **Citations or authors**.

| ID | Data set | Domain | Other domain |
|---- |---------------- |----------------- |------------------------- |
| A | ogbn-mag | Citations, Authors | ❌ |
| B | ogbn-arxiv | Citations | ❌ |
| C | ogbl-collab | Authors | ❌ |
| D | ogbl-citation2 | Citations | ❌ |
| E | ogbl-ppa | Proteins | ✅ |
| F | ogbn-proteins | proteins | ✅ |
| G | ogbn-products | products | ✅ |
| H | ogbl-wiki2g | Wikidata | ✅ |
| I | ogbn-papers100M | Citations | ❌                        	
| J | ogbl-vessel | Brain vessels of a mouse | ✅                      	

### Same domain - criterion 3 (feature interpretation)
The third criterion defines whether the features of the nodes and edges can be interpreted directly or only after processing.
An example of a directly interpretable feature would be a continuous variable from everyday life. For example, the trunk length or the trunk diameter on a "tree" node. Embedding vectors are an example of a feature that cannot be interpreted directly.  


| ID | Dataset | Features (Nodes) | Features (Edges) | Interpretation <br>direct | Interpretation <br>with Processing | Remarks |
|---- |---------------- |--------------------------------------------- |------------------------------------- |--------------------------- |------------------------------------- |------------------------------------------------------------------------------------------------------------------------------------- |
| E | ogbl-ppa | Species Vektor | None | ✅ | ✅ | species vector indicates the species <br>that the corresponding protein comes from.                                                 	|
| F | ogbn-proteins | Species Vektor | None | ✅ | ✅ | presumably species is not entirely clear,,<br>species vector indicates the species <br>that the corresponding protein comes from. 	| |
| G | ogbn-products | bag-of-words Features | none | ❌ | ✅ |Features are from the product descriptions| |
| H | ogbl-wiki2g | none | none | ❌ | ❌ | Features not described in detail, Knowledge Graph und Triple Edge Prediction | |
| J | ogbl-vessel | spatial (x,y,z) in Allen Brain atlas reference space | none | ❌ | ❌ | special case for link prediction due to physical conditions in the brain |       


Since **ogbl-wiki2g** has not defined any features more precisely, we are eliminating this data set. The bag-of-words features can be interpreted with formatting, but the other two data sets have features that are easier to interpret. We therefore also eliminate **ogbn-products**.
**ogbl-vessel** is a special case and uses model architectures that train on physical objects. For our general transfer learning, this dataset is not useful and is also eliminated.



### Other domain - final choice

There is only **ogbl-ppa** (E) or **ogbn-proteins** (F) left to choose from. 
**ogbn-proteins** is from the *node classification* benchmark and is therefore omitted.

We therefore select the **ogbl-ppa** data set.