# Obtaining the Dataset: TUDataset


The most common task for graph classification is **molecular property prediction**, in which molecules are represented as graphs.

The TU Dortmund University has collected a wide range of different graph classification datasets, known as the `TUDatasets`, which are also accessible via `torch_geometric.datasets.TUDataset` in PyTorch Geometric. 

Let's load and inspect one of the smaller ones, the `MUTAG dataset`. This dataset provides **188 different graphs**, and the task is to classify each graph into **one out of two classes**.

In [2]:
import torch
from torch_geometric.datasets import TUDataset

dataset = TUDataset(root='data/TUDataset', name='MUTAG')

print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of classes: {dataset.num_classes}')#number of node classes

Dataset: MUTAG(188):
Number of graphs: 188
Number of classes: 2


In [52]:
for (i,graph) in zip(range(len(dataset)),dataset):
        print(f'{i}-th graph: {graph}')
        print(f'------------')
        print(f'Number of nodes: {graph.num_nodes}')
        print(f'Number of node features: {graph.num_node_features}')
        print(f'Number of edges: {graph.num_edges}')
        print(f'Number of edge features: {graph.num_edge_features}')
        print(f'Average node degree: {graph.num_edges / graph.num_nodes:.2f}')
        #print(f'Number of training nodes: {graph.train_mask.sum()}')
        #print(f'Number of validation nodes: {graph.val_mask.sum()}')
        #print(f'Number of test nodes: {graph.test_mask.sum()}')
        print(f'Has isolated nodes(nodes without edges): {graph.has_isolated_nodes()}')
        print(f'Has self-loops: {graph.has_self_loops()}')
        print(f'Is undirected: {graph.is_undirected()}')
        print(f'y: {graph.y}')
        print(f'x: {graph.x}')
        print(f'edge_index: {graph.edge_index}')
        print(f'edge_attr: {graph.edge_attr}')
        print(f'------------')

0-th graph: Data(edge_index=[2, 38], x=[17, 7], edge_attr=[38, 4], y=[1])
------------
Number of nodes: 17
Number of node features: 7
Number of edges: 38
Number of edge features: 4
Average node degree: 2.24
Has isolated nodes(nodes without edges): False
Has self-loops: False
Is undirected: True
y: tensor([1])
x: tensor([[1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0.]])
edge_index: tensor([[ 0,  0,  1,  1,  2

edge_attr: tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
 

**NOTE:** As you can see, the graphs in this dataset have only one value for label *y*. That's because this is a graph dataset and *y* represents the label of the graph.

## Splitting the dataset into training and test sets

In [28]:
torch.manual_seed(123456)
dataset = dataset.shuffle()

train_dataset = dataset[:151]
test_dataset = dataset[151:]

Neural networks are usually trained in a batch-wise fashion. Also, the graphs in graph classification datasets are usually small. To maximize GPU utilization by parallelization across a number of graphs, multiple graphs are clubbed together into a mini-batch before inputting them to a GNN. PyG achieves this by stacking adjacency matrices in a diagonal fashion (creating a giant graph that holds multiple isolated subgraphs), and the node features are simply concatenated in the node dimension. This procedure has some crucial advantages over other batching procedures:

1. GNN operators that rely on a message passing scheme do not need to be modified since messages are not exchanged between two nodes that belong to different graphs.

2. There is no computational or memory overhead since adjacency matrices are saved in a sparse fashion holding only non-zero entries, *i.e.*, the edges.

PyTorch Geometric automatically takes care of **batching multiple graphs into a single giant graph** with the help of the `torch_geometric.data.DataLoader` class.

<span style="color:red">     **DOUBT:** HOW IS BATCHING DONE IN CASE OF NODE CLASSIFICATION DATASETs? THE TRAINING SAMPLES ARE NODES THAT ARE LINKED IN THE FORM OF GRAPHs    </span>

In [29]:
from torch_geometric.loader import DataLoader

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
print(train_loader)

<torch_geometric.loader.dataloader.DataLoader object at 0x7f3e72345b80>


In [54]:
for step, batch in enumerate(train_loader):
    print(f'Batch {step + 1}:')
    print(f'=======')
    print(f'number of graphs in the current batch: {batch.num_graphs}')
    print(f'feature vector(x) of nodes: {batch.x}')
    print(f'number of feature vectors(x): {len(batch.x)}')
    print(f'label(y) of graphs: {batch.y}')
    print(f'number of graph labels(y): {len(batch.y)}')
    print(f'length of batch vector = total number of nodes in this batch: {len(batch.batch)}')
    print(f'edge index of graphs: {batch.edge_index}')
    print(f'edge attributes of graphs: {batch.edge_attr}')
    print(f'batch vector: {batch.batch}')
    print(batch) #y here represents the labels of graphs and not labels
    print() #adds a newline

Batch 1:
number of graphs in the current batch: 64
feature vector(x) of nodes: tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.]])
number of feature vectors(x): 1143
label(y) of graphs: tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
        1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1,
        1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0])
number of graph labels(y): 64
length of batch vector = total number of nodes in this batch: 1143
edge index of graphs: tensor([[   0,    0,    1,  ..., 1140, 1141, 1142],
        [   1,    9,    0,  ..., 1142, 1140, 1140]])
edge attributes of graphs: tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        ...,
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
 

**NOTE:** 

1. Each `Batch` object is equipped with a **`batch` vector**, which maps each node to its respective graph in the batch:

$$
\textrm{batch} = [ 0, \ldots, 0, 1, \ldots, 1, 2, \ldots ]
$$

2. In `DataBatch(edge_index=[2, 930], x=[417, 7], edge_attr=[930, 4], y=[23], batch=[417], ptr=[24])`, *y* represents the label of a graph and not nodes.