# Custom Dataset

You are provided a pair of files with data.

* `node.csv` -- a TSV file containing the node ID and a node feature vector of size 100.
```
0	0.38250,0.55505,0.32324,0.69098,0.97875,0.50953,0.59311,0.93023,0.52275,0.21924
```
* `edges.csv` -- a TSV file with pairs of node IDs that are connected.
```
0	277
0	445
```

Running the cell under __Create Raw Data__ section will write out the raw files under `random/raw`.

Construct a custom dataset that extends `torch_geometric.data.Dataset` similar to the CORA dataset, i.e. with only one graph in the dataset. The custom dataset should `process` the raw files and write out a `Data` object in the `random/processed`.

Instantiate the custom dataset, you should see new files appear in the `random/processed` folder.

Verify that the length of the dataset is 1 and the number of node features is 100.

## Setup

In [None]:
import torch
torch.__version__

'1.9.0+cu111'

In [None]:
%%capture
!pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html
!pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html
!pip install -q torch-geometric

## Create Raw Data

In [None]:
import numpy as np
import os
import shutil
import torch_geometric.utils as pyg_utils

raw_dir = "random/raw"

shutil.rmtree("random")
os.makedirs(raw_dir)
for i in range(50):
  edge_index = pyg_utils.barabasi_albert_graph(100, 50)
  node_features = torch.rand((100, 10), dtype=torch.float32)
  fnode = open(os.path.join(raw_dir, "node-{:d}.csv".format(i)), "w")
  for j in range(node_features.size(0)):
    features = node_features[j].numpy().tolist()
    features_str = ",".join(["{:.5f}".format(feat) for feat in features])
    fnode.write("{:d}\t{:s}\n".format(j, features_str))
  fnode.close()
  fedge = open(os.path.join(raw_dir, "edge-{:d}.csv".format(i)), "w")
  for j in range(edge_index.size(1)):
    src, dst = edge_index[0, j], edge_index[1, j]
    fedge.write("{:d}\t{:d}\n".format(src, dst))
  fedge.close()

## Custom Dataset

Extend the `torch_geometric.data.Dataset` object to create a PyG Dataset object.

Use the code in the Pytorch documentation page [Creating your own Dataset](https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html) and/or the [Youtube video on Creating a Custom Dataset in Pytorch Geometric](https://www.youtube.com/watch?v=QLIkOtKS4os) by DeepFindr as examples.


In [None]:
import torch_geometric.data as pyg_data

In [None]:
# shutil.rmtree("random/processed", ignore_errors=True)

In [None]:
class RandomDataset(pyg_data.Dataset):
  def __init__(self, root, transform=None, pre_transform=None):
    super().__init__(root, transform, pre_transform)

  @property
  def raw_file_names(self):
    raw_files = []
    for i in range(50):
      raw_files.append("node-{:d}.csv".format(i))
      raw_files.append("edge-{:d}.csv".format(i))
    return raw_files

  @property
  def processed_file_names(self):
    processed_files = []
    for i in range(50):
      processed_files.append("random-{:d}.pt".format(i))
    return processed_files

  def download(self):
    pass

  def process(self):
    for i in range(50):
      features, edge_index = [], []
      with open(os.path.join(self.raw_dir, "node-{:d}.csv".format(i)), "r") as fnode:
        for line in fnode:
          node_id, node_feats = line.strip().split('\t')
          features.append(np.array([float(x) for x in node_feats.split(',')]))
      features = torch.tensor(np.array(features), dtype=torch.float32)
      with open(os.path.join(self.raw_dir, "edge-{:d}.csv".format(i)), "r") as fedge:
        for line in fedge:
          src, dst = line.strip().split('\t')
          edge_index.append((int(src), int(dst)))
          edge_index.append((int(dst), int(src)))
      edge_index = torch.tensor(edge_index).t().to(torch.long)
      labels = torch.randint(low=0, high=2, size=(1,))
      data = pyg_data.Data(x=features,
                          edge_index=edge_index, 
                          y=labels)
      torch.save(data, os.path.join(
          self.processed_dir, "random-{:d}.pt".format(i)))

  def len(self):
    return len(self.processed_file_names)

  def get(self, idx):
    data = torch.load(os.path.join(self.processed_dir, 'random-{:d}.pt'.format(idx)))
    return data


## Test Custom Dataset

You should be able to instantiate your new custom Dataset and verify its properties.

* `num_features` -- should be 10
* `len()` -- should be 50

In [None]:
dataset = RandomDataset("random")
dataset

Processing...
Done!


RandomDataset(50)

In [None]:
len(dataset)

50

In [None]:
dataset.num_features

10

In [None]:
dataset[0]

Data(x=[100, 10], edge_index=[2, 5536], y=[1])

## Wrap in DataLoader

Try wrapping your custom Dataset into a PyG DataLoader and iterate through one batch. Print out the batch and verify that it looks right (for example, the number of labels should be the same as your batch size).

In [None]:
from torch_geometric.loader import DataLoader

loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
  print(batch)
  break

DataBatch(x=[3200, 10], edge_index=[2, 173920], y=[32], batch=[3200], ptr=[33])
