# Creating graph datasets

## creating In Memory datasets

we need 4 fundamental methods: 
- raw_file_names() -> list of file names in `raw_dir` for raw data used to skip the download 
- processed_file_names -> list of file name sin `processed_dir` to skip the processing 
- download() downloads raw data into raw_dir() --> donÂ´t implement if no download necessary 
- process() process raw data and save it into processed dir 

the 'process' method is the most important one. this creates a list of 'Data' objects that are saved into 'processed_dir' then. Data objects will be collated into one giant `Data` object 

In [None]:
import torch 
from torch_geometric.data import InMemoryDataset, download_url 

In [None]:
class MyOwnDataset(InMemoryDataset): 
    def __init__(self, root, transform=None, pre_transform= None, pre_filter = None): 
        super().__init__(root, transform, pre_transform, pre_filter)
        self.load(self.processed_paths[0])

    @property
    def raw_file_names(self): 
        return ['data1.pt', 'data2.pt']
    
    @property
    def processed_file_names(self): 
        return ['data.pt']
    
    def download(self):
        url = 'https://example.com/data.zip'
        download_url(url, self.raw_dir)

    def process(self): 
        data_list = [...]
        

        if self.pre_filter is not None:
            data_list = [d for d in data_list if self.pre_filter(d)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(d) for d in data_list]

        self.save(data_list, self.processed_paths[0])

in my case, I would need to use hdf5 or similar. Zarr could work, but Arrow does not, it's ill suited for this purpose

# Creating 'larger' Datasets

if stuff does not fit into memory, we can use the `Dataset` class. This follows closely the concept of the torchvision datasets. It expects the methods len() and get() to be implemented. get() implements the logic to get a single graph, len() gets the number of examples in the dataset. Works in much the same way as the Julia datasets we already have

In [None]:
import os.path as osp 
from torch_geometric.data import Dataset 

class MyOwnOnDiskDataset(Dataset): 
    def __init__(self, root, transform=None, pre_transform=None, pre_filter=None): 
        pass 
    
    @property 
    def raw_file_names(self): 
        return ['data1.pt', 'data2.pt']
    
    @property 
    def processed_file_names(self):
        return ['data.pt']
    
    def download(self):
        url = 'https://example.com/data.zip'
        download_url(url, self.raw_dir)

    def process(self): 
        idx = 0
        for raw_path in self.raw_paths:
            # Read data from `raw_path
            data = Data(...) # TODO

            if self.pre_filter is not None and not self.pre_filter(data): 
                continue 

            if self.pre_transform is not None:
                data = self.pre_transform(data)

            torch.save(data, osp.join(self.processed_dir, f'data_{idx}.pt'))
            idx += 1

    def len(self):
        return len(self.processed_file_names)
    
    def get(self, idx):
        data =torch.load(osp.join(self.processed_dir, f'data_{idx}.pt'))
        return data

Here, each graph data object gets saved individually in process(), and is manually loaded in get(). We might want to cache some for ease of use

Use HDF5 or Zarr if possible

# Loading Graphs from CSV

In [None]:
from torch_geometric.data import download_url, extract_zip

In [None]:
url = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

In [None]:
extract_zip(download_url(url, '.'), './')

In [None]:
movie_path = './ml-latest-small/movies.csv'
rating_path = './ml-latest-small/ratings.csv'

In [None]:
import pandas as pd
print(pd.read_csv(movie_path).head())


In [None]:
print(pd.read_csv(rating_path).head())