Training with a huge dataset class #9227

Charles-Ca · 2024-04-23T08:10:10Z

Charles-Ca
Apr 23, 2024

Hello,

Thank you for the amazing framework; I really love it. I am currently testing to scale up a GNN on a huge dataset. I have a single graph that contains approximately 700 million edges and a lot of nodes. I am trying to train a graphSAGE on my data. My graph is heterogeneous with two types of nodes.

My graph can't fit into memory, so I made my own dataset class that works great. My dataset contains approximately 150 subgraphs, each with 5 million edges. The data is similar to the Aminer dataset (authors write papers), except that my nodes have features. This is why I am trying to use graphSAGE to do a link prediction task.
It should be okay to "break" my graph at a certain point, which is why I have used the dataset class, but maybe I am also wrong here?

I'm performing the following operations:

data = Dataset_Large(root=os.path.join(path, 'data'))
graph_loader = DataLoader(data, batch_size=1, shuffle=True)

for epoch in range(1, 15) :
total_loss = total_examples = 0
for batch_graph in graph_loader :

        train_loader = LinkNeighborLoader(
            data = batch_graph,
            # Create negative pairs for robust training
            edge_label_index=('author', 'write', 'paper'),
            neg_sampling='binary',
            neg_sampling_ratio=1.0,
             **kwargs)

        model.train()
        
        for batch in train_loader:
             batch = batch.to(device)
             optimizer.zero_grad()
             pred = model(batch.x_dict, batch.edge_index_dict, batch['authors', 'paper'].edge_label_index)
             loss = binary_cross_entropy_with_logits(pred, batch['authors', 'paper'].edge_label)
             loss.backward()
             optimizer.step()
             total_loss += float(loss)
             total_examples += pred.numel()
        loss = total_loss / total_examples

I am performing a kind of "batch of batch," and I am wondering if what I'm doing is correct or if it may be memory or computationally inefficient.

Thank you for your advice.

rusty1s · 2024-04-26T11:51:18Z

rusty1s
Apr 26, 2024
Maintainer

In general this is a valid approach. Two things:

I don't see where you are saving memory here. It looks like the full graph is still kept in memory, so I am confused why you need the nested data loading routine here at all.
In general, it might be better to only initialize LinkLoader once instead of initializing it in every epoch (there is some heavy computation performed when initializing LinkLoader).

2 replies

Charles-Ca Apr 26, 2024
Author

Hello @rusty1s , thank you a lot for your reply!

I'm quite a beginner in PyG, and I surely misunderstood how the dataset class works. I was thinking that the custom dataset class doesn't load all your data into memory, only chunk by chunk.

My main idea was trying to avoid memory overhead because you can use "get(XXX)" to get the chunk of your choice, so having a kind of loop over each chunk (with the DataLoader batch_size=1) could lead to only loading each chunk one by one and saving memory, but with your answer, I guess I'm wrong about it.

i tried something like this :

data = Dataset_Large(root=os.path.join(path, 'data'))
graph_loader = DataLoader(data, batch_size=1, shuffle=True)
train_loader = LinkNeighborLoader(
      data=batch_graph,
      edge_label_index=('author', 'write', 'paper'),
       neg_sampling='binary',
       neg_sampling_ratio=1.0,
       **kwargs)

for epoch in range(1, 15) :
       total_loss = total_examples = 0
       # training here

But LinkNeighborLoader expects for data a 'Data', 'HeteroData', and I'm giving a '<class 'graph_embedding.data.customDataset.DatasetLarge'>') (That raises an error).
So my first guess was, that I was to iterate and create a LinkNeighborLoader on each subgraph.

Thanks again for your help and the amazing framework!

rusty1s May 2, 2024
Maintainer

Currently, all neighbor samplers require operating on a single graph. If you want to prevent this, then you could batch the graphs together:

batch = Batch.from_data_list(dataset)

Note however, that now negative edges may be created across different graphs, which is probably undesired.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with a huge dataset class #9227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Training with a huge dataset class #9227

Charles-Ca Apr 23, 2024

Replies: 1 comment · 2 replies

rusty1s Apr 26, 2024 Maintainer

Charles-Ca Apr 26, 2024 Author

rusty1s May 2, 2024 Maintainer

Charles-Ca
Apr 23, 2024

Replies: 1 comment 2 replies

rusty1s
Apr 26, 2024
Maintainer

Charles-Ca Apr 26, 2024
Author

rusty1s May 2, 2024
Maintainer