torch_geometric:主模块 <br>
torch_geometric.nn: 搭建图神经网络层 <br>
torch_geometric.data: 图结构数据的表示 <br>
torch_geometric.loader: 加载数据集 <br>
torch_geometric.datasets: 常用图神经网络数据集 <br>
torch_geometric.transforms: 数据变换 <br>
torch_geometric.utils: 常用工具 <br>
torch_geometric.graphgym: 常用的图神经网络模型 <br>
torch_geometric.profile: 监督模型的训练 <br>

## 图数据的处理

PyG使用torch_geometric.data.Data保存图结构的数据，导入的data包含以下属性:
+ data.x: 图节点的属性信息，维度为: [num_nodes, num_node_features]
+ data.edge_index: COO格式的图节点连接信息，类型为torch.long, 维度为: [2, num_egdes]
+ data.edge_attr: 图中边的属性信息，维度为: [num_edges, num_edge_features]
+ data.y: 标签信息，如果分类任务，维度为[num_edges, 类别数], 如果是整图分类任务，维度为[1, 类别数]
+ data.pos: 节点的位置信息（一般用于图结构数据的可视化）

除了以上属性，还可以通过data.face自定义属性

PyG处理以下图

![graph1](https://tva1.sinaimg.cn/large/006C3FgEgy1gxgy50z26ej30ni09yt94.jpg)

In [1]:
import torch
import torch_geometric
from torch_geometric.data import Data

In [3]:
# 边的连接信息
# 注意，无向图的边要定义两次
edge_index = torch.tensor([
    # 这里表示节点0和1有连接，因为是无向图
    # 那么1和0也有连接
    [0, 1, 1, 2],
    [1, 0, 2, 1],
], # 指定数据类型
    dtype=torch.long)

In [4]:
# 节点的属性信息
x = torch.tensor([
    # 三个节点
    # 每个节点的属性向量维度为1
    [-1],
    [0],
    [1],
])

In [5]:
data = Data(x=x, edge_index=edge_index)

In [6]:
data

Data(edge_index=[2, 4], x=[3, 1])

In [8]:
print(data) # 查看图数据

Data(edge_index=[2, 4], x=[3, 1])


In [10]:
print(data.keys) # 图中包含的信息

['x', 'edge_index']


In [12]:
data.x # 节点的属性信息

tensor([[-1],
        [ 0],
        [ 1]])

In [13]:
data['x']

tensor([[-1],
        [ 0],
        [ 1]])

In [15]:
data.num_nodes # 节点数

3

In [16]:
data.num_edges # 边数

4

In [18]:
data.num_node_features # 节点属性向量的维度

1

In [27]:
print(data.contains_isolated_nodes()) # 是否有孤立点

False


In [28]:
data.contains_self_loops() # 是否有环

False

In [29]:
data.is_directed() # 是否是有向图

False

In [18]:
# dir(data)

In [31]:
from torch_geometric.datasets import TUDataset

dataset = TUDataset(
    root = './data/ENZYMES',
    name = 'ENZYMES',
)


Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Extracting data/ENZYMES/ENZYMES/ENZYMES.zip
Processing...
Done!


In [38]:
len(dataset) # 总的图数

600

In [19]:
# dir(dataset)

In [35]:
dataset.num_classes # 类别数

6

In [36]:
dataset.num_node_features # 节点属性的维度

3

In [37]:
data = dataset[0] # 选择第一个图

In [39]:
data

Data(edge_index=[2, 168], x=[37, 3], y=[1])

In [40]:
dataset = dataset.shuffle()

In [41]:
dataset[0]

Data(edge_index=[2, 100], x=[26, 3], y=[1])

### 加载数据集

图神经网络训练的时候一般将一部分数据到内存中进行训练，叫做一个batch

![graph2](https://tva1.sinaimg.cn/large/006C3FgEgy1gxgy510inyj30u0078ab7.jpg)

In [16]:
from torch_geometric.loader import DataLoader
from torch_geometric.datasets import TUDataset

dataset = TUDataset(
    root = './data/ENZYMES',
    name = 'ENZYMES',
    use_node_attr=True,
)

loader = DataLoader(
    # 待加载的数据集
    dataset=dataset,
    # 每次加载32个，总共600个图
    batch_size=32,
    # 每次加载进来之后是否随机打乱
    shuffle=True,
)

In [17]:
for batch in loader:
    print(batch)
    print(batch.num_graphs)

DataBatch(edge_index=[2, 3996], x=[1053, 21], y=[32], batch=[1053], ptr=[33])
32
DataBatch(edge_index=[2, 3884], x=[977, 21], y=[32], batch=[977], ptr=[33])
32
DataBatch(edge_index=[2, 3856], x=[995, 21], y=[32], batch=[995], ptr=[33])
32
DataBatch(edge_index=[2, 3818], x=[963, 21], y=[32], batch=[963], ptr=[33])
32
DataBatch(edge_index=[2, 4330], x=[1136, 21], y=[32], batch=[1136], ptr=[33])
32
DataBatch(edge_index=[2, 3580], x=[993, 21], y=[32], batch=[993], ptr=[33])
32
DataBatch(edge_index=[2, 4122], x=[1144, 21], y=[32], batch=[1144], ptr=[33])
32
DataBatch(edge_index=[2, 4010], x=[1048, 21], y=[32], batch=[1048], ptr=[33])
32
DataBatch(edge_index=[2, 4110], x=[1199, 21], y=[32], batch=[1199], ptr=[33])
32
DataBatch(edge_index=[2, 4022], x=[1064, 21], y=[32], batch=[1064], ptr=[33])
32
DataBatch(edge_index=[2, 4048], x=[1049, 21], y=[32], batch=[1049], ptr=[33])
32
DataBatch(edge_index=[2, 3970], x=[1053, 21], y=[32], batch=[1053], ptr=[33])
32
DataBatch(edge_index=[2, 4210], x=[1

In [15]:
len(dataset)

600

In [11]:
# dir(torch_geometric)

In [9]:
# torch_geometric.__version__

### 空域图卷积的建立

#### GCN的实现

![gcn公式](https://tva1.sinaimg.cn/large/006C3FgEgy1gxgy512kcoj30u00580ue.jpg)

其中，$\theta$是可学习的参数矩阵，然后用节点的度进行正则化，最后所有的信息相加，作为当前节点新的特征表示.
$\gamma$是一个求和函数，$\phi$是一个线性变换+正则化

## 自建数据集

PyG将数据集分为两个文件夹---- raw_dir、processed_dir。
+ raw_dir是原始数据集
+ processed_dir是PyG处理后

PyG对于数据集有三种过滤方法---- transform、pre_transform、pre_filter
+ transform: 读取数据，然后对齐进行变换
+ pre_transform: 对于整个数据集进行变换，然后将变换之后的数据进行存储，pre_filter同理

PyG将数据集分为两种类型:
+ torch_geometric.data.InMemoryDataset: 能够完全放入内存中的 
+ torch_geometric.data.Dataset: 不能够完全放入内存的

1. 创建能完全放入内存中的图数据集需要做4件事
+ 实现torch_geometric.data.InMemoryDataset.raw_file_names(): 告诉PyG数据集放在哪里
+ 实现torch_geometric.data.InMemoryDataset.processed_file_names(): 告诉PyG数据集处理完之后放在哪里
+ 实现torch_geometric.data.InMemoryDataset.download(): 告诉PyG从哪里获取数据集
+ 实现torch_geometric.data.InMemoryDataset.process(): 告诉PyG如何处理你的数据集

In [24]:
# 通用的模板如下

import torch  
from torch_geometric.data import InMemoryDataset, download_url

  
# 实现In Memory Dataset的通用模板  
class MyDataset(InMemoryDataset):  
    # 初始化  
    def __init__(self, root, transfrom=None, pre_transform=None):  
        # root是数据集的根目录  
        super(MyDataset, self).__init__(root, transfrom, pre_transform)  
        # 加载数据集  
        self.data, self.slices = torch.load(self.processed_paths[0])  
  
    # -> Union[str, List[str], Tuple]
    def raw_file_names(self) :  
        return ['file_1', 'file_2', ...]  
  
    #  -> Union[str, List[str], Tuple]
    def processed_file_names(self):  
        return ['data.pt']  
  
    def download(self):  
        # 将数据集下载到raw_dir文件夹中  
        download_url(url, self.raw_dir)  
  
    def process(self):  
        data_list = [...]  
        # 进行数据过滤  
        if self.pre_filter is not None:  
            data_list = [data for data in data_list if self.pre_filter(data)]  
        if self.pre_transform is not None:  
            data_list = [self.pre_transform(data) for data in data_list]  
        # self.collate将所有数据组合在一起,加速存储  
        # data是组合之后的数据  
        # slices是分割方式，告诉PyG如何将data还原为原先的数据  
        data, slices = self.collate(data_list)  
        # 保存数据  
        torch.save((data, slices), self.processed_paths[0]) 

2. 创建无法完全放入内存的数据集

类似Pytorch中的Dataset了，在上面需要做的几件事的基础上还需要
+ 实现torch_geometric.data.Dataset.len(): 告诉PyG数据集有多大
+ 实现torch_geometric.data.Dataset.get(): 告诉PyG如何从数据集中获取一个数据

In [25]:
# 通用模板
import os.path as osp  
import torch  
from torch_geometric.data import Dataset, download_url  

class MyDataset(Dataset):  
    # 初始化  
    def __init__(self, root, transform=None, pre_transform=None):  
        super(MyDataset, self).__init__(root, transform, pre_transform)  
  
    # -> Union[str, List[str], Tuple]
    def raw_file_names(self):  
        return ['file_1', 'file_2', ...]  
  
    #  -> Union[str, List[str], Tuple]
    def processed_file_names(self):  
        return ['data_1.pt', ...]  
  
    def download(self):  
        path = download_url(url, self.raw_dir)  
  
    def process(self):  
        i = 0  
        for raw_path in self.raw_paths:  
            # 读取数据  
            data = Data(...)  
            # 过滤数据集  
            if self.pre_filter is not None and not self.pre_filter(data):  
                pass  
            if self.pre_transform is not None:  
                data = self.pre_transform(data)  
            # 保存数据  
            torch.save(data, osp.join(self.processed_dir, 'data_{}.pt'.format(i)))  
            i += 1  
  
    def len(self):  
        return len(self.processed_file_names)  
  
    def get(self,idx):  
        data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))  
        return data  

## 批处理

In [3]:
from typing import Any
import torch
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader


# 定义图数据  
class PairData(Data):  
    def __init__(self, edge_index_s=None, x_s=None, edge_index_t=None, x_t=None):  
        # 每个数据中包含两个图s,t  
        """  
        :param edge_index_s: 图s的连接关系  
        :param x_s: 图s的节点属性矩阵  
        :param edge_index_t: 图t的连接关系  
        :param x_t: 图t的节点属性矩阵  
        """  
        super(PairData, self).__init__()  
        self.edge_index_s = edge_index_s  
        self.x_s = x_s  
        self.edge_index_t = edge_index_t  
        self.x_t = x_t  
  
    def __inc__(self, key: str, value: Any, *args, **kwargs) -> Any:  
        # 如果要合并的是图s  
        # 那么告诉PyG图s的节点数  
        if key == 'edge_index_s':  
            return self.x_s.size(0)  
        # 如果要合并的是图t  
        # 那么告诉PyG图t的节点数  
        if key == 'edge_index_t':  
            return self.x_t.size(0)  
        # 其它情况默认  
        else:  
            return super().__inc__(key, value, *args, **kwargs)  

In [4]:
# 定义图s
edge_index_s = torch.tensor([
    [0, 0, 0, 0],
    [1, 2, 3, 4],
])
x_s = torch.randn(5, 16)

In [5]:
# 定义图t
edge_index_t = torch.tensor([
    [0, 0, 0],
    [1, 2, 3],
])
x_t = torch.randn(4, 16)

In [9]:
data = PairData(edge_index_s, x_s, edge_index_t, x_t)

In [10]:
data

PairData(edge_index_s=[2, 4], x_s=[5, 16], edge_index_t=[2, 3], x_t=[4, 16])

In [11]:
data_list = [data, data]

In [13]:
# follow_batch描述节点信息
loader = DataLoader(data_list, batch_size=2, follow_batch=['x_s', 'x_t'])
batch = next(iter(loader))

In [14]:
batch

PairDataBatch(edge_index_s=[2, 8], x_s=[10, 16], x_s_batch=[10], edge_index_t=[2, 6], x_t=[8, 16], x_t_batch=[8])

In [15]:
batch.edge_index_s # 查看batch中的s

tensor([[0, 0, 0, 0, 5, 5, 5, 5],
        [1, 2, 3, 4, 6, 7, 8, 9]])

In [16]:
batch.edge_index_t # 查看batch中的t

tensor([[0, 0, 0, 4, 4, 4],
        [1, 2, 3, 5, 6, 7]])

In [18]:
import torch  
from torch_geometric.data import Data  
from torch_geometric.loader import DataLoader 


# 定义二分图结构  
class BipartiteData(Data):  
    def __init__(self, edge_index=None, x_s=None, x_t=None):  
        super().__init__()  
        # 包含一组边  
        # 两组节点  
        self.edge_index = edge_index  
        self.x_s = x_s  
        self.x_t = x_t  
  
    # 定义每个batch的合并方式  
    def __inc__(self, key, value, *args, **kwargs):  
        # 如果要合并两个图的边连接信息  
        if key == 'edge_index':  
            # 左边（边连接信息的第一行）按照第一组节点数合并  
            # 右边（边连接信息的第二行）按照第二组节点数合并  
            return torch.tensor([[self.x_s.size(0)], [self.x_t.size(0)]])  
        else:  
            return super().__inc__(key, value, *args, **kwargs) 

In [20]:
edge_index = torch.tensor([
    [0, 0, 1, 1],
    [0, 1, 1, 2],
])
x_s = torch.randn(2, 16)
x_t = torch.randn(3, 16)


data = BipartiteData(edge_index, x_s, x_t)
data_list = [data, data]
loader = DataLoader(data_list, batch_size=2)

batch = next(iter(loader))
print(batch)
print(batch.edge_index)

BipartiteDataBatch(edge_index=[2, 8], x_s=[4, 16], x_t=[6, 16], batch=[6], ptr=[3])
tensor([[0, 0, 1, 1, 2, 2, 3, 3],
        [0, 1, 1, 2, 3, 4, 4, 5]])




## 异质图构建

In [12]:
import os.path as osp
import torch
import pandas as pd
# from sentence_transformers import SentenceTransformer

In [3]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 65 kB/s eta 0:00:011
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 34 kB/s eta 0:00:015
Collecting torch>=1.6.0
  Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
[K     |▉                               | 20.8 MB 22 kB/s eta 9:09:168^C

[?25h[31mERROR: Operation cancelled by user[0m


In [13]:
from torch_geometric.data import HeteroData, download_url, extract_zip
from torch_geometric.transforms import ToUndirected, RandomLinkSplit


url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
root = "./data/MovieLens"
extract_zip(download_url(url, root), root)

Using existing file ml-latest-small.zip
Extracting ./data/MovieLens/ml-latest-small.zip


In [15]:
movie_path = osp.join(root, 'ml-latest-small', 'movies.csv')
rating_path = osp.join(root, 'ml-latest-small', 'ratings.csv')

In [18]:
pd.read_csv(movie_path).head() # 描述电影的基本信息，唯一的ID,电影名,电影所属类型

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [17]:
pd.read_csv(rating_path).head() # 用户对电影的评分

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
