## 创建自己的数据集

&emsp;&emsp;`PyTorch Geometric`提供了两个抽象类：

1. `torch_geometric.data.Dataset`: 

2. `torch_geometric.data.InMemoryDataset`:

&emsp;&emsp;前者适用于不能一次性放进内存中的大数据集，后者适用于可以全部放进内存中的小数据集。

## In Memory Datasets数据集

&emsp;&emsp;`torch_geometric.data.InMemoryDataset`有四个可选参数:

1. `root(string, optional)`: 保存数据集的根目录。每个数据集都传递一个根文件夹，该跟文件夹指示数据集应该存储在何处。将根文件夹分成两个文件夹：未处理过的数据集被保存在`raw_dir`目录下；已处理过的数据集被保存在`processed_dir`目录下。

2. `transform(callable, optional)`:

3. `pre_transform(callable, optional)`:

4. `pre_filter(callable, optional)`:

&emsp;&emsp;为了创建一个`torch_geometric.data.InMemoryDataset`，您需要实现四个基本方法：

1. `torch_geometric.data.InMemoryDataset.raw_file_names()`：`raw_dir`需要找到其中的文件列表才能跳过下载。

2. `torch_geometric.data.InMemoryDataset.processed_file_names()`:`processed_dir`需要找到的文件列表才能跳过处理。

3. `torch_geometric.data.InMemoryDataset.download()`：将原始数据下载到`raw_dir`。

4. `torch_geometric.data.InMemoryDataset.process()`：处理原始数据并将其保存到`processed_dir`。

In [1]:
import torch
from torch_geometric.data import InMemoryDataset, download_url


class MyOwnDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super().__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data.pt']

    def download(self):
        # Download to `self.raw_dir`.
        download_url(url, self.raw_dir)
        ...

    def process(self):
        # Read data into huge `Data` list.
        data_list = [...]

        if self.pre_filter is not None:
            data_list = [data for data in data_list if self.pre_filter(data)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

## 创建更大的数据集

&emsp;&emsp;对于无法全部放进内存中的大数据集，可以使用`torch_geometric.data.Dataset`。

&emsp;&emsp;`torch_geometric.data.Dataset`的参数和`torch_geometric.data.InMemoryDataset`的一致。常用的方法如下：

1. `len()`——获取数据集中的数据量。

2. `get(idx)`——获取索引为`idx`的数据对象。

In [2]:
import os.path as osp

import torch
from torch_geometric.data import Dataset, download_url


class MyOwnDataset(Dataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super().__init__(root, transform, pre_transform)

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data_1.pt', 'data_2.pt', ...]

    def download(self):
        # Download to `self.raw_dir`.
        path = download_url(url, self.raw_dir)
        ...

    def process(self):
        i = 0
        for raw_path in self.raw_paths:
            # Read data from `raw_path`.
            data = Data(...)

            if self.pre_filter is not None and not self.pre_filter(data):
                continue

            if self.pre_transform is not None:
                data = self.pre_transform(data)

            torch.save(data, osp.join(self.processed_dir, 'data_{}.pt'.format(i)))
            i += 1

    def len(self):
        return len(self.processed_file_names)

    def get(self, idx):
        data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))
        return data

## RecSys Challenge 2015构建自己的数据集

In [7]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.read_csv('../../data/yoochoose-data/yoochoose-clicks.dat', header=None)
df.columns=['session_id','timestamp','item_id','category']
print(df.head())

buy_df = pd.read_csv('../../data/yoochoose-data/yoochoose-buys.dat', header=None)
buy_df.columns=['session_id','timestamp','item_id','price','quantity']
print(buy_df.head())

  exec(code_obj, self.user_global_ns, self.user_ns)


   session_id                 timestamp    item_id category
0           1  2014-04-07T10:51:09.277Z  214536502        0
1           1  2014-04-07T10:54:09.868Z  214536500        0
2           1  2014-04-07T10:54:46.998Z  214536506        0
3           1  2014-04-07T10:57:00.306Z  214577561        0
4           2  2014-04-07T13:56:37.614Z  214662742        0
   session_id                 timestamp    item_id  price  quantity
0      420374  2014-04-06T18:44:58.314Z  214537888  12462         1
1      420374  2014-04-06T18:44:58.325Z  214537850  10471         1
2      281626  2014-04-06T09:40:13.032Z  214535653   1883         1
3      420368  2014-04-04T06:13:28.848Z  214530572   6073         1
4      420368  2014-04-04T06:13:28.858Z  214835025   2617         1


In [8]:
item_encoder = LabelEncoder()
df['item_id'] = item_encoder.fit_transform(df.item_id)
df.head()

Unnamed: 0,session_id,timestamp,item_id,category
0,1,2014-04-07T10:51:09.277Z,2053,0
1,1,2014-04-07T10:54:09.868Z,2052,0
2,1,2014-04-07T10:54:46.998Z,2054,0
3,1,2014-04-07T10:57:00.306Z,9876,0
4,2,2014-04-07T13:56:37.614Z,19448,0


### 创建Dataset

这里我们将预处理过的数据创建成为`Dataset`对象。对于每个`session`，里面的每个商品（`item`）看作一个节点，因此每个`session`里所有的商品组成一个图。

首先，我们将数据集按照`session_id`进行分组，分组过程中`item_id`也要被重新编码，因为对于每个图，每个节点的`index`应该从`0`开始：

In [9]:
import torch
from torch_geometric.data import InMemoryDataset
from tqdm import tqdm

class YooChooseBinaryDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(YooChooseBinaryDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return []
    @property
    def processed_file_names(self):
        return ['../../data/yoochoose_click_binary_1M_sess.dataset']

    def download(self):
        pass
    
    def process(self):
        
        data_list = []

        # process by session_id
        grouped = df.groupby('session_id')
        for session_id, group in tqdm(grouped):
            sess_item_id = LabelEncoder().fit_transform(group.item_id)
            group = group.reset_index(drop=True)
            group['sess_item_id'] = sess_item_id
            node_features = group.loc[group.session_id==session_id,['sess_item_id','item_id']].sort_values('sess_item_id').item_id.drop_duplicates().values

            node_features = torch.LongTensor(node_features).unsqueeze(1)
            target_nodes = group.sess_item_id.values[1:]
            source_nodes = group.sess_item_id.values[:-1]

            edge_index = torch.tensor([source_nodes, target_nodes], dtype=torch.long)
            x = node_features

            y = torch.FloatTensor([group.label.values[0]])

            data = Data(x=x, edge_index=edge_index, y=y)
            data_list.append(data)
        
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

然后我们对数据集进行随机排序，分成`training`, `validation`和`testing`三个子数据集：

In [10]:
dataset = dataset.shuffle()
train_dataset = dataset[:800000]
val_dataset = dataset[800000:900000]
test_dataset = dataset[900000:]
len(train_dataset), len(val_dataset), len(test_dataset)

NameError: name 'dataset' is not defined

https://www.pytorchtutorial.com/pytorch-geometric-for-gnn/