## Description：
这个文档主要是完成数据的预处理， 因为原始数据采用了稀疏的方式存储， 通过这个文件转成矩阵的形式， 不过这次用的是已经经过处理过的数据了, 可以将其处理成隐式反馈（例如将高于3分的认为是感兴趣的正例，否则不感兴趣的负例）, 打过分的记为1， 没打分的记为0。 就可以把数据处理成下面的三个矩阵。

In [12]:
import scipy.sparse as sp
import numpy as np

In [13]:
# filename为test.rating的数据   类似于测试集的制作
def load_rating_file_as_list(filename):
    ratingList = []
    with open(filename, "r") as f:
        line = f.readline()
        while line is not None and line != "":
            arr = line.split("\t")
            user, item = int(arr[0]), int(arr[1])
            ratingList.append([user, item])     # 用户名 电影名
            line = f.readline()
    return ratingList

In [17]:
# test.negative
def load_negative_file(filename):
    negativeList = []
    with open(filename, "r") as f:
        line = f.readline()
        while line is not None and line != "":
            arr = line.split("\t")
            negatives = []
            for x in arr[1:]:
                negatives.append(int(x))
            negativeList.append(negatives)
            line = f.readline()
    return negativeList

In [16]:
def load_rating_file_as_matrix(filename):
    """
    Read .rating file and Return dok matrix.
    The first line of .rating file is: num_users\t num_items
    """
    # Get number of users and items
    num_users, num_items = 0, 0   # 这俩记录用户编号和物品编号里面的最大值
    with open(filename, "r") as f:
        line = f.readline()
        while line is not None and line != "":
            arr = line.split("\t")
            u, i = int(arr[0]), int(arr[1])
            num_users = max(num_users, u)
            num_items = max(num_items, i)
            line = f.readline()
    # Construct matrix
    mat = sp.dok_matrix((num_users + 1, num_items + 1), dtype=np.float32)  # dok_matrix可以高效地逐渐构造稀疏矩阵。 存储是稀疏存储 toarray()
    with open(filename, "r") as f:
        line = f.readline()
        while line is not None and line != "":
            arr = line.split("\t")
            user, item, rating = int(arr[0]), int(arr[1]), float(arr[2])
            if rating > 0:
                mat[user, item] = 1.0
            line = f.readline()
    return mat   # 0,1矩阵， 如果评过分就是1， 否则是0

In [15]:
class Dataset():
    def __init__(self, path):
        self.trainMatrix = load_rating_file_as_matrix(path+'.train.rating')
        self.testRatings = load_rating_file_as_list(path+'.test.rating')
        self.testNegatives = load_negative_file(path+'.test.negative')
        assert len(self.testRatings) == len(self.testNegatives)
    
    def Getdataset(self):
        return (self.trainMatrix, self.testRatings, self.testNegatives)

In [18]:
# 开始导入原数据并进行处理
path = 'Data/ml-1m'
dataset = Dataset(path)

In [20]:
train, testRatings, testNegatives = dataset.Getdataset()

In [22]:
train.toarray()   #  这个矩阵的行数是用户数目， 列数是商品数目， 1代表某个用户对某个电影感兴趣

array([[1., 1., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [29]:
testRatings   # 6040个  每个元素(useID, ItemID)的格式

6040

In [32]:
testNegatives  # 6040个   每个元素的长度似乎99个  这个和上面的testRating对应， 每个用户评分电影里面有一个正的， 99个负的

[[1064,
  174,
  2791,
  3373,
  269,
  2678,
  1902,
  3641,
  1216,
  915,
  3672,
  2803,
  2344,
  986,
  3217,
  2824,
  2598,
  464,
  2340,
  1952,
  1855,
  1353,
  1547,
  3487,
  3293,
  1541,
  2414,
  2728,
  340,
  1421,
  1963,
  2545,
  972,
  487,
  3463,
  2727,
  1135,
  3135,
  128,
  175,
  2423,
  1974,
  2515,
  3278,
  3079,
  1527,
  2182,
  1018,
  2800,
  1830,
  1539,
  617,
  247,
  3448,
  1699,
  1420,
  2487,
  198,
  811,
  1010,
  1423,
  2840,
  1770,
  881,
  1913,
  1803,
  1734,
  3326,
  1617,
  224,
  3352,
  1869,
  1182,
  1331,
  336,
  2517,
  1721,
  3512,
  3656,
  273,
  1026,
  1991,
  2190,
  998,
  3386,
  3369,
  185,
  2822,
  864,
  2854,
  3067,
  58,
  2551,
  2333,
  2688,
  3703,
  1300,
  1924,
  3118],
 [1072,
  3154,
  3368,
  3644,
  549,
  1810,
  937,
  1514,
  1713,
  2186,
  660,
  2303,
  2416,
  670,
  1176,
  788,
  889,
  3120,
  2344,
  2525,
  3301,
  2055,
  1436,
  2630,
  11,
  2773,
  2176,
  1847,
  740,
  2332,

In [33]:
# 下面把上面的三个矩阵保存  注意得保存成numpy的形式
np.save('ProcessedData/train.npy', train)
np.save('ProcessedData/testRatings.npy', np.array(testRatings))
np.save('ProcessedData/testNegatives.npy', np.array(testNegatives))

关于上面这三个变量的具体处理含义， 可以参考[https://github.com/hexiangnan/neural_collaborative_filtering](https://github.com/hexiangnan/neural_collaborative_filtering)