Kakao Buffalo의 Bayesian Personalized Ranking Matrix Factorization을 이용하여 추천 성능 확인하기
========================

In [1]:
%pip install buffalo

Collecting buffalo
  Downloading buffalo-2.0.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (96.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.4/96.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy~=1.24.3 (from buffalo)
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
Collecting scipy~=1.10.1 (from buffalo)
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy, scipy, buffalo
  Attempting uninstall: numpy
    Found existing installation: numpy 1.25.2
    Uninstalling numpy-1.25.2:
      Successfully uninstalled numpy-1.25.2
  Attempting uninstall: scipy
    F

In [None]:
# 기록용 - 주석처리
#%%bash
#!$PATH
#export PATH=$PATH:/usr/local/lib
#export PATH=$PATH:/usr/lib
#!sudo apt-get install libopenblas-dev

bash: line 1: !/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin: No such file or directory


#0. Data transform for Buffalo
--------------------------------------------------

버팔로에서는 내부적으로 Matrix Market 형식 데이터나 Stream 형식의 포맷을 입력으로 받아, h5py를 사용한 rawdata를 저장하는 데이터베이스 클래스로 변환한다.

2-1.스트림 데이터 형식 : 다음과 같은 두 파일로 구성.

- 기본
  - 데이터가 블로그 서비스의 읽기 기록이라 가정하면, 각 줄은 UID파일의 각 행에 해당하는 읽기 기록, 즉 사용자 목록이다.
  - 읽은 이력은 공백으로 구분되고 왼쪽이 과거, 오른쪽이 가장 최근의 이력이다.
  - 예를 들어 A B C D D E는 사용자가 A B C C D E순으로 내용을 읽었다는 의미.

A B C D D E

F G H I

J K L M N O P

Q R S

T U V W X Y Z

-UID
  - 각 줄은 MAIN파일의 각 행에 해당하는 사용자 표현 값이다.(사용자 이름과 같은)
  -각 줄에는 공백을 허용해서는 안된다.

user1

user2

user3

user4

user5

2-2.Matrix Market 데이터 형식
- main
  - Matrix Market 데이터 파일.
-UID
  - 각 줄은 MM 파일의 행 ID에 해당하는 실제 사용자 키이다.
-IID
  - 각 줄은 MM 파일의 열 ID에 해당하는 실제 아이템 키이다.

UID와 IID는 사람이 읽을 수 있는 결과를 위해 필요하며 필수는 아니다.

-----------------------------------------------------------
자료 생성 : 필요한 형태에 따라 Case 1, 2 따로 사용하기

In [21]:
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.io import mmwrite
from sklearn.model_selection import train_test_split

def csv_to_mtx(df, mtx_file):
    # CSV 파일 읽기


    # 'newUserId', 'newBusinessId', 'stars' 열 선택
    selected_columns = ['newUserId', 'newBusinessId', 'stars']
    df_selected = df[selected_columns]

    # 중복된 행 제거
    df_unique = df_selected.drop_duplicates()

    # 사용자 및 비즈니스 ID 가져오기
    user_ids = df_unique['newUserId'].tolist()
    business_ids = df_unique['newBusinessId'].tolist()

    # 행렬 크기 결정
    num_users = len(set(user_ids))
    num_businesses = len(set(business_ids))

    # 사용자 및 비즈니스 ID를 인덱스로 변환
    user_id_index_map = {id: index for index, id in enumerate(set(user_ids))}
    business_id_index_map = {id: index for index, id in enumerate(set(business_ids))}

    # 희소 행렬 생성
    data = df_unique['stars'].astype(int).tolist()
    row_indices = [user_id_index_map[id] for id in df_unique['newUserId'].astype(int)]
    col_indices = [business_id_index_map[id] for id in df_unique['newBusinessId'].astype(int)]
    matrix = csr_matrix((data, (row_indices, col_indices)), shape=(num_users, num_businesses))

    # Matrix Market 형식의 .mtx 파일로 저장
    mmwrite(mtx_file, matrix)


# CSV 파일과 저장할 .mtx 파일 지정
csv_file = '/content/drive/MyDrive/philadelphia_rating.csv'
mtx_file = '/content/drive/MyDrive/output.mtx'
mtx_file1 = '/content/drive/MyDrive/trainset.mtx'
mtx_file2 = '/content/drive/MyDrive/testset.mtx'
df = pd.read_csv(csv_file)

In [22]:
#Case 1. 전체 자료에 대해 학습
csv_to_mtx(df, mtx_file)

In [35]:
#Case 2. 평가용 데이터세트 생성을 위한 분할
trainset, testset = train_test_split(df, test_size=0.2, random_state=42)

# 함수 호출
csv_to_mtx(trainset, mtx_file1)
csv_to_mtx(testset, mtx_file2)

UID/IID 파일 생성 (정수형 대신 코드 필요시 해당 코드가 입력하도록 수정하여 사용)

In [28]:
#Case 1. 전체 자료에 대해 학습
from collections import OrderedDict
unique_rows = OrderedDict()
unique_cols = OrderedDict()

for row in df['newUserId']:
    unique_rows[row] = None
for col in df['newBusinessId']:
    unique_cols[col] = None

uid=unique_rows
iid=unique_cols

with open(f"/content/drive/MyDrive/uid", "w") as f:
    for val in uid:
        print(val, file=f)
with open(f"/content/drive/MyDrive/iid", "w") as f:
    for val in iid:
        print(val, file=f)

In [36]:
#Case 2. 평가용 데이터세트 생성을 위한 분할
from collections import OrderedDict
unique_rows = OrderedDict()
unique_cols = OrderedDict()

for row in trainset['newUserId']:
    unique_rows[row] = None
for col in trainset['newBusinessId']:
    unique_cols[col] = None

uid=unique_rows
iid=unique_cols

with open(f"/content/drive/MyDrive/uid1", "w") as f:
    for val in uid:
        print(val, file=f)
with open(f"/content/drive/MyDrive/iid1", "w") as f:
    for val in iid:
        print(val, file=f)

for row in testset['newUserId']:
    unique_rows[row] = None
for col in testset['newBusinessId']:
    unique_cols[col] = None

uid=unique_rows
iid=unique_cols

with open(f"/content/drive/MyDrive/uid2", "w") as f:
    for val in uid:
        print(val, file=f)
with open(f"/content/drive/MyDrive/iid2", "w") as f:
    for val in iid:
        print(val, file=f)

#1. BPRMF 추천 시스템을 통해 모델 생성 및 학습
--------------------------------------

※ buffalo.data import시 사용하지 않는 ALS 라이브러리 관련 의존성 문제가 발생하므로,

/usr/local/lib/python3.10/dist-packages/buffalo/algo/eals.py에서 line 8의 import EALS 부분을 주석처리하여 적용되지 않도록 수정

 #1-1. 훈련 세트를 사용하여 모델 학습

In [24]:
import buffalo
import buffalo.data as bdt
#from buffalo import ALS
from buffalo import  BPRMF
from buffalo import aux, log
from buffalo import ALSOption
from buffalo import BPRMFOption
from buffalo import MatrixMarketOptions
#log.set_log_level(1)


MODEL_TO_USE = "BPR"


if MODEL_TO_USE == "ALS":
    #opt = ALSOption().get_default_option()
    print("NO ALS")
elif MODEL_TO_USE == "BPR":
    opt = BPRMFOption().get_default_option()

In [25]:
opt.evaluation_on_learning =  True
opt.validation = aux.Option({'topk': 10})


opt

{'evaluation_on_learning': True,
 'compute_loss_on_training': True,
 'early_stopping_rounds': 0,
 'save_best': False,
 'evaluation_period': 100,
 'save_period': 10,
 'random_seed': 0,
 'validation': {'topk': 10},
 'accelerator': False,
 'use_bias': True,
 'num_workers': 1,
 'hyper_threads': 256,
 'num_iters': 100,
 'd': 20,
 'update_i': True,
 'update_j': True,
 'reg_u': 0.025,
 'reg_i': 0.025,
 'reg_j': 0.025,
 'reg_b': 0.025,
 'optimizer': 'sgd',
 'lr': 0.002,
 'min_lr': 0.0001,
 'beta1': 0.9,
 'beta2': 0.999,
 'eps': 1e-10,
 'per_coordinate_normalize': False,
 'num_negative_samples': 1,
 'sampling_power': 0.0,
 'verify_neg': True,
 'random_positive': False,
 'model_path': '',
 'data_opt': {}}

훈련 세트 입력

해당 자료에 맞게 수정

In [33]:
data_opt = MatrixMarketOptions().get_default_option()

#data_opt.input.main = '/content/drive/MyDrive/output.mtx'  #Case 1 학습시 사용
data_opt.input.main = '/content/drive/MyDrive/trainset.mtx'  #Case 2 학습시 사용
data_opt.input.iid = '/content/drive/MyDrive/iid1' #Case 1 : iid / Case 2 : iid1
data_opt.input.uid = '/content/drive/MyDrive/uid1' #Case 1 : uid / Case 2 : uid1

data_opt

{'type': 'matrix_market',
 'input': {'main': '/content/drive/MyDrive/trainset.mtx',
  'uid': '/content/drive/MyDrive/uid1',
  'iid': '/content/drive/MyDrive/iid1'},
 'data': {'internal_data_type': 'matrix',
  'validation': {'name': 'sample', 'p': 0.01, 'max_samples': 500},
  'batch_mb': 1024,
  'use_cache': False,
  'tmp_dir': '/tmp/',
  'path': './mm.h5py',
  'disk_based': False}}

In [37]:
# 데이터 호출
data = bdt.load(data_opt)
data.create()

# 모델 생성
if MODEL_TO_USE == "ALS":
    #model = ALS(opt, data=data)
    print("NO ALS")
elif MODEL_TO_USE == "BPR":
    model = BPRMF(opt, data=data)
model.initialize()

# 모델 학습
val_res = model.train()

val_res

# 모델 저장
!mkdir model

#model.save("model/model-mymodel") #Case 1 학습시 사용
model.save("model/model-mymodel2") #Case 2 학습시 사용
del model

[INFO    ] 2024-04-11 14:35:11 [mm.py:247] Create the database from matrix market file.
INFO:MatrixMarket:Create the database from matrix market file.
[INFO    ] 2024-04-11 14:35:12 [base.py:179] File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.
INFO:MatrixMarket:File ./mm.h5py exists. To build new database, existing file ./mm.h5py will be deleted.
[INFO    ] 2024-04-11 14:35:12 [mm.py:260] Creating working data...
INFO:MatrixMarket:Creating working data...
[PROGRESS] 0.00% 0.0/0.0secs 0.00it/s
[PROGRESS] 100.00% 1.6/1.6secs 4,676,744.21it/s
[PROGRESS] 100.00% 1.6/1.6secs 4,676,744.21it/s

[INFO    ] 2024-04-11 14:35:13 [mm.py:265] Building data part...
INFO:MatrixMarket:Building data part...
[INFO    ] 2024-04-11 14:35:13 [base.py:417] Building compressed triplets for rowwise...
INFO:MatrixMarket:Building compressed triplets for rowwise...
[INFO    ] 2024-04-11 14:35:13 [base.py:418] Preprocessing...
INFO:MatrixMarket:Preprocessing...
[INFO    ] 20

mkdir: cannot create directory ‘model’: File exists


 #1-2. 모델 불러오기 및 예측 (Case 1을 사용하여 평가)



In [30]:
# 모델 호출
if MODEL_TO_USE == "ALS":
    #model = ALS()
    print("NO ALS")
elif MODEL_TO_USE == "BPR":
    model = BPRMF()
model.load("model/model-mymodel")

[INFO    ] 2024-04-11 14:25:01 [bpr.py:55] BPRMF({
  "evaluation_on_learning": true,
  "compute_loss_on_training": true,
  "early_stopping_rounds": 0,
  "save_best": false,
  "evaluation_period": 100,
  "save_period": 10,
  "random_seed": 0,
  "validation": {},
  "accelerator": false,
  "use_bias": true,
  "num_workers": 1,
  "hyper_threads": 256,
  "num_iters": 100,
  "d": 20,
  "update_i": true,
  "update_j": true,
  "reg_u": 0.025,
  "reg_i": 0.025,
  "reg_j": 0.025,
  "reg_b": 0.025,
  "optimizer": "sgd",
  "lr": 0.002,
  "min_lr": 0.0001,
  "beta1": 0.9,
  "beta2": 0.999,
  "eps": 1e-10,
  "per_coordinate_normalize": false,
  "num_negative_samples": 1,
  "sampling_power": 0.0,
  "verify_neg": true,
  "random_positive": false,
  "model_path": "",
  "data_opt": {}
})
INFO:BPRMF:BPRMF({
  "evaluation_on_learning": true,
  "compute_loss_on_training": true,
  "early_stopping_rounds": 0,
  "save_best": false,
  "evaluation_period": 100,
  "save_period": 10,
  "random_seed": 0,
  "valida

In [31]:
# 모델 평가자료 출력
val_res

{'train_loss': 0.0,
 'val_ndcg': 0.026281085292861574,
 'val_map': 0.020018770147304854,
 'val_accuracy': 0.04627249357326478,
 'val_auc': 0.5224451547008032,
 'val_rmse': 2.5956369990270796,
 'val_error': 2.3246547355651854}

*작성당시 실험결과 RMSE : 2.5956369990270796

In [32]:
uids = [str(x) for x in range(61, 70)]
uids.append(str(216944))
recommendation_result = model.topk_recommendation(uids, topk=5)
print(dir(recommendation_result))
for uid, iids in recommendation_result.items():
  print(f"for user {uid}, recommendations are ", f"\nitems {iids}.\n")

['__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__or__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']
for user 61, recommendations are  
items ['1252', '3252', '574', '7074', '3474'].

for user 62, recommendations are  
items ['1252', '3252', '574', '7074', '3474'].

for user 63, recommendations are  
items ['1252', '3252', '574', '7074', '3474'].

for user 64, recommendations are  
items ['1252', '3252', '574', '7074', '3474'].

for user 65, recommendations are  
items ['1252', '3252', '574', '7074', '3474'].

for user 66, recommenda

#2. scikit-learn을 사용하여 Precision/Recall 테스트

2-1. 앞서 학습한 모델에 테스트 데이터에 대한 예측 수행 (Case 2를 통해 분할하여 학습)

In [60]:
def parse_matrix_market(file_path):
    user_item_pairs = []
    with open(file_path, 'r') as file:
        for line in file:
            if line.startswith('%'):
                continue  # 주석 라인은 건너뜁니다.
            row, col, _ = map(int, line.split())  # 행, 열, 값 정보를 추출합니다.
            user_item_pairs.append((row, col))
    user_item_pairs = user_item_pairs[1:] # 개수를 저장하는 부분 제거
    return user_item_pairs

# Matrix Market 파일에서 사용자-항목 쌍 추출
row_col_pairs = parse_matrix_market('/content/drive/MyDrive/testset.mtx')

row_col_pairs

[(1, 2750),
 (2, 2750),
 (2, 3609),
 (2, 4073),
 (3, 218),
 (3, 732),
 (3, 1401),
 (3, 2183),
 (3, 5482),
 (3, 5542),
 (3, 6210),
 (3, 6395),
 (4, 2750),
 (5, 5589),
 (6, 1),
 (6, 6077),
 (7, 1),
 (8, 2381),
 (8, 4865),
 (9, 2750),
 (10, 340),
 (10, 431),
 (10, 3377),
 (10, 3618),
 (10, 3635),
 (10, 6010),
 (10, 6180),
 (11, 2750),
 (12, 3102),
 (12, 3403),
 (12, 6425),
 (13, 1),
 (14, 2750),
 (15, 6231),
 (16, 2750),
 (17, 4882),
 (18, 1401),
 (19, 2750),
 (20, 1),
 (21, 2750),
 (22, 6183),
 (23, 1),
 (24, 1),
 (25, 98),
 (25, 103),
 (25, 203),
 (25, 2074),
 (25, 2095),
 (25, 2229),
 (25, 2730),
 (25, 3434),
 (25, 4867),
 (25, 5521),
 (25, 6083),
 (26, 2750),
 (27, 2817),
 (27, 5729),
 (28, 4759),
 (29, 1493),
 (29, 4071),
 (29, 4082),
 (30, 1),
 (31, 4115),
 (31, 5665),
 (32, 136),
 (32, 6226),
 (33, 1),
 (34, 2750),
 (35, 521),
 (35, 1647),
 (35, 4022),
 (35, 4175),
 (35, 5649),
 (36, 763),
 (36, 4248),
 (37, 85),
 (37, 121),
 (37, 136),
 (37, 701),
 (37, 773),
 (37, 812),
 (37, 103

In [61]:
from buffalo.algo.bpr import BPRMF
from buffalo.data.mm import MatrixMarketOptions, MatrixMarket

#data_opt2 = MatrixMarketOptions().get_default_option()
#data_opt2.input.main = '/content/drive/MyDrive/testset.mtx'
#data_opt2.input.iid = '/content/drive/MyDrive/iid2'
#data_opt2.input.uid = '/content/drive/MyDrive/uid2'
#test_data = bdt.load(data_opt2)
trainmodel = BPRMF()
trainmodel.load("model/model-mymodel2")
predictions = trainmodel.get_scores(row_col_pairs)
predictions #(사용자, 업소) : 예측값

[INFO    ] 2024-04-11 15:00:30 [bpr.py:55] BPRMF({
  "evaluation_on_learning": true,
  "compute_loss_on_training": true,
  "early_stopping_rounds": 0,
  "save_best": false,
  "evaluation_period": 100,
  "save_period": 10,
  "random_seed": 0,
  "validation": {},
  "accelerator": false,
  "use_bias": true,
  "num_workers": 1,
  "hyper_threads": 256,
  "num_iters": 100,
  "d": 20,
  "update_i": true,
  "update_j": true,
  "reg_u": 0.025,
  "reg_i": 0.025,
  "reg_j": 0.025,
  "reg_b": 0.025,
  "optimizer": "sgd",
  "lr": 0.002,
  "min_lr": 0.0001,
  "beta1": 0.9,
  "beta2": 0.999,
  "eps": 1e-10,
  "per_coordinate_normalize": false,
  "num_negative_samples": 1,
  "sampling_power": 0.0,
  "verify_neg": true,
  "random_positive": false,
  "model_path": "",
  "data_opt": {}
})
INFO:BPRMF:BPRMF({
  "evaluation_on_learning": true,
  "compute_loss_on_training": true,
  "early_stopping_rounds": 0,
  "save_best": false,
  "evaluation_period": 100,
  "save_period": 10,
  "random_seed": 0,
  "valida

{(1, 2750): -0.0113541,
 (2, 2750): -0.011333206,
 (2, 3609): 1.2038256,
 (2, 4073): -0.9928506,
 (3, 218): 0.21496233,
 (3, 732): -1.0642723,
 (3, 1401): -0.86731166,
 (3, 2183): 1.4515904,
 (3, 5482): -0.67486966,
 (3, 5542): -0.9739953,
 (3, 6210): -1.0464691,
 (3, 6395): 2.0985107,
 (4, 2750): -0.011313326,
 (5, 5589): -0.93180746,
 (6, 1): 1.2277808,
 (6, 6077): -0.88481015,
 (7, 1): 1.227736,
 (8, 2381): -0.67704886,
 (8, 4865): 0.5998954,
 (9, 2750): -0.011344503,
 (10, 340): -0.5551434,
 (10, 431): -0.15053849,
 (10, 3377): -0.7407474,
 (10, 3618): 1.2009802,
 (10, 3635): 1.2900177,
 (10, 6010): 1.1037229,
 (10, 6180): -1.0715265,
 (11, 2750): -0.011338337,
 (12, 3102): -0.9215661,
 (12, 3403): -1.0874438,
 (12, 6425): -0.5913112,
 (13, 1): 1.2277875,
 (14, 2750): -0.011335997,
 (15, 6231): -0.83772457,
 (16, 2750): -0.011352236,
 (17, 4882): -1.0193403,
 (18, 1401): -0.86730045,
 (19, 2750): -0.011348537,
 (20, 1): 1.2277391,
 (21, 2750): -0.011334216,
 (22, 6183): -0.87564814

In [62]:
len(predictions)

146699

In [64]:
from sklearn.metrics import precision_score, recall_score
import numpy as np
import scipy.io

# 실제 값 로드
mat = scipy.io.mmread('/content/drive/MyDrive/testset.mtx')
mytestset = mat.tocsr()  # CSR 희소 행렬로 변환

# 예측값과 실제 값을 비교하여 이진 분류 수행
y_true = []
y_pred = []
threshold = 3  # 임계값 (이 값보다 큰 경우 1로 분류)
for row_col_pair, value in predictions.items():
    row, col = row_col_pair
    y_pred.append(1 if value > threshold else 0)
    y_true.append(1 if mytestset[row-1, col-1] > 3 else 0)

# Precision과 Recall 계산
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print("Precision:", precision)
print("Recall:", recall)

Precision: 0.0
Recall: 0.0


  _warn_prf(average, modifier, msg_start, len(result))
