# Bài tập

Với tập dữ liệu `data_01.xls`, code trên chuyển đổi ma trận rating ban đầu về dạng unary bằng cách thay thế các giá trị `Nan = 0`, các giá trị khác `= 1`.

**Yêu cầu:**
- Tạo lại ma trận unary bằng cách thay thế các rating >= 3 bởi 1, còn lại = 0, với ý nghĩa là 1: user thích item.
- Sử dụng thuật toán Apriori và luật kết hợp để đề xuất item cho user 89, chọn min_support = 0.7, min_conf=0.7, max_length=3.
- So sánh với kết quả trong ví dụ trên.

Lưu ý: tập dữ liệu không chia thành 2 tập con yêu thích và không yêu thích trong phần `class Recommender()` nữa.


In [238]:
import numpy as np
import pandas as pd
from itertools import combinations

df = pd.read_excel('data_01.xls')
df.rename(columns={"Unnamed: 0": "userId"}, inplace=True)
df.set_index('userId', inplace = True)
df.head()

Unnamed: 0_level_0,11: Star Wars: Episode IV - A New Hope (1977),12: Finding Nemo (2003),13: Forrest Gump (1994),14: American Beauty (1999),22: Pirates of the Caribbean: The Curse of the Black Pearl (2003),24: Kill Bill: Vol. 1 (2003),38: Eternal Sunshine of the Spotless Mind (2004),63: Twelve Monkeys (a.k.a. 12 Monkeys) (1995),77: Memento (2000),85: Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981),...,8467: Dumb & Dumber (1994),8587: The Lion King (1994),9331: Clear and Present Danger (1994),9741: Unbreakable (2000),9802: The Rock (1996),9806: The Incredibles (2004),10020: Beauty and the Beast (1991),36657: X-Men (2000),36658: X2: X-Men United (2003),36955: True Lies (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1648,,,,,4.0,3.0,,,,,...,,4.0,,,5.0,3.5,3.0,,3.5,
5136,4.5,5.0,5.0,4.0,5.0,5.0,5.0,3.0,,5.0,...,1.0,5.0,,,,5.0,5.0,4.5,4.0,
918,5.0,5.0,4.5,,3.0,,5.0,,5.0,,...,,5.0,,,,3.5,,,,
2824,4.5,,5.0,,4.5,4.0,,,5.0,,...,,3.5,,,,,,,,
3867,4.0,4.0,4.5,,4.0,3.0,,,,4.5,...,1.0,4.0,,,,3.0,4.0,4.0,3.5,3.0


In [239]:
def unary_transform(df):
  df_values = np.unique(df) # find all value in df
  df.replace(df_values[df_values < 3], 0, inplace = True) # replace values less than 3 with 0
  df.replace(df_values[df_values >= 3], 1, inplace = True) # replace values greater than 3 with 1
  df.fillna(0, inplace = True) # fill nan with 0
  return df

In [263]:
class AssociationRule:
  def __init__(self, dataframe, min_supp = 0.5, min_conf = 0.5, max_depth = 3):
    self.data = dataframe.to_numpy() # transform data frame to numpy array
    self.item_index, self.item_name = self.item_retrive(dataframe) # get name of items with corresponding index in numpy
    self.min_supp = min_supp
    self.min_conf = min_conf
    self.max_depth = max_depth

  def item_retrive(self, dataframe):
    index = np.arange(len(dataframe.columns))
    item = np.array(dataframe.columns)
    return index, item

  def can_merge(self, itemset_1, itemset_2):
    # check if two (k-1)-itemsets can merge together to create k-itemset
    return np.array_equal(itemset_1[:-1], itemset_2[:-1])

  def itemset_generator(self, old_itemset):
    new_itemset = []
    for i in np.arange(len(old_itemset)):
      for j in np.arange(i + 1, len(old_itemset)):
        # get every possible itemset pair to merge
        if self.can_merge(old_itemset[i], old_itemset[j]): # check if a pair can merge
          merge_itemset = old_itemset[i].copy()
          merge_itemset.append(old_itemset[j][-1]) # new merged itemset create by appending the last item of itemset 2 to itemset 1
          new_itemset.append(merge_itemset) # append new itemset to new itemset list
    return new_itemset

  def estimate_rules(self, k_itemset, support_data, rules):
    left_hand_index = []
    right_hand_index = []
    k = len(k_itemset[0])

    # get possible position combinations of left hand items
    for left_len in np.arange(1, k): # get length of left hand items
      pos_arr = np.arange(k) # position array used make combinations
      for index in combinations(pos_arr, left_len):
        left_hand_index.append(list(index))

    # get position of right hand items based on left hand items
    for left_index in left_hand_index:
      right_hand_index.append([i for i in pos_arr if i not in left_index])

    # The process of getting position combinations reduce the complication of nested loop below

    for item in k_itemset:
      itemset_supp = support_data['supp'][support_data['item'].index(item)] # get support score of the itemset(X, Y)
      for left_index, right_index in zip(left_hand_index, right_hand_index):
        left_hand = list(np.array(item)[left_index]) # X in X -> Y, covert to numpy array to index multiple array elements
        right_hand = list(np.array(item)[right_index]) # Y in X -> Y, covert to numpy array to index multiple array elements

        left_hand_supp = support_data['supp'][support_data['item'].index(left_hand)] # get support score of X
        right_hand_supp = support_data['supp'][support_data['item'].index(right_hand)] # get support score of Y

        conf = itemset_supp / left_hand_supp # calculate X -> Y confidence
        if conf < self.min_conf: # if not qualified, eliminated
          break

        # if qualified, add to rules database
        rules['left hand'].append(left_hand)
        rules['right hand'].append(right_hand)
        rules['left supp'].append(left_hand_supp)
        rules['right supp'].append(right_hand_supp)
        rules['set supp'].append(itemset_supp)
        rules['confidence'].append(conf)
    return rules

  def item_translate(self, list_set):
    # this function turns list of multiple item_index lists to list of multiple item_name lists
    new_itemset = []
    for itemset in list_set:
      new_itemset.append(self.item_name[itemset])
    return new_itemset

  def apriori(self):
    support_data = {'item' : [], 'supp' : []}
    rules = {'left hand' : [], 'right hand' : [], 'left supp' : [], 'right supp' : [], 'set supp' : [], 'confidence' : []}
    k = 1 # k = 1 in k-itemset
    candidate_itemset = [[i] for i in self.item_index] # list of 1-itemset cadidate

    while (True):
      k_itemset = [] # list of support-qualified k-itemset
      for item in candidate_itemset: # loop through every k-itemset candidate
        supp = np.sum(np.sum(self.data[:,item], axis = 1) == k) / self.data.shape[0] # calculate support of k-itemset candidate
        if supp >= self.min_supp: # if qualified, append to support database and append to list of qualified k-itemset
          support_data['item'].append(item)
          support_data['supp'].append(supp)
          k_itemset.append(item)
      if k > 1: # skip estimate rules step if there is list of 1-itemset
        rules = self.estimate_rules(k_itemset, support_data, rules)
      if k == self.max_depth or k_itemset == []: # finish apriori if k reach max_depth or the k-itemset have no value
        break
      # generate (k+1)-itemset candidates based on qualified k-itemset
      candidate_itemset = self.itemset_generator(k_itemset)
      k += 1
    rules['right hand'] = self.item_translate(rules['right hand']) # turn items to readable form
    rules['left hand'] = self.item_translate(rules['left hand']) # turn items to readable form

    return pd.DataFrame(rules).sort_values(by = 'confidence', ascending = False)


In [266]:
def recommender(df, target_user, rules):
  user_rating = df.loc[target_user] # get user rating
  # get list of liked item and other item
  liked_item = list(df.columns[df.loc[target_user] == 1])
  undefine_item = list(df.columns[df.loc[target_user] == 0])

  result = []
  for left, right in zip(rules['left hand'], rules['right hand']): # consider every rule in rules database
      if set(left).issubset(set(liked_item)): # consider if customer like items that exist in left hand of the rule
        # add some items of the right hand off the rule if they are in undefine items and not in founded recommending items
        add_item = [item for item in right if item in undefine_item and item not in result]
        result.extend(add_item)

  return pd.DataFrame(result, columns = ['item'])

In [242]:
df = unary_transform(df)
df.head()

Unnamed: 0_level_0,11: Star Wars: Episode IV - A New Hope (1977),12: Finding Nemo (2003),13: Forrest Gump (1994),14: American Beauty (1999),22: Pirates of the Caribbean: The Curse of the Black Pearl (2003),24: Kill Bill: Vol. 1 (2003),38: Eternal Sunshine of the Spotless Mind (2004),63: Twelve Monkeys (a.k.a. 12 Monkeys) (1995),77: Memento (2000),85: Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981),...,8467: Dumb & Dumber (1994),8587: The Lion King (1994),9331: Clear and Present Danger (1994),9741: Unbreakable (2000),9802: The Rock (1996),9806: The Incredibles (2004),10020: Beauty and the Beast (1991),36657: X-Men (2000),36658: X2: X-Men United (2003),36955: True Lies (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1648,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0
5136,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0
918,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2824,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3867,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0


In [269]:
rule = AssociationRule(df, max_depth = 3, min_supp = 0.7, min_conf = 0.7)
asso_rules = rule.apriori()
asso_rules.tail(10)

Unnamed: 0,left hand,right hand,left supp,right supp,set supp,confidence
69,[603: The Matrix (1999)],[238: The Godfather (1972)],0.96,0.76,0.72,0.75
292,[603: The Matrix (1999)],"[155: The Dark Knight (2008), 393: Kill Bill: ...",0.96,0.76,0.72,0.75
244,[603: The Matrix (1999)],"[24: Kill Bill: Vol. 1 (2003), 155: The Dark K...",0.96,0.76,0.72,0.75
220,[603: The Matrix (1999)],"[13: Forrest Gump (1994), 453: A Beautiful Min...",0.96,0.76,0.72,0.75
208,[603: The Matrix (1999)],"[13: Forrest Gump (1994), 393: Kill Bill: Vol....",0.96,0.76,0.72,0.75
118,[603: The Matrix (1999)],"[13: Forrest Gump (1994), 24: Kill Bill: Vol. ...",0.96,0.76,0.72,0.75
100,[603: The Matrix (1999)],[8587: The Lion King (1994)],0.96,0.76,0.72,0.75
286,[603: The Matrix (1999)],"[155: The Dark Knight (2008), 272: Batman Begi...",0.96,0.76,0.72,0.75
98,[603: The Matrix (1999)],[862: Toy Story (1995)],0.96,0.72,0.72,0.75
92,[603: The Matrix (1999)],[607: Men in Black (a.k.a. MIB) (1997)],0.96,0.76,0.72,0.75


In [270]:
recommender(df, 89, asso_rules)

Unnamed: 0,item
0,98: Gladiator (2000)
1,597: Titanic (1997)
2,77: Memento (2000)
3,8587: The Lion King (1994)


Ta có thể thấy việc thay đổi cách biến đổi unary khiến cho việc đánh giá một sản phẩm khắc khe hơn. Thay vì đánh giá những item khác NaN (đã được đánh giá) bằng 1 thì đổi bằng những item được thích (rating >= 3) bằng 1. Việc thay đổi tiêu chí làm giảm số lượng sản phẩm được đề xuất dù min support và min confidence cũng được giảm, nhưng lại khiến đề xuất chắc chắn hơn trong việc xác định độ yêu thích của khách hàng.