# 实验二 频繁模式与关联规则挖掘

本次实验将完成对Microsoft资讯推荐数据集的频繁模式与关联规则挖掘实验。实验将首先完成数据的预处理，之后完成频繁模式的挖掘，之后对模式进行命名，对挖掘结果进行分析，最后完成可视化的展示

In [3]:
import pandas as pd
import numpy as np

## 1. 数据预处理

首先，将数据读出，并处理成事务表的样式。

In [67]:
with open("data/news.tsv", encoding="utf-8") as f:
    df_news = pd.read_csv(f, delimiter='\t', header=None)

df_news = df_news.set_index(0)
print(df_news[1][:10])

with open("data/behaviors.tsv", encoding="utf-8") as f:
    df_behav = pd.read_csv(f, delimiter='\t', header=None)

print(df_behav[4][:10])

data_list = []
from tqdm import tqdm
import sys
for row in tqdm(df_behav.iterrows(), total=df_behav.shape[0], file=sys.stdout):
    items = row[1][4].split(" ")
    data_row = []
    for item in items:
        name, time = item.split("-")
        if name not in df_news.index:
            continue
        news_find = df_news.loc[name]
        news_type = news_find[1]
        if int(time) == 1:
            data_row.append(news_type)
    if len(data_row) < 2:
        continue
    data_list.append(list(set(data_row)))

for row in data_list:
    print(row)

0
N55528    lifestyle
N18955       health
N61837         news
N53526       health
N38324       health
N2073        sports
N11429         news
N49186      weather
N2131        health
N59295         news
Name: 1, dtype: object
0    N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...
1    N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...
2    N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...
3    N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...
4    N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...
5    N29862-0 N48740-0 N11390-0 N5472-0 N53572-0 N2...
6    N42767-1 N30290-0 N36779-0 N20036-0 N32536-0 N...
7    N53687-0 N31289-0 N37458-0 N8455-0 N56211-0 N5...
8                  N45612-0 N60939-1 N33397-0 N19685-0
9    N29091-0 N60762-0 N29862-0 N512-0 N48740-0 N60...
Name: 4, dtype: object
100%|██████████| 73152/73152 [00:42<00:00, 1730.05it/s]
['sports', 'lifestyle']
['sports', 'news']
['sports']
['news']
['sports', 'news']
['weather', 'news']
['finance']
['sports', 'foodanddrink']
['hea

## 2. 频繁模式挖掘

使用apyori算法对频繁模式进行挖掘。

In [82]:
from apyori import apriori

frequent_itemsets = list(apriori(data_list, min_support=0.1))

for itemset in frequent_itemsets:
    print(itemset.items)

frozenset({'finance'})
frozenset({'foodanddrink'})
frozenset({'health'})
frozenset({'lifestyle'})
frozenset({'news'})
frozenset({'sports'})
frozenset({'tv'})
frozenset({'news', 'lifestyle'})
frozenset({'sports', 'lifestyle'})
frozenset({'sports', 'news'})


## 对频繁模式命名

上面挖掘出了10个频繁模式，分别命名如下：
经济，隐式，健康，新闻，运动，电视，健康新闻，体育健康，体育新闻

In [80]:
# 挖掘关联规则
association_rules = list(apriori(data_list, min_support=0.1, min_confidence=0.1))

# 打印出关联规则
# for rule in association_rules:
#     print(rule)

for i in association_rules:
    for j in i.ordered_statistics:
        x = j.items_base
        y = j.items_add
        x = ",".join([item for item in x])
        y = ",".join([item for item in y])
        if x != "":
            print(x + " -> " + y)

lifestyle -> news
news -> lifestyle
lifestyle -> sports
sports -> lifestyle
news -> sports
sports -> news
