## 背景
这是来自kaggle的一个数据集，主要是来自一个面包店的两万多条交易记录。提供的数据可以用于商业分析，产品营销推荐等，没有task，一般是自己定义的问题。https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery  

每条记录有日期，时间、交易项等，交易项是一个商品，比如Tea、bread等，也就是表示每个商品被销售的时间。
这里主要学习关联规则，参考https://www.kaggle.com/bbhatt001/bakery-business-model-association-rules 

这里主要测试pyspark 和 networkx
ml是spark机器学习库类似mllib，不同的是ml是二次封装，基于DataFrame结构，更贴近数据科学
MLlib是基于rdd，更贴近底层

In [21]:
#_*_coding:utf-8_*_
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
#创建上下文
spark=SparkSession.builder.appName("FPgrowth").getOrCreate()
sc = SparkContext.getOrCreate()
#读取数据
lines = sc.textFile('input/BreadBasket_DMS.csv')
#如果你的csv文件有标题 的话，需要剔除首行
header = lines.first() 

#第一行 
print(header)

 

Date,Time,Transaction,Item


In [22]:
#ml的fpgrowth

import datetime

if __name__ == "__main__":
    t1=datetime.datetime.now()
    #指定schema：
    schema = StructType([
        # true代表不为null
        StructField("Date", StringType(), True), # nullable=True, this field can not be null
        StructField("Time", StringType(), True),
        StructField("Transaction", StringType(), True),
        StructField("Item",StringType(),True)
        ]
    )
    
    #data = spark.read.csv(r"hdfs://my_master:8020/user/root/data_spark.csv", encoding='gbk', header=True, inferSchema=True) 
    # header表示数据的第一行是否为列名，inferSchema表示自动推断schema,未指定schema设为True
    data = spark.read.csv(r"input/BreadBasket_DMS.csv",header=True,inferSchema=True)  #直接为DataFrame
    
    
    print(data.head())

Row(Date=datetime.datetime(2016, 10, 30, 0, 0), Time='09:58:11', Transaction=1, Item='Bread')


In [3]:
#以树的形式打印概要
data.printSchema()

root
 |-- Date: timestamp (nullable = true)
 |-- Time: string (nullable = true)
 |-- Transaction: integer (nullable = true)
 |-- Item: string (nullable = true)



In [4]:
data.show()
data.show(10)

+-------------------+--------+-----------+-------------+
|               Date|    Time|Transaction|         Item|
+-------------------+--------+-----------+-------------+
|2016-10-30 00:00:00|09:58:11|          1|        Bread|
|2016-10-30 00:00:00|10:05:34|          2| Scandinavian|
|2016-10-30 00:00:00|10:05:34|          2| Scandinavian|
|2016-10-30 00:00:00|10:07:57|          3|Hot chocolate|
|2016-10-30 00:00:00|10:07:57|          3|          Jam|
|2016-10-30 00:00:00|10:07:57|          3|      Cookies|
|2016-10-30 00:00:00|10:08:41|          4|       Muffin|
|2016-10-30 00:00:00|10:13:03|          5|       Coffee|
|2016-10-30 00:00:00|10:13:03|          5|       Pastry|
|2016-10-30 00:00:00|10:13:03|          5|        Bread|
|2016-10-30 00:00:00|10:16:55|          6|    Medialuna|
|2016-10-30 00:00:00|10:16:55|          6|       Pastry|
|2016-10-30 00:00:00|10:16:55|          6|       Muffin|
|2016-10-30 00:00:00|10:19:12|          7|    Medialuna|
|2016-10-30 00:00:00|10:19:12| 

In [5]:
#看有多少无效数据
data[data['Item']=='NONE'].count()

786

In [23]:
#去掉空值
data_drop = data[data['Item']!='NONE'] 
data_drop.count()

20507

查看数据集里销售量排名前十的商品有那几个

In [7]:
data_drop.groupby('Item').count().sort('count',ascending=False).show(10)

+-------------+-----+
|         Item|count|
+-------------+-----+
|       Coffee| 5471|
|        Bread| 3325|
|          Tea| 1435|
|         Cake| 1025|
|       Pastry|  856|
|     Sandwich|  771|
|    Medialuna|  616|
|Hot chocolate|  590|
|      Cookies|  540|
|      Brownie|  379|
+-------------+-----+
only showing top 10 rows



 这个统计说明，面包店主要卖的好的是咖啡、面包、茶和蛋糕等，当然这也是显而易见的，我们需要更深入的分析，比如什么日期段，什么时间段什么商品卖的好，然后看哪些一起卖好，也就是什么时间摆哪些组合，这就是店长需要的数据分析，以根据具体情况设置商品的销售和促销活动等等。  
我们先按月份看，各个月卖的好的商品有什么，这需要对date数据特征做特征提取，取出月日。相对于月份，我们还关注每周星期几的销售情况，比如周几卖什么比较多。所以要提取星期几作为新特征

In [8]:
from pyspark.sql.functions import to_date, to_timestamp
from pyspark.sql.functions import year, month, dayofmonth,dayofweek
from pyspark.sql.functions import hour, minute, second

data_drop.select(to_date(data_drop.Date).alias('date')).show()


data_drop.select(year(data_drop.Date).alias('year'), 
          month(data_drop.Date).alias('month'),
          dayofmonth(data_drop.Date).alias('day')
     ).show()

data_drop.select(hour(to_timestamp(data_drop.Time)).alias('hour'),
          minute(to_timestamp(data_drop.Time)).alias('minute'),
          second(to_timestamp(data_drop.Time)).alias('second')
          ).show()




+----------+
|      date|
+----------+
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
|2016-10-30|
+----------+
only showing top 20 rows

+----+-----+---+
|year|month|day|
+----+-----+---+
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
|2016|   10| 30|
+----+-----+---+
only showing top 20 rows

+----+------+------+
|hour|minute|second|
+----+------+------+
|   9|    58|    11|
|  10|     5|    34|
|  10|     5|    34|
|  10|     7|    57|
|  10|     7|    57|
|  10|     7|    57|
|  10|     8|    41|
|  10|    13|     

In [9]:
#增加月字段
data_drop1 = data_drop.withColumn('month',month(data_drop.Date))
data_drop1.head()

Row(Date=datetime.datetime(2016, 10, 30, 0, 0), Time='09:58:11', Transaction=1, Item='Bread', month=10)

In [10]:
#增加年字段
data_drop2 = data_drop1.withColumn('year',year(data_drop.Date))
data_drop2.head()

Row(Date=datetime.datetime(2016, 10, 30, 0, 0), Time='09:58:11', Transaction=1, Item='Bread', month=10, year=2016)

In [11]:
#增加星期几字段
data_drop3 = data_drop2.withColumn('weekday',dayofweek(data_drop.Date))
data_drop3.head()

Row(Date=datetime.datetime(2016, 10, 30, 0, 0), Time='09:58:11', Transaction=1, Item='Bread', month=10, year=2016, weekday=1)

In [12]:
#增加day
data_drop4 = data_drop3.withColumn('day',dayofmonth(data_drop.Date))
data_drop4.head()

Row(Date=datetime.datetime(2016, 10, 30, 0, 0), Time='09:58:11', Transaction=1, Item='Bread', month=10, year=2016, weekday=1, day=30)

In [13]:
data_drop3.show(20)

+-------------------+--------+-----------+-------------+-----+----+-------+
|               Date|    Time|Transaction|         Item|month|year|weekday|
+-------------------+--------+-----------+-------------+-----+----+-------+
|2016-10-30 00:00:00|09:58:11|          1|        Bread|   10|2016|      1|
|2016-10-30 00:00:00|10:05:34|          2| Scandinavian|   10|2016|      1|
|2016-10-30 00:00:00|10:05:34|          2| Scandinavian|   10|2016|      1|
|2016-10-30 00:00:00|10:07:57|          3|Hot chocolate|   10|2016|      1|
|2016-10-30 00:00:00|10:07:57|          3|          Jam|   10|2016|      1|
|2016-10-30 00:00:00|10:07:57|          3|      Cookies|   10|2016|      1|
|2016-10-30 00:00:00|10:08:41|          4|       Muffin|   10|2016|      1|
|2016-10-30 00:00:00|10:13:03|          5|       Coffee|   10|2016|      1|
|2016-10-30 00:00:00|10:13:03|          5|       Pastry|   10|2016|      1|
|2016-10-30 00:00:00|10:13:03|          5|        Bread|   10|2016|      1|
|2016-10-30 

In [14]:
#获取小时
data_drop5 = data_drop4.withColumn('hour',hour(data_drop.Time))
data_drop5.head()

Row(Date=datetime.datetime(2016, 10, 30, 0, 0), Time='09:58:11', Transaction=1, Item='Bread', month=10, year=2016, weekday=1, day=30, hour=9)

In [15]:
#获取分钟
data_drop6 = data_drop5.withColumn('minute',minute(data_drop.Time))
data_drop6.head()

Row(Date=datetime.datetime(2016, 10, 30, 0, 0), Time='09:58:11', Transaction=1, Item='Bread', month=10, year=2016, weekday=1, day=30, hour=9, minute=58)

将数据转为项集，获得transaction的唯一数

In [16]:
[i.Transaction for i in data_drop.select('Transaction').distinct().collect()]

[148, 463, 471, 833, 1088, 1238, 1342, 1580, 1591, 1645, 1829, 1959, 2122, 2142, 2366, 2659, 2866, 3175, 3749, 3794, 3918, 3997, 4101, 4519, 4818, 4900, 4935, 5156, 5300, 5518, 5803, 6336, 6357, 6397, 6466, 6620, 6654, 6658, 7240, 7253, 7340, 7554, 7754, 7833, 7880, 7982, 7993, 8086, 8389, 8592, 8638, 9376, 9427, 9465, 243, 392, 540, 623, 737, 858, 897, 1025, 1084, 1127, 1395, 1460, 1483, 1507, 1522, 1721, 1896, 1990, 2235, 2387, 2563, 2580, 2811, 3179, 3226, 3475, 3698, 4158, 4161, 4190, 4219, 4929, 5614, 6393, 6623, 6773, 7168, 7417, 8803, 8928, 8932, 9182, 9564, 31, 516, 1139, 1143, 1270, 1303, 1322, 1339, 1352, 1618, 1650, 1699, 1903, 2393, 2572, 2711, 2776, 2821, 2996, 3000, 3213, 3352, 3488, 3704, 3761, 4391, 4489, 5071, 5117, 5173, 5287, 5345, 5984, 6266, 6482, 6559, 6622, 6825, 7281, 7387, 7850, 7879, 8222, 8257, 8407, 8924, 9162, 9383, 9454, 9517, 9558, 85, 137, 251, 451, 580, 808, 1265, 1975, 2025, 2231, 2259, 2443, 2488, 2525, 2721, 2748, 2923, 3089, 3098, 3220, 3490, 3796, 

In [24]:
#data_drop['Transaction']
transid = [i.Transaction for i in data_drop.select('Transaction').distinct().collect()]
shopping_list = []
for item in transid:
    lst2 = list(set(data_drop.filter(data_drop['Transaction']==item).select('Item').collect()))   
    #collect的读取相当于从所有分布式机器上把数据拉下来放在本地展示,返回一个Array对象，消耗内存
    #每个item里商品不要重复，用集合set
    if len(lst2)>0:
        shopping_list.append(lst2)
        
print(shopping_list[0:3])



[[Row(Item='Coffee'), Row(Item='Fudge')], [Row(Item='Tea')], [Row(Item='Coffee')]]


In [18]:
#转为str list
shop_list  = []  

for items in shopping_list:
    #b = sc.parallelize([items])
    lst=[]
    for x in items:
        for y in x:
            lst.append(str(y))
    shop_list.append(lst)
print(shop_list[0:10])    


[['Coffee', 'Fudge'], ['Tea'], ['Coffee'], ['Coffee'], ['Sandwich', 'Alfajores', 'Juice'], ['Bread', 'Brownie'], ['Coffee', 'Medialuna'], ['Coffee', 'Tea'], ['Soup', 'Sandwich', 'Tea', 'Bread'], ['Coffee', 'Bread', 'Keeping It Local', 'Medialuna']]


In [25]:
from pyspark.sql import *
from pyspark.ml.fpm import FPGrowth
from pyspark import SparkContext
sqlContext = SQLContext(sc)


shop_list = [shopping_list[0:5000]]
#转为spark的DataFrame
list_df = sqlContext.createDataFrame(shop_list,["items"]) 

#模型建立
fp = FPGrowth(minSupport=0.03, minConfidence=0.05)

#模型拟合
fpm  = fp.fit(list_df)

#在控制台显示前五条频繁项集
fpm .freqItemsets.show(5)

#强关联规则
association_rule=fpm.associationRules
print(association_rule)

+-------------------+----+
|              items|freq|
+-------------------+----+
|          [[Fudge]]|   1|
|         [[Coffee]]|   1|
|[[Coffee], [Fudge]]|   1|
+-------------------+----+

DataFrame[antecedent: array<struct<Item:string>>, consequent: array<struct<Item:string>>, confidence: double, lift: double]


In [27]:
import networkx as nx
import matplotlib.pyplot as plt

#assRuleDf=association_rule.toPandas()#转为python中的dataframe  
#print('强关联规则：\n',assRuleDf)
 
new_data = spark.createDataFrame([(["Tea", "Coffee"], )], ["items"])#新的前项数据
print('后项预测：\n',fpm.transform(new_data).first().prediction) #预测后项               
spark.stop()#关闭spark   
#fig, ax=plt.subplots(figsize=(10,4))
#GA = nx.from_pandas_edgelist(assRuleDf,source='antecedents',target='consequents')
#nx.draw(GA,with_labels=True)
#plt.show()

后项预测：
 []
