Поиск ассоциативных правил при помощи apriori алгоритма

Импортируем библиотеки

In [1]:
import numpy as np
import pandas as pd
from apyori import apriori

Считываем данные

In [2]:
df = pd.read_excel('datasets\Big_retail_dataset.xlsx')
print("Количество строк: %d"  %  df.shape[0])
print("Количество столбцов: %d"  %  df.shape[1])

Количество строк: 541909
Количество столбцов: 8


In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


Нормализуем дату(избавляемся от времени)

In [4]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate']).dt.normalize()
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01,3.39,17850.0,United Kingdom


Выводим количество строк с пустым атрибутом 'Description'

In [5]:
print(len(df[df['Description'].isnull()]))

1454


Удаляем строки с пустым атрибутом 'Description'

In [6]:
df=df[df['Description'].notnull()]
print("Количество строк: %d"  %  df.shape[0])
print("Количество столбцов: %d"  %  df.shape[1])

Количество строк: 540455
Количество столбцов: 8


In [7]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01,3.39,17850.0,United Kingdom


Группируем данные по номерам выставленных счетов

In [8]:
gp_InvoiceNo = df.groupby('InvoiceNo')

Формируем список транзакций, где каждая запись - это список вещей, купленных на один счет

In [9]:
transactions = []
for name,group in gp_InvoiceNo:
    transactions.append(list(group['Description'].map(str)))

Формируем из DataFrame списка транзакций

In [10]:
transactions_df=pd.DataFrame(transactions)
print("Количество строк: %d"  %  transactions_df.shape[0])
print("Количество столбцов: %d"  %  transactions_df.shape[1])

Количество строк: 24446
Количество столбцов: 1114


In [11]:
transactions_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113
0,WHITE HANGING HEART T-LIGHT HOLDER,WHITE METAL LANTERN,CREAM CUPID HEARTS COAT HANGER,KNITTED UNION FLAG HOT WATER BOTTLE,RED WOOLLY HOTTIE WHITE HEART.,SET 7 BABUSHKA NESTING BOXES,GLASS STAR FROSTED T-LIGHT HOLDER,,,,...,,,,,,,,,,
1,HAND WARMER UNION JACK,HAND WARMER RED POLKA DOT,,,,,,,,,...,,,,,,,,,,
2,ASSORTED COLOUR BIRD ORNAMENT,POPPY'S PLAYHOUSE BEDROOM,POPPY'S PLAYHOUSE KITCHEN,FELTCRAFT PRINCESS CHARLOTTE DOLL,IVORY KNITTED MUG COSY,BOX OF 6 ASSORTED COLOUR TEASPOONS,BOX OF VINTAGE JIGSAW BLOCKS,BOX OF VINTAGE ALPHABET BLOCKS,HOME BUILDING BLOCK WORD,LOVE BUILDING BLOCK WORD,...,,,,,,,,,,
3,JAM MAKING SET WITH JARS,RED COAT RACK PARIS FASHION,YELLOW COAT RACK PARIS FASHION,BLUE COAT RACK PARIS FASHION,,,,,,,...,,,,,,,,,,
4,BATH BUILDING BLOCK WORD,,,,,,,,,,...,,,,,,,,,,


Вычисляем поддрежку, при которой товар встречается в среднем один раз за день

In [12]:
support=len(df['InvoiceDate'].unique())/transactions_df.shape[0]
print(support)

0.012476478769532848


Используя функцию apriori находим список ассоциативных правил

In [13]:
association_results = list(apriori(transactions,
                                   min_support = support,
                                   min_confidence = 0.5,
                                   min_lift = 3,
                                   min_length = 2))

Формируем DataFrame для списка правил

In [14]:
association_results_table=\
    pd.DataFrame(np.random.randint(low=0,
                                   high=1,
                                   size=(len(association_results),6)),
                 columns=['GeneralRules',
                          'LeftPart',
                          'RightPart',
                          'Support',
                          'Confidence',
                          'Lift'])



Заполняем DataFrame данными

In [15]:
index=0
for g, s, i in association_results:
    association_results_table.iloc[index] = [' | & | '.join(list(g)),
                                             ' | & | '.join(list(i[0][0])),
                                             ' | & | '.join(list(i[0][1])),
                                             s,
                                             i[0][2],
                                             i[0][3]]
    index=index+1

Сортируем данные по атрибуту 'Lift'

In [16]:
association_results_table = \
    association_results_table.sort_values('Lift', ascending=0)

Выводим правила в виде таблицы

In [17]:
print("Количество строк: %d"  %  association_results_table.shape[0])
print("Количество столбцов: %d"  %  association_results_table.shape[1])

Количество строк: 108
Количество столбцов: 6


In [18]:
association_results_table.head()

Unnamed: 0,GeneralRules,LeftPart,RightPart,Support,Confidence,Lift
71,REGENCY TEA PLATE GREEN | & | REGENCY TEA PLA...,REGENCY TEA PLATE GREEN,REGENCY TEA PLATE ROSES,0.013172,0.834197,44.623145
67,POPPY'S PLAYHOUSE BEDROOM | & | POPPY'S PLAYH...,POPPY'S PLAYHOUSE BEDROOM,POPPY'S PLAYHOUSE KITCHEN,0.012845,0.737089,40.952006
75,SET/6 RED SPOTTY PAPER PLATES | & | SET/6 RED ...,SET/6 RED SPOTTY PAPER CUPS,SET/6 RED SPOTTY PAPER PLATES,0.014317,0.817757,37.933374
76,SMALL DOLLY MIX DESIGN ORANGE BOWL | & | SMALL...,SMALL DOLLY MIX DESIGN ORANGE BOWL,SMALL MARSHMALLOWS PINK BOWL,0.014317,0.662879,36.171283
80,WOODEN HEART CHRISTMAS SCANDINAVIAN | & | WOOD...,WOODEN HEART CHRISTMAS SCANDINAVIAN,WOODEN STAR CHRISTMAS SCANDINAVIAN,0.015913,0.72037,34.194513


Выводим топ-10 в виде списка

In [19]:
count=1
for i, d in association_results_table.head(10).iterrows():
    print('Rule #'+str(count)+':')
    print(d['LeftPart'])
    print('=> '+d['RightPart'])
    print('Support: '+str(d['Support']))
    print('Confidence: '+str(d['Confidence']))
    print('Lift: '+str(d['Lift']))
    print('==========================================================')
    count=count+1

Rule #1:
REGENCY TEA PLATE GREEN 
=> REGENCY TEA PLATE ROSES 
Support: 0.01317188906160517
Confidence: 0.8341968911917097
Lift: 44.62314486230314
Rule #2:
POPPY'S PLAYHOUSE BEDROOM 
=> POPPY'S PLAYHOUSE KITCHEN
Support: 0.01284463715945349
Confidence: 0.7370892018779343
Lift: 40.95200597524541
Rule #3:
SET/6 RED SPOTTY PAPER CUPS
=> SET/6 RED SPOTTY PAPER PLATES
Support: 0.014317270719136055
Confidence: 0.8177570093457943
Lift: 37.93337353029846
Rule #4:
SMALL DOLLY MIX DESIGN ORANGE BOWL
=> SMALL MARSHMALLOWS PINK BOWL
Support: 0.014317270719136055
Confidence: 0.6628787878787878
Lift: 36.17128314393939
Rule #5:
WOODEN HEART CHRISTMAS SCANDINAVIAN
=> WOODEN STAR CHRISTMAS SCANDINAVIAN
Support: 0.0159126237421255
Confidence: 0.7203703703703703
Lift: 34.19451276519237
Rule #6:
SET OF 12 FAIRY CAKE BAKING CASES
=> SET OF 12 MINI LOAF BAKING CASES
Support: 0.012558291745070768
Confidence: 0.5511669658886894
Lift: 30.0085248287637
Rule #7:
JUMBO BAG VINTAGE CHRISTMAS 
=> JUMBO BAG 50'S CHRI

Rule #1:
REGENCY TEA PLATE GREEN 
=> REGENCY TEA PLATE ROSES 
Support: 0.01317188906160517
Confidence: 0.8341968911917097
Lift: 44.62314486230314
Rule #2:
POPPY'S PLAYHOUSE BEDROOM 
=> POPPY'S PLAYHOUSE KITCHEN
Support: 0.01284463715945349
Confidence: 0.7370892018779343
Lift: 40.95200597524541
Rule #3:
SET/6 RED SPOTTY PAPER CUPS
=> SET/6 RED SPOTTY PAPER PLATES
Support: 0.014317270719136055
Confidence: 0.8177570093457943
Lift: 37.93337353029846
Rule #4:
SMALL DOLLY MIX DESIGN ORANGE BOWL
=> SMALL MARSHMALLOWS PINK BOWL
Support: 0.014317270719136055
Confidence: 0.6628787878787878
Lift: 36.17128314393939
Rule #5:
WOODEN HEART CHRISTMAS SCANDINAVIAN
=> WOODEN STAR CHRISTMAS SCANDINAVIAN
Support: 0.0159126237421255
Confidence: 0.7203703703703703
Lift: 34.19451276519237
Rule #6:
SET OF 12 FAIRY CAKE BAKING CASES
=> SET OF 12 MINI LOAF BAKING CASES
Support: 0.012558291745070768
Confidence: 0.5511669658886894
Lift: 30.0085248287637
Rule #7:
JUMBO BAG VINTAGE CHRISTMAS 
=> JUMBO BAG 50'S CHRI