https://campus.datacamp.com/courses/market-basket-analysis-in-python/

## Introduction to Market Basket Analysis

In this chapter, you’ll learn the basics of Market Basket Analysis: association rules, metrics, and pruning. You’ll then apply these concepts to help a small grocery store improve its promotional and product placement efforts

In [5]:
# Import pandas under the alias pd
import pandas as pd

# Groceries_dataset.csv
# Load transactions from pandas
groceries = pd.read_csv('bookstore_transactions.csv')

# Split transaction strings into lists
transactions = groceries['Transaction'].apply(lambda t: t.split(','))

# Convert DataFrame column into list of strings
transactions = list(transactions)

# Print the list of transactions
print(transactions[0])

['History', 'Bookmark']


In [13]:
groceries = pd.read_csv('Groceries_dataset.csv')
groceries.head()

# Split transaction strings into lists
transactions = groceries['itemDescription'].apply(lambda t: t.split(','))

# Convert DataFrame column into list of strings
transactions = list(transactions)

# Print the list of transactions
print(transactions[0])

['tropical fruit']


In [23]:
transactions.count(['jam'])

34

In [14]:
# Import permutations from the itertools module
from itertools import permutations

# Define the set of groceries
flattened = [i for t in transactions for i in t]
groceries = list(set(flattened))

# Generate all possible rules from groceries list
rules = list(permutations(groceries, 2))

# Print the set of rules
print(rules)

# Print the number of rules
print(len(rules))

[('vinegar', 'frozen chicken'), ('vinegar', 'liquor'), ('vinegar', 'frozen meals'), ('vinegar', 'white wine'), ('vinegar', 'sweet spreads'), ('vinegar', 'photo/film'), ('vinegar', 'mayonnaise'), ('vinegar', 'root vegetables'), ('vinegar', 'ready soups'), ('vinegar', 'bags'), ('vinegar', 'other vegetables'), ('vinegar', 'ham'), ('vinegar', 'shopping bags'), ('vinegar', 'abrasive cleaner'), ('vinegar', 'dessert'), ('vinegar', 'pasta'), ('vinegar', 'flour'), ('vinegar', 'seasonal products'), ('vinegar', 'herbs'), ('vinegar', 'canned beer'), ('vinegar', 'dog food'), ('vinegar', 'ice cream'), ('vinegar', 'beef'), ('vinegar', 'candy'), ('vinegar', 'whisky'), ('vinegar', 'nut snack'), ('vinegar', 'bottled beer'), ('vinegar', 'snack products'), ('vinegar', 'hygiene articles'), ('vinegar', 'brown bread'), ('vinegar', 'soups'), ('vinegar', 'preservation products'), ('vinegar', 'pot plants'), ('vinegar', 'turkey'), ('vinegar', 'specialty chocolate'), ('vinegar', 'pork'), ('vinegar', 'pastry'), ('

In [16]:
!pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.20.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.20.0


In [17]:
# Import the transaction encoder function from mlxtend
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Instantiate transaction encoder and identify unique items in transactions
encoder = TransactionEncoder().fit(transactions)

# One-hot encode transactions
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)

# Print the one-hot encoded transaction dataset
print(onehot)

       Instant food products  UHT-milk  abrasive cleaner  artif. sweetener  \
0                      False     False             False             False   
1                      False     False             False             False   
2                      False     False             False             False   
3                      False     False             False             False   
4                      False     False             False             False   
...                      ...       ...               ...               ...   
38760                  False     False             False             False   
38761                  False     False             False             False   
38762                  False     False             False             False   
38763                  False     False             False             False   
38764                  False     False             False             False   

       baby cosmetics   bags  baking powder  bathroom cleaner  

In [25]:
onehot.nunique().head(50)

Instant food products    2
UHT-milk                 2
abrasive cleaner         2
artif. sweetener         2
baby cosmetics           2
bags                     2
baking powder            2
bathroom cleaner         2
beef                     2
berries                  2
beverages                2
bottled beer             2
bottled water            2
brandy                   2
brown bread              2
butter                   2
butter milk              2
cake bar                 2
candles                  2
candy                    2
canned beer              2
canned fish              2
canned fruit             2
canned vegetables        2
cat food                 2
cereals                  2
chewing gum              2
chicken                  2
chocolate                2
chocolate marshmallow    2
citrus fruit             2
cleaner                  2
cling film/bags          2
cocoa drinks             2
coffee                   2
condensed milk           2
cooking chocolate        2
c

In [18]:
# Compute the support
support = onehot.mean()

# Print the support
print(support)

Instant food products    0.001548
UHT-milk                 0.008332
abrasive cleaner         0.000568
artif. sweetener         0.000748
baby cosmetics           0.000077
                           ...   
white bread              0.009338
white wine               0.004540
whole milk               0.064543
yogurt                   0.034412
zwieback                 0.001548
Length: 167, dtype: float64


In [27]:
import numpy as np 

# Add a jam+bread column to the DataFrame onehot
onehot['jam+brown bread'] = np.logical_and(onehot['jam'], onehot['brown bread'])

# Compute the support
support = onehot.mean()

# Print the support values
print(support)

Instant food products    0.001548
UHT-milk                 0.008332
abrasive cleaner         0.000568
artif. sweetener         0.000748
baby cosmetics           0.000077
                           ...   
whole milk               0.064543
yogurt                   0.034412
zwieback                 0.001548
jam+butter               0.000000
jam+brown bread          0.000000
Length: 169, dtype: float64


## Association Rules

Association rules tell us that two or more items are related. Metrics allow us to quantify the usefulness of those relationships. In this chapter, you’ll apply six metrics to evaluate association rules: supply, confidence, lift, conviction, leverage, and Zhang's metric. You’ll then use association rules and metrics to assist a library and an e-book seller.

### Confidence and Lift

In [29]:
books = pd.read_csv('goodbooks-10k/books.csv')

In [None]:
books = pd.read_csv('goodbooks-10k/books.csv')
books.head()

In [31]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         10000 non-null  int64  
 1   book_id                    10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code              8916 non-null   object 
 12  average_rating             10000 non-null  float64
 13  ratings_count              10000 non-null  int6

In [None]:
## dataset GoodBooks-10k

# Compute support for Hunger and Potter
supportHP = np.logical_and(books['Hunger'], books['Potter']).mean()

# Compute support for Hunger and Twilight
supportHT = np.logical_and(books['Hunger'], books['Twilight']).mean()

# Compute support for Potter and Twilight
supportPT = np.logical_and(books['Potter'], books['Twilight']).mean()

# Print support values
print("Hunger Games and Harry Potter: %.2f" % supportHP)
print("Hunger Games and Twilight: %.2f" % supportHT)
print("Harry Potter and Twilight: %.2f" % supportPT)

In [None]:
# Compute support for Potter and Twilight
supportPT = np.logical_and(books['Potter'], books['Twilight']).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for Twilight
supportT = books['Twilight'].mean()

# Compute confidence for both rules
confidencePT = supportPT / supportP
confidenceTP = supportPT / supportT

# Print results
print('{0:.2f}, {1:.2f}'.format(confidencePT, confidenceTP))

In [None]:
"""
Compute the support of {Potter, Twilight}.
Compute the support of {Potter}.
Compute the support of {Twilight}.
Compute the lift of {Potter}  {Twilight}.
"""

# Compute support for Potter and Twilight
supportPT = np.logical_and(books['Potter'], books['Twilight']).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for Twilight
supportT = books['Twilight'].mean()

# Compute lift
lift = supportPT / (supportP * supportT)

# Print lift
print("Lift: %.2f" % lift)

# As it turns out, lift is greater than 1.0. 
#This could give us some confidence that the association rule we recommended did not arise by random chance.

### Leverage and conviction

In [None]:
"""
Compute the support for {Potter} and assign it to supportP.
Compute the support for NOT {Hunger}.
Compute the support for {Potter} and NOT {Hunger}.
Complete the expression for the conviction metric in the return statement.
"""

In [None]:
# Compute support for Potter AND Hunger
supportPH = np.logical_and(books['Potter'], books['Hunger']).mean()

# Compute support for Potter
supportP = books['Potter'].mean()

# Compute support for NOT Hunger
supportnH = 1.0 - books['Hunger'].mean()

# Compute support for Potter and NOT Hunger
supportPnH = supportP - supportPH

# Compute and print conviction for Potter -> Hunger
conviction = supportP * supportnH / supportPnH
print("Conviction: %.2f" % conviction)

### Notice that the value of conviction was less than 1, suggesting that the rule ``if Potter then Hunger'' is not supported.

In [None]:
# Computing conviction with a function

def conviction(antecedent, consequent):
	# Compute support for antecedent AND consequent
	supportAC = np.logical_and(antecedent, consequent).mean()

	# Compute support for antecedent
	supportA = antecedent.mean()

	# Compute support for NOT consequent
	supportnC = 1.0 - consequent.mean()

	# Compute support for antecedent and NOT consequent
	supportAnC = supportA - supportAC

    # Return conviction
	return supportA * supportnC / supportAnC


In [None]:
# Promoting ebooks with conviction

# Compute conviction for twilight -> potter and potter -> twilight
convictionTP = conviction(twilight, potter)
convictionPT = conviction(potter, twilight)

# Compute conviction for twilight -> hunger and hunger -> twilight
convictionTH = conviction(twilight, hunger)
convictionHT = conviction(hunger, twilight)

# Compute conviction for potter -> hunger and hunger -> potter
convictionPH = conviction(potter, hunger)
convictionHP = conviction(hunger, potter)

# Print results
print('Harry Potter -> Twilight: ', convictionHT)
print('Twilight -> Potter: ', convictionTP)

# Notice that the conviction metric for if Potter then Twilight and if Twilight then Potter are both above 1, 
# indicating that they are both viable rules.

### Association and dissociation

In [None]:
"""
Compute the support of {Twilight} and the support of {Potter}.
Compute the support of {Twilight, Potter}.
Complete the expression for the denominator.
Compute Zhang's metric for {Twilight}  {Potter}.
"""

In [None]:
# Compute the support of Twilight and Harry Potter
supportT = books['Twilight'].mean()
supportP = books['Potter'].mean()

# Compute the support of both books
supportTP = np.logical_and(books['Twilight'], books['Potter']).mean()

# Complete the expressions for the numerator and denominator
numerator = supportTP - supportT*supportP
denominator = max(supportTP*(1-supportT), supportT*(supportP-supportTP))

# Compute and print Zhang's metric
zhang = numerator / denominator
print(zhang)

In [None]:
# Defining Zhang's metric

# Define a function to compute Zhang's metric
def zhang(antecedent, consequent):
	# Compute the support of each book
	supportA = antecedent.mean()
	supportC = consequent.mean()

	# Compute the support of both books
	supportAC = np.logical_and(antecedent, consequent).mean()

	# Complete the expressions for the numerator and denominator
	numerator = supportAC - supportA*supportC
	denominator = max(supportAC*(1-supportA), supportA*(supportC-supportAC))

	# Return Zhang's metric
	return numerator / denominator

In [None]:
# Applying Zhang's metric
# for selecting website layout

# Define an empty list for Zhang's metric
zhangs_metric = []

# Loop over lists in itemsets
for itemset in itemsets:
    # Extract the antecedent and consequent columns
	antecedent = books[itemset[0]]
	consequent = books[itemset[1]]
    
    # Complete Zhang's metric and append it to the list
	zhangs_metric.append(zhang(antecedent, consequent))
    
# Print results
rules['zhang'] = zhangs_metric
print(rules)

# Notice that most of the items were dissociated, which suggests that they would have been a poor choice 
# to pair together for promotional purposes.


### Advanced rules

In [None]:
# Multi-metric filtering

#-------------------------------------
# Filtering with support and conviction

# Preview the rules DataFrame using the .head() method
print(rules.head())

# Select the subset of rules with antecedent support greater than 0.05
rules = rules[rules['antecedent support'] > 0.05]

# Select the subset of rules with a consequent support greater than 0.02
rules = rules[rules['consequent support'] > 0.02]

# Select the subset of rules with a conviction greater than 1.01
rules = rules[rules['conviction'] > 1.01]

# Print remaining rules
print(rules)
#--------------------------------------------

# Using multi-metric filtering to cross-promote books

# Set the lift threshold to 1.5
rules = rules[rules['lift'] > 1.5]

# Set the conviction threshold to 1.0
rules = rules[rules['conviction'] > 1.0]

# Set the threshold for Zhang's rule to 0.65
rules = rules[rules['zhang'] > 0.65]

# Print rule
print(rules[['antecedents','consequents']])

# Aggregation and Pruning

The fundamental problem of Market Basket Analysis is determining how to translate vast amounts of customer decisions into a small number of useful rules. This process typically starts with the application of the Apriori algorithm and involves the use of additional strategies, such as pruning and aggregation. In this chapter, you’ll learn how to use these methods and will ultimately apply them in exercises where you assist a retailer in selecting a physical store layout and performing product cross-promotions.

In [None]:
# Aggregation


"""
Select the subset of the DataFrame's columns that contain the string sign.
Print the support for signs.
"""
# Select the column headers for sign items
sign_headers = [i for i in onehot.columns if i.lower().find('sign')>=0]

# Select columns of sign items using sign_headers
sign_columns = onehot[sign_headers]

# Perform aggregation of sign items into sign category
signs = sign_columns.sum(axis = 1) >= 1.0

# Print support for signs
print('Share of Signs: %.2f' % signs.mean())
# If you look at the printed statement, you'll notice that support for signs is 0.10,
# which suggests that signs are an important category of items for the retailer.

In [None]:
# Defining an aggregation function

def aggregate(item):
	# Select the column headers for sign items in onehot
	item_headers = [i for i in onehot.columns if i.lower().find(item)>=0]

	# Select columns of sign items
	item_columns = onehot[item_headers]

	# Return category of aggregated items
	return item_columns.sum(axis = 1) >= 1.0

# Aggregate items for the bags, boxes, and candles categories  
bags = aggregate('bag')
boxes = aggregate('boxes')
candles = aggregate('candles')

In [None]:
# The Apriori algorithm

# Identifying frequent itemsets with Apriori
# Import apriori from mlxtend
from mlxtend.frequent_patterns import apriori

# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(onehot, 
                            min_support = 0.006, 
                            max_len = 3, 
                            use_colnames = True)

# Print a preview of the frequent itemsets
print(frequent_itemsets.head())

# Selecting a support threshold
# Import apriori from mlxtend
from mlxtend.frequent_patterns import apriori

# Compute frequent itemsets using a support of 0.003 and length of 3
frequent_itemsets_1 = apriori(onehot, min_support = 0.003, 
                            max_len = 3, use_colnames = True)

# Compute frequent itemsets using a support of 0.001 and length of 3
frequent_itemsets_2 = apriori(onehot, min_support = 0.001, 
                            max_len = 3, use_colnames = True)

# Print the number of freqeuent itemsets
print(len(frequent_itemsets_1), len(frequent_itemsets_2))

In [None]:
# Basic Apriori results pruning
#----------------------------------
# Generating association rules

# Import the association rule function from mlxtend
from mlxtend.frequent_patterns import association_rules

# Compute all association rules for frequent_itemsets_1
rules_1 = association_rules(frequent_itemsets_1, 
                            metric = "support", 
                         	min_threshold = 0.0015)

# Compute all association rules for frequent_itemsets_2
rules_2 = association_rules(frequent_itemsets_2, 
                            metric = "support", 
                        	min_threshold = 0.0015)

# Print the number of association rules generated
print(len(rules_1), len(rules_2))

# Excellent work! Notice that rules_1 generated no association rules; whereas rules_2, which used a lower support threshold for the Apriori algorithm, generated two.

In [None]:
# Import the association rules function
from mlxtend.frequent_patterns import association_rules

# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(onehot, min_support = 0.001, 
                            max_len = 2, use_colnames = True)

# Compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets,
                          metric = "lift",
                          min_threshold = 1.0)

# Print association rules
print(rules)


# It looks like you've ended up with two association rules once again, both with lift values greater than 1.0.

In [None]:
# Import the association rules function
from mlxtend.frequent_patterns import association_rules

# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(onehot, min_support = 0.0015, 
                            max_len = 2, use_colnames = True)

# Compute all association rules using confidence
rules = association_rules(frequent_itemsets, 
                            metric = "confidence", 
                         	min_threshold = 0.5)

# Print association rules
print(rules)
# Notice that we have narrowed things down to just a single rule.

In [None]:
# Advanced Apriori results pruning
# Applications - cross-promotion (to sell a targeted consequent) & aggregation (to select a layout for a new store)

# Apply the apriori algorithm with a minimum support of 0.0001
frequent_itemsets = apriori(aggregated, min_support = 0.0001, use_colnames = True)

# Generate the initial set of rules using a minimum support of 0.0001
rules = association_rules(frequent_itemsets, 
                          metric = "support", min_threshold = 0.0001)

# Set minimum antecedent support to 0.35
rules = rules[rules['antecedent support'] > 0.35]

# Set maximum consequent support to 0.35
rules = rules[rules['consequent support'] < 0.35]

# Print the remaining rules
print(rules)

# If you look at the list of remaining association rules, you'll find both bag -> box and sign -> candles. 
# We can tell the store manager that the original proposal is also acceptable under this new criterion.


In [None]:
# Applying Zhan's rule

# Generate the initial set of rules using a minimum lift of 1.00
rules = association_rules(frequent_itemsets, metric = "lift", min_threshold = 1.00)

# Set antecedent support to 0.005
rules = rules[rules['antecedent support'] > 0.005]

# Set consequent support to 0.005
rules = rules[rules['consequent support'] > 0.005]

# Compute Zhang's rule
rules['zhang'] = zhangs_rule(rules)

# Set the lower bound for Zhang's rule to 0.98
rules = rules[rules['zhang'] > 0.98]
print(rules[['antecedents', 'consequents']])
# Notice that 10 items had a Zhang's metric value of over 0.98, which suggests that the items are nearly perfectly associated 
# in the data. In general, when we see such strong associations, we'll want to think carefully about what explains them. 
# We might, for instance, investigate whether the items be purchased separately or whether they are bundled 
# in a way that prevents this.

In [None]:
# Advanced filtering with multiple metrics

# Note that the data has been loaded, preprocessed, and one-hot encoded, and is available as onehot
# Apply the Apriori algorithm with a minimum support threshold of 0.001
frequent_itemsets = apriori(onehot, min_support = 0.001, use_colnames = True)

# Recover association rules using a minium support threshold of 0.001
rules = association_rules(frequent_itemsets, metric = 'support', min_threshold = 0.001)

# Apply a 0.002 antecedent support threshold, 0.60 confidence threshold, and 2.50 lift threshold
filtered_rules = rules[(rules['antecedent support'] > 0.002) &
						(rules['consequent support'] > 0.01) &
						(rules['confidence'] > 0.6) &
						(rules['lift'] > 2.50)]

# Print remaining rule
print(filtered_rules[['antecedents','consequents']])
# Your use of the Apriori algorithm and advanced filtering revealed a potentially useful rule: 
# when people purchase a birthday card, they often pair it with a red jumbo bag, presumably as a gift.

## Visualizing Rules

In this final chapter, you’ll learn how visualizations are used to guide the pruning process and summarize final results, which will typically take the form of itemsets or rules. You’ll master the three most useful visualizations -- heatmaps, scatterplots, and parallel coordinates plots – and will apply them to assist a movie streaming service.

### Heatmaps

In [None]:
# Visualizing itemset support

# Compute frequent itemsets using a minimum support of 0.07
frequent_itemsets = apriori(onehot, min_support = 0.07, 
                            use_colnames = True, max_len = 2)

# Compute the association rules
rules = association_rules(frequent_itemsets, metric = 'support', 
                          min_threshold = 0.0)

# Replace frozen sets with strings
rules['antecedents'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents'] = rules['consequents'].apply(lambda a: ','.join(list(a)))

# Transform data to matrix format and generate heatmap
pivot = rules.pivot(index='consequents', columns='antecedents', values='support')
sns.heatmap(pivot)

# Format and display plot
plt.yticks(rotation=0)
plt.show()

In [None]:
# Heatmaps with lift

# Import seaborn under its standard alias
import seaborn as sns

# Transform the DataFrame of rules into a matrix using the lift metric
pivot = rules.pivot(index = 'consequents', 
                   columns = 'antecedents', values= 'lift')

# Generate a heatmap with annotations on and the colorbar off
sns.heatmap(pivot, annot = True, cbar = False)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

### Scatterplots

In [None]:
# Pruning with scatterplots

# Import seaborn under its standard alias
import seaborn as sns

# Apply the Apriori algorithm with a support value of 0.0075
frequent_itemsets = apriori(onehot, min_support = 0.0075, 
                            use_colnames = True, max_len = 2)

# Generate association rules without performing additional pruning
rules = association_rules(frequent_itemsets, metric = 'support', 
                          min_threshold = 0.0)

# Generate scatterplot using support and confidence
sns.scatterplot(x = "support", y = "confidence", data = rules)
plt.show()

# Notice that the confidence-support border roughly forms a triangle. 
# This suggests that throwing out some low support rules would also mean that we would discard rules
# that are strong according to many common metrics.

In [None]:
# Optimality of the support-confidence border

# Import seaborn under its standard alias
import seaborn as sns

# Apply the Apriori algorithm with a support value of 0.0075
frequent_itemsets = apriori(onehot, min_support = 0.0075, 
                         use_colnames = True, max_len = 2)

# Generate association rules without performing additional pruning
rules = association_rules(frequent_itemsets, metric = "support", 
                          min_threshold = 0.0)

# Generate scatterplot using support and confidence
sns.scatterplot(x = "support", y = "confidence", 
                size = "lift", data = rules)
plt.show()
# If you look at the plot carefully, you'll notice that the highest values of lift 
# are always at the support-confidence border for any given value of supply or confidence.

### Parallel coordinates plot

In [None]:
# Using parallel coordinates to visualize rules

"""
Your visual demonstration in the previous exercise convinced the founder that the supply-confidence border is worthy of 
further exploration. She now suggests that you extract part of the border and visualize it. 
Since the rules that fall on the border are strong with respect to most common metrics, she argues that
you should simply visualize whether a rule exists, rather than the intensity of the rule according to some metric.
"""
# Compute the frequent itemsets
frequent_itemsets = apriori(onehot, min_support = 0.05, 
                         use_colnames = True, max_len = 2)

# Compute rules from the frequent itemsets with the confidence metric
rules = association_rules(frequent_itemsets, metric = 'confidence', 
                          min_threshold = 0.50)

# Convert rules into coordinates suitable for use in a parallel coordinates plot
coords = rules_to_coordinates(rules)

# Generate parallel coordinates plot
parallel_coordinates(coords, 'rule')
plt.legend([])
plt.show()

In [None]:
# Refining a parallel coordinates plot

# Import the parallel coordinates plot submodule
from pandas.plotting import parallel_coordinates

# Convert rules into coordinates suitable for use in a parallel coordinates plot
coords = rules_to_coordinates(rules)

# Generate parallel coordinates plot
parallel_coordinates(coords, 'rule', colormap = 'ocean')
plt.legend([])
plt.show()