# 1. Introduction

The data set is a record of transactions, perhaps from a supermarket or grocery store.

Each row represents a purchase, which is a single transaction involving one or more items.

In [None]:
import pandas as pd
import matplotlib
import seaborn as sns

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10, 6)
sns.set(style='whitegrid', palette='autumn')

In [None]:
data = pd.read_csv('../input/association-rule-learningapriori/Market_Basket_Optimisation.csv',
                  header=None, prefix='item_')
data.head()

In [None]:
data.info()

In [None]:
# Some items e.g asparagus appear multiple different ways due to whitespace
sorted(data.melt()['value'].dropna().unique())[:5]

In [None]:
# remove trailing and leading whitespace e.g. ' asparagus' to 'asparagus'
for col in data.columns:
    data[col] = data[col].str.strip()
    
sorted(data.melt()['value'].dropna().unique())[:5]

# 2. Analysing Transaction

## 2.1 Summary

In [None]:
print(f'There were a total of {data.shape[0]:,} transactions, involving items ranging from'
      + f' {data.shape[1]} to {data.notna().sum(axis=1).min()}.')

## 2.2 Product types

In [None]:
all_items = data.melt()['value'].dropna()
print(f'There were {all_items.nunique()} different products:\n')
print(sorted(all_items.unique()))

## 2.3 Most purchased product

Assuming that only one unit of each item was bought in each transaction, **mineral water** is the best selling product.

In [None]:
ax = all_items.value_counts().head(15).plot(kind='bar')
ax.set_title('Most Frequently Purchased Items', size=20)
ax.set_ylabel('Count')

## 2.4 Least purchased product

Assuming that only one unit of each item was bought in each transaction, **water spray** is the poorest performing item.

In [None]:
ax = all_items.value_counts().nsmallest(15).plot(kind='bar')
ax.set_title('Least Sold Items', size=20)
ax.set_ylabel('Count')

## 2.5 Basket sizes
A "basket" in this context refers to the set of unique items purchased in a single transaction.

A bulk of the transactions involved just a solitary item. In general, basket(transaction) frequency decreased with increase in basket size.

In [None]:
basket_sizes = data.notna().apply(sum, axis=1)

ax = basket_sizes.value_counts().plot.bar()
ax.set_title('Distribution of Basket Sizes', size=20)
ax.set_ylabel('Count')
ax.set_xlabel('Number of items in a single transaction.')

## 2.4 What's in the largest transactions?

*Mineral water, olive oil, salmon, vegetables mix, frozen smoothie* and *honey* are in 3 out of the four largest transactions.

In [None]:
basket_sizes.nlargest(10)

In [None]:
largest_transactions = data[basket_sizes > 15]
items_in_largest_transactions = largest_transactions.melt()['value'].dropna()

ax = items_in_largest_transactions.value_counts().head(15).plot.bar()
ax.set_title('Appearances in the 8 Largest Transactions', size=20)
ax.set_ylabel('Number of transactions (out of 8)')

In [None]:
pie_data = items_in_largest_transactions.value_counts()
ax = pie_data.plot.pie(cmap='autumn', explode=[0.2] * 61, figsize=(12, 12),
                       autopct=lambda pct: f' {pct * 0.01 * pie_data.sum():.0f} of 8'
)
ax.set_title('Appearances in the 8 Largest Transactions', size=20, pad=25)
ax.set_ylabel('')
ax.figure.tight_layout()

## 2.5 What's in the smallest transactions

*Cookies* were the item most often bought alone.

In [None]:
single_items = data[basket_sizes == 1]['item_0'].value_counts()
ax = single_items.head(15).plot.bar()
ax.set_title('Items Commonly Bought Alone', size=20)
ax.set_ylabel('Number of times bought alone')

# 3. Association Rule Mining

## 3.1 Basic Concepts

### 3.1.1 Association Rule

A statement of the form: 
$$ {\{  antecedent(s) \}} => {\{ consequent \}} $$ 

e.g $ {\{ bread, cheese \}} => {\{ lettuce \}} $, implying that those who purchase *bread* together with *cheese* are likely to also buy *lettuce*.

### 3.1.2 Support

How frequently a set of items appears in the transaction data. A rule with a support of 0.1 indicates that the items in question appear together in 10% of the trancactions.

### 3.1.3 Confidence

A measure of how often an association rule has proved to be true.

[learm more here..][1]

In this example, we'll be using the [apyori][2] Python package to implement the [*Apriori algorithm*][3].

[1]: https://en.wikipedia.org/wiki/Association_rule_learning
[2]: https://pypi.org/project/apyori/
[3]: https://en.wikipedia.org/wiki/Apriori_algorithm

In [None]:
!pip install apyori

## 3.2 Preprocessing

Data input to the `apyori.apriori` method is required as a list of 'baskets' i.e collections of items. 


In order to find item relationships, these baskets must include more than 1 item.


In [None]:
baskets = [set(row.dropna()) for _, row in data.iterrows() if row.dropna().size > 1]
baskets[:5]

## 3.3 Association rules

In [None]:
import apyori

# The apyori.apriori method returns a generator object
association_rules = apyori.apriori(baskets, min_support=0.03, min_confidence=0.3)


for rule in association_rules:
    items = list(rule.items)
    print(f"{items[:-1]}  --> [{items[-1]}]     Support: {rule.support:g}"
          + f"  Confidence: {rule.ordered_statistics[0].confidence:g}")    