# Lab 9: Association Analysis using APRIORI Algorithm


## Association Analysis

**Association analysis** is a data mining technique used to discover interesting relationships, patterns, or correlations among a set of items in large datasets. It is commonly applied in market basket analysis to identify sets of products frequently purchased together.

## Support

**Support** is a measure that indicates how frequently a particular item or item set appears in a dataset. It is defined as the proportion of transactions that contain the item or item set. In the context of association rules, support helps to determine the significance of an item set.

## Confidence

**Confidence** is a measure of the reliability of an association rule. It is defined as the proportion of transactions containing the antecedent (the "if" part of the rule) that also contain the consequent (the "then" part of the rule). High confidence indicates a strong likelihood that the rule is valid.

## The Apriori Principle

The **Apriori principle** states that if an item set is frequent (meets the minimum support threshold), then all of its subsets must also be frequent. This principle allows for the efficient identification of frequent item sets by pruning the search space, reducing the number of candidate item sets that need to be examined.

## Candidate Set

A **candidate set** is a collection of item sets that are being considered for frequent item set generation. The candidate set is generated based on previous frequent item sets, and it is evaluated against the minimum support threshold to determine which item sets are frequent.

## Min Support Threshold

The **minimum support threshold** is a user-defined parameter that specifies the minimum frequency an item set must have to be considered frequent. It helps in filtering out infrequent item sets and reduces the computational burden of the association analysis process.

## Frequent Item Set

A **frequent item set** is a set of items that appears in a dataset with a frequency equal to or greater than the minimum support threshold. Frequent item sets are the basis for generating association rules, which reveal relationships between items in the dataset.


## Apriori Algorithm:

Notation = $C_{k}$

Candidate itemset of size  k  and $L_{k}$ =  frequent itemset of size k 

$C_{k+1}$ = candidates generated from $L_{k}$

$L_{k +1}$ = candidates in  $C_{k+1}$ satisfying min_support

1. Read the transaction database and get support for each itemset, compare the support with minimum support to generate frequent itemset at level 1.
2. Use join to generate a set of candidate k-item sets of length $K+1$ ($C_{k+1}$) at the next level.
3. Generate frequent item sets of length $K+1$ ($L_{k+1}$) at the next level using minimum support. In this step:
   - Scan the original database to count support for $k+1$ candidates
   - Prune candidates below minsup
4. Repeat steps 2 and 3 until no frequent item sets can be generated.
5. Generate rules from frequent itemsets from level 2 onwards using minimum confidence.


## Implement apriori algorithm for market basket analysis


In [1]:
import pandas as pd
from apyori import apriori

In [2]:
data_frame = pd.read_csv('dataset/market_basket.csv', header=None)
data_frame.head()

Unnamed: 0,0,1,2,3,4,5
0,Wine,Chips,Bread,Butter,Milk,Apple
1,Wine,,Bread,Butter,Milk,
2,,,Bread,Butter,Milk,
3,,Chips,,Butter,,Apple
4,Wine,Chips,Bread,Butter,Milk,Apple


In [3]:
data_frame.shape

(22, 6)

## Convert Pandas dataframe into nested lists


In [4]:
lsts = []
for i in range(0, 22):
    lsts.append([str(data_frame.values[i, j]) for j in range(0, 6)])

print(lsts)

[['Wine', 'Chips', 'Bread', 'Butter', 'Milk', 'Apple'], ['Wine', 'nan', 'Bread', 'Butter', 'Milk', 'nan'], ['nan', 'nan', 'Bread', 'Butter', 'Milk', 'nan'], ['nan', 'Chips', 'nan', 'Butter', 'nan', 'Apple'], ['Wine', 'Chips', 'Bread', 'Butter', 'Milk', 'Apple'], ['Wine', 'Chips', 'nan', 'nan', 'Milk', 'nan'], ['Wine', 'Chips', 'Bread', 'Butter', 'nan', 'Apple'], ['Wine', 'Chips', 'nan', 'nan', 'Milk', 'nan'], ['Wine', 'nan', 'Bread', 'nan', 'nan', 'Apple'], ['nan', 'nan', 'Bread', 'Butter', 'Milk', 'nan'], ['Wine', 'Chips', 'Bread', 'Butter', 'nan', 'Apple'], ['Wine', 'nan', 'nan', 'Butter', 'Milk', 'Apple'], ['Wine', 'Chips', 'Bread', 'Butter', 'Milk', 'Apple'], ['Wine', 'nan', 'Bread', 'nan', 'Milk', 'nan'], ['Wine', 'nan', 'Bread', 'Butter', 'Milk', 'Apple'], ['Wine', 'Chips', 'Bread', 'Butter', 'Milk', 'Apple'], ['nan', 'Chips', 'Bread', 'Butter', 'Milk', 'Apple'], ['nan', 'Chips', 'nan', 'Butter', 'Milk', 'Apple'], ['Wine', 'Chips', 'Bread', 'Butter', 'Milk', 'Apple'], ['Wine', 'n

## Goal of Association Rule Mining and its Application to any buisness

When you apply Association Rule Mining on a given set of transactions T your goal will be to find all rules with:

1. Support greater than or equal to min_support
2. Confidence greater than or equal to min_confidence

One of the algorithm for Association Rule Mining implemented here is "APRIORI"


## Support, Confidence, Strong Rules, and Lift in Association Analysis

### Support

Support defines the popularity of an item within the dataset. It is calculated as the proportion of transactions that contain the item or item set.

### Confidence

Confidence indicates the likelihood of how often items **X** and **Y** occur together, given the number of times **X** has occurred. It helps assess the strength of the association rule.

### Strong Rules

A rule **A ⇒ B** is considered a strong rule if it satisfies the minimum support (min_sup) and minimum confidence (min_confidence) thresholds. Strong rules indicate a reliable relationship between item sets.

### Lift

Lift measures the correlation between **A** and **B** in the rule **A ⇒ B**. It shows how one item set **A** affects the item set **B**. It is calculated as:

$$
\text{Lift}(A \Rightarrow B) = \frac{\text{Support}(A \cap B)}{\text{Support}(A) \times \text{Support}(B)}
$$

If the lift is greater than **1**, then **A** and **B** are dependent on each other, and the degree of dependence is indicated by the lift value.

### Interpretation of Lift

- Lift indicates the certainty of a rule. It shows how much the sale of item **X** has increased when item **Y** is sold.

The formula for lift can also be expressed as:

$$
\text{Lift}(X \Rightarrow Y) = \frac{\text{Confidence}(X, Y)}{\text{Support}(Y)}
$$

### Example

For the rule **X ⇒ Y** with:

- **Support = 60%**: This means that **60%** of all transactions show that **X** and **Y** have been bought together.
- **Confidence = 90%**: This indicates that **90%** of the customers who bought **X** also bought **Y**.


## Make APRIORI MODEL for RULE GENERATION


In [5]:
asscsn_rules = apriori(lsts, min_support=0.50, min_confidence=0.7, min_lift=1.2, min_length=2)
asscsn_results = list(asscsn_rules)

In [6]:
import json

print(json.dumps(asscsn_results, default=str, indent=4))

[
    [
        "frozenset({'Wine', 'Apple', 'Bread'})",
        0.5,
        [
            [
                "frozenset({'Apple'})",
                "frozenset({'Wine', 'Bread'})",
                0.7333333333333334,
                1.241025641025641
            ],
            [
                "frozenset({'Apple', 'Bread'})",
                "frozenset({'Wine'})",
                0.9166666666666667,
                1.2604166666666667
            ],
            [
                "frozenset({'Wine', 'Apple'})",
                "frozenset({'Bread'})",
                0.9166666666666667,
                1.2604166666666667
            ],
            [
                "frozenset({'Wine', 'Bread'})",
                "frozenset({'Apple'})",
                0.8461538461538461,
                1.241025641025641
            ]
        ]
    ]
]


## Result Interpretation - Market Basket Analysis

### Consumer Behavior Insights

The consumer behavior insights derived from the given dataset and its application are interpreted as follows:

#### Frequent Item Set

The frequent item set identified from the market basket dataset analysis is:

- {‘Wine’, ‘Apple’, ‘Bread’}, support = 0.5
  - This means these items are bought together **50%** of the time across all transactions.


#### Strong Association Rules

1. Rule 1: {'Apple'} → {'Bread', 'Wine'}

- **Confidence**: 0.7333 (or 73.33%)
  - This indicates that 73.33% of the consumers who bought **Apple** also bought **Bread** & **Wine**.
- **Lift**: 1.241
  - This means that **Bread** & **Wine** is 1.24 times more likely to be bought by customers who buy **Apple**.
  - A lift greater than 1 indicates a strong correlation between the items.

2. Rule 2: {'Apple', 'Bread'} → {'Wine'}

- **Confidence**: 0.9167 (or 91.67%)
  - This implies that 91.67% of the customers who bought **Apple** & **Bread** also bought **Wine**.
- **Lift**: 1.260
  - This suggests a strong association, indicating that customers who buy both **Apple** & **Bread** are highly likely to also purchase **Wine**.

3. Rule 3: {'Apple', 'Wine'} → {'Bread'}

- **Confidence**: 0.9167 (or 91.67%)
  - This means that 91.67% of the customers who bought **Apple** & **Wine** also bought **Bread**.
- **Lift**: 1.260
  - Similar to Rule 2, this indicates a strong correlation, suggesting that customers who buy **Apple** & **Wine** are also very likely to buy **Bread**

4. Rule 4: {'Bread', 'Wine'} → {'Apple'}

- **Confidence**: 0.8462 (or 84.62%)
  - This implies that 84.62% of the customers who bought **Bread** & **Wine** also bought **Apple**.
- **Lift**: 1.241
  - This indicates that customers who purchase **Bread** & **Wine** are also likely to buy **Apple**, with a significant correlation.
