# Apriori

TODO: def

### Importing the dataset

In [129]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

**dataset**:
- each row is a transaction with the products purchased by a customer
- 7,501 transactions in a week

In [130]:
dataset = pd.read_csv("./filez/Market_Basket_Optimisation.csv", header=None, sep=',')
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


### Data preprocessing

In [131]:
# Convert the DataFrame into a list of lists (transactions), excluding NaN values
transactions = dataset.apply(lambda x: x.dropna().tolist(), axis=1).tolist()
for tx in transactions[:5]:
    print(tx)

['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil']
['burgers', 'meatballs', 'eggs']
['chutney']
['turkey', 'avocado']
['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea']


In [132]:
te = TransactionEncoder()

# Transform the data into a one-hot encoded DataFrame
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


### Equivalence of 2 Python libraries with Apriori algo

| apyori | mlxtend |
|----------|----------|
| min_confidence | min_threshold (when metric="confidence") |
| min_lift | min_threshold (when metric="lift") |
| min_support | min_support |
| max_length | max_len |

To set both `min_confidence` and `min_lift` in **mlxtend**, generate rules based on one metric and filter rules based on the other metric.

### Training the Apriori model on the dataset

Below are the decisions to fit the model (mainly based on experience):

`min_support`:
- def. support X: transactions X / total transactions
- frequency: 7501 transactions a week
- rule x day: consider products that appear at least in 3 transactions a day (products purchased 3 times a day)
- rule x week: 3 x 7 = 21
- min_support: 21 / 7501 = 0.0028

`min_confidence`:
- Based on R, default value: 0.8, but this means the rule has to be correct 80% of the time
- So reduce and try... 0.4, very few rules
- with 0.2, there is good rules

`min_lift`:
- Based on experience, from 3 onwards

`max_len`:
- Limits the size of the itemsets generated
  - e.g. `max_len`=2 : the largest frequent itemset will have at most 2 items. so both the antecedent and consequent of any rule will contain at most 1 item, because the largest possible combination from a 2-item set is one item leading to another.
  - e.g. `max_len`=3 : the frequent itemsets can contain up to 3 items. Consequently, the antecedent and consequent can contain either 1 or 2 items.
- if we want to offer buy one and get another, we set `max_len`=2 to have 1 on the left and 1 on the right
 

In [133]:
# Step 1: Generate frequent itemsets
frequent_itemsets = apriori(df, min_support=0.003, use_colnames=True, max_len=2)

# Step 2: Generate rules and filter by confidence and lift
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)

# Further filter the rules by lift
rules = rules[rules["lift"] >= 3]

### Visualizing the results

In [134]:
# select specific rules
selected_rules = rules[["antecedents", "consequents", "support", "confidence", "lift"]]

# sort rules by lift
sorted_rules = selected_rules.sort_values(by="lift", ascending=False)

print(sorted_rules)

                antecedents    consequents   support  confidence      lift
166         (fromage blanc)        (honey)  0.003333    0.245098  5.164271
58            (light cream)      (chicken)  0.004533    0.290598  4.843951
139                 (pasta)     (escalope)  0.005866    0.372881  4.700812
270                 (pasta)       (shrimp)  0.005066    0.322034  4.506672
267     (whole wheat pasta)    (olive oil)  0.007999    0.271493  4.122410
198          (tomato sauce)  (ground beef)  0.005333    0.377358  3.840659
138  (mushroom cream sauce)     (escalope)  0.005733    0.300699  3.790833
188         (herb & pepper)  (ground beef)  0.015998    0.323450  3.291994
211           (light cream)    (olive oil)  0.003200    0.205128  3.114710


Influlence of "fromage blanc" to "honey":
- `Antecedents`: ('fromage blanc') - **IF** part of an association rule. It indicates that the item 'fromage blanc' is present in the transactions we are considering.
- `Consequents`: ('honey') - **THEN** part of the rule. It indicates that in the transactions where 'fromage blanc' is present, 'honey' also tends to be present.
- `Support`: 0.003333 - Proportion of all transactions that contain both 'fromage blanc' and 'honey'. This combination of items appears in about 0.33% of all transactions.
- `Confidence`: 0.245098 - How often 'honey' appears in transactions that include 'fromage blanc'. In 24.51% of transactions containing 'fromage blanc', we will also find 'honey'.
- `Lift`: 5.164271 - Strength of association between 'fromage blanc' and 'honey'. A lift value greater than 1, like 5.16 here, indicates that 'fromage blanc' and 'honey' appear together significantly more often than would be expected if they were independent. In this case, the presence of 'fromage blanc' increases the likelihood of finding 'honey' in the same transaction by about 5.16 times.