For this project, we selected Python to be the tool/language. So, the first step was to load the pip packages required:

In [1]:
import os
import pandas as pd
import random

As required, the first task was to create the list of of items available on our store (that we called inventory). For that we used the most common items we buy when shopping on grocery stores (Walmart, ShopRite, Stop & Shop), along with some other items commonly found in the same categories. We created a dictionary, converted it to a Pandas Dataframe and then saved it to a file.

In [2]:
inventory = [
    {"item_id": 1, "item_description": "Classic Coke"},
    {"item_id": 2, "item_description": "Sprite"},
    {"item_id": 3, "item_description": "Fanta"},
    {"item_id": 4, "item_description": "Apple Juice"},
    {"item_id": 5, "item_description": "Orange Juice"},
    {"item_id": 6, "item_description": "Pear"},
    {"item_id": 7, "item_description": "Apple"},
    {"item_id": 8, "item_description": "Grape"},
    {"item_id": 9, "item_description": "Lemon"},
    {"item_id": 10, "item_description": "Banana"},
    {"item_id": 11, "item_description": "Hot Pocket"},
    {"item_id": 12, "item_description": "Hungry Man"},
    {"item_id": 13, "item_description": "Meatlovers Pizza"},
    {"item_id": 14, "item_description": "Sliced Ham"},
    {"item_id": 15, "item_description": "Hard Salami"},
    {"item_id": 16, "item_description": "Provolone Cheese"},
    {"item_id": 17, "item_description": "Muenster Cheese"},
    {"item_id": 18, "item_description": "Bread"},
    {"item_id": 19, "item_description": "Milk"},
    {"item_id": 20, "item_description": "Coffee"},    
    {"item_id": 21, "item_description": "Rice"},
    {"item_id": 22, "item_description": "Popcorn"},
    {"item_id": 23, "item_description": "Italian Sub"},
    {"item_id": 24, "item_description": "Butter"},
    {"item_id": 25, "item_description": "Eggs"},
    {"item_id": 26, "item_description": "Batteries"},
    {"item_id": 27, "item_description": "Shampoo"},
    {"item_id": 28, "item_description": "Toothpaste"},
    {"item_id": 29, "item_description": "Tylenol"},
    {"item_id": 30, "item_description": "Yogurt"},
]
    
inventory = pd.DataFrame(inventory)
inventory.to_csv("inventory.tsv", sep = "\t", index=False)
inventory.head()

Unnamed: 0,item_id,item_description
0,1,Classic Coke
1,2,Sprite
2,3,Fanta
3,4,Apple Juice
4,5,Orange Juice


Next, we created a function to generate the a database by randomly combining items from our inventory. This function receives as parameters:

* **transactions**: the number of transactions to be generate into the database
* **min_items_per_transaction**: the minimum number of items per transaction
* **max_items_per_transaction**: the maximum number of items per transaction
* **file_name**: the name of the file to be saved

In [3]:
def generate_database(transactions = 20, min_items_per_transaction = 2, 
        max_items_per_transaction = 6, file_name = "database"):
    
    database = []
    inventory = pd.read_csv("inventory.tsv", sep = "\t")
    for transaction in range(transactions):
        basket_items = random.sample(list(inventory["item_id"]), 
                                     random.randint(min_items_per_transaction, 
                                                    max_items_per_transaction))
        basket_items = inventory[
            inventory["item_id"].isin(basket_items)]["item_description"].tolist()
        database.append({
            "transaction_id": transaction + 1,
            "items": ','.join(basket_items)}
        )
    database = pd.DataFrame(database)
    database.to_csv(file_name, sep = "\t", index=False)
    print("Database ", file_name, " with ", transactions, 
          " transactions generated", sep = "")

Then we ran the function above to generate the five databases, using different set of parameters

In [4]:
# generate_database(min_items_per_transaction = 2, max_items_per_transaction = 6, file_name = "database1.tsv")
# generate_database(min_items_per_transaction = 3, max_items_per_transaction = 8, file_name = "database2.tsv")
# generate_database(min_items_per_transaction = 1, max_items_per_transaction = 3, file_name = "database3.tsv")
# generate_database(min_items_per_transaction = 2, max_items_per_transaction = 8, file_name = "database4.tsv")
# generate_database(min_items_per_transaction = 1, max_items_per_transaction = 10, file_name = "database5.tsv")

In [5]:
database = pd.read_csv("database1.tsv", sep = "\t")
database.head()

Unnamed: 0,transaction_id,items
0,1,"Apple,Hard Salami,Muenster Cheese,Coffee,Rice,..."
1,2,"Hard Salami,Provolone Cheese"
2,3,"Sprite,Hot Pocket,Hungry Man,Batteries,Yogurt"
3,4,"Hot Pocket,Popcorn,Tylenol"
4,5,"Orange Juice,Hard Salami,Batteries"


We created then a function to generate the combinations of items (item set). This function receives as parameters:

* **item_list**: the list of items
* **num_items**: the number of items to be combined in the item set

This function will be used on both algorithms (brute-force and apriori)

In [6]:
test = ["a", "b", "c", "d"]

def generate_combinations(item_list, num_items):
    comb_list = []
    def comb(combinations, item_list, n):
        if n == 0:
            combined_items = combinations[:-1].split("|")
            combined_items.sort()
            comb_list.append(combined_items)
        else:
            for i in range(len(item_list)):
                comb(combinations + item_list[i] + "|", item_list[i+1:], n-1)
    comb("", item_list, num_items)
    return comb_list

print(generate_combinations(test, 1))
print(generate_combinations(test, 2))
print(generate_combinations(test, 3))
print(generate_combinations(test, 4))

[['a'], ['b'], ['c'], ['d']]
[['a', 'b'], ['a', 'c'], ['a', 'd'], ['b', 'c'], ['b', 'd'], ['c', 'd']]
[['a', 'b', 'c'], ['a', 'b', 'd'], ['a', 'c', 'd'], ['b', 'c', 'd']]
[['a', 'b', 'c', 'd']]


We then created another function to check if an itemset belongs to a superset. This function returns 1 if the itemset (subset) does not belong to the superset or 0 if the itemset does not belong to the superset or if the itemset has more items than the superset. The function receives the following parameters:

* **itemset**: the subset to be checked agains the superset
* **transaction_items**: the superset

In [7]:
transaction = ["a", "b", "c", "d"]

def check_belonging(itemset, transaction_items):
    belong = 0
    if len(itemset) > len(transaction_items):
        belong = 0
    elif(all(item in transaction_items for item in itemset)):
        belong = 1
    return belong

print(check_belonging(["a"], transaction))
print(check_belonging(["e"], transaction))
print(check_belonging(["b", "c"], transaction))
print(check_belonging(["a", "b", "c", "d"], transaction))
print(check_belonging(["a", "b", "c", "d", "e"], transaction))

1
0
1
1
0


# Brute-force method

We decided to first implement the brute-force algorithm, because it seemed less complex to develop and would help in developing the Aprior. And we also have created a function for it, that receives: 

* **inventory**: a TSV file containing the list of items available on our store (as created above)
* **database**: a TSV file containing the transactions with items of our inventory (as created above)
* **min_support**: the minimum support in quantity (integer) 
* **min_confidence**: the minimum confidence in the decimal fraction form


The function performs the brute-force and spits out the list of itemsets and their support and confidence values considering the parameters informed

In [8]:
def brute_force(inventory, database, min_support, min_confidence):
    inventory = pd.read_csv(inventory, sep="\t")
    inventory = list(inventory["item_description"])
    transactions = pd.read_csv(database, sep="\t")
    frequent_items = []
    num_transactions = len(transactions.index)
    
    ## Getting the support for each combination of items available on inventory
    for num_items in range(1, len(inventory)):
        itemset = generate_combinations(inventory, num_items)  
        for each_combination in itemset:
            support = 0
            # Check for the presence of the item in the transaction and adds +1 to support if so
            for index, each_transaction in transactions.iterrows():
                support += check_belonging(each_combination, 
                                           each_transaction["items"].split(","))
            # Add to our frequent items list if above the minimum support
            if support >= min_support:
                frequent_items.append({
                        "itemset": ','.join(each_combination),
                        "support": support,
                        "qty_items": len(each_combination)
                    }
                )
        ## Early-stop if there is no frequent items for the combinations of that size
        if not frequent_items or pd.DataFrame(frequent_items)["qty_items"].max() < num_items:
            break
            
    frequent_itemsets = pd.DataFrame(frequent_items)
    # Remove frequent itemsets with only one item
    frequent_itemsets = frequent_itemsets[frequent_itemsets["qty_items"] > 1]   
    
    ## Creating association rules and getting the confidence
    association_rules = []
    for index, each_itemset in frequent_itemsets.iterrows():
        for each_item in each_itemset["itemset"].split(","):
            consequent = each_item
            antecedent = each_itemset["itemset"].split(",")
            antecedent.remove(consequent)
            confidence = 0
            # Check the combination on all transactions and add +1 to confidence if present
            for index, each_transaction in transactions.iterrows():
                confidence += check_belonging(antecedent, 
                                              each_transaction["items"].split(",")) 
            # Add to association rules
            if each_itemset["support"] / confidence >= min_confidence:
                association_rules.append({
                        "antecedent": ",".join(antecedent),
                        "consequent": consequent,
                        "support": str(each_itemset["support"]) + "/" + str(num_transactions),
                        "support %": each_itemset["support"] / num_transactions,
                        "confidence": str(each_itemset["support"]) + "/" + str(confidence),
                        "confidence %": each_itemset["support"] / confidence
                    }
                )
                
    if not association_rules:
        print("No frequent itemset found for support =", min_support, 
              "and confidence =", min_confidence, "in Brute Force algorithm")
        return
    
    return pd.DataFrame(association_rules).sort_values(by = ["antecedent", "consequent"])

Here is our test using the Database \#1 with 2 as the minimum support (10%) and 0.5 as the minimum confidence (50%)

In [9]:
df_brute_force = brute_force("inventory.tsv", "database1.tsv", min_support = 2, min_confidence = 0.5)
df_brute_force

Unnamed: 0,antecedent,consequent,support,support %,confidence,confidence %
1,Batteries,Orange Juice,3/20,0.15,3/6,0.5
4,Rice,Toothpaste,2/20,0.1,2/2,1.0
0,Sliced Ham,Orange Juice,2/20,0.1,2/3,0.666667
2,Toothpaste,Orange Juice,2/20,0.1,2/3,0.666667
3,Toothpaste,Rice,2/20,0.1,2/3,0.666667
5,Yogurt,Batteries,2/20,0.1,2/2,1.0


# Apriori

Apriori algorithm works almost in the same way as the brute-force, with the important difference that instead of using the inventory (available items) to generate the combinations of items, we use the apriori knowledge about most frequent items sold together, which means, the transactions themselves are used. Our algorithm will receive the following parameters:

* **database**: a TSV file containing the transactions with items of our inventory (as created above)
* **min_support**: the minimum support in quantity (integer) 
* **min_confidence**: the minimum confidence in the decimal fraction form

and will output the same as our brute-force: list of itemsets and their support and confidence values as a Pandas DataFrame

In [10]:
def apriori(database, min_support, min_confidence):
    transactions = pd.read_csv(database, sep = "\t")
    frequent_items = []    
    num_transactions = len(transactions.index)
    
    num_items = 1
    # Enter into an infinite loop to assess every possible combination
    while 1 == 1:
        for index, each_transaction in transactions.iterrows():
            itemset = generate_combinations(each_transaction["items"].split(","), num_items + 1)    
            for each_combination in itemset:
                # Check if we have already calculated the support for frequent itemset
                if (not frequent_items or 
                    pd.DataFrame(frequent_items)[
                        pd.DataFrame(frequent_items)["itemset"] == ','.join(each_combination)
                    ]["itemset"].count() == 0):
                    support = 0
                    # Check for the presence of the item in the transaction and adds +1 to support if so
                    for index, each_transaction in transactions.iterrows():
                        support += check_belonging(
                            each_combination, 
                            each_transaction["items"].split(","))
                    # Add to our frequent items list if above the minimum support
                    if support >= min_support:
                        frequent_items.append({
                                "itemset": ','.join(each_combination),
                                "support": support,
                                "qty_items": len(each_combination)
                            }
                        )
        num_items += 1
        ## Early-stop if there is no frequent items for the combinations of that size
        if not frequent_items or pd.DataFrame(frequent_items)["qty_items"].max() < num_items:
            break                        
                  
    if not frequent_items:
        print("No frequent itemset found for support =", min_support, "in Apriori algorithm")
        return
    
    frequent_itemsets = pd.DataFrame(frequent_items)
    # Remove frequent itemsets with only one item
    frequent_itemsets = frequent_itemsets[frequent_itemsets["qty_items"] > 1]                      

    ## Creating association rules and getting the confidence
    association_rules = []
    for index, each_itemset in frequent_itemsets.iterrows():
        for each_item in each_itemset["itemset"].split(","):
            consequent = each_item
            antecedent = each_itemset["itemset"].split(",")
            antecedent.remove(consequent)
            confidence = 0
            # Check the combination on all transactions and add +1 to confidence if present
            for index, each_transaction in transactions.iterrows():
                confidence += check_belonging(antecedent, each_transaction["items"].split(",")) 
            # Add to association rules
            if each_itemset["support"] / confidence >= min_confidence:
                association_rules.append({
                        "antecedent": ",".join(antecedent),
                        "consequent": consequent,
                        "support": str(each_itemset["support"]) + "/" + str(num_transactions),
                        "support %": each_itemset["support"] / num_transactions,
                        "confidence": str(each_itemset["support"]) + "/" + str(confidence),
                        "confidence %": each_itemset["support"] / confidence
                    }
                )
    return pd.DataFrame(association_rules).sort_values(by = ["antecedent", "consequent"])

Here is our test using the Database \#1 with 2 as the minimum support (10%) and 0.6 as the minimum confidence (60%)

In [11]:
df_apriori = apriori("database1.tsv", 2, 0.6)
df_apriori

Unnamed: 0,antecedent,consequent,support,support %,confidence,confidence %
1,Rice,Toothpaste,2/20,0.1,2/2,1.0
3,Sliced Ham,Orange Juice,2/20,0.1,2/3,0.666667
4,Toothpaste,Orange Juice,2/20,0.1,2/3,0.666667
0,Toothpaste,Rice,2/20,0.1,2/3,0.666667
2,Yogurt,Batteries,2/20,0.1,2/2,1.0


We purposedly used a difference confidence because we want to outer join the results of the two dataset to check if they are behaving as expected

In [12]:
df_apriori.merge(df_brute_force, how = "outer", left_on=["antecedent", "consequent"], right_on=["antecedent", "consequent"], 
                 suffixes=(' [Apriori]', ' [Brute Force]')).sort_values(by = ["antecedent", "consequent"])

Unnamed: 0,antecedent,consequent,support [Apriori],support % [Apriori],confidence [Apriori],confidence % [Apriori],support [Brute Force],support % [Brute Force],confidence [Brute Force],confidence % [Brute Force]
5,Batteries,Orange Juice,,,,,3/20,0.15,3/6,0.5
0,Rice,Toothpaste,2/20,0.1,2/2,1.0,2/20,0.1,2/2,1.0
1,Sliced Ham,Orange Juice,2/20,0.1,2/3,0.666667,2/20,0.1,2/3,0.666667
2,Toothpaste,Orange Juice,2/20,0.1,2/3,0.666667,2/20,0.1,2/3,0.666667
3,Toothpaste,Rice,2/20,0.1,2/3,0.666667,2/20,0.1,2/3,0.666667
4,Yogurt,Batteries,2/20,0.1,2/2,1.0,2/20,0.1,2/2,1.0


We can see above that our two algorithms outputted the same frequent itemsets with the same support and confidence and that, as we raised the confidence level when executing Apriori, the itemset that didn't meet that threshold was removed from the final list.

As the final goal of the project is to compare the performance between both algorithms, we created a function to execute that comparison. This function receives:

* **inventory**: a TSV file containing the list of items available on our store (as created above)
* **database**: a TSV file containing the transactions with items of our inventory (as created above)
* **min_support**: the minimum support in quantity (integer)
* **min_confidence**: the minimum confidence in the decimal fraction form

The function prints the running time (in seconds) of each algorithm and, in the case we have association rules that meet the parameters, it returns the merged Pandas DataFrame (using outer join)

In [13]:
def compare_algorithms(inventory, database, min_support, min_confidence):
    import time
    start_time = time.time()
    df_apriori = apriori(database, min_support, min_confidence)
    apriori_time = time.time() - start_time
    start_time = time.time()
    df_brute_force = brute_force(inventory, database, min_support = min_support, 
                                 min_confidence = min_confidence)
    brute_force_time = time.time() - start_time
    print(
        "Apriori time (s): ", round(apriori_time, 3), 
        "\t\t\t\t", 
        "Brute Force time (s): ", round(brute_force_time, 3), sep = ""
    )
    
    if df_apriori is not None:
        return df_apriori.merge(
            df_brute_force, 
            how = "outer", 
            left_on=["antecedent", "consequent"], 
            right_on=["antecedent", "consequent"], 
            suffixes=(' [Apriori]', ' [Brute Force]')
        ).sort_values(by = ["antecedent", "consequent"])

We executed the comparison for the Database \#1 as a way of testing the function and obtain the difference in performance

In [14]:
compare_algorithms("inventory.tsv", "database1.tsv", min_support = 2, min_confidence = 0.5)

Apriori time (s): 0.574				Brute Force time (s): 8.605


Unnamed: 0,antecedent,consequent,support [Apriori],support % [Apriori],confidence [Apriori],confidence % [Apriori],support [Brute Force],support % [Brute Force],confidence [Brute Force],confidence % [Brute Force]
0,Batteries,Orange Juice,3/20,0.15,3/6,0.5,3/20,0.15,3/6,0.5
1,Rice,Toothpaste,2/20,0.1,2/2,1.0,2/20,0.1,2/2,1.0
2,Sliced Ham,Orange Juice,2/20,0.1,2/3,0.666667,2/20,0.1,2/3,0.666667
3,Toothpaste,Orange Juice,2/20,0.1,2/3,0.666667,2/20,0.1,2/3,0.666667
4,Toothpaste,Rice,2/20,0.1,2/3,0.666667,2/20,0.1,2/3,0.666667
5,Yogurt,Batteries,2/20,0.1,2/2,1.0,2/20,0.1,2/2,1.0


AS we already executed for database1 (above) and now we are going to execute for the rest of the databases, using different parameters for support and confidence, starting with Database \#2:

In [15]:
compare_algorithms("inventory.tsv", "database2.tsv", min_support = 3, min_confidence = 0.3)

Apriori time (s): 2.926				Brute Force time (s): 8.671


Unnamed: 0,antecedent,consequent,support [Apriori],support % [Apriori],confidence [Apriori],confidence % [Apriori],support [Brute Force],support % [Brute Force],confidence [Brute Force],confidence % [Brute Force]
0,Apple Juice,Bread,3/20,0.15,3/4,0.75,3/20,0.15,3/4,0.75
1,Banana,Butter,3/20,0.15,3/5,0.6,3/20,0.15,3/5,0.6
2,Banana,Classic Coke,3/20,0.15,3/5,0.6,3/20,0.15,3/5,0.6
3,Bread,Apple Juice,3/20,0.15,3/6,0.5,3/20,0.15,3/6,0.5
4,Bread,Grape,3/20,0.15,3/6,0.5,3/20,0.15,3/6,0.5
5,Bread,Yogurt,3/20,0.15,3/6,0.5,3/20,0.15,3/6,0.5
6,Butter,Banana,3/20,0.15,3/5,0.6,3/20,0.15,3/5,0.6
7,Classic Coke,Banana,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
8,Coffee,Grape,3/20,0.15,3/5,0.6,3/20,0.15,3/5,0.6
9,Coffee,Orange Juice,3/20,0.15,3/5,0.6,3/20,0.15,3/5,0.6


Database \#3 (for this database, we have generated limited the items at 3 on purpose, to see how the algorithm would behave if not finding meaninful associations):

In [16]:
compare_algorithms("inventory.tsv", "database3.tsv", min_support = 2, min_confidence = 0.2)

No frequent itemset found for support = 2 in Apriori algorithm
No frequent itemset found for support = 2 and confidence = 0.2 in Brute Force algorithm
Apriori time (s): 0.047				Brute Force time (s): 0.962


Database \#4

In [20]:
compare_algorithms("inventory.tsv", "database4.tsv", min_support = 3, min_confidence = 0.9)

Apriori time (s): 5.722				Brute Force time (s): 342.543


Unnamed: 0,antecedent,consequent,support [Apriori],support % [Apriori],confidence [Apriori],confidence % [Apriori],support [Brute Force],support % [Brute Force],confidence [Brute Force],confidence % [Brute Force]
0,"Apple,Hard Salami",Sprite,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
1,"Apple,Hard Salami",Tylenol,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
2,"Apple,Hard Salami,Sprite",Tylenol,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
3,"Apple,Hard Salami,Tylenol",Sprite,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
4,"Apple,Sprite",Hard Salami,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
5,"Apple,Sprite",Tylenol,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
6,"Apple,Sprite,Tylenol",Hard Salami,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
7,"Apple,Tylenol",Hard Salami,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
8,"Apple,Tylenol",Sprite,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
9,Hard Salami,Apple,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0


Database \#5

In [21]:
compare_algorithms("inventory.tsv", "database5.tsv", min_support = 3, min_confidence = 0.65)

Apriori time (s): 9.499				Brute Force time (s): 61.41


Unnamed: 0,antecedent,consequent,support [Apriori],support % [Apriori],confidence [Apriori],confidence % [Apriori],support [Brute Force],support % [Brute Force],confidence [Brute Force],confidence % [Brute Force]
0,Apple,Provolone Cheese,3/20,0.15,3/4,0.75,3/20,0.15,3/4,0.75
1,Apple,Sliced Ham,3/20,0.15,3/4,0.75,3/20,0.15,3/4,0.75
2,Apple,Yogurt,3/20,0.15,3/4,0.75,3/20,0.15,3/4,0.75
3,Eggs,Hot Pocket,3/20,0.15,3/4,0.75,3/20,0.15,3/4,0.75
4,"Fanta,Hot Pocket",Toothpaste,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
5,"Fanta,Toothpaste",Hot Pocket,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
6,"Hot Pocket,Toothpaste",Fanta,3/20,0.15,3/4,0.75,3/20,0.15,3/4,0.75
7,Hungry Man,Rice,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
8,Orange Juice,Popcorn,3/20,0.15,3/3,1.0,3/20,0.15,3/3,1.0
9,Popcorn,Orange Juice,3/20,0.15,3/4,0.75,3/20,0.15,3/4,0.75


In [41]:
frequent_itemsets = pd.DataFrame(frequent_items)
frequent_itemsets
# Remove frequent itemsets with only one item
frequent_itemsets = frequent_itemsets[frequent_itemsets["qty_items"] > 1]    

In [None]:
#                 for each_other_transaction in transactions:
#                     frequent_items.append({
#                             "itemset": str(each_combination),
#                             "quantity": check_belonging(each_combination, each_other_transaction.split(","))
#                         }
#                     )                    
# #                 print(each_combination)
# #             for each_transaction in transactions:
# #                 frequent_items.append({
# #                         "itemset": str(each_combination),
# #                         "quantity": check_belonging(each_combination, each_transaction)
# #                     }
# #                 )
#     frequent_itemsets = pd.DataFrame(frequent_items).groupby(["itemset"]).sum().reset_index()
#     return frequent_itemsets[frequent_itemsets["quantity"] > min_support]    
#     return frequent_items


# def brute_force(inventory, database, min_support, min_confidence):
#     inventory = pd.read_csv(inventory, sep="\t")
#     inventory = list(inventory["item_description"])
#     transactions = pd.read_csv(database, sep="\t")
#     frequent_items = []
    
#     ## Getting the support for each combination of items available on inventory
#     for num_items in range(1, len(inventory)):
#         itemset = generate_combinations(inventory, num_items)  
#         for each_combination in itemset:
#             support = 0
#             # Check for the presence of the item in the transaction and adds +1 to support if so
#             for index, each_transaction in transactions.iterrows():
#                 support += check_belonging(each_combination, each_transaction["items"].split(","))
#             # Add to our frequent items list if above the minimum support
#             if support >= min_support:
#                 frequent_items.append({
#                         "itemset": ','.join(each_combination),
#                         "support": support,
#                         "qty_items": len(each_combination)
#                     }
#                 )
#         ## Early-stop if there is no frequent items for the combinations of that size
#         if pd.DataFrame(frequent_items)["qty_items"].max() < num_items:
#             break
            
#     frequent_itemsets = pd.DataFrame(frequent_items)
#     # Remove frequent itemsets with only one item
#     frequent_itemsets = frequent_itemsets[frequent_itemsets["qty_items"] > 1]   
    
#     ## Creating association rules and getting the confidence
#     association_rules = []
#     for index, each_itemset in frequent_itemsets.iterrows():
#         for each_item in each_itemset["itemset"].split(","):
#             consequent = each_item
#             antecedent = each_itemset["itemset"].split(",")
#             antecedent.remove(consequent)
#             confidence = 0
#             # Check the combination on all transactions and add +1 to confidence if present
#             for index, each_transaction in transactions.iterrows():
#                 confidence += check_belonging(antecedent, each_transaction["items"].split(",")) 
#             # Add to association rules
#             if each_itemset["support"] / confidence >= min_confidence:
#                 association_rules.append({
#                         "antencedent": ",".join(antecedent),
#                         "consequent": consequent,
#                         "support": str(each_itemset["support"]) + "/" + str(len(transactions.index)),
#                         "support %": each_itemset["support"] / len(transactions.index),
#                         "confidence": str(each_itemset["support"]) + "/" + str(confidence),
#                         "confidence %": each_itemset["support"] / confidence
#                     }
#                 )
#     return pd.DataFrame(association_rules)

# Brute-force method

We are going to use a max of 4 combined items as our stopping criteria and a support of 10% (at last 2 transactions containing the combination)

In [26]:
num_items_stopping_criteria = 2
min_support = 3
items = list(inventory["item_description"])
transactions = pd.DataFrame(database) 
frequent_items = []

for num_items in range(num_items_stopping_criteria):
    itemset = generate_combinations(items, num_items + 1)
    for each_combination in itemset:
        for each_transaction in list(transactions["items"]):
            frequent_items.append({
                    "itemset": str(each_combination),
                    "quantity": check_belonging(each_combination, each_transaction)
                }
            )
#     print(itemset)
frequent_itemsets = pd.DataFrame(frequent_items).groupby(["itemset"]).sum().reset_index()
frequent_itemsets[frequent_itemsets["quantity"] > min_support]

Unnamed: 0,itemset,quantity
58,['Apple'],4
188,['Classic Coke'],4
211,['Coffee'],4
274,['Grape'],4
311,['Hot Pocket'],4
344,['Italian Sub'],5
409,['Orange Juice'],6


# Brute-force

k = 1

Here you generate all 1-itemsets. Check their support values. Find those that are frequent.

k = 2

Here you generate all 2-itemsets. Check their support values. Find those that are frequent.

k = 3

Here you generate all 3-itemsets. Check their support values. Find those that are frequent. 

k = 4

Here you generate all 4-itemsets. Check their support values. Find those that are frequent. Suppose none of the generated 4-itemsets is frequent. You stop here; there is no need to consider k = 5.

In [37]:
items = list(inventory["item_description"])
num_items = 2
items = pd.DataFrame(generate_combinations(items, num_items))
# items.sort_values(by = [0, 1])
items[(items[0] == "Apple Juice") | (items[1] == "Apple Juice")]

Unnamed: 0,0,1
2,Apple Juice,Classic Coke
30,Apple Juice,Sprite
57,Apple Juice,Fanta
84,Apple Juice,Orange Juice
85,Apple Juice,Pear
86,Apple,Apple Juice
87,Apple Juice,Grape
88,Apple Juice,Lemon
89,Apple Juice,Banana
90,Apple Juice,Hot Pocket


In [47]:
test = pd.DataFrame(database)
test[["item1", "item2", "item3", "item4", "item5", "item6"]] = test["items"].str.split(expand=True)

test

ValueError: Columns must be same length as key

In [5]:

items.sort()
items

['Apple',
 'Apple Juice',
 'Banana',
 'Batteries',
 'Bread',
 'Butter',
 'Classic Coke',
 'Coffee',
 'Eggs',
 'Fanta',
 'Grape',
 'Hard Salami',
 'Hot Pocket',
 'Hungry Man',
 'Italian Sub',
 'Lemon',
 'Meatlovers Pizza',
 'Milk',
 'Muenster Cheese',
 'Orange Juice',
 'Pear',
 'Popcorn',
 'Provolone Cheese',
 'Rice',
 'Shampoo',
 'Sliced Ham',
 'Sprite',
 'Toothpaste',
 'Tylenol',
 'Yogurt']

In [6]:
df1 = pd.DataFrame(items)
df2 = pd.DataFrame(items)
df1 = df1.merge(df2, how="cross")
df1 = df1[df1.iloc[:,0] != df1.iloc[:,1]]
df1

Unnamed: 0,0_x,0_y
1,Apple,Apple Juice
2,Apple,Banana
3,Apple,Batteries
4,Apple,Bread
5,Apple,Butter
...,...,...
894,Yogurt,Shampoo
895,Yogurt,Sliced Ham
896,Yogurt,Sprite
897,Yogurt,Toothpaste


In [7]:
df2 = pd.DataFrame(items)
df1 = df1.merge(df2, how="cross")
df1 = df1[df1.iloc[:,0] != df1.iloc[:,2]]
df1 = df1[df1.iloc[:,1] != df1.iloc[:,2]]
df1

Unnamed: 0,0_x,0_y,0
2,Apple,Apple Juice,Banana
3,Apple,Apple Juice,Batteries
4,Apple,Apple Juice,Bread
5,Apple,Apple Juice,Butter
6,Apple,Apple Juice,Classic Coke
...,...,...,...
26093,Yogurt,Tylenol,Rice
26094,Yogurt,Tylenol,Shampoo
26095,Yogurt,Tylenol,Sliced Ham
26096,Yogurt,Tylenol,Sprite


In [64]:
num_items = 2
df1 = pd.DataFrame(items)
for i in range(num_items - 1):
    df2 = pd.DataFrame(items)
    df1 = df1.merge(df2, how="cross")
    for j in range(i + 1):
        # Rename columns
        df1.columns.values[j] = "item_" + str(j + 1)
        df1.columns.values[j + 1] = "item_" + str(j + 2)  

        df1 = df1[df1.iloc[:,j] != df1.iloc[:,i + 1]]
#         print(i, j, i + 1)
#     print(i)
df1.drop_duplicates()
print(len(df1.index))
import itertools
len(list(itertools.combinations(items, num_items)))

870


435

In [71]:
df1[((df1["item_1"] == "Apple Juice") & (df1["item_2"] == "Apple")) | 
    ((df1["item_2"] == "Apple Juice") & (df1["item_1"] == "Apple"))]

Unnamed: 0,item_1,item_2
1,Apple,Apple Juice
30,Apple Juice,Apple


In [58]:
list(itertools.combinations(items, num_items))

[('Apple', 'Apple Juice'),
 ('Apple', 'Banana'),
 ('Apple', 'Batteries'),
 ('Apple', 'Bread'),
 ('Apple', 'Butter'),
 ('Apple', 'Classic Coke'),
 ('Apple', 'Coffee'),
 ('Apple', 'Eggs'),
 ('Apple', 'Fanta'),
 ('Apple', 'Grape'),
 ('Apple', 'Hard Salami'),
 ('Apple', 'Hot Pocket'),
 ('Apple', 'Hungry Man'),
 ('Apple', 'Italian Sub'),
 ('Apple', 'Lemon'),
 ('Apple', 'Meatlovers Pizza'),
 ('Apple', 'Milk'),
 ('Apple', 'Muenster Cheese'),
 ('Apple', 'Orange Juice'),
 ('Apple', 'Pear'),
 ('Apple', 'Popcorn'),
 ('Apple', 'Provolone Cheese'),
 ('Apple', 'Rice'),
 ('Apple', 'Shampoo'),
 ('Apple', 'Sliced Ham'),
 ('Apple', 'Sprite'),
 ('Apple', 'Toothpaste'),
 ('Apple', 'Tylenol'),
 ('Apple', 'Yogurt'),
 ('Apple Juice', 'Banana'),
 ('Apple Juice', 'Batteries'),
 ('Apple Juice', 'Bread'),
 ('Apple Juice', 'Butter'),
 ('Apple Juice', 'Classic Coke'),
 ('Apple Juice', 'Coffee'),
 ('Apple Juice', 'Eggs'),
 ('Apple Juice', 'Fanta'),
 ('Apple Juice', 'Grape'),
 ('Apple Juice', 'Hard Salami'),
 ('Appl

In [84]:
def Combination(inputArray, combinationArray, start, end, index, r):
    if index == r:
        for item in combinationArray:
            print(item, end = " ")
        print()
        return
    i = start
    while (i & lt; = end and end - i + 1 & gt; = r - index):
        combinationArray[index] = inputArray[i]
        Combination(inputArray, combinationArray, i + 1, end, index + 1, r)
        i += 1
inputArray = "RGYBI"
n = len(inputArray)
r = 3
combinationArray = [0] * r
Combination(inputArray, combinationArray, 0, n - 1, 0, r)
# combinationArray

SyntaxError: invalid syntax (<ipython-input-84-ba04a3d1bb58>, line 8)

In [None]:
test = ["a", "b", "c", "d"]
def combinations(items_list, n):
    if comb_list is None:
        comb_list = []
    if n == 0:
        return(comb_list)
    else:
        for i in range(len(item_list)):
            comb_list.append(combinations()
            
print(combinations(test, 2))

# Algorithm 5.1 Frequent itemset generation of the Apriori algorithm

```
  1: k = 1.
  2: Fk = {i | i ∈ I ∧ σ({i}) ≥ N × minsup}. {Find all frequent 1-itemsets}
  3: repeat
  4:   k = k + 1.
  5:   Ck = candidate-gen(Fk − 1). {Generate candidate itemsets.}
  6:   Ck = candidate-prune(Ck, Fk − 1). {Prune candidate itemsets.}
  7:   for each transaction t ∈ T do
  8:     Ct = subset(Ck, t). {Identify all candidates that belong to t.}
  9:     for each candidate itemset c ∈ Ct do
 10:       σ(c) = σ(c) + 1. {Increment support count.}
 11:     end for
 12:   end for
 13:   Fk = {c | c ∈ Ck ∧ σ(c) ≥ N × minsup}. {Extract the frequent k-itemsets.}
 14: until Fk = ∅
 15: Result = ∪Fk.
```
![image.png](attachment:image.png)