# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Assignment 1: Market_Basket_Analysis

## Learning Objectives

At the end of the experiment, you will be able to :

* identify segments based on the overall buying behaviour

* identify the association rules for the items purchased

* implement the apriori algorithm from scratch

## Dataset

The dataset chosen for this assignment is 'Grocery Store Dataset'. The dataset contains 20 transactions of general items from a supermarket. For example, one transaction looks like 'BREAD, MILK, BISCUIT, CORNFLAKES'

## Information

The Apriori algorithm is an influential algorithm for searching a series of frequent sets of items in the dataset or database. It is mainly used for Association Rule mining. So, what exactly is Association Rule mining?

Alex goes to buy a chips from the bakery. He grabs a Pepsi as well. The shop manager analyses that, not only Alex but people in general, often tend to buy chips and Pepsi together. After finding out the pattern, the shop manager arranges these items together and notices an increase in sales. This process of identifying the relationship between items is called association rule mining.

The key concept in the Apriori algorithm is that it assumes

- All subsets of a frequent itemset must be frequent
- If an itemset is infrequent, all its supersets will be infrequent.

**Important Definitions**

**Itemset:** A set of items is referred as itemset and an itemset containing n items is called n-itemset.

**SupportCount:** Number of transactions in  which the itemset appears.

**MinimumSupportCount:** The minimum frequency of itemset in the dataset or database.

**Frequent Itemset:** If an itemset satisfies minimum support, then it is a frequent itemset.

**Support:** An indication of how frequently the itemset appears in the dataset..

To know more about apriori algorithm click [here](https://hackr.io/blog/what-is-apriori-algorithm)

### Overview

![Overview](https://cdn.iisc.talentsprint.com/CDS/Apriori_Overview.JPG)

We will be using below helper functions to implement the apriori algorithm:

* **get_support(transactions, item_set):** This function calculates the support value for the given item_set from the provided list of transactions.

* **self_join(frequent_item_sets_per_level, level):** This function performs self join in the given list of frequent itemsets of previous level, and generates the candidate itemsets for the current level.

* **get_single_drop_subsets(item_set):** This function returns the subsets of the given item_set with one item less.

* **is_valid_set(item_set, prev_level_sets):** This checks if the given item_set is valid, i.e., has all its subsets with support value greater than the minimum support value. It relies on the fact that prev_level_sets contains only those item_sets which are frequent, i.e., have support value greater than the minimum support value.

* **pruning(frequent_item_sets_per_level, level, candidate_set):** This function performs the pruning step of the Apriori Algorithm. It takes a list candidate_set of all the candidate itemsets for the current level, and for each candidate itemset checks if all its subsets are frequent itemsets. If not, it prunes it, If yes, it adds it to the list of post_pruning_set.

* **apriori(min_support):** This is the main function which uses all the above described Utility functions to implement the Apriori Algorithm and generate the list of frequent itemsets for each level for the provided transactions and min_support value.

* **association_rules(min_confidence, support_dict):** This function generates the association rules in accordance with the given minimum confidence value and the provided dictionary of itemsets against their support values. It takes min_confidence and support_dict as a parameter, and returns rules as a list.

### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M8_AST_01_Market_Basket_Analysis_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.exec.talentsprint.com/static/cds/content/GroceryStoreDataSet.csv")
    ipython.magic("sx wget https://cdn.exec.talentsprint.com/static/cds/content/grocerystoredataset.csv")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://learn-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



#### Import required packages

In [None]:
import pandas as pd
import numpy as np
from collections import defaultdict
from itertools import combinations


To know about collection package click [here](https://docs.python.org/3/library/collections.html)

To know about itertools package click [here](https://docs.python.org/3/library/itertools.html)

## Data Wrangling

### Load the data

In [None]:
store_data = pd.read_csv('/content/grocerystoredataset.csv',names=['Items'])
store_data.head()

A common strategy adopted by many association rule mining algorithms is to decompose the problem into two major subtasks:

- Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the minimum support threshold. These itemsets are called frequent itemsets.
- Rule Generation, whose objective is to extract all the high-confidence rules from the frequent itemsets found in the previous step.


#### Frequent Itemset Generation

Create the basket with transactions

To know more about lambda function click [here](https://realpython.com/python-lambda/)

In [None]:
# Creating a list of items for each order by splitting with ','
store_data['Items'] = store_data['Items'].apply(lambda x: x.split(','))
store_data

In [None]:
# Finding the unique items from all the orders

unique_items = [] # defining a empty list

# Iterating over records in the dataset
for i in store_data['Items']:
    unique_items.extend(i) # extending the

# extracting the unique items from the list by converting the list into the set
unique_items = set(unique_items)
unique_items

To know about extend click [here](https://www.techbeamers.com/python-list-extend/)

In [None]:
# Initializing an empty dictionary
collections = dict()

# Iterating over the unique items and indexing in each item
for item in unique_items:
    # creating a dictionary where item is the key and occurence of the item in the record is the value
    collections[item] = [1 if item in row else 0 for row in store_data['Items'] ]
orders = pd.DataFrame(collections)
orders

### Giving an index to each item to represent it with a number in each transaction

In [None]:
item_dict = dict(zip(unique_items, range(1, len(unique_items)+1)))
item_dict

In [None]:
item_list = list(orders.columns)
item_dict = dict()

# Iterating the item list
for i, item in enumerate(item_list):
    # Assigning index to item name to represent with a number
    item_dict[item] = i + 1

item_dict

### Identifying the pattern of the items purchased in each order

In [None]:
transactions = list()

for i, row in orders.iterrows():
    transaction = set()

    for item in item_dict:
        if row[item] != 0:
            transaction.add(item_dict[item])
    transactions.append(transaction)

transactions

#### Helper functions

Let's create a function that return the support value for the given item_set from the provided list of transactions.

In [None]:
def get_support(transactions, item_set):
    match_count = 0 # initializing a variable to store the number of transactions where the given item_set is found.

    # Iterating through the the list of transactions
    for transaction in transactions:
        if item_set.issubset(transaction): # checking whether the given item_set is a subset of the transaction or not
            match_count += 1 # incrementing the count when the above condition is met
    # support value calculated by dividing the match_count by total number of transactions is returned
    return float(match_count/len(transactions))


Let us create another helper function which performs self join in the given list of frequent itemsets of previous level and generates the candidate itemsets for the current level.

In [None]:
# this function takes 2 arguments as input. the 1st argument is map of level to the list of itemsets found
# to be frequent for that level, 2nd argument is the current level number.
def self_join(frequent_item_sets_per_level, level):
    # initializing an empty list to store the current level candidates
    current_level_candidates = list()
    # Storing the list of frequent itemsets from the previous level
    last_level_items = frequent_item_sets_per_level[level - 1]

    # If there are no frequent itemsets from the previous level, it returns an empty list for current_level_candidates.
    # Otherwise, it iterates through each itemset in last_level_items starting from 0 for index i,
    # and for each itemset in last_level_items starting from 1 for index j.
    if len(last_level_items) == 0:
        return current_level_candidates

    for i in range(len(last_level_items)):
        for j in range(i+1, len(last_level_items)):
            itemset_i = last_level_items[i][0]
            itemset_j = last_level_items[j][0]
            # union of itemsets at indices i and j
            union_set = itemset_i.union(itemset_j)

            #If this union_set is not already present in current_level_candidates and the
            # number of elements in the union_set is equal to the level number,
            # then this union_set is appended into current_level_candidates.
            if union_set not in current_level_candidates and len(union_set) == level:
                current_level_candidates.append(union_set)

    return current_level_candidates


We have a check for the number of elements in union_set to ensure that the current_level_candidates contain only the sets of fixed length. This is a requirement for Apriori Algorithm

Let's create another function that returns the subsets of the given items with one item less.

In [None]:
def get_single_drop_subsets(item_set):
    # initializing an empty list
    single_drop_subsets = list()

    # Iterating over each item in the item set
    for item in item_set:
        # creating a temporary copy of the item_set given
        temp = item_set.copy()
        # removing this item from the temporary item set (a subset of length one less than the length of the item_set)
        temp.remove(item)
        # append this temporary set to the single_drop_subsets
        single_drop_subsets.append(temp)

    return single_drop_subsets

**is_valid_set()** checks if the given item_set is valid, i.e., has all its subsets with support value greater than the minimum support value. It relies on the fact that prev_level_sets contains only those item_sets which are frequent, i.e., have support value greater than the minimum support value.

Now Let's create another function that checks if the given item_set is valid, i.e., has all its subsets with support value greater than the minimum support value. It relies on the fact that prev_level_sets contains only those item_sets which are frequent, i.e., have support value greater than the minimum support value.

In [None]:

def is_valid_set(item_set, prev_level_sets):
    # generating all the subsets of the given item_set with length one less than the length of the original item_set.
    # This is done using the above described function get_single_drop_subsets()
    single_drop_subsets = get_single_drop_subsets(item_set)

    # iterating through the single_drop_subsets list.
    for single_drop_set in single_drop_subsets:
        # checks if it was present in the prev_level_sets. If it wasn’t it means the given
        # item_set is a superset of a non-frequent item_set. Thus, it returns False
        # If all the single_drop_subsets are frequent itemsets, and are present in the prev_level_sets, it returns True
        if single_drop_set not in prev_level_sets:
            return False
    return True

Now let's perform the pruning step of the Apriori Algorithm. It takes a list candidate_set of all the candidate itemsets for the current level, and for each candidate itemset checks if all its subsets are frequent itemsets. If not, it prunes it, If yes, it adds it to the list of post_pruning_set.

In [None]:
def pruning(frequent_item_sets_per_level, level, candidate_set):
    # Initializing empty list to store the list of frequent itemsets for the current level
    # after performing pruning operation on the given list of candidate sets.
    post_pruning_set = list()
    # If there are no candidate_set, it returns an empty list.
    # Otherwise, it first creates a list of frequent itemsets from the previous level and stores it in prev_level_sets.
    if len(candidate_set) == 0:
        return post_pruning_set

    prev_level_sets = list()
    for item_set, _ in frequent_item_sets_per_level[level - 1]:
        prev_level_sets.append(item_set)

    # Iterating over each item_set in candidate_set list
    for item_set in candidate_set:
        # checking whether it is a valid itemset or not
        if is_valid_set(item_set, prev_level_sets):
            # If this item_set is valid, it is appended to the list of post_pruning_set.
            post_pruning_set.append(item_set)

    return post_pruning_set

### Apriori algorithm

Now let us use all the above defined helper functions to implement the Apriori Algorithm and generate the list of frequent itemsets for each level for the provided transactions and min_support value.

In [None]:
def apriori(min_support):
    # creating a default empty dictionary which maps level numbers to the list of frequent itemsets for that level
    frequent_item_sets_per_level = defaultdict(list)
    print("level : 1", end = " ")

    # iterating through the list of all items item_list
    for item in range(1, len(item_list) + 1):
        # calculate the support value of each item using the helper function get_support().
        # If this support value is greater than or equal to the provided min_support value,
        # this item_set is added to the list of frequent itemsets for this level.
        support = get_support(transactions, {item})
        if support >= min_support:
            # every itemset is stored as a pair of 2 values: item, support
            frequent_item_sets_per_level[1].append(({item}, support))

    # For each level greater than 1, generate the current_level_candidates itemsets
    # by performing self_join() on the frequent itemsets of the previous level.
    for level in range(2, len(item_list) + 1):
        print(level, end = " ")
        current_level_candidates = self_join(frequent_item_sets_per_level, level)

        # perform the pruning operation on these current_level_candidates using the pruning()
        # helper function defined above, and obtain the results in post_pruning_candidates
        post_pruning_candidates = pruning(frequent_item_sets_per_level, level, current_level_candidates)

        # if there is no itemset left after pruning, we break the loop.
        # It means there is no point in processing for further levels.
        # Otherwise, for each item_set in post_pruning_candidates,
        # we calculate the support value using the get_support() helper function.
        if len(post_pruning_candidates) == 0:
            break

        for item_set in post_pruning_candidates:
            support = get_support(transactions, item_set)
            # If this support value is greater than or equal to the given min_support,
            # we append this item_set into the list of frequent itemsets for this level.
            if support >= min_support:
                frequent_item_sets_per_level[level].append((item_set, support))

    return frequent_item_sets_per_level

In [None]:
# defining the minimum support value as 0.005
min_support = 0.005
frequent_item_sets_per_level = apriori(min_support)

In [None]:
for level in frequent_item_sets_per_level:
    print(len(frequent_item_sets_per_level[level]))

In [None]:
for level in frequent_item_sets_per_level:
    print(frequent_item_sets_per_level[level])

In [None]:
# Creating a dictionary that maps items to their support values.
item_support_dict = dict()
item_list = list() # to store the name of items corresponding to item_dict values retrieved from frequent_item_sets_per_level

# Keys and values are retrieved from the item_dict and stored in separate variables
key_list = list(item_dict.keys())
val_list = list(item_dict.values())

# For each level in frequent_item_sets_per_level, for each item-support pair, name of the item retrieved from the key_list
# that corresponds to the number in set_support_pair, and names are added to the item_list.
for level in frequent_item_sets_per_level:
    for set_support_pair in frequent_item_sets_per_level[level]:
        for i in set_support_pair[0]:
            item_list.append(key_list[val_list.index(i)])
        # Items names and their support values are mapped in the item_support_dict as a frozenset-float number pair.
        item_support_dict[frozenset(item_list)] = set_support_pair[1]
        item_list = list()

In [None]:
item_support_dict

**find_possible_subsets()** takes each item from the item_support_dict and its length item_length as parameter, and returns all possible combinations of elements inside the items.

In [None]:
def find_possible_subsets(item, item_length):
    # creating empty list to store a list of combinations.
    combs = []

    # Iterating over the items
    for i in range(1, item_length + 1):
        # appending a list of all possible combinations of items to the combs array.
        combs.append(list(combinations(item, i)))

    # Creating a subset array
    subsets = []
    for comb in combs:
        for elt in comb:
            subsets.append(elt)

    return subsets

### Generate the Association rules

In [None]:
item = {1,2,3}
b = item.difference({1,2,3})
if b:
  print("Do something")
item | b

In [None]:
def association_rules(min_confidence, support_dict):
    rules = list()
    """For itemsets of more than one element, it first finds all their subsets calling the find_subset(item, item_length)
        For every subset A, it calculates the set B = itemset-A.
        If B is not empty, the confidence of B is calculated.
        If this value is more than minimum confidence value, the rule A->B is added to the list rules with the corresponding confidence value of B."""
    for item, support in support_dict.items():
        item_length = len(item)

        if item_length > 1:
            subsets = find_possible_subsets(item, item_length)

            for A in subsets:
                B = item.difference(A)
                if B:
                    A = frozenset(A)

                    AB = A | B

                    confidence = support_dict[AB] / support_dict[A]
                    if confidence >= min_confidence:
                        rules.append((A, B, confidence))

    return rules

In [None]:
association_rules = association_rules(min_confidence = 0.6, support_dict = item_support_dict)

In [None]:
print("Number of rules: ", len(association_rules), "\n")

for rule in association_rules:
    print('{0} -> {1} <confidence: {2}>'.format(set(rule[0]), set(rule[1]), rule[2]))

### Please answer the questions below to complete the experiment:




In [None]:
#@title Mark the following statement as True or False: The pairs (Biscuit, Coffee) and (Biscuit, Coke) do not meet the minimum support and hence are not frequent. However, any larger set containing (Biscuit, Coffee) or (Biscuit, Coke) can still be frequent { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["","True", "False"]


In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")