# CSE 5243 - Introduction to Data Mining
## Homework 5: Association Analysis
- Semester: Fill in
- Instructor: Fill in
- Section: Fill in days, start time 
- Student Name: John Smith
- Student Email: smith.12345@osu.edu

Template Version V2.
***

# Introduction

### Objectives

In this lab, you will use a grocery dataset provided on Carmen to find potential association rules.

The objectives of this assignment are:
- Practice the Association Analysis content we covered this semester.
- Understand “why” the particular topics, techniques, etc., are important from a practical perspective.
- Understand how to choose and use appropriate tools to solve the provided problems.

### The Dataset
- This workbook contains is a market basket dataset containing transactions.
- The data file captures the data in "long format". Specifically, every row corresponds to the transaction id and the item. If the specific transaction id has multiple items, there will be multiple rows in the data.
- You can process the data however you like, but it is recommended you convert into a one-hot-encoded data structure. This will allow you to easily use the mlxtend package.

## The Business Problem
- Assume this dataset contains all of the transactions for one month for our store.  We wish to find association rules that would improve our revenue as follows:
  - We would discount one of our products by **10%** each month, with the hope that this would encourage customers to visit our store to purchase that product **8%** more frequently, and also purchase other products (that are not discounted) more frequently.
- Practically speaking, we would like to come up with **two-item** rules (one antecedent and one consequent: (A -> B)) and choose the one that best adds to our revenues  (based on the rule support, confidence, etc.).
- For simplicity, don't consider complex rule interactions (e.g., A->B and B->C, A->B and B->A, etc.).  Assume each rule is completely separate.

### Proper Answers
- **IMPORTANT:** **Show your work** and **explain it**.  This will help us give partial credit in some cases.

### Collaboration
For this assignment, you should work as an individual. You may informally discuss ideas with classmates, but your work should be your own.

### What You Need to Turn In
- Submit this Jupyter Notebook in .IPYNB format.  Do not "zip" the file.

### Notes
- Feel free to use the **mlxtend** package throughout this assignment.
  - See: **https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/#example-1-generating-association-rules-from-frequent-itemsets**
***

***
# Section: 1 - Get Ready
1A) Load the data, and get it ready for association analysis. Do this with convenient python helper methods as appropriate. Feel free to use the tools given in the example we covered. 
- Suggest: Make the data one-hot encoded.
***

In [None]:
#Note: If the mlxtend library is not installed, uncomment the following line (once) and run it.
#!pip install mlxtend
import numpy as np
import pandas as pd
import mlxtend as mlx
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import fpgrowth

pd.set_option('display.max_columns', 1000) #include to avoid ... in middle of display (and use 'display(...)' when printing in cells)

In [None]:
# From the Business Problem above, discounting a product by X% causes customers to purchase the product Y% more frequently.
price_discount  = 0.10
purchase_uplift = 0.08

In [None]:
data_file_name = 'TEB_Groceries_dataset6.xlsx'

In [None]:
# load the Transaction data (see the ReadMe sheet in the Excel workbook for more information)
transaction_df = pd.read_excel(data_file_name,sheet_name = 'TxToItem')
transaction_df

In [None]:
# load the Item data
item_df = pd.read_excel(data_file_name,sheet_name = 'Items')
item_df

In [None]:
# Check for null entries (and fix them if necessary)
transaction_df.isnull().values.any()

In [None]:
transaction_df.head()

In [None]:
# Find the unique items
items = set()
items.update(transaction_df['ItemID'].unique())
items = sorted(items)

In [None]:
# One hot encode the Transaction data
from mlxtend.preprocessing import TransactionEncoder
from collections import defaultdict

transaction_items = defaultdict(list)
for transaction in transaction_df.values.tolist():
    transaction_items[transaction[0]].append(transaction[1])

dataset_modified = list(transaction_items.values())

te = TransactionEncoder()
te_ary = te.fit(dataset_modified).transform(dataset_modified)

ohe_df= pd.DataFrame(te_ary, columns=te.columns_)
display(ohe_df.head(5))

***
# Section: 2 - Explore the Data
***

***
## Section: 2.1 - Get the Transaction and Item Sizes
- Calculate the **number_of_transactions** and **number_of_items**.
***

***
## Section: 2.2 - Evaluate the Itemset and Rule Size & Complexity
- Calculate the **maximum number of Itemsets** that could be created from the items (without considering the actual transaction data). Show your work.
- Calculate the **maximum number of Rules** that can be created from the items (without considering the actual transaction data). Show your work.
- What do the calculations suggest as a **potential cause of concern**?
- What might you do to manage these concerns?
***

**Discussion:**

***
# Section: 3 - Itemset Generation
- Create a set of 20 two-item sets with highest support. Sort them in decreasing order of support.
- NOTE: You can set the **apriori** function to create itemsets of **max_len=2** and try various values for **min_support** to get the right number of 2-itemsets.  Then you can filter them for only the 2-itemsets.  But keep the 1and2-itemsets and the 1-itemsets around - they might be useful later. 
- Show the results, briefly.
- Explain what you did and why you did it.
***

***
# Section: 4 - Generate Rules
- For the two-itemsets created above, create the related rules.
- Use the **association_rules** function.  You need to pass in the 1and2_itemsets to the function.  If you set **min_threshold=0.0**, you will get all of the rules.
***

***
# Section: 5 - Rule Evaluation
- For the rules created above, find the single Item (that would be given the discount) that would cause the greatest increase in monthly store revenue.
  - This is based on the Business Problem stated at the top of this notebook.
  - Consider:
    - How much will the store's monthly revenue decrease (or increase) due to the change in price for the chosen Item (and its increased sales)?
    - How much will the store's monthly revenue increase (or decrease) due to the increased sales of the associated Items?
***

**Discussion:** **XXXXX** looks like the **winner**!

# Section: 6 - Calculate Inventory Needs
- Based on the Chosen Item, how much additional inventory, for which Items, will be needed to support the additional sales?

***
# Section: 7 - Conclusions
- Write a paragraph on what you discovered or learned from this homework.
***

***
### END-OF-SUBMISSION
***