# Frequent Itemset Mining

### Student: Rodolfo Lerma

Learning Objectives:

- Extract frequent patterns given a corpus of data.
- Find the rules which are interesting and non-obvious for a given domain.

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
df = pd.read_excel('Online Retail.xlsx')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
df.shape

(541909, 8)

In [4]:
columns = df.columns.to_list()

In [5]:
df2 = pd.read_excel('online_retail_II.xlsx')
df2.rename({'Invoice':'InvoiceNo', }, axis = 1, inplace = True)
df2.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [6]:
df3 = df2.set_axis(columns, axis=1, inplace=False)

In [7]:
df3.shape

(525461, 8)

### Question 1.1: Concatinate both dataframes to create a single dataframe. Remove any rows where InvoiceNo is Null and Quantity is Negative

In [8]:
frames = [df, df3]
data = pd.concat(frames, axis = 0, ignore_index = True ,sort=False)

In [9]:
data.tail(2)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1067368,538171,20970,PINK FLORAL FELTCRAFT SHOULDER BAG,2,2010-12-09 20:01:00,3.75,17530.0,United Kingdom
1067369,538171,21931,JUMBO STORAGE BAG SUKI,2,2010-12-09 20:01:00,1.95,17530.0,United Kingdom


In [10]:
data.shape

(1067370, 8)

It is possible to see that the number of rows and columns match what we had before in the previous 2 datasets.

In [11]:
data.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [12]:
data.isnull().sum() #Looking for NULL Values

InvoiceNo           0
StockCode           0
Description      4382
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     243007
Country             0
dtype: int64

As we can see here there are no `NULL` values for the **InvoiceNo** column.

In [13]:
data = data[data['Quantity'] > 0] 
data.shape #Check point to verify the size of the filtered dataset

(1044420, 8)

### Question 1.2: Filter the data by only transactions that happened in United Kingdom 

In [14]:
data = data[data['Country'] == 'United Kingdom'] 
data.shape

(961224, 8)

### Question 1.3: What are the most popular 5 items?

In [15]:
data = data.astype({'StockCode': str})
data[data['StockCode'] == '85123A'].head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
49,536373,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 09:02:00,2.55,17850.0,United Kingdom
66,536375,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 09:32:00,2.55,17850.0,United Kingdom


In [16]:
items_list = data['StockCode'].unique().tolist()
total_sum_per_item = []
description_summary = []
invoice_summary = []
for i in items_list:
#     if data['StockCode'] == i:
    x = data[data['StockCode'] == i]
    y = x['Quantity'].sum()
    z = x['Description'].iloc[0]
    w  = x['InvoiceNo'].iloc[0]
    total_sum_per_item.append(y)
    description_summary.append(z)
    invoice_summary.append(w)

In [17]:
data_summary = pd.DataFrame(list(zip(invoice_summary, items_list, total_sum_per_item, description_summary)),
               columns =['InvoiceNo','StockCode', 'Total_Qty', 'Description'])

In [18]:
data_sum_sorted = data_summary.sort_values(by = 'Total_Qty', ascending = False)
data_sum_sorted.head()

Unnamed: 0,InvoiceNo,StockCode,Total_Qty,Description
1341,536615,84077,101464,WORLD WAR 2 GLIDERS ASSTD DESIGNS
0,536365,85123A,92476,WHITE HANGING HEART T-LIGHT HOLDER
121,536386,85099B,89143,JUMBO BAG RED RETROSPOT
142,536390,22197,84149,SMALL POPCORN HOLDER
3935,581483,23843,80995,"PAPER CRAFT , LITTLE BIRDIE"


We can see the 5 more popular items on this store:
- World War 2 Gliders
- White Hanging Heart T-Light Holder
- Jumbo Bag Red Retropot
- Small Popcorn Holder
- Paper Craft

### Question 1.4: Filter down the data to include transaction that contain the top 20 items

In [19]:
data_sum_sorted.head(20)

Unnamed: 0,InvoiceNo,StockCode,Total_Qty,Description
1341,536615,84077,101464,WORLD WAR 2 GLIDERS ASSTD DESIGNS
0,536365,85123A,92476,WHITE HANGING HEART T-LIGHT HOLDER
121,536386,85099B,89143,JUMBO BAG RED RETROSPOT
142,536390,22197,84149,SMALL POPCORN HOLDER
3935,581483,23843,80995,"PAPER CRAFT , LITTLE BIRDIE"
2913,541431,23166,77036,MEDIUM CERAMIC TOP STORAGE JAR
9,536367,84879,76021,ASSORTED COLOUR BIRD ORNAMENT
49,536378,21212,72389,PACK OF 72 RETROSPOT CAKE CASES
929,536544,17003,71000,BROCADE RING PURSE
51,536378,21977,46471,PACK OF 60 PINK PAISLEY CAKE CASES


In [20]:
first_20_list = data_sum_sorted['StockCode'].head(20).to_list()
first_20 = data['StockCode'].isin(first_20_list)
index = data.index
indexes = index[first_20]
indices_list = indexes.tolist()

In [21]:
#Data Frame where the StockCode is present
data_20 = data.loc[indices_list] 

In [22]:
#Invoices where at least one of the 20 more relevant items is present
invoice_20 = data_20['InvoiceNo'].unique().tolist()

In [23]:
first_20_invoice = data['InvoiceNo'].isin(invoice_20)
index_invoice = data.index
indexes_invoice = index[first_20_invoice]
indices_list_invoice = indexes.tolist()

In [24]:
# Data Frame with invoice includes at least one of the 20 items
data_invoice_20 = data.loc[indices_list_invoice]
data_invoice_20.head(2)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,2010-12-01 08:34:00,1.69,13047.0,United Kingdom


In [25]:
data_invoice_20['InvoiceNo'][0]

536365

In [32]:
data_invoice_20['InvoiceNo'][0]

536365

In [34]:
list_of_list = []
for j in invoice_20:
    list_for_invoice = []
    for i in range(data_invoice_20.shape[0]):
        y = data_invoice_20['InvoiceNo'].loc[i]
        if y == j:
            x = data_invoice_20['StockCode'][i]
            list_for_invoice.append(x)
    list_of_list.append(list_for_invoice)

# for i in range(data_invoice_20.shape[0]):
#     list_for_invoice = []
#     for j in indices_list_invoice:
#         if data_invoice_20['InvoiceNo'][i] == j:
#             x = data_invoice_20['StockCode'][i]
#             list_for_invoice.append(x)
#     list_of_list.append(list_for_invoice)

KeyError: 1

### Question 2.1: Consolidate the items into 1 transaction per row and each product one-hot encoded.

In [None]:
#Create the "basket"
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()


In [None]:
#Check to make sure you did it right
basket

### Question 2.2: Convert all the values to 1 when values are greater than 0 and 0 when values are 0 or less.

### Question 3.1: Apply [apriori](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) algorithm to generate frequent item sets that have a support of at least 7%

### Question 3.2: Generate the association rules with their corresponding support, confidence and lift.

### Question 4: Based on the above rules, identify what would be the opportunity of promoting one of the antecendents.

### Question 5: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: 
- What was your incoming experience with this model, if any? 
- What steps you took, what obstacles you encountered?
- How you link this exercise to real-world, machine learning problem-solving?
- What steps were missing? What else do you need to learn?