# Groceries Analysis

**Question**

1. What is the top 5 best seller of items in this dataset?
2. What is the itemset that customers like to purchase together?

Thank you to the IFN509 Data Exploration and mining (QUT) for all example code

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

I start checking the information in this dataset. I found that there are 38765 rows for each variables. There is no missing data. The variable includes member number, date, and item description. I also display the first ten rows of the data.

In [None]:
df = pd.read_csv('/kaggle/input/groceries-dataset/Groceries_dataset.csv')

# print dataset information
df.info()
print()

# print first 10 rows
print(df.head(10))

To answer my first question that what is the best seller products, I found that whole milk is the best seller, following by other vegetables, rolls/buns, soda, yogurt, and root vegetables.

In [None]:
# count the number of item
df.itemDescription.value_counts().head(10)

To answer the second question that what is the frequent itemset that customers likely to purchase together,
- started by grouping the data into list
- install apyori to perform analysis
- apply apyori
- convert to pandas dataset
- sort by lift

In [None]:
# group item description by member number
transactions = df.groupby(['Member_number'])['itemDescription'].apply(list)
print(transactions.head())

In [None]:
pip install apyori

In [None]:
from apyori import apriori

transaction_list = list(transactions)
results = list(apriori(transaction_list, min_support = 0.05))

# print
print(results[:5])

In [None]:
# convert into pandas dataset to make it easier to read
def convert_apriori_results_to_pandas_df(results):
    rules = []
    
    for rule_set in results:
        for rule in rule_set.ordered_statistics:
            # items_base = left side of rules, items_add = right side
            # support, confidence and lift for respective rules
            rules.append([','.join(rule.items_base), ','.join(rule.items_add),
                         rule_set.support, rule.confidence, rule.lift]) 
    
    # typecast it to pandas df
    return pd.DataFrame(rules, columns=['Left_side', 'Right_side', 'Support', 
                                        'Confidence', 'Lift']) 

result_df = convert_apriori_results_to_pandas_df(results)

print(result_df.head(20))

In [None]:
# sort by lift
result_df = result_df.sort_values(by='Lift', ascending=False)
result_df.head(20)

In the first 4 rows, customers like to purchase bottled waterm whole milk and other vegetable together with different orders at support equal 0.056183. I think it does not have much condition that if they buy milk, they will buy water. With the positive lift > 1.

Customers like to buy yogurt along with other vegetables and whole milk with support equal 0.071832.

The other itemsets includes [yogurt, sausage] and [yogurts, rolls/buns, and whole milk].
