Autor: Roman Janic Studer 

In [1]:
import pandas as pd
import numpy as np
import csv
import pickle

# Product Reduction partially using the Apriori approach
------
This Notebook is used to create a List of all Products and eliminate up to 80% of the "weakest" products. (Rating should not be used for a recommender at a later state)

In [2]:
#load first few roads
overview = pd.read_csv('Recommender4Retail.csv', nrows = 10)
overview.head()


Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,department,aisle
0,1,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,beverages,soft drinks
1,2,2539329,1,prior,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,dairy eggs,soy lactosefree
2,3,2539329,1,prior,1,2,8,,12427,3,0,Original Beef Jerky,23,19,snacks,popcorn jerky
3,4,2539329,1,prior,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,snacks,popcorn jerky
4,5,2539329,1,prior,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,household,paper goods


### Extract a list of Products and Orders out of a DataFrame
----
The following code-blocks are used to extract all unique values for a certain column (in our case `product_id`). 
The difficulty here is that the dataset is to large to be loaded at once. Because the extraction of the unique values for a large dataset takes some time I decided to save the created lists as a pickle file to be able to load them in after a restart of the notebook.

In [3]:
def extract_unique_values(data,column, chunksize=None):
    """
    Extracts the unique values for a certain column in a dataframe
    
    :param str, data is the path to a csv
    :column str, column to be filtered
    :chunksize int, defines the size of a batch to load
    
    :return list, of all unique values
    """
    products = []
    
    for chunk in pd.read_csv(data, chunksize=chunksize):
        
        # could be optimized 
        t = chunk[column].value_counts()
        t = pd.DataFrame(t).reset_index()
        t = t.rename(columns={'index': 'product'})
        
        for val in t['product'].values.tolist():
            products.append(val)
    list_set = set(products)
    unique_val = list(list_set)
    
    return unique_val

In [4]:
# products = extract_unique_values('Recommender4Retail.csv', 'product_id', chunksize=10_000)

In [5]:
"""with open('product_ids.pkl', 'wb') as f:
    pickle.dump(products, f)"""

"with open('product_ids.pkl', 'wb') as f:\n    pickle.dump(products, f)"

In [6]:
# orders = extract_unique_values('Recommender4Retail.csv', 'order_id', chunksize=10_000)

In [7]:
"""with open('order_ids.pkl', 'wb') as f:
    pickle.dump(orders, f)"""

"with open('order_ids.pkl', 'wb') as f:\n    pickle.dump(orders, f)"

In [8]:
# load saved data to 

with open('product_ids.pkl', 'rb') as f:
    products = pickle.load(f)

with open('order_ids.pkl', 'rb') as f:
    orders = pickle.load(f)

In [9]:
f'Total number of Products in Recommender4Retail.csv: {len(products)}' 
# Products: 49_685

'Total number of Products in Recommender4Retail.csv: 49685'

In [10]:
f'Total number of Orders in Recommender4Reteail.csv: {len(orders)}' 
# Orders: 3_346_083

'Total number of Orders in Recommender4Reteail.csv: 3346083'

### Populate Dataframe
----
The code above created two lists called `products` and `orders` representing all products and orders in the dataset. With this information, we can calculate the support for every product. The function `product_order_df` creates a DataFrame containing the columns `n_orders` which represents the amount of orders for every product(index = product_id) and `sup_product` which is a placeholder for the support of the product. 

In [11]:
def calc_product_count():
    %timeit
    chunks = pd.read_csv("Recommender4Retail.csv",chunksize=10_000)
    subsets = [chunk.groupby('product_id')['product_id'].agg(['count']) for chunk in chunks]

    product_order = pd.concat(subsets).groupby(level=0).sum()
    
    product_order.reset_index(inplace=True)
    return product_order

# calc_product_count()

In [12]:
#save list
"""with open('product_n_order.pkl', 'wb') as f:
    pickle.dump(product_order, f)"""

"with open('product_n_order.pkl', 'wb') as f:\n    pickle.dump(product_order, f)"

In [13]:
# load list
with open('product_n_order.pkl', 'rb') as f:
    product_order = pickle.load(f)

n_product_order = product_order['count'].sum()
'Total number of Products bought: {}'.format(n_product_order)

'Total number of Products bought: 33819106'

### Calcuate the Support of the Products
-----

The Apriori - Algoritmn works with two probabilistic measurements which rate the assoziationrules: Support and Confidence. The Algorithmen expects the values `minsupp` and `minconf`. The support represents the probability of an item being in a basket.

#### Mathematical Approach
To callculate the support we need a Multiset $$X = \text{Every Product}$$ containing all ordered products

$$
support(x\in X) =  \frac{|\forall x \in X|}{|X|}
$$

Or in writing:
$$
support(\text{Product}) =  \frac{\text{Number of Buys of the Product}}{\text{Total number of transactions}}
$$

In [14]:
product_order['sup_product'] = product_order['count']/n_product_order # support for product
product_order.head()

Unnamed: 0,product_id,count,sup_product
0,1,1928,5.700919e-05
1,2,94,2.779494e-06
2,3,283,8.368051e-06
3,4,351,1.037875e-05
4,5,16,4.731054e-07


In [15]:
# Normalize support values

product_order['sup_product'] = (product_order['sup_product']-product_order['sup_product'].min())/(product_order['sup_product'].max()-product_order['sup_product'].min())

In [16]:
product_order.head()

Unnamed: 0,product_id,count,sup_product
0,1,1928,0.003922
1,2,94,0.000189
2,3,283,0.000574
3,4,351,0.000712
4,5,16,3.1e-05


### Drop "weak" products
----
With the now calculated support for each product, we can now evaluate and sort the products. This allows us to remove  weak, rarely purchased products.

In [27]:
def drop_products(df, rate=0.8):
    """
    Function drops a predefined percentage of products
    :param df: pandas dataframe
    :param rate: float that definies how many products should be dropped (percentage)
    """
    # sort values:
    df.sort_values('sup_product', ascending=False, inplace=True)
    
    # calculate condition to drop
    rows_to_drop = int(len(product_order)*rate)
    df = df.drop(df.tail(rows_to_drop).index)
    
    return df

In [22]:
reduced_df = drop_products(product_order);

In [23]:
reduced_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9937 entries, 24849 to 6677
Data columns (total 3 columns):
product_id     9937 non-null int64
count          9937 non-null int64
sup_product    9937 non-null float64
dtypes: float64(1), int64(2)
memory usage: 310.5 KB


In [44]:
# list of products to keep:
good_products = list(reduced_df.product_id)