Autor: Roman Janic Studer 

In [1]:
import pandas as pd
import numpy as np
import csv
import pickle
from pathlib import Path

# Product Reduction using a Rating
------
This Notebook is used to create a List of all Products and eliminate up to 80% of the "weakest" products. (Rating should not be used for a recommender at a later state)

In [2]:
#load first few roads to get an overview of the data
overview = pd.read_csv('Recommender4Retail.csv', nrows = 6)
overview.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,department,aisle
0,1,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,beverages,soft drinks
1,2,2539329,1,prior,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,dairy eggs,soy lactosefree
2,3,2539329,1,prior,1,2,8,,12427,3,0,Original Beef Jerky,23,19,snacks,popcorn jerky
3,4,2539329,1,prior,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,snacks,popcorn jerky
4,5,2539329,1,prior,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,household,paper goods


### Extract a list of Products and Orders out of a DataFrame
----
The following code-blocks are used to extract all unique values for a certain column (in our case `product_id`). 
The difficulty here is that the dataset is to large to be loaded at once. Because the extraction of the unique values for a large dataset takes some time I decided to save the created lists as a pickle file to be able to load them in after a restart of the notebook.

In [3]:
def extract_unique_values(data,column, chunksize=None):
    """
    Extracts the unique values for a certain column in a dataframe
    
    :param str, data is the path to a csv
    :column str, column to be filtered
    :chunksize int, defines the size of a batch to load
    
    :return list, of all unique values
    """
    products = []
    
    for chunk in pd.read_csv(data, chunksize=chunksize):
        
        # could be optimized 
        t = chunk[column].value_counts()
        t = pd.DataFrame(t).reset_index()
        t = t.rename(columns={'index': 'product'})
        
        for val in t['product'].values.tolist():
            products.append(val)
    list_set = set(products)
    unique_val = list(list_set)
    
    return unique_val

In [4]:
# products = extract_unique_values('Recommender4Retail.csv', 'product_id', chunksize=10_000)

In [5]:
"""with open('product_ids.pkl', 'wb') as f:
    pickle.dump(products, f)"""

"with open('product_ids.pkl', 'wb') as f:\n    pickle.dump(products, f)"

In [6]:
# orders = extract_unique_values('Recommender4Retail.csv', 'order_id', chunksize=10_000)

In [7]:
"""with open('order_ids.pkl', 'wb') as f:
    pickle.dump(orders, f)"""

"with open('order_ids.pkl', 'wb') as f:\n    pickle.dump(orders, f)"

In [8]:
# load saved data to 

with open('product_ids.pkl', 'rb') as f:
    products = pickle.load(f)

with open('order_ids.pkl', 'rb') as f:
    orders = pickle.load(f)

In [9]:
f'Total number of Products in Recommender4Retail.csv: {len(products)}' 
# Products: 49_685

'Total number of Products in Recommender4Retail.csv: 49685'

In [10]:
f'Total number of Orders in Recommender4Reteail.csv: {len(orders)}' 
# Orders: 3_346_083

'Total number of Orders in Recommender4Reteail.csv: 3346083'

### Create DataFrame as a basis for a rating
----
The code above created two lists called `products` and `orders` representing all products and orders in the dataset. With this information, we can calculate a rating for every product. The function `calc_product_order_reorder_df` creates a DataFrame containing columns which contain information that can be used to create a rating of the product.

- `index` = product_id
- `product_id = 

In [11]:
def calc_product_order_reorder_count():
    %timeit
    chunks = pd.read_csv("Recommender4Retail.csv",chunksize=10_000)
    subsets = [chunk.groupby('product_id').agg({'product_id':'count',
                                                'user_id': 'nunique',
                                                'reordered':'sum'}) for chunk in chunks]

    df = pd.concat(subsets).groupby(level=0).sum()
    
    df.reset_index(inplace=True, drop=True)
    df.rename(columns={"product_id": "n_orders", "user_id": "n_users", 'reordered': "n_reorders"})
    return product_order

In [12]:
df = calc_product_order_reorder_count()

NameError: name 'product_order' is not defined

In [None]:
df.head()

In [None]:
# sanity check: 
print(f'Number of Products in list: {len(df)}') #correct

### Calcuate the Support of the Products and implement it into a rating formula
-----

The Apriori - Algoritmn works with two probabilistic measurements which rate the assoziationrules: Support and Confidence. The support represents the probability of an item being in a basket. We use the basic idea of the support to calculate a Rating for every product containing information about how often the product has been bought, how many customers bought the product, and how many of those customers reordered the product. This gives us the "rating" formula below:

#### Mathematical Approach
To callculate the support we need a Multiset $$X = \text{Product for every transaction}$$ containing all ordered products. 

The formula to calculate the Support can be written as follows:
$$
support(x\in X) =  \frac{|\forall x \in X|}{|X|}
$$

Or in writing:
$$
support(\text{Product}) =  \frac{\text{Number of Buys of the Product}}{\text{Total number of transactions}}
$$

Rating formula:
$$
rating(\text{Product}) =  \text{normalized}( \frac{\text{Number of Buys of the Product}}{\text{Total number of transactions}}+\frac{\text{Number of Customers for the Product}}{\text{Total number of Customers}}+\frac{\text{Number of Reorders of the Product}}{\text{Total number of Reorders}} )
$$

In [None]:
n_orders = sum(df['n_orders'])
n_customers = sum(df['n_users'])
n_reorders = sum(df['n_reorders'])
df['rating']= df['n_orders']/n_orders + df['n_users']/n_orders + df['n_reorders']/n_reorders
# normalize rating
df['rating'] = (df['rating']-df['rating'].min())/(df['rating'].max()-df['rating'].min())

In [None]:
df.head()

### Drop "weak" products
----
With the now calculated support for each product, we can now evaluate and sort the products. This allows us to remove  weak, rarely purchased products.

In [None]:
def drop_products(df, rate=0.8):
    """
    Function drops a predefined percentage of products
    
    :param df: pandas dataframe
    :param rate: float that definies how many products should be dropped (percentage)
    :return df: DataFrame containing all the product_id's to keep
    """
    # sort values:
    df.sort_values('rating', ascending=False, inplace=True)
    
    # calculate condition to drop
    rows_to_drop = int(len(df)*rate)
    df = df.drop(df.tail(rows_to_drop).index)
    
    return df

In [None]:
reduced_df = drop_products(df);

In [None]:
reduced_df.info()

In [None]:
reduced_df.head()

In [None]:
# list of products to keep:
good_products = reduced_df.index.tolist()

In [None]:
# drop columns with bad products
if Path('rating.csv').is_file():
    pass
else:
    chunks = pd.read_csv('Recommender4Retail.csv', chunksize=10_000)
    for chunk in chunks:
        chunk = chunk[chunk['product_id'].isin(good_products)]

        if Path('rating.csv').is_file():
            chunk.to_csv('rating.csv', mode='a', header=False)
        else:
            chunk.to_csv('rating.csv', mode='a', header=True)

In [None]:
df = pd.read_csv('rating.csv')