## Market Basket Analysis ##


**Running Kaggle competition**

Who cares?

- Instacart, other retailers
    - Personalised advertisement/product suggestions
         - Kind reminders for forgotten items at the shopping lists
         - Increase turnover/revenue
         - Lessen time spent in shop
    - Dynamic pricing tailored for each customer



**Instacart business model**

- Runs only in US
- Online order (same-day delivery)
   - Delivery within one or two hours
       - depends on your subscription plan
   - 24-hour delivery
        - by personal shopper
- Prices may differ from those in shops either way
- 5-year old/ valued at $3.4 bln



**AIM: Predict products in the last order**

- Purchasing histories of 206209 customers
- 3214874 orders in total
    - Order history records (prior)
        - Multiple orders for each customer (min of 3)
    - Last order for each customer
        - train: 131209 orders
        - test: 75000 orders
- About 50000 products



**Data features**

- Orders are chronologically arranged
- Days_since_prior_order
- Products_per_order
- Add_to_cart_order
- Order_dow (day of the week)
- Order_hour_of_day
- Aisle/Department
- Reordered (60%)



**Kaggle scripts**

- EDA (a lot)
- ML models (a few)
    - XGBoost
    - light GBM
    - Temporal Annotated Recurring Sequence (TARS) based prediction
        - code withdrawn from Github
- Benchmark models



**Benchmark models**

- Random baseline
    - Sample from categories (aisles/departments) ordered before
- Top N-products ranked by reordering propensity
    - Distribution of reordering propensity
    - F1 score sensitivity wrt N
- Top N-products ranked by reordering propensity from *m* last orders
- All products ever purchased by a customer
    - 0.2164845 Public Leadership Board (LB) Score
- All products ever reordered by a customer
    - 0.2996690 Public LB Score
- All products ever reordered by a customer in the last *m* orders
- Repeat the last order
    - 0.3276746 Public LB Score
- Repeat Last Order (Reordered Products Only)
    - 0.3276826 Public LB Score

**ML models**

- XGBoost [LB 0.3808482]
- light GBM [LB 0.3692]

- ** There is an improvement over the benchmark models!!!**

- But: Dependent Variable :
    - probability of a given product ever bought by a customer will be re-ordered
    - list of predicted products is always a subset of ordered products



# Top LB score - 0.4021544 #

**MBA-models**

- Enlarge list of predicted products by associative rules
    - most common pairs of products purchased
        - beer/nuts, beer/chips, cheese/wine
- Optimal size of predicted basket varies by a customer
- Sensitivity of F1-score wrt predicted basket size



**Evaluation metrics: mean F1 score**

- p - *precision*
    - the number of correct positive results divided by the number of all positive results
- r - *recall*
    - the number of correct positive results divided by the number of positive results that should have been returned.

- $$ F_1{ \ \ } score = 2 * \frac{1}{\frac{1}{p} + \frac{1}{r}}$$
    - harmonic mean of p and r: values between 0 and 1
- $$ F_1{ \ \ } score = 2 * \frac{p * r}{p + r}$$


# Features
- **product-specific**
- **user-specific**
- **user-product-specific**
- **train/test-specific**

## Product-specific features (49688) products
- **prod_orders**: how many times product_id was ordered by all customers
- **prod_reorders**: how many times product_id was re-ordered by all customers
- **prod_reorder_ratio**:
    - $$  prod\_reorder\_ratio = \frac{prod\_reorders}{prod\_orders}$$
- **prod_first_orders**: how many times product_id was ordered for the first time
- **prod_second_orders**: how many times product_id was ordered for the second time
- **prod\_reorder\_probability**:   
    - $$  prod\_reorder\_probability = \frac{prod\_reorders}{prod\_orders}$$
    - measures probability of the first re-order
- **prod_reorder_times**: 
    - $$ prod\_reorder\_times =  1 + \frac{prod\_reorders}{ prod\_first\_orders} = \frac{prod\_first\_orders + prod\_reorders}{ prod\_first\_orders} $$
    - inverse of share of first orders in total orders


# User-specific features
- **user_orders**: number of orders made by user_id
- **user_period**: number of days since the first purchase
- **user_mean_days_since_prior**: average number of days since the prior order
- **user_mean_order_dow**: average day-of-week when orders were made
- **user_mean_order_hour_of_day**: average hour-of-day when orders were made
- **user_total_products**: number of all products ever ordered by user_id
- **user_distinct_products**: number of all *distinct* products ever ordered by user_id
- **user_average_basket**: size of average order
    - $$user\_average\_basket = \frac{user\_total\_products}{user\_orders}$$
- **user_reorder_ratio**: 
    - $$user\_reorder\_ratio = \frac{sum(reordered == 1)}{sum(order\_number > 1)}$$



# User-Product-specific (UP) features
 - **up_orders**: how many times a user_id ordered given product_id
 - **up_first_order**: in which **order_number** a product_id was first ordered
 - **up_last_order**: in which **order_number** a product_id was last ordered
 - **up_average_cart_position**: average position in ordering cart
 - **up_order_rate**: share of orders with ordered product_id in total number of orders
     - $$ up\_order\_rate = \frac{up\_orders}{user\_orders}$$
 - **up_orders_since_last_order**: how many orders ago product_id was ordered last time
     - $$up\_orders\_since\_last\_order = user\_orders - up\_last\_order$$
 - **up_order_rate_since_first_order**: how many times a user_id ordered given product_id per number of orders since it was ordered for the first time
     - $$ up\_order\_rate\_since\_first\_order =  \frac{up\_orders}{user\_orders - up\_first\_order + 1}$$

# Train/test-specific features
- **days_since_the_last_order**
- **postprior_order_dow**
- **postprior_order_hour_of_day**
- **reordered**: dependent variable
    - indicates whether product_id was (re-)ordered in train set
    - for those that were ordered but not re-ordered in train set this variable is set to zero!?

# Suggestions:
- user-specific:
    - **buys_organic**: reorder ratio is different for organic/non-organic products
    - **s
- user-products_specific:
    - **up_days_since_last_order**: how many days lapsed since product_id was ordered

# Play with thresholds: apply user-specific thresholds
- identify **reordering** customers in test set: according to https://www.kaggle.com/philippsp/exploratory-analysis-instacart 
there are 3,487 customers, that always reorder products. Set for them the threshold to zero.