# Notebook 1: Data Tasks

In this notebook, you will solve three tasks related to data processing and machine learning (ML), based on the loyalty card data of a retail company:

1. compute how often products co-occur in shopping baskets
1. calculate the number of _different_ products customers have bought at a given point point in time
1. build a data streamer class that generates training samples for a ML model that we will discuss later in the lectures (module 3)

In [1]:
import os

import pandas as pd

In [2]:
# please update me!
PATH_DATA = "../data/instacart"

<br> 
<br> 

## Preparation: Get the data

### Download data from Kaggle

1. Create account on www.kaggle.com
1. Download the Instacart data set from kaggle.com: https://www.kaggle.com/c/instacart-market-basket-analysis/data
1. Put your data into `PATH_DATA` (as specified above) and unzip files

In [3]:
# check that the required files are available
assert os.path.isfile(f"{PATH_DATA}/orders.csv")
assert os.path.isfile(f"{PATH_DATA}/order_products__prior.csv")
assert os.path.isfile(f"{PATH_DATA}/order_products__train.csv")

### Load data

In [4]:
orders = pd.read_csv(f"{PATH_DATA}/orders.csv")
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [5]:
order_products = pd.concat(
    [
        pd.read_csv(f"{PATH_DATA}/order_products__prior.csv"),
        pd.read_csv(f"{PATH_DATA}/order_products__train.csv"),
    ]
)
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


<br>
<br>

## 1. &ensp; Task 1: "Product co-occurrence"

For all products (`product_id`), compute how often the product co-occurs in orders (`order_id`) with every other product. The output should be a `pd.DataFrame` with the following three columns:

1. Product 1
1. Product 2
1. Number of times the products co-occur in a shopping basket

Some questions to consider:
- What drives runtime and memory consumption of your implementation? What tools can you use for profiling runtime and memory consumption in Python? How can you reduce runtime and memory consumption?
- What are meaningful parameters of your program that you should expose to the user?
- Instead of calculating how often two products co-occur in a shopping basket, what would be (more meaningful) metrics? Why do you think so? 

In [6]:
# please add your implementation here

<br>
<br>

## 2. &ensp; Task 2: "Time-resolved ML feature"

Count number of distinct products (`product_id`) purchased by users (`user_id`) at any given point in time (`order_number`). The output should be a `pd.DataFrame` with the following three columns:

1. User
1. Order number
1. Number of _different_ products purchased prior to the given order

Some questions to consider:
- Instead of calculating the ___number___ of different products users have purchased, how could you normalize the feature so it is more meaningful? Explain why you suggest this normalization.
- What similar features (other than the number of unique products) can you compute given the data that is available to you?
- What drives runtime and memory consumption of your implementation? How can you reduce runtime and memory consumption?

In [7]:
# please add your implementation here

<br>
<br>

## 3. &ensp; Task 3: "P2V-MAP data streamer"

Your goal is to implement the data streamer used in

> P2V-MAP: Mapping Market Structure for Large Assortments (Gabel et al. 2019).

The data streamer is used in a model that predicts whether two products occur together in shopping baskets. Your streamer provides data for model training (and prediction) in batches. Each batch is made up of three `numpy` arrays, and each batch contains `B` training samples (i.e., rows in the `np.arrays`):

- $a_1$: the product ID of a product in the basket
- $a_2$: the product ID of another product in the basket, but _not_ the product in array $a_1$
- $a_3$: N _randomly_ chosen products that are not in the basket with the products in $a_1$ and $a_2$ in a given row.

For example, if a basket contains the products (1, 2, 3) and the product assortment is made up of 10 products (i.e., 1, ..., 10), then a possible output for a batch size of `B=3` and `N=2` is:

- $a_1$ = `array([[1], [2], [3]])` (a `np.array` of size `3x1`)
- $a_2$ = `array([[2], [1], [1]])` (a `np.array` of size `3x1`)
- $a_3$ = `array([[4, 8], [9, 5], [7, 4]])` (a `np.array` of size `3x2`)

As you can see, $a_3$ does not contain any of the products that you can find in the baskets. Of course, the streamer should generalize to multiple baskets.

Please create a Python class `DataStreamerP2V` that contains (at least) the following methods:
1. `__init__`
1. `generate_batch`: returns one batch of training samples (the three arrays specified above)
1. `reset_iterator`: resets the state of the data streamer, for example, after the data streamer iterated through all baskets

The basket data (derived from `order_products` and `orders`) must be one of the inputs for the data streamer. Think about what parameters your streamer should feature so you allow the user to configure the streamer's functionality.

_Bonus:_ Implement a unit test for your streamer class.

In [8]:
# please add your implementation here

<br>
<br>
&mdash; <br>
Sebastian Gabel <br>
`Learning from Big Data` <br>