# First submission to [Kaggle Instacart competition](https://www.kaggle.com/c/instacart-market-basket-analysis)

> In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order.

For my first submission, I want to keep it simple and just get a baseline: **I will just predict that each user's next order will be the same as their last order.**

# Why start with a simple baseline?

This philosophy is explained in the following excerpt from [*Data Science for Business*](https://www.safaribooksonline.com/library/view/data-science-for/9781449374273/), by Provost & Fawcett, Chapter 7:

> Another fundamental notion in data science is: *it is important to consider carefully what would be a reasonable baseline against which to compare model performance.* This is important for the data science team in order to understand whether they indeed are improving performance, and is equally important for demonstrating to stakeholders that mining the data has added value. …

> In Nate Silver’s book on prediction, *The Signal and the Noise* (2012), he mentions the baseline issue with respect to weather forecasting. Weather forecasters have two simple—but not simplistic—baseline models that they compare against. One (persistence) predicts that the weather tomorrow is going to be whatever it was today. The other (climatology) predicts whatever the average historical weather has been on this day from prior years. Each model performs considerably better than random guessing, and both are so easy to compute that they make natural baselines of comparison. Any new, more complex model must beat these. …

> What are some general guidelines for good baselines? For classification tasks, one good baseline is the *majority classifier*, a naive classifier that always chooses the majority class of the training dataset …

> For regression problems we have a directly analogous baseline: predict the average value over the population (usually the mean or median). …

> In some applications there are multiple simple averages that one may want to combine. For example, when evaluating recommendation systems that internally predict how many “stars” a particular customer would give to a particular movie, we have the average number of stars a movie gets across the population (how well liked it is) and the average number of stars a particular customer gives to movies (what that customer’s overall bias is). …

# First, download the CSV files
From https://www.kaggle.com/c/instacart-market-basket-analysis/data

> The dataset for this competition is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the [blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) accompanying its public release.

# Use `pandas` for data wrangling
To learn more: http://www.dataschool.io/best-python-pandas-resources/

In [2]:
import pandas as pd

# Look at sample submission
So we know what our submission should look like.  
Submissions should have 75,000 rows and two columns: An order id, and the products predicted for the order.

In [3]:
sample_submission = pd.read_csv('sample_submission.csv')

In [4]:
sample_submission.shape

(75000, 2)

In [5]:
sample_submission.head()

Unnamed: 0,order_id,products
0,17,39276 29259
1,34,39276 29259
2,137,39276 29259
3,182,39276 29259
4,257,39276 29259


Recall that the baseline idea is: **I will just predict that each user's next order will be the same as their last order.**

To create a submission in the correct format, I'll need to answer:
* For a given `order_id`, who is the `user_id`?
* For that `user_id`, what was the `order_id` of their last prior order?
* For that last prior `order_id`, what were the `products` in the order?

Pandas can be used to wrangle the data and answer these questions.

# Load `orders.csv`

> This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders.

In [6]:
orders = pd.read_csv('orders.csv')

In [7]:
orders.shape

(3421083, 7)

In [8]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [9]:
orders.eval_set.value_counts()

prior    3214874
train     131209
test       75000
Name: eval_set, dtype: int64

#### Split into 3 dataframes (prior, train, test)

In [10]:
orders_prior = orders[orders.eval_set=='prior']
orders_train = orders[orders.eval_set=='train']
orders_test = orders[orders.eval_set=='test']

# Load `order_products__prior.csv`

> order_products__prior.csv contains previous order contents for all customers. 'reordered' indicates that the customer has a previous order that contains the product.

In [11]:
%%time
order_products_prior = pd.read_csv('order_products__prior.csv')

Wall time: 9.85 s


In [12]:
order_products_prior.shape

(32434489, 4)

In [13]:
order_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


# Group products by `order_id`
I don't want one row per `product_id` and multiple rows per `order_id`.  
Instead I want one row per `order_id`, with a list of all the products in that order.

In [14]:
%%time
products = order_products_prior.groupby('order_id')['product_id'].apply(list).reset_index()
products.rename(columns={'product_id': 'products'}, inplace=True)

Wall time: 2min 21s


In [15]:
products.head()

Unnamed: 0,order_id,products
0,2,"[33120, 28985, 9327, 45918, 30035, 17794, 4014..."
1,3,"[33754, 24838, 17704, 21903, 17668, 46667, 174..."
2,4,"[46842, 26434, 39758, 27761, 10054, 21351, 225..."
3,5,"[13176, 15005, 47329, 27966, 23909, 48370, 132..."
4,6,"[40462, 15873, 41897]"


# Merge products & prior orders

In [16]:
orders_prior = orders_prior.merge(products, on='order_id')

In [17]:
orders_prior.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,products
0,2539329,1,prior,1,2,8,,"[196, 14084, 12427, 26088, 26405]"
1,2398795,1,prior,2,3,7,15.0,"[196, 10258, 12427, 13176, 26088, 13032]"
2,473747,1,prior,3,3,12,21.0,"[196, 12427, 10258, 25133, 30450]"
3,2254736,1,prior,4,4,7,29.0,"[196, 12427, 10258, 25133, 26405]"
4,431534,1,prior,5,4,15,28.0,"[196, 12427, 10258, 25133, 10326, 17122, 41787..."


# Get last order for each user

In [18]:
orders_prior['last_order'] = orders_prior.order_number == orders_prior.groupby('user_id')['order_number'].transform(max)

In [19]:
last_orders = orders_prior.query('last_order==True')[['user_id', 'products']]

In [20]:
last_orders.head()

Unnamed: 0,user_id,products
9,1,"[196, 46149, 39657, 38928, 25133, 10258, 35951..."
23,2,"[24852, 16589, 1559, 19156, 18523, 22825, 2741..."
35,3,"[39190, 18599, 23650, 21903, 47766, 24810]"
40,4,"[26576, 25623, 21573]"
44,5,"[27344, 24535, 43693, 40706, 16168, 21413, 139..."


# Predict that each user's next order will be the same as their last order, for each user / order id in the test set

In [21]:
submission = orders_test.merge(last_orders, on='user_id')[['order_id', 'products']]

In [22]:
submission.shape

(75000, 2)

In [23]:
submission.head()

Unnamed: 0,order_id,products
0,2774568,"[39190, 18599, 23650, 21903, 47766, 24810]"
1,329954,"[26576, 25623, 21573]"
2,1528013,"[49401, 25659, 8424]"
3,1376945,"[24799, 17706, 33572, 27959, 48697, 49374, 830..."
4,1356845,"[13176, 14992, 44422, 11520, 31506, 22959, 712..."


In [24]:
def convert(list_of_numbers):
    list_of_strings = [str(x) for x in list_of_numbers]
    return ' '.join(list_of_strings)

In [25]:
submission.products = submission.products.apply(convert)

In [26]:
submission.head()

Unnamed: 0,order_id,products
0,2774568,39190 18599 23650 21903 47766 24810
1,329954,26576 25623 21573
2,1528013,49401 25659 8424
3,1376945,24799 17706 33572 27959 48697 49374 8309 30563...
4,1356845,13176 14992 44422 11520 31506 22959 7120 37687...


In [27]:
submission.to_csv('submission-2017-06-06-A.csv', index=False)

The submission scored 0.3118026 on [Kaggle's Public Leaderboard](https://www.kaggle.com/c/instacart-market-basket-analysis/leaderboard).