# Data Discription

### Instacart Purchase Prediction Competition (Kaggle)

https://www.kaggle.com/c/instacart-market-basket-analysis/overview

Last updated: November 16, 2020

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

import random

import matplotlib.pyplot as plt
%matplotlib inline

## Load data

In [2]:
aisles_df = pd.read_csv("aisles.csv")
departments_df = pd.read_csv("departments.csv")

orders_df = pd.read_csv("orders.csv")
products_df = pd.read_csv("products.csv")

prior_df = pd.read_csv("order_products__prior.csv")
train_df = pd.read_csv("order_products__train.csv")

template_df = pd.read_csv("sample_submission.csv")

## About the data

We go through every dataset provided by *Kaggle*: `aisles.csv`, 

### Aisles.csv

`aisles.csv` contains the aisle information

There are 2 columns:

- `aisle_id` unique IDs ranges from 1 to 134

- `aisle` stores a string of the aisle category

In [3]:
aisles_df.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


### Department.csv

`department.csv` contains the department information

There are 2 columns:

- `department_id`: unique IDs ranges from 1 to 21

- `department`: a string of the department category

In [5]:
departments_df.head()

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [12]:
departments_df['department'].values

array(['frozen', 'other', 'bakery', 'produce', 'alcohol', 'international',
       'beverages', 'pets', 'dry goods pasta', 'bulk', 'personal care',
       'meat seafood', 'pantry', 'breakfast', 'canned goods',
       'dairy eggs', 'household', 'babies', 'snacks', 'deli', 'missing'],
      dtype=object)

### Products.csv

`products.csv` contains the inventory of all products

There are 4 columns:

- `product_id`: unique product id (1 to 49688)
- `product_name`: name of the product
- `aisle_id`: the id of the aisle (1 to 134)
- `department_id`: the id of the department (1 to 21)

In [13]:
products_df.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


### Orders.csv

`orders.csv` link the order_id to the user_id, and provides some additional information about the purchase.

It has 7 columns:

- `order_id`: unique order id (1 to 3,421,083)
- `user_id`: unique user id (1 to 206,209)
- `eval_set`: which set (prior, train, test) the order belongs to
- `order_number` order number (from oldest to newest) of the user (min = 1, max = 100)
- `order_dow`: day of the week, from 0 to 6
- `order_hour_of_day`: hour of the day, from 0 to 23
- `days_since_prior_order`: days since the user made the very first order

In [16]:
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


### Order_products__prior.csv

`order_products__prior.csv` (`prior_df`) shows the what products were purchased in each order.

There 4 columns:

- `order_id`
- `product_id`
- `add_to_cart_order`: the order of that product added into the order
- `reordered`: whether or not the product was ordered before

In [17]:
prior_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


### Order_products__train.csv

`order_products__train.csv` is our training data. It contains of 131,209 orders, with 39,123 different products. Each order is made by a distinct user, i.e., by 131,209 different users.

There 4 columns:

- `order_id`
- `product_id`
- `add_to_cart_order`: the order of that product added into the order
- `reordered`: whether or not the product was ordered before

In [18]:
train_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


### Sample_submission.csv

`sample_submission.csv` (`template_df`) contains the template of the submission file. It contains 2 columns:

- `order_id` gives the orders we need to predict
- `products` should be our prediction. Multiple product IDs can be join by a space

There are **75,000** orders to predict

In [22]:
template_df.head()

Unnamed: 0,order_id,products
0,17,39276 29259
1,34,39276 29259
2,137,39276 29259
3,182,39276 29259
4,257,39276 29259


## Some more information about the data

#### 1. In a single order, there're no duplicate products

All product_id in a single order are unique

In [29]:
current_orderid = -1
current_products = []

for i in prior_df.index:
    
    order_id = prior_df['order_id'][i]
    product_id = prior_df['product_id'][i]
    
    if order_id != current_orderid:
        if len(current_products) != len(list(set(current_products))):
            print("Duplicates found")
        current_products = []
        current_orderid = order_id
    
    current_products.append(product_id)
    
print("No duplicates")

No duplicates


In [30]:
current_orderid = -1
current_products = []

for i in train_df.index:
    
    order_id = train_df['order_id'][i]
    product_id = train_df['product_id'][i]
    
    if order_id != current_orderid:
        if len(current_products) != len(list(set(current_products))):
            print("Duplicates found")
        current_products = []
        current_orderid = order_id
    
    current_products.append(product_id)
    
print("No duplicates")

No duplicates
