# Rotman Data Science Competition
## Part 1.2: Exploratory Data Analysis of Context of Orders

# Table of Contents
1. [Competition Data Anlaysis](#CompetitionData)

## Part 1. Competition Data Analysis <a name="CompetitionData"></a>
### 1.0 Import Libraries & Load Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
SHOW_GRAPHS = True

In [None]:
def load_competition_data() -> pd.DataFrame:
    DATA_PATH = "data/mma_mart.csv"
    data = pd.read_csv(DATA_PATH)
    return data

GRAPH_OUT_PATH = "graphs/"

In [None]:
mma_data = load_competition_data()
mma_data.head()

### 1.1 Examining How Items in an Order Relate to Each Other
To begin, read the data to simply identify whether the items in the order are linked manually, since automatic methods is the aim of this project


In [None]:
mma_data.head(200)

In [None]:
prod_sales_count = pd.DataFrame(mma_data.groupby("product_id")["order_id"].count().sort_values(ascending=False))
prod_sales_count.reset_index(inplace=True)
prod_sales_count.columns = ["product_id", "sales_count"]

In [None]:
# Add column of product names corresponding to the product IDs
id_to_name = mma_data[["product_id", "product_name"]].drop_duplicates().set_index("product_id")
prod_sales_count["product_name"] = prod_sales_count["product_id"].map(id_to_name["product_name"])
prod_sales_count.head(100)

### a) Case Study
First, look for a product that has two very clear ways of substituting. (e.g. apple -> apple juice, apple -> apple extract, apple -> banana)

In [None]:
# Look for good products for a case study
prod_sales_count[prod_sales_count["product_name"].str.contains("garlic", case=False)]

### b) Trying Out the Fill-Mask Task Manually
Remove random entries from orders, and see if we can predict the missing entries manually

In [81]:
# Extract a few short orders between 5 to 10 items
short_order_oids = mma_data.groupby("order_id")["product_id"].count()
short_order_oids = pd.DataFrame(short_order_oids[short_order_oids.between(5, 8)].index)

In [84]:
print(short_order_oids.head())
short_order_oids.shape

   order_id
0         1
1         3
2        11
3        15
4        20


(26656, 1)

Pick a few orders, then randomly remove the product name from an item in the order


In [88]:
# Pick a few orders
short_ord_id_samp = short_order_oids.sample(5)
short_ord_samp = mma_data[mma_data["order_id"].isin(short_ord_id_samp["order_id"])]
short_ord_samp.drop(columns=["product_id", "aisle_id", "aisle", "department_id", "department"], inplace=True)

answers = short_ord_samp.copy(deep=True)
# Remove the product name from the first item in each order
cur_o_id = ""
for i in short_ord_samp.index:
    if cur_o_id != short_ord_samp.loc[i, "order_id"]:
        cur_o_id = short_ord_samp.loc[i, "order_id"]
        short_ord_samp.loc[i, "product_name"] = "guess who I am"

short_ord_samp

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  short_ord_samp.drop(columns=["product_id", "aisle_id", "aisle", "department_id", "department"], inplace=True)


Unnamed: 0,order_id,product_name
135800,13730,guess who I am
135801,13730,Diet Coke Caffeine Free Soda
135802,13730,Thin Sliced Oven Roasted Turkey Breast
135803,13730,Muenster Cheese Slices
135804,13730,Premium Paper Towels
135805,13730,Naturally Hickory Smoked Hometown Original Bacon
135806,13730,Buttermilk Ranch Dressing & Dip
168920,17053,guess who I am
168921,17053,Smoked Uncured Kielbasa
168922,17053,Garlic


My Guesses:
13730: Turkey meal accessory or dressing
17053: Salad ingredient or main course
58719: Snack or fruit or drink or salad dressing
80755: Something to accompany cake. Drink or another sweet food item or accessories like candles
89470: Sidedish or topping or drink

Correct answers:
13730: AA Batteries x
17053: Yellow Potato √
58719: Original Hummus √
80755: Bananas √
89470: Lime Seltzer √

My score: 4/5

In [89]:
answers

Unnamed: 0,order_id,product_name
135800,13730,Coppertop AA Batteries
135801,13730,Diet Coke Caffeine Free Soda
135802,13730,Thin Sliced Oven Roasted Turkey Breast
135803,13730,Muenster Cheese Slices
135804,13730,Premium Paper Towels
135805,13730,Naturally Hickory Smoked Hometown Original Bacon
135806,13730,Buttermilk Ranch Dressing & Dip
168920,17053,Yellow Potato
168921,17053,Smoked Uncured Kielbasa
168922,17053,Garlic


This experiment shows that it is very possible to gain information about the missing item in an order by looking at the other items in the order, suggesting that a Natural Language Processing approach may be useful for this problem.