See https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b for an explanation of all the fields.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import Image, display
# from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor, export_graphviz

What if we could predict the `aisle` based on the `product_name`? We would need to train on `product_name` and `aisle_id` and then use the values in aisles.csv to go from `aisle_id` to `aisle` (the name of the aisle).

Let's take a closer look at products.csv first.

In [None]:
!apt-get install -y tree

In [None]:
%cd ../input
!tree

In [None]:
ITEMS = 10_000

# These are completely, entirely arbitrary. I generated them using:
#
# >>> import random
# >>> random.randint(1, 2**32 - 1)

RANDOM_STATE_SAMPLE_ONE =  596497102
RANDOM_STATE_SAMPLE_TWO = 1539131531
RANDOM_STATE_DEC_TREE   = 2343470337

In [None]:
products = pd.read_csv("products.csv")
products = products.sample(n=ITEMS, random_state=RANDOM_STATE_SAMPLE_ONE)
products = products.drop(columns=["product_id", "department_id"], axis=1)

products.head()

... and a quick look at aisles.csv:

In [None]:
aisles = pd.read_csv("aisles.csv")
aisles.head()

In [None]:
# This is going to be our prediction target
y = products.aisle_id

_products = pd.get_dummies(products)

In [None]:
def id_to_aisle(_id):
    return aisles.loc[aisles.aisle_id == _id].aisle

In [None]:
# Define model. Specify a number for random_state to ensure same results each run
products_model = DecisionTreeRegressor(random_state=RANDOM_STATE_DEC_TREE)

# Fit model
products_model.fit(_products, y)
predictions = products_model.predict(_products.head())

for idx, prediction in enumerate(predictions):
    print(products.iloc[idx].product_name, id_to_aisle(prediction))

In [None]:
products = pd.read_csv("products.csv")
products = products.sample(n=ITEMS, random_state=RANDOM_STATE_SAMPLE_TWO)
products = products.drop(columns=["product_id", "department_id"], axis=1)
products.head()

In [None]:
_products = pd.get_dummies(products)

predictions = products_model.predict(_products)

n_correct = 0

for idx, prediction in enumerate(predictions):
    if products.iloc[idx].aisle_id == prediction:
        n_correct += 1

print(f"Got {n_correct} products correct, out of {len(products)} (accuracy: {round((n_correct/len(products)) * 100, 2)}%).")

Am I happy that it seemingly did so well, getting *literally every single one* correct? Of course.  
Am I kind of suspicious? Also yes.

What is going on here? Is there a rookie mistake in my logic somewhere? I mean, statistically speaking, we ought to get at least *a couple* values wrong. Like, say, ten or fifty or one hundred (which, I'm told, would actually be pretty bad again).

Or is this just such a good dataset that has an abundance of samples (almost 50K) for a relatively limited space of possible values (134)? I need to investigate that more.

Also, I'm not sure how to use this model myself?

## TODO:

Try something like:

```python
for n in [10,  100, 1_000, 10_000]:
    products = pd.read_csv("products.csv")
    products = products.sample(n=n, random_state=RANDOM_STATE_SAMPLE_TWO)
    products = products.drop(columns=["product_id", "department_id"], axis=1)

    _products = pd.get_dummies(products)

    predictions = products_model.predict(_products)

    n_correct = 0

    for idx, prediction in enumerate(predictions):
        if products.iloc[idx].aisle_id == prediction:
            n_correct += 1

    print(f"Got {n_correct} products correct, out of {len(products)} (accuracy: {round((n_correct/len(products)) * 100, 2)}%).")
```

to see when the accuracy is optimal. Note to self: this requires retraining the model.

In [None]:
products = pd.read_csv("products.csv")
products_f = products.sample(n=ITEMS, random_state=RANDOM_STATE_SAMPLE_ONE)
products_l = products.sample(n=ITEMS, random_state=RANDOM_STATE_SAMPLE_TWO)

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))

pd.value_counts(products_f.aisle_id).plot.hist(ax=ax1)
pd.value_counts(products_l.aisle_id).plot.hist(ax=ax2)

In [None]:
export_graphviz(products_model, "../model.dot")
!dot -Tpng ../model.dot -o ../model.png
display(Image(filename="../model.png"))
!rm ../model.png

(To see the model in all its glory, right click and then click "Open image in new tab")

I look forward to hearing what you think about my kernel.

# ✌️