# Wrangle ML datasets
- Explore tabular data for supervised machine learning
- Join relational data for supervised machine learning

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from glob import glob
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import requests
import tarfile
from IPython.display import Image

# I. Wrangle Data

1. Download the data [HERE](https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis) by clicking the **Download** button on the top right corner of the page.
2. Upload the file to your Google Drive in a folder named "Instacart"

In [None]:
# mounting your google drive
from google.colab import drive
drive.mount('/content/gdrive/Instacart/')

In [None]:
#change your working directory, if you want to or have already saved your kaggle dataset on google drive.
%cd /content/gdrive/My Drive/Instacart

In [None]:
!ls

In [None]:
# Unzip the files, if you haven't already done so.
!unzip \*.zip  && rm *.zip

**Before you start,** load each of the above `.csv` files into its own DataFrame.

In [None]:
orders = pd.read_csv('orders.csv')
order_products_train = pd.read_csv('order_products__train.csv')
order_products_prior = pd.read_csv('order_products__prior.csv')
products = pd.read_csv('products.csv')

In [None]:
Image(url= "https://i.imgur.com/R7c37Yw.png")

## I.a. Warm-up Questions

What information is contained in the column `orders['eval_set']`?

In [None]:
orders['eval_set']

In [None]:
orders['eval_set']

The first row of `orders['order_id']` is `2539329`. Where can we find the items that were included in that order?

In [None]:
orders.head()

In [None]:
order_products_prior.head()

The first row of `order_products__prior['product_id']` is `33120`. What is the name of that product?

# Define Our Machine Learning Problem

- We want predict whether or not a customer will purchase a specific item (of our choosing).
- The most commonly ordered product: `'Banana'` (`24852`).
- Our model is going to predict whether or not an order will include a `'Banana'`.

Lets adjust the Kaggle competition classification task from "What products will be ordered?" (multiclass, multilabel classification) to "Will one product be reordered?" (binary classification).

## I.c. Create Feature Matrix and Target Vector

Our **feature matrix** will be all the `'train'` rows from `orders`.

Our **target vector** will be whether or not each order in `X` contains the item we've chosen above.

In [None]:
banana_orders

In [None]:
X['includes_bananas'] =

In [None]:
X['includes_bananas']

# I.d Feature Engineering

## What features can we engineer? We want to predict, will these customers reorder bananas on their next order?

- Products per order
- Time of day
- Have they reordered bananas before? (Have ordered bananas >= 2 times)
- Other fruit they buy
- Size of orders (customers with smaller orders on average are less likely to be reordering any particular product on their next order)


- Frequency of banana orders:
    - % of orders
    - Time between banana orders: Every n days on average
    - Raw count: Total orders, how many times have you ordered bananas?

- Recency of banana orders
    - n days since you ordered banana



Is an order placed before 11:00AM?

In [None]:
X['morning_order'] =

In [None]:
X.head()

How many items in the order?

In [None]:
n_items_per_order =

Did the user order `'Banana'` in previous orders?

In [None]:

order_products_prior['is_banana'] = order_products_prior['product_id'] == 24852
banana_orders_id_prior = order_products_prior[order_products_prior['is_banana']]['order_id']


prior_orders = orders[orders['eval_set']=='prior'].copy()
prior_orders['has_banana'] = prior_orders['order_id'].isin(banana_orders_id_prior)


prior_banana_user_ids = prior_orders[prior_orders['has_banana']]['user_id'].unique()

X['prior_banana_orders'] = X['user_id'].isin(prior_banana_user_ids).astype(int)

In [None]:
X.head()

# II. Split Data

In [None]:
target = 'includes_bananas'

y = X[target]

X = X.drop(columns=['order_id', 'user_id', 'order_number', target])

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.2,random_state=42)

# III. Establish Baseline

In [None]:
print('Baseline accuracy:', y_train.value_counts(normalize=True).max())

In [None]:
y_train.value_counts(normalize=True).plot(kind='bar')

# IV. Build Model

In [None]:
model_rf = RandomForestClassifier(random_state=42, n_jobs=-1)

model_rf.fit(X_train, y_train)

# V. Check Metrics

In [None]:
print('RF Training Accuracy:', model_rf.score(X_train, y_train))
print('RF Validation Accuracy:', model_rf.score(X_val, y_val))

In [None]:
print('RF Training Accuracy:', model_rf.score(X_train, y_train))
print('RF Validation Accuracy:', model_rf.score(X_val, y_val))
