# Task 2: Recommend Items

## Prompt:
Continuing from task 1 above, if required, find dataset of customer groceries shopping basket and build a recommended extra item model. For example, provided a shopping basket containing "pasta" and "olive oil", the model may make a recommendation of "canned tomato" as an extra item to be added to the shopping basket.

The outcome of the previous task should be useful for directly feeding into this task, you should look into reusing the output of the previous task for this task.

*Please read the full documentation [[here](https://docs.google.com/document/d/1ZQARiQPf4BdPAFJjts1v5l4Ewr0ICFlaDV9t4mTUxbE/edit?usp=sharing)]*

## Part 1: Preprocess Data

### Step 1: Import Libaries & Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data1 = (pd.read_csv("data1.csv"))

# convert xlsx raw data file to csv
data2_xlsx = (pd.read_excel("data2.xlsx"))
data2_xlsx.to_csv("data2.csv", index=None, header=True)
data2 = pd.DataFrame(pd.read_csv("data2.csv"))

data3 = (pd.read_csv("data3.csv"))

### Step 2: Standardize Columns & Names

In [None]:
data1_temp = data1.iloc[:, [0,5]].copy()
data1_temp.columns = ["order_id", "item_name"]
data1_temp['order_id'] = "data1_" + data1_temp['order_id'].astype(str)

data2_temp = data2.iloc[:, [1,2]].copy()
data2_temp.columns = ["order_id", "item_name"]
data2_temp['order_id'] = "data2_" + data2_temp['order_id'].astype(str)

data3_temp = data3.iloc[:, [0,1]].copy()
data3_temp.columns = ["order_id", "item_name"]
data3_temp['order_id'] = "data3_" + data3_temp['order_id'].astype(str)

       order_id item_name
0    data3_1000    Apples
1    data3_1000    Butter
2    data3_1000      Eggs
3    data3_1000  Potatoes
4    data3_1004   Oranges
..          ...       ...
495  data3_1493     Juice
496  data3_1493     Bread
497  data3_1497    Coffee
498  data3_1497     Pasta
499  data3_1497      Eggs

[500 rows x 2 columns]


### Step 3: Convert Wide to Long

In [49]:
def wide_to_long(
    wide_data,
    prefix='data',
    separator='_',
    drop_columns=None,
    empty_val=['', 'None', None]
  ):

  """
  Converts dataset with wide format to long format.

  Parameters:
    wide_data: dataset input
    prefix: the string prefix to add to order_id, to make them distinguishable
    drop_columns: list of columns to remove
    empty_val: values representing missing items used to remove them later

  Returns:
    new dataframe with columns ['order_id', 'item_name'] as a csv file
  """
  data_temp = wide_data.copy()

  if drop_columns:
    data_temp = data_temp.drop(columns=drop_columns)

  data_temp.insert(0, "order_id", data_temp.index)
  data_temp = data_temp.replace(empty_val, pd.NA)

  item_cols = [col for col in data_temp.columns if col.startswith('Item') or col not in ['order_id']]

  data = data_temp.melt(
    id_vars='order_id',
    value_vars=item_cols,
    var_name='temp',
    value_name='item_name'
  )

  data = data.dropna(subset=['item_name'])
  data = data.drop(columns='temp')
  data = data[['order_id', 'item_name']].reset_index(drop=True)
  data['order_id'] = prefix + separator + data['order_id'].astype(str)

  print(data)

  data.to_csv(f"{prefix}.csv", index=False)

data4_wide = pd.read_csv("data4_wide.csv")
data4 = wide_to_long(
  wide_data=data4_wide,
  prefix="data4",
  drop_columns="Item(s)"
)
data4_temp = (pd.read_csv("data4.csv"))

data5_wide = pd.read_csv("data5_wide.csv")
data5 = wide_to_long(
  wide_data=data5_wide, 
  prefix="data5"
)
data5_temp = (pd.read_csv("data5.csv"))

         order_id                 item_name
0         data4_0              citrus fruit
1         data4_1            tropical fruit
2         data4_2                whole milk
3         data4_3                 pip fruit
4         data4_4          other vegetables
...           ...                       ...
41581  data4_9792               hard cheese
41582  data4_9796             sweet spreads
41583  data4_9817  long life bakery product
41584  data4_9821            red/blush wine
41585  data4_9830                     flour

[41586 rows x 2 columns]
         order_id        item_name
0         data5_0          burgers
1         data5_1          chutney
2         data5_2           turkey
3         data5_3    mineral water
4         data5_4   low fat yogurt
...           ...              ...
29253  data5_6521  frozen smoothie
29254  data5_6593      yogurt cake
29255  data5_6971      protein bar
29256  data5_7179        green tea
29257  data5_7341     tomato juice

[29258 rows x 2 columns]


### Step 4: Combine Datasets & Save Final Data

In [None]:
final_data = pd.concat([data1_temp, data2_temp, data3_temp, data4_temp, data5_temp], ignore_index=True)

# remove duplicates: items that appear twice in same basket
final_data = final_data.drop_duplicates()

final_data.to_csv("final_data.csv", index=False)

print(final_data)

          order_id           item_name
0       data1_1000         Wheat Flour
1       data1_1000  Dishwashing Liquid
2       data1_1000              Pastry
3       data1_1000              Marker
4       data1_1001               Saree
...            ...                 ...
125084  data5_6521     frozen smoothie
125085  data5_6593         yogurt cake
125086  data5_6971         protein bar
125087  data5_7179           green tea
125088  data5_7341        tomato juice

[120705 rows x 2 columns]


## Part 2: Convert Long Format to Transactions

In [58]:
transactions = (
  final_data
    .groupby("order_id")["item_name"]
    .apply(list)
    .tolist()
)

# print the first 5 baskets
for i, basket in enumerate(transactions[:5]):
    print(f"Basket {i}: {basket}")

Basket 0: ['Wheat Flour', 'Dishwashing Liquid', 'Pastry', 'Marker']
Basket 1: ['Saree', 'Spinach', 'Face Wash', 'Energy Drink', 'Mixer Grinder', 'Fish', 'Apple']
Basket 2: ['Cookies', 'Chicken Breast', 'Butter', 'Dress']
Basket 3: ['Vitamins', 'Dishwashing Liquid', 'Apple']
Basket 4: ['Pastry', 'Salt', 'Notebook', 'Tissue']


## Part 3: Run Apriori Algorithm

In [None]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules