# Task 1: Sorting Groceries

## Prompt:
Mimicking the behavior of the Apple Reminders app when creating a "Groceries" reminder list, create a programme that when passed a grocery item name (eg. "oranges", "eggs") is able to return which category they belong in (eg. "Fruits and vegetable", "Eggs and Dairy", respectively).

You should start with collecting and processing necessary data that can contribute to completing this task. The focus of this task will be on this data collection and processing steps. What kind of data do you need, how will you process them, how much will you need, etc.

### ML Classifier VS Neural Network?

My data:
- only need product name & category columns
- item names are short (1-4 words)
- small # of categories
- text input → text classification

So, I should use ML Classifier → learns patterns from given data to guess the correct answer (category)

Steps:

1. Training dataset with name and category
2. Tokenize necessary words
3. Vectorize the tokens into numbers as computers cannot understand words
4. Train the classifier to know what words belong to which category
5. Predict for the new items

Categories:

1. Fruits & Vegetables
2. Dairy & Egg
3. Meat & Seafood
4. Grains & Staples
5. Bakery & Snacks
6. Beverages
7. Electronics
8. Household
9. Clothing & Lifestyle
10. Personal Care & Health
11. Stationery & Books
12. Instant & Frozen Food
13. Pet Care

## Part 1: Cleaning & Combining Datasets

### Step 1: Load the datasets

pandas = python library for working with load datasets

In [148]:
import pandas as pd
#import numpy as np
#import matplotlib.pyplot as plt
#import keras as keras
#from keras import layers

from datasets import load_dataset

unwanted = [
  "Dairy & Breakfast",
  "Produce",
  "Pasta & Grains",
  "Pet Supplies",
  "Frozen Foods",
  "Instant & Frozen Food",
  "Deli"
]

# remove rows with NULL values
data1 = (pd.read_csv("data1.csv"))
data1.dropna(inplace = True)

data2 = (pd.read_csv("data2.csv"))
data2.dropna(inplace = True)

data3 = (pd.read_csv("data3.csv"))
data3.dropna(inplace = True)

data4 = (pd.read_csv("data4.csv"))
data4.dropna(inplace = True)

# remove unwanted categories
data4 = data4[~data4["category"].isin(unwanted)]
data4.to_csv("data4.csv", index=False)

# data5 = (pd.read_csv("data5.csv"))
# data5.dropna(inplace = True)

#load dataset from hugging face
data5_temp = load_dataset("AmirMohseni/GroceryList")
data5 = data5_temp["train"].to_pandas()

data5 = data5[~data5["Category"].isin(unwanted)]
data5.to_csv("data5.csv", index=False)


# to_string() prints out the ENTIRE data
# print(dataset1.to_string())
# print(dataset1)
# print(dataset2)

### Step 2: Select only the product & category columns

In [149]:
data1_temp = data1.iloc[:, [5,8]].copy()
data1_temp.columns = ["Product Name", "Category"]

data2_temp = data2.iloc[:, [1,2]].copy()
data2_temp.columns = ["Product Name", "Category"]

data3_temp = data3.iloc[:, [0,1]].copy()
data3_temp.columns = ["Product Name", "Category"]

data4_temp = data4.iloc[:, [6,7]].copy()
data4_temp.columns = ["Product Name", "Category"]

# data5_temp = data5.iloc[:, [0,1]].copy()
# data5_temp.columns = ["Product Name", "Category"]

data5_temp = data5.iloc[:, [0,1]].copy()
data5_temp.columns = ["Product Name", "Category"]

print(data5_temp)

         Product Name        Category
17            chicken  Meat & Seafood
18               beef  Meat & Seafood
19               pork  Meat & Seafood
20             salmon  Meat & Seafood
21               tuna  Meat & Seafood
..                ...             ...
206  canned mushrooms    Canned Goods
207  canned pineapple    Canned Goods
208    canned peaches    Canned Goods
209    canned carrots    Canned Goods
210     canned olives    Canned Goods

[155 rows x 2 columns]


### Step 3: Replace the category data with a standardized name

In [150]:
replacements = {
    "Stationery": "Stationery & Books",
    "Books": "Stationery & Books",
    "Clothing": "Clothing & Lifestyle",
    "Footwear": "Clothing & Lifestyle",
    "Clothing Accessories": "Clothing & Lifestyle",
    "Personal Care": "Personal Care & Health",
    "Health & Wellness": "Personal Care & Health",
    "Oils & Fats": "Grains & Staples",
    "Grains & Pulses": "Grains & Staples",
    "Dairy": "Dairy & Eggs",
    "Seafood": "Meat & Seafood",
    "Grocery & Staples": "Grains & Staples",
    "Snacks & Munchies": "Bakery & Snacks",
    "Cold Drinks & Juices": "Beverages",
    "Household Care": "Household",
    "Baby Care": "Personal Care & Health",
    "Pharmacy": "Personal Care & Health",
    "Snacks": "Bakery & Snacks",
    "Bakery": "Bakery & Snacks",
    "Pantry": "Grains & Staples",
    "Personal Care": "Personal Care & Health",
    "Pet Care": "Household",
    "Condiments & Sauces": "Grains & Staples",
    "Canned Goods": "Grains & Staples"
}

datasets = [data1_temp, data2_temp, data3_temp, data4_temp, data5_temp]

for df in datasets:
  df["Category"] = df["Category"].replace(replacements)

### Step 4: Combine the 2 datasets

In [151]:
# resets the index of the rows so that data2_temp comes directly after data1_temp
final_data = pd.concat([data1_temp, data2_temp, data3_temp, data4_temp, data5_temp], ignore_index=True)

print(final_data)

             Product Name              Category
0             Wheat Flour      Grains & Staples
1      Dishwashing Liquid             Household
2                  Pastry       Bakery & Snacks
3                  Marker    Stationery & Books
4                   Saree  Clothing & Lifestyle
...                   ...                   ...
12346    canned mushrooms      Grains & Staples
12347    canned pineapple      Grains & Staples
12348      canned peaches      Grains & Staples
12349      canned carrots      Grains & Staples
12350       canned olives      Grains & Staples

[12351 rows x 2 columns]


### Step 5: Preprocess the product names

1. Rewrite all product names into lowercase
2. Remove the words in parenthesis

regex explanation:
1. r"" = raw string; backslashes are read literally
2. \s* = any whitespace character
3. \( = literal open parenthesis
4. .*? = represents any character & stops at the first closing parenthesis
5. \) = literal closing parenthesis

means -> targeting the word inside the parenthesis (if there are any)

In [152]:
# rewrite all product names into lowercase
final_data["Product Name"] = final_data["Product Name"].str.lower()

# remove the words in parenthesis
final_data["Product Name"] = final_data["Product Name"].str.replace(r"\s*\(.*?\)", "", regex=True)

### Step 6: Save the changes as a new dataset

In [153]:
# removes the index column
final_data.to_csv("final_data.csv", index=False)

check how many of products we have in each category

In [154]:
print(final_data['Category'].value_counts())

Category
Fruits & Vegetables       1848
Grains & Staples          1727
Bakery & Snacks           1583
Personal Care & Health    1248
Dairy & Eggs              1114
Beverages                  933
Clothing & Lifestyle       872
Household                  830
Meat & Seafood             801
Electronics                727
Stationery & Books         668
Name: count, dtype: int64


## Part 2: Splitting Train & Test Datasets

### Step 1: Install scikit-learn

scikit-learn = open source tool for predictive data analysis built on NumPy, matplotlib, and SciPy

I'm using classification = identifying which category an object belongs to

In [155]:
from sklearn.model_selection import train_test_split

# ML cannot work with raw text, so we need to vectorize them into numbers
# This converts text into vectors based on TF-IDF (word frequency and importance)
from sklearn.feature_extraction.text import TfidfVectorizer

# classifier model from scikit used to predict category based on product name input
from sklearn.linear_model import LogisticRegression

# needed to evaluate my model after training it
# prints out accuracy % and model performance
from sklearn.metrics import classification_report, accuracy_score

df = pd.read_csv('final_data.csv')

X = df['Product Name']
Y = df['Category']

# 80% train 20% test
X_train, X_test, Y_train, Y_test = train_test_split(
  X,
  Y,
  test_size=0.2,
  random_state=42,
  stratify=Y
)

check if the data was correctly split

In [156]:
X_train.shape, Y_train.shape

((9880,), (9880,))

In [157]:
X_test.shape, Y_test.shape

((2471,), (2471,))

### Step 2: Oversample the training

In [158]:
from imblearn.over_sampling import RandomOverSampler

# convert train data into 2D array
X_train_df = X_train.to_frame()

# creates an oversampler object
ros = RandomOverSampler(random_state=42)

# performs oversampling: duplicates categories with few data
X_train_over, Y_train_over = ros.fit_resample(X_train_df, Y_train)

# convert data back in 1D array
X_train_over = X_train_over.squeeze()

### Step 3: Vectorize text into numbers using TF-IDF

In [159]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=5000)

#fit_transform = looks at all my training text & learns the vocabuluary & converts each product name into a numeric vector
#transform = uses only the exact vocabulary learned from training data to convert new product names into vectors & words not in training data are ignored
X_train_vec = vectorizer.fit_transform(X_train_over)
X_test_vec = vectorizer.transform(X_test)

### Step 4: Train a classifier model using logistic regression

In [160]:
clf = LogisticRegression(max_iter=3000)
clf.fit(X_train_vec, Y_train_over)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,3000


### Step 5: Predict data on the test set

In [161]:
# the categories that my model predicted
y_pred = clf.predict(X_test_vec)

### Step 6: Evaluate

In [162]:
# accurdacy_score = number of correct predictions / total number of predictions
print("Acurracy: ", accuracy_score(Y_test, y_pred))
print("\nClassification Report:\n", classification_report(Y_test, y_pred))

Acurracy:  0.9943342776203966

Classification Report:
                         precision    recall  f1-score   support

       Bakery & Snacks       1.00      0.99      1.00       317
             Beverages       1.00      0.99      1.00       187
  Clothing & Lifestyle       1.00      1.00      1.00       174
          Dairy & Eggs       0.99      1.00      1.00       223
           Electronics       1.00      1.00      1.00       145
   Fruits & Vegetables       0.97      1.00      0.98       370
      Grains & Staples       1.00      0.99      1.00       345
             Household       1.00      0.99      1.00       166
        Meat & Seafood       1.00      0.98      0.99       160
Personal Care & Health       1.00      0.99      1.00       250
    Stationery & Books       1.00      1.00      1.00       134

              accuracy                           0.99      2471
             macro avg       1.00      0.99      1.00      2471
          weighted avg       0.99      0.99    

## Part 3: Create & Test User Input

In [163]:
while True:
  product_name = input("Enter product name (or type 'exit' to quit): ")

  if product_name.lower() == 'exit':
    break

  # use [] because TfidVectorizer expects a list of strings and without it, each character will be separated rather than words
  product_vec = vectorizer.transform([product_name])

  # use [0] to print out just the product name as a string rather than an array
  predicted_category = clf.predict(product_vec)[0]

  print(f"Category: {predicted_category}\n")

Category: Fruits & Vegetables

Category: Meat & Seafood

Category: Meat & Seafood

Category: Meat & Seafood

Category: Meat & Seafood

Category: Meat & Seafood

Category: Fruits & Vegetables

Category: Fruits & Vegetables

Category: Fruits & Vegetables

Category: Grains & Staples

Category: Grains & Staples

Category: Bakery & Snacks

Category: Bakery & Snacks

Category: Fruits & Vegetables

Category: Dairy & Eggs

Category: Dairy & Eggs

Category: Fruits & Vegetables

Category: Fruits & Vegetables

Category: Meat & Seafood

Category: Dairy & Eggs

Category: Fruits & Vegetables

Category: Fruits & Vegetables

Category: Grains & Staples

Category: Fruits & Vegetables

