# Task 1: Sorting Groceries

## Prompt:
Mimicking the behavior of the Apple Reminders app when creating a "Groceries" reminder list, create a programme that when passed a grocery item name (eg. "oranges", "eggs") is able to return which category they belong in (eg. "Fruits and vegetable", "Eggs and Dairy", respectively).

You should start with collecting and processing necessary data that can contribute to completing this task. The focus of this task will be on this data collection and processing steps. What kind of data do you need, how will you process them, how much will you need, etc.

*Please read the full documentation [[here](https://docs.google.com/document/d/1ZQARiQPf4BdPAFJjts1v5l4Ewr0ICFlaDV9t4mTUxbE/edit?usp=sharing)]*

### Which approach to use: ML Classifier VS Neural Network?

My data:
- only need product name & category columns
- item names are short (1-4 words)
- small # of categories
- text input → text classification

So, I should use ML Classifier → learns patterns from given data to guess the correct answer (category)

Steps:
1. Preprocess and combine the datasets into one
2. Split the final dataset into train and test sets
3. Vectorize text into numbers using TF-IDF
4. Train the model using logistic regression
5. Create & test user input

Categories:

1. Fruits & Vegetables
2. Dairy & Egg
3. Meat & Seafood
4. Grains & Staples
5. Bakery & Snacks
6. Beverages
7. Electronics
8. Household
9. Clothing & Lifestyle
10. Personal Care & Health
11. Stationery & Books

## Part 1: Preprocess & Combine the Datasets

### Step 1: Create a Virtual Environment

&nbsp; &nbsp; &nbsp; conda create -n task1_sortgroceries python=3.11 <br>
&nbsp; &nbsp; &nbsp; conda activate task1_sortgroceries

### Step 2: Import Pandas and Datasets

I use the pandas, a python library for working with datasets, and load_datasets from datasets, to access data5, which is a HuggingFace dataset. 

Before cleaning the datasets, I went through each of the datasets to see which category names were different from the 13 categories that I wanted; then I decided whether to rename it, or to remove it completely from my dataset. The unwanted categories are stored as a list called 'unwanted'.

### Step 3: Read Dataset Using Pandas

pd.read_csv reads the datasets which are saved as text based csv files (comma separated values) into a python dataframe: a two dimensional python array that’s structured and easy to work with later.

.dropna(inplace = True) removes all (potential) null values in the dataset and modifies the dataset directly.

Dataset 4 and 5 had unrelated and unwanted categories (i.e. ‘Deli’) or categories that were too few of them (i.e. ‘Pet Supplies’ and ‘Instant & Frozen Food’), so I decided to remove certain categories.

In [1]:
import pandas as pd
from datasets import load_dataset

unwanted = [
  "Dairy & Breakfast",
  "Produce",
  "Pasta & Grains",
  "Pet Supplies",
  "Frozen Foods",
  "Instant & Frozen Food",
  "Deli"
]

# remove rows with NULL values
data1 = (pd.read_csv("data1.csv"))
data1.dropna(inplace = True)

data2 = (pd.read_csv("data2.csv"))
data2.dropna(inplace = True)

data3 = (pd.read_csv("data3.csv"))
data3.dropna(inplace = True)

data4 = (pd.read_csv("data4.csv"))
data4.dropna(inplace = True)

# remove unwanted categories
data4 = data4[~data4["category"].isin(unwanted)]
data4.to_csv("data4.csv", index=False)

#load dataset from hugging face
data5_temp = load_dataset("AmirMohseni/GroceryList")
data5 = data5_temp["train"].to_pandas()

data5 = data5[~data5["Category"].isin(unwanted)]
data5.to_csv("data5.csv", index=False)


# to_string() prints out the ENTIRE data
# print(dataset1.to_string())
# print(dataset1)
# print(dataset2)

### Step 4: Selecting Only Product Names & Category Columns and Combine Into 1

.iloc[:, [5,8]].copy() uses the iloc—an int based location indexer—to select all rows using ‘:’ and only columns with index 5 and 8. .copy() makes a deep copy of all rows in the selected columns. A deep copy means that changes made do not reflect back in the original dataset, which preserves the original dataset. I will be creating a new final dataset, so I won’t need to make direct changes to each of the datasets.
	
  
data1_temp.columns = [“Product Name”, “Category”] renames the two column names to “Product Name” and “Category” in the previously made copy. This is to make sure that the column names are standardized, so that when I go to replace and change the category names in each of the 5 datasets to fit into the 12 categories I want, I will be able to easily access the “Category” columns of each dataset using fewer lines of code.


In [2]:
data1_temp = data1.iloc[:, [5,8]].copy()
data1_temp.columns = ["Product Name", "Category"]

data2_temp = data2.iloc[:, [1,2]].copy()
data2_temp.columns = ["Product Name", "Category"]

data3_temp = data3.iloc[:, [0,1]].copy()
data3_temp.columns = ["Product Name", "Category"]

data4_temp = data4.iloc[:, [6,7]].copy()
data4_temp.columns = ["Product Name", "Category"]

data5_temp = data5.iloc[:, [0,1]].copy()
data5_temp.columns = ["Product Name", "Category"]

print(data5_temp)

         Product Name        Category
17            chicken  Meat & Seafood
18               beef  Meat & Seafood
19               pork  Meat & Seafood
20             salmon  Meat & Seafood
21               tuna  Meat & Seafood
..                ...             ...
206  canned mushrooms    Canned Goods
207  canned pineapple    Canned Goods
208    canned peaches    Canned Goods
209    canned carrots    Canned Goods
210     canned olives    Canned Goods

[155 rows x 2 columns]


### Step 5: Standardize Category Names

I manually went through each dataset and wrote down the replacements I needed to make in order to standardize and fit the category names into the 11 I decided above. Then, I created a replacement dictionary with {key:value} of {current category name:new standardized category name}. Then, I created a list called datasets to use a for loop to quickly run through each dataset and replace the category names if they need to be replaced according to the replacements dictionary. Here, renaming the category columns in each of the datasets came in handy, as now I only need one line of code to replace the category names inside the for loop.

In [3]:
replacements = {
    "Stationery": "Stationery & Books",
    "Books": "Stationery & Books",
    "Clothing": "Clothing & Lifestyle",
    "Footwear": "Clothing & Lifestyle",
    "Clothing Accessories": "Clothing & Lifestyle",
    "Personal Care": "Personal Care & Health",
    "Health & Wellness": "Personal Care & Health",
    "Oils & Fats": "Grains & Staples",
    "Grains & Pulses": "Grains & Staples",
    "Dairy": "Dairy & Eggs",
    "Seafood": "Meat & Seafood",
    "Grocery & Staples": "Grains & Staples",
    "Snacks & Munchies": "Bakery & Snacks",
    "Cold Drinks & Juices": "Beverages",
    "Household Care": "Household",
    "Baby Care": "Personal Care & Health",
    "Pharmacy": "Personal Care & Health",
    "Snacks": "Bakery & Snacks",
    "Bakery": "Bakery & Snacks",
    "Pantry": "Grains & Staples",
    "Personal Care": "Personal Care & Health",
    "Pet Care": "Household",
    "Condiments & Sauces": "Grains & Staples",
    "Canned Goods": "Grains & Staples"
}

datasets = [data1_temp, data2_temp, data3_temp, data4_temp, data5_temp]

for df in datasets:
  df["Category"] = df["Category"].replace(replacements)

### Step 6: Combine the Datasets

Using pd.concat(), I combine the five copies of the datasets, now only containing 2 columns (Product Name and Category) with standardized category names. ignore_index=True ignores the column labels of dataset two, three, four, and five, so that when concatenated, the final_data will only have the labels at index 0, and not anywhere in between the data. print(final_data) prints out all the rows and columns in the final_data and shows many rows and columns there are.

In [4]:
# resets the index of the rows so that data2_temp comes directly after data1_temp
final_data = pd.concat([data1_temp, data2_temp, data3_temp, data4_temp, data5_temp], ignore_index=True)

print(final_data)

             Product Name              Category
0             Wheat Flour      Grains & Staples
1      Dishwashing Liquid             Household
2                  Pastry       Bakery & Snacks
3                  Marker    Stationery & Books
4                   Saree  Clothing & Lifestyle
...                   ...                   ...
12346    canned mushrooms      Grains & Staples
12347    canned pineapple      Grains & Staples
12348      canned peaches      Grains & Staples
12349      canned carrots      Grains & Staples
12350       canned olives      Grains & Staples

[12351 rows x 2 columns]


### Step 7: Preprocess the Product Names to Lowercase

1. Rewrite all product names into lowercase
2. Remove the words in parenthesis

regex explanation:
1. r"" = raw string; backslashes are read literally
2. \s* = any whitespace character
3. \( = literal open parenthesis
4. .*? = represents any character & stops at the first closing parenthesis
5. \) = literal closing parenthesis

means -> targeting the word inside the parenthesis (if there are any)

In [5]:
# rewrite all product names into lowercase
final_data["Product Name"] = final_data["Product Name"].str.lower()

# remove the words in parenthesis
final_data["Product Name"] = final_data["Product Name"].str.replace(r"\s*\(.*?\)", "", regex=True)

### Step 8: Save Changes as a New CSV Dataset

.to_csv() saves the final_data as a new .csv file, and index=False removes the index column, as I only need the Product Name and Category columns.

In [6]:
# removes the index column
final_data.to_csv("final_data.csv", index=False)

check how many of products we have in each category

In [7]:
print(final_data['Category'].value_counts())

Category
Fruits & Vegetables       1848
Grains & Staples          1727
Bakery & Snacks           1583
Personal Care & Health    1248
Dairy & Eggs              1114
Beverages                  933
Clothing & Lifestyle       872
Household                  830
Meat & Seafood             801
Electronics                727
Stationery & Books         668
Name: count, dtype: int64


## Part 2: Splitting Train & Test Datasets

### Step 1: Install scikit-learn

I’m using [scikit-learn](https://scikit-learn.org/stable/), an open source machine learning library with several algorithms and models, built on top of NumPy, matplotlib and SciPy. Specifically, I will be using the [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class to identify the category the grocery items belong to.

I’ve described each import line using comments below.

Using the train_test_split() from scikit-learn, I input the X (product name) and Y (categories). test_size=0.2 uses 20% of the data to go to testing and 80% to training. random_state=42 states to use the same split every time, meaning it allows for reproducibility, easy to debug, and allows for a fair & accurate comparison. The number 42 has no significant meaning as it is just a common placeholder used to indicate a seed of randomness. However, it’s interesting to note that the number 42 comes from The Hitchhiker’s Guide to the Galaxy where 42 is said to be the answer to all things. Finally, stratify=Y is a parameter that ensures the proportion of the categories in the ‘Category’ column are proportional. For instance, if ‘Fruits & Vegetables’ take up 90% of the entire category and ‘Household’ take up 10%, then both the training and test sets will have a similar 90%/10% split. This is essential as it prevents the data from having an imbalanced proportion of each category.

In [8]:
# splits arrays into random smaller sets for train & test data
from sklearn.model_selection import train_test_split

# ML cannot work with raw text, so we need to vectorize them into numbers
# This converts text into vectors based on TF-IDF (word frequency and importance)
from sklearn.feature_extraction.text import TfidfVectorizer

# classifier model from scikit-learn used to predict category based on product name input
from sklearn.linear_model import LogisticRegression

# needed to evaluate my model after training it
# prints out accuracy % and model performance
from sklearn.metrics import classification_report, accuracy_score

df = pd.read_csv('final_data.csv')

X = df['Product Name']
Y = df['Category']

# 80% train 20% test
X_train, X_test, Y_train, Y_test = train_test_split(
  X,
  Y,
  test_size=0.2,
  random_state=42,
  stratify=Y
)

check if the data was correctly split

In [9]:
X_train.shape, Y_train.shape

((9880,), (9880,))

In [10]:
X_test.shape, Y_test.shape

((2471,), (2471,))

### Step 2: Oversample the Train Set

After running this line of code previously, "print(final_data['Category'].value_counts())",
I noticed that there was a huge gap from my biggest and smallest category of over 1,000 data points. Therefore, I faced some errors when it came to testing my model to predict categories. So, I decided to import the [RandomOverSampler](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html) class from Imbalanced-learn, an open source library based on scikit-learn. The RandomOverSampler over-samples the minority classes by duplicating samples from it. Because the oversample only works with input X to be 2D, I convert the train data into a 2D array, or a DataFrame. Then, I create an oversample object which looks at my training labels and analyzes which categories have fewer samples, duplicating them until all classes have equal data points. Lastly, I convert the X_train_over back to a 1D array.

In [11]:
from imblearn.over_sampling import RandomOverSampler

# convert train data into 2D array
X_train_df = X_train.to_frame()

# creates an oversampler object
ros = RandomOverSampler(random_state=42)

# performs oversampling: duplicates categories with few data
X_train_over, Y_train_over = ros.fit_resample(X_train_df, Y_train)

# convert data back in 1D array
X_train_over = X_train_over.squeeze()

### Step 3: Vectorize Text Into Numbers Using TF-IDF

TF-IDF, standing for Term Frequency - Inverse Document Frequency, is one of the most common methods of converting text into numbers to be used for machine learning. TF outputs a number based on how often the word appears in the data. IDF analyzes how rare a word is out of the entire dataset. TF-IDF is then calculated by multiplying those two scores together.

I create a TF-IDF vectorizer with two settings: ngram_range=(1,2) telling the vectorizer to extract product names with one to two words such as ‘apple juice’, and max_features=5000 telling the vectorizer to keep only the top 5000 most important n-grams based on frequency to help prevent overfitting and unnecessary noise from rare words.

Next, I use .fit_transform() to analyze all text in X_train_over and learn the vocabulary, and convert each product name into a numeric vector. .transform() then applies the trained vocabulary to my test set. Test data words not in the training vocabulary are ignored.


In [12]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=5000)

#fit_transform = looks at all my training text & learns the vocabuluary & converts each product name into a numeric vector
X_train_vec = vectorizer.fit_transform(X_train_over)

#transform = uses only the exact vocabulary learned from training data to convert new product names into vectors & words not in training data are ignored
X_test_vec = vectorizer.transform(X_test)

### Step 4: Train Model Using Logistic Regression

I create a clf variable, the logistic regression classifier, and use max_iter=3000 to increase the number of iterations to give the model more time to complete the training process. Then, I apply .fit() to clf to train my model using X_train_vec (my numerically converted vectors) and Y_train_over (my oversampled Y data).

In [13]:
clf = LogisticRegression(max_iter=3000)
clf.fit(X_train_vec, Y_train_over)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,3000


### Step 5: Predict Data on Test Set

Now, I’m using my trained logistic regression model to predict the test data set I’ve previously split. 

In [14]:
# the categories that my model predicted
y_pred = clf.predict(X_test_vec)

### Step 6: Evaluate

I print out the accuracy score and a [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) of my model.

In [15]:
# accuracy_score = number of correct predictions / total number of predictions
print("Acurracy: ", accuracy_score(Y_test, y_pred))
print("\nClassification Report:\n", classification_report(Y_test, y_pred))

Acurracy:  0.9943342776203966

Classification Report:
                         precision    recall  f1-score   support

       Bakery & Snacks       1.00      0.99      1.00       317
             Beverages       1.00      0.99      1.00       187
  Clothing & Lifestyle       1.00      1.00      1.00       174
          Dairy & Eggs       0.99      1.00      1.00       223
           Electronics       1.00      1.00      1.00       145
   Fruits & Vegetables       0.97      1.00      0.98       370
      Grains & Staples       1.00      0.99      1.00       345
             Household       1.00      0.99      1.00       166
        Meat & Seafood       1.00      0.98      0.99       160
Personal Care & Health       1.00      0.99      1.00       250
    Stationery & Books       1.00      1.00      1.00       134

              accuracy                           0.99      2471
             macro avg       1.00      0.99      1.00      2471
          weighted avg       0.99      0.99    

## Part 3: Create Input Interface & Test Model

### Version 1: Test Directly

In [16]:
while True:
  product_name = input("Enter product name (or type 'exit' to quit): ")

  if product_name.lower() == 'exit':
    break

  # use [] because TfidVectorizer expects a list of strings and without it, each character will be separated rather than words
  product_vec = vectorizer.transform([product_name])

  # use [0] to print out just the product name as a string rather than an array
  predicted_category = clf.predict(product_vec)[0]

  print(f"Category: {predicted_category}\n")

Category: Fruits & Vegetables

Category: Fruits & Vegetables



### Version 2: Test Using ipywidgets

I wanted to create a simple GUI to visualize the grocery item sorter interface. 

I first tried to create an HTML, CSS, and JS file, as this is my most comfortable language, but soon figured out that browsers/HTML files originally cannot run .py files without any extensions or using a python backend route in which python creates URLs using Flask or an API. So, I tried searching for python GUIs that were simple and easily runnable directly in this jupyter notebook file, and found ipywidgets. [Here is the doc webpage.](https://minrk-ipywidgets.readthedocs.io/en/latest/index.html)

The main function of the code below is creating an html widget with a text widget which takes a text input from the user, detects when the user [presses](https://minrk-ipywidgets.readthedocs.io/en/latest/examples/Widget%20Events.html) the "Enter" key, and uses my trained model to predict the category of the grocery item. If the category is already written down, the grocery item gets placed underneath the corresponding category in the form of a [vertical flexbox](https://ipywidgets.readthedocs.io/en/7.6.3/examples/Widget%20Styling.html); whereas, if the category is new, a new category box is created.

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

text_input = widgets.Text(
  placeholder='Enter grocery ...',
  disabled=False
)

output = widgets.Output()

# VBox = vertical flexbox
assigned_category = {}
main_box = widgets.VBox()

def enter_pressed(text_widget):
  product_name = text_input.value

  # if no text was entered, return
  if not product_name:
    return
  
  try:
    with output:
      # clear any typed input to prepare for next
      clear_output()

    # same as before
    product_vec = vectorizer.transform([product_name])
    predicted_category = clf.predict(product_vec)[0]

    with output:
      print("Debug: ", product_name, " into ", predicted_category)

    # when a new grocery category is outputted
    if predicted_category not in assigned_category:
      category_header = widgets.HTML(
        f"<h1 style='font-size: 16px;'>{predicted_category}</h1>"
      )
      # VBox to hold grocery items
      grocery_box = widgets.VBox([])

      # since the category name doesn't already exist, add it to the "assigned_category" python dictionary
      category_box = widgets.VBox(
        [category_header, grocery_box],
        layout=widgets.Layout(margin="20px 0px 20px 0px")
      )

      assigned_category[predicted_category] = {
        "container": category_box,
        "grocery_box": grocery_box
      }

      # add the new category at the bottom of the main VBox
      main_box.children += (category_box,)

    # label the grocery item as a bullet point
    product_to_add = widgets.Label(f"- {product_name}")

    # find the corresponding category VBox, locate the grocery_item_box (index 1 beneath the category name), and below all the childrens (grocery item lists), append a new child (grocery item)
    grocery_box = assigned_category[predicted_category]["grocery_box"]
    grocery_box.children += (product_to_add,)

    # clear the input box after
    text_input.value = ''

  except Exception as e:
    print("Error: ", e)

# when the user presses the "Enter" key
text_input.on_submit(enter_pressed)

display(text_input) # displays the text input
display(main_box) # displays the main GUI
# display(output) displays the debug print messages

  text_input.on_submit(enter_pressed)


Text(value='', placeholder='Enter grocery ...')

VBox()