# Task 1: Sorting Groceries

## Prompt:
Mimicking the behavior of the Apple Reminders app when creating a "Groceries" reminder list, create a programme that when passed a grocery item name (eg. "oranges", "eggs") is able to return which category they belong in (eg. "Fruits and vegetable", "Eggs and Dairy", respectively).

You should start with collecting and processing necessary data that can contribute to completing this task. The focus of this task will be on this data collection and processing steps. What kind of data do you need, how will you process them, how much will you need, etc.

### ML Classifier VS Neural Network?

My data:
- only need product name & category columns
- item names are short (1-4 words)
- small # of categories
- text input → text classification

So, I should use ML Classifier → learns patterns from given data to guess the correct answer (category)

Steps:

1. Training dataset with name and category
2. Tokenize necessary words
3. Vectorize the tokens into numbers as computers cannot understand words
4. Train the classifier to know what words belong to which category
5. Predict for the new items

Categories:

1. Fruits & Vegetables
2. Dairy & Egg
3. Meat & Seafood
4. Grains & Staples
5. Bakery
6. Snacks
7. Beverages
8. Electronics
9. Household
10. Clothing & Lifestyle
11. Personal care & Health
12. Stationery & Books

## Part 1: Cleaning & Combining Datasets

### Step 1: Load the datasets

pandas = python library for working with load datasets

In [20]:
import pandas as pd

# remove rows with NULL values
data1 = (pd.read_csv("data1.csv"))
data1.dropna(inplace = True)

data2 = (pd.read_csv("data2.csv"))
data2.dropna(inplace = True)

# to_string() prints out the ENTIRE data
# print(dataset1.to_string())
# print(dataset1)
# print(dataset2)

### Step 2: Select only the product & category columns

In [21]:
data1_temp = data1.iloc[:, [5,8]].copy()
data1_temp.columns = ["Product Name", "Category"]

data2_temp = data2.iloc[:, [1,2]].copy()
data2_temp.columns = ["Product Name", "Category"]

print(data2_temp)

        Product Name             Category
0         Sushi Rice      Grains & Pulses
1     Arabica Coffee            Beverages
2         Black Rice      Grains & Pulses
3    Long Grain Rice      Grains & Pulses
4               Plum  Fruits & Vegetables
..               ...                  ...
985          Spinach  Fruits & Vegetables
986   Cheddar Cheese                Dairy
987          Cabbage  Fruits & Vegetables
988      Avocado Oil          Oils & Fats
989           Papaya  Fruits & Vegetables

[989 rows x 2 columns]


### Step 3: Replace the category data with a standardized name

In [None]:
data1_temp["Category"] = data1_temp["Category"].replace({
  "Stationery": "Stationery & Books",
  "Books": "Stationery & Books",
  "Clothing": "Clothing & Lifestyle",
  "Footwear": "Clothing & Lifestyle",
  "Clothing Accessories": "Clothing & Lifestyle",
  "Personal Care": "Personal Care & Health",
  "Health & Wellness": "Personal Care & Health"
})

data2_temp["Category"] = data2_temp["Category"].replace({
  "Oils & Fats": "Grains & Staples",
  "Grains & Pulses": "Grains & Staples",
  "Dairy": "Dairy & Eggs",
  "Seafood": "Meat & Seafood"
})

### Step 4: Combine the 2 datasets

In [None]:
# resets the index of the rows so that data2_temp comes directly after data1_temp
final_data = pd.concat([data1_temp, data2_temp], ignore_index=True)

print(final_data)

             Product Name              Category
0             wheat flour      Grains & Staples
1      dishwashing liquid             Household
2                  pastry                Bakery
3                  marker    Stationery & Books
4                   saree  Clothing & Lifestyle
...                   ...                   ...
10984             spinach   Fruits & Vegetables
10985      cheddar cheese          Dairy & Eggs
10986             cabbage   Fruits & Vegetables
10987         avocado oil      Grains & Staples
10988              papaya   Fruits & Vegetables

[10989 rows x 2 columns]


### Step 5: Preprocess the product names

1. Rewrite all product names into lowercase
2. Remove the words in parenthesis

regex explanation:
1. r"" = raw string; backslashes are read literally
2. \s* = any whitespace character
3. \( = literal open parenthesis
4. .*? = represents any character & stops at the first closing parenthesis
5. \) = literal closing parenthesis

means -> targeting the word inside the parenthesis (if there are any)

In [None]:
# rewrite all product names into lowercase
final_data["Product Name"] = final_data["Product Name"].str.lower()

# remove the words in parenthesis
final_data["Product Name"] = final_data["Product Name"].str.replace(r"\s*\(.*?\)", "", regex=True)

### Step 6: Save the changes as a new dataset

In [33]:
# removes the index column
final_data.to_csv("final_data.csv", index=False)