# Adding material to the item name

Changes manually made to the NEA json data (`data/data.json`) `item` and `simiar_items` field for the vision model to map the image item to the item name of the correct material
- `tube` to `glass tube`
- `Paint container` to `Metal paint container`
- `Paint cans` to `Metal paint cans`
- `Bakeware` to `Glass Bakeware`
- `Condiment bottle` to `Glass condiment bottle`
- `Beer bottle` to `Glass beer bottle`
- `Dryer` to `Hairdryer`
- `Bodywash bottle` to `Plastic bodywash bottle`
- `Milk bottles` to `Plastic milk bottles`
- `Facial cleanser bottle` to `Plastic Facial cleanser bottle`
- `Magazine wrapper` to `Plastic magazine wrapper`
- `Sauce bottle` to `Glass sauce bottle`
- `Condiment bottle` to `Glass condiment bottle`
- `Fruit box` to `Plastic fruit box`
- `Wine bottle` to `Glass wine bottle`
- `Beer bottle` to `Glass Beer bottle`
- `Bread bag` to `Plastic bread bag`
- `Fire-Wire` to `Fire wire`
- `Pill bottle` to `Plastic pill bottle`
- `Spoon` to `Plastic spoon`
- `Plastic container` to `Plastic tupperware container`
- `Serving Bowl` to `Plastic serving bowl`
- `Red Wine` to `red wine bottle`,
- `Carbonated drink bottle` to `plastic carbonated drink bottle`,
- `Carbonated drink can` to `Metal Carbonated drink can`,
- `Milk bottles` to `plastic milk bottles`,
- `Soft drink bottle` to `plastic soft drink bottle`,
- `Medicine bottle` to `plastic medicine bottle`,
- `spork` to `plastic spork`,
- `White Wine ` to `white wine bottle`,
- `Serving Plate` to `plastic serving plate`,
- `Water bottle` to `plastic water bottle`,
- `Saucer` to `plastic saucer`

In [54]:
to_ignore = set([item.lower() for item in [
 '',
 '100 plus',
 'absolut',
 'acer',
 'adat',
 'adidas',
 'almon buuter',
 'anchor',
 'android',
 'apple',
 'apple juice',
 'aqua',
 'asahi',
 'asics',
 'asus',
 'belvedere',
 "benjaminmooreyeo's",
 'berocca',
 'blackmores',
 'bombay sapphire',
 'bonjour',
 'bowl',
 'breda',
 'calbee',
 'calsberg',
 'champagne',
 'chilli',
 'chivas',
 'coke',
 'corona',
 'dasani',
 'dell',
 'duracell',
 'energizer',
 'enriched',
 'eveready',
 'evian',
 'f&n',
 'fanta',
 'farm fresh',
 'fiji',
 'fire-wire',
 'fnn',
 'fork',
 'fragrance',
 'fruit juice packet',
 'gardenia',
 'gin',
 'google phone',
 'green tea',
 'greenfield',
 'grey goose',
 'guinness',
 'hammerite',
 'hdmi cable',
 'heaven & earth',
 'heineken',
 'hendricks',
 'hl',
 'hoegarden',
 'hojicha tea',
 'honey',
 'hp',
 'hua tiao jiu',
 'ice lemon tea',
 'ice mountain',
 'imac',
 'ipad',
 'iphone',
 'jack daniels',
 'jagermeister',
 'johnny walker',
 'jotun',
 'juice',
 'ketchup',
 'kingfisher',
 'knife',
 'kronenbourgh',
 "lay's",
 'lenovo',
 'lightning cable',
 'lychee tea',
 'macbook',
 'macbook air',
 'macbook pro',
 'marigold',
 'mayonaise',
 'meadows',
 'meiji',
 'metal carbonated drink can',
 'milo',
 'monkey shoulder',
 'mustard',
 'nescafe',
 'nespresso',
 'new balance',
 'nike',
 'nippon paint',
 'nokia',
 'oatly',
 'oatside',
 'oolong tea',
 'orange juice',
 'peanut butter',
 'peel fresh',
 'pepsi',
 'ph balancer',
 'pizza box',
 'plate',
 'pokka',
 'prosecco',
 'puma',
 'rafflespaint',
 'razer',
 'razor',
 'redoxon',
 'reebok',
 'ribena',
 'roku gin',
 'ronseal',
 'ruffles',
 'rum',
 'samsung',
 'school diary',
 'scotts',
 'seasons',
 'sesame oil',
 'sketchers',
 'smartwater',
 'smirnoff',
 'soy sauce',
 'spread',
 'sprite',
 'sunkist',
 'sunshine',
 'super value',
 'tanquery',
 'ten year series',
 'tequila',
 'the botanist',
 'tiger',
 'tiger brand',
 'toiletries',
 'top one',
 'torres',
 'trs',
 'ts',
 'tube',
 'tupperware',
 'twisties',
 'type b',
 'type c',
 "tyrrell's",
 'under armour',
 'usb',
 'usb c',
 'vitasoy',
 'volvic',
 'whisky',
 'xlr',
 "yeo's" 
]])

# Put similar items into new entries in the json file

In [56]:
import json
with open('../data/data.json', 'r') as file:
    data = json.load(file)

print(f"No. of items in data: {len(data)}")



No. of items in data: 321


In [57]:
items_in_data = set()
with open('../data/data.json', 'r') as file:
    data = json.load(file)

for item in data:
    item_name = item['item']
    item_name = item_name.strip()
    item_name = item_name.lower()
    if item_name not in items_in_data:
        items_in_data.add(item_name)
    
    for similar_item in item['similar_items']:
        similar_item = similar_item.strip()
        similar_item = similar_item.lower()
        if similar_item not in items_in_data:
            items_in_data.add(similar_item)


print(f"No. of unique items and similar items in data: {len(items_in_data)}")

No. of unique items and similar items in data: 589


In [58]:
unique_items = set()
res = []
for item in data:
    item_name = item['item']
    item_name = item_name.strip()
    item_name = item_name.lower()
    if item_name in to_ignore:
        continue

    if item_name not in unique_items:
        res.append({
        'material': item['material'],
        'item': item_name,
        'recyclable': item['recyclable'],
        'instructions': item['instructions']
        })
        unique_items.add(item_name)

    for similar_item in item['similar_items']:
        similar_item = similar_item.strip()
        similar_item = similar_item.lower()
        if similar_item in to_ignore:
            continue

        if similar_item not in unique_items:
            res.append({
                'material': item['material'],
                'item': similar_item,
                'recyclable': item['recyclable'],
                'instructions': item['instructions']
            })
            unique_items.add(similar_item)

sorted_res = sorted(res, key=lambda item: item['item'])
print(f"No. of items in data with similar items after ignoring brands: {len(sorted_res)}")

No. of items in data with similar items after ignoring brands: 430


In [59]:
with open('../data/data_with_similar_items.json', 'w') as file:
    json.dump(sorted_res, file, indent=4)



# Removing html links and html elements from instructions

In [60]:
import re
def extract_link(text):
    # Regular expression to find the href attribute in the anchor tag
    links = re.findall(r"href='(.*?)'", text)
    return links if links else []
    

In [61]:
# text = """
# Clothes should be donated if they are in good condition. <br/><br/> Click <a href='https://www.nea.gov.sg/our-services/waste-management/donation-resale-and-repair-channels/' target='_blank' style='color:black; font-weight:600; text-decoration: underline; font-style: italic;'>here</a> for avenues to donate, resell or repair your clothes.can be recycled through E-waste bins, located <a href='https://www.nea.gov.sg/our-services/waste-management/3r-programmes-and-resources/e-waste-management/where-to-recycle-e-waste' target='_blank' style='color:black; font-weight:600; text-decoration: underline; font-style: italic;'>here</a>.
# """
text = 'should be disposed of as general waste'
print(extract_link(text))

[]


In [62]:
import re

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [63]:
text = "Clothes should be donated if they are in good condition. <br/><br/> Click <a href='https://www.nea.gov.sg/our-services/waste-management/donation-resale-and-repair-channels/' target='_blank' style='color:black; font-weight:600; text-decoration: underline; font-style: italic;'>here</a> for avenues to donate, resell or repair your clothes."

print(remove_html_tags(text))

Clothes should be donated if they are in good condition.  Click here for avenues to donate, resell or repair your clothes.


In [64]:
# read from json file
# extract all the links and create a new field in the json called 'link'
import json
with open('../data/data_with_similar_items.json', 'r') as file:
    data = json.load(file)
print(f"no. of documents in data with similar items: {len(data)}")

for doc in data:
    doc['item'] = doc['item'].lower()
    doc['links'] = extract_link(doc['instructions'])
    doc['instructions'] = remove_html_tags(doc['instructions'])
    doc['instructions'] = doc['instructions'].strip()
    if doc['instructions']:
        doc['instructions'] = doc['instructions'][0].capitalize() + doc['instructions'][1:]
    else:
        doc['instructions'] = ''

with open('../data/cleaned_data.json', 'w') as file:
    json.dump(data, file, indent=4)

no. of documents in data with similar items: 430


In [65]:
with open('../data/cleaned_data.json', 'r') as file:
    data = json.load(file)
print(f"no. of documents in cleaned data: {len(data)}")

no. of documents in cleaned data: 430


Perform any further manual cleaning on the json and save it as `cleaned_data_final.json`

Manual cleaning made
- `located here. (ALBA E-waste recycling bins).` to `located at ALBA E-waste recycling bins.`
- `Click here for avenues` to `Refer to the link(s)`
- `Click here for for avenues` to `Refer to the link(s)`
- `Can be recycled through E-waste bins, located here.` to `Can be recycled through E-waste bins, located at the link(s).`
