# DATA, TAGS & QUERIES
Everything from the folder currently, with unfinished code for queries (near the end).

Tags in `coldstorage4.csv` have yet to be modified as well.

Always run *1) Imports* *2) Constants & Variable Initialisation* and *3) Helper Functions* before running anything else.

Remember to document your code :)
_____

## Imports

In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import csv
import os
import random

## Constants & Variable Initialisation

In [None]:
BASE_URL = "https://coldstorage.com.sg/"

## Helper Functions

In [None]:
def read(url):
    # Outputs the formatted html of a webpage parsed with lxml-parser
    html = urlopen(url)
    content = html.read()
    return BeautifulSoup(content, 'lxml')

# Data Scraping
from https://coldstorage.com.sg/

### Obtaining list of URLs to scrape

This data is written onto `children.csv`. **If the folder already contains this file, you don't need to run this code.**

For example, if one of the websites is https://coldstorage.com.sg/fruits-vegetables/fresh-fruits/apples-pears, the corresponding entry in the csv is `fruits-vegetables/fresh-fruits/apples-pears`.

Note: Only items with **no deeper subcategories** are considered. For example, https://coldstorage.com.sg/fruits-vegetables is not an entry in the file.

In [None]:
visited = []
children = []
categories = {}

def dfs(node):
    url = BASE_URL + node
    if node in visited:
        return
    visited.append(node)
    content = read(url)
    print("processing: " + node)
    sub_categories = content.find_all('a', {"class": "subcat_catalog"})
    if len(sub_categories) == 0:
        children.append(node)
        print("appended: " + node)
        return
    for x in sub_categories:
        nd = x['href'][1:]
        dfs(nd)

file = open("children.csv", "w")

content = read(BASE_URL)
tmp = content.find_all("li", {"class": "col-sm-6 col-md-3 menu-promo"})
for i in tmp[:2]:
    j = i.find_all('a')
    for k in j:
        categories[k['href'][1:]] = k.string.strip()

for x in categories:
    dfs(x)

for x in children:
    file.write(x + "\n")

file.close()

### The actual scraping

*We use BeautifulSoup to navigate the html. Documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/*

Using the URLs in `children.csv`, data is extracted from the webpage and written onto `coldstorage4.csv` (remember to change the version number if you re-run the code). This code includes the creation of tags for each item.

**Pages:** Depending on the number of items each URL contains, there might be a few pages for a specific URL (eg: https://coldstorage.com.sg/food-pantry/condiments-dressings/herbs-spices-seasonings), as shown:

![Pages](pagenumbers.png "Pages")

We use `find_all('li', {"class": "page"})` to look for every page in a URL, then iterate through the list of pages.

Every item on the webpage is wrapped in `<li class="col-xs-6 col-sm-4 col-md-3 col-lg-2 open-product-detail algolia-click"></li>`. We use `find_all('li', {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-2 open-product-detail algolia-click"})` to look for every item.

Inside the `<li></li>` tags for each item `x`, we find relevant attributes as follows:
___
`a = x.find('a', {"class": "search product-quick-view"})`

![search-product-quick-view](pic1.png "search product-quick-view")

**Brand (WAITROSE):** `a['data-product-brand']`

**Size (350G):** `a['data-product-size'][6:]` (`[6:]` is to remove `Size: ` from the text)
___
`b = x.find('a', {"class": "product-link"})`

![product-link](pic2.png "product-link")

**Name (Sea Salt Coarse Crystals 350g):** `b.text`

**URL ( https://coldstorage.com.sg/sea-salt-coarse-crystals-016542 ):** `BASE_URL + b['href'][9:]`
___
`c = x.find_all('div', {"class": "content_price"})[0]`
![content-price](pic3.png "content-price")

**Price (3.85):** `d = c.find_all('div'); d[0].text`
___
`e = x.find_all('div', {"class":"promo-wrapper"})[0]`
![promo-wrapper](pic4.png "promo-wrapper")

**Promotion (13% off):** `f = e.find_all('span'); f[0].text`
___
We create an array `row = []` and append all the attributes to this array. Tags are added as a single element (string) to the array, with the following structure: `"<tag1>,<tag2>,<tag3>"`. We then use `writerow(row)` to add the row to the csv file.

In [None]:
file = open("coldstorage5.csv", "w")
file.write("name,url,tree,brand,size,price,promotion,tags,\n")
filez = csv.writer(file)

with open("children.csv", "r") as f:
    cutez = csv.reader(f)
    for j in cutez:
        ct = read(BASE_URL + j[0])
        pages = ct.find_all('li', {"class": "page"})
        if len(pages) > 0: # check if there are multiple pages
            # the code in if and else are basically the same except that the one in if is inside a "for i in pages"
            for i in pages:
                ctt = read(BASE_URL + i.find('a')['href'][1:])
                pdt = ctt.find_all('li', {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-2 open-product-detail algolia-click"})
                for x in pdt:
                    a = x.find('a', {"class": "search product-quick-view"})
                    b = x.find('a', {"class": "product-link"})
                    c = x.find_all('div', {"class": "content_price"})[0]
                    d = c.find_all('div')
                    e = x.find_all('div', {"class":"promo-wrapper"})[0]
                    f = e.find_all('span')
                    row = [] # array of attributes to add as row to csv
                    row.append(b.text) # name
                    row.append(BASE_URL + b['href'][9:]) # url
                    row.append(j[0]) # tree (see children.csv)
                    row.append(a['data-product-brand']) # brand
                    row.append(a['data-product-size'][6:]) # size
                    if len(d) > 0: # check if item has price label
                        row.append(d[0].text)
                    else:
                        row.append("")
                    if len(f) > 0: # check if item has promotion
                        row.append(f[0].text)
                    else:
                        row.append("")
                        
                    arr = [] # list of tags, excluding duplicates (currently taken from "name" and "tree")
                    aa = b.text.rstrip().split(" ")
                    bb = j[0].rstrip().split("/")
                    for cc in bb:
                        dd = cc.split('-')
                        for i in dd:
                            if i not in arr:
                                arr.append(i)
                    for i in aa:
                        if i not in arr:
                            arr.append(i)
                    foo = "" # the string containing the tags
                    for i in arr:
                        foo += i
                        foo += ","
                    foo = foo[:len(foo)-1]
                    print(foo)
                    row.append(foo)
                    print(row)
                    filez.writerow(row)
        else:
            pdt = ct.find_all('li', {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-2 open-product-detail algolia-click"})
            for x in pdt:
                a = x.find('a', {"class": "search product-quick-view"})
                b = x.find('a', {"class": "product-link"})
                c = x.find_all('div', {"class": "content_price"})[0]
                d = c.find_all('div')
                e = x.find_all('div', {"class":"promo-wrapper"})[0]
                f = e.find_all('span')
                row = []
                row.append(b.text)
                row.append(BASE_URL + b['href'][9:])
                row.append(j[0])
                row.append(a['data-product-brand'])
                row.append(a['data-product-size'][6:])
                if len(d) > 0:
                    row.append(d[0].text)
                else:
                    row.append("")
                if len(f) > 0:
                    row.append(f[0].text)
                else:
                    row.append("")
                arr = []
                aa = b.text.rstrip().split(" ")
                bb = j[0].rstrip().split("/")
                for cc in bb:
                    dd = cc.split('-')
                    for i in dd:
                        if i not in arr:
                            arr.append(i)
                for i in aa:
                    if i not in arr:
                        arr.append(i)
                foo = ""
                for i in arr:
                    foo += i
                    foo += ","
                foo = foo[:len(foo)-1]
                print(foo)
                row.append(foo)
                print(row)
                filez.writerow(row)
file.close()

# Post-Processing

### Creating list of items with tags

For use in one of the app's .js files (`roc.js` in this folder). Output is at `plswork.txt`. 

In [None]:
# Don't need to understand the stuff here - it's just putting the tags in a js file for Siyong

cnt = 0

yay = ""

with open("old-data-files/coldstorage2.csv", "r") as f:
    file = csv.reader(f)
    for line in file:
        if (cnt == 0):
            cnt += 1
            continue

        itemName = line[0].rstrip()
        iuid = cnt + 29
        istock = random.randint(0,100)
        shelfLocation = "shelf" + str(random.randint(1, 4))
        shelfRow = random.randint(1, 4)
        shelfColumn = random.randint(1, 10)
        friendlyLocation = line[2].rstrip()
        cnt += 1

        arr = []
        a = line[0].rstrip().split(" ")
        b = line[2].rstrip().split('/')
        for c in b:
            d = c.split('-')
            for i in d:
                if i not in arr:
                    arr.append(i)
        for i in a:
            if i not in arr:
                arr.append(i)
        tags = "["
        for i in arr:
            tags += ("\"" + i + "\"")
            tags += ","
        tags = tags[:len(tags)-1]

        tags += "]"

        sleep = "{"
        sleep += ("\"iuid\":" + str(iuid) + ",")
        sleep += ("\"istock\":" + str(istock) + ",")
        sleep += ("\"itemName\":" + "\"" + itemName + "\"" + ",")
        sleep += ("\"shelfLocation\":" + "\"" + shelfLocation + "\"" + ",")
        sleep += ("\"friendlyLocation\":" + "\"" + friendlyLocation + "\"" + ",")
        sleep += ("\"shelfRow\":" + str(shelfRow) + ",")
        sleep += ("\"shelfColumn\":" + str(shelfColumn) + ",")
        sleep += ("\"tags\":" + tags)
        sleep += "},\n"
        yay += sleep

with open("plswork2.txt", "w") as f:
    f.write(yay)

### Query Types

In [None]:
def query(query_id, item_name):
    
    # query_id is the numerical id of the query type
    # item_name is a string containing the name of the item
    # The function returns a string containing the query
    
    if (query_id == 1):
        return "Where can I find " + item_name + "?"
    if (query_id == 2):
        return "Where can I get " + item_name + "?"
    if (query_id == 3):
        return "Can I find " + item_name + "?"
    if (query_id == 4):
        return "Can I get " + item_name + "?"
    if (query_id == 5):
        return "Is there " + item_name + "?"
    if (query_id == 6):
        return "Is " + item_name + " in stock?"
    if (query_id == 7):
        return "Is "+ item_name + " available?"
    if (query_id == 8):
        return "Where is the " + item_name + "?"
    if (query_id == 9):
        return "Do you know where the " + item_name + " is?"
    if (query_id == 10):
        return "Can you check whether there is " + item_name + "?"
    if (query_id == 11):
        return "Check whether there is " + item_name + "."
    if (query_id == 12):
        return "I'm looking for " + item_name + "."
    if (query_id == 13):
        return "What " + item_name + " are there?"
    if (query_id == 14):
        return "Where is the " + item_name + " section?"

def test():
    a = int(input("Enter query type: "))
    b = str(input("Enter item name: "))
    print("Query: " + query(a, b))
    
test()


'''
fyi i found a list of query types that we can use

1. Where can I find xxx?
2. Where can I get xxx?
3. Can I find xxx?
4. Can I get xxx?
5. Is there xxx here?
6. Is xxx in stock?
7. Is xxx available? 
8. Do you know where xxx is?
9. Where is the xxx?
10. Can you check whether there is xxx?
11. Check whether there is xxx.
12. I’m looking for xxx
13. (broad cat) What xxx (vegetables) are there?
14. (broad cat) Where is the xxx section?
'''

### Creating queries