Extract the list of all possible locations as a bulk from www.niederoesterreich-card.at/alle-ausflugsziele/0 up to www.niederoesterreich-card.at/alle-ausflugsziele/27 with http://import.io (https://import.io/data/mine/?id=57367c6e-1454-43b8-8f08-01780ca79aef)

In [1]:
import csv
import mpcouch
from lxml import html
import requests
import couchdb
import hashlib

import requests

couchdbUrl = "http://gi88.geoinfo.tuwien.ac.at:5984"

First, we collect the data from the CSV file in a list. We could already perform the collection of data here, but since there are only about 334 entries, there is no measurable loss in speed.

In [2]:
collectedDocs = []
with open('noeHomepageData.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=",", quotechar='"')
    for i, row in enumerate(csvreader):
        if i == 0: continue
        pushDoc = {'title': row[14],
                   'link': row[9],
                   'category': row[13],
                   'plz': row[15],
                   'addr': row[11],
                   'imageUrl': row[2]}
        collectedDocs.append(pushDoc)
        #couchPusher.pushData(pushDoc)
    #couchPusher.finish()
print("Collected entries: {}".format(len(collectedDocs)))

Collected entries: 333


Now, we collect the missing information by calling each website of each entry and parsing it. After that it gets uploaded to the CouchDB database.
One important thing is, that we calculate a hash value of every entry important, so we can detect any real change in comparison to the data already in the database. When this check is performed on the mobile end-point, every user acts as a updating node automatically. So, in reality, all we would have to do manually, is to check whether any entries are added or removed.

The **contentHash** contains a hash value for the **description**, the **price** and the **opening hours**.

In [3]:
couchPusher = mpcouch.mpcouchPusher(couchdbUrl+"/noecard", 10000)
for i, entry in enumerate(collectedDocs):
    print("processing entry {} of {}: {}".format(i, len(collectedDocs), entry['title']))
    page = requests.get(entry['link'])
    pageTree = html.fromstring(page.content)
    errorText = pageTree.xpath('//*[@id="body"]/div[2]/div/section/article/p[1]/strong')
    if len(errorText) < 1:
        pageDescription = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/article/div/p[1]')[0].text
        pagePrice = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[2]/p/span[2]')[0].text
        pageOpen = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[4]/p[2]')[0].text
        pageLocation = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[3]/p[2]/span[3]')[0].text
        pageLocationName = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[3]/p[1]/strong')[0]

        try:
            pageMapString = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[5]/figure/img/@src')[0]
            pageMap = pageMapString.replace('7C','').replace('2C','').split('%')
            pageMap = [pageMap[-1], pageMap[-2]]
        except IndexError:
            pageMap = []
        
        collectedDocs[i]['_id'] = collectedDocs[i]['title']
        collectedDocs[i]['description'] = pageDescription
        collectedDocs[i]['price'] = float(pagePrice.replace('€ ','').replace(',','.'))
        collectedDocs[i]['open'] = pageOpen
        collectedDocs[i]['coordinates'] = pageMap
        
        collectedDocs[i]['entryId'] = i
        
        if pageOpen == None: pageOpen = ""
        if pageDescription == None: pageDescription = ""
        if pagePrice == None: pagePrice = ""  
            
        hashString = (pageDescription+pagePrice+pageOpen).encode('utf-8')
        collectedDocs[i]['contentHash'] = hashlib.sha1(hashString).hexdigest()
        
        couchPusher.pushData(collectedDocs[i])
    else:
        # the page does not exist anymore, ignore
        pass
couchPusher.finish()

processing entry 0 of 333: Stift Altenburg
processing entry 1 of 333: BÄRENWALD
processing entry 2 of 333: Hammerschmiede Kamp
processing entry 3 of 333: Mohndorf Armschlag
processing entry 4 of 333: Schnaps-Glas-Museum Echsenbach
processing entry 5 of 333: Krahuletz-Museum
processing entry 6 of 333: Nostalgiewelt Eggenburg
processing entry 7 of 333: Perlmuttdrechslerei
processing entry 8 of 333: NÖ Falknerei- und Greifvogelzentrum
processing entry 9 of 333: Wirtex – Älteste Frottierweberei
processing entry 10 of 333: Naturpark Geras
processing entry 11 of 333: Stift Geras
processing entry 12 of 333: Naturpark Blockheide
processing entry 13 of 333: Sole-Felsen-Bad
processing entry 14 of 333: Waldviertelbahn
processing entry 15 of 333: Schloss Grafenegg
processing entry 16 of 333: SONNENWELT
processing entry 17 of 333: Nationalpark Thayatal
processing entry 18 of 333: Käsemacherwelt
processing entry 19 of 333: Naturpark Heidenreichsteiner Moor
processing entry 20 of 333: Museen der Stad

328

Now, we want to include the image of the entry. We have to perform this step in a second run, since with the CouchDB interface in Python, the document has to already exist, when an attachment is added to it.

In [7]:
couchdbserver = couchdb.Server(couchdbUrl)
couchdbdb = couchdbserver['noecard']

for i, entry in enumerate(collectedDocs):
    print("Retrieving {}: {}".format(i, entry['title']))
    currentDoc = couchdbdb.get(entry['title'])
    if currentDoc != None and currentDoc['imageUrl'] != '':
        response = requests.get(currentDoc['imageUrl'])
        storeName = currentDoc['title']+".jpg"
        couchdbdb.put_attachment(currentDoc, response.content, filename=storeName, content_type='image/jpeg')

Retrieving 0: Stift Altenburg


TypeError: unsupported operand type(s) for &: 'NoneType' and 'str'