Extract the list of all possible locations as a bulk from www.niederoesterreich-card.at/alle-ausflugsziele/0 up to www.niederoesterreich-card.at/alle-ausflugsziele/27 with http://import.io (https://import.io/data/mine/?id=57367c6e-1454-43b8-8f08-01780ca79aef)

In [1]:
import csv
import mpcouch
from lxml import html
import requests
import couchdb
import hashlib

import requests

import pyprind

couchdbUrl = "http://gi88.geoinfo.tuwien.ac.at:5984"

First, we collect the data from the CSV file in a list. We could already perform the collection of data here, but since there are only about 334 entries, there is no measurable loss in speed.

In [None]:
collectedDocs = []
with open('noeHomepageData.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=",", quotechar='"')
    for i, row in enumerate(csvreader):
        if i == 0: continue
        pushDoc = {'title': row[14],
                   'link': row[9],
                   'category': row[13],
                   'plz': row[15],
                   'addr': row[11],
                   'imageUrl': row[2]}
        collectedDocs.append(pushDoc)
        #couchPusher.pushData(pushDoc)
    #couchPusher.finish()
print("Collected entries: {}".format(len(collectedDocs)))

Now, we collect the missing information by calling each website of each entry and parsing it. After that it gets uploaded to the CouchDB database.
One important thing is, that we calculate a hash value of every entry important, so we can detect any real change in comparison to the data already in the database. When this check is performed on the mobile end-point, every user acts as a updating node automatically. So, in reality, all we would have to do manually, is to check whether any entries are added or removed.

The **contentHash** contains a hash value for the **description**, the **price** and the **opening hours**.

In [None]:
couchPusher = mpcouch.mpcouchPusher(couchdbUrl+"/noecard", 10000)
my_prbar = pyprind.ProgBar(len(collectedDocs))
for i, entry in enumerate(collectedDocs):
    my_prbar.update()
    #print("processing entry {} of {}: {}".format(i, len(collectedDocs), entry['title']))
    page = requests.get(entry['link'])
    pageTree = html.fromstring(page.content)
    errorText = pageTree.xpath('//*[@id="body"]/div[2]/div/section/article/p[1]/strong')
    if len(errorText) < 1:
        pageDescription = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/article/div/p[1]')[0].text
        pagePrice = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[2]/p/span[2]')[0].text
        pageOpen = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[4]/p[2]')[0].text
        pageLocation = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[3]/p[2]/span[3]')[0].text
        pageLocationName = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[3]/p[1]/strong')[0]

        try:
            pageMapString = pageTree.xpath('//*[@id="detailArticleNOECard"]/div/section/aside/article[5]/figure/img/@src')[0]
            pageMap = pageMapString.replace('7C','').replace('2C','').split('%')
            pageMap = [pageMap[-1], pageMap[-2]]
        except IndexError:
            pageMap = []
        
        collectedDocs[i]['_id'] = collectedDocs[i]['title']
        collectedDocs[i]['description'] = pageDescription
        collectedDocs[i]['price'] = float(pagePrice.replace('€ ','').replace(',','.'))
        collectedDocs[i]['open'] = pageOpen
        collectedDocs[i]['coordinates'] = pageMap
        collectedDocs[i]['checkedSeason'] = 2015
        
        collectedDocs[i]['entryId'] = i
        
        if pageOpen == None: pageOpen = ""
        if pageDescription == None: pageDescription = ""
        if pagePrice == None: pagePrice = ""  
            
        hashString = (pageDescription+pagePrice+pageOpen).encode('utf-8')
        collectedDocs[i]['contentHash'] = hashlib.sha1(hashString).hexdigest()
        
        couchPusher.pushData(collectedDocs[i])
    else:
        # the page does not exist anymore, ignore
        pass
couchPusher.finish()

Now, we want to include the image of the entry. We have to perform this step in a second run, since with the CouchDB interface in Python, the document has to already exist, when an attachment is added to it.

In [None]:
couchdbserver = couchdb.Server(couchdbUrl)
couchdbdb = couchdbserver['noecard']

my_prbar = pyprind.ProgBar(len(collectedDocs))
for i, entry in enumerate(collectedDocs):
    my_prbar.update()
    #print("Retrieving {}: {}".format(i, entry['title']))
    currentDoc = couchdbdb.get(entry['title'])
    if currentDoc != None and currentDoc['imageUrl'] != '':
        response = requests.get(currentDoc['imageUrl'])
        storeName = "preview.jpg"
        couchdbdb.put_attachment(currentDoc, response.content, filename=storeName, content_type='image/jpeg')

# Step 2: Data Correction

The OpeningHours imported are just a long string. We dissect this string and parse useful information out of it.

In [6]:
couchdbserver = couchdb.Server(couchdbUrl)
couchdbdb = couchdbserver['noecard']

# each entry represents a day in the year
Jänner = [False for x in range(31)]
Februar = [False for x in range(30)] # in case of leap year, we will ignore entry 30 later on
März = [False for x in range(31)]
April = [False for x in range(30)]
Mai = [False for x in range(31)]
Juni = [False for x in range(30)]
Juli = [False for x in range(31)]
August = [False for x in range(31)]
September = [False for x in range(30)]
Oktober = [False for x in range(31)]
November = [False for x in range(30)]
Dezember = [False for x in range(31)]
 
days = "Montag,Dienstag,Mittwoch,Donnerstag,Freitag,Samstag,Sonntag,Mo,Di,Mi,Do,Fr,Sa,So".split(',')
months = "Jänner,Januar,Februar,Feber,März,April,Mai,Juni,Juli,August,September,Oktober,November,Dezember".split(",")


dblen = (len(couchdbdb))
my_prbar = pyprind.ProgBar(dblen)
for i, entry in enumerate(couchdbdb):
    my_prbar.update()
    # exclude design documents starting with '_'
    if not entry.startswith('_'):
        currentDoc = couchdbdb[entry]
        newDoc = currentDoc
        if str(type(currentDoc['open'])) != "<class 'NoneType'>":
            
            parsedArray = []
            for entry in currentDoc['open'].split(";"):
                splitEntry = entry.split("bis")
                print(splitEntry)
                parsedArray.append(entry)
            newDoc['openParsed'] = parsedArray
            
        else:
            newDoc['openParsed'] = []
        #couchdbdb.save(newDoc)
    if i == 20: break
        

0%                          100%
[                              ]

['ganzjährig, Mi–Do 11–21 h, Fr–So 11–18 h']
['Jänner ', ' November, tgl. 10–18 h, letzter Einlass 17 h. Wegen Umbauarbeiten ', ' 15. Februar geschlossen.']
['1. Mai ', ' 15. Oktober, Mo–Fr 9–12 und 14–18 h, Sa 9–12 h, So und Ftg geschlossen']
['1. Mai ', ' 4. Juli, Sa, So']
['']
['Sommersaison (Freibad und Hallenbad): Mitte Mai ', ' Anfang September, tgl. 9–21 h']
[' Wintersaison (nur Hallenbad): Mitte September ', ' Ende April, Di–Fr 13–21 h, Sa, So und Ftg 9–21 h']
[' während der Schulferien Mo–So 9–21 h']
['ganzjährig, tgl. 10–17 h']
[' Mai ', ' September 10–18 h']
[' Jänner u. Februar nur Sa, So u. Ftg. Besichtigung nur mit Führung möglich. Öffnungszeiten während der Weihnachtsfeiertage: ', ' 23.12. normale Öffnungszeiten, tgl. von 10–17 h']
[' 24.12. 10–12 h']
[' 25.12. geschlossen']
[' 26.12.–30.12. 10–17 h']
[' 31.12. 10–12 h']
[' 1.1. geschlossen']
[' 2.1.–6.1. 10–17 h']
[' 7. und 8.1. geschlossen']
[' von 9.1.–28.2. nur samstags und sonntags von 10–17 h geöffnet']
[' ab 29.2.

[#                             ] | ETA: 00:00:11


['21. März ', ' 15. Nov., täglich 9–17 h']
['Mai ', ' September, Sa, So']
['ganzjährig, 10–17 h']
[' Führungen: Sa, So u. Ftg. um 15 h und nach Vereinbarung']
['Mai ', ' Anfang September, tgl. 9–18 h (witterungsabhängig)']
['1. April ', ' 31. Oktober, Fr–So und Ftg 10–12 und 13–17 h']
['4. April ', ' 26. Oktober']
['Mai, Juni, Sept. und Okt.,']
['18. ', ' 30. April u. 5. ', ' 26. Okt. 2015 Einmalige „Große Wachaurundfahrt“ ab Krems 10.10 h ', ' Krems 15.30 h']
['Juli und August']
[' Renntage laut Rennplan, siehe Internet oder telefonische Auskunft']
['10. April ', ' 26. Oktober, Mi–So u Ftg. 13–19 h']
['1. Mai ', ' 26. Oktober, Sa, So und Ftg Abfahrten ab Retz 9.30, 13.20 und 16.20 h, ab Drosendorf 11.50, 14.50 und 17.50 h']
['ganzjährig, Di–So u. Ftg. 10–18 h']
[' 24. u. 31. Dez. 10–14 h']
