# Aquire and store historical weather data for Kirkwall Airport from WeatherHQ

Vackervader (or its English language counterpart, WeatherHQ) provides a more comprehensive source of weather records than is available from Met Office DataPoint.
- https://www.weatherhq.co.uk/weather-station/kirkwall-airport
- https://www.vackertvader.se/v%C3%A4derstation/kirkwall-airport

Examining the website reveals a webservice, that can be used to retreive the raw data. 
![Vacker Website displaying Data sources](../images/VackerRestApi.png)

The station id for Kirkwall Airport is 3017 - the same as used by the Met Office.

This notebook exlpores and parses the available data, for the required period and writes it to a sqlite database and CSV file for further use.

Some points to note:
* The Vakervader returns datetimes as milliseconds from the epoch (Python uses seconds from the epoch in its datetime type)
* The webservice responded with a 'forbidden' error when queried using urlib.request.   
According to https://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping  :  
>'This is probably because of ... some ...server security feature which blocks known spider/bot user agents' 

Consequently, a differnet request was used with which the user agent could be set.
* The webservice did not return data records matching the request.  A method that obtained the complete set in small batches was implemented. 

An example webservice call:

http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start=1517734899000&end=1549275655000

the data is returned in json format containing the weather records as a list of records, which in turn is a list of numbers and values for the 'from' and 'to' for the data set

each record consists of a list of 12 numbers which are interpreted as follows:

| posn | value  | property  | units |
|:---:|----:|:---|:---:|
|0 |	1549273800000 | timestamp| milliseconds|
|1 |	4 | temperature | degrees celcius|
|2 |	null | |
|3 |	1002 | pressure |mbar|
|4 | 3.6| wind speed | m/s|
|5.|	null| |
|6.	|93 | humidity |%|
|7.|	280 | wind direction| degrees|
|8.	|null| | 
|9.	|10000 | visibility |m|
|10.|	37.5 |cloud cover |%|
|11.	|1036 | cloud height |m|

In [2]:
from datetime import datetime, date, time
t = datetime(2005, 7, 14, 12, 30)
t.isoformat()

'2005-07-14T12:30:00'

In [3]:
from datetime import datetime
import calendar

d = datetime.utcnow()
unixtime = calendar.timegm(d.utctimetuple())
print (unixtime, d)

(1581815111, datetime.datetime(2020, 2, 16, 1, 5, 11, 456668))


In [4]:
def unixtime(d):
    return calendar.timegm(d.utctimetuple())

startdate = datetime(2019,1, 16, 12, 0)
enddate = datetime.utcnow()
print(unixtime(startdate))
print(unixtime(enddate))

1547640000
1581815113


In [4]:
print(str(1000*unixtime(startdate)))

1547640000000


In [5]:
print('http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start='+str(1000*unixtime(startdate))+'&end='+str(1000*unixtime(enddate)))

http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start=1547640000000&end=1573397301000


In [6]:
#!/usr/bin/env python3
# this returns http error 403 forbidden 
# according to https://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping
# This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:

import urllib
import json
from datetime import datetime, date, time

startdate =datetime(2019, 1, 14, 23, 59, 59)
enddate =datetime.now()
URL = 'http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start='+str(1000*unixtime(startdate))+'&end='+str(1000*unixtime(enddate))

print(URL)

url = urllib.request.urlopen(URL)
page = url.read()
print(page)       


http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start=1547510399000&end=1581815581000


AttributeError: 'module' object has no attribute 'request'

In [7]:
from urllib.request import Request, urlopen

req = Request('http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start='+str(1000*unixtime(startdate))+'&end='+str(1000*unixtime(enddate)), headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
data = json.loads(webpage)

display first and last records

In [9]:
print(data['data'][0])
print(data['data'][-1])

[1561410000000, 11.3, None, 1020.1, 8.2, None, 92, 30, None, 16000, 100.0, None]
[1573394400000, 5.0, None, 1011.9, 2.1, None, 71, 90, None, 35000, 87.5, None]


data returned is not the range requested.  
possibly limited to 3000 records

In [1]:
print ('Data requested')
print ('from: ', 1000*unixtime(startdate), 'to: ', 1000*unixtime(enddate))
print ('from: ', startdate, 'to: ',enddate)
dt = enddate-startdate
print ('range: ', dt.days, ' days')
print ()

print ('Data returned')
print ('from returned labels')
print ('from: ', data['from'], 'to: ',data['to'])
print ('from: ', datetime.fromtimestamp(data['from']/1000, tz=None), 'to: ',datetime.fromtimestamp(data['to']/1000, tz=None))
dt = datetime.fromtimestamp(data['to']/1000, tz=None)-datetime.fromtimestamp(data['from']/1000)
print ('range: ', dt.days, ' days')
print ()

print('from returned data records')
returned_start= int(data['data'][0][0])//1000 
returned_end= data['data'][-1][0]//1000 
print ('from: ', data['data'][0][0], 'to: ',data['data'][-1][0])
print ('from: ', datetime.fromtimestamp(returned_start, tz=None), 'to: ',datetime.fromtimestamp(returned_end, tz=None))
dt = datetime.fromtimestamp(returned_end, tz=None)-datetime.fromtimestamp(returned_start)
print ('range: ', dt.days, 'days')
print ('records returned: ',len(data['data']))


Data requested


NameError: name 'unixtime' is not defined

Let's see if we can work out how the data set is returned and calulate which is next call


In [15]:
startdate = datetime(2019,1, 16, 12, 0)
enddate = datetime(2019,2,16,12,0)

req = Request('http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start='+str(1000*unixtime(startdate))+'&end='+str(1000*unixtime(enddate)), headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
data = json.loads(webpage)

In [16]:
print ('Data requested')
print ('from: ', 1000*unixtime(startdate), 'to: ', 1000*unixtime(enddate))
print ('from: ', startdate, 'to: ',enddate)
dt = enddate-startdate
print ('range: ', dt.days, ' days')
print ()

print ('Data returned')
print ('from returned labels')
print ('from: ', data['from'], 'to: ',data['to'])
print ('from: ', datetime.fromtimestamp(data['from']/1000, tz=None), 'to: ',datetime.fromtimestamp(data['to']/1000, tz=None))
dt = datetime.fromtimestamp(data['to']/1000, tz=None)-datetime.fromtimestamp(data['from']/1000)
print ('range: ', dt.days, ' days')
print ()

print('from returned data records')
returned_start= int(data['data'][0][0])//1000 
returned_end= data['data'][-1][0]//1000 
print ('from: ', data['data'][0][0], 'to: ',data['data'][-1][0])
print ('from: ', datetime.fromtimestamp(returned_start, tz=None), 'to: ',datetime.fromtimestamp(returned_end, tz=None))
dt = datetime.fromtimestamp(returned_end, tz=None)-datetime.fromtimestamp(returned_start)
print ('range: ', dt.days, 'days')
print ('records returned: ',len(data['data']))

Data requested
from:  1547640000000 to:  1550318400000
from:  2019-01-16 12:00:00 to:  2019-02-16 12:00:00
range:  31  days

Data returned
from returned labels
from:  1546776000000 to:  1551182400000
from:  2019-01-06 12:00:00 to:  2019-02-26 12:00:00
range:  51  days

from returned data records
from:  1546777200000 to:  1551180000000
from:  2019-01-06 12:20:00 to:  2019-02-26 11:20:00
range:  50 days
records returned:  2888


In [25]:
startdate = datetime(2019,2, 26, 11, 20)
enddate = datetime(2019,3,26,12,0)

req = Request('http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start='+str(1000*unixtime(startdate))+'&end='+str(1000*unixtime(enddate)), headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
data = json.loads(webpage)

In [26]:
print ('Data requested')
print ('from: ', 1000*unixtime(startdate), 'to: ', 1000*unixtime(enddate))
print ('from: ', startdate, 'to: ',enddate)
dt = enddate-startdate
print ('range: ', dt.days, ' days')
print ()

print ('Data returned')
print ('from returned labels')
print ('from: ', data['from'], 'to: ',data['to'])
print ('from: ', datetime.fromtimestamp(data['from']/1000, tz=None), 'to: ',datetime.fromtimestamp(data['to']/1000, tz=None))
dt = datetime.fromtimestamp(data['to']/1000, tz=None)-datetime.fromtimestamp(data['from']/1000)
print ('range: ', dt.days, ' days')
print ()

print('from returned data records')
returned_start= int(data['data'][0][0])//1000 
returned_end= data['data'][-1][0]//1000 
print ('from: ', data['data'][0][0], 'to: ',data['data'][-1][0])
print ('from: ', datetime.fromtimestamp(returned_start, tz=None), 'to: ',datetime.fromtimestamp(returned_end, tz=None))
dt = datetime.fromtimestamp(returned_end, tz=None)-datetime.fromtimestamp(returned_start)
print ('range: ', dt.days, 'days')
print ('records returned: ',len(data['data']))


Data requested
from:  1551180000000 to:  1553601600000
from:  2019-02-26 11:20:00 to:  2019-03-26 12:00:00
range:  28  days

Data returned
from returned labels
from:  1550316000000 to:  1554465600000
from:  2019-02-16 11:20:00 to:  2019-04-05 13:00:00
range:  48  days

from returned data records
from:  1550317800000 to:  1554463200000
from:  2019-02-16 11:50:00 to:  2019-04-05 12:20:00
range:  48 days
records returned:  2688


In [27]:
print (data['data'][0:2])

[[1550317800000, 8.0, None, 1010.0, 3.6, None, 93, 150, None, 9000, 75.0, 182], [1550318400000, 8.0, None, 1007.5, 4.6, None, 97, 140, None, 9000, 37.5, None]]


What we can be seen from the two examples above is that the returned data is quite different from the request: 
* the number of records returned seems to be limited to less than 3000, making the requested end date irrelevant, and,
* the first record may be prior to that requested.

So, to obtain the required data set the webservice will have to be polled repeatedly for limited quantities of records which will then have to be collated into a complete set.

The following script, requests the required data in batches of 30 days. Each batch is added to the resulting dataframe, omitting (dropping) duplicate entries.  The subsequent batch is requested from one day prior to last record in the previous batch to ensure there are no gaps in the dataset.  All batches seem to consist of 2688 returned records spanning approximately 49 to 50 days with the returned date being 9 or 10 days prior to the requested date.  It is a bit of a mystery as to what is going on.

In [None]:
# https://stackoverflow.com/questions/21317384/pandas-python-how-to-concatenate-two-dataframes-without-duplicates
# pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

# https://stackoverflow.com/questions/1720421/how-do-i-concatenate-two-lists-in-python
# mergedlist = list(set(listone + listtwo))

In [57]:
# attempt to perform multiple calls on vaker waether rest api to obtain complete set of data.

from datetime import datetime, timedelta
import calendar
import pandas as pd

def get_weather_records(begin, end):
    req = Request('http://archive.vackertvader.se/archive/epoch_observations?station_id=3017&start='+str(1000*unixtime(begin))+'&end='+str(1000*unixtime(end)), headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    return json.loads(webpage)

def unixtime(d):
    return calendar.timegm(d.utctimetuple())

def print_info(startdate, enddate, datasetaslist):
    print ('Data requested')
    print ('from: ', 1000*unixtime(startdate), 'to: ', 1000*unixtime(enddate))
    print ('from: ', startdate, 'to: ',enddate)
    dt = enddate-startdate
    print ('range: ', dt.days, ' days')
    print ()
    
    print ('Data returned')   
    print('from returned data records')
    returned_start= int(datasetaslist[0][0])//1000 
    returned_end= datasetaslist[-1][0]//1000 
    print ('from: ', datasetaslist[0][0], 'to: ',datasetaslist[-1][0])
    print ('from: ', datetime.fromtimestamp(returned_start, tz=None), 'to: ',datetime.fromtimestamp(returned_end, tz=None))
    dt = datetime.fromtimestamp(returned_end, tz=None)-datetime.fromtimestamp(returned_start)
    print ('range: ', dt.days, 'days')
    print ('records returned: ',len(data['data']))



startdate = datetime(2019,1, 1, 0, 0)
enddate = datetime(2019,11,1,0,0)

range_to_get = timedelta(days=30)
one_day = timedelta(days=1)

batch_start = startdate
batch_end = batch_start+range_to_get
batch_count =1
result_set =pd.DataFrame()

while batch_start < enddate:
    new_batch = get_weather_records(batch_start, batch_end)
    print('batch: ', batch_count)
    print_info(batch_start, batch_end, new_batch['data'])
    # result_set = list(set(result_set+new_batch['data']))
    df = pd.DataFrame.from_records(new_batch['data'])
    result_set = pd.concat([result_set,df]).drop_duplicates().reset_index(drop=True)
    batch_start = datetime.fromtimestamp(new_batch['data'][-1][0]//1000, tz=None)-one_day
    batch_end = batch_start+range_to_get
    batch_count += 1
    


batch:  1
Data requested
from:  1546300800000 to:  1548892800000
from:  2019-01-01 00:00:00 to:  2019-01-31 00:00:00
range:  30  days

Data returned
from returned data records
from:  1545438000000 to:  1549754400000
from:  2018-12-22 00:20:00 to:  2019-02-09 23:20:00
range:  49 days
records returned:  2688
batch:  2
Data requested
from:  1549668000000 to:  1552260000000
from:  2019-02-08 23:20:00 to:  2019-03-10 23:20:00
range:  30  days

Data returned
from returned data records
from:  1548806400000 to:  1553122800000
from:  2019-01-30 00:00:00 to:  2019-03-20 23:00:00
range:  49 days
records returned:  2688
batch:  3
Data requested
from:  1553036400000 to:  1555628400000
from:  2019-03-19 23:00:00 to:  2019-04-18 23:00:00
range:  30  days

Data returned
from returned data records
from:  1552173600000 to:  1556491800000
from:  2019-03-09 23:20:00 to:  2019-04-28 23:50:00
range:  50 days
records returned:  2688
batch:  4
Data requested
from:  1556409000000 to:  1559001000000
from:  2019

In [58]:
result_set.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1545438000000,4.0,,997.0,2.6,,100.0,230.0,,10000.0,0.0,
1,1545439800000,5.0,,997.0,3.1,,93.0,220.0,,10000.0,0.0,
2,1545441600000,5.0,,997.0,3.6,,93.0,230.0,,10000.0,0.0,
3,1545444000000,5.2,,995.0,3.6,,97.0,210.0,,50000.0,25.0,
4,1545445200000,5.0,,998.0,3.1,,93.0,210.0,,10000.0,75.0,975.0


In [59]:
result_set.describe()

Unnamed: 0,0,1,3,4,5,6,7,9,10,11
count,13121.0,13121.0,13121.0,13081.0,344.0,13118.0,12729.0,12805.0,12673.0,7142.0
mean,1556113000000.0,8.276343,1010.32541,6.066226,19.071802,83.967373,202.19106,16572.217884,51.57224,748.042705
std,6654478000.0,3.641796,13.345692,3.329315,3.912832,10.588983,90.996557,14787.716986,34.997363,384.482275
min,1545438000000.0,-1.4,958.7,0.0,11.3,33.0,10.0,50.0,0.0,0.0
25%,1550557000000.0,6.0,1002.0,3.6,15.9,77.0,140.0,10000.0,12.5,426.0
50%,1555621000000.0,8.0,1011.0,5.7,18.5,86.0,200.0,10000.0,50.0,762.0
75%,1560965000000.0,11.0,1020.1,8.2,21.6,93.0,270.0,17000.0,75.0,1066.0
max,1573336000000.0,22.8,1044.0,22.6,32.4,100.0,360.0,70000.0,100.0,1493.0


In [None]:
result_set.tail()

The columns are labelled and the unused columns are removed from the table.
The timestamp is converted to seconds from milliseconds.
It is then written to an sqlite database and csv files for later use.

In [61]:
result_set.columns=['timestamp', 'Temp','na','Pressure','WindSpeed','Precipitation','Humidity','WindDirection','na2','Visibility','CloudCover','CloudHeight']

In [62]:
result_set.head()

Unnamed: 0,timestamp,Temp,na,Pressure,WindSpeed,Precipitation,Humidity,WindDirection,na2,Visibility,CloudCover,CloudHeight
0,1545438000000,4.0,,997.0,2.6,,100.0,230.0,,10000.0,0.0,
1,1545439800000,5.0,,997.0,3.1,,93.0,220.0,,10000.0,0.0,
2,1545441600000,5.0,,997.0,3.6,,93.0,230.0,,10000.0,0.0,
3,1545444000000,5.2,,995.0,3.6,,97.0,210.0,,50000.0,25.0,
4,1545445200000,5.0,,998.0,3.1,,93.0,210.0,,10000.0,75.0,975.0


In [63]:
result_set = result_set.drop(['na','na2'], 1)

In [64]:
result_set.head()

Unnamed: 0,timestamp,Temp,Pressure,WindSpeed,Precipitation,Humidity,WindDirection,Visibility,CloudCover,CloudHeight
0,1545438000000,4.0,997.0,2.6,,100.0,230.0,10000.0,0.0,
1,1545439800000,5.0,997.0,3.1,,93.0,220.0,10000.0,0.0,
2,1545441600000,5.0,997.0,3.6,,93.0,230.0,10000.0,0.0,
3,1545444000000,5.2,995.0,3.6,,97.0,210.0,50000.0,25.0,
4,1545445200000,5.0,998.0,3.1,,93.0,210.0,10000.0,75.0,975.0


In [65]:
result_set.timestamp = result_set.timestamp//1000

In [73]:
result_set.head()

Unnamed: 0,timestamp,Temp,Pressure,WindSpeed,Precipitation,Humidity,WindDirection,Visibility,CloudCover,CloudHeight
0,1545438000,4.0,997.0,2.6,,100.0,230.0,10000.0,0.0,
1,1545439800,5.0,997.0,3.1,,93.0,220.0,10000.0,0.0,
2,1545441600,5.0,997.0,3.6,,93.0,230.0,10000.0,0.0,
3,1545444000,5.2,995.0,3.6,,97.0,210.0,50000.0,25.0,
4,1545445200,5.0,998.0,3.1,,93.0,210.0,10000.0,75.0,975.0


In [67]:
result_set.describe()

Unnamed: 0,timestamp,Temp,Pressure,WindSpeed,Precipitation,Humidity,WindDirection,Visibility,CloudCover,CloudHeight
count,13121.0,13121.0,13121.0,13081.0,344.0,13118.0,12729.0,12805.0,12673.0,7142.0
mean,1556113000.0,8.276343,1010.32541,6.066226,19.071802,83.967373,202.19106,16572.217884,51.57224,748.042705
std,6654478.0,3.641796,13.345692,3.329315,3.912832,10.588983,90.996557,14787.716986,34.997363,384.482275
min,1545438000.0,-1.4,958.7,0.0,11.3,33.0,10.0,50.0,0.0,0.0
25%,1550557000.0,6.0,1002.0,3.6,15.9,77.0,140.0,10000.0,12.5,426.0
50%,1555621000.0,8.0,1011.0,5.7,18.5,86.0,200.0,10000.0,50.0,762.0
75%,1560965000.0,11.0,1020.1,8.2,21.6,93.0,270.0,17000.0,75.0,1066.0
max,1573336000.0,22.8,1044.0,22.6,32.4,100.0,360.0,70000.0,100.0,1493.0


In [91]:
%load_ext sql

In [93]:
%sql sqlite:///./database/vackerWeather.db

'Connected: @./database/vackerWeather.db'

In [94]:
%sql persist result_set

 * sqlite:///./database/vackerWeather.db


'Persisted result_set'

In [95]:
result_set.to_csv('./database/vackerWeather.csv')