## Introduction

I am currently building a data lake which will be used to improve operations at an energy company using machine learning. Among the many interesting topics the following are prioritized:
- Can we predict the energy production of hydroelectric, solar and wind power plants? 
- Can we predict the energy consumption using weather reports? After all, home owners need to heat their homes more on a cold winter day than on a sunny day in October.
- Under what conditions to equipment fail? Can we predict the need for maintenance and optimize scheduled down time?

Hence, getting access to weather data is necessary. In today's post I want to share what I did yesterday to 
- import weather data from an external service provider
- deal with time stamps.

I intend to follow up this post with the subsequent steps in building the data lake, the predictive analytics and the delivery of these insights as a service.

## Getting weather data

There are multiple ways to get access to weather data, but my preferred method is using an API designed for this purpose. My favourite provider is Weather Underground, https://www.wunderground.com/. They have both free and paid options, but no matter what you opt for, you do need to register to get an API key.

After importing some libraries that we will use we query the API with the longitude and latitude of the location of interest.

In [3]:
import urllib2
import json
import time
import os

f = urllib2.urlopen('http://api.wunderground.com/api/'+apikey+'/geolookup/conditions/forecast/q/46.94809341,7.44744301.json')
json_string = f.read()

It is also possible to query by city name, for example for Bern, the capital of Switzerland (CH):

In [None]:
f = urllib2.urlopen('http://api.wunderground.com/api/'+apikey+'/geolookup/conditions/q/CH/Bern.json')

I sometimes am interested in writing the raw data I get back from the API:

In [4]:
with open('weather.json', 'w') as file:
  file.write(json_string)
file.close()

However, we might want to extract specific fields of the JSON document the API returned. Fortunately, parsing the JSON document is trivial in python.

In [5]:
parsed_json = json.loads(json_string)
location = parsed_json['location']['city']
temp_c = parsed_json['current_observation']['temp_c']
print "Current temperature in %s is: %s" % (location, temp_c)

Current temperature in Berne is: 20.9


## Time information

To blend this weather data with the data from the sensor data from the power plants, we have to make sure we take the data from the correct time window. We can get the time of the last observation just like we got the temperature above.

In [6]:
print parsed_json['current_observation']['observation_time']

Last Updated on August 20, 2:20 PM CEST


While this "pretty print" is human readible, it is harder for something like Hive or SQL to interpret. For this purpose, it is good practice to make use of the time epoch.

In [7]:
obs_time = parsed_json['current_observation']['observation_epoch']
print obs_time

1471695656


This value can be used to extract things like the year, month, day etc, which makes it easier to define in which user defined time window the observation belongs to.

In [8]:
print "Observation time", time.strftime('%Y-%m-%d %H:%M:%S %Z', time.localtime(float(obs_time)))

Observation time 2016-08-20 14:20:56 CEST


Now, since I am running this as a script on an R-server edge node on the Azure cloud, the time zone of the node and the observation point can be different. To ensure that the correct time zone is used, we can prepend the previous command with this: 

In [9]:
os.environ['TZ'] = 'Europe/Zurich'
time.tzset()

## Conclusion
This mini tutorial has shown how to query and access weather data, and how to deal with the time and time zone of the measurement. With this we are ready to start blending this data with other data that is stored on the data lake. The next steps will be included in a future blog post.