# Hands on exercise on TAHMO data

Outline
---------
- Dataset description and picture map. 
- Feature extraction step 
- Model construction 


<h2>Introduction</h2>

<img src=images/allafrica.png width=75%>

Stations located in Kenya. 

<img src=images/target-stations.png width=75%>

<h3>The TAHMO stations</h3>

<h3>Dataset</h3>

The data is a time series of weather sensor readings, consisting of different physical variables on a regular grid on the Earth, indexed by lon(gitude) and lat(itude) coordinates. The variables we have made available are: 
<ul>
<li>rain --- Percipitation. 
<li>te --- air temperature. 
<li>rh --- relative humidity. 
<li>wsd --- wind speed. 
<li>wsg --- wind gust. 
<li>pres --- surface pressure.
<li>rad --- Solar radiation. 
</ul>
 
The fields are recorded every 5 minutes for two years from 2016-2017. The dataset is observation averaged on hour scale. 


### Station description

 - from: Station id
 - to: Station id 
 - distance (km): distance between stations.  
 - elevation (m) : Elevation difference between from & to station. 

<h3>The prediction task</h3>
 - Predict whether it rains or not.
 - Predict the amount of rain for the rainy period. 



In [4]:
## Map of the stations.. 
from mpl_toolkits.basemap import Basemap
import matplotlib.pylab as plt
%matplotlib inline 
import pandas as pd 
import numpy as np 

In [76]:
import math
def haversine_distance(lat1, lon1, lat2, lon2):
    """
    Distance between two geographical location. 
    args:
        lat1: latitude of station 1
        lon1: longtitude of station 1
        lat2: latitude of station 2
        lon2: longtitude of station2
    """
    
    earth_radius = 6371.16
    deg2rad = lambda deg: deg*math.pi/180
    dlat = deg2rad(lat2 - lat1)
    dlon = deg2rad(lon2 - lon1)
    a = math.sin(dlat / 2) * math.sin(dlat / 2) + math.cos(deg2rad(lat1)) * \
    math.cos(deg2rad(lat2)) * math.sin(dlon / 2) * math.sin(dlon / 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = earth_radius * c #// Distance in km
    return d


In [88]:
def nearby_stations(site_code, k=10, radius=100):
    """
    Return k-nearest stations. 
    """
    stations = pd.read_csv("nearest_stations.csv")
    k_nearest = stations[(stations['from'] == site_code) & (stations['distance'] < radius)]
    k_nearest = k_nearest.sort_values(by=['distance', 'elevation'], ascending=True)[0:k]
    
    return k_nearest

In [103]:
nearby_stations('TA00020',k=5)

Unnamed: 0,distance,elevation,from,to
3,40.705243,-155,TA00020,TA00025
11,44.519409,7,TA00020,TA00057
15,48.530912,-353,TA00020,TA00066
2,67.805403,123,TA00020,TA00024
16,85.761088,459,TA00020,TA00067


In [104]:
import os

In [106]:
alldata = pd.DataFrame()
# Extract the station
for ff in os.listdir('hrdata/'):
    #print ff
    
    dd2 = pd.read_csv('hrdata/'+ff)
    alldata = pd.concat([alldata,dd2])
    

In [109]:
alldata.to_csv("kenyastations.csv",index=False)

In [100]:
xx = pd.concat([dd,dd2])

In [102]:
xx.shape, dd.shape[0] + dd2.shape[0]

((35088, 9), 35088)

In [98]:
xx.shape

(0, 9)

In [None]:
class FeatureExtractor(object):

    def __init__(self):
        pass

    def transform(self, X_ds):
        """Compute the vector of input variables at time t. Spatial variables will
        be averaged along lat and lon coordinates."""
        # This is the range for which features should be provided. Strip
        # the burn-in from the beginning and the prediction look-ahead from
        # the end.
        valid_range = np.arange(X_ds.attrs['n_burn_in'], len(X_ds['time']))
        # We convert the Dataset into a 4D DataArray
        X_xr = X_ds.to_array()
        # We compute the mean over the lat and lon axes
        mean_xr = np.mean(X_xr, axis=(2, 3))
        # We convert it into numpy array, transpose, and slice the valid range
        X_array = mean_xr.values.T[valid_range]
        return X_array


### Exploratory analysis

In [13]:
stn.head(5)

Unnamed: 0,id,name,country,latitude,longitude,elevation
0,TA00020,Woodlands 2000 Trust,KE,-1.653356,36.862397,1643
1,TA00021,Kipsomba Secondary School,KE,0.757986,35.173711,1946
2,TA00023,Dwa Estate,KE,-2.38855,38.040767,791
3,TA00024,Mang'u High School,KE,-1.071731,37.045578,1520
4,TA00025,Kenya Meteorological Department,KE,-1.301839,36.7602,1798


1. Hurdel model. Predict whether it rains or not?
2. If it rains how much amount? Possibly use the rainy events data. 

### Dived the data into training/validation and testing data. 