In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [2]:
import numpy as np
import pandas as pd
import datetime as dt
import gzip
import grader

# Time Series Data: Predict Temperature
Time series prediction presents its own challenges which are different from machine-learning problems.  As with many other classes of problems, there are a number of common features in these predictions.

## A note on scoring
It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!

## Fetch the data:

In [3]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'train.txt.gz'

The columns of the data correspond to the
  - year
  - month
  - day
  - hour
  - temp
  - dew_temp
  - pressure
  - wind_angle
  - wind_speed
  - sky_code
  - rain_hour
  - rain_6hour
  - city

This function will read the data from a file handle into a Pandas DataFrame.  Feel free to use it, or to write your own version to load it in the format you desire.

In [4]:
def load_stream(stream):
    return pd.read_table(stream, sep=' *', engine='python',
                         names=['year', 'month', 'day', 'hour', 'temp',
                                'dew_temp', 'pressure', 'wind_angle', 
                                'wind_speed', 'sky_code', 'rain_hour',
                                'rain_6hour', 'city'])

In [5]:
df = load_stream(gzip.open('train.txt.gz', 'r'))

The temperature is reported in tenths of a degree Celcius.  However, not all the values are valid.  Examine the data, and remove the invalid rows.

We will focus on using the temporal elements to predict the temperature.

## Per city model

It makes sense for each city to have it's own model.  Build a "groupby" estimator that takes an estimator factory as an argument and builds the resulting "groupby" estimator on each city.  That is, `fit` should create and fit a model per city, while the `predict` method should look up the corresponding model and perform a predict on each.  An estimator factory is something that returns an estimator each time it is called.  It could be a function or a class.

In [11]:
from sklearn import base

class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, column, features, estimator_factory):
        self.column = column
        self.features = features
        self.estimator_factory = estimator_factory
    
    def fit(self, X, y):
        groups = list(X[self.column].unique())
        self.fit_dict = {}
        for ii in groups:
            mask = X[self.column] == ii
            x_tmp = X[mask][self.features].values
            y_tmp = y[mask]   
            self.fit_dict[ii] = self.estimator_factory.fit(x_tmp,y_tmp)
        
        return self

    def predict(self, X):
        X = [x.split() for x in X]
        X = pd.DataFrame(X)
        X.columns = ['year', 'month', 'day', 'hour', 'temp',
                                'dew_temp', 'pressure', 'wind_angle', 
                                'wind_speed', 'sky_code', 'rain_hour',
                                'rain_6hour', 'city']
        
        # print X
        return X.apply(lambda x: self.fit_dict[x['city']].predict(x[self.features].values.reshape(1,-1)), 1)['temp'].tolist()

# Questions

For each question, build a model to predict the temperature in a given city at a given time.  You will be given a list of records, each a string in the same format as the lines in the training file.  Return a list of predicted temperatures, one for each incoming record.  (As you can imagine, the temperature values will be stripped out in the actual text records.)

## month_hour_model
Seasonal features are nice because they are relatively safe to extrapolate into the future. There are two ways to handle seasonality.  

The simplest (and perhaps most robust) is to have a set of indicator variables. That is, make the assumption that the temperature at any given time is a function of only the month of the year and the hour of the day, and use that to predict the temperature value.

**Question**: Should month be a continuous or categorical variable?  (Recall that [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is useful to deal with categorical variables.)

In [12]:
import xgboost as xgb
xgbr = xgb.XGBRegressor(max_depth=5, learning_rate=0.1,colsample_bytree=0.7, subsample=0.7,n_estimators=100,
                   reg_alpha = 0.1, reg_lambda = 0.1)

In [13]:
df = df.replace(-9999, np.nan)
df = df.fillna(method='ffill')

In [16]:
def season_factory():
    return xgbr

features = ['month', 'day', 'hour', 'pressure', 'wind_speed','wind_angle', 'sky_code', 'rain_hour', 'rain_6hour']

season_model = GroupbyEstimator('city', features, xgbr).fit(df, df['temp'])

You will need to write a function that makes predictions from a list of strings.  You can either create a pipeline with a transformer and the `season_model`, or you can write a helper function to convert the lines to the format you expect.

In [17]:
grader.score('ts__month_hour_model', season_model.predict)

Your score:  1.10612415472


*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*