In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [2]:
import numpy as np
import pandas as pd
import datetime as dt
import gzip
import grader

# Time Series Data: Predict Temperature
Time series prediction presents its own challenges which are different from machine-learning problems.  As with many other classes of problems, there are a number of common features in these predictions.

## A note on scoring
It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!

## Fetch the data:

In [3]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'train.txt.gz'

download: s3://dataincubator-course/mldata/train.txt.gz to ./train.txt.gz


The columns of the data correspond to the
  - year
  - month
  - day
  - hour
  - temp
  - dew_temp
  - pressure
  - wind_angle
  - wind_speed
  - sky_code
  - rain_hour
  - rain_6hour
  - city

This function will read the data from a file handle into a Pandas DataFrame.  Feel free to use it, or to write your own version to load it in the format you desire.

In [4]:
def load_stream(stream):
    return pd.read_table(stream, sep=' *', engine='python',
                         names=['year', 'month', 'day', 'hour', 'temp',
                                'dew_temp', 'pressure', 'wind_angle', 
                                'wind_speed', 'sky_code', 'rain_hour',
                                'rain_6hour', 'city'])

In [5]:
df = load_stream(gzip.open('train.txt.gz', 'r'))

The temperature is reported in tenths of a degree Celcius.  However, not all the values are valid.  Examine the data, and remove the invalid rows.

In [21]:
df = df[df['temp']!=-9999]

In [28]:
df

Unnamed: 0,year,month,day,hour,temp,dew_temp,pressure,wind_angle,wind_speed,sky_code,rain_hour,rain_6hour,city
0,2000,1,1,0,-11,-72,10197,220,26,4,0,0,bos
1,2000,1,1,1,-6,-78,10206,230,26,2,0,-9999,bos
2,2000,1,1,2,-17,-78,10211,230,36,0,0,-9999,bos
3,2000,1,1,3,-17,-78,10214,230,36,0,0,-9999,bos
4,2000,1,1,4,-17,-78,10216,230,36,0,0,-9999,bos
5,2000,1,1,5,-22,-78,10218,230,36,0,0,-9999,bos
6,2000,1,1,6,-28,-83,10219,230,26,0,0,0,bos
7,2000,1,1,7,0,-78,10222,280,46,0,0,-9999,bos
8,2000,1,1,8,-11,-72,10231,240,36,7,0,-9999,bos
9,2000,1,1,9,-28,-78,10228,230,41,0,0,-9999,bos


In [24]:
set(df['city'])

{'bal', 'bos', 'chi', 'nyc', 'phi'}

We will focus on using the temporal elements to predict the temperature.

## Per city model

It makes sense for each city to have it's own model.  Build a "groupby" estimator that takes an estimator factory as an argument and builds the resulting "groupby" estimator on each city.  That is, `fit` should create and fit a model per city, while the `predict` method should look up the corresponding model and perform a predict on each.  An estimator factory is something that returns an estimator each time it is called.  It could be a function or a class.

In [56]:
from sklearn import base

class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, column, estimator_factory):
        self.f = column
        self.uni = set(df[column])
        self.estimators = {}
        for c in self.uni:
            self.estimators.update({c:estimator_factory()})
       
        # column is the value to group by; estimator_factory can be
        # called to produce estimators
    
    def fit(self, X, y):
        for c in self.uni:
            xx = X[X[self.f]==c]
            yy = X['temp']
            self.estimators[c].fit(xx,yy)
        # Create an estimator and fit it with the portion in each group
        return self

    def predict(self, X):
        result = []
        for ind,row in X.iterrows():
            est = self.estimators[row[self.f]]
            result.append(est.predict([row])[0])
        # Call the appropriate predict method for each row of X
        return result

# Questions

For each question, build a model to predict the temperature in a given city at a given time.  You will be given a list of records, each a string in the same format as the lines in the training file.  Return a list of predicted temperatures, one for each incoming record.  (As you can imagine, the temperature values will be stripped out in the actual text records.)

## month_hour_model
Seasonal features are nice because they are relatively safe to extrapolate into the future. There are two ways to handle seasonality.  

The simplest (and perhaps most robust) is to have a set of indicator variables. That is, make the assumption that the temperature at any given time is a function of only the month of the year and the hour of the day, and use that to predict the temperature value.

**Question**: Should month be a continuous or categorical variable?  (Recall that [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is useful to deal with categorical variables.)

In [57]:
import numpy as np
class season_factory(base.BaseEstimator, base.RegressorMixin):
    def __init__(self):
        self.avg = np.zeros((12,24))
    def fit(self, X, y):
        count = np.zeros((12,24))
        total = np.zeros((12,24))
        for idx, row in X.iterrows():
            h = int(row['hour'])
            m = int(row['month'])-1
            total[m,h]+=row['temp']
            count[m,h]+=1
        self.avg = total/count
        # Create an estimator and fit it with the portion in each group
        return self

    def predict(self, X):
        result = []
        for row in X:
            h = int(row['hour'])
            m = int(row['month'])-1
            t = self.avg[m,h]
            result.append(t)
        # Call the appropriate predict method for each row of X
        return result

In [62]:
season_model = GroupbyEstimator('city', season_factory).fit(df, df['temp'])

In [69]:
import pandas as pd
def answer(x):
    output = []
    X = []
    labels = ['month','hour','city']
    for r in x:
        l = r.split()
        X.append([l[1],l[3],l[-1]])
    data = pd.DataFrame.from_records(X, columns=labels)
    return season_model.predict(data)

You will need to write a function that makes predictions from a list of strings.  You can either create a pipeline with a transformer and the `season_model`, or you can write a helper function to convert the lines to the format you expect.

In [70]:
grader.score('ts__month_hour_model', answer)

Your score:  1.00484342307


## fourier_model
Since we know that temperature is roughly sinusoidal, we know that a reasonable model might be

$$ y_t = y_0 \sin\left(2\pi\frac{t - t_0}{T}\right) + \epsilon $$

where $k$ and $t_0$ are parameters to be learned and $T$ is one year for seasonal variation.  While this is linear in $y_0$, it is not linear in $t_0$. However, we know from Fourier analysis, that the above is
equivalent to

$$ y_t = A \sin\left(2\pi\frac{t}{T}\right) + B \cos\left(2\pi\frac{t}{T}\right) + \epsilon $$

which is linear in $A$ and $B$.

Create a model containing sinusoidal terms on one or more time scales, and fit it to the data using a linear regression.

In [99]:
from sklearn.linear_model import LinearRegression,Ridge
class f_factory(base.BaseEstimator, base.RegressorMixin):
    def __init__(self):
        self.coefs = []
        self.intercept = 0
    def fit(self, X, y):
        regr = LinearRegression()
        XX = []
        yy = []
        for idx, row in X.iterrows():
            m = int(row['month'])
            h = int(row['hour'])
            v = [0]*24
            v[h]=1
            XX.append([np.sin(2*np.pi*(m)/12.0),np.cos(2*np.pi*(m)/12.0)]+v)
            yy.append(row['temp'])
        regr.fit(XX,yy)
        self.coefs = regr.coef_
        self.intercept = regr.intercept_
        return self

    def predict(self, X):
        pred = []
        for row in X:
            m = int(row['month'])
            h = int(row['hour'])
            v = [0]*24
            v[h] = 1
            x = [np.sin(2*np.pi*m/12.0),np.cos(2*np.pi*m/12.0)]+v
            p = sum(np.array(x)*self.coefs)+self.intercept 
            pred.append(p)
        # Call the appropriate predict method for each row of X
        return pred

In [100]:
fourier_model = GroupbyEstimator('city', f_factory).fit(df, df['temp'])

In [101]:
def answer_2(x):
    X=[]
    labels = ['month','hour','city']
    for r in x:
        l = r.split()
        X.append([l[1],l[3],l[-1]])
    data = pd.DataFrame.from_records(X, columns=labels)
    return fourier_model.predict(data)

In [102]:
grader.score('ts__fourier_model', answer_2)

Your score:  0.972040308163


*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*