In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [2]:
from grader import score

Please show this message to a TDI staff member.


# Time Series Data: Predict Temperature
Time series prediction presents its own challenges which are different from machine-learning problems.  As with many other classes of problems, there are a number of common features in these predictions.

## Fetch the data:

In [3]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'train.txt.gz'

The columns of the data correspond to the
  - year
  - month
  - day
  - hour
  - temp
  - dew_temp
  - pressure
  - wind_angle
  - wind_speed
  - sky_code
  - rain_hour
  - rain_6hour
  - city

We will focus on using the temporal elements to predict the temperature.

## Per city model

It makes sense for each city to have it's own model.  Build a "groupby" estimator that takes an estimator as an argument and builds the resulting "groupby" estimator on each city.  That is, `fit` should fit a model per city while the `predict` method should look up the corresponding model and perform a predict on each.

# Questions

For each question, build a model to predict the temperature in a given city at a given time.  You will be given a list of records, each a string in the same format as the lines in the training file.  Return a list of predicted temperatures, one for each incoming record.

## month_hour_model
There are two ways to handle seasonality.  Seasonality features are nice because they are good at projecting arbitrarily far into the future.

The simplest (and perhaps most robust) is to have a set of indicator variables for each month. That is to say, make the assumption that the temperature at any given time is a function of only the month of the year, and the hour of the day, and use that to predict the temperature value.  As you can imagine, the temperature values will be stripped out in the actual text records that are passed.

**Question**: Should month be a continuous or categorical variable?

In [4]:
score('ts__month_hour_model', lambda x: [0] * len(x))

Your score:  -2.80125110686


## fourier_model
Since we know that temperature is roughly sinusoidal, we know that a reasonable model might be

$$ y_t = y_0 \sin\left(2\pi\frac{t - t_0}{T}\right) + \epsilon $$

where $k$ and $t_0$ are parameters to be learned and $T$ is one year for seasonal variation.  While this is linear in $y_0$, it is not linear in $t_0$. However, we know from Fourier analysis, that the above is
equivalent to

$$ y_t = A \sin\left(2\pi\frac{t}{T}\right) + B \cos\left(2\pi\frac{t}{T}\right) + \epsilon $$

which is linear in $A$ and $B$.

Create a model containing sinusoidal terms on one or more time scales, and fit it to the data using a linear regression.

In [None]:
score('ts__fourier_model', lambda x: [0] * len(x))

*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*