# Week 2: Linear Models

# Theory stuff

## Supervised learning: classification and regression
For now, we'll stick to supervised learning: the case of machine learning where we want to learn a function from input features $\vec{x}$ to some output $y$, for which we have a dataset of many examples of inputs and their associated "true" outputs $\{\vec{x}_i, y_i\}$.

Simple supervised learning problems are split into two cases.
**Regression** is the case where each $y_i$ is continuous, a real-valued number, and **classification** is the case where $y_i$ is discrete, one of a few classes.

## Loss functions
The loss function maps from the parameters $\theta$ of a model, and the set of examples (features and their true labels, $\{\vec{x_i}, y_i\}_i$), to a single value that represents how badly the model fits the data.
Minimizing the loss function is equivalent to finding the "optimum" parameter settings, in the sense that if our model was generating the data, it is the model with these parameters that is most likely to have generated the data (instead of a model structured the same way with any other parameters).

We can justify this kind of inversion using Bayes' theorem:
$$
p(\vec{\theta} | \text{data}) = \frac{p(\text{data} | \vec{\theta}) \cdot p(\vec{\theta})}{p(\text{data})} \propto p(\text{data} | \vec{\theta})
$$
Therefore maximizing the probability that our model was the one that generated the data is equivalent to finding the most probable parameters given the data, which is exactly what we want to do.
We can also compute the loss value on held-out values (a validation set) to see how well the model generalizes to unseen data.

To make learning feasible, we usually assume that the loss function is possible to express as an average of per-example loss functions:
$$ L_\text{total}(\{\vec{x_i}, y_i\}_i^n, \vec{\theta}) = \frac{1}{n} \sum_i^n L((\vec{x}_i, y_i), \vec{\theta}) $$

Often the loss over the entire dataset is called a "cost function" and the loss over individual examples is just called a "loss function," so I'll use that notation from now on.
Either way, it's important that the loss function is differentiable so we can minimize it with gradient descent.

### Squared error
The (mean) squared error (MSE) is the "default" loss function for regression:
$$ L(y, \hat{y}) = (y - \hat{y})^2 $$
where $y$ is the true value to be predicted, and $\hat{y}$ is the model's prediction.

It penalizes the model for mistakes super-linearly, giving larger weight to big mistakes than small ones.
This can make MSE vulnerable to outliers.

### Cross-entropy
Cross-entropy loss (or "log loss") is the most common loss function for classification.

The output of a classification model is a vector of probabilities, one per class, that should sum up to one.
The $i$th probability represents how confident the model is that the input corresponds to class $i$.
In an information-theoretic sense, cross-entropy loss measures how different the probability distribution given by the model is from the empirically-measured distribution for that one example (0 probability for every class except the true one, which has probability 1).

As the model's confidence in the correct class approaches 1, the cross-entropy loss approaches 0.
As the model's confidence in the correct class approaches 0, the cross-entropy loss increases to infinity.

#### The binary case
In binary classification, your model outputs a single probability $p(x)$: the probability that the output label is 1 (which is one minus the probability that the label is 0).
The true label $y$ is 0 or 1, and the loss is
$$L(\vec{x}, y; \vec{\theta}) = - y \cdot \log (p) - (1 - y) \cdot \log (1 - p)
= \begin{cases} -\log(p), & \text{if } y = 0 \\ -\log(1-p), & \text{if } y = 1 \end{cases}
$$

#### The multi-class case
In multi-class classification (with multinomial cross-entropy loss), we turn the true label into a "one-hot encoding" -- a vector with one entry per possible class, that's zero everywhere except the true class.
For example, for a problem with $n=4$ classes and an example with correct label of 1, the one-hot encoding is 
$$p_\text{true}(y) = \begin{bmatrix}0 & 1 & 0 & 0 \end{bmatrix}$$
which defines a discrete "probability distribution" over possible labels that just assigns probability 1 to the correct label.
Our model outputs a vector of confidences, $p_\text{model}(y)$.
The loss is
$$L = -\sum_{i=0}^n p_\text{true}(y_i) \log (p_\text{model}(y_i))
= \begin{cases} - \log p_\text{model}(y_i) & \text{if } y = i \\ \vdots \end{cases}$$
Often this is computed by taking the dot product of the one-hot encoded label vector with the probability vector, then taking the log.

Note that this loss function only cares about what probability the model assigns to the correct class, not any of the other classes.
This can make computing cross-entropy loss more efficient (i.e. by not computing the probabilities assigned to any other class).

#### Aside: the other terms in the Bayes theorem derivation
There are two terms we didn't talk about.
$p(\text{data})$ is the _probability of seeing the data we did under the true data generatng process_, commonly called the "evidence", and it generally doesn't impact our model-building. 

The other term, $p(\theta)$ is much more interesting. 
It's our prior belief, represented as a probability distribution, about what kinds of parameters we expect to see.
In machine learning, we almost always express our priors through how the model is structured, how we preprocess (or augment) the data, and what kind of regularization we use.
Using neural networks (next week), for instance, expresses a strong prior that the function we want to learn is a _hierarchical composition of simple patterns_.

#### Aside: other loss functions
Lots of the time, other loss functions make sense to use.
It's problem-dependent.
For example, you might want to use dice-coefficient loss to deal with unbalanced classes in problems like classifying each pixel of an image (the "semantic segmentation" problem).
MSE and cross-entropy are good default choices, though.

## Stochastic and minibatch gradient descent
Since the loss over the entire dataset decomposes into an average of per-example losses, we can approximate the total loss with an average over a smaller number of examples:
$$L(\{\vec{x}_i, y_i\}; \vec{\theta}) = \frac{1}{n} \sum_i^n L((\vec{x}_i, y_i), \vec{\theta}) \approx \frac{1}{m} \sum_i^m L((\vec{x}_i, y_i), \vec{\theta})$$
where $m$ is the "batch size", or the number of examples used to approximate the true dataset loss in this case.

This is called "minibatch gradient descent" or "stochastic gradient descent" (SGD) (sometimes this term is restricted for $m=1$).
It can converge faster than full-scale gradient descent because often small batches are sufficient to approximate the average loss reasonably well, and the average direction of steps taken in the direction opposite the gradient of the average minibatch loss is the same as if you were using the dataset loss.
As a result, minibatch gradient descent is considered the best way to train most machine learning models.

Some considerations when picking a batch size:
 - Smaller batches provide a less-accurate estimate of the direction to move parameters in at each step
 - Larger batches take a longer time to compute
 - [There is evidence that the noise introduced from small batches is beneficial for reaching optima that generalize out-of-sample (geometrically, "broad" minima instead of "sharp" ones)](https://arxiv.org/abs/1609.04836)
 - [In some sense lowering the learning rate and increasing the batch size have the same effect](https://arxiv.org/abs/1711.00489)
 - Large batch sizes compute faster per-example on GPUs, making training the model overall faster (by reducing the frequency with which the GPU and CPU need to swap memory)
 - Large batches may not fit on the available memory (eg video RAM in a GPU)
 - Power-of-two batch sizes compute faster on GPUs (because of how "tensor cores" work)
 
Common batch sizes are anywhere from 4 to 64, sometimes higher.

## Linear regression
Linear regression is the simplest model we'll cover.
Linear regression of one variable involves as parameters a vector of weights $\vec{w}$ and a scalar bias $b$; given a vector of input features $\vec{x}$, the model predicts $\hat{y} = \vec{w}^T \vec{x} + b$.
The output variable is a linear combination of the inputs weighted by the weight vector, plus a bias that captures something like the mean output.

While linear algebra gives us a closed-form solution, in practice it's ineffient with large datasets.
Instead, learning $\vec{w}$ and $b$ with gradient descent on mean-squared error produces the same optimum, since the learning problem for linear regression is convex.

Linear regression can actually be used to regress against any handmade set of features computed from the data, so long as the output is expected to be a linear combination of the features.
For instance, we can fit a quadratic polynomial by computing the square of every input feature, and treating them as completely new features: 
$$\vec{x} = \begin{bmatrix} x_1 & x_2 & \cdots \end{bmatrix} \rightarrow  \begin{bmatrix} x_1 & x_2 & \cdots & x_1^2 & x_2^2 & \cdots \end{bmatrix}$$

## Logistic regression
Despite the name, logistic regression is the linear model for classification.
Linear regression on the probability that a particular class is right doesn't make much sense, because:
 - Individual class probabilities must be in the range [0, 1] and linear regression may produce values outside of this range
 - All class probabilities must sum to 1

Logistic regression models follow a few steps:
 1. For each class, perform linear regression to an "unnormalized probability" for that class. This value is called the **logit**, and it corresponds to the natural log of the odds (as opposed to probability) that the class is correct. This linear regression makes sense, since the log-odds can take any value from $-\infty$ to $\infty$.
 2. Convert the vector of log-odds to a valid probability distribution. In the one-variable case, this is done through the _logistic function_. In the multi-variable case, it means applying the _softmax function_.
 3. Train the model (usually using cross-entropy loss) on the computed probability distribution over classes.

#### One-variable case and the logistic function
In the one-variable case, $y \in \{0, 1\}$ and the model is as follows:

$$
\begin{align}
&l = \vec{\alpha}^T \vec{x} + b \\
&p = \sigma(l) = \frac{e^l}{e^l + 1} \\ 
&L = -y \log(p) - (1 - y) \log (1 - p)
\end{align}
$$

where $l$ is the logit (the log of the odds that $y=1$), $\sigma$ is the logistic function, $p$ is the model probability that $y=1$, and $L$ is the cross-entropy loss.

The logistic function converts from log-odds to probability by exponentiating the logit (to undo the log), then computing the ratio of the odds (numerator) to the total odds (denominator).
![logistic function](https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg)
Image credit: [Wikipedia article on logistic regression](https://en.wikipedia.org/wiki/Logistic_regression)

The logistic function...
 - Maps every input to the range \[0, 1\]
 - Is continuously increasing
 - "Saturates" (getting near-zero derivative) for high and low input values
 
#### Multivariable case and the softmax function
This can be generalized to the case of more than one variable.
In this case, there is one logit value per class, $l_i$, and so there is one linear regression per class.
Then the softmax function is applied, which converts the logits into probabilities:
$$ p_i = \frac{\exp(l_i)}{\sum_j \exp(l_j)}$$
It works exactly the same way as the logistic function, taking the ratio of the unnormalized probability of one class to the total of unnormalized probabilities.

The probabilities computed by the softmax function...
 - Are each individually in the range \[0, 1\]
 - Together sum to 1
 - Increase monotonically with their associated logit
 
Another interpretation of the softmax function is that it's a "softened" or "smoothed" version of the argmax function, which sets the largest element of a vector to 1 and all other elements to 0.
The softmax function just makes the largest element near 1 and all of the other elements near 0 -- the fact that it's "softened" allows it to be differentiable, and so we can use it for gradient descent.

## Backpropagation and the chain rule
The backpropagation algorithm, and how it combines with gradient descent to make differentiable programming a powerful way of writing models, is probably the most important concept from this class.

The critical idea is this: _if you have a computational graph made of only differentiable operations, you can efficiently compute the derivative of any one value with respect to every other value at once_.

By using the [multivariate chain rule](https://en.wikipedia.org/wiki/Chain_rule#Higher_dimensions), we can easily (and automatically!) compute the derivative of a value with respect to a single other value, assuming the derivatives of each individual operation are known:
![backpropagation on a computational graph](https://colah.github.io/posts/2015-08-Backprop/img/tree-eval-derivs.png)
In the above graph, all of the operations, values, and single-edge derivatives are labeled.
To compute the partial derivative of one value $A$ with respect to another value $B$, traverse every possible path from $B$ to $A$; within a path, multiply each of the edge partial derivatives together, and at the end add up all of the path derivatives to get the _total derivative_.
For instance, $\frac{\partial e}{\partial b} = (\frac{\partial e}{\partial c} \times \frac{\partial c}{\partial b}) + (\frac{\partial e}{\partial d} \times \frac{\partial d}{\partial b}) = (1 \times 2) + (1 \times 3) = 5$.

To understand this, **convince yourself** (really!) that:
 1. The single-variable chain rule means that multiplying the partial derivatives along a single path to get the contribution of that path to the derivative of the output
 2. The multi-variable chain rule means that we can add each of the paths from $b$ to $e$ to get its total contribution to the derivative

This process of **automatic differentiation** (abusing terminology slightly) makes it possible to compute the derivative of a single value with respect to another single value by traversing each path once.

However, if we want to compute the gradient of the loss function over a minibatch with respect to the parameters of our model (i.e. the `tf.Variable`s) so we can update them all with gradient descent, we need the partial derivative of the loss with respect to each parameter individually.
Naively "summing over paths" means we might wind up computing the same value many times, making the algorithm inefficient.

Instead, we step backwards through the graph, first computing the partial derivative with respect to the direct parents of the desired node ($\frac{\partial e}{\partial c}$ and $\frac{\partial e}{\partial d}$), then computing the partial derivative with respect to their parents, and so on.
You will notice that each of these computations makes use of values we already have (namely, the partial derivatives computed further-along in the graph) and only a single new edge, multiplying the saved derivative by the newly-computed derivative on that edge.
Thus, once we've computed the partial derivative of the loss with respect to every parameter, we have computed the derivative on each edge exactly once.

You should think of _values flowing forwards through a network_ (and being saved, since they're needed for differentiation) and _derivatives flowing backwards_.
Alternatively, this is a dynamic-programming solution to the problem of computing all of these derivatives, where we get a (worst-case) exponential-time speedup by computing the values in the correct order and caching them.
Computing the gradient of a single value (the loss function) with respect to every value in the graph takes only time time to compute a _forwards pass_ (to compute the loss), plus the sum of the times it takes to compute the derivatives of every individual operation (the _backwards pass_).

The end result is a beautiful system where so long as we can build a model out of differentiable operations, we can efficiently optimize it with gradient descent (subject, of course, to optimization's usual quibbles about smoothness assumptions, local minima, etc).
This is what differentiable programming is all about!

#### Outside reading on backpropagation
Since this is a three unit class I can't require you to read these, but I can _promise_ they'll be worth your time, and that once you read all three you'll deeply understand one of the most important algorithms in the modern world:
 - ["Calculus on Computational Graphs: Backpropagation" (Chris Olah)](https://colah.github.io/posts/2015-08-Backprop/) quickly and intuitively derives backpropagation. If you only read one of these resources, make it this one.
 - ["Hacker's guide to Neural Networks" (Andrej Karpathy)](https://karpathy.github.io/neuralnets/) derives the chain rule and backpropagation from scratch, from the perspective of math as combining "circuits" (operations) and from a "hacking" or "coding" perspective
 
[Image credit: Colah's Blog](https://colah.github.io/posts/2015-08-Backprop/)
(Also, credit to the blog for examples I've taken here)

# TensorFlow stuff

## Loading data
In this case, I can't give a better explanation than [the TensorFlow official guide](https://www.tensorflow.org/guide/datasets).
Read the following sections:
 - [Basic mechanics (all subsections)](https://www.tensorflow.org/guide/datasets#basic_mechanics)
 - [Consuming NumPy arrays](https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays)
 - [Batching dataset elements](https://www.tensorflow.org/guide/datasets#batching_dataset_elements)
 - [Training workflows](https://www.tensorflow.org/guide/datasets#training_workflows)

`tf.data.Dataset` can be iterated over by a for-loop, and this is how I usually write my training loops, though an iterator can be explicitly declared.
But, how you structure your input pipeline is mostly up to you (and what makes sense for the problem).

If you're interested in making performant data pipelines, check out the [official guide on data input pipeline performance](https://www.tensorflow.org/guide/performance/datasets).

## Saving and restoring variables
To save your model for later use, first create a `tf.train.Checkpoint(model=your_model)`, then run `checkpoint.write('file_name').` To restore your model, first create a new instance of it, and a new checkpoint for your new model, `tf.train.Checkpoint(model=new_model)`. Then run `new_checkpoint.restore('file_name')`.

To learn more (saving and loading only specific variables, and the `SavedModel` abstraction for serving models), look at the [official guide](https://www.tensorflow.org/guide/saved_model).

## TensorBoard
TensorBoard encapsulates all of the many visualization tools that come with TensorFlow.
This week we'll cover two of its dashboards: Graphs (for visualizing the computational graph) and Scalars (for plotting values as we run).

### Running TensorBoard
When using TensorBoard, you first need at least one [`tf.summary.SummaryWriter`](https://www.tensorflow.org/api_docs/python/tf/summary/SummaryWriter) to create log files in some directory. Use `tf.summary.create_file_writer('logdir')`.

Then, run `tensorboard --logdir=path` in your shell, where `path` is the path to the log directory (or a parent directory) and same as the one you specified for the `SummaryWriter`, and go to the URL `localhost:6006` in a browser.

### Graph visualization
To visualize a graph, you need create a trace function, which needs to be `tf.function`. A Python function can be compiled into a TensorFlow graph by adding the `@tf.function` decorator, but be warned it must only take in TensorFlow arguments like `tf.Tensor`s, not things like Python `int`s. This trace function should generally compute the final output of your model. Note that it will not work properly if any functions called within it also have a `@tf.function` decorator, as this will lead to a malformed graph. This is one of the best ways to debug a model: make sure the right operations are connected and the shapes are correct.

Immediately before you run your trace function in your training loop, enable tracing with `tf.summary.trace_on(graph=True, profiler=True)`. The `graph` argument lets us see the computational graph while the `profiler` argument adds things like memory and CPU time. Run your trace function and export it to TensorBoard by,
```
with writer.as_default():
    tf.summary.trace_export('name', step=i, profiler_outdir='logdir')
```

Don't do this for every run though, as it is quite computationally expensive and you only need to record your computational graph once.

To expand grouped nodes (for instance, under a `name_scope`), double click the node.
Nested name scopes can really make a graph a lot more readable.
There are plenty of other options worth playing around with too, in the left sidebar.

More detailed info in the [official guide on graph visualization](https://www.tensorflow.org/guide/graph_viz).

### Plotting scalars
Often there are scalars in your model worth keeping track of throughout training and inference.
Obvious examples include loss and accuracy -- if loss is decreasing and accuracy is increasing, the model is training correctly.

To keep track of a scalar, run,
```
with writer.as_default():
    tf.summary.scalar('name', scalar_tensor, step=i)
```

This returns a summary operation and allows the `SummaryWriter` to plot the scalar and make it visible on TensorBoard. Note that we must provide a `step` argument, which I will usually use to enumerate batches. 

Using multiple `SummaryWriter`s pointed at different directories, and pointing TensorBoard's `logdir` at a parent directory of both, lets you plot multiple runs at once.
This is useful for plotting train and test statistics, or runs with different datasets or hyperparameters.

More detailed info in the [official guide on summaries](https://www.tensorflow.org/guide/summaries_and_tensorboard).

# Example: linear regression
A concrete example of a problem worked end-to-end with TensorFlow should make this all a lot more concrete.
This performs a linear regression on the Boston house-prices dataset.

In [1]:
%matplotlib inline
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

### Loading data
I'm using a scipy default dataset for convenience: Boston house sale prices, in thousands of dollars, and some properties of the house as features.
We'll select two featues to linearly regresss from: number of rooms ('RM'), and percentage of the population that's lower-status  ('LSTAT').

In [2]:
from sklearn.datasets import load_boston

# Load data
dataset = load_boston()
print('Feature names:', dataset.feature_names, '\n')

x_all = dataset.data
y_all = dataset.target

# Shuffle some features and targets together
together = np.concatenate([x_all[:, [5, 12]], 
                           np.expand_dims(y_all, axis=1)], 
                          axis=1)
np.random.shuffle(together)
x_all = together[:, :-1]
y_all = together[:, -1]

print('Input shape:', x_all.shape)
print('Target shape:', y_all.shape)

# Split data into train and test sets
n_points = x_all.shape[0]
n_features = x_all.shape[1]
n_train = int(n_points * 0.7)
n_test = n_points - n_train

x_train, x_test = np.split(x_all, [n_train], axis=0)
y_train, y_test = np.split(y_all, [n_train])

Feature names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT'] 

Input shape: (506, 2)
Target shape: (506,)


### Data pipeline

In [3]:
n_epochs = 1000
batch_size = 128
n_batches_per_epoch_train = n_train // batch_size
n_batches_per_epoch_test  = n_test  // batch_size

In [4]:
# Make one dataset for training data and one for test data
# Each dataset gets shuffled, batched, and cached in memory
dataset_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))\
    .shuffle(500).batch(batch_size).cache()
dataset_test = tf.data.Dataset.from_tensor_slices((x_test, y_test))\
    .shuffle(500).batch(batch_size).cache()

In [5]:
train_writer = tf.summary.create_file_writer('./logs_lecture/train')
test_writer = tf.summary.create_file_writer('./logs_lecture/test')

### Model graph

In [6]:
# We write our models as classes because it is a very convenient format that lets
# us initialize and use TensorFlow's features easily
class LinearRegression(tf.Module):
    # If we wanted to, we could take in the number of inputs and outputs as arguments to the
    # constructor to set the shape of our weights and bias
    def __init__(self, name=None):
        # Inherit functionality of `tf.Module` super class
        super().__init__(name)
        # Variables: a weights vector (one weight per feature) initialized uniformly
        # and a bias scalar, initialized to zero
        self.weights = tf.Variable(tf.initializers.glorot_uniform()(shape=(n_features,), dtype=tf.float64), \
                                   name='weights')
        self.bias = tf.Variable(tf.zeros_initializer()(shape=(), dtype=tf.float64), name='bias')
    
    # We don't need to do it this way, but making our model callable is convenient for when we want
    # to do predictions as we can just use `model(x)` to get predictions on input x
    def __call__(self, x):
        return tf.reduce_sum(x * self.weights, axis=1) + self.bias

In [7]:
# We define a wrapper function with a `@tf.function` decorator in order
# to pass the model through so we can get our computational graph on TensorBoard

# We could technically use the `@tf.function` decorator on the `__call__` method above
# and simply call `model(x)` for a trace in our training loop
@tf.function
def trace_function(model, x):
    model(x)

In [8]:
# Compute MSE loss
def _loss(target, actual):
    mse_per_example = tf.pow(actual - target, 2) # Shape: (?)
    mse_batch = tf.reduce_mean(mse_per_example)  # Shape: ()
    return mse_batch

In [9]:
# Add an optimizer to the graph
optimizer = tf.optimizers.SGD(1e-4)

# I use a low learning rate here because MSE can lead to 
# high loss values at the start of training

# This is one iteration of training
def train(model, x, y, i):
    with tf.GradientTape() as g:
        loss = lambda: _loss(y, model(x))
    # Because our model inherits from `tf.Module`, it will automatically mark `tf.Variable`s
    # as trainable unless we provide an argument to indicate they are not
    optimizer.minimize(loss, model.trainable_variables)
    # Track how loss and bias changes throughout training by
    # adding summary operations to the graph
    with train_writer.as_default():
        tf.summary.scalar('loss', loss(), step=i)
        tf.summary.scalar('bias', model.bias, step=i)

# This is one iteration of validation
def test(model, x, y, i):
    loss = _loss(y, model(x))
    with test_writer.as_default():
        tf.summary.scalar('loss', loss, step=i)
    # We are returning loss because we want to print it during our training loop
    return loss

In [10]:
model = LinearRegression()
train_batch = 0
test_batch = 0

# Training loop
for i in range(n_epochs):
    # Iterate over dataset once
    for x, y in dataset_train:  
        if train_batch == 0:
            # On the first batch, run a full trace
            tf.summary.trace_on(graph=True, profiler=True)
            # We simply run this operation to add our graph to TensorBoard
            trace_function(model, x)
            with train_writer.as_default():
                tf.summary.trace_export(name='first training batch', step=0, profiler_outdir='./logs_lecture')
        # Call train iteration
        train(model, x, y, train_batch)
        train_batch += 1
    
    # Validation on every hundredth epoch, and the last epoch
    if i % 100 != 0 or i == n_epochs - 1:
        continue

    print('Epoch:\t', i)
    test_losses = [] # Track average loss over test set
    for x, y in dataset_test:
        test_losses.append(test(model, x, y, test_batch))
        # Roughly align test batches with training batches
        test_batch += 100

    print('Average Test Set Loss:\t', np.mean(test_losses))
    
# Save model
checkpoint = tf.train.Checkpoint(model=model)
checkpoint.write('./checkpoints_lecture/model')

Epoch:	 0
Average Test Set Loss:	 556.9833048073272
Epoch:	 100
Average Test Set Loss:	 98.08206415370853
Epoch:	 200
Average Test Set Loss:	 41.108372156117
Epoch:	 300
Average Test Set Loss:	 25.842196225964592
Epoch:	 400
Average Test Set Loss:	 21.881191394440087
Epoch:	 500
Average Test Set Loss:	 20.923927114978262
Epoch:	 600
Average Test Set Loss:	 20.732184711250753
Epoch:	 700
Average Test Set Loss:	 20.717698469873227
Epoch:	 800
Average Test Set Loss:	 20.734063345206224
Epoch:	 900
Average Test Set Loss:	 20.749389107074357


'./checkpoints_lecture/model'

In [11]:
# Restore model
new_model = LinearRegression()
new_check = tf.train.Checkpoint(model=new_model)
new_check.restore('./checkpoints_lecture/model')

# Look at some test-set predictions
values, _ = iter(dataset_test).get_next()
# Check restored model is the same
tf.print(new_model(values))
tf.print(model(values))

[20.172116736291237 24.061122628252214 19.59290015280488 ... 21.774383212883116 27.565419189312387 19.571077915391154]
[20.172116736291237 24.061122628252214 19.59290015280488 ... 21.774383212883116 27.565419189312387 19.571077915391154]
