In [None]:
from training_functions import make_tfr_input_fn
import tensorflow as tf
tf.__version__

Take the file pattern from [Beam_Pipelines.ipynb](Beam_Pipelines.ipynb):

In [None]:
with open('temp_dir.txt') as file:
    temp_dir = file.read()
import os
file_pattern = os.path.join(temp_dir, "training.tfr-*")
file_pattern

When we call ```train_input_fn```, we'll get a tensor that iterates through the training files and gets a new batch of records out of it each time it is being evaluated. In the end we will pass ```train_input_fn``` to the estimator, so that it can create the computational graph for the input stream within its own session and graph context.

In [None]:
train_input_fn = make_tfr_input_fn(
    filename_pattern=file_pattern,
    batch_size=1000, 
    options={'num_epochs': None,  # repeat infinitely
             'shuffle_buffer_size': 1000,
             'prefetch_buffer_size': 1000,
             'reader_num_threads': 10,
             'parser_num_threads': 10,
             'sloppy_ordering': True,
             'distribute': False})

---
Here, we take the tensors that provide the input stream and create an input layer from it, like described in detail in [Input_Functions.ipynb](Input_Functions.ipynb). Then we create a single ```Dense``` layer, that in essence provides the hypothesis function - a linear regression model for predicting the humidity from the the full $170$-dimensional input:

$$
h(\beta_1, \beta_2, \vartheta_1, \vartheta_2, \dots, \vartheta_{168} ) 
= (A_1, A_2, B_1, B_2, \dots, B_{168}) \cdot
\left( 
\begin {array} {c}
\beta_1 \\
\beta_2 \\
\vartheta_1 \\
\vartheta_2 \\
\dots \\
\vartheta_{168} 
\end{array}
\right) + C
$$

$\vartheta_i$ are the components of the one-hot encoded $168$-dimensional vector for the hour-of-the-week.

In [None]:
from training_functions import input_layer
features, measured_humidity = train_input_fn()
my_input_layer = input_layer(features)

linreg = tf.layers.Dense(name="LinReg", units=1)
hypothesis=linreg(my_input_layer)

In [None]:
my_input_layer, hypothesis

When we evaluate the hypothesis, the computational graph will now *magically* draw on batch of 1000 records from the given input file, pass it through to the hypothesis function to arrive at 1000 1-dimensional predicions - one prediction for each of the records in the current batch.

In [None]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    out1 = sess.run(hypothesis)

In [None]:
out1.shape

---
We can actually see the variables used in the ```Dense``` layer used in the hyptothesis:

In [None]:
variables = hypothesis.graph.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
kernel, bias = variables[:2]

In [None]:
with tf.Session() as sess:
    sess.run(init)
    A170 = sess.run(kernel)
A170[2:].reshape([7,24])

---
Now, we can manually adjust these parameters such that the predictions of the hypothesis are as close as possible to the humidity values that have actually been measured. For that, we compute the sum of the squared differences between each prediction and the measured humidity. 

In [None]:
measured_humidity, hypothesis

In [None]:
loss = tf.reduce_mean((hypothesis-measured_humidity)**2)
loss

One of the strengths of using an ML framework is that it usually provides means to easily compute gradients. With Tensorflow, it's as easy as the following:

In [None]:
grad_k, grad_b = tf.gradients(loss, [kernel, bias])

In [None]:
grad_k, grad_b

In [None]:
learning_rate = 1e-1

In [None]:
update_k = tf.assign_sub(kernel, grad_k * learning_rate)
update_b = tf.assign_sub(bias, grad_b * learning_rate)

In mathematical terms, what we do is:
$$
A \leftarrow A - \varepsilon \cdot \frac{\partial}{\partial A} 
L(\vec{\beta}, \vec{\vartheta},  \vec{A}, \vec{B}, C)
$$

$$
B \leftarrow B - \varepsilon \cdot \frac{\partial}{\partial B} 
L(\vec{\beta}, \vec{\vartheta},  \vec{A}, \vec{B}, C)
$$

$$
C \leftarrow C - \varepsilon \cdot \frac{\partial}{\partial C} 
L(\vec{\beta}, \vec{\vartheta},  \vec{A}, \vec{B}, C)
$$

with $A$,$B$ being the *weight* parameters for $\beta_i$ and $\vartheta_i$, resp., $\varepsilon$ being the learning rate, and
$L$ being the mean squared error ```loss``` as defined above. Computationally, evaluating ```update_k``` will have the *side effect* of changing the value of variables ```A``` and ```B``` and ```update_b``` will do that with ```C```.

In short: We tweek all parameters with the help of their respective gradients.

Now, all we need to do is continously evaluate the ```update_b``` and ```update_k``` tensors and take note of the monotonously decreasing loss.

Remember: The loss is the difference between the prediction and the *reality*, the smaller it is, the better is my hypothesis.

In [None]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    l = sess.run(loss)
    print(l)
    for i in range(101):
        l, k, b, _, _ = sess.run([loss, kernel, bias, update_b, update_k])
        if i % 10 == 0:
            print(l)

### Using Optimizers
Typically in Tensorflow, we don't do the gradient update ourselves. That's done by a special breed called *optimizers*. And some of them are particularly efficient in certain areas. As an example, we see the adaptive-momentum *ADAM* optimizer below. The inner workings of it are subject to ML lessions, but if you're too curious to stop here, the following link leads you to this well-known publication.

[Kingma, Ba 2014 - Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)

---
# Exercise 
### Please complete the below cells
---

In [None]:
features, measured_humidity = 
my_input_layer = 

linreg = 
hypothesis=

loss = 

In [None]:
optimizer = 
train = 

---
# Exercise Done
---

In [None]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    l = sess.run(loss)
    print(l)
    for i in range(100):
        l, k, b, _ = sess.run([loss, kernel, bias, train])
    print(l)