<a href="https://colab.research.google.com/github/rzl-ds/gu511/blob/master/014_deep_learning.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

# deep learning

## intro

it's hard to understate the pervasiveness and success of deep learning methods in recent years. knowledge of deep learning techniques is a must for modern data scientists.

it's so important, in fact, that GU offers an entire class on it: [Math 514: intro to neural networks](https://myaccess.georgetown.edu/pls/bninbp/bwckctlg.p_display_courses?term_in=201910&one_subj=MATH&sel_crse_strt=514&sel_crse_end=514&sel_subj=&sel_levl=&sel_schd=&sel_coll=&sel_divs=&sel_dept=&sel_attr=#_ga=2.146187319.2080322458.1542654672-115011657.1531320772). this is not that course! you should consider taking it.

what follows is merely an incredibly hand-wavy introduction to deep neural nets. I hope to give you enough understanding and context that you feel comfortable executing simple deep learning code

### deep learning vs. deep neural nets

for starters, a bit of nomenclature: the lecture is called "deep learning" but I will often be talking about "deep neural nets" instead. they are related:

+ **deep learning** is a family of statistical modelling approaches that attempt to "learn" the underlying structure or most convenient representation of data in order to make a specific sort of prediction
    + "learning" happens through exposure to subsequent examples. a model we have trained should become a better model if exposed to a new example
    + the predictions made in deep learning are typically supervised (real-world targets), but not necessarily so (autoencoders)
+ **deep neural nets** are a sub-family of *deep learning* models that are specifically constructed out of inter-connected "neurons", computation steps that perform a linear transformation and then a subsequent nonlinear transformation

it's a minor distinction, but there are things that are **deep learning** that are not **deep neural nets**; we're not going to talk about them!

### introduction to deep neural nets

let's talk about what a deep neural net is

#### TL;DR

a neural net is a highly flexible architecture for efficiently learning an optimal representation of input feature for making specific predictions.

what follows builds up a neural net from the smallest components to the final net

#### one neuron / node
the fundamental element of a neural net is the neuron. this is so named due to long-standing analogies to the way neurons work in a brain.

I think this analogy is more confusing than it is worth. *just watch me* call them nodes instead of neurons.

the ~~neuron~~ node is a two-step operation: you do one *linear* transformation with a vector of weights and a bias value, then you do one *nonlinear* transformation with some function (called the **activation function**).

what weights? what bias value? what function? to be determined!

suppose we have a record of data with two features $x_1$ and $x_2$. a neuron that can act on that record would have two weight values ($w_1$ and $w_2$, one for each feature), a bias value $b$, and an activation function $f$

<br><div align="center"><img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-09-at-3-42-21-am.png?w=568&h=303" width="800px"></div>

the first step is the *linear* transformation. symbolically, this is:

$$
W \cdot x + b = \sum_i W_i x_i + b
$$

geometrically, this is a measurement of how large the vector $x$ is when projected along the weight $W$

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1DYpFUfRxuuAhn66402TJOMbyXCzf4Zcw" width="800px"></div>

basically, we are *linearizing* the input by converting every incoming record in whatever space to one single number measuring the amount of that vector pointing in some specific direction.

after we have linearized the input, we add a non-linearity using an **activation function**. this activation function takes one input value (the linearized input value) and outputs something that is specifically non-linear.

there are [a lot of these functions](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions), but the most common are

the **sigmoid**, $\sigma(x) = \dfrac{1}{1 + \exp(-x)}$

In [None]:
import numpy as np, plotly.graph_objs as go
x = np.linspace(-10, 10, 100)
y = 1 / (1 + np.exp(-x))
data = [go.Scatter(x=x, y=y)]
go.Figure(data).show()

the **ReLU** (**Re**ctified **L**inear **U**nit),

$$
\operatorname{relu}(x) = \left\{\begin{array}{ll}
0 & x \leq 0 \\
x & x > 0
\end{array}
\right.
$$

In [None]:
y = np.where(x <= 0, 0, x)
data = [go.Scatter(x=x, y=y)]
go.Figure(data).show()

the **leaky ReLU**,

$$
\operatorname{relu}(x) = \left\{\begin{array}{ll}
0.01 x & x \leq 0 \\
x & x > 0
\end{array}
\right.
$$

In [None]:
y = np.where(x <= 0, 0.01 * x, x)
data = [go.Scatter(x=x, y=y)]
go.Figure(data).show()

#### a stack of neurons

once we understand what one neuron is doing, we could take a whole stack of $N$ of them. each could have different $W$ and $b$ values. they could have different activation functions (but typically don't). and all together they could take any one input record and create $N$ output values.

this is often visualized as a "net", where the neurons (nodes) are drawn as circles, and the "weights" are represented as edges (that is, edge from a node to $x_i$ represents that nodes' $w_i$ value)

<br><img src="https://draftin.com/images/34466?token=YFsmpDuQfD3DDylinRD8F4sLOgjCFm4Aow1gIWoCY5KED3bnQKs17RaTja95OIQQWdr25dqS2fxq_6mDwwdcs9Y" width="500px"></img>

so with a stack of $N$ nodes we can convert an input record $x$ into an $N$-dimensional output record $z$

#### a stack of stack of neurons

the output of one stack of neurons is a new record. it's in some crazy $N$-dimensional space which is determined by the weights and biases of the previous layer, but it's basically now just a new record.

so we could do the same thing with *that* record that we did with our $x$ records, and feed it into a *new* stack of nodes

this is the "deep" neural net -- it's a neural net with hidden layers, so it's become "deep"

<br><img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg"></img>

#### finally, an output

remember, we started down this path because our model we are constructing should be able to *predict* something. so we need a final layer that will take... whatever it is that we've created -- whatever that representation is -- and predict a value.

in practice, this is usually a logistic function (for binary predictions), a softmax (for categorical predictions), a linearization-only node (for regression), or a collection of logistics (for multi-category predictions)

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

#### summary

so a neural net is: a series of **layers**, where each **layer** is a stack of some number of **neurons**, and each **neuron** is a linearization (defined by a weight $w$ and a bias $b$) followed by an **activation** function.

#### why it works

if I just gave you a neural net with random number of layers, with layers of random node size, and with random weights and biases throughout, it would be *terrible* at making predictions. so it's not the *structure* that is making good predictions.

rather, this particular way of arranging things has some special properties that make it easy to figure out how to tweak weights to incrementally improve those predictions. the process whereby we tweak weights is called *backpropagation* and is, at its heart, just the chain rule applied to millions of variables.

slowly but surely, and with enough input data, we can update the weights in our deep neural net to **learn** the ways of representing our data (the elements that come out of each layer of nodes) that are **optimal** for making our predictions.

in a way, it's almost like cheating -- we know we want to make predictions, and we have a clever way of mashing together our input features such that what comes out is some $N$ dimensional vector that we can pass to a logistic regression and get amazing results

#### why we care (in this class)

so why go through the hassle of covering this in "advanced math and statistical computing" when it's the topic of an entire different course?

because there's so much computing action focused on deep learning!

we spent the bulk of last lecture talking about how anything that can be parallelized is a good candidate for `gpu` analytics and acceleration, and in particular linear algebra.

well, as you saw above, deep neural nets are a giant pile of linear algebra. recent advancements in `gpu` availability (price and number) as well as speed have caused an explosion of `gpu` deep learning application development. including ours, in this course

## higher-level deep learning `api`s

let's figure out how we can code deep neural network models!

recall the deep learning stack picture in the `gpu` lecture:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1M3LZQRI8nfCscnyL_h7xjKi4i9e8lo1t"></div>

this is called a stack because each level is providing a new interface and possibly new functionality "on top of" the level below it. the bottom three green levels are all very low level, and represent the particular stack made available by `nvidia`.

the very lowest level of that diagram (`gpu`) is the hardware level. you could have several types of `gpu`, but in this class we will focus on `nvidia` brand `gpu`s.

the level above that (`cuda`) is a set of `c++` libraries which allow programmers to write code that can be executed on the `gpu`. for `nvidia` `gpu`s, developers at `nvidia` have done the heavy lifting here, creating the bridge from the hardware (extremely low-level instructions!) to `c++` (a full OOP language).

they have also used that set of `c++` libraries to create a second set of libraries that use the lower-level `c++` `cuda` code base to implement functionality that is specifically meant to be used in deep learning applications.

at this point, if you are an extremely `l33t h4x0r` `c++` programmer, feel free to hop into your `ide` and bang out some deep neural nets.

for the rest of us mere mortals, we will focus on even-higher-level apis in `python`. fortunately, a few exist

the orange (`tensorflow`) and red (`keras`) boxes represent two levels of abstraction available to `python` coders for interacting with `gpu`s.

+ `tensorflow` is a numerical computation framework that defines complicated computations as a directed graph of smaller computations
    + we define a "graph" of operations (nodes) that we connect by their inputs and ouputs (edges)
    + the graph defines how to get from one first node (e.g. loading the ultimate input) to any step downstream in the graph
    + this is *not* deep learning specific (we could create another crappy alarm clock script with `tensorflow`, e.g.), but it was created with deep learning in mind

perhaps that begs the question: how can I use this existing `c++` libraries from with `python`?

you look for a `python` `api` that

+ `keras` is a deep-learning-focused framework
    + this is a high-level way of describing deep neural networks and training methods in simple `python` code
    + it has *backends*, internal libraries which are used to *implement* the higher-level `keras` framework

### deep learning `api`s

each of these libraries is an *interface* to *something* beneath it. they provide developers a set of functions in some runtime (e.g. `python`) that hide the difficult, messy internal implementation details, so that someone who wants to use `tensorflow` to do *whatever* it is that `tensorflow` does won't need to know or care about what's happening a level below.

like any good interface, each of them **can** be an interface to the layer below

+ `tensorflow` is a `python` interface to `c++` libraries `cuda` and `c++` (really, it "goes through" the `tensorflow` `c++` libraries, for an extra layer of interface-y goodness)
+ `keras` is a `python` interface to `tensorflow`

but they also **are not required** to be using that particular lower-level piece

+ `tensorflow` can use non-`nvidia` or non-`gpu` lower-level libraries (that is, it can work on different `gpu`s, or `cpu`s, or google's proprietary `tpu`s, or `android` phones)
    + while this is true in theory, in practice you are strongly incentivized to use `nvidia` `gpu`s if you want to use `gpu`s at all. this is less to do with `tensorflow`'s ability to support other types of `gpu`s, and more to do with the lack of replacements for `cuda` and especially `cudnn` for other `gpu` manufacturers. to put it another way: `amd` is way behind in developing code to enable deep learning, even if the architecture could do it in theory.
+ `keras` can use other `python` deep learning libraries (e.g. `theano`, `cntk`, or apache `mxnet`)

you've used at least one subject-matter-specific `api` library before in this class: the `scikit-learn` library is a framework for creating machine learning models. you are used to relying on the same sort of `api` for `sklearn` models:

```python
model = sklearn.somemodeltype.MySpecialClassifier(param1, param2)
model.fit(X_train, y_train)
model.score(X_test, y_test)
model.predict(X_test)
```

all that changes from model to model is the actual `model` object you create, but there are standard ways of creating those models. then you assume they all have the same methods

finally, if you weren't confused enough yet, the `keras` library is available as a standalone `python` package but *also* as a model within the `tensorflow` package:

In [None]:
import keras
help(keras)

In [None]:
import tensorflow as tf
help(tf.keras)

note: these are not necessarily the same version!!

In [None]:
print('keras version: {}'.format(keras.__version__))
print('tf.keras version: {}'.format(tf.keras.__version__))

let's take a step back and re-focus on what we want to do, to help illuminate what these `api`s (`tensorflow` and `keras`) are doing for us.

we want to create a deep neural network models, and we'd like to be able to use `gpu`s to accelerate our computation if we would like

`tensorflow` can help us do this

+ **if** we can define our model as a directed graph of computation nodes and input / output edges, **then** `tensorflow` will handle the implementation on different lower-level hardware types (`gpu`, `cpu`, `tpu`, `android`) for us
+ **if** we have a novel or experimental neural network architecture we want to try, **then** `tensorflow` provides us with all of the necessary infrastructure to create and train that model in the same way we would train any other model. we should be able to build pretty much *whatever* deep learning model we want
+ **if** we want to train a fairly straightforward model type (`dnn`, `cnn`, `rnn`, `lstm`), **then** we may have to work a little bit harder than we'd like to define that model (see `keras`)

also, `keras` can help us do this

+ **if** we have a backend which implements neural net computation methods (e.g. `tensorflow` or `theano`), **then** `keras` gives us a *much* simpler interface for writing that code
+ **if** we want to use a different backend (`tensorflow` on one computer and `mxnet` on another), **then** `keras` will handle the implementation details for each without requiring code changes

if you are just starting out and want to do some simple neural network development, I **strongly** encourage you to start with `keras`.

in particular, the author of the `keras` library (François Chollet) is a prolific author and blogger. his [`keras` blog](https://blog.keras.io/author/francois-chollet.html) is one of the best resources out there for tutorials on how to write deep learning models in `keras`. additionally, he wrote a great book: https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

### alternatives

the *thing* `keras` gives us is a high-level backend-agnostic interface for creating most types of deep neural net architectures. alternative options include apache `mxnet`, which is a high-level framework with implementations in multiple different languages (and, coincidentally, one of the *backends* to `keras` to boot)

the *thing* `tensorflow` gives us is a computation environment with all the basic building blocks of deep neural net models (e.g. activation functions, loss functions, gradient descent algorithms) and supporting implementation on various different hardware types. the main alternative to `tensorflow` for this at this time is `pytorch`.

many people prefer `pytorch` to `tensorflow`, so this is by no means a settled dispute. that being said, one of the people that prefers `tensorflow` is `google`, so I feel pretty confident that project will keep advancing.

## hands-on

enough yaking, more key clacking. let's build some models

**<div align="center">exercise: install `tensorflow` and `keras`</div>**

on some machine where you have `conda` and some disk space, let's run

```sh
conda install -y tensorflow keras
```

verify it work by running (in a `python` or `ipython` session)

```python
import keras
import tensorflow as tf
```

note: we also could have used `docker` to create a `container` with `tensorflow` (and therefore `keras`, via `tf.keras`) pre-installed. look at https://hub.docker.com/r/tensorflow/tensorflow/ for details, but the basic commands are

```sh
# pull (if you haven't) and run the latest py v3 tensorflow
# container
docker run --rm -it -p 8888:8888 tensorflow/tensorflow:latest-py3

# open a jupyter notebook at localhost:8888
```

## using `tensorflow`

the [`tensorflow` documentation](https://www.tensorflow.org/tutorials/) is the definitive source for information on how to write `tensorflow` code, and this is no replacement. I simply want to cover the high-level concepts of working with `tensorflow`

### important context: `tf1` vs. `tf2`

about 2 months ago (Sep 30, 2019), `tensorflow` officially graduated from major version 1 to major version 2

this was a **big** change, and it has particular implications for how you as a data scientist should be using `tensorflow`.

to make a long story short, there were some *features* of `tensorflow` that the -- let's say, less technical (i.e. not software engineers at `google`) -- community didn't like. chief among these were

+ an over-abundance of "official" `api`s: a *lot* of ways to build models
    + this included `tf.estimator`, `tf.keras`, and `tf.slim` `api`s
    + these were not necessarily interchangeable and not all things you might want to do in one were available in others
+ an insistence on graph execution over eager execution
    + **graph execution**: create the calculation you want to perform but *DON'T PERFORM IT UNTIL I SAY SO*
    + **eager execution**: calculate things when I press `shift + enter` in my `jupyter notebook`
+ confusing locations of "side projects", including an enormous dumping ground called `tf.contrib` that held a mix of half-baked and extremely important things

`google` had many things it wanted to change from v1 to v2 in `tensorflow`, but some of the biggest changes come from concessions on these points above, namely

+ while other `api`s are still accessible, `tf.keras` is **the** `api` for `tensorflow`
+ eager execution (not graph execution) is now the default and recommended mode)
    + useful side effect: much of the `tf` boilerplate (e.g. `tf.Session`, `tf.Graph`, `tf.Placeholder`, and `tf.Session.run` calls) are deprecated
+ important contributed code will live in "the right place" in the `tf` module, and non-essential things will live in other packages

there is [a long walkthrough covering the changes you should implement to migrate from `v1` to `v2`](https://www.tensorflow.org/guide/migrate), but in practice the real implication is this:

> when using a pre-trained model or starting your own model from scratch, always use `tf2` if at all possible. if this is not immediately possible (e.g. a pre-trained model that is only available in `tf1`, consider two things
> 1. it may be better for you to look for a different pre-trained model (e.g. a `fork` done in `tf2`, or a `pytorch` version)
> 1. if you feel confident you can do it, creat that `fork` and migrate it yourself
> 1. if neither of the above are possible or efficient, be aware that you are going to do some extra work that will soon be irrelevant. sometimes this is the price you have to pay!

### `tensor`s and `operation`s

there are many fundamental `python` objects in the `tensorflow` library, but the most important are `tensor`s and `operation`s

#### `tensor`

`tensor`s are arrays of objects (usually numbers) with arbitrary numbers of dimensions. these are the data of a `tensorflow` computation

you have plenty of experience with 0, 1, 2 dimensional tensors (generally called scalars, vectors, matrices).

the number of dimensions an object has -- or in `python` code, the number of brackets `[...]` you need to describe them -- is called the **rank**

+ `3`, a rank 0 tensor (aka scalar)
+ `[1, 2, 3]`, a rank 1 tensor (aka vector)
+ `[[1, 2], [3, 4]]`, a rank 2 tensor (aka matrix)
+ the below, a rank 3 tensor

```python
[[[1, 2], [3, 4]],
 [[5, 6], [7, 8]]]
```

`tensor`s also have **shape**, a list representing the number of elements in each dimension.

+ `3` has shape `[]`
+ `[1, 2, 3]` has shape `[3]`
+ `[[1, 2, 3], [4, 5, 6]]` has shape `[2, 3]` (2 "first level" elements with 3 elements inside each)
+ the below has shape `[2, 2, 2]` (2 "first level", each with 2 sub-elements, each of those with 2 elements

```python
[[[1, 2], [3, 4]],
 [[5, 6], [7, 8]]]
```

there are a few types of `tensor`, but the most important to know about are:

+ `tf.Variable`: a tensor that holds a value and can be passed from one "session" (defined below) to the next and can change value
    + you can think of this like a variable in any program you've written; you instantiate it once but are free to add to it / update it over time.
    + for example, the weights and biases might be variables. you will pass them from one training run to the next and you will update them as part of your training
+ `tf.constant`: a constant, immutable value
+ `tf.placeholder`: a constant, immutable value *that isn't necessarily known yet, but will be*
    + note: these are mostly frowned on in v2.0

In [None]:
import tensorflow as tf

t_scalar = tf.constant(3)
print('t_scalar = {}'.format(t_scalar))
print('t_scalar.shape = {}'.format(t_scalar.shape))

In [None]:
t_vector = tf.constant([1, 2, 3])
print('t_vector = {}'.format(t_vector))
print('t_vector.shape = {}'.format(t_vector.shape))

In [None]:
t_matrix = tf.constant([[1, 2, 3],
                        [4, 5, 6]])
print('t_matrix = {}'.format(t_matrix))
print('t_matrix.shape = {}'.format(t_matrix.shape))

In [None]:
t_rank3 = tf.constant([[[1, 2], [3, 4]],
                       [[5, 6], [7, 8]], ])
print('t_rank3 = {}'.format(t_rank3))
print('t_rank3.shape = {}'.format(t_rank3.shape))

#### `operation`

an `operation` is something I *do* to a `tensor` -- think any ol' math expression, `add`, `subtract`, `multiply`, `divide`, etc.

```python
tf.add(1, 1)
tf.multiply(x, y)
```

importantly, though: in `python`-speak, **these are not functions**, they are **objects** that represent a computation.

as such, they don't return a computed value that is the result of that operation, but rather they return a new tensor which acts as a placeholder for the output of the function. it is not computed in real time (by default), but rather assumed to be eventually computed.

*note: this is slightly abusive terminology because base `python` functions themselves are in fact objects; what I really mean to say here is that `tensorflow` operations are not base `python` functions*

In [None]:
mysum_op = tf.add(1, 1)
mysum_op

In [None]:
mymult_op = tf.multiply(t_scalar, t_matrix)
mymult_op

In [None]:
mycombo_op = mymult_op / mysum_op
mycombo_op

### the execution graph

the fundamental abstraction in `tensorflow` is the **execution graph**: a directed graph connecting *computation* nodes (`operation` objects like `add`, `subtract`, `multiply`, etc) with edges that symoblize inputs and outputs (`tensor`s of data that the `operation`s consume or produce).

a graph is **directed** in that the `tensor` outputs of some `operation`s are subsequently consumed by later `operations`. the logic of the computation is still sequential ("do this, then do that, then do that...")

for example, when we wrote `tf.add(1, 1)` above, we were updating the execution graph by

1. creating a new `add` operation node
1. assigning its `x` and `y` inputs to be two constant scalar `tensor`s with values of 1
    1. this implicitly created special operations which "load" the scalar value 1 into the graph
1. creating a new output tensor implicitly creating two `tf.constant`

please disregard the mess here, this cell is just to get a `graph` we can actually view

In [None]:
!rm -r /tmp/tensorboard-gu511-dl-lecture/*

In [None]:
%load_ext tensorboard

logdir = '/tmp/tensorboard-gu511-dl-lecture'
writer = tf.summary.create_file_writer(logdir)

t_scalar = tf.constant(3)
t_matrix = tf.constant([[1, 2, 3],
                        [4, 5, 6]])

# The function to be traced.
@tf.function
def my_func(x, y):
    return tf.divide(tf.multiply(x, y), tf.add(1, 1))


# tf.summary.trace_on() and tf.summary.trace_export().
tf.summary.trace_on(graph=True, profiler=True)
# Call only one tf.function when tracing.
z = my_func(t_scalar, t_matrix)
with writer.as_default():
    tf.summary.trace_export(name="my_func_trace",
                            step=0,
                            profiler_outdir=logdir)

In [None]:
%tensorboard --port 6007 --logdir /tmp/tensorboard-gu511-dl-lecture

what the diagram above shows are **`operation` nodes** (e.g. `Add`, `Mul`) connected by **`tensor` edges**. together these make up the **execution graph**

as we said above, it used to be the case that you would write `tensorflow` code to create the `graph`, but that nothing would be *executed* until you explicitly called for that to happen.

for example, even though the program we wrote above is pretty trivial, the default behavior of `tensorflow` in previous versions was to define the execution graph and wait to calculate until you asked for it. it treated the graph like a recipe -- if you provide some materials (input `tensor`s `x` and `y`), `tensorflow` would bake you a `tensor` cake (the result of the function).

this mode is gone in `tf2` -- now it calculates `x` when you define `x`, `y` when you define `y`, the sum when you call `tf.add(x, y)`, ...

### high-level modelling `api`s: `dataset` and `estimator`

the most common complaint about `tensorflow` is that it is over-engineered and not easy or intuitive to write. `tf2` helps with that, but in general I think that complaint is still true.

even so, you will *definitely* run into these `api`s if you are doing deep learning, and it is useful to know what these two major `api`s *are* and at a high level how to use them.

google generally recommends that you

1. use the `dataset` data `api` to define your datasets you will use for train, validation, dev, and test
1. use the `tf.keras` model `api` to define your models
1. use the `estimator` model `api` to define your models when `keras` won't do

because the `dataset` and `estimator` class are unique to `tensorflow`, but `keras` is useful across different deep learning frameworks, we will talk about those two here and save `keras` for a later section

#### `tf.data` `api` basics

the `tensorflow` developers have created a class of objects `tf.data.Dataset` to handle the most common functionality of loading and manipulating data for deep learning modelling. if you load your data using this object and its `api`, you will be given free access to many pre-built methods for manipulating your data, and other `tensorflow` objects will know exactly what to do with your dataset.

the basic workflow is this:

+ identify your pre-`tensorflow` data location on disk (file paths) or in memory (a non-`tensorflow` object)
+ create a *source* `tf.data.Dataset` object which takes that data location and knows how to ingest it
+ use the *transformation* methods that all `tf.data.Dataset` objects have to transform that data (e.g. shuffling, batching, zero-padding)
+ create a `tf.data.Iterator` object which knows how to take a `tf.data.Dataset` collection of records and generate records for a training or evaluation routine

##### an example

the data that you load into a `tensorflow` dataset is assumed to be a collection of features that have the same first dimension shape but don't necessarily have much to do with each other in terms of their internal makeup.

taking [an example straight from the docs](https://www.tensorflow.org/guide/data#dataset_structure):

In [None]:
# create a tensor which is 4 x 10 random uniform numbers
dataset1 = (tf.data.Dataset
            .from_tensor_slices(tf.random.uniform([4, 10])))
dataset1

these `dataset` objects can be iterated over just like any other `python` collection:

In [None]:
for record in dataset1:
    print(record)

and we can extract the values inside the individual `tensor` elements as `numpy` arrays:

In [None]:
record.numpy()

the `dataset`s we make can be composed of several `tensor`s of different shape (they must share the first dimension (number of records), but after that, go wild)

In [None]:
# create one 4-element random number vector, and one 4 x 100 element random
# number matrix, and combine them into one dataset
dataset2 = (tf.data.Dataset
            .from_tensor_slices((tf.random.uniform([4]),
                                 tf.random.uniform([4, 100],
                                                   maxval=100,
                                                   dtype=tf.int32))))
dataset2

and there are composition functions (e.g. `zip`) to tie multiple `dataset`s together

In [None]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3

in the above, we are creating a `tf.data.Dataset` which can be iterated to produce four records.

In [None]:
for (t1, (t2a, t2b)) in dataset3:
    print('shapes: {t1.shape}, < {t2a.shape}, {t2b.shape} >'.format(t1=t1, t2a=t2a, t2b=t2b))

so elements in `dataset3` have two top-level pieces (from `dataset1` and `dataset2`, respectively); and the second (from `dataset2`) has two sub-pieces, each of which is a `tensor`

In [None]:
print('t1 = {}'.format(t1))
print('\nt2a = {}'.format(t2a))
print('\nt2b = {}'.format(t2b))

for each of the `dataset` objects above we could have created them with a `dict` or features (with column names as keys) instead of just a tuple of `tensor`s -- this would result in datasets where the components of the records are named, and might help us unpack things a little better

In [None]:
dataset1 = (tf.data.Dataset
            .from_tensor_slices({'t1': tf.random.uniform([4, 10])}))
dataset2 = (tf.data.Dataset
            .from_tensor_slices({'t2a': tf.random.uniform([4]),
                                 't2b': tf.random.uniform([4, 100],
                                                          maxval=100,
                                                          dtype=tf.int32)}))
dataset3 = (tf.data.Dataset.zip({'t3a': dataset1,
                                 't3b': dataset2}))
dataset3

In [None]:
for record in dataset3:
    break
record

In [None]:
record['t3a']

In [None]:
record['t3b']

In [None]:
record['t3b']['t2b'].numpy()

##### loading data with `numpy` or `pandas`

in practice, though, you will spend much more time loading datasets from `numpy` arrays or `pandas` dataframes, or flat files, so why not spend some time figuring that out!

In [None]:
import numpy as np, pandas as pd, sklearn.datasets

iris = sklearn.datasets.load_iris()

x = iris.data
y = iris.target
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()

the basic type of `tensor`s in `tensorflow` is (effectively) just a `numpy` array, so you can pass the `numpy` array directly to a `from_tensor_slices` call (as if it were a `tensor` slice) and this will construct your `tensor`:

In [None]:
print(type(x))
iris_tensor = tf.data.Dataset.from_tensor_slices(x)

In [None]:
for record in iris_tensor:
    print(record)
    break

for a `pd.DataFrame` object, we have a few options. first, we could just use the `df.values` method, which is itself a `numpy` array

In [None]:
iris_tensor = tf.data.Dataset.from_tensor_slices(df.values)

In [None]:
for record in iris_tensor:
    print(record)
    break

alternatively, we could convert our dataframe into a dictionary with feature names as keys and columns as values

In [None]:
iris_tensor = tf.data.Dataset.from_tensor_slices(df.to_dict(orient='series'))

In [None]:
for record in iris_tensor:
    print(record)
    print()
    print(record['petal width (cm)'].numpy())
    break

note that the above two approaches are doing fundamentally different things, and your following code will need to know / care about that:

1. passing as an array gives you an iterable which yields one **4-element list of numbers**
1. passing as a dictionary gives you an iterable which yields one **dictionary** where each key corresponds to a **one-element scalar tensor**

in practice the former (pass as array) is more common

##### loading data from file

the above were simple ways to take datasets that fit **in memory** and load them directly into `tensor`s using the `tf.data.Dataset.from_tensor_slices` method.

this is great when the data you are working with fits in memory, but in practice this is rarely the case.

instead, it usually the case you are working with datasets that far surpass the memory constraints of your machine (data volume is one of the main reasons deep learning works!). in these cases, you will usually read from files instead.

while there is built-in support for `csv`s (`tf.data.experimental.make_csv_dataset`) and `tfrecord`, a proprietary google binary file format for "records" (`tf.data.TFRecordDataset`), the general approach is to define a `tf.data.TextLineDataset` dataset object and to apply transformations to it that convert a single string per row into a `tensor`

##### `dataset` manipulation methods

in addition to loading data, we also have options for making common manipulations of that data before iterating through it. for example

`dataset.batch()` will combine records into a `batch` (or chunk) of a fixed size

In [None]:
iris_tensor = (tf.data.Dataset
               .from_tensor_slices(df.values)
               # here we batch
               .batch(3))
for record in iris_tensor:
    print(record)
    break

`dataset.filter()` will apply a filter to every record and only return those for which the predicate (filter function) is `True`. for example, the zeroeth record in our `iris` dataset has a zeroeth element of 5.1, so let's look at a function which checks for first elements `<= 5.0`:

In [None]:
def five_filter(x):
    return x[0] <= 5.0

iris_tensor = (tf.data.Dataset
               .from_tensor_slices(df.values)
               # here we filter with a lambda function
               .filter(five_filter))
for record in iris_tensor:
    print(record)
    break

`dataset.map()` will apply a single function to every record in the dataset

In [None]:
def double(x):
    return 2 * x

iris_tensor = (tf.data.Dataset
               .from_tensor_slices(df.values)
               # here we apply the function with map
               .map(double))
for record in iris_tensor:
    print(record)
    break

`dataset.shuffle()` will shuffle the records before iterating through them. in order to do this, `tensorflow` must know how many elements forward it should look for shuffling -- this doesn't have to be the entire dataset, but that's the number we'll use. we'll look at the first few elements and compare with the dataframe:

In [None]:
iris_tensor = (tf.data.Dataset
               .from_tensor_slices(df.values)
               .shuffle(buffer_size=x.shape[0]))
for (i, record) in enumerate(iris_tensor):
    print(record)
    if i == 2:
        break

In [None]:
df.head(3)

for more information on `datasets`, please [refer to the documentation](https://www.tensorflow.org/guide/datasets)

#### `tf.estimator` `api` basics

so you've managed to get your data loaded as a `dataset`. good for you!

now, let's build a model!

but first, we have a decision to make: which `api` to use?

by default you should be thinking `keras`. *however*, there are certain situations in which you might want to use a different high-level `api` for modelling: the `estimator` `api`

for motivation, here are the reasons the `tensorflow` docs themselves give for using `estimator`s:

> Estimators provide the following benefits:
>
> + You can run Estimator-based models on a local host or on a distributed multi-server environment without changing your model. Furthermore, you can run Estimator-based models on CPUs, GPUs, or TPUs without recoding your model.
> + Estimators provide a safe distributed training loop that controls how and when to:
>   + load data
>   + handle exceptions
>   + create checkpoint files and recover from failures
>   + save summaries for TensorBoard

like most frameworks, in using the `estimator` framework for model building / training we are **inverting control**: rather than `tensorflow` providing us with a handful of functions and us writing the over-arching control flow and logic for using those functions, here `tensorflow` is asking us to define a few simple functions or objects and it will then use them to do any number of more complicated tasks.

those functions or objects we must provide to our `estimator` are:

1. `input_fn`: a function which returns a `dict` of `column_name: feature_tensor` key-value pairs and a `tensor` of `label`s
    + you may need to define different functions for different modes (e.g. training, eval)
1. a sort of schema for the columns in the loaded dataset in the form of a list of `tf.feature_column` objects
    + `tf.feature_column` objects which indicate the name, data type, and required pre-processing for each feature

once that has been done, you need to

1. build the estimator using one of the `tf.estimator` classes, and
1. run one of the `train`, `evaluate`, or `predict` methods

##### an example

let's do the simplest thing we can: a logistic regression.

+ are we going to be dealing with big data? no!
+ are we going to do distributed training? no!
+ are we going to deploy this to production using TFX? no!

should we be using the `estimator` `api`? probably not!

but I ~~re~~digress. start by splitting our `iris` data into a train and test dataset

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    random_state=1337,
                                                    stratify=y,
                                                    test_size=0.3)

we can create an `input_fn` in a straight-forward way (we have to clean up the feature names first):

In [None]:
feature_names = [fn.replace(' ', '_').replace('(', '').replace(')', '')
                 for fn in iris.feature_names]

def input_fn_train():
    feature_dict = {feature_name: x_train[:, i]
                    for (i, feature_name) in enumerate(feature_names)}
    return feature_dict, y_train

In [None]:
def input_fn_test():
    feature_dict = {feature_name: x_test[:, i]
                    for (i, feature_name) in enumerate(feature_names)}
    return feature_dict, y_test

as our columns are each numeric, creating our "schema" is also easy. if our data was already normalized it would be trivial, but as it isn't we will add some normalization functions into this step -- note that this means we are assuming data comes in un-normalized going forward.

In [None]:
feature_columns = []

for (i, feature_name) in enumerate(feature_names):
    # build the zscore normalization function here
    x_i_mean = x_train[:, i].mean()
    x_i_std = x_train[:, ].std()
    zscore = lambda x_i: (x_i - x_i_mean) / x_i_std

    # add it to the list
    feature_columns.append(
        tf.feature_column.numeric_column(feature_name,
                                         normalizer_fn=zscore))

now we can initialize our estimator. one quick peak at the docs:

In [None]:
tf.estimator.LinearClassifier?

In [None]:
# make sure the logging level is at the default
tf.get_logger().setLevel('INFO')
estimator = (tf.estimator
             .LinearClassifier(feature_columns=feature_columns,
                               # the number of classes, len(iris.target_names)
                               n_classes=3))

and, finally, we use this model to `train`:

In [None]:
estimator.train?

In [None]:
estimator.train(input_fn=input_fn_train,
                steps=10000)

# turning down the logging level now
tf.get_logger().setLevel('ERROR')

our results on training:

In [None]:
result = estimator.evaluate(input_fn_train, steps=1)

for key,value in sorted(result.items()):
    print('{}: {:0.2f}'.format(key, value))

and on test:

In [None]:
result = estimator.evaluate(input_fn_test, steps=1)

for key,value in sorted(result.items()):
    print('{}: {:0.2f}'.format(key, value))

for more information on `estimator`s, please [refer to the documentation](https://www.tensorflow.org/guide/estimators)

In [None]:
# turn the logger back on...
tf.get_logger().setLevel('INFO')

### `gpu` and `tpu` acceleration

we have the ability to accelerate our computations with different hardware -- specifically, `gpu`s and `tpu`s. the way we do this in `tensorflow` is different for each type (though I expect in the long run they will converge. currently, executing computations on `gpu`s is trivial, whereas executing computations on a `tpu` requires a bit more effort

#### `gpu`s

`tensorflow` has an ingrained understanding of the "device" you intend to use to do your computation. currently the supported devices are `cpu`s and `gpu`s, and any model built to be executed on one can seamlessly be executed on the other with minimal code alteration. the same is not true of `tpu`s (more on this in the following)

##### `gpu`-capable installation

the *ability* to put computations on a `gpu` is something that your particular installation of `tensorflow` either has or doesn't -- it comes down to how you installed `tensorflow`.

if you look at [the main `tensorflow` installation page](https://www.tensorflow.org/install/), there are two installation methods: `pip` and `docker`. coincidentally, `tensorflow` is also available via `conda`, but if you are aiming for `gpu` use you may want to consider the "official" installation methods

as you can read on [the `pip` install page](https://www.tensorflow.org/install/pip), the main difference between `gpu`-capable and non-`gpu`-capable `tensorflow` installations is the `python` package installed. the base package is called `tensorflow` and the `gpu`-capable package is called (shocker!) `tensorflow-gpu`.

if you are using `conda` to manage your environment, you can ignore the discussion about creating virtual environments using `virtualenv` -- just make sure you are in your desired `conda` environment before `pip` installing, and verify that `which pip` returns the path to your `conda` environment's `pip` executable.

meanwhile, [the `docker` install](https://www.tensorflow.org/install/docker) is perhaps even more trivial. for this method of installation we have a number of options to choose from, and once we have made our decisions we use our selected values to build our `docker` image's tag:

+ the library name is `tensorflow/tensorflow`
+ the base of the tag depends on whether we want the `latest` stable version, the `nightly` build, or a specific version number
+ do we need the soure code? add a `-devel` to the tag if so
+ do we need `gpu` support? add a `-gpu` if so
+ do we intend to use `python` version 3 instead of 2? add a `-py3` if so

so to get the `docker` image with `gpu` support for `python` version 3 (but not the source code we would have

+ `tensorflow/tensorflow`
+ the `latest` tag
+ nothing for source files
+ add a `-gpu` for `gpu` support
+ add a `-py3` for `python` 3

and our tag will be `tensorflow/tensorflow:latest-gpu-py3`

it is assumed that you will be working with `nvidia` `gpu`s. in order to verify whether or not that is the case, run (`*nix` only)

```sh
lspci | grep -i nvidia
```

machines with no `gpu`s will produce nothing, whereas machines with `gpu`s will produce something like

```
05:00.0 VGA compatible controller: NVIDIA Corporation Device 1d81 (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 10f2 (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation Device 1d81 (rev a1)
06:00.1 Audio device: NVIDIA Corporation Device 10f2 (rev a1)
09:00.0 VGA compatible controller: NVIDIA Corporation Device 1d81 (rev a1)
09:00.1 Audio device: NVIDIA Corporation Device 10f2 (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation Device 1d81 (rev a1)
0a:00.1 Audio device: NVIDIA Corporation Device 10f2 (rev a1)
```

finally, to launch a container with `gpu` support, run

```sh
# --runtime nvidia: using the nvidia runtime takes care of
#     much of the complexity of mapping your local host's gpu
#     devices to the container (i.e. getting your container to
#     "see" your local machine's gpus)
# -it: make it *i*nteractive and create a *t*ty terminal
# --rm: remove the container after we're done
# -p 8888:8888 : map the container's internal port 8888 to the
#     local host's port 8888 (so you can connect to jupyter)
docker run --runtime nvidia -it --rm -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3
```

with this done, you will be able to log in to a `jupyter` notebook at `localhost:8888` and execute `tensorflow` code that will run on `gpu`

##### placing `operation`s on devices

among the many `tensorflow` `operation`s, *some* have implementations on `gpu`s and some do not. left to its own devices (pun intended), `tensorflow` will always attempt to execute a `gpu`-capable operation on a `gpu` before a `cpu`. you don't have to do anything

to see this in action on a machine with a `gpu` and in an environment that supports `gpu` execution, run

```python
tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)
```

even though I said nothing in the above code about `gpu`s, on a `gpu`-capable machine I see that my `matmul` operation occurred on my `gpu`:

```
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)
```

if you want a bit more control than that, you simply tell `tensorflow` where you want the code to execute. `tensorflow` has names for these devices:

+ `cpu` is `/cpu:0` (as in the first (index 0) `cpu`)
+ the first `gpu` is `/device:GPU:0`, or `/GPU:0` for short
+ the second `gpu` is `/GPU:1`
+ etc.

you can choose to deliberately locate any `tensor` or `operation` on a specific single device with a `context` manager:

```python
with tf.device('/device:GPU:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
```

the above code will place the `a`, `b`, and `c` `tensor`s on the first `gpu`

there are many more details, for your particular edge case, and your best bet is to check [the `tensorflow` documentation](https://www.tensorflow.org/guide/using_gpu)

#### `tpu`s

the actual details of implementing code on a `tpu` are beyond the scope of this lecture -- it suffices to say it is not as simple as `with tf.device('/devices:TPU:0')`, though I expect that will eventually be the case.

the basic idea is that you replace some of the basic elements of `tensorflow` with classes that duplicate their `api` but for `tpu` use

+ `tf.estimator.Estimator` becomes `tf.contrib.tpu.TPUEstimator`
    + you must add a parameter `use_tpu=True`
    + you must add a `config` (previously this has always been the default run config) with value `tf.contrib.tpu.RunConfig()`
+ your chosen training optimizer (above, for the logistic regression, we used `tf.train.FtrlOptimizer`) must be wrapped inside a call to `tf.contrib.tpu.CrossShardOptimizer`
+ replace your `tf.estimator.EstimatorSpec` (previously left as default) with a `tf.contrib.tpu.TPUEstimatorSpec`
    + this has implications for how you define you metrics that you must resolve

the actual implementation of these code changes is not trivial and requires quite a bit of reference to existing `google` code and [the current somewhat under-complete documentation](https://www.tensorflow.org/guide/using_tpu).

for a faster start, consider doing one of the pre-canned tutorials, three of which are listed at the top of the above documentation page.

#### executing on `google` `colab`

recently `google` added both `gpu` and `tpu` support to `google` `colab`. to use an accelerator device, select "Runtime > Change runtime type > Hardware accelerator"

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1CcA73qR3-FF6zvvuylD8Wg0zmSKHaBAa" width="1000px"></div>

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

whew. that was a lot of `tensorflow`... feeling ready to easily crank out deep neural net models?

me neither.

if only there were an easier way

## `keras`

`keras` is a backend-agnostic `api` for creating high-level neural networks.

it is **backend-agnostic** in that it is *not* implementing neural network computations directly, but rather utilizing *other* libraries that have implemented those computations and providing them to you, the `keras` user, under a unified single `api`.

to put it another way: they've figure out how you would create a simple `CNN` layer in multiple implementation libraries (e.g. `tensorflow`, `cntk`, and `theano`) and given you *one* set of functions that will (under the hood) call out to whichever of those libraries you've installed.

`keras` will allow you to write simpler, higher-level, portable neural net code

### `keras` vs. `tf.keras`

`keras` is itself [an independently developed `python` library](https://github.com/keras-team/keras) which you could install via `pip` or `conda`, and use without any `tensorflow` installation at all. it is completely open source and is developed by a team of OSS developers, and is free to contribute to at any time.

in a sense, `keras` is also an `api` specification -- a statement about what *interface* a deep neural net developer could use to create backend-agnostic neural net models. if I write code using the `keras` `api`, someone has created an under-the-hood implementation of that code that leverages `tensorflow`, `cntk`, or `theano` for me, and I can use any of them as I see fit

`google`, in an effort to make their library more user-friendly, incorporated the `keras` **`api`** into the `tensorflow` package as a submodule: `tensorflow.keras`

this library contains `google`'s own custom implementation of the `keras` `api` that is completely integrated with (and only with) `tensorflow` core code and writen by `tensorflow` core developers. the author of `keras` discusses this in a [blog post from 2017](https://blog.keras.io/introducing-keras-2.html)

in summary,

+ the `keras` package is *open source* and implements the `keras` `api` in *several* backends
+ the `tf.keras` module is *developer by google* and implements the `keras` `api` for `tensorflow` *only*

*note*: additionally, the most recent version of `tf.keras` and `keras` may not be the same.

practically speaking, while there *may* be some performance differences (with the `tf.keras` implementation being faster for `tensorflow`), the resulting *code* should be immediately transferable from one library to the other because they are implementing a shared `api`

given `keras`-`api` code, you simply swap which way you `import` the `keras` package:

```python
import keras
```

or

```python
from tensorflow import keras
```

just to make things easier, since tensorflow is already installed and running, let's use the `tf.keras` version:

In [None]:
import tensorflow as tf
from tensorflow import keras

### general `keras` workflow

`keras` is a unified deep neural net `api`, so there is an assumed way of creating models:

+ create a `keras.models` object to collect your neural net layers
+ add layers to your model
+ `compile` your model to configure the learning process
    + here we specify things like [the `loss` function](https://keras.io/losses/) we wish to optimize, [the `optimizer`](https://keras.io/optimizers/) (optimization algorithm), [`metrics`](https://keras.io/metrics/) we wish to track along the way, etc
+ `fit` the model to labelled training data
+ `evaluate` the model on test records
+ use the model to `predict` the outcome of input predictors

### a simple example

the simple example found on [the main `keras` landing page](https://keras.io/) is a good walkthrough of the above workflow

#### create a `keras.models` object

we start off by creating a `keras.models` object using the sequential `api` (*"as opposed to what?" you ask -- more on this later*)

In [None]:
from tensorflow.keras.models import Sequential
model = Sequential()
model

#### add layers

this `model` object is a container into which we put our neural net layers.

a simple neural net is defined by the number of layers it has (depth), the number of nodes it has in each of those layers, the activation functions used in each of those layers, and perhaps some other information.

for the very first layer, we must provide an extra piece of information. recall the neural network picture from before:

<br><img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg"></img>

for the hidden layers and the output layer, we already know the dimension of the individual records coming *into* each layer -- it's the number of nodes in the previous layer.

for the very first layer, that is not known, and is defined by our input data set.

let's work with the `iris` dataset again, where our input records have four elements in them (4). we need to reload it and resplit it after killing our kernel up above

In [None]:
import sklearn.datasets, sklearn.model_selection

iris = sklearn.datasets.load_iris()
x = iris.data
y = iris.target

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(
    x, y, random_state=1337, stratify=y, test_size=0.3)

input_dim = x.shape[1]
input_dim

furthermore, let's arbitrarily choose our neural net architecture (number of layers, nodes per layer, activation, etc) to be

+ a first layer that is a dense (fully connected) layer of 64 neurons with a `relu` activation
+ a second that is a dense (fully connected) layer of 10 neurons with a `relu` activation
+ a final that is a dense (fully connected) layer with a 3-category softmax activation
    + the softmax activation is making our prediction among the three possible `iris` categories

we specify all layers with calls to a `keras.layers` object such as the onese created by the `keras.layers.Dense` class, and we add them to our `model` (a thing which collects `layer` objects) via `model.add`:

In [None]:
from tensorflow.keras.layers import Dense

model.add(Dense(units=64, activation='relu', input_dim=input_dim))
model.add(Dense(units=10, activation='relu'))  # input_dim inferred from previous layer
model.add(Dense(units=3, activation='softmax'))

at any time we may access a convenient summary of our `keras` model with the (surprise!) `.summary()` method

In [None]:
model.summary()

#### `compile`

we specify the way (`optimizer`) that our `model` will be optimized, as well as the goal of that optimization (the `loss` function we wish to minimize) in the `model.compile` function.

In [None]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#### `fit`

having completely defined our deep learning model, we can now `fit` it to our training data. the only complication is that our output is currently assumed to be a 3-element vector (a one-hot encoding) of the known category. we can build this encoding with the `tf.keras.utils.to_categorical` function

In [None]:
y_train[:4]

In [None]:
y_train_onehot = keras.utils.to_categorical(y_train)
y_train_onehot[:4]

In [None]:
# this should take a minute or two. set verbose = 1 if you want progress info
model.fit(x_train,
          y_train_onehot,
          epochs=1000,  # number of times we will iterate over the dataset
          verbose=0,  # 0 is silent, 1 is a progress bar, and 2 is a line per epoch
          validation_split=0.1)  # hold out some validation data and evaluate at every step

##### diversion: `model.history`

after fitting, each model acquires a useful attribute `model.history`, which itself has an attribute `model.history.history`. this object is a dictionary which contains the after-each-epoch values of whatever `loss`es and `metric`s were recorded.

for example, we just recorded

In [None]:
model.history.history.keys()

you could easily plot these values over time (though you may choose to use `tensorboard` for that, more later):

In [None]:
import plotly.graph_objs as go

h = model.history.history
x = list(range(len(model.history.history['loss'])))
data = [go.Scatter(x=x, y=h['loss'], name='training loss'),
        go.Scatter(x=x, y=h['val_loss'], name='validation loss')]
go.Figure(data).show()

In [None]:
data = [go.Scatter(x=x, y=h['accuracy'], name='training accuracy'),
        go.Scatter(x=x, y=h['val_accuracy'], name='validation accuracy')]
go.Figure(data).show()

#### `evaluate`

our trained model can now be used to evaluate on our held-out test data -- remember, we have to one-hot encode that too

In [None]:
y_test_onehot = keras.utils.to_categorical(y_test)

In [None]:
test_loss, test_acc = model.evaluate(x_test, y_test_onehot)
print("test loss: {}".format(test_loss))
print("test accuracy: {}".format(test_acc))

#### `predict`

and for our held out records we also get softmax predictions, so a probability of every category for a given record

In [None]:
probas = model.predict(x_test)
probas[:3]

In [None]:
predictions = model.predict_classes(x_test)
predictions[:3]

In [None]:
import numpy as np
# the level of certainty we had in our greatest predictions
predicted_probas = probas[np.arange(probas.shape[0]), predictions]
predicted_probas[:3]

In [None]:
data = [go.Histogram(x=predicted_probas)]
go.Figure(data).show()

#### summary

let's bring the above example together into one code block:

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.train import AdamOptimizer

input_dim = x.shape[1]
# note: you can replace multiple model.add with a list passed to Sequential
model = Sequential([Dense(units=64, activation='relu', input_dim=input_dim),
                    Dense(units=10, activation='relu'),
                    Dense(units=3, activation='softmax')])
model.compile(loss='categorical_crossentropy',
              optimizer=AdamOptimizer(),
              metrics=['accuracy'])
y_train_onehot = keras.utils.to_categorical(y_train)
model.fit(x_train, y_train_onehot, epochs=1000, verbose=0, validation_split=0.1)
y_test_onehot = keras.utils.to_categorical(y_test)
test_loss, test_acc = model.evaluate(x_test, y_test_onehot)
print("test loss: {}".format(test_loss))
print("test accuracy: {}".format(test_acc))
```

hopefully you recognize at this point the value of the `keras` `api` in terms of normalizing and simplifying our neural net model development!

### sequential vs. functional `api`s

the way we constructed our model up above used what `keras` calls its ["sequential" `api`](https://keras.io/getting-started/sequential-model-guide), in which you create a `Sequential` model object and use that to add on layers of the `tf.keras.layers` object type one layer at a time.

if you models you are attempting to build are *sequential* and are composed out of the simplest provided `layer` types, you should default to the sequential `api`

if, on the other hand, your model architecture is not strictly linear -- perhaps it include branches that split off and join together again, as in a ladder network or a variational auto-encoder; or it include multiple copies of one portion being used in tandem, as in a siamese network -- you may be better off using [the functional `api`](https://keras.io/getting-started/functional-api-guide)

while in the sequential `api` we treat a model as a list of consecutive layers all concatenated together, the functional `api` hearkens back to the `tensor` graph of base `tensorflow`: every layer (e.g. `Dense(64)`) is an *object*, and each *object* can be *called* as if it were a function. when you *call* the layer, you are effectively attaching some tensor as an *input* to that layer (in effect, building the execution graph)

this means, among other things, that you can attach layers in non-sequential ways.

it is straight-forward to replicate our work above. we start by defining the different layers (now we are thinking of these as `tensor`s). the first is an input layer:

In [None]:
from tensorflow.keras.layers import Input

inputs = Input(shape=(4,))
inputs

we construct the graph of `tensor`s one layer-call at a time by declaring what a given layer's input is and capturing the output as a variable. schematically this looks like:

```python
layer_output = NewLayer(...)(layer_input)
```

this is really just a very compact way of writing

```python
new_layer = NewLayer(...)
layer_output = new_layer(layer_input)
```

in practice the intermediate `layer_output` variable is often repeatably called `x` (because we don't need it after we've defined the graph). thus, our layers collectively are built like

In [None]:
inputs = Input(shape=(4,))
x = Dense(units=64, activation='relu', autocast=False)(inputs)
x = Dense(units=10, activation='relu', autocast=False)(x)
predictions = Dense(units=3, activation='softmax', autocast=False)(x)

after creating the execution graph this way, just as with `tensorflow` we must define our `input` and our `output` `tensor`s in order to tell the training computation environment which graph components we need to execute:

In [None]:
from tensorflow.keras.models import Model

model = Model(inputs=inputs, outputs=predictions)

at this point, everything will proceed the same way it did before. we `compile` the model to define the training algorithm, and then we `fit` the model to training data and `predict` on test data

In [None]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
y_train_onehot = keras.utils.to_categorical(y_train)
model.fit(x_train, y_train_onehot, epochs=1000, verbose=0, validation_split=0.1)
y_test_onehot = keras.utils.to_categorical(y_test)
test_loss, test_acc = model.evaluate(x_test, y_test_onehot)
print("test loss: {}".format(test_loss))
print("test accuracy: {}".format(test_acc))

additionally, the **`model` itself** is also callable, returning the predicted values:

In [None]:
predictions = model(x_test)
predictions[:3]

remember, the motivation for having this `api` is to make more complicated network architectures, especially ones in which we re-use components. when in doubt, **try the sequential `api` first**

### callbacks

one last item on `keras` before we call it a day: `keras` has a notion of `callbacks` -- lists of functions that are meant to be called every time you pass some milestone in training. for example, the way we get our progress info printed to screen is via a `callback` function that is activated after every `epoch`

`keras` defines [a handful of useful native `callback` objects](https://keras.io/metrics/), and also provides users with the ability to define their own via inheriting from the `keras.callbacks.Callback` class.

`callback`s can be registered to the beginning and end of

+ the entire training run
+ any single epoch within a training run
+ any single batch within an epoch

among the various callbacks, there are a few worth special mention

+ `ModelCheckpoint`: this callback will trigger after every epoch and save a checkpoint version of your current model to some `filepath` value
    + importantly, you can set the `save_best_only` parameter to be `True`, in which case the checkpoint will only be written if the last epoch resulted in the best-ever value for your optimization loss (that is, your final checkpoint value will be whatever your best epoch was per your loss function)
+ `EarlyStopping`: this callback will fire after every epoch and will terminate the training early if some monitored quantity (e.g. your validation set loss) has not improved in some number of steps
    + this requires you defining "improved" and "some number", but is straight-forward
+ `TensorBoard`: this callback will log history in a file consumable by the `tensorboard` application

#### callback example

the following runs our exact same model as before, but this time when we `fit` the model we provide a `ModelCheckpoint` `callback` object

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
ckpt_callback = ModelCheckpoint(filepath='/tmp/weights.hdf5',
                                      verbose=1,
                                      save_best_only=True)
model.fit(x_train, y_train_onehot, epochs=100, verbose=0, validation_split=0.1,
          callbacks=[ckpt_callback])

the result of this is a checkpointed model file, saved *in the `keras` way* as an `hdf5` file of weights:

In [None]:
ls -alh /tmp/weights*

we can load a model from a checkpoint at any time via

In [None]:
loaded_model = keras.models.load_model('/tmp/weights.hdf5')
loaded_model.summary()

<strong><em><div align="center">it's cold outside, wear layers</div></em></strong>
<div align="center"><img src="https://cdn-images-1.medium.com/max/1600/1*U_mJ4Yq7pUctpFYwlx1u0g.jpeg" width="500px"></div>

# END OF LECTURE

next lecture: [`hadoop` and `spark`](015_hadoop.ipynb)

# appendix

## tensorflow `operation`s, `tensor`s, and the execution graph (v1)

***NOTE***: the following cells pertain to `tensorflow 1.x` **only**. they are left here as an indication of some of the differences between `tf1` and `tf2`, because `tf1` is still very common

what the diagram above (c.f. [the execution graph](#the-execution-graph) section) shows is **`operation` nodes** connected by **`tensor` edges**, i.e. the graph.

the ultimate output values are not calculated yet, but if we asked `tensorflow` nicely, saying "pretty pretty please" and all that, it will `flow` from the ultimate input `tf.constant` `tensor`s down to whatever output `tensor` we desire.

if you want to get any information out of any `tensor` edge or `operation` node in this graph, you need to *evalutate* it. `tensorflow` will then perform all the calculations required to get from an ultimate input source node to the requested edge or node.

you must ask for these values from within a ***session***: you create a context in which `tensorflow` knows what inputs it should expect (you define them when you `run` the session!) and which outputs to return (you are declaring this in real time).

there are a few ways to do this. first, for any `tensor` object you may call the `eval` method:

In [None]:
with tf.Session() as sess:
    mysum_value = mysum_op.eval()
mysum_value

alternatively, we could have asked the `tf.Session` object to `run`

In [None]:
sess.run?

and request any number of `tensor`s or `operation`s we wanted evaluated:

In [None]:
with tf.Session() as sess:
    x = sess.run([mymult_op, mycombo_op])
x

*note: any request for an `operation` to be evaluated within a `tf.Session.run` call will return `None` if successful and will raise an error otherwise*

this paradigm of creating a `tf.Session` and then evaluating `tensor`s and `operation`s from within that session is often referred to in the `tensorflow` docs as "graph execution", as juxtaposed with eager execution.

## graph execution vs. eager execution

to make a long story short, `google` tried to make everyone use `tensorflow` in their preferred more rigorous way and everyone hated it, so `google` relented and now let's everything do things the way that you hoped it would work all along

when developing code, this extremely structured way of doing things (build a computation graph, then run it in a session) can be... pretty annoying.

the google developers created an "eager execution" functionality to address exactly this problem. you can

1. develop your code in "eager execution" mode -- get the results of your operation immediately
1. remove one line of code from the beginning of your developed file and put everything inside a `tf.Session` to get the "production" behavior

note: choosing to work with graph execution or eager execution is a one-time-only decision. if you've started creating graphs with `with tf.Session()` or started performing eager execution (see below), you can't change to the other. you must kill your `python` session and restart to switch. so here, we will restart our `jupyter` kernel and start anew

In [None]:
# NOTE:
# you must restart your kernel if you want to do this!!!
import tensorflow as tf
try:
    tf.enable_eager_execution()
except ValueError:
    print("restart your kernel!!!!")
    raise

mysum_op = tf.add(1, 1)
mysum_op

In [None]:
mysum_op.numpy()

note: this was calculated for us, no need to create a `tf.Session`. yep, that's the entire point!

## `dataset` digression: making iterators

in `tf1`, each `Dataset` object is a thing which can *create* an iterator (it is not, itself, a thing which can be iterated). there are a number of methods for creating an `iterator` (an object we can iterate over to yield our individual row-records)

these methods are

+ `dataset.make_one_shot_iterator()`: will create a "one shot" iterator which we can iterate over exactly once
+ `dataset.make_initializable_iterator()`: will give us an iterator we must initialize with some `tf.placeholder` value
    + depending on exactly how we create the `dataset` that we then use to create an initializable iterator, this could be something we have to initialize
        + once and only once (*initializable*)
        + multiple times (*reinitializable*)
        + conditionally (*feedable*)