# UBC Scientific Software Seminar

## Hands-on Machine Learning with Scikit-Learn and TensorFlow

---

## Chapter 9: Up and running with TensorFlow

See [Hands-on Machine Learning with Scikit-Learn and TensorfFow](https://github.com/ageron/handson-ml) (by Aurélien Géron).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import tensorflow as tf

In [3]:
tf.__version__

'1.4.0'

## Computational graphs

A [computational graph](https://www.tensorflow.org/programmers_guide/graph_viz) is a set of nodes connected by edges such that nodes represent tensor operations and edges represent the tensors flowing to and from nodes.

We construct a computational graph by defining nodes for constant, variable and placeholder tensors and define nodes for operations which combine tensors by addition, matrix multiplication, etc.

| Node | Description |
| ---: | :--- |
| `tf.constant` | tensors which do not change in the computation |
| `tf.variable` | tensors (such as model parameters) which get updated during the computation |
| `tf.placeholder` | tensors which get assigned input data |

## Example: Linear regression by the formula

Let's construct a graph for fitting a [linear regression](https://en.wikipedia.org/wiki/Ordinary_least_squares) model. Suppoe we want to fit a linear model

$$
f(X; \theta) = \theta_0 + \theta_1 X_1 + \cdots + \theta_m X_m
$$

to set of $n$ data points with $m$ features: $(x_{1,1},\dots,x_{1,m}; y_1)$, $(x_{2,1},\dots,x_{2,m}; y_2)$, ... , $(x_{n,1},\dots,x_{n,m}; y_n)$.

The model parameters $\hat{\theta}$ which minimize the sum of squared errors

$$
SSE(\theta) = \sum_i (f(x_1,x_2,\dots,x_m; \theta) - y_i)^2
$$

is given by the formula

$$
\hat{\theta} = (X^T X)^{-1} X^T y
$$

where the matrix $X$ is

$$
X = \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1,m} \\
1 & x_{2,1} & \cdots & x_{2,m} \\
\vdots & & & \vdots \\
1 & x_{n,1} & \cdots & x_{n,m}
\end{bmatrix}
$$

Le's import the [Califormia housing dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) dataset and fit a linear model to the data.

In [4]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
m, n = housing.data.shape

In [5]:
print(m,n)

20640 8


The California housing includes 20,640 samples. Each sample is a house with 8 features:

In [6]:
housing.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

Let's create the computational graph to compute the model parameters and run the graph in a session:

In [7]:
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]

X = tf.constant(housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)

with tf.Session() as sess:
    theta_value = theta.eval()

In [8]:
theta_value

array([[ -3.74651413e+01],
       [  4.35734153e-01],
       [  9.33829229e-03],
       [ -1.06622010e-01],
       [  6.44106984e-01],
       [ -4.25131839e-06],
       [ -3.77322501e-03],
       [ -4.26648885e-01],
       [ -4.40514028e-01]], dtype=float32)

Compare with pure NumPy:

In [9]:
X = housing_data_plus_bias
y = housing.target.reshape(-1, 1)
theta_numpy = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

print(theta_numpy)

[[ -3.69419202e+01]
 [  4.36693293e-01]
 [  9.43577803e-03]
 [ -1.07322041e-01]
 [  6.45065694e-01]
 [ -3.97638942e-06]
 [ -3.78654265e-03]
 [ -4.21314378e-01]
 [ -4.34513755e-01]]


Compare with Scikit-Learn:

In [10]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing.data, housing.target.reshape(-1, 1))

print(np.r_[lin_reg.intercept_.reshape(-1, 1), lin_reg.coef_.T])

[[ -3.69419202e+01]
 [  4.36693293e-01]
 [  9.43577803e-03]
 [ -1.07322041e-01]
 [  6.45065694e-01]
 [ -3.97638942e-06]
 [ -3.78654265e-03]
 [ -4.21314378e-01]
 [ -4.34513755e-01]]


## Example: Linear regression using batch gradient descent

Gradient Descent requires scaling the feature vectors first. We could do this using TF, but let's just use Scikit-Learn for now.

In [11]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]

In [12]:
tf.reset_default_graph()

n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

gradients = tf.gradients(mse, [theta])[0]
training_op = tf.assign(theta, theta - learning_rate * gradients)
# Or use TensorFlow's GradientDescentOptimizer
# optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
# training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op) 
    best_theta = theta.eval()

print("Best theta:")
print(best_theta)

Epoch 0 MSE = 2.75443
Epoch 100 MSE = 0.632222
Epoch 200 MSE = 0.57278
Epoch 300 MSE = 0.558501
Epoch 400 MSE = 0.549069
Epoch 500 MSE = 0.542288
Epoch 600 MSE = 0.537379
Epoch 700 MSE = 0.533822
Epoch 800 MSE = 0.531243
Epoch 900 MSE = 0.529371
Best theta:
[[  2.06855249e+00]
 [  7.74078071e-01]
 [  1.31192386e-01]
 [ -1.17845066e-01]
 [  1.64778143e-01]
 [  7.44078017e-04]
 [ -3.91945094e-02]
 [ -8.61356676e-01]
 [ -8.23479772e-01]]


## Example: Linear regression using mini-batch gradient descent

In the previous example, we compute the gradient of the mean squared error function using the **all** the data points. This can become computational expensive! Instead, mini-batch gradient descent computes gradients using a small portion (a mini batch) of the data for each step in gradient descent.

We define placeholder nodes in the graph so that we can input different data sets for each step through the graph.

In [13]:
n_epochs = 1000
learning_rate = 0.01

tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")

theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

n_epochs = 10

batch_size = 100
n_batches = int(np.ceil(m / batch_size))

def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)  # not shown in the book
    indices = np.random.randint(m, size=batch_size)  # not shown
    X_batch = scaled_housing_data_plus_bias[indices] # not shown
    y_batch = housing.target.reshape(-1, 1)[indices] # not shown
    return X_batch, y_batch

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()

In [14]:
best_theta

array([[ 2.07001591],
       [ 0.82045609],
       [ 0.1173173 ],
       [-0.22739051],
       [ 0.31134021],
       [ 0.00353193],
       [-0.01126994],
       [-0.91643935],
       [-0.87950081]], dtype=float32)

## Visualizing the graph with TensorBoard

In [15]:
mse_summary = tf.summary.scalar('MSE', mse)
file_writer = tf.summary.FileWriter('graph', graph=tf.get_default_graph())

In [16]:
n_epochs = 10
batch_size = 100
n_batches = int(np.ceil(m / batch_size))

In [17]:
with tf.Session() as sess:                                                        # not shown in the book
    sess.run(init)                                                                # not shown

    for epoch in range(n_epochs):                                                 # not shown
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
            if batch_index % 10 == 0:
                summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch})
                step = epoch * n_batches + batch_index
                file_writer.add_summary(summary_str, step)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

    best_theta = theta.eval()                                                     # not shown

In [18]:
file_writer.close()

In [19]:
best_theta

array([[ 2.07001591],
       [ 0.82045609],
       [ 0.1173173 ],
       [-0.22739051],
       [ 0.31134021],
       [ 0.00353193],
       [-0.01126994],
       [-0.91643935],
       [-0.87950081]], dtype=float32)

Enter the following command into a terminal to open Tensorboard:

```
python -m tensorboard.main --logdir graph
```

[Tensorboard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard) is a web application to visualize Tensorflow graphs and sessions.