# Machine Learning (part 2)

**BMI 773 Clinical Research Informatics**

*Yuriy Sverchkov*

*March 25, 2020*

[![xkcd: Machine Learning](images/xkcd_machine_learning.png)](https://xkcd.com/1838/)

## Lecture goals

*stretch goals in italics*

* Understand concepts
  * perceptrons
  * hidden units
  * multilayer neural networks
  * gradient descent
  * backpropagation
  * activation functions
    * sigmoid, hyperbolic tangent, ReLU
  * loss functions
    * squared error, cross entropy
  * logistic regression as a neural network
  * input encodings
  * output encodings
  * *autoencodings*
  * *word embeddings*
  * *recurrent neural networks*
  * *convolutional networks*
  * *decision trees*
* Pitfalls of machine learning
  

## Artificial neural networks (ANNs) and deep learning

Artificial neural networks have seen much success in recent years.

[find examples]

### Biological neurons

<table>
    <tr>
        <td><img src="images/neuron.png"></td>
        <td><img src="images/all-or-none_law_en.svg" width=300></td>
    </tr>
</table>

* Receive inputs from other neurons via dendrites
* Activate at a certain stimulus threshold
* Signal to other neurons via terminal branches

### An artificial neural network's "neuron"

<img src='images/artificial-neuron.svg' />

$$
\underbrace{y}_\text{output} =
\sigma\left( \overbrace{
w_0 + \sum_{i=1}^m \underbrace{x_i}_\text{inputs} w_i
}^z \right) =
\sigma(
\underbrace{\langle w_0, w_1, w_2, \ldots, w_m \rangle}_{\mathbf w} \cdot
\underbrace{\langle 1, x_1, x_2, \ldots, x_m \rangle}_{\mathbf x}
)
$$

### Multilayer neural network

A typical neural network would have many neurons arranged in many layers

<img src='images/lecun-nature-2015-f1c.png' width=600px/>

### Logistic regression as a neural network

$$ \hat y = \mathrm{expit}(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_d x_d) $$

<img src='images/lr_net.svg' align='right' />

* __Input units:__ $$\mathbf x = \langle x_1, x_2, \ldots, x_d \rangle$$
* __Weights:__ $$\boldsymbol \beta = \langle \beta_0, \beta_1, \beta_2, \ldots, \beta_d \rangle$$
* __Activation function:__ $$\mathrm{expit}(z) = \frac{1}{1 + e^{-z}}$$
* __Output units:__ $$\hat y$$
* No __hidden units__.

## Training ANNs

### How did we train logistic regression?

* We had a training set $(\mathbf x^{(1)}, y^{(1)}), \ldots, (\mathbf x^{(n)}, y^{(n)})$
* Our model was a function $$ f_{LR}(\boldsymbol \beta^{(i)}, \mathbf x^{(i)}) $$ with parameters $\boldsymbol \beta$
* We had an instance-wise cost function (cross-entropy loss) $$
\mathrm{cost}(y, \hat y) =
 \begin{cases}
  -\log( \hat y ) & \text{ if }y=1 \\
  -\log( 1-\hat y ) & \text{ if }y=0
 \end{cases} $$

Training meant finding the parameters that minimize the average cost across all training instances.

$$ \hat{\boldsymbol \beta} = \underset{\boldsymbol \beta}{\arg \min} \frac{1}{n} \sum_{i=1}^n \mathrm{cost}(y^{(i)}, f_{LR}(\boldsymbol \beta, \mathbf x^{(i)})) $$

Found the minimum using some numeric optimization software.

### Training ANNs

* Training set: $(\mathbf x^{(1)}, \mathbf y^{(1)}), \ldots, (\mathbf x^{(n)}, \mathbf y^{(n)})$ for supervised learning
  * Still need training sets for unsupervised learning $\mathbf x^{(1)}, \ldots, \mathbf x^{(n)}$
  * The targets $\mathbf y$ may also be multidimensional vectors

* The full ANN can still be viewed as a function $f_\mathrm{net}( \mathbf W, \mathbf x )$
* $\mathbf W$ represents all the weights in the network - these are the parameters

* Need to pick a cost function (also *objective function*) - the choice depends on the task and data representation

* As with LR, the task is to minimize the average cost:
$$ \hat{\mathbf W} = \underset{\mathbf W}{\arg \min} \frac{1}{n} \sum_{i=1}^n \mathrm{cost}(y^{(i)}, f_\mathrm{net}(\mathbf W, \mathbf x^{(i)})) $$

The following four slides are from Mark Craven's Machine learning course.

<img src='images/craven-slides/ann-3-08.png' width='90%' />

<img src='images/craven-slides/ann-3-09.png' width='90%' />

![](images/craven-slides/ann-3-10.png)

<img src='images/craven-slides/ann-3-11.png' width='90%' />

### Minimizing the cost function

$$ \hat{\mathbf W} = \underset{\mathbf W}{\arg \min} \frac{1}{n} \sum_{i=1}^n \mathrm{cost}(y^{(i)}, f_\mathrm{net}(\mathbf W, \mathbf x^{(i)})) $$

* __Gradient descent__ (and variants of it) is an algorithm for finding the parameter values that minimize the cost function.
* __Backpropagation__ is an efficient method for computing the gradient of the cost function with respect to the weights in an ANN.

#### Gradient descent
* __Gradient:__ the multidimensional extension of a derivative
  * __Derivative:__ the slope of a function (locally) around a point
* __Descent:__ Using the gradient of the cost function to figure out how to change the network's weights to decrease the cost

![Alyhan Tejani: A Brief Introduction To Gradient Descent](images/gradient_descent_line_graph.gif)

#### Backpropagation

Backpropagation is application of the chain rule from calculus

<img src="images/toy_net.svg" />

$$ f_\mathrm{net}(w_1, w_2, x) = f_y( w_2 \underbrace{ f_h( w_1 x ) }_h ) $$

$$
\underbrace{
  \frac{
    \partial \mathrm{cost}(f_\mathrm{net}(w_1, w_2, x) )
  }{\partial w_2}
}_{\text{change of cost w.r.t. } w_2} =
\overbrace{ \frac{\partial \mathrm{cost} }{ \partial y } }^{\text{change of cost w.r.t. } y} 
\underbrace{ \frac{\partial y }{ \partial w_2 } }_{\text{change of } y \text{ w.r.t. } w_2}
$$

$$
\underbrace{
  \frac{
    \partial \mathrm{cost}(f_\mathrm{net}(w_1, w_2, x) )
  }{\partial w_1}
}_{\text{change of cost w.r.t. } w_1} =
\overbrace{ \frac{\partial \mathrm{cost} }{ \partial y } }^{\text{change of cost w.r.t. } y} 
\underbrace{ \frac{\partial y }{ \partial h } }_{\text{change of } y \text{ w.r.t. } h}
\overbrace{ \frac{\partial h }{ \partial w_1 } }^{\text{change of } h \text{ w.r.t. } w_1}
$$

1. Forward pass from inputs to outputs computes values of units
2. Backward pass from outputs to inputs computes gradients (derivatives) at those values

## Network architecture tailored to the task

* Hierarchical networks
* Convolutional neural networks
* Recurrent neural networks

### Hierarchically structured deep network

![J. Ma et al., “Using deep learning to model the hierarchical structure and function of a cell,” Nat Methods, vol. 15, no. 4, pp. 290–298, Apr. 2018, doi: 10.1038/nmeth.4627.](images/ma2018title.png)

![](images/ma2018fig.png)

### Convolutional neural networks

![](images/alexnet.png)

<img src="images/convolution.gif" width=200 />

### Recurrent neural network

![](images/rnn.svg)

## Decision trees

![](images/craven-slides/dt-1-03.png)

![](images/craven-slides/dt-1-07.png)

## ML pitfalls

### Spurious correlations

![](images/caruana2015-title.png)

> The 2nd term in the model, asthma, is the one that caused trouble in the CEHC study in the mid-90’s and preventedclinical trials with the very accurate neural net model.  The GA2M model has found the same pattern discovered backthen: that **having asthma lowers the risk of dying from pneumonia.**

### Bias in the dataset

![](images/bissoto2019-title.png)

* Patterns in data collection can introduce information about the learning target that is irrelevant to the true task
* Example:
  * Deep convolutional neural networks trained to classify skin lesions
  * What happens when we remove the medically relevant information from the images?

![](images/bissoto2019-images.png)

<img src='images/bissoto2019-aucs.png' width=500 align='left' /> The networks still perform well above random chance (AUC=50%) when all information about the lesions is obscured!

### Adversarial attacks

<img alt='S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, and I. S. Kohane, “Adversarial attacks on medical machine learning,” Science, vol. 363, no. 6433, pp. 1287–1289, Mar. 2019, doi: 10.1126/science.aaw4399.' src='images/finlayson2019-title.png' width=800 />

![](images/finlayson2019-1.png)

![](images/finlayson2019-2.png)

-----

## Image credits in order of appearance

* [XKCD](https://xkcd.com/1838/)
* [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:All-or-none_law_en.svg)
* [Wikimedia Commons (Prof. Loc Vu-Quoc), CC BY-SA 4.0](https://commons.wikimedia.org/w/index.php?curid=72816083)
* LeCun et al. Nature 2015
* Select slides from Prof. Mark W Craven
* [Alyhan Tejani: A Brief Introduction To Gradient Descent](https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/)
* Ma et al. Nature Methods 2018
* CNN: https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/
* Convolution gif: http://deeplearning.stanford.edu/
* RNN: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg)
* Caruana et al. KDD 2015
* Bissoto et al. CVPR 2019
* Finlayson et al. Science 2019