# Machine Learning (part 2)

**BMI 773 Clinical Research Informatics**

*Yuriy Sverchkov*

*March 25, 2020*

[![xkcd: Machine Learning](images/xkcd_machine_learning.png)](https://xkcd.com/1838/)

## Lecture goals

*stretch goals in italics*

* Understand concepts
  * perceptrons
  * hidden units
  * multilayer neural networks
  * gradient descent
  * backpropagation
  * activation functions
    * sigmoid, hyperbolic tangent, ReLU
  * loss functions
    * squared error, cross entropy
  * logistic regression as a neural network
  * input encodings
  * output encodings
  * *autoencodings*
  * *word embeddings*
  * *recurrent neural networks*
  * *convolutional networks*
  * *decision trees*
* Pitfalls of machine learning
  

## Artificial neural networks (ANNs)

Artificial neural networks have seen much success in recent years.

[find examples]

### Biological neurons

<table>
    <tr>
        <td><img src="images/neuron.png"></td>
        <td><img src="images/all-or-none_law_en.svg" width=300></td>
    </tr>
</table>

* Receive inputs from other neurons via dendrites
* Activate at a certain stimulus threshold
* Signal to other neurons via terminal branches

### An artificial neural network's "neuron"

<img src='images/artificial-neuron.svg' />

$$
\underbrace{y}_\text{output} =
\sigma\left( \overbrace{
w_0 + \sum_{i=1}^m \underbrace{x_i}_\text{inputs} w_i
}^z \right) =
\sigma(
\underbrace{\langle w_0, w_1, w_2, \ldots, w_m \rangle}_{\mathbf w} \cdot
\underbrace{\langle 1, x_1, x_2, \ldots, x_m \rangle}_{\mathbf x}
)
$$

### Multilayer neural network

A typical neural network would have many neurons arranged in many layers

<img src='images/lecun-nature-2015-f1c.png' width=600px/>

### Logistic regression as a neural network

$$ \hat y = \mathrm{expit}(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_d x_d) $$

* __Input units:__ $\boldsymbol x = \langle 1, x_1, x_2, \ldots, x_d \rangle$
* __Weights:__ $\boldsymbol \beta = \langle \beta_0, \beta_1, \beta_2, \ldots, \beta_d \rangle$
* __Activation function:__ $\mathrm{expit}(z) = \frac{1}{1 + e^{-z}}$
* __Output units:__ $\hat y$
* No __hidden units__.

In [None]:
# insert image

## Training ANNs

### How did we train logistic regression?

* We had a training set $(\mathbf x^{(1)}, y^{(1)}), \ldots, (\mathbf x^{(n)}, y^{(n)})$
* Our model was a function $$ f_{LR}(\boldsymbol \beta^{(i)}, \mathbf x^{(i)}) $$ with parameters $\boldsymbol \beta$
* We had an instance-wise cost function (cross-entropy loss) $$
\mathrm{cost}(y, \hat y) =
 \begin{cases}
  -\log( \hat y ) & \text{ if }y=1 \\
  -\log( 1-\hat y ) & \text{ if }y=0
 \end{cases} $$

Training meant finding the parameters that minimize the average cost across all training instances.

$$ \hat{\boldsymbol \beta} = \underset{\boldsymbol \beta}{\arg \min} \frac{1}{n} \sum_{i=1}^n \mathrm{cost}(y^{(i)}, f_{LR}(\boldsymbol \beta, \mathbf x^{(i)})) $$

### Training ANNs

* Training set: $(\mathbf x^{(1)}, \mathbf y^{(1)}), \ldots, (\mathbf x^{(n)}, \mathbf y^{(n)})$ for supervised learning
  * Still need training sets for unsupervised learning $\mathbf x^{(1)}, \ldots, \mathbf x^{(n)}$
  * The targets $\mathbf y$ may also be multidimensional vectors
* The full ANN can still be viewed as a function $f_\mathrm{net}( \mathbf W, \mathbf x )$
* $\mathbf W$ represents all the weights in the network - these are the parameters
* Need to pick a cost function (also *objective function*) - the choice depends on the task and data representation
* As with LR, the task is to minimize the average cost:
$$ \hat{\mathbf W} = \underset{\mathbf W}{\arg \min} \frac{1}{n} \sum_{i=1}^n \mathrm{cost}(y^{(i)}, f_\mathrm{net}(\mathbf W, \mathbf x^{(i)})) $$

In [None]:
#Todo: Find multi-class classification example

The following four slides are from Mark Craven's Machine learning course.

![](images/craven-slides/ann-3-08.png)

![](images/craven-slides/ann-3-09.png)

![](images/craven-slides/ann-3-10.png)

![](images/craven-slides/ann-3-11.png)

#### Backpropagation and gradient descent

__Gradient descent__ is a method for finding the minimum of a function with respect to parameters.

![Alyhan Tejani: A Brief Introduction To Gradient Descent](gradient_descent_line_graph.gif)

__Backpropagation__ is an efficient method for computing the gradient of the cost function with respect to the weights in an ANN.

## The value of multiple layers
-a single layer isn't very expressive
-show example with 1 input (blood pressure) 1 output (normal/abnormal)
-1 layer - not a good fit
-2 layers - gets full separation

## Network architecture tailored to the task

* Hierarchical networks
* Convolutional neural networks
* Recurrent neural networks

### Hierarchically structured deep network

![J. Ma et al., “Using deep learning to model the hierarchical structure and function of a cell,” Nat Methods, vol. 15, no. 4, pp. 290–298, Apr. 2018, doi: 10.1038/nmeth.4627.](images/ma2018title.png)

![](images/ma2018fig.png)

### Convolutional neural networks

![](images/alexnet.png)

<img src="images/convolution.gif" width=200 />

### Recurrent neural network

![](images/rnn.svg)

## Decision trees

[image of a decision tree]

What is the optimization problem?

A greedy optimisation strategy (greedy strategies are not always perfect)



## ML pitfalls

- A. Bissoto, M. Fornaciali, E. Valle, and S. Avila, “(De)Constructing Bias on Skin Lesion Datasets,” presented at the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, p. 9.
- R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’15, Sydney, NSW, Australia, 2015, pp. 1721–1730, doi: 10.1145/2783258.2788613.


### Adversarial attacks

![S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, and I. S. Kohane, “Adversarial attacks on medical machine learning,” Science, vol. 363, no. 6433, pp. 1287–1289, Mar. 2019, doi: 10.1126/science.aaw4399.](images/finlayson2019-title.png)

![](images/finlayson2019-1.png)

![](images/finlayson2019-2.png)

## Image credits in order of appearance

* [XKCD](https://xkcd.com/1838/)
* [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:All-or-none_law_en.svg)
* [Wikimedia Commons (Prof. Loc Vu-Quoc), CC BY-SA 4.0](https://commons.wikimedia.org/w/index.php?curid=72816083)
* LeCun et al. Nature 2015
* Select slides from Prof. Mark W Craven
* [Alyhan Tejani: A Brief Introduction To Gradient Descent](https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/)
* CNN: https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/
* Convolution gif: http://deeplearning.stanford.edu/
* RNN: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg)
* Finlayson et al. Science 2019