# Crime Prediction with Recurrent Neural Networks (RNNs)#

## Outline
 - Why do we need Recurrent Neural Networks?
 - Introduction to RNNs?
 - Long Short Term Memory Cell
 - Gated Recurrent Unit
 - Introduction to the Crime Dataset 
 - Implementation of RNN, LSTM and GRU
 - Parameter Tuning and Evaluation

## Why do we need Recurrent Neural Networks? 
<br>
### Recap Feed Forward Network
<br>
<img src="presentation_pics/neural_network1.png" alt="NN" style="width: 400px;"/>
<br>
[Source:  Alisa's Presentation on NN Primer]

## Lets simplify
<br>
<br>
<img src="presentation_pics/NN_simplified.png" alt="NN" style="width: 500px;"/>
<br>

### Input data:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __“In France, I had a great time”__

<br>
<br>
[Source: https://course.fast.ai/lessons/lesson6.html ]


- Changed from color to shape coding 
- got rid of dimensionality --> eachs symbol represents multiple activations
- each arrow represents a Layer operation e.g. a matrix multiplication
<br>
<br>
- Theory on Textanalysis example, bc intuitve understanding of sequential dependencies in Language

## Use Case sentiment analysis
### Is this a __positive__ or a __negative__ statement?
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __“In France, I had a great time”__
<br>
<br>
<img src="presentation_pics/NN_simplified.png" alt="NN" style="width: 500px;"/>
<br>
__One Solution:__

- Turn sentence into vector (Bag-of-Words)
- Use NN to predict the sentiment for each sentence independently
<br>

[Source: https://course.fast.ai/lessons/lesson6.html ]

#### One Solution:
- Use Bag-of_words to transfer sentence into a vector (Order doesnt matter)
- Use a Feed Forward Network to predict the class of the given sentence

<br>

#### Notice:
- The order of the words is not taken into account
- Each sentence / document is a single observation
- The classification of the next sentencen is independent from the last sentence



## Let's change our classification task:

### What will be the next word?
<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __“In France, I had a great ...?”__
<br>
<br>
<img src="presentation_pics/NN_simplified.png" alt="NN" style="width: 500px;"/>


<br>

__How can we include multiple observation vectors while preserving the order?__

<br>

[Source: https://course.fast.ai/lessons/lesson6.html ]

#### Whats new:
- Each word is an observation at a given point of time
- The next word depends mainly on the previous words -> the order is important
- Each word might be represented as a vector 
<br>

## How to add annother preceding timestep

### What will be the next word?
<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __“a great ...?”__
<br>
<br>
<img src="presentation_pics/RNN2.png" alt="NN" style="width: 600px;"/>
<br>
<br>
Prediction at time t:  $$y_t = softmax(W_{out}h_t)$$
Activations for step t: $$h_t = tanh(W_{in}x_t + W_hh_{t-1})$$
<br>
[Source: https://course.fast.ai/lessons/lesson6.html ]



W_in, W_h, W_out are equal across all steps --> reduces the number of parameters to learn
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

## How to add annother preceding timestep

### What will be the next word?
<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __“had a great ...?”__
<br>
<br>
<img src="presentation_pics/RNN3.png" alt="NN" style="width: 650px;"/>
<br>
[Source: https://course.fast.ai/lessons/lesson6.html ]




## Adding an abitrary number of preceding words

### What will be the next word?
<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __“In France, I had a great ...?”__
<br>
<br>
<img src="presentation_pics/RNN_loop.png" alt="NN" style="width: 600px;"/>
<br>
[Source: https://course.fast.ai/lessons/lesson6.html ]

## How to train a Recurrent Neural Net

### Backpropagation through time (bptt)
<br>


Total Loss: $$J(\theta) = \sum_t{J_t(\theta)}$$
Loss at time t: $$J_t(\theta) = f(y_{t,true}, y_{t})$$

<br>
Gradient: $$\frac{\partial J_t}{\partial W_{in}} =
\frac{\partial J_t}{\partial y_{t}}\frac{\partial y_t}{\partial W_{in}} = 
\frac{\partial J_t}{\partial y_{t}}\frac{\partial y_t}{\partial h_{t}}\frac{\partial h_t}{\partial W_{in}}=
\frac{\partial J_t}{\partial y_{t}}\frac{\partial y_t}{\partial h_{t}}\frac{\partial h_t}{\partial h_{t-1}}
\frac{\partial h_{t-1}}{\partial W_{in}} = 
\frac{\partial J_t}{\partial y_{t}}\frac{\partial y_t}{\partial h_{t}}\frac{\partial h_t}{\partial h_{t-1}}
\frac{\partial h_{t-1}}{\partial h_{t-2}} ...\frac{\partial h_{0}}{\partial W_{in}}
$$




<img src="presentation_pics/RNN_loop.png" alt="NN" style="width: 600px;"/>
<br>
[Source: http://introtodeeplearning.com/materials/2018_6S191_Lecture2.pdf ]<br>
[Source: http://neuralnetworksanddeeplearning.com/chap5.html ]

## Yet annother representation

<br>

<img src="presentation_pics/RNN_alt.png" alt="NN" style="width: 600px;"/>
<br>
[Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

## Drawback of a simple Recurrent Network
<br>

### Forgetting long-term dependencies:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; __“In France, I had a great time and I learnt some of the ...? [language]"__ 
 
<img src="presentation_pics/vanishing_gradient.png" alt="vanishing_gradient" style="width: 600px;"/>
<br>
[Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]



- cannot connect information anymore
- "vanishing gradient problem"

# Vanishing Gradient problem

Gradient: $$\frac{\partial J_t}{\partial W_{in}} =\frac{\partial J_t}{\partial y_{t}}\frac{\partial y_t}{\partial h_{t}}\frac{\partial h_t}{\partial h_{t-1}}
\frac{\partial h_{t-1}}{\partial h_{t-2}} ...\frac{\partial h_{0}}{\partial W_{in}}$$
<br><br>
It can be shown: $$\frac{\partial h_t}{\partial h_{t-1}}
= W_{in}^T diag[tanh^{ `}(W_{in}+W_h x_j)]$$
<br>
$tanh^{ `} \in [0,1] $ <br><br>
$W_{in}$ = sampled from standard normal distribution = mostly < 1
<br>
<br>
[Source: http://introtodeeplearning.com/materials/2018_6S191_Lecture2.pdf]<br>
[Source: http://neuralnetworksanddeeplearning.com/chap5.html ]

- backpropagation through time
- as gap between timesteps becomes bigger, product longer and we are multiplying very small numbers (small gradients)
- due to activation function (tanh)
- some crucial previous timesteps do not influence anymore in later timesteps: gradient vanishes...

## Application / Use Case
### Where should the head of the Chicago Police Force sent his patrols?
<img src="presentation_pics/crime_intro.png" alt="GRU" style="width: 800px;"/>

## Machine Learning and Ethics
### Should the police apply our Model?
#### Pro:
- Reduce Crime
- Use public resources more effectivly
<br>

#### Contra:
- Data bias towards certain crime types and neighborhoods
- Confirmation bias
<br>
<br>

[Source: Cathy O’Neil, Weapons of Math Destruction, Chapter 5]