### Assumptions in Machine Learning

Every machine learning model has assumptions, indeed, that is the only way we can reasonably learn from data and make assumptions about how what we learn generalizes to the real world. 

The problem is the assumptions/difficulty tradeoff. Simpler models tend to have more naive assumptions (ie, assumptions that are not likely to be true or not likely to generalize), and more complex models (such as neural networks, convolutional networks, and recurrent neural networks) have less assumptions (and therefore, save for overfitting, are more likely to generalize better to the real world), difficult (and expensive) to build, train, and deploy. 

To understand this specifically, lets go through some models and state the assumptions they make. 


### Linear Regression & Logistic Regression

Linear regression assumes that the data can be generated via a linear function, and logistic regression assumes that the data can be linearly separated. 

![lin](http://sebastianraschka.com/images/blog/2014/kernel_pca/linear_vs_nonlinear.png)

Although we may intuitively think that assuming linearity is a very naive assumption, these are two of the most common algorithms used in the industry to make reasonable predictions for the following reasons: 

1. They are relatively easy to train, and don't require expensive & difficult research to optimize
2. They have well known theoretical gaurantees and have been shown to work well in practice for decades
3. Domain expertise can be applied to make sense of what has been learned
4. The assumptions often turn out to be mostly true, and the tradeoff we have to make to get the edge cases classified correctly aren't worth the extra work.

### Naive Bayes Classifier

Bayes Rule states that for a latent variable and an evidence variable, the posterior distribution of the latent variable is equivalent to the prior probability of the latent variable times the likelihood of the evidence given the latent variable, divided by a constant (the probability of observing the evidence). 

Specifically:

$$ p(y_i | x) = \frac{p(y_i) p(x | y_i)}{p(x)} = \frac{p(x, y_i)}{\sum_i p(x, y_i)} $$

Using the rules of probability, we can write the joint distribution as a conditional:

$$ p(x, y_i) = p(x_1 | x_2, ... x_n, y_i) * p(x_2 | x_3, ... x_n, y_i) $$

Here is where the assumption comes into play: since we can't really compute these conditional probabilities, we just define: 

$$ p(x_1 | x_2, ... x_n, y_i) = p(x_1 | y_i) $$

This assumption means that each feature in our training set is entirely dependent upon the label that it is given, and is completely independent of every other feature, given the class label.

We cal also set $ p(x_1 | y_i) $ to take on whatever probability distribution we want to (such as the Gaussian); doing this is also another assumption in our model. 

### Neural Networks
- They also assume that the data are conditionally indpendent of each other. 


We have used Neural Networks and Convolutional Neural Netoworks to remove the assumptions of linearity in our data (and for convolutional neural networks, the assumption of spatial variance being important). Now we will use Recurrent Neural Networks to remove the assumption of independence (to an extent). 

Aside: A much simpler model (that doesn't work as well, but still has produced good results) that removes this independence assumption are Hidden Markov Models.



### An Introduction to Recurrent Neural Networks in Tensorflow

Recurrent Neural Networks are models that are able to model relationships between data where each input is not  independent of the previous inputs. 

For example, consider the field of Natural Language Processing (NLP). NLP deals with all tasks involved with processing texts and obtaining information (learning) from text. It is extremely common, especially today, to solve NLP tasks with machine learning (and especially deep learning). 

Recurrent Neural networks are a popular method for solving NLP problems, including predicting the sentiment of a movie review, predicting the next word (or character) in a sequence of words or characters, and even translating text to another language. This is because they can model dependencies, and often in text a current word is dependent on the previous words (you can probably think of several examples). 

Just for a break from all this technical talk, here is some Shakespeare: 

```
PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

DUKE VINCENTIO:
Well, your wit is in the care of side and that.

Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.

Clown:
Come, sir, I will make did behold your worship.

VIOLA:
I'll drink it.
```


### The Power of Recurrent Neural Networks

I lied, the above was not really Shakespeare. It was actually a poem written by a recurrent neural network that trained on a dataset of Shakespeare's writings. This should give you an idea of the power of RNNs: just by looking at characters (not even words!) it learned how to generate strikingly similar text to Shakespeare. Granted, it does not make much sense but it is likely that at least engineers will not be able to tell the difference.

This was from an awesome article here http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Recurrent Neural Networks are important because they model sequential data. Each neuron, in addition to its data input, receives a previous ** state ** which is just a vector of numbers describing some previous computation in the RNN. This state input is what allows RNNs to learn dependencies in sequential data. 

It's a lot similar to you reading this sentence - when you understand the meaning of a particular word, you don't start figuring it out from scratch, but take the previous words and sentences into account. 

Here is what an RNN looks like: 

![rnn](https://cdn-images-1.medium.com/max/1600/1*V2W4TCmTj2h1CE7I-DngPw.png)


And this is what an RNN looks like, predicting words character by character: 

![rnn-2](https://cdn-images-1.medium.com/max/1600/1*IMalbwl6uj3nlqxixZYFvA.jpeg)

To understand RNNs deeply, we'll need to get into some data and code. 

#### Setup
- First, let's import everything we need. Download the file "shakespeare.txt" and ensure that it is placed in the subdirectory 'data'. 


In [None]:
import unidecode
import string
import random
import re
import numpy as np
all_characters = string.printable
n_characters = len(all_characters)

file = open('./data/shakespeare.txt').read()
file_len = len(file)
print('file_len =', file_len)

Let's define the RNN model. We'll use an LSTM cell for our hidden units. We'll also feed the model raw input that is converted into a character-level embedding (similar to word embeddings).

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
from torch import autograd

class RNN(nn.Module):
    def __init__(self, *, input_size, embedding_size, hidden_size, output_size, n_layers=1):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        
        # map inputs to embeddings
        self.embedding_layer = nn.Embedding(input_size, embedding_size)
        # forward embeddings through LSTM
        self.LSTM = nn.LSTM(embedding_size, hidden_size, n_layers)
        # compute a linear transformation to output space
        self.linear = nn.Linear(hidden_size, output_size)
    
    def forward(self, input, hidden):
        # input should be batch_size * input_dim
        input = self.embedding_layer(input.view(1, -1))
        # input into LSTM is expected to be seq_len * batch_size * input_shape
        output, hidden = self.LSTM(input.view(1, 1, -1), hidden)
        # output should be batch_size * input_shape
        output = self.linear(output.view(1, -1))
        return output, hidden

    def init_hidden(self):
        # need 2 hidden state inits: one for the hidden state and one for the cell state 
        # dim is layers, batch_size, hidden_size
        return (autograd.Variable(torch.zeros(self.n_layers, 1, self.hidden_size)),
                autograd.Variable(torch.zeros(self.n_layers, 1, self.hidden_size)))

Let's write some code to prepare our inputs to our model. We'll need to write a function to extract a sequence from our file, a function to convert it to a fixed-length vector that can be fed into our model, and an overarching function that generates pairs of inputs and labels. 

In [None]:
seq_len = 200

def get_seq(seq_len = 200):
    start = np.random.randint(0, file_len - seq_len)
    seq = file[start:start + seq_len]
    assert len(seq) == 200
    return seq

def to_vector(seq, chars = all_characters):
    return autograd.Variable(torch.LongTensor([chars.index(s) for s in seq]))

def generate_training_set():
    seq = get_seq()
    inputs, labels = to_vector(seq[:-1]), to_vector(seq[1:])
    return inputs, labels

Next, let's write a function to output example sequences of characters from our RNN. We'll use the outputs of the RNN as a probability distribution, and obtain the most likely next character by sampling. We'll also use the concept of "priming", which is building up the hidden state so that the RNN already has a representation of previous timestep knowledge before we start sampling from it.

In [None]:
def evaluate(prime_str='A', predict_len=100, temperature=0.8):
    hidden = rnn.init_hidden()
    prime_input = to_vector(prime_str)
    predicted = prime_str

    # Use priming string to "build up" hidden state
    for p in range(len(prime_str) - 1):
        _, hidden = rnn(prime_input[p], hidden)
    inp = prime_input[-1]
    
    for p in range(predict_len):
        output, hidden = rnn(inp, hidden)
        
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = torch.multinomial(output_dist, 1)[0]
        
        # Add predicted character to string and use as next input
        predicted_char = all_characters[top_i]
        predicted += predicted_char
        inp = to_vector(predicted_char)

    return predicted

Now, let's take our RNN, define our hyperparameters, and train it.

In [None]:
n_epochs = 5000
hidden_size = 200
embedding_size = 100
n_layers = 4
        
rnn = RNN(input_size = n_characters,
          embedding_size = embedding_size,
          hidden_size = hidden_size, 
          output_size = n_characters,
          n_layers = n_layers)

optim = torch.optim.Adam(rnn.parameters(), lr = 0.005)
criterion = nn.CrossEntropyLoss()
all_losses = []

for epoch in range(n_epochs):
    # re-init hidden and zero the grads
    hidden = rnn.init_hidden()
    rnn.zero_grad()
    inputs, labels = generate_training_set()
    # run through the inputs one by one, accumulating a loss
    loss = 0
    for c in range(seq_len - 1):
        output, hidden = rnn(inputs[c], hidden)
        loss += criterion(output, labels[c])
    loss.backward(retain_graph = True)
    optim.step()
    all_losses.append(loss.data[0]/seq_len)
    if epoch % 20 == 0:
        print('EPOCH: {}'.format(epoch))
        predicted = evaluate()
        print('predicted: {}'.format(predicted))
        print('LOSS: {}'.format(loss.data[0]/seq_len))
