## This section is about text generation and processing using an ANN 

The basic building block of an ANN is the perceptron. It is based on a simplication of the biological neuron. Multiple inputs are provided to the perceptron. They are each provided a weight and added to a bias. The body of the perceptron contains the activation function, which linearly combines the inputs but then gives a non-linear output (typically a value of 0 or 1) and feeds that into the next layer.  

The weights $w_i$ are initialized randomly

$$ \text{output} = \sum w_i x_i + b $$    


A densely connected neural network is just a series of perceptrons that are connected to one another via their inputs and outputs. A network with three or more input layers is known as a deep network. 

Changing the activation function can be useful depending on the task. Common ones include: 

- The sigmoid (0 to 1)  
- Tanh (-1 to 1)  
- ReLu (0 to z), just max(0,z)  
- Boxcar (0 or 1)  

ReLu seems to have the best performance in most situations

Lets use Keras on the classic IRIS daraset to produce a simple ANN

In [None]:
import numpy as np
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()

In [None]:
type(iris)

In [None]:
print(iris.DESCR)

In [None]:
X = iris.data

In [None]:
y = iris.target

In [None]:
iris.feature_names

For keras, we need to make each of the class labels into one-hot encodings

In [None]:
#class 0 --> [1,0,0]
#class 1 -- > [0,1,0]

In [None]:
from keras.utils import to_categorical

In [None]:
y = to_categorical(y)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(X_train)

In [None]:
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

In [None]:
from keras.models import Sequential
from keras.layers import Dense

In [None]:
#How to make a simple ANN

model = Sequential()
model.add(Dense(8,input_dim=4,activation='relu'))
model.add(Dense(8,input_dim=4,activation='relu'))
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])


In [None]:
model.summary()

In [None]:
model.fit(scaled_X_train,y_train,epochs=150,verbose=2)

In [None]:
model.predict(scaled_X_test)

In [None]:
predictions = model.predict_classes(scaled_X_test)

In [None]:
y_test.argmax(axis=1)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
print(classification_report(predictions,y_test.argmax(axis=1)))

In [None]:
accuracy_score(y_test.argmax(axis=1),predictions)

In [None]:
#easy way of saving a model to disk
model.save('myfirstmodel.h5')

In [None]:
from keras.models import load_model

In [None]:
new_model = load_model('myfirstmodel.h5')

## RNN theory overview   

RNNs are specifially designed to work with sequence data because they can conserve information about datapoints that occured earlier in the sequence. A sequence is a vector whose order matters -> basically we want to be able to predict the sequence shifted over some number of timesteps. 

In a normal Feed Forward netowork the output only goes to the next cells in the ANN. In a recurrent neuron, the output from previous timesteps will feed into the cell determining the values at the current timestep. Its easy to create a layer of recurrent neurons, where the outputs go back into each of the perceptrons in the network.  

The output at timestep t is a function of all the previous outputs, so the network has some memory of previous information. These previous datapoints can easily be weighted. 

We can perform a sequence to sequence output - shift a given sequence into the future. 

We can also pass in a sequence of and output a vector (sentiment scores, for example). This would be associated with a training dataset where this mapping is in place. 

Vector to sequence is also possible. For example, we have a seed word 'hello' and the network will predict the next part of the phrase. 

### LSTM cells 

Simple RNNs can begin to forget data that it was trained on a while ago. The LSTM cell was developed in order to counter this.   

The input to an LSTM are h(t-1) and x(t) in addtion to the previous cell state c(t-1)

The first step is called the forget gate - what will be lost from the cell state?  Pass h(t-1) and x(t) into a sigmoid. 1 means keep, 0 means forget -> this is f(t)

The next step is to decide what new information is to be stored in the new state, c(t). Take in h(t-1) and x(t), pass into sigmoid (i(t)) and also into a tanh() layer. This makes candidate values that can be added to the state. We then combine what comes from the sigmoid with what comes from the tanh(). 

Multiply the old state by f(t) then add scaled candiate values i(t). 

There is then a final combination stage. 

The Gated Recurrent Unit (GRU) is much the same idea, only has a simpler architecture.   

LSTMs seem to work best with text data - we're going to use them here for text generation. 



In [None]:
def read_file(filepath):
    
    with open(filepath) as f:
        str_text = f.read()
    
    return str_text

In [None]:
import spacy
nlp = spacy.load('en',disable=['parser','tagger','ner'])

In [None]:
nlp.max_length = 1198623

In [None]:
def separate_punc(doc_text):
    
    return [token.text.lower() for token in nlp(doc_text) if token.text  not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [None]:
doc = read_file('moby_dick_four_chapters.txt')

In [None]:
tokens = separate_punc(doc)

In [None]:
len(tokens) #all the tokens (not punctuation) in the book

The general idea is this: We pass in a sentence and than have the network predict the next word. We have 25 words and we will then predict the 26th word. 
Then the window will slide along by one word and the prediction will be run again.

In [None]:
train_len = 25 + 1 
text_sequences = []

for i in range(train_len,len(tokens)):
    
    seq = tokens[i-train_len:i]
    text_sequences.append(seq)

In [None]:
' '.join(text_sequences[0])

In [None]:
' '.join(text_sequences[1])

There are just all the sequencnes that we're trying to predict in the training dataset. Given the preceeding words, can the network predict the last word of each sentence.

In [None]:
from keras.preprocessing.text import Tokenizer

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)

In [None]:
#Each of thes numbers in these sequencies is an id that links to a particular word. A dictionary object is created, which maps between them. 
sequences = tokenizer.texts_to_sequences(text_sequences)

In [None]:
#tokenizer.word_index

In [None]:
#This allows you to do count vectorization - counts the number of words
#tokenizer.word_counts

In [None]:
vocab_size = len(tokenizer.word_counts)

In [None]:
vocab_size

In [None]:
import numpy as np

Cast the token ids to an array object

In [19]:
sequences = np.array(sequences)

In [20]:
sequences

In [21]:
from keras.utils import to_categorical

In [22]:
X = sequences[:,:-1] #features
y = sequences[:,-1] #labels

In [23]:
#set one-hot encoding
y = to_categorical(y,num_classes=vocab_size+1)

In [24]:
y

In [25]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

In [30]:
def create_model(vocab_size,seq_len):
    
    model = Sequential()
    model.add(Embedding(vocab_size,seq_len,input_length=seq_len))
    model.add(LSTM(seq_len*2,return_sequences=True))
    model.add(LSTM(seq_len*2))
    model.add(Dense(seq_len*2,activation='relu'))
    
    model.add(Dense(vocab_size,activation='softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    
    model.summary()
    
    return model
    
    

In [31]:
model = create_model(vocab_size+1,X.shape[1])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 25, 25)            67950     
_________________________________________________________________
lstm_5 (LSTM)                (None, 25, 50)            15200     
_________________________________________________________________
lstm_6 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_5 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_6 (Dense)              (None, 2718)              138618    
Total params: 244,518
Trainable params: 244,518
Non-trainable params: 0
_________________________________________________________________


In [32]:
from pickle import dump,load

In [33]:
#We don't really have a test-train split, we're just looking at the training dataset. We are going to use the output to generate some new text

model.fit(X,y,batch_size=128,epochs=2,verbose=1)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0xb427d13c8>

In [34]:
model.save('my_mobydick_model.h5')
dump(tokenizer,open('my_simpletokenizer','wb'))

In [57]:
from keras.preprocessing.sequence import pad_sequences

In [79]:
def generate_text(model,tokenizer,seq_len,seed_text,num_words):
    
    output_text  = []
    
    input_text = seed_text #need to feed it some seed -> will take in and generate one word. Then we chop the first word and feed to model again
    
    for i in range(num_words):
        
        
        #Transform input
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        #of the text is too short or too long, will need to pad -> chops the first word 
        pad_encoded = pad_sequences([encoded_text],maxlen=seq_len,truncating='pre')
        
        #predict the probs for each word and get the index of the word with the highest prob
        predicted_word_ind = model.predict_classes(pad_encoded,verbose=0)[0] 
        
        #get the word assciated with the predicted index
        pred_word = tokenizer.index_word[predicted_word_ind]
        
        #add the predicted word to the input text
        input_text += ' '+pred_word
        
        output_text.append(pred_word)
        
        #after this, its going to chop the first word from the sequnece and then proceed forwards to predict word 
        #by word
        
    
    return ' '.join(output_text)

In [80]:
import random
random.seed(101)
random_pick = random.randint(0,len(text_sequences))

In [81]:
random_pick

9521

In [82]:
seed_text = text_sequences[random_pick]

In [83]:
seed_text = ' '.join(seed_text)

In [84]:
print(seed_text)

thought i to myself the man 's a human being just as i am he has just as much reason to fear me as i have


In [85]:
generate_text(model,tokenizer,seq_len=X.shape[1],seed_text=seed_text,num_words=25)

'the the the the the the the the the the the the the the the the the the the the the the the the the'

In [86]:
from keras.models import load_model

In [87]:
model = load_model('epochBIG.h5')

In [88]:
tokenizer = load(open('epochBIG','rb'))

In [89]:
generate_text(model,tokenizer,seq_len=X.shape[1],seed_text=seed_text,num_words=25)

"to be seen there was no bad olfactories my own letter was cheerily listening over his hearers who 's more can go have a wearing"

### Chatbot learning 

We will be following a paper called 'end to end memory networks' 

The model takes in a discrete sets of inputs X1 - Xn, which are to be stored in memory, a query q and it then outputs an answer a

Each of the x, q and a contain symbols coming from a dictionary with V (vocabulary) words 

The model writes all x to the memory up to a fixed buffer size, then finds a continuous representation for the x and q. 

We have input memory representation - what stories will go in - then output memory representation and generation of the final prediction