### Workflow

This time we can use a larger chunk of the original dataset without running into so many memory and runtime problems.

1. x - Explore and analyze the set of unique characters present in the dataset.
1. x - Subsample the dataset to take into account the titles only.
1. x - Implement the tokenization of the dataset and transform the tokens into AllenNLP Instances.
1. X - Design a RNN with AllenNLP that includes:
   1. an embedding of the tokens,
   1. a seq2seq LSTM layer,
   1. a feed forward layer that outputs probability distribution of the characters
1. X - Train the model.
1. Evaluate the model by calculating the loss of some sentences and by generating text.

### Bonus

Can you swap a Transformer-based model in for an LSTM? 

### Resources

* [AllenNLP the hard way: Building a Baseline Model](https://jbarrow.ai/allennlp-the-hard-way-2/)
* [Sequential Labeling and Language Modeling (Chapter 5 - Covers allenNLP and Char RNN LM)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-5/v-5/35)
  - [Related Colab Jupyter Notebook](https://colab.research.google.com/github/mhagiwara/realworldnlp/blob/master/examples/generation/lm.ipynb#scrollTo=CX2cfBPu1bfw)
* [Atentions and the Transformer (Chapter 8)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-8/v-5/)
* [Sequence to Sequence Models (Chapter 6)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-6/v-5/)
* [An In Depth AllenNLP tutorial (Basics to BERT)](https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/)
* [Official AllenNLP tutorial](https://allennlp.org/tutorials)
* [AllenNLP Guide - Reading Data](https://guide.allennlp.org/reading-data)
* [AllenNLP Simple Language Model DatasetReader](https://github.com/allenai/allennlp-models/blob/master/allennlp_models/lm/dataset_readers/simple_language_modeling.py)
* [Testing your DatasetReader](https://jbarrow.ai/allennlp-the-hard-way-1/)
* [AllenNLP - Vocabulary](https://guide.allennlp.org/reading-data#3)
* [sequence_cross_entropy_loss_with_logits](https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py#L704) - Looks like the simplest way to calulate the loss for your CharRNN. 



In [1]:
import pandas as pd
import numpy as np
from collections import Counter, defaultdict


In [13]:
%load_ext autoreload
%autoreload 2

from utils import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Explore the dataset

In [3]:
TOK_FILE= 'stackexchange_tokenized.csv'
file_path = f'data/{TOK_FILE}'
df = pd.read_csv(f'data/{TOK_FILE}')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,post_id,parent_id,comment_id,text,category,length,tokenized
0,0,1,,,Eliciting priors from experts,title,29,Eliciting priors from experts
1,1,2,,,What is normality?,title,18,What is normality ?
2,2,3,,,What are some valuable Statistical Analysis op...,title,65,What are some valuable Statistical Analysis op...
3,3,4,,,Assessing the significance of differences in d...,title,58,Assessing the significance of differences in d...
4,4,6,,,The Two Cultures: statistics vs. machine learn...,title,50,The Two Cultures : statistics vs. machine lear...


#### Split into training and test sets

In [4]:
from sklearn.model_selection import train_test_split

max_df_size = int(1e7)
df_size = len(df.index)

if df_size > max_df_size:
    df_size = max_df_size

indxs     = np.arange(df_size)
train_i, test_i = train_test_split(indxs, test_size=0.4, random_state=42) 

df_train = df.iloc[train_i]
df_test  = df.iloc[test_i]
# TODO: add a validation dataset later? 

df_train.to_csv('data/char-rnn-train.csv')
df_test.to_csv('data/char-rnn-test.csv') #going to make the conscious choice to not have train-valid-test. Let's just use two. 



### Limit Dataset to Titles

In [5]:
df_titles = df[df.category == 'title']

In [6]:
len(df_titles)

91648

In [11]:
len(df)

809166

In [12]:
charCtr = Counter()

def collect_chars(txt):
    global charCtr   
    charCtr.update(list(txt))
        
       

In [13]:
df_titles['text'].apply(collect_chars)
charCtr.most_common(15)

[(' ', 728098),
 ('e', 490819),
 ('i', 408741),
 ('t', 394552),
 ('a', 368819),
 ('o', 357424),
 ('n', 342324),
 ('r', 300101),
 ('s', 296598),
 ('l', 182752),
 ('d', 149191),
 ('c', 139820),
 ('m', 123104),
 ('u', 112104),
 ('f', 110940)]

### Tokenization

In [14]:




# https://github.com/allenai/allennlp/issues/1998



'''

Strategy

* work on text_to_instance first. 
* when you implement read you could probably pull it from the dataframe.  In the real world this might be slow.

Questions

* What are token indexers?
* Do I need both input and output tokens captured in my data reader?  I assume I can make the call. See example below.
* 
'''

'\n\nStrategy\n\n* work on text_to_instance first. \n* when you implement read you could probably pull it from the dataframe.  In the real world this might be slow.\n\nQuestions\n\n* What are token indexers?\n* Do I need both input and output tokens captured in my data reader?  I assume I can make the call. See example below.\n* \n'

In [None]:
reader = CharDatasetReader()
reader.read(file_path)

#### Model Parameters

In [3]:
EMBEDDING_DIM = 512
HIDDEN_DIM = 256

BATCH_SIZE=8

#### Construct the Vocab

In [4]:
# TODO: try using the datasetreader to accomplish this as done at the following link
# https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/

# construct a datasetreader from 'data/char-rnn-train.csv'
reader = CharDatasetReader()
train_ds = reader.read('data/char-rnn-train.csv')
valid_ds = reader.read('data/char-rnn-test.csv')
# use reeader to get vocab like vocab = Vocabulary.from_instances(train_ds, max_vocab_size=XXXX)

vocab = Vocabulary.from_instances(train_ds, max_vocab_size=1000)

train_ds.index_with(vocab)
valid_ds.index_with(vocab)

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




HBox(children=(FloatProgress(value=0.0, max=55110.0), HTML(value='')))




#### Construct DataLoaders

In [5]:

import allennlp

train_dl = allennlp.data.DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
valid_dl = allennlp.data.DataLoader(valid_ds, batch_size=BATCH_SIZE, shuffle=True)


#### Construct the Embedder

In [6]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
embedder = BasicTextFieldEmbedder({"tokens": token_embedding})

#### Construct the Model

In [7]:

encoder = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
clm = CharLanguageModel(vocab,embedder, encoder)

In [8]:
vocab.get_vocab_size("tokens") 


171

#### Construct an Iterator

In [9]:
#this is not strictly necessary but if you construct one you can pass it to the trainer

#### Construct a Trainer

In [10]:
optimizer  = optim.Adam(clm.parameters(), lr=5.e-3)
num_epochs = 1

trainer = GradientDescentTrainer(
    model=clm,
    optimizer=optimizer,
    #iterator=iterator,
    data_loader=train_dl,
    validation_data_loader=valid_dl,  
    num_epochs=num_epochs,
)

You provided a validation dataset but patience was set to None, meaning that early stopping is disabled


In [11]:

#LEFT OFF HERE

'''
There is likely a mismatch between reader and forward.  Make sure you have the same names for Instances / fields?

The input/output spec of Model.forward() is somewhat more strictly defined than that of PyTorch modules. 
Its parameters need to match field names in your data code exactly. Instances created by the dataset 
reader are batched and converted to a set of tensors by AllenNLP (specifically, this part happens in 
the allennlp_collate function that the DataLoader uses). Inside our Trainer, batched tensors get passed 
to Model.forward() by their original field names. 
'''

# https://guide.allennlp.org/building-your-model#1

trainer.train()


HBox(children=(FloatProgress(value=0.0, max=6889.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4568.0), HTML(value='')))




{'best_epoch': 0,
 'peak_worker_0_memory_MB': 1341.708,
 'peak_gpu_0_memory_MB': 2,
 'peak_gpu_1_memory_MB': 1278,
 'training_duration': '0:11:06.716631',
 'training_start_epoch': 0,
 'training_epochs': 0,
 'epoch': 0,
 'training_loss': 1.2657199104252046,
 'training_reg_loss': 0.0,
 'training_worker_0_memory_MB': 1341.708,
 'training_gpu_0_memory_MB': 2,
 'training_gpu_1_memory_MB': 1278,
 'validation_loss': 1.1288638919357035,
 'validation_reg_loss': 0.0,
 'best_validation_loss': 1.1288638919357035,
 'best_validation_reg_loss': 0.0}

### Predict

In [None]:
predict(clm, reader, 'How do I get started with Machine Le')

#you could try to predict partial strings from data in the test set. 

# left off here - test me. 

### Testing

In [5]:
@Model.register('char_lstm')
class CharLSTM(Model):
    pass

In [7]:
encoder = PytorchSeq2VecWrapper(
     LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

In [7]:
ct = CharacterTokenizer()


In [8]:
tokens = ct.tokenize('this is a test')

In [9]:
tokens

[t, h, i, s,  , i, s,  , a,  , t, e, s, t]

In [10]:
# left off here


reader2 = CharDatasetReader()
train_ds = reader2.read('data/char-rnn-train-10.csv')

vocab = Vocabulary.from_instances(train_ds, max_vocab_size=1000)

#train_ds.index_with(vocab)



HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [13]:
print(vocab.get_vocab_size("tokens")) #might the problem just be that I am using the wrong namespace? 
print(vocab.print_statistics())


28


----Vocabulary Statistics----


Top 10 most frequent tokens in namespace 'tokens':
	Token:  		Frequency: 15
	Token: t		Frequency: 11
	Token: e		Frequency: 11
	Token: i		Frequency: 9
	Token: n		Frequency: 9
	Token: o		Frequency: 8
	Token: r		Frequency: 6
	Token: a		Frequency: 6
	Token: d		Frequency: 5
	Token: s		Frequency: 4

Top 10 longest tokens in namespace 'tokens':
	Token:  		length: 1	Frequency: 15
	Token: t		length: 1	Frequency: 11
	Token: e		length: 1	Frequency: 11
	Token: i		length: 1	Frequency: 9
	Token: n		length: 1	Frequency: 9
	Token: o		length: 1	Frequency: 8
	Token: r		length: 1	Frequency: 6
	Token: a		length: 1	Frequency: 6
	Token: d		length: 1	Frequency: 5
	Token: s		length: 1	Frequency: 4

Top 10 shortest tokens in namespace 'tokens':
	Token: ?		length: 1	Frequency: 1
	Token: m		length: 1	Frequency: 1
	Token: g		length: 1	Frequency: 1
	Token: )		length: 1	Frequency: 1
	Token: %		length: 1	Frequency: 1
	Token: (		length: 1	Frequency: 1
	Token: b		length: 1	Frequenc

Start and end symbols you should use when training your language model.

In [37]:
from allennlp.common.util import START_SYMBOL, END_SYMBOL


tokens.insert(0, Token(START_SYMBOL))
tokens.append(Token(END_SYMBOL))

In [38]:
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
 
token_indexers = {'tokens': SingleIdTokenIndexer()}

This is the beginning of how you'd create a DatasetReader.  Implement `_read` function to return an instance.

In [39]:
from allennlp.data.fields import TextField
from allennlp.data.instance import Instance
 
input_field = TextField(tokens[:-1], token_indexers)
output_field = TextField(tokens[1:], token_indexers)
instance = Instance({'input_tokens': input_field,
                     'output_tokens': output_field})

In [45]:
for row in df.itertuples(index=False):
    print(row.text)
    break

Eliciting priors from experts


#### Messing with Fields / TextFields

* [TextFields](https://guide.allennlp.org/representing-text-as-features#2)
* [Field API](https://guide.allennlp.org/reading-data#1)
* [TextFieldTensors](https://guide.allennlp.org/representing-text-as-features#9)

In [22]:
from allennlp.data.tokenizers import Token
from allennlp.data import Vocabulary
from allennlp.data.token_indexers import SingleIdTokenIndexer
from allennlp.data.fields import TextField

tokens = [Token('The'), Token('best'), Token('movie'), Token('ever'), Token('!')]
token_indexers = {'tokens': SingleIdTokenIndexer()}
text_field = TextField(tokens, token_indexers=token_indexers)

In [17]:
vocab = Vocabulary()
vocab.add_tokens_to_namespace(['The', 'best', 'movie', 'ever', '!'], namespace='token_vocab')
text_field.index(vocab)
padding_lengths = text_field.get_padding_lengths()
tft = text_field.as_tensor(padding_lengths)

In [18]:
type(tft)

dict

In [30]:
#TextFieldTensors = Dict[str, Dict[str, torch.Tensor]] #type definition

from allennlp.data import TextFieldTensors
#isinstance(tft, TextFieldTensors) #this won't work. 

In [24]:
tft.keys()

dict_keys(['tokens'])

In [25]:
tft['tokens']

{'tokens': tensor([1, 1, 1, 1, 1])}

In [26]:
tft['tokens']['tokens'] #this looks so strange. 

tensor([1, 1, 1, 1, 1])