### Workflow

This time we can use a larger chunk of the original dataset without running into so many memory and runtime problems.

1. x - Explore and analyze the set of unique characters present in the dataset.
1. x - Subsample the dataset to take into account the titles only.
1. x - Implement the tokenization of the dataset and transform the tokens into AllenNLP Instances.
1. Design a RNN with AllenNLP that includes:
   1. an embedding of the tokens,
   1. a seq2seq LSTM layer,
   1. a feed forward layer that outputs probability distribution of the characters
1. Train the model.
1. Evaluate the model by calculating the loss of some sentences and by generating text.

### Bonus

Can you swap a Transformer-based model in for an LSTM? 

### Resources

* [AllenNLP the hard way: Building a Baseline Model](https://jbarrow.ai/allennlp-the-hard-way-2/)
* [Sequential Labeling and Language Modeling (Chapter 5 - Covers allenNLP and Char RNN LM)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-5/v-5/35)
  - [Related Colab Jupyter Notebook](https://colab.research.google.com/github/mhagiwara/realworldnlp/blob/master/examples/generation/lm.ipynb#scrollTo=CX2cfBPu1bfw)
* [Atentions and the Transformer (Chapter 8)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-8/v-5/)
* [Sequence to Sequence Models (Chapter 6)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-6/v-5/)
* [An In Depth AllenNLP tutorial (Basics to BERT)](https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/)
* [Official AllenNLP tutorial](https://allennlp.org/tutorials)
* [AllenNLP Guide - Reading Data](https://guide.allennlp.org/reading-data)
* [AllenNLP Simple Language Model DatasetReader](https://github.com/allenai/allennlp-models/blob/master/allennlp_models/lm/dataset_readers/simple_language_modeling.py)
* [Testing your DatasetReader](https://jbarrow.ai/allennlp-the-hard-way-1/)
* [AllenNLP - Vocabulary](https://guide.allennlp.org/reading-data#3)
* [sequence_cross_entropy_loss_with_logits](https://github.com/allenai/allennlp/blob/master/allennlp/nn/util.py#L704) - Looks like the simplest way to calulate the loss for your CharRNN. 



In [43]:
import pandas as pd
import numpy as np
from collections import Counter, defaultdict


In [44]:
%load_ext autoreload
%autoreload 2

from utils import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Explore the dataset

In [8]:
TOK_FILE= 'stackexchange_tokenized.csv'
file_path = f'data/{TOK_FILE}'
df = pd.read_csv(f'data/{TOK_FILE}')

In [20]:
df.head()

Unnamed: 0.1,Unnamed: 0,post_id,parent_id,comment_id,text,category,length,tokenized
0,0,1,,,Eliciting priors from experts,title,29,Eliciting priors from experts
1,1,2,,,What is normality?,title,18,What is normality ?
2,2,3,,,What are some valuable Statistical Analysis op...,title,65,What are some valuable Statistical Analysis op...
3,3,4,,,Assessing the significance of differences in d...,title,58,Assessing the significance of differences in d...
4,4,6,,,The Two Cultures: statistics vs. machine learn...,title,50,The Two Cultures : statistics vs. machine lear...


#### Split into training and test sets

In [12]:
from sklearn.model_selection import train_test_split

max_df_size = int(1e7)
df_size = len(df.index)

if df_size > max_df_size:
    df_size = max_df_size

indxs     = np.arange(df_size)
train_i, test_i = train_test_split(indxs, test_size=0.4, random_state=42) 

df_train = df.iloc[train_i]
df_test  = df.iloc[test_i]
# TODO: add a validation dataset later? 

df_train.to_csv('data/char-rnn-train.csv')
df_test.to_csv('data/char-rnn-test.csv')



### Limit Dataset to Titles

In [23]:
df_titles = df[df.category == 'title']

In [26]:
len(df_titles)

91648

In [29]:
len(df)

809166

In [32]:
charCtr = Counter()

def collect_chars(txt):
    global charCtr   
    charCtr.update(list(txt))
        
       

In [35]:
df_titles['text'].apply(collect_chars)
charCtr.most_common(15)

[(' ', 1456196),
 ('e', 981638),
 ('i', 817482),
 ('t', 789104),
 ('a', 737638),
 ('o', 714848),
 ('n', 684648),
 ('r', 600202),
 ('s', 593196),
 ('l', 365504),
 ('d', 298382),
 ('c', 279640),
 ('m', 246208),
 ('u', 224208),
 ('f', 221880)]

### Tokenization

In [37]:




# https://github.com/allenai/allennlp/issues/1998



'''

Strategy

* work on text_to_instance first. 
* when you implement read you could probably pull it from the dataframe.  In the real world this might be slow.

Questions

* What are token indexers?
* Do I need both input and output tokens captured in my data reader?  I assume I can make the call. See example below.
* 
'''

'\n\nStrategy\n\n* work on text_to_instance first. \n* when you implement read you could probably pull it from the dataframe.  In the real world this might be slow.\n\nQuestions\n\n* What are token indexers?\n* Do I need both input and output tokens captured in my data reader?  I assume I can make the call. See example below.\n* \n'

In [39]:
reader = CharDatasetReader()
reader.read(file_path)

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




<allennlp.data.dataset_readers.dataset_reader.AllennlpDataset at 0x7fd622b7e700>

#### Model Parameters

In [15]:
EMBEDDING_DIM = 512
HIDDEN_DIM = 256

#### Construct the Vocab

In [13]:
# TODO: try using the datasetreader to accomplish this as done at the following link
# https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/

#LEFT OFF HERE.
# construct a datasetreader from 'data/char-rnn-train.csv'
train_reader = CharDatasetReader()
train_ds = train_reader.read('data/char-rnn-train.csv')
# use reeader to get vocab like vocab = Vocabulary.from_instances(train_ds, max_vocab_size=XXXX)

vocab = Vocabulary.from_instances(train_ds, max_vocab_size=1000)

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




HBox(children=(FloatProgress(value=0.0, max=55110.0), HTML(value='')))




#### Construct the Embedder

In [20]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)
embedder = BasicTextFieldEmbedder({"tokens": token_embedding})

#### Construct the Model

In [45]:

encoder = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
clm = CharLanguageModel(vocab,embedder, encoder)

### Testing

In [5]:
@Model.register('char_lstm')
class CharLSTM(Model):
    pass

In [7]:
encoder = PytorchSeq2VecWrapper(
     LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

In [30]:
ct = CharacterTokenizer()


In [34]:
tokens = ct.tokenize('this is a test')

Start and end symbols you should use when training your language model.

In [37]:
from allennlp.common.util import START_SYMBOL, END_SYMBOL


tokens.insert(0, Token(START_SYMBOL))
tokens.append(Token(END_SYMBOL))

In [38]:
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
 
token_indexers = {'tokens': SingleIdTokenIndexer()}

This is the beginning of how you'd create a DatasetReader.  Implement `_read` function to return an instance.

In [39]:
from allennlp.data.fields import TextField
from allennlp.data.instance import Instance
 
input_field = TextField(tokens[:-1], token_indexers)
output_field = TextField(tokens[1:], token_indexers)
instance = Instance({'input_tokens': input_field,
                     'output_tokens': output_field})

In [45]:
for row in df.itertuples(index=False):
    print(row.text)
    break

Eliciting priors from experts


#### Messing with Fields / TextFields

* [TextFields](https://guide.allennlp.org/representing-text-as-features#2)
* [Field API](https://guide.allennlp.org/reading-data#1)
* [TextFieldTensors](https://guide.allennlp.org/representing-text-as-features#9)

In [22]:
from allennlp.data.tokenizers import Token
from allennlp.data import Vocabulary
from allennlp.data.token_indexers import SingleIdTokenIndexer
from allennlp.data.fields import TextField

tokens = [Token('The'), Token('best'), Token('movie'), Token('ever'), Token('!')]
token_indexers = {'tokens': SingleIdTokenIndexer()}
text_field = TextField(tokens, token_indexers=token_indexers)

In [17]:
vocab = Vocabulary()
vocab.add_tokens_to_namespace(['The', 'best', 'movie', 'ever', '!'], namespace='token_vocab')
text_field.index(vocab)
padding_lengths = text_field.get_padding_lengths()
tft = text_field.as_tensor(padding_lengths)

In [18]:
type(tft)

dict

In [30]:
#TextFieldTensors = Dict[str, Dict[str, torch.Tensor]] #type definition

from allennlp.data import TextFieldTensors
#isinstance(tft, TextFieldTensors) #this won't work. 

In [24]:
tft.keys()

dict_keys(['tokens'])

In [25]:
tft['tokens']

{'tokens': tensor([1, 1, 1, 1, 1])}

In [26]:
tft['tokens']['tokens'] #this looks so strange. 

tensor([1, 1, 1, 1, 1])