### Workflow

This time we can use a larger chunk of the original dataset without running into so many memory and runtime problems.

1. Explore and analyze the set of unique characters present in the dataset.
1. Subsample the dataset to take into account the titles only.
1. Implement the tokenization of the dataset and transform the tokens into AllenNLP Instances.
1. Design a RNN with AllenNLP that includes:
   1. an embedding of the tokens,
   1. a seq2seq LSTM layer,
   1. a feed forward layer that outputs probability distribution of the characters
1. Train the model.
1. Evaluate the model by calculating the loss of some sentences and by generating text.

### Bonus

Can you swap a Transformer-based model in for an LSTM? 

### Resources

* [AllenNLP the hard way: Building a Baseline Model](https://jbarrow.ai/allennlp-the-hard-way-2/)
* [Sequential Labeling and Language Modeling (Chapter 5 - Covers allenNLP and Char RNN LM)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-5/v-5/35)
* [Atentions and the Transformer (Chapter 8)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-8/v-5/)
* [Sequence to Sequence Models (Chapter 6)](https://livebook.manning.com/book/real-world-natural-language-processing/chapter-6/v-5/)
* [An In Depth AllenNLP tutorial (Basics to BERT)](https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/)
* [Official AllenNLP tutorial](https://allennlp.org/tutorials)
* [AllenNLP Guide - Reading Data](https://guide.allennlp.org/reading-data)
* [AllenNLP Simple Language Model DatasetReader](https://github.com/allenai/allennlp-models/blob/master/allennlp_models/lm/dataset_readers/simple_language_modeling.py)

In [20]:
import pandas as pd
from collections import Counter, defaultdict


### Explore the dataset

In [42]:
from allennlp.models import Model
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder, PytorchSeq2VecWrapper
from allennlp.data.tokenizers.character_tokenizer import CharacterTokenizer
from allennlp.data.tokenizers import Token
from allennlp.data.dataset_readers.dataset_reader import DatasetReader

from torch.nn import LSTM

from typing import Dict, Iterable, Union, Optional, List

In [13]:
TOK_FILE= 'stackexchange_tokenized.csv'
df = pd.read_csv(f'data/{TOK_FILE}')

In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,post_id,parent_id,comment_id,text,category,length,tokenized
0,0,1,,,Eliciting priors from experts,title,29,Eliciting priors from experts
1,1,2,,,What is normality?,title,18,What is normality ?
2,2,3,,,What are some valuable Statistical Analysis op...,title,65,What are some valuable Statistical Analysis op...
3,3,4,,,Assessing the significance of differences in d...,title,58,Assessing the significance of differences in d...
4,4,6,,,The Two Cultures: statistics vs. machine learn...,title,50,The Two Cultures : statistics vs. machine lear...


### Limit Dataset to Titles

In [17]:
df_titles = df[df.category == 'title']

In [18]:
len(df_titles)

91648

In [19]:
len(df)

809166

In [26]:
charCtr = Counter()

def collect_chars(txt):
    global charCtr   
    charCtr.update(list(txt))
        
       

In [28]:
df_titles['text'].apply(collect_chars)
charCtr.most_common(15)

[(' ', 1456196),
 ('e', 981638),
 ('i', 817482),
 ('t', 789104),
 ('a', 737638),
 ('o', 714848),
 ('n', 684648),
 ('r', 600202),
 ('s', 593196),
 ('l', 365504),
 ('d', 298382),
 ('c', 279640),
 ('m', 246208),
 ('u', 224208),
 ('f', 221880)]

### Tokenization

In [52]:


#@DatasetReader.register("character_based_lm_so")
class CharDatasetReader(DatasetReader):
    def __init__() -> None:
        
        #todo: could become args
        self._token_indexers = {'tokens': SingleIdTokenIndexer()}
        self._tokenizer = CharacterTokenizer()
        
        pass

    def text_to_instance(self, sentence: str,) -> Instance:
        
        tokenized = self._tokenizer.tokenize(sentence)
        #TODO: add start and end characters
        #TODO: do you want to add "source" and "target" here? 
        instance = Instance({"source": TextField(tokenized, self._token_indexers)})
        return instance
    
    def _read(df: pd.DataFrame) -> Iterable[Instance]:
        
        for row in df.itertuples(index=False):
            instance = self.text_to_instance(row.text)
            yield instance

#left off here -https://jbarrow.ai/allennlp-the-hard-way-1/ (Good info on testing your datasetReader)

'''

Strategy

* work on text_to_instance first. 
* when you implement read you could probably pull it from the dataframe.  In the real world this might be slow.

Questions

* What are token indexers?
* Do I need both input and output tokens captured in my data reader?  I assume I can make the call. See example below.
* 
'''

'\n\nStrategy\n\n* work on text_to_instance first. \n* when you implement read you could probably pull it from the dataframe.  In the real world this might be slow.\n\nQuestions\n\n* What are token indexers?\n* Do I need both input and output tokens captured in my data reader?  I assume I can make the call. See example below.\n* \n'

### Testing

In [5]:
@Model.register('char_lstm')
class CharLSTM(Model):
    pass

In [6]:
EMBEDDING_DIM = 512
HIDDEN_DIM = 256

In [7]:
encoder = PytorchSeq2VecWrapper(
     LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

In [30]:
ct = CharacterTokenizer()


In [34]:
tokens = ct.tokenize('this is a test')

Start and end symbols you should use when training your language model.

In [37]:
from allennlp.common.util import START_SYMBOL, END_SYMBOL


tokens.insert(0, Token(START_SYMBOL))
tokens.append(Token(END_SYMBOL))

In [38]:
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
 
token_indexers = {'tokens': SingleIdTokenIndexer()}

This is the beginning of how you'd create a DatasetReader.  Implement `_read` function to return an instance.

In [39]:
from allennlp.data.fields import TextField
from allennlp.data.instance import Instance
 
input_field = TextField(tokens[:-1], token_indexers)
output_field = TextField(tokens[1:], token_indexers)
instance = Instance({'input_tokens': input_field,
                     'output_tokens': output_field})

In [45]:
for row in df.itertuples(index=False):
    print(row.text)
    break

Eliciting priors from experts
