## Few tasks of NLP
1. Part of Speech Tagging
2. Parsing - Syntactic / Semantic / Dependency
3. Named Entity Relation

## Higher Level Tasks
1. Natural Language Entailment : Entailment , Contradiction and Neutral : SNLI / MNLI
2. Semantic Textual Similarity : Entailment and Not-Entailment : STS
3. Machine Comprehension : Question-Answering, Summarization : Squad, Swag
4. Machine Translation : English to French , Need to not just be Natural language 

[https://gluebenchmark.com/leaderboard] (GLUE Leaderboard)

[https://rajpurkar.github.io/SQuAD-explorer/] (SQuaD Leaderboard)

[https://stanfordnlp.github.io/coqa/] (Conversational QA)

## Is this enough? Reasoning and Common Sense, Open Domain QA, Maths ...
[https://leaderboard.allenai.org/] (Allen AI Leaderboard)

## Summary of Tools


1.   Google Colab : GPU and TPUs. File Uploads and Downloads. 
2.   Jupyter Notebooks
3.   PyTorch as Deep Learning Tool
4.   AllenNLP as NLP Framework
5.   NLTK for Tokenizers, Stopwords, Datasets
6.   SpaCy for Tokenizers, POS and DEP taggings, feature generations. KParser and Stanford Parsers
7.   AllenNLP Demo
8.   Sklearn and GridSearch Demo
9.   Embeddings
10.  BERT, ELMO , OpenAI , Glove

## Resources:
1. AWS Educate : https://aws.amazon.com/education/awseducate/ : 100$
2. Google GCP : 100$
3. Use Spot Instances


In [None]:
'''
!pip install torch
!pip install allennlp
!pip install --upgrade numpy pandas plotly

'''

import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  l



Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] comtrans............ ComTrans Corpus Sample


In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')

### POS, NER, DEPENDENCY PARSING 

In [6]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)
    
print()
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


### Similarity between Tokens

In [7]:
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

Apple Apple 1.0
Apple is -0.05606477
Apple looking 0.17363018
Apple at 0.09761385
Apple buying 0.23062494
Apple U.K. 0.49552646
Apple startup 0.088767774
Apple for -0.020235488
Apple $ 0.05591571
Apple 1 0.116298646
Apple billion 0.1802153
is Apple -0.05606477
is is 1.0
is looking 0.03646119
is at 0.15363081
is buying -0.010881584
is U.K. 0.014506461
is startup 0.22643858
is for 0.22094187
is $ -0.015712954
is 1 0.015063022
is billion -0.054269664
looking Apple 0.17363018
looking is 0.03646119
looking looking 1.0
looking at 0.20644525
looking buying 0.55810255
looking U.K. 0.010433668
looking startup 0.4008714
looking for 0.045873784
looking $ 0.093066834
looking 1 0.089511566
looking billion 0.10066204
at Apple 0.09761385
at is 0.15363081
at looking 0.20644525
at at 1.0
at buying 0.20436426
at U.K. 0.07365204
at startup 0.13014099
at for 0.60098934
at $ 0.116372705
at 1 0.18446393
at billion 0.21969722
buying Apple 0.23062494
buying is -0.010881584
buying looking 0.55810255
buying at 

In [10]:
from spacy import displacy

doc_dep = nlp(u'This is a sentence.')
displacy.render(doc_dep, style='dep', jupyter=True)

doc_ent = nlp(u'When Sebastian Thrun started working on self-driving cars at Google '
              u'in 2007, few people outside of the company took him seriously.')
displacy.render(doc_ent, style='ent', jupyter=True)

In [5]:
qnli = open("/scratch/pbanerj6/datasets/glue_data/QNLI/train.tsv").readlines()[0:5]
for line in qnli:
    print(line)

index	question	sentence	label

0	What is the Grotto at Notre Dame?	Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection.	entailment

1	What is the Grotto at Notre Dame?	It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.	not_entailment

2	What sits on top of the Main Building at Notre Dame?	Atop the Main Building's gold dome is a golden statue of the Virgin Mary.	entailment

3	What sits on top of the Main Building at Notre Dame?	Next to the Main Building is the Basilica of the Sacred Heart.	not_entailment



In [1]:
import json
from typing import Iterator, List, Dict, Optional
import torch
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

# for dataset reader
from allennlp.data import Instance
from allennlp.data.fields import TextField, SequenceLabelField, LabelField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.common.file_utils import cached_path
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, WordTokenizer
from allennlp.data.vocabulary import Vocabulary

# read pretrained embedding from AWS S3
from allennlp.modules.token_embedders.embedding import _read_embeddings_from_text_file

# for building model
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder, PytorchSeq2VecWrapper
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.modules import FeedForward
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.nn import InitializerApplicator, RegularizerApplicator
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer

In [2]:
class PublicationDatasetReader(DatasetReader):
    """
    DatasetReader for publication and venue dataaet
    """
    def __init__(self, 
                 tokenizer: Tokenizer = None,
                 token_indexers: Dict[str, TokenIndexer] = None, 
                 lazy: bool = False,
                 train_max_len = 1000,
                 val_max_len = 500) -> None:
        super().__init__(lazy)
        self._tokenizer = tokenizer or WordTokenizer()
        self._token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
        self.train_max_len = train_max_len
        self.val_max_len = val_max_len

   
    def _read(self, file_path: str) -> Iterator[Instance]:
        """
        Read publication and venue dataset in JSON format
        
        Data is in the following format:
            {"title": ..., "paperAbstract": ..., "venue": ...}
        """
        is_train = True if "train" in file_path else False
        max_len = self.train_max_len if is_train else self.val_max_len
        with open(cached_path(file_path), "r") as data_file:
            lines = data_file.readlines()
            for line in lines[0:max_len]:
                line = line.strip("\n")
                if not line:
                    continue
                paper_json = json.loads(line)
                title = paper_json['title']
                abstract = paper_json['paperAbstract']
                venue = paper_json['venue']
                yield self.text_to_instance(title, abstract, venue)
        
    def text_to_instance(self, 
                         title: str, 
                         abstract: str, 
                         venue: str=None) -> Instance:
        """
        Turn title, abstract, and venue to instance
        """
        tokenized_title = self._tokenizer.tokenize(title)
        tokenized_abstract = self._tokenizer.tokenize(abstract)
        title_field = TextField(tokenized_title, self._token_indexers)
        abstract_field = TextField(tokenized_abstract, self._token_indexers)
        fields = {'title': title_field, 
                  'abstract': abstract_field}
        if venue is not None:
            fields['label'] = LabelField(venue)
        return Instance(fields)

In [3]:
class AcademicPaperClassifier(Model):
    """
    Model to classify venue based on input title and abstract
    """
    def __init__(self, 
                 vocab: Vocabulary,
                 text_field_embedder: TextFieldEmbedder,
                 title_encoder: Seq2VecEncoder,
                 abstract_encoder: Seq2VecEncoder,
                 classifier_feedforward: FeedForward,
                 initializer: InitializerApplicator = InitializerApplicator(),
                 regularizer: Optional[RegularizerApplicator] = None) -> None:
        super(AcademicPaperClassifier, self).__init__(vocab, regularizer)
        self.text_field_embedder = text_field_embedder
        self.num_classes = self.vocab.get_vocab_size("labels")
        self.title_encoder = title_encoder
        self.abstract_encoder = abstract_encoder
        self.classifier_feedforward = classifier_feedforward
        self.metrics = {
                "accuracy": CategoricalAccuracy(),
                "accuracy3": CategoricalAccuracy(top_k=3)
        }
        self.loss = torch.nn.CrossEntropyLoss()
        initializer(self)
    
    def forward(self, 
                title: Dict[str, torch.LongTensor],
                abstract: Dict[str, torch.LongTensor],
                label: torch.LongTensor = None) -> Dict[str, torch.Tensor]:
        
        embedded_title = self.text_field_embedder(title)
        title_mask = get_text_field_mask(title)
        encoded_title = self.title_encoder(embedded_title, title_mask)

        embedded_abstract = self.text_field_embedder(abstract)
        abstract_mask = get_text_field_mask(abstract)
        encoded_abstract = self.abstract_encoder(embedded_abstract, abstract_mask)

        logits = self.classifier_feedforward(torch.cat([encoded_title, encoded_abstract], dim=-1))
        class_probabilities = F.softmax(logits, dim=-1)
        argmax_indices = np.argmax(class_probabilities.cpu().data.numpy(), axis=-1)
        labels = [self.vocab.get_token_from_index(x, namespace="labels") for x in argmax_indices]
        output_dict = {
            'logits': logits, 
            'class_probabilities': class_probabilities,
            'predicted_label': labels
        }
        if label is not None:
            loss = self.loss(logits, label)
            for metric in self.metrics.values():
                 metric(logits, label)
            output_dict["loss"] = loss

        return output_dict
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        
        return {
                'accuracy': self.metrics["accuracy"].get_metric(reset),
                'accuracy3': self.metrics["accuracy3"].get_metric(reset)
                }

In [4]:
train_data_path = "https://s3-us-west-2.amazonaws.com/allennlp/datasets/academic-papers-example/train.jsonl"
validation_data_path = "https://s3-us-west-2.amazonaws.com/allennlp/datasets/academic-papers-example/dev.jsonl"
pretrained_file = "https://s3-us-west-2.amazonaws.com/allennlp/datasets/glove/glove.6B.100d.txt.gz"

In [5]:
reader = PublicationDatasetReader()

In [6]:
instance = reader.text_to_instance("This is a great paper.", 
                                   "Indeed, this is a great paper of all time", 
                                   "Nature")

In [12]:
train_dataset = reader.read(train_data_path,)
validation_dataset = reader.read(validation_data_path,)

1000it [00:02, 401.17it/s]
500it [00:01, 438.67it/s]


In [13]:
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)

100%|██████████| 1500/1500 [00:00<00:00, 3679.47it/s]


In [14]:
embedding_matrix = _read_embeddings_from_text_file(file_uri=pretrained_file, 
                                                   embedding_dim=100, 
                                                   vocab=vocab)

400000it [00:03, 120373.42it/s]


In [15]:
print(embedding_matrix.size())

torch.Size([17398, 100])


In [16]:
EMBEDDING_DIM = 100
HIDDEN_DIM = 100
num_classes = len(vocab.get_index_to_token_vocabulary('labels'))

In [17]:
# embedding
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'), 
                            embedding_dim=EMBEDDING_DIM,
                            weight=embedding_matrix)
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

In [18]:

lstm_title = PytorchSeq2VecWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, 
                                                 batch_first=True, bidirectional=True))
lstm_abstract = PytorchSeq2VecWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, 
                                                    batch_first=True, bidirectional=True))
feed_forward = torch.nn.Linear(2 * 2 * HIDDEN_DIM, num_classes)

In [19]:
model = AcademicPaperClassifier(vocab,
                                word_embeddings, 
                                lstm_title, 
                                lstm_abstract, 
                                feed_forward)

In [20]:
optimizer = optim.SGD(model.parameters(), lr=0.005)

In [21]:
iterator = BucketIterator(batch_size=64, 
                          sorting_keys=[("abstract", "num_tokens"), 
                                        ("title", "num_tokens")])
iterator.index_with(vocab) # index with the created vocabulary

In [22]:
trainer = Trainer(
    model=model,
    optimizer=optimizer,
    iterator=iterator,
    train_dataset=train_dataset,
    validation_dataset=validation_dataset,
    patience=2,
    num_epochs=3,
    serialization_dir='output2',
    validation_metric="+accuracy"
)

In [23]:
trainer.train()

{'training_duration': '00:00:00',
 'training_start_epoch': 3,
 'training_epochs': 0,
 'best_epoch': 2}

In [24]:
from allennlp.common.util import JsonDict
from allennlp.predictors.predictor import Predictor

class PaperClassifierPredictor(Predictor):
    """"
    Predictor wrapper for the AcademicPaperClassifier
    """
    def _json_to_instance(self, json_dict: JsonDict) -> Instance:
        title = json_dict['title']
        abstract = json_dict['paperAbstract']
        instance = self._dataset_reader.text_to_instance(title=title, abstract=abstract)
        return instance
predictor = PaperClassifierPredictor(model, dataset_reader=reader)   
prediction_output = predictor.predict_json(
    {
        "title": "Know What You Don't Know: Unanswerable Questions for SQuAD", 
        "paperAbstract": "Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0."
    }
)

print(prediction_output)

{'logits': [0.13181300461292267, -0.05287451297044754, -0.06042443588376045], 'class_probabilities': [0.3764386773109436, 0.3129575848579407, 0.31060364842414856], 'predicted_label': 'ACL'}


1. [Link to Structured GitHub Repo](https://github.com/allenai/allennlp-as-a-library-example)


2. [Link to Notebook with Comments](https://github.com/titipata/allennlp-tutorial/blob/master/allennlp_tutorial.ipynb)
