## Data Augmentation
I have tried augmentation with paraphrasing promt using GPT-J with and EDA (Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks)separately --code in src/data/--  here we will use the NLPAug library to introduce the topic.

In [1]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

In [2]:
text = 'The quick brown fox jumps over the lazy dog .'
print(text)

The quick brown fox jumps over the lazy dog .


### Contextual Word Embeddings Augmenter
Insert word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)

In [4]:
aug = naw.ContextualWordEmbsAug(
    model_path='allenai/scibert_scivocab_uncased', action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
now the quick brown tit fox slowly jumps over the very lazy dog.


Now that we have got ourselves a specialist model lets try a sentence from the biomedical domain:

In [5]:
text = "The frequencies and nature of the adverse events were in line with those reported for the innovator MabThera/Rituxan in the RA and NLH study populations."
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The frequencies and nature of the adverse events were in line with those reported for the innovator MabThera/Rituxan in the RA and NLH study populations.
Augmented Text:
the distribution frequencies and very nature therefore of all the adverse drug events were quite in line entirely with those reported previously for the canadian innovator trial mabthera / rituxan in the ra and nlh study populations.


### Back Translation Augmenter
Translates the sentence back and forth between a pair of languages to yield a slightly different sentence

In [8]:
import nlpaug.augmenter.word as naw
text = "In the pivotal RA trial, efficacy results in terms of DAS28 and ACR were shown to be comparable between CT-P10 and MabThera."
back_translation_aug = naw.BackTranslationAug(
    from_model_name='facebook/wmt19-en-de', 
    to_model_name='facebook/wmt19-de-en'
)
print("Original:")
print(text)
print("Augmented Text:")
print(back_translation_aug.augment(text))

Original:
In the pivotal RA trial, efficacy results in terms of DAS28 and ACR were shown to be comparable between CT-P10 and MabThera.
Augmented Text:
In the pivotal RA study, efficacy results for DAS28 and ACR were comparable between CT-P10 and MabThera.


### Abstractive summarization

This might be fitting to our use case as well, as the sentiment, or the judgement on how reliable the study is should not depend on report numbers and such, or at least we do not want to learn that, I assume

In [11]:
text = "to the EMA Guideline on the Investigation of Bioequivalence (CPMP/EWP/QWP/1401/98 Rev. 1/ Corr **), Ivabradine 5 mg film-coated tablets satisfy the conditions for waiver of bioequivalence studies conducted on the applied product 7.5 mg strength."

aug = nas.AbstSummAug(model_path='t5-base')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
to the EMA Guideline on the Investigation of Bioequivalence (CPMP/EWP/QWP/1401/98 Rev. 1/ Corr **), Ivabradine 5 mg film-coated tablets satisfy the conditions for waiver of bioequivalence studies conducted on the applied product 7.5 mg strength.
Augmented Text:
5 mg film-coated tablets satisfy the conditions for waiver of bioequivalence studies conducted on the applied product 7.5 mg strength .


In [15]:
my_list = ['The planned extension studies CT-P10 3.2 (RA) CT-P10 3.3 (AFL) and CT-P10 3.4 (LTBFL) listed in the RMP will provide additional long term safety data.', 'For these parameters, CHF 5993 pMDI and Foster pMDI + Tiotropium generally showed a similar effect.']

In [16]:
aug.augment(my_list)

Your max_length is set to 50, but you input_length is only 34. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


['the planned extension studies will provide additional long term safety data . the studies CT-P10 3.2 (RA) CT- P10 3.3 (AFL) and CT-p10 3.4 (LTBFL) listed in the',
 'CHF 5993 pMDI and Foster + Tiotropium generally show a similar effect .']