#Scientific Text & Summary Generation with GTP-2
Notebook to generate scientific text and summaries of abstracts or scientific papers using GTP-2. 
Data: Abstracts and fulltexts of over 6000 papers from econstor.
Sources for code and notebooks:
  [Huggingface Transformers](https://github.com/huggingface/transformers)
  There are also a lot of aricles such as this [Medium post](https://towardsdatascience.com/fine-tuning-gpt2-for-text-generation-using-pytorch-2ee61a4f1ba7) which explain how to use the GPT-2 for text generization. Most of the, however, use older versions of the transformers library.



##Colab preparation & Data preparation


1.   Mount Google Drive
2.   Download the data
1.   Extract the abstracts
2.   Create training, validation, and test datasets






In [1]:
#1. Mount gDrive
#Steffen's code for mounting a GoogleDrive to the content folder
from google.colab import drive
import os

drive.mount('/content/gdrive')  # Mounting GoogleDrive to the content folder

project_dir = 'NLP_scientific-text-generation'
if not os.path.exists('/content/gdrive/MyDrive/'+project_dir):  # Create a project folder if it does not exist yet
    os.makedirs('/content/gdrive/MyDrive/'+project_dir)
os.chdir('/content/gdrive/MyDrive/'+project_dir)  # Changing the working directory to the project folder on GoogleDrive

Mounted at /content/gdrive


In [10]:
#2. Get the data
#Code from Steffen's colab
import urllib.request
import tarfile

def getData (dataset):
    # Connect download stream as tar file object
    url = "https://www.econstor.eu/ki-hackathon/" + dataset
    ftpstream = urllib.request.urlopen(url)
    tf = tarfile.open(fileobj=ftpstream, mode="r|gz")
    # Extract files (files will be extracted into a subfolder included in the zip file)
    tf.extractall()


In [11]:
getData('econstor-cc-by-4.0-json.tgz')

In [12]:
getData('econstor-cc-by-4.0-txt.tgz')

In [13]:
#3. Data preparation
#The following code extracts the abstracts from the json files and creates a list of abstracts (code is from Philipp's aitextgen notebook).
import os
import json
import pandas as pd

dir = "/content/gdrive/MyDrive/NLP_scientific-text-generation/json"

abstracts = []
count = 1000
i = 0

for f_name in os.listdir(dir):
  if f_name.endswith(".json"):
    path = os.path.join(dir, f_name)
    with open(path, "r") as f:
      data = json.load(f)
      is_english = False
      abstract = ""
      for obj in data["metadata"]:
        if obj["key"] == "dc.language.iso":
          if obj["value"] == "eng":
            is_english = True
          else:
            break
        elif obj["key"] == "dc.description.abstract":
          abstract = obj["value"]
          if is_english:
            break
      
      if is_english:
        abstracts.append(abstract)

    # ensure we get exactly the amount of samples required
    if len(abstracts) >= count:
      break

In [14]:
#create a dataframe with each abstract per row. Take a first look at the data
data = (abstracts) 
df = pd.DataFrame(data)
df.columns = ['abstract']
df = df.reset_index()
print(df)

     index                                           abstract
0        0  OBJECTIVES: This systematic review and meta-an...
1        1  In light of the recent worldwide migration of ...
2        2  Theories about neighbours’ influence on childr...
3        3  Although electricity supply is still dominated...
4        4  Job mobility equilibrates disparities in local...
..     ...                                                ...
995    995  With increasing market competition, organizati...
996    996  We introduce a class of production function wh...
997    997  HPV infections can cause substantial burden in...
998    998  Additive manufacturing (AM), or popular scient...
999    999  In many applications, ranking of decision maki...

[1000 rows x 2 columns]


In [16]:
#4. Create training, validation and test datasets
#Train-, Validation-, & Testdata
from sklearn.model_selection import train_test_split
import re

train_test_ratio = 0.9  #Proportion of training data
train_valid_ratio = 7/9 #Proportion of validation data
df_full_train, df_test = train_test_split(df, train_size = train_test_ratio, random_state = 1)
df_train, df_valid = train_test_split(df_full_train, train_size = train_valid_ratio, random_state = 1)

In [24]:
#Create dataset
def build_dataset(df, dest_path):
    f = open(dest_path, 'w', encoding='utf-8')
    data = ''
    abstracts = df['abstract'].tolist()
    for abstract in abstracts:
        abstract = str(abstract).strip()
        abstract = re.sub(r"\s", " ", abstract)
        bos_token = '<BOS>'
        eos_token = '<EOS>'
        data += bos_token + ' ' + abstract + ' ' + eos_token + '\n'

        
    f.write(data)

In [25]:
build_dataset(df_train, 'train.txt')
build_dataset(df_valid, 'valid.txt')
build_dataset(df_test, 'test.txt')

##Get the huggingface transformers library

In [12]:
!pip install datasets
!pip install transformers
#Clone gitgub repository to get the required python scripts
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 60280 (delta 13), reused 12 (delta 12), pack-reused 60263[K
Receiving objects: 100% (60280/60280), 45.06 MiB | 7.88 MiB/s, done.
Resolving deltas: 100% (42517/42517), done.
Checking out files: 100% (1015/1015), done.


In [31]:
!nvidia-smi

Fri Jan 22 14:05:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    77W / 149W |  11326MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Train the model
Different model sizes can be used by changing "--model_name_or_path gpt2"
GTP-2 models from huggingface can be found [here](https://huggingface.co/transformers/pretrained_models.html). 

GPT-2 models (OpenAI):
*   gpt2(12-layer, 768-hidden, 12-heads, 117M parameters)
*   gpt2-medium (24-layer, 1024-hidden, 16-heads, 345M parameters)
*   gpt2-large (36-layer, 1280-hidden, 20-heads, 774M parameters)
*   gpt2-xl (48-layer, 1600-hidden, 25-heads, 1558M parameters)
Distill GPT by huggingface:
*   distilgpt2
Model distilled from the GPT-2 (gpt2) model and checkpoints )6-layer, 768-hidden, 12-heads, 82M parameters).

The large and xl version might not run on a colab because of RAM issues even when a batch size of one is used.

#PROBLEMS / ToDo
Since our data have only one abstract per line the "--line_by_line" option should be enabled : "If your dataset is organized with one sample per line, you can use the --line_by_line flag (otherwise the script concatenates all texts and then splits them in blocks of the same length)".
However, this produces an error code...








In [3]:
#Trains the model. Be careful with the batch_size when using larger models or datasets.
%run "/content/gdrive/MyDrive/NLP_scientific-text-generation/transformers/examples/language-modeling/run_clm.py" \
    --model_name_or_path gpt2-medium \
    --train_file "/content/gdrive/My Drive/NLP_scientific-text-generation/train.txt" \
    --validation_file "/content/gdrive/My Drive/NLP_scientific-text-generation/valid.txt" \
    --do_train \
    --do_eval \
    --output_dir "/content/gdrive/My Drive/NLP_scientific-text-generation/output/" \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --learning_rate 5e-5 \
    --num_train_epochs=6 \
    --overwrite_output_dir

01/22/2021 14:07:23 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/content/gdrive/My Drive/NLP_scientific-text-generation/output/, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=6.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Jan22_14-07-23_4f0af75a20c5, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1096.0, style=ProgressStyle(description…

Using custom data configuration default



Downloading and preparing dataset text/default-152b112fb2efee7b (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-152b112fb2efee7b/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-152b112fb2efee7b/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab. Subsequent calls will reuse this data.


01/22/2021 14:07:25 - INFO - filelock -   Lock 140134538215264 acquired on /root/.cache/huggingface/transformers/3a7a4b7235202f93d14a4a5e8200709184c5b25a29d9cfa6b0ede5166adf0768.cf0ec4a33a38dc96108560e01338af4bd3360dd859385d451c35b41987ae73ff.lock
[INFO|file_utils.py:1272] 2021-01-22 14:07:25,368 >> https://huggingface.co/gpt2-medium/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpuf4_gkkg


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=718.0, style=ProgressStyle(description_…

[INFO|file_utils.py:1276] 2021-01-22 14:07:25,684 >> storing https://huggingface.co/gpt2-medium/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/3a7a4b7235202f93d14a4a5e8200709184c5b25a29d9cfa6b0ede5166adf0768.cf0ec4a33a38dc96108560e01338af4bd3360dd859385d451c35b41987ae73ff
[INFO|file_utils.py:1279] 2021-01-22 14:07:25,685 >> creating metadata file for /root/.cache/huggingface/transformers/3a7a4b7235202f93d14a4a5e8200709184c5b25a29d9cfa6b0ede5166adf0768.cf0ec4a33a38dc96108560e01338af4bd3360dd859385d451c35b41987ae73ff
01/22/2021 14:07:25 - INFO - filelock -   Lock 140134538215264 released on /root/.cache/huggingface/transformers/3a7a4b7235202f93d14a4a5e8200709184c5b25a29d9cfa6b0ede5166adf0768.cf0ec4a33a38dc96108560e01338af4bd3360dd859385d451c35b41987ae73ff.lock
[INFO|configuration_utils.py:445] 2021-01-22 14:07:25,689 >> loading configuration file https://huggingface.co/gpt2-medium/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3a7




[INFO|configuration_utils.py:445] 2021-01-22 14:07:25,965 >> loading configuration file https://huggingface.co/gpt2-medium/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3a7a4b7235202f93d14a4a5e8200709184c5b25a29d9cfa6b0ede5166adf0768.cf0ec4a33a38dc96108560e01338af4bd3360dd859385d451c35b41987ae73ff
[INFO|configuration_utils.py:481] 2021-01-22 14:07:25,966 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "predict_special_tokens": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summ

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…

[INFO|file_utils.py:1276] 2021-01-22 14:07:27,047 >> storing https://huggingface.co/gpt2-medium/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/fee58641d7a73348d842afaa337d5a7763dad32beff8d9008bb3c3c847749d6b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|file_utils.py:1279] 2021-01-22 14:07:27,048 >> creating metadata file for /root/.cache/huggingface/transformers/fee58641d7a73348d842afaa337d5a7763dad32beff8d9008bb3c3c847749d6b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
01/22/2021 14:07:27 - INFO - filelock -   Lock 140134534741912 released on /root/.cache/huggingface/transformers/fee58641d7a73348d842afaa337d5a7763dad32beff8d9008bb3c3c847749d6b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock





01/22/2021 14:07:27 - INFO - filelock -   Lock 140134696737648 acquired on /root/.cache/huggingface/transformers/23c853a0fcfc12c7d72ad4e922068b6982665b673f6de30b4c5cbe5bd70a2236.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
[INFO|file_utils.py:1272] 2021-01-22 14:07:27,327 >> https://huggingface.co/gpt2-medium/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpak0pb_ua


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

[INFO|file_utils.py:1276] 2021-01-22 14:07:28,021 >> storing https://huggingface.co/gpt2-medium/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/23c853a0fcfc12c7d72ad4e922068b6982665b673f6de30b4c5cbe5bd70a2236.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|file_utils.py:1279] 2021-01-22 14:07:28,029 >> creating metadata file for /root/.cache/huggingface/transformers/23c853a0fcfc12c7d72ad4e922068b6982665b673f6de30b4c5cbe5bd70a2236.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
01/22/2021 14:07:28 - INFO - filelock -   Lock 140134696737648 released on /root/.cache/huggingface/transformers/23c853a0fcfc12c7d72ad4e922068b6982665b673f6de30b4c5cbe5bd70a2236.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock





01/22/2021 14:07:28 - INFO - filelock -   Lock 140134696737648 acquired on /root/.cache/huggingface/transformers/8e4f9a65085b1b4ae69ffac9a953a44249c9ea1e72e4a7816ee87b70081df038.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0.lock
[INFO|file_utils.py:1272] 2021-01-22 14:07:28,312 >> https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp8tihcftb


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…

[INFO|file_utils.py:1276] 2021-01-22 14:07:29,107 >> storing https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/8e4f9a65085b1b4ae69ffac9a953a44249c9ea1e72e4a7816ee87b70081df038.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|file_utils.py:1279] 2021-01-22 14:07:29,108 >> creating metadata file for /root/.cache/huggingface/transformers/8e4f9a65085b1b4ae69ffac9a953a44249c9ea1e72e4a7816ee87b70081df038.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
01/22/2021 14:07:29 - INFO - filelock -   Lock 140134696737648 released on /root/.cache/huggingface/transformers/8e4f9a65085b1b4ae69ffac9a953a44249c9ea1e72e4a7816ee87b70081df038.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0.lock
[INFO|tokenization_utils_base.py:1766] 2021-01-22 14:07:29,111 >> loading file https://huggingface.co/gpt2-medium/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/fee58641d7




01/22/2021 14:07:29 - INFO - filelock -   Lock 140134534740008 acquired on /root/.cache/huggingface/transformers/6249eef5c8c1fcfccf9f36fc2e59301b109ac4036d8ebbee9c2b7f7e47f440bd.2538e2565f9e439a3668b981faf959c8b490b36dd631f3c4cd992519b2dd36f1.lock
[INFO|file_utils.py:1272] 2021-01-22 14:07:29,453 >> https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpma06mlm0


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1520013706.0, style=ProgressStyle(descr…

[INFO|file_utils.py:1276] 2021-01-22 14:08:37,402 >> storing https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/6249eef5c8c1fcfccf9f36fc2e59301b109ac4036d8ebbee9c2b7f7e47f440bd.2538e2565f9e439a3668b981faf959c8b490b36dd631f3c4cd992519b2dd36f1
[INFO|file_utils.py:1279] 2021-01-22 14:08:37,403 >> creating metadata file for /root/.cache/huggingface/transformers/6249eef5c8c1fcfccf9f36fc2e59301b109ac4036d8ebbee9c2b7f7e47f440bd.2538e2565f9e439a3668b981faf959c8b490b36dd631f3c4cd992519b2dd36f1
01/22/2021 14:08:37 - INFO - filelock -   Lock 140134534740008 released on /root/.cache/huggingface/transformers/6249eef5c8c1fcfccf9f36fc2e59301b109ac4036d8ebbee9c2b7f7e47f440bd.2538e2565f9e439a3668b981faf959c8b490b36dd631f3c4cd992519b2dd36f1.lock
[INFO|modeling_utils.py:1027] 2021-01-22 14:08:37,412 >> loading weights file https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/6




[INFO|modeling_utils.py:1143] 2021-01-22 14:08:49,622 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1152] 2021-01-22 14:08:49,622 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-medium.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

[INFO|tokenization_utils_base.py:951] 2021-01-22 14:08:50,153 >> Assigning <BOS> to the bos_token key of the tokenizer
[INFO|tokenization_utils_base.py:951] 2021-01-22 14:08:50,154 >> Assigning <EOS> to the eos_token key of the tokenizer
[INFO|tokenization_utils_base.py:951] 2021-01-22 14:08:50,154 >> Assigning <PAD> to the pad_token key of the tokenizer





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




[INFO|trainer.py:442] 2021-01-22 14:09:06,601 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:442] 2021-01-22 14:09:06,603 >> The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:791] 2021-01-22 14:09:06,615 >> ***** Running training *****
[INFO|trainer.py:792] 2021-01-22 14:09:06,616 >>   Num examples = 148
[INFO|trainer.py:793] 2021-01-22 14:09:06,617 >>   Num Epochs = 6
[INFO|trainer.py:794] 2021-01-22 14:09:06,617 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:795] 2021-01-22 14:09:06,619 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:796] 2021-01-22 14:09:06,620 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:797] 2021-01-22 14:09:06,621 >>   Total optimization steps = 888


Step,Training Loss
500,3.3134


[INFO|trainer.py:1344] 2021-01-22 14:14:09,237 >> Saving model checkpoint to /content/gdrive/My Drive/NLP_scientific-text-generation/output/checkpoint-500
[INFO|configuration_utils.py:300] 2021-01-22 14:14:09,247 >> Configuration saved in /content/gdrive/My Drive/NLP_scientific-text-generation/output/checkpoint-500/config.json
[INFO|modeling_utils.py:817] 2021-01-22 14:14:16,425 >> Model weights saved in /content/gdrive/My Drive/NLP_scientific-text-generation/output/checkpoint-500/pytorch_model.bin
[INFO|trainer.py:953] 2021-01-22 14:19:25,443 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:1344] 2021-01-22 14:19:25,470 >> Saving model checkpoint to /content/gdrive/My Drive/NLP_scientific-text-generation/output/
[INFO|configuration_utils.py:300] 2021-01-22 14:19:25,479 >> Configuration saved in /content/gdrive/My Drive/NLP_scientific-text-generation/output/config.json
[INFO|modeling_utils.py:817] 2021-01-22 14:19:32,713 >> Model

01/22/2021 14:19:40 - INFO - __main__ -   ***** Eval results *****
01/22/2021 14:19:40 - INFO - __main__ -     perplexity = 24.676151471832235


##Use the trained model to generate text
usage: run_generation.py [-h] --model_type MODEL_TYPE --model_name_or_path
                         MODEL_NAME_OR_PATH [--prompt PROMPT]
                         [--length LENGTH] [--stop_token STOP_TOKEN]
                         [--temperature TEMPERATURE]
                         [--repetition_penalty REPETITION_PENALTY] [--k K]
                         [--p P] [--prefix PREFIX]
                         [--padding_text PADDING_TEXT]
                         [--xlm_language XLM_LANGUAGE] [--seed SEED]
                         [--no_cuda]
                         [--num_return_sequences NUM_RETURN_SEQUENCES]
                         [--fp16]

In [18]:
#Text generation
%run "/content/gdrive/MyDrive/NLP_scientific-text-generation/transformers/examples/text-generation/run_generation.py" \
--model_type gpt2 \
--model_name_or_path "/content/gdrive/MyDrive/NLP_scientific-text-generation/output/" \
--length 100 \
--prompt "This study" \
--stop_token "<EOS>" \
--k 50 \
--num_return_sequences 5

[INFO|tokenization_utils_base.py:1685] 2021-01-22 16:10:02,359 >> Model name '/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming '/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-22 16:10:02,362 >> Didn't find file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-22 16:10:02,364 >> loading file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/vocab.json
[INFO|tokenization_utils_base.py:1764] 2021-01-22 16:10:02,365 >> loading file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/merges.txt
[INFO|tokenization_utils_base.py:1764] 2021-01-22 16:10:02,366 >> loading file /content/gdrive/MyDrive/NLP_scientific-text-gener

=== GENERATED SEQUENCE 1 ===
This study presents the results of a multi-dimensional modeling approach which allows us to investigate the contribution of different aspects of the business environment (cultural, administrative, legal, professional etc.) on business failures. A theoretical model was constructed from the findings of the survey conducted on small and medium enterprises in Lithuania, and the study results show that cultural factors such as family and friends play a crucial role in the formation of a successful business. The paper proposes some steps for further research. 
=== GENERATED SEQUENCE 2 ===
This study examines how changes in energy use are associated with changes in residential electricity use and greenhouse gas emissions. The main result shows that a transition to a low carbon electricity system can help reduce electricity consumption, increase renewable energy generation and help achieve energy efficiency measures and GHG reductions. However, because electricity 


=== GENERATED SEQUENCE 5 === 
This study shows that there is a correlation between the distance travelled on average and the cost of accommodation.
Would imo be an ideal TL;DR.

In [13]:
#Save the trained model
! tar -czf gpt2-tuned.tar.gz output/

## Things that don't work at the moment

###Summarization with the huggingface pipeline - GPT-2 is not among the supported models

In [None]:
#Abstract generation - GTP-2 not supported, Code does not work
from transformers import pipeline
model = GPT2LMHeadModel.from_pretrained('/content/gdrive/MyDrive/NLP_scientific-text-generation/output/')
tokenizer = GPT2Tokenizer.from_pretrained('/content/gdrive/MyDrive/NLP_scientific-text-generation/output/')
summarizer = pipeline("summarization", model = model, tokenizer = tokenizer)
summarizer("This paper presents two models of consumption for the primary purpose of forecasting consumption expenditure growth in New Zealand. The models, which are consistent with a range of consumption functions including the life-cycle and permanent income hypothesis, are error correction models with the long-run equations estimated using both the conventional ordinary least squares procedure as well as the Stock and Watson procedure of leads and lags. Unlike earlier New Zealand studies, actual data on household net wealth, rather than proxies or derived series were used. This allowed the wealth variable to modelled in disaggregated form. Mortgage equity withdrawal by households and funds brought into the economy by immigrants are two novel variables included in the consumption models. Migrant transfers were found to have an influence on short-run consumption growth, but not mortgage equity withdrawal although the latter did contribute to a higher overall model fit. Net non-financial wealth was found to have short-run influence on consumption but not in the long-run.", min_length = 20, max_length = 150)

## More finetuning using PPLM (Plug and Play Language Models)
[Related Arictle](https://arxiv.org/pdf/1912.02164.pdf). Idea: Creating context specific text/summaries can be improved without retraining the complete model on a very specific dataset (e.g. only climate change data) by finetuning the model with a list of domain specific words. 
"PPLM builds on top of other large transformer-based generative models (like GPT-2), where it enables finer-grained control of attributes of the generated language (e.g. gradually switching topic or sentiment).
This controlled language generation method consists of plugging in simple bag-of-words or one-layer classifiers as attribute controllers, and making updates in the activation space, without changing any model parameters."

In [None]:
#More finetuning using UberAI's PPLM
!git clone https://github.com/uber-research/PPLM
!pip install transformers==3.4.0  #Conflicts with the version used for model training. Best used in a seperate session

In [None]:
#More finetuning - Either crashes after the first sample or produces nonsense after the first sample
%run "/content/gdrive/MyDrive/NLP_scientific-text-generation/PPLM/run_pplm.py" -B "/content/gdrive/My Drive/NLP_scientific-text-generation/ClimateChangeWords.txt"  \
--pretrained_model='/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' \
--cond_text="Abstract" \
--num_samples=5 \
--length=150 \
--stepsize=0.05 \
--num_iterations=3 \
--window_length=5 \
--gamma=1.5 \
--gm_scale=0.95 \
--kl_scale=0.01 \
--colorama \
--verbosity='regular' \
--sample