#Colab preparation & Data preparation


1.   Mount Google Drive
2.   Import & Prepare Fulltexts
3.   Create Training, Validation, and Test Datasets






###1. Mount Google Drive

In [1]:
# Mounting GoogleDrive to the content folder
from google.colab import drive
import os

drive.mount('/content/gdrive')  # Mounting GoogleDrive to the content folder

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
# Change Working Directory
working_dir = 'NLP_scientific-text-generation/'
os.chdir('/content/gdrive/MyDrive/'+working_dir)  # Changing the working directory to the project folder on GoogleDrive

###2. Import and Prepare Fulltexts

In [3]:
import re

dir = "prep/02_fulltexts-climate-change"
papers = []

# Definition of the Beginning of Sample (BOS) and End of Sample (EOS) Tokens
bos_token = '<BOS>'
eos_token = '<EOS>'

for f_name in os.listdir(dir):
  path = os.path.join(dir, f_name)
  with open(path, "r") as f:
    f_text = ''.join(f.readlines()) # Fulltext as a single string
    f_text = re.sub(r'\n(?!\n)(?<!\n\n)', ' ', f_text) # Replace carriage returns if no carriage return follows immediatly with white spave (to remove line breaks due to scanning)
    f_text = bos_token + f_text + eos_token
    papers.append(f_text)
  

In [4]:
#Create a dataframe with each fulltext in a single row
import pandas as pd
data = (papers) 
df = pd.DataFrame(data)
df.columns = ['sample']
#df = df.reset_index()

# Visual Check of the data
print(df)

                                                sample
0    <BOS>Introduction A new trend of research in p...
1    <BOS>U.S. – E.U. ECONOMIC RELATIONS IN THE CON...
2    <BOS>JEL Classification: F18; Q34; Q56; Q57.\n...
3    <BOS>Excellence fosters convergence in higher ...
4    <BOS>Introduction\n\n1.1\n\nThe transition to ...
..                                                 ...
143  <BOS>With varying degrees of intensity, each o...
144  <BOS> Second query\n\nTOPIC (electric OR hybri...
145  <BOS>\n\n1 Introduction Companies are more tha...
146  <BOS>Introduction\n\nThe UN’s Sustainable Deve...
147  <BOS>Introduction\n\nSustainable functioning o...

[148 rows x 1 columns]


###3. Create Training, Validation, and Test Datasets

In [5]:
from sklearn.model_selection import train_test_split
import re

# Splitting Train-, Validation-, & Testdata
train_test_ratio = 0.9  #Proportion of training data
train_valid_ratio = 7/9 #Proportion of validation data

# Splitting Test Data (10%) from the Total Dataset
df_full_train, df_test = train_test_split(df, train_size = train_test_ratio, random_state = 1)

# Splitting the Remaining Data into Test and Validation Dataset
df_train, df_valid = train_test_split(df_full_train, train_size = train_valid_ratio, random_state = 1)

# Visual Check of the Training Data
print(df_train)


                                                sample
22   <BOS> 1. Introduction Analyses of farm efficie...
96   <BOS> Background Developed and developing econ...
106  <BOS>clearly stands out. Marchand and Walker (...
67   <BOS>Introduction\n\nIn the modern society, to...
48   <BOS>JEL Classification: F15, F16, F41, F02\n\...
..                                                 ...
27   <BOS> Introduction Sustainability has become a...
137  <BOS> 9891\n\n1. Introduction Sustainability o...
87   <BOS>Economic Policy in the Wake of the Crisis...
23   <BOS>JEL Classification: Q42; Q48; F10; F5.\n\...
124  <BOS> Background Current issues and challenges...

[103 rows x 1 columns]


In [6]:
# Save Datasets to Individual Files
df_train.to_csv("train.txt", index=False, header=False, encoding='utf-8')
df_valid.to_csv("valid.txt", index=False, header=False, encoding='utf-8')
df_test.to_csv("test.txt", index=False, header=False, encoding='utf-8')

##Get the huggingface transformers library

In [7]:
!pip install datasets
!pip install transformers
#Clone gitgub repository to get the required python scripts
!git clone https://github.com/huggingface/transformers.git

fatal: destination path 'transformers' already exists and is not an empty directory.


In [8]:
!nvidia-smi

Tue Jan 26 12:33:40 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8     8W /  75W |      0MiB /  7611MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Fine-Tuning of the GPT-2 Model
Different model sizes can be used by changing "--model_name_or_path gpt2"
GTP-2 models from huggingface can be found [here](https://huggingface.co/transformers/pretrained_models.html). 

GPT-2 models (OpenAI):
*   gpt2(12-layer, 768-hidden, 12-heads, 117M parameters)
*   gpt2-medium (24-layer, 1024-hidden, 16-heads, 345M parameters)
*   gpt2-large (36-layer, 1280-hidden, 20-heads, 774M parameters)
*   gpt2-xl (48-layer, 1600-hidden, 25-heads, 1558M parameters)
Distill GPT by huggingface:
*   distilgpt2
Model distilled from the GPT-2 (gpt2) model and checkpoints )6-layer, 768-hidden, 12-heads, 82M parameters).

The large and xl version might not run on a colab because of RAM issues even when a batch size of one is used.





In [9]:
#Trains the model. Be careful with the batch_size when using larger models or datasets.
%run "prep/transformers/examples/language-modeling/run_clm.py" \
    --model_name_or_path gpt2 \
    --train_file "train.txt" \
    --validation_file "valid.txt" \
    --do_train \
    --do_eval \
    --output_dir "output/" \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --learning_rate 5e-5 \
    --num_train_epochs=6 \
    --overwrite_output_dir

01/26/2021 12:34:19 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=output/, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=6.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Jan26_12-34-19_c494915f7425, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=output/, disable_tqdm=False, remove_unused_colu

Downloading and preparing dataset text/default-5d1e496d89f8f224 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-5d1e496d89f8f224/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-5d1e496d89f8f224/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab. Subsequent calls will reuse this data.


[INFO|configuration_utils.py:445] 2021-01-26 12:34:20,832 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:481] 2021-01-26 12:34:20,834 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




[INFO|trainer.py:442] 2021-01-26 12:34:44,961 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:442] 2021-01-26 12:34:44,962 >> The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:791] 2021-01-26 12:34:44,972 >> ***** Running training *****
[INFO|trainer.py:792] 2021-01-26 12:34:44,972 >>   Num examples = 984
[INFO|trainer.py:793] 2021-01-26 12:34:44,973 >>   Num Epochs = 6
[INFO|trainer.py:794] 2021-01-26 12:34:44,974 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:795] 2021-01-26 12:34:44,974 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:796] 2021-01-26 12:34:44,975 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:797] 2021-01-26 12:34:44,977 >>   Total optimization steps = 5904


Step,Training Loss
500,3.4714
1000,3.3273
1500,3.056
2000,3.0397
2500,2.8619
3000,2.7866
3500,2.6919
4000,2.6758
4500,2.5669
5000,2.5897


[INFO|trainer.py:1344] 2021-01-26 12:39:25,322 >> Saving model checkpoint to output/checkpoint-500
[INFO|configuration_utils.py:300] 2021-01-26 12:39:25,641 >> Configuration saved in output/checkpoint-500/config.json
[INFO|modeling_utils.py:817] 2021-01-26 12:39:27,982 >> Model weights saved in output/checkpoint-500/pytorch_model.bin
[INFO|trainer.py:1344] 2021-01-26 12:44:15,642 >> Saving model checkpoint to output/checkpoint-1000
[INFO|configuration_utils.py:300] 2021-01-26 12:44:15,889 >> Configuration saved in output/checkpoint-1000/config.json
[INFO|modeling_utils.py:817] 2021-01-26 12:44:18,356 >> Model weights saved in output/checkpoint-1000/pytorch_model.bin
[INFO|trainer.py:1344] 2021-01-26 12:49:06,474 >> Saving model checkpoint to output/checkpoint-1500
[INFO|configuration_utils.py:300] 2021-01-26 12:49:06,664 >> Configuration saved in output/checkpoint-1500/config.json
[INFO|modeling_utils.py:817] 2021-01-26 12:49:09,391 >> Model weights saved in output/checkpoint-1500/pyto

01/26/2021 13:32:46 - INFO - __main__ -   ***** Eval results *****
01/26/2021 13:32:46 - INFO - __main__ -     perplexity = 32.8289058882148


#Generation of TL;DRs
1. Definition of the Test Abstracts
2. Definition of the Prompts for One-Shot and Few-Shot Learning
3. Generation of the Output

### 1. Definition of the Test Abstracts

In [10]:
test_abstract_1 = "Greenhouse gas emissions have significantly altered global climate, and will continue to do so in the future. Increases in the frequency, duration, and/or severity of drought and heat stress associated with climate change could fundamentally alter the composition, structure, and biogeography of forests in many regions. Of particular concern are potential increases in tree mortality associated with climate-induced physiological stress and interactions with other climate-mediated processes such as insect outbreaks and wildfire. Despite this risk, existing projections of tree mortality are based on models that lack functionally realistic mortality mechanisms, and there has been no attempt to track observations of climate-driven tree mortality globally. Here we present the first global assessment of recent tree mortality attributed to drought and heat stress. Although episodic mortality occurs in the absence of climate change, studies compiled here suggest that at least some of the world's forested ecosystems already may be responding to climate change and raise concern that forests may become increasingly vulnerable to higher background tree mortality rates and die-off in response to future warming and drought, even in environments that are not normally considered water-limited. This further suggests risks to ecosystem services, including the loss of sequestered forest carbon and associated atmospheric feedbacks. Our review also identifies key information gaps and scientific uncertainties that currently hinder our ability to predict tree mortality in response to climate change and emphasizes the need for a globally coordinated observation system. Overall, our review reveals the potential for amplified tree mortality due to drought and heat in forests worldwide."
test_abstract_2 = "The world's forests influence climate through physical, chemical, and biological processes that affect planetary energetics, the hydrologic cycle, and atmospheric composition. These complex and nonlinear forest-atmosphere interactions can dampen or amplify anthropogenic climate change. Tropical, temperate, and boreal reforestation and afforestation attenuate global warming through carbon sequestration. Biogeophysical feedbacks can enhance or diminish this negative climate forcing. Tropical forests mitigate warming through evaporative cooling, but the low albedo of boreal forests is a positive climate forcing. The evaporative effect of temperate forests is unclear. The net climate forcing from these and other processes is not known. Forests are under tremendous pressure from global change. Interdisciplinary science that integrates knowledge of the many interacting climate services of forests with the impacts of global change is necessary to identify and understand as yet unexplored feedbacks in the Earth system and the potential of forests to mitigate climate change."
test_abstract_3 = "The paper summarizes the current knowledge about the impact of livestock sector on climate change. The main sources of greenhouse gas (GHG) emissions from livestock are described and the contribution of livestock sector to the global GHG emissions is presented on the basis of the latest results obtained from the scientific research. The most recent mitigation strategies for reducing greenhouse gas emissions from livestock sector are also discussed. The paper aims to provide a general overview of an emergent environmental issue such as the impact of livestock sector on climate change. While the paper is easy to understand for non-expert readers, it may also be a relevant reference point for academic researchers and for policy makers aimed at achieving the sustainability of livestock/food sector."
test_abstract_4 = "Feeding a growing global population in a changing climate presents a significant challenge to society. The projected yields of crops under a range of agricultural and climatic scenarios are needed to assess food security prospects. Previous meta-analyses have summarized climate change impacts and adaptive potential as a function of temperature, but have not examined uncertainty, the timing of impacts, or the quantitative effectiveness of adaptation. Here we develop a new data set of more than 1,700 published simulations to evaluate yield impacts of climate change and adaptation. Without adaptation, losses in aggregate production are expected for wheat, rice and maize in both temperate and tropical regions by 2 °C of local warming. Crop-level adaptations increase simulated yields by an average of 7–15%, with adaptations more effective for wheat and rice than maize. Yield losses are greater in magnitude for the second half of the century than for the first. Consensus on yield decreases in the second half of the century is stronger in tropical than temperate regions, yet even moderate warming may reduce temperate crop yields in many locations. Although less is known about interannual variability than mean yields, the available data indicate that increases in yield variability are likely."
test_abstract_5 = "The effects of climate change on biodiversity are increasingly well documented, and many methods have been developed to assess species' vulnerability to climatic changes, both ongoing and projected in the coming decades. To minimize global biodiversity losses, conservationists need to identify those species that are likely to be most vulnerable to the impacts of climate change. In this Review, we summarize different currencies used for assessing species' climate change vulnerability. We describe three main approaches used to derive these currencies (correlative, mechanistic and trait-based), and their associated data requirements, spatial and temporal scales of application and modelling methods. We identify strengths and weaknesses of the approaches and highlight the sources of uncertainty inherent in each method that limit projection reliability. Finally, we provide guidance for conservation practitioners in selecting the most appropriate approach(es) for their planning needs and highlight priority areas for further assessments."

### 2. Definition of the Prompts

In [11]:
abstract_1 = "Causal attribution of recent biological trends to climate change is complicated because non-climatic influences dominate local, short-term biological changes. Any underlying signal from climate change is likely to be revealed by analyses that seek systematic trends across diverse species and geographic regions; however, debates within the Intergovernmental Panel on Climate Change (IPCC) reveal several definitions of a ‘systematic trend’. Here, we explore these differences, apply diverse analyses to more than 1,700 species, and show that recent biological trends match climate change predictions. Global meta-analyses documented significant range shifts averaging 6.1 km per decade towards the poles (or metres per decade upward), and significant mean advancement of spring events by 2.3 days per decade. We define a diagnostic fingerprint of temporal and spatial ‘sign-switching’ responses uniquely predicted by twentieth century climate trends. Among appropriate long-term/large-scale/multi-species data sets, this diagnostic fingerprint was found for 279 species. This suite of analyses generates ‘very high confidence’ (as laid down by the IPCC) that climate change is already affecting living systems."
tldr_1 = "Climate change predictions are confirmed by the collective change in distribution of species, and the change in timing of biological events."

abstract_2 = "Significantly more carbon is stored in the world's soils—including peatlands, wetlands and permafrost—than is present in the atmosphere. Disagreement exists, however, regarding the effects of climate change on global soil carbon stocks. If carbon stored belowground is transferred to the atmosphere by a warming-induced acceleration of its decomposition, a positive feedback to climate change would occur. Conversely, if increases of plant-derived carbon inputs to soils exceed increases in decomposition, the feedback would be negative. Despite much research, a consensus has not yet emerged on the temperature sensitivity of soil carbon decomposition. Unravelling the feedback effect is particularly difficult, because the diverse soil organic compounds exhibit a wide range of kinetic properties, which determine the intrinsic temperature sensitivity of their decomposition. Moreover, several environmental constraints obscure the intrinsic temperature sensitivity of substrate decomposition, causing lower observed ‘apparent’ temperature sensitivity, and these constraints may, themselves, be sensitive to climate."
tldr_2 = "Soil carbon decomposition may be sensitive to climate, but the amount of decomposition is constrained by other factors."

abstract_3 = "Climate change over the past ∼30 years has produced numerous shifts in the distributions and abundances of species and has been implicated in one species-level extinction. Using projections of species' distributions for future climate scenarios, we assess extinction risks for sample regions that cover some 20% of the Earth's terrestrial surface. Exploring three approaches in which the estimated probability of extinction shows a power-law relationship with geographical range size, we predict, on the basis of mid-range climate-warming scenarios for 2050, that 15–37% of species in our sample of regions and taxa will be ‘committed to extinction’. When the average of the three methods and two dispersal scenarios is taken, minimal climate-warming scenarios produce lower projections of species committed to extinction (∼18%) than mid-range (∼24%) and maximum-change (∼35%) scenarios. These estimates show the importance of rapid implementation of technologies to decrease greenhouse gas emissions and strategies for carbon sequestration."
tldr_3 = "Using predictions of future climate, the authors predict that if greenhouse gases continue to increase, 15-37% of species will be committed to extinction."


# One-Shot Prompt
os_prompt = "Abstract:\n" + abstract_1 + "\nTL;DR:\n" + tldr_1 + "\n\nAbstract:\n"

# Few-Shot Prompt (2-Shot)
fs_prompt = "Abstract:\n" + abstract_1 + "\nTL;DR:\n" + tldr_1 + "\n\nAbstract:\n" + abstract_2 + "\nTL;DR:\n" + tldr_2 + "\n\nAbstract:\n" + abstract_3 + "\nTL;DR:\n" + tldr_3 + "\n\nAbstract:\n"


### 3. Generation of the TL;DR

For each test abstract five TL;DRs are generated using a one-shot approach and five using a few-shot approach.

`run_generation` allows to set the following arguments/flags for the generation:  
```
run_generation.py [-h]
--model_type MODEL_TYPE \  
--model_name_or_path MODEL_NAME_OR_PATH  
[--prompt PROMPT \]  
[--length LENGTH \]  
[--stop_token STOP_TOKEN \]  
[--temperature TEMPERATURE \]  
[--repetition_penalty REPETITION_PENALTY]  
[--k K \]  
[--p P \]  
[--prefix PREFIX \]  
[--padding_text PADDING_TEXT \]  
[--xlm_language XLM_LANGUAGE \]  
[--seed SEED \]  
[--no_cuda \]  
[--num_return_sequences NUM_RETURN_SEQUENCES \]  
[--fp16]
```



In [15]:
# Test Abstract 1 with One-Shot-TLDR-Generation
prompt = os_prompt + test_abstract_1 + "\nTL;DR:\n"

%run "transformers/examples/text-generation/run_generation.py" \
--model_type gpt2 \
--model_name_or_path "output/" \
--prompt "$prompt" \
--length 100 \
--stop_token "\." \
--temperature .7 \
--k 50 \
--num_return_sequences 5

[INFO|tokenization_utils_base.py:1685] 2021-01-26 13:38:20,036 >> Model name 'output/' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'output/' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-26 13:38:20,038 >> Didn't find file output/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-26 13:38:20,040 >> Didn't find file output/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-26 13:38:20,041 >> loading file output/vocab.json
[INFO|tokenization_utils_base.py:1764] 2021-01-26 13:38:20,042 >> loading file output/merges.txt
[INFO|tokenization_utils_base.py:1764] 2021-01-26 13:38:20,043 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-26 13:38:20,043 >> loading file output/special_tokens_map.json
[INFO|tokenization_utils_base.py:1764] 2021-01-26 13:38:20,044 >> loading file out

=== GENERATED SEQUENCE 1 ===
Abstract:
Causal attribution of recent biological trends to climate change is complicated because non-climatic influences dominate local, short-term biological changes. Any underlying signal from climate change is likely to be revealed by analyses that seek systematic trends across diverse species and geographic regions; however, debates within the Intergovernmental Panel on Climate Change (IPCC) reveal several definitions of a ‘systematic trend’. Here, we explore these differences, apply diverse analyses to more than 1,700 species, and show that recent biological trends match climate change predictions. Global meta-analyses documented significant range shifts averaging 6.1 km per decade towards the poles (or metres per decade upward), and significant mean advancement of spring events by 2.3 days per decade. We define a diagnostic fingerprint of temporal and spatial ‘sign-switching’ responses uniquely predicted by twentieth century climate trends. Among app

In [None]:
#Save the trained model
! tar -czf gpt2-tuned.tar.gz output/