#Using the GPT-2 to Generate TL;DRs for the EconStore DataBase
Notebook to generate TL;DRs of scientific papers using GTP-2. 

Data: Abstracts and fulltexts of over 7000 papers from econstor. This paper uses a subset of 193 climate change related papers.

Sources for code and notebooks:
  [Huggingface Transformers library](https://github.com/huggingface/transformers).



##Colab preparation & Data preparation


1.   Mount Google Drive
2.   Download the data
1.   Extract the abstracts
2.   Create training and validation datasets






In [None]:
#1. Mount gDrive
from google.colab import drive
import os

drive.mount('/content/gdrive')  # Mounting GoogleDrive to the content folder

project_dir = 'NLP_scientific-text-generation'
if not os.path.exists('/content/gdrive/MyDrive/'+project_dir):  # Create a project folder if it does not exist yet
    os.makedirs('/content/gdrive/MyDrive/'+project_dir)
os.chdir('/content/gdrive/MyDrive/'+project_dir)  # Changing the working directory to the project folder on GoogleDrive

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
#2. Get the data
import urllib.request
import tarfile

def getData (dataset):
    # Connect download stream as tar file object
    url = "https://www.econstor.eu/ki-hackathon/" + dataset
    ftpstream = urllib.request.urlopen(url)
    tf = tarfile.open(fileobj=ftpstream, mode="r|gz")
    # Extract files (files will be extracted into a subfolder included in the zip file)
    tf.extractall()


In [None]:
getData('econstor-cc-by-4.0-json.tgz')

In [None]:
getData('econstor-cc-by-4.0-txt.tgz')

In [None]:
#3. Data preparation
import pandas as pd
import numpy as np
import json
import os

###code for getting the handles, the journal name and the metadata
#setting your personal path
path = '/content/gdrive/MyDrive/NLP_scientific-text-generation/json'

#initiating the lists for handles, journal name and metadata

os.chdir(path)

#metadata = []
abstracts = []
#keywords = []
handles =[]
journal_name = []
##setting up the list of keywords to filter out the abstracts related to climate change

#193 abstracts with this keyword list
keyword_list_climate = ['Sustainability', 'sustainability','sustainable development','Sustainable development','Sustainable Development',
                'globalization', 'Globalization',
                'environment', 'Environment'
                'climate change', 'Climate change', 'Climate Change',
                'energy','Energy', 'sustainable development goals' ]

#here we filter only the abstracts with keywords related to climate change
for item in keyword_list_climate:
    for i in os.listdir():

        with open(i, "r", encoding='utf-8', errors='ignore') as f:
            data = json.load(f)

            handles.append(data['handle'])
            journal_name.append(data['parentCollection']['name'])

            for obj in data["metadata"]:
                if obj["key"] == "dc.subject.keyword":
                    if (obj['value'] == ''+ item):
                        for ob in data['metadata']:
                            if ob['key'] == 'dc.description.abstract':
                                abstracts.append(ob["value"])

#put the list into a dataframe to print the abstracts with an index to count the number of abstracts
df = pd.DataFrame(abstracts)
df.columns = ['text'] #The column name should be 'text'

print(df)

                                                  text
0    In this paper the contribution of technology m...
1    Crop yield is influenced over time and space, ...
2    Almost daily, news indicates that there are en...
3    The paper examines the corporate social respon...
4    Environmental assessment and pollution protect...
..                                                 ...
188  More and better quality private sector investm...
189  Ensuring 'health for all' remains a persistent...
190  Ensuring "health for all" remains a persistent...
191  With the Sustainable Development Goals (SDGs),...
192  The United Nation's Agenda 2030 and Sustainabl...

[193 rows x 1 columns]


In [None]:
print(df)

In [None]:
df = pd.DataFrame(abstracts)
df.columns = ['text']
df.to_csv('/content/test.csv')

In [None]:
#4. Create training, validation and test datasets
#Train-, Validation-, & Testdata

from sklearn.model_selection import train_test_split
import re

train_valid_ratio = 0.8  #Proportion of training data
data_train, data_valid = train_test_split(df, train_size = train_valid_ratio, random_state = 1)

#Uncomment these lines to withheld a portion of the data testing
#train_test_ratio = 0.9  #Proportion of training data
#train_valid_ratio = 7/9 #Proportion of validation data
#data_full_train, data_test = train_test_split(df, train_size = train_test_ratio, random_state = 1)
#data_train, data_valid = train_test_split(data_full_train, train_size = train_valid_ratio, random_state = 1)

In [None]:
#Create dataset
#A BOS token is added before each abstract and a EOS token is added at the end of each abstract
#as suggested here: 
def build_dataset(df, dest_path):
    f = open(dest_path, 'w', encoding='utf-8')
    data = ''
    abstracts = df['abstract'].tolist()
    for abstract in abstracts:
        abstract = str(abstract).strip()
        abstract = re.sub(r"\s", " ", abstract)
        bos_token = '<BOS>'
        eos_token = '<EOS>'
        data += bos_token + ' ' + abstract + ' ' + eos_token + '\n'

        
    f.write(data)

In [None]:
build_dataset(data_train, '/content/gdrive/My Drive/NLP_scientific-text-generation/train.txt')
build_dataset(data_valid, '/content/gdrive/My Drive/NLP_scientific-text-generation/valid.txt')
#build_dataset(df_test, '/content/gdrive/My Drive/NLP_scientific-text-generation/test.txt') #Uncomment if a test dataset was defined in the previous step.

In [None]:
data_train.to_csv('/content/gdrive/My Drive/NLP_scientific-text-generation/train.csv')
data_valid.to_csv('/content/gdrive/My Drive/NLP_scientific-text-generation/valid.csv')

##Get the huggingface transformers library

In [None]:
!pip install datasets
#!pip install transformers
os.chdir('/content/gdrive/My Drive/NLP_scientific-text-generation/')
!pip install git+https://github.com/huggingface/transformers  #Clone the Huggingface Transformers Github

In [None]:
!nvidia-smi

## Train the model
Different model sizes can be used by changing "--model_name_or_path gpt2"

GTP-2 models from huggingface can be found [here](https://huggingface.co/transformers/pretrained_models.html). 

Pretrained "standard" GPT-2 models:
*   gpt2 -> 12-layer, 768-hidden, 12-heads, 117M parameters
*   gpt2-medium -> 24-layer, 1024-hidden, 16-heads, 345M parameters
*   gpt2-large -> 36-layer, 1280-hidden, 20-heads, 774M parameters
*   gpt2-xl -> 48-layer, 1600-hidden, 25-heads, 1558M parameters

Distill GPT by huggingface:
*   distilgpt2
Model distilled from the GPT-2 (gpt2) model and checkpoints (6-layer, 768-hidden, 12-heads, 82M parameters).

The large and xl version might not run on a colab because of RAM issues even when a batch size of one is used.

In [None]:
#Trains the model. Be careful with the batch_size when using larger models or datasets.
%run "/content/gdrive/My Drive/NLP_scientific-text-generation/transformers/examples/language-modeling/run_clm.py" \
    --model_name_or_path gpt2 \
    --train_file "/content/gdrive/My Drive/NLP_scientific-text-generation/train.csv" \
    --validation_file "/content/gdrive/My Drive/NLP_scientific-text-generation/valid.csv" \
    --do_train \
    --do_eval \
    --output_dir "/content/gdrive/My Drive/NLP_scientific-text-generation/output/" \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --learning_rate 2e-5 \
    --num_train_epochs=15 \
    --overwrite_output_dir \
    --logging_steps 50 \
    --save_steps 1000

[INFO|training_args.py:495] 2021-01-26 22:11:26,512 >> PyTorch: setting up devices
01/26/2021 22:11:26 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/content/gdrive/My Drive/NLP_scientific-text-generation/output/, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=15.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Jan26_22-11-26_65398e238d13, logging_first_step=False, logging_steps=50, save_steps=1000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, d

Step,Training Loss
50,3.6092
100,3.4021
150,3.2848
200,3.14
250,3.0773
300,2.992
350,2.9767
400,2.9279
450,2.9328


[INFO|trainer.py:964] 2021-01-26 22:14:45,154 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:1358] 2021-01-26 22:14:45,176 >> Saving model checkpoint to /content/gdrive/My Drive/NLP_scientific-text-generation/output/
[INFO|configuration_utils.py:300] 2021-01-26 22:14:45,183 >> Configuration saved in /content/gdrive/My Drive/NLP_scientific-text-generation/output/config.json
[INFO|modeling_utils.py:817] 2021-01-26 22:14:47,486 >> Model weights saved in /content/gdrive/My Drive/NLP_scientific-text-generation/output/pytorch_model.bin
01/26/2021 22:14:47 - INFO - __main__ -   ***** Train results *****
01/26/2021 22:14:47 - INFO - __main__ -     epoch = 15.0
01/26/2021 22:14:47 - INFO - __main__ -     total_flos = 355519553863680
01/26/2021 22:14:47 - INFO - __main__ -     train_runtime = 193.5011
01/26/2021 22:14:47 - INFO - __main__ -     train_samples_per_second = 2.403
01/26/2021 22:14:47 - INFO - __main__ -   *** Evaluate ***
[I

01/26/2021 22:14:48 - INFO - __main__ -   ***** Eval results *****
01/26/2021 22:14:48 - INFO - __main__ -     perplexity = 32.116767088050096


##Use the trained model to generate TL;DRs
The trained model can now be used to generate TL;DRs of abstracts.

Here we use one-shot and few-shot learning to generate TL;DRs for five abstracts. The TL;DRs of the abstracts used as examples in the one-shot and few-shot prompts can be generated either by hand or by using zero-shot learning (i.e. using only an example abstract as the prompt followed by TL;DR).



In [None]:
test_abstract_1 = "Greenhouse gas emissions have significantly altered global climate, and will continue to do so in the future. Increases in the frequency, duration, and/or severity of drought and heat stress associated with climate change could fundamentally alter the composition, structure, and biogeography of forests in many regions. Of particular concern are potential increases in tree mortality associated with climate-induced physiological stress and interactions with other climate-mediated processes such as insect outbreaks and wildfire. Despite this risk, existing projections of tree mortality are based on models that lack functionally realistic mortality mechanisms, and there has been no attempt to track observations of climate-driven tree mortality globally. Here we present the first global assessment of recent tree mortality attributed to drought and heat stress. Although episodic mortality occurs in the absence of climate change, studies compiled here suggest that at least some of the world's forested ecosystems already may be responding to climate change and raise concern that forests may become increasingly vulnerable to higher background tree mortality rates and die-off in response to future warming and drought, even in environments that are not normally considered water-limited. This further suggests risks to ecosystem services, including the loss of sequestered forest carbon and associated atmospheric feedbacks. Our review also identifies key information gaps and scientific uncertainties that currently hinder our ability to predict tree mortality in response to climate change and emphasizes the need for a globally coordinated observation system. Overall, our review reveals the potential for amplified tree mortality due to drought and heat in forests worldwide."
test_abstract_2 = "The world's forests influence climate through physical, chemical, and biological processes that affect planetary energetics, the hydrologic cycle, and atmospheric composition. These complex and nonlinear forest-atmosphere interactions can dampen or amplify anthropogenic climate change. Tropical, temperate, and boreal reforestation and afforestation attenuate global warming through carbon sequestration. Biogeophysical feedbacks can enhance or diminish this negative climate forcing. Tropical forests mitigate warming through evaporative cooling, but the low albedo of boreal forests is a positive climate forcing. The evaporative effect of temperate forests is unclear. The net climate forcing from these and other processes is not known. Forests are under tremendous pressure from global change. Interdisciplinary science that integrates knowledge of the many interacting climate services of forests with the impacts of global change is necessary to identify and understand as yet unexplored feedbacks in the Earth system and the potential of forests to mitigate climate change."
test_abstract_3 = "The paper summarizes the current knowledge about the impact of livestock sector on climate change. The main sources of greenhouse gas (GHG) emissions from livestock are described and the contribution of livestock sector to the global GHG emissions is presented on the basis of the latest results obtained from the scientific research. The most recent mitigation strategies for reducing greenhouse gas emissions from livestock sector are also discussed. The paper aims to provide a general overview of an emergent environmental issue such as the impact of livestock sector on climate change. While the paper is easy to understand for non-expert readers, it may also be a relevant reference point for academic researchers and for policy makers aimed at achieving the sustainability of livestock/food sector."
test_abstract_4 = "Feeding a growing global population in a changing climate presents a significant challenge to society. The projected yields of crops under a range of agricultural and climatic scenarios are needed to assess food security prospects. Previous meta-analyses have summarized climate change impacts and adaptive potential as a function of temperature, but have not examined uncertainty, the timing of impacts, or the quantitative effectiveness of adaptation. Here we develop a new data set of more than 1,700 published simulations to evaluate yield impacts of climate change and adaptation. Without adaptation, losses in aggregate production are expected for wheat, rice and maize in both temperate and tropical regions by 2 °C of local warming. Crop-level adaptations increase simulated yields by an average of 7–15%, with adaptations more effective for wheat and rice than maize. Yield losses are greater in magnitude for the second half of the century than for the first. Consensus on yield decreases in the second half of the century is stronger in tropical than temperate regions, yet even moderate warming may reduce temperate crop yields in many locations. Although less is known about interannual variability than mean yields, the available data indicate that increases in yield variability are likely."
test_abstract_5 = "The effects of climate change on biodiversity are increasingly well documented, and many methods have been developed to assess species' vulnerability to climatic changes, both ongoing and projected in the coming decades. To minimize global biodiversity losses, conservationists need to identify those species that are likely to be most vulnerable to the impacts of climate change. In this Review, we summarize different currencies used for assessing species' climate change vulnerability. We describe three main approaches used to derive these currencies (correlative, mechanistic and trait-based), and their associated data requirements, spatial and temporal scales of application and modelling methods. We identify strengths and weaknesses of the approaches and highlight the sources of uncertainty inherent in each method that limit projection reliability. Finally, we provide guidance for conservation practitioners in selecting the most appropriate approach(es) for their planning needs and highlight priority areas for further assessments."

####################

abstract_1 = "Causal attribution of recent biological trends to climate change is complicated because non-climatic influences dominate local, short-term biological changes. Any underlying signal from climate change is likely to be revealed by analyses that seek systematic trends across diverse species and geographic regions; however, debates within the Intergovernmental Panel on Climate Change (IPCC) reveal several definitions of a ‘systematic trend’. Here, we explore these differences, apply diverse analyses to more than 1,700 species, and show that recent biological trends match climate change predictions. Global meta-analyses documented significant range shifts averaging 6.1 km per decade towards the poles (or metres per decade upward), and significant mean advancement of spring events by 2.3 days per decade. We define a diagnostic fingerprint of temporal and spatial ‘sign-switching’ responses uniquely predicted by twentieth century climate trends. Among appropriate long-term/large-scale/multi-species data sets, this diagnostic fingerprint was found for 279 species. This suite of analyses generates ‘very high confidence’ (as laid down by the IPCC) that climate change is already affecting living systems."
tldr_1 = "Climate change predictions are confirmed by the collective change in distribution of species, and the change in timing of biological events."

abstract_2 = "Significantly more carbon is stored in the world's soils—including peatlands, wetlands and permafrost—than is present in the atmosphere. Disagreement exists, however, regarding the effects of climate change on global soil carbon stocks. If carbon stored belowground is transferred to the atmosphere by a warming-induced acceleration of its decomposition, a positive feedback to climate change would occur. Conversely, if increases of plant-derived carbon inputs to soils exceed increases in decomposition, the feedback would be negative. Despite much research, a consensus has not yet emerged on the temperature sensitivity of soil carbon decomposition. Unravelling the feedback effect is particularly difficult, because the diverse soil organic compounds exhibit a wide range of kinetic properties, which determine the intrinsic temperature sensitivity of their decomposition. Moreover, several environmental constraints obscure the intrinsic temperature sensitivity of substrate decomposition, causing lower observed ‘apparent’ temperature sensitivity, and these constraints may, themselves, be sensitive to climate."
tldr_2 = "Soil carbon decomposition may be sensitive to climate, but the amount of decomposition is constrained by other factors."

abstract_3 = "Climate change over the past ∼30 years has produced numerous shifts in the distributions and abundances of species and has been implicated in one species-level extinction. Using projections of species' distributions for future climate scenarios, we assess extinction risks for sample regions that cover some 20% of the Earth's terrestrial surface. Exploring three approaches in which the estimated probability of extinction shows a power-law relationship with geographical range size, we predict, on the basis of mid-range climate-warming scenarios for 2050, that 15–37% of species in our sample of regions and taxa will be ‘committed to extinction’. When the average of the three methods and two dispersal scenarios is taken, minimal climate-warming scenarios produce lower projections of species committed to extinction (∼18%) than mid-range (∼24%) and maximum-change (∼35%) scenarios. These estimates show the importance of rapid implementation of technologies to decrease greenhouse gas emissions and strategies for carbon sequestration."
tldr_3 = "Using predictions of future climate, the authors predict that if greenhouse gases continue to increase, 15-37% of species will be committed to extinction."

####################
# One-Shot Prompt
os_prompt = "Abstract:\n" + abstract_1 + "\nTL;DR:\n" + tldr_1 + "\n\nAbstract:\n"

# Few-Shot Prompt (2-Shot)
fs_prompt = "Abstract:\n" + abstract_1 + "\nTL;DR:\n" + tldr_1 + "\n\nAbstract:\n" + abstract_2 + "\nTL;DR:\n" + tldr_2 +  "\n\nAbstract:\n"

####################

In [None]:
# Test Abstract 1 with One-Shot TL;DR generation
prompt = os_prompt + test_abstract_5 + "\nTL;DR:\n"

%run "/content/gdrive/MyDrive/NLP_scientific-text-generation/transformers/examples/text-generation/run_generation.py" \
--model_type gpt2 \
--model_name_or_path '/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' \
--length 100 \
--prompt "$prompt" \
--stop_token "<EOS>" \
--temperature 0.7 \
--k 50 \
--num_return_sequences 5

[INFO|tokenization_utils_base.py:1685] 2021-01-26 23:14:29,409 >> Model name '/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming '/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-26 23:14:29,414 >> Didn't find file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-26 23:14:29,418 >> Didn't find file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-26 23:14:29,421 >> loading file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/vocab.json
[INFO|tokenization_utils_base.py:1764] 2021-01-26 23:14:29,421 >> loading file /content/gdrive/MyD

=== GENERATED SEQUENCE 1 ===
Abstract:
Causal attribution of recent biological trends to climate change is complicated because non-climatic influences dominate local, short-term biological changes. Any underlying signal from climate change is likely to be revealed by analyses that seek systematic trends across diverse species and geographic regions; however, debates within the Intergovernmental Panel on Climate Change (IPCC) reveal several definitions of a ‘systematic trend’. Here, we explore these differences, apply diverse analyses to more than 1,700 species, and show that recent biological trends match climate change predictions. Global meta-analyses documented significant range shifts averaging 6.1 km per decade towards the poles (or metres per decade upward), and significant mean advancement of spring events by 2.3 days per decade. We define a diagnostic fingerprint of temporal and spatial ‘sign-switching’ responses uniquely predicted by twentieth century climate trends. Among app

In [None]:
# Test Abstract 1 with Few-Shot TL;DR generation
prompt = fs_prompt + test_abstract_5 + "\nTL;DR:\n"

%run "/content/gdrive/MyDrive/NLP_scientific-text-generation/transformers/examples/text-generation/run_generation.py" \
--model_type gpt2 \
--model_name_or_path "/content/gdrive/MyDrive/NLP_scientific-text-generation/output/" \
--length 100 \
--prompt "$prompt" \
--stop_token "<EOS>" \
--temperature .7 \
--k 50 \
--num_return_sequences 5

[INFO|tokenization_utils_base.py:1685] 2021-01-26 22:53:53,989 >> Model name '/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming '/content/gdrive/MyDrive/NLP_scientific-text-generation/output/' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-26 22:53:53,991 >> Didn't find file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-26 22:53:53,993 >> Didn't find file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-26 22:53:53,995 >> loading file /content/gdrive/MyDrive/NLP_scientific-text-generation/output/vocab.json
[INFO|tokenization_utils_base.py:1764] 2021-01-26 22:53:53,996 >> loading file /content/gdrive/MyD

=== GENERATED SEQUENCE 1 ===
Abstract:
Causal attribution of recent biological trends to climate change is complicated because non-climatic influences dominate local, short-term biological changes. Any underlying signal from climate change is likely to be revealed by analyses that seek systematic trends across diverse species and geographic regions; however, debates within the Intergovernmental Panel on Climate Change (IPCC) reveal several definitions of a ‘systematic trend’. Here, we explore these differences, apply diverse analyses to more than 1,700 species, and show that recent biological trends match climate change predictions. Global meta-analyses documented significant range shifts averaging 6.1 km per decade towards the poles (or metres per decade upward), and significant mean advancement of spring events by 2.3 days per decade. We define a diagnostic fingerprint of temporal and spatial ‘sign-switching’ responses uniquely predicted by twentieth century climate trends. Among app

## Appendix
For creating a dataset based on all abstracts use this code for the third step in the datapreparation.

In [None]:
#3. Data preparation
import os
import json
import pandas as pd

dir = '/content/gdrive/MyDrive/NLP_scientific-text-generation/json'

abstracts = []
count = 7129 #7129 includes all abstracts in the dataset. Change to a smaller number to chose only a subset.
i = 0

for f_name in os.listdir(dir):
  if f_name.endswith(".json"):
    path = os.path.join(dir, f_name)
    with open(path, "r", encoding="utf8") as f:
      data = json.load(f)
      is_english = False
      abstract = ""
      for obj in data["metadata"]:
        if obj["key"] == "dc.language.iso":
          if obj["value"] == "eng":
            is_english = True
          else:
            break
        elif obj["key"] == "dc.description.abstract":
          abstract = obj["value"]
          if is_english:
            break
      
      if is_english:
        abstracts.append(abstract)

    # ensure we get exactly the amount of samples required
    if len(abstracts) >= count:
      break
#put the list into a dataframe to print the abstracts with an index to count the number of abstracts
df = pd.DataFrame(abstracts)
df.columns = ['abstract']
df = df.reset_index()
print(df)      