## Setting environment

In [1]:
!curl -O https://raw.githubusercontent.com/deepjavalibrary/d2l-java/master/tools/colab_build.sh && bash colab_build.sh
!java --list-modules | grep "jdk.jshell"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   520  100   520    0     0   4262      0 --:--:-- --:--:-- --:--:--  4262
Update environment...
Install Java...
Install Jupyter java kernel...
jdk.jshell@11.0.14.1


In [2]:
!git clone https://github.com/allenai/s2orc-doc2json.git
import os
os.chdir('/content/s2orc-doc2json')
!pip install -r requirements.txt

Cloning into 's2orc-doc2json'...
remote: Enumerating objects: 480, done.[K
remote: Counting objects: 100% (279/279), done.[K
remote: Compressing objects: 100% (187/187), done.[K
remote: Total 480 (delta 166), reused 181 (delta 88), pack-reused 201[K
Receiving objects: 100% (480/480), 9.40 MiB | 29.98 MiB/s, done.
Resolving deltas: 100% (255/255), done.
Collecting beautifulsoup4==4.7.1
  Downloading beautifulsoup4-4.7.1-py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 1.4 MB/s 
[?25hCollecting boto3==1.9.147
  Downloading boto3-1.9.147-py2.py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 33.2 MB/s 
[?25hCollecting requests==2.21.0
  Downloading requests-2.21.0-py2.py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 4.2 MB/s 
[?25hCollecting Flask==1.0.2
  Downloading Flask-1.0.2-py2.py3-none-any.whl (91 kB)
[K     |████████████████████████████████| 91 kB 6.1 MB/s 
Collecting python-magic==0.4.18
  Download

### interupt cell when output is stuck in *EXECUTING 87%*

In [3]:
!bash scripts/setup_grobid.sh



In [4]:
import subprocess
subprocess.Popen(["bash", "scripts/run_grobid.sh"])

<subprocess.Popen at 0x7f1f29b6ad10>

## Pdf2Json

In [5]:
import json
import argparse
import time
from bs4 import BeautifulSoup
from typing import Optional, Dict

from doc2json.grobid2json.grobid.grobid_client import GrobidClient
from doc2json.grobid2json.tei_to_json import convert_tei_xml_file_to_s2orc_json, convert_tei_xml_soup_to_s2orc_json

BASE_TEMP_DIR = 'temp'
BASE_OUTPUT_DIR = 'output'
BASE_LOG_DIR = 'log'

def process_pdf_stream(input_file: str, sha: str, input_stream: bytes, grobid_config: Optional[Dict] = None) -> Dict:
    """
    Process PDF stream
    :param input_file:
    :param sha:
    :param input_stream:
    :return:
    """
    # process PDF through Grobid -> TEI.XML
    client = GrobidClient(grobid_config)
    tei_text = client.process_pdf_stream(input_file, input_stream, 'temp', "processFulltextDocument")

    # make soup
    soup = BeautifulSoup(tei_text, "xml")

    # get paper
    paper = convert_tei_xml_soup_to_s2orc_json(soup, input_file, sha)

    return paper.release_json('pdf')


def process_pdf_file(
        input_file: str,
        temp_dir: str = BASE_TEMP_DIR,
        output_dir: str = BASE_OUTPUT_DIR,
        grobid_config: Optional[Dict] = None
) -> str:
    """
    Process a PDF file and get JSON representation
    :param input_file:
    :param temp_dir:
    :param output_dir:
    :return:
    """
    os.makedirs(temp_dir, exist_ok=True)
    os.makedirs(output_dir, exist_ok=True)

    # get paper id as the name of the file
    paper_id = '.'.join(input_file.split('/')[-1].split('.')[:-1])
    tei_file = os.path.join(temp_dir, f'{paper_id}.tei.xml')
    output_file = os.path.join(output_dir, f'{paper_id}.json')

    # check if input file exists and output file doesn't
    if not os.path.exists(input_file):
        raise FileNotFoundError(f"{input_file} doesn't exist")
    # if os.path.exists(output_file):
    #     print(f'{output_file} already exists!')

    # process PDF through Grobid -> TEI.XML
    client = GrobidClient(grobid_config)
    # TODO: compute PDF hash
    # TODO: add grobid version number to output
    client.process_pdf(input_file, temp_dir, "processFulltextDocument")

    # process TEI.XML -> JSON
    assert os.path.exists(tei_file)
    paper = convert_tei_xml_file_to_s2orc_json(tei_file)

    # write to file
    with open(output_file, 'w') as outf:
        json.dump(paper.release_json(), outf, indent=4, sort_keys=False)

    return paper,output_file
os.makedirs(BASE_TEMP_DIR, exist_ok=True)
os.makedirs(BASE_OUTPUT_DIR, exist_ok=True)

In [None]:
input_path = '/content/s2orc-doc2json/tests/pdf/N18-3011.pdf'
start_time = time.time()
paper,tei_file = process_pdf_file(input_path, BASE_TEMP_DIR, BASE_OUTPUT_DIR)
runtime = round(time.time() - start_time, 3)
print("runtime: %s seconds " % (runtime))
## Output file in /content/s2orc-doc2json/output/xx.json

/content/s2orc-doc2json/tests/pdf/N18-3011.pdf


ConnectionError: ignored

In [6]:
parent_path = '/content/'
for file in os.listdir(parent_path):
  if file.endswith('.pdf'):
    pdf_path = os.path.join(parent_path, file)
    paper,tei_file = process_pdf_file(pdf_path, BASE_TEMP_DIR, BASE_OUTPUT_DIR)

/content/648.pdf
/content/SlowFast.pdf
/content/Timesformer.pdf
/content/Video Transformer Network.pdf
/content/ViT.pdf
/content/I3D.pdf
/content/MAE.pdf
/content/566.pdf
/content/ViViT.pdf
/content/Deep residual learning for image recognition.pdf
/content/739.pdf
/content/518.pdf
/content/556.pdf
Processing failed with error 500


AssertionError: ignored

In [None]:
! cp -r /content/s2orc-doc2json/output/ /conten

In [None]:
!zip output_json.zip /content/s2orc-doc2json/output/*

  adding: content/s2orc-doc2json/output/518.json (deflated 86%)
  adding: content/s2orc-doc2json/output/556.json (deflated 84%)
  adding: content/s2orc-doc2json/output/566.json (deflated 83%)
  adding: content/s2orc-doc2json/output/648.json (deflated 86%)
  adding: content/s2orc-doc2json/output/687.json (deflated 83%)
  adding: content/s2orc-doc2json/output/697.json (deflated 84%)
  adding: content/s2orc-doc2json/output/719.json (deflated 83%)
  adding: content/s2orc-doc2json/output/739.json (deflated 81%)


## tldr generation

### envirnoment setting

In [None]:
!pip install datasets transformers rouge-score nltk
!apt install git-lfs
!pip install -U spacy
!python -m spacy download en_core_web_sm
import transformers
print(transformers.__version__)

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 5.3 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 8.6 MB/s 
[?25hCollecting rouge-score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.9 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 38.3 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 62.1 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37

In [None]:
!pip install transformers

In [None]:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
sum_model = AutoModelForSeq2SeqLM.from_pretrained("HenryHXR/t5-base-finetuned-scitldr")
sum_tokenizer = AutoTokenizer.from_pretrained("HenryHXR/t5-base-finetuned-scitldr")

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

In [None]:
del sum_model

In [None]:
def get_aic_from_json(paper_json):
  with open(paper_json,'r') as f: 
    paper = json.load(f)
  body_text = paper['pdf_parse']['body_text']
  text = paper['abstract']
  for index in range(len(body_text)):
    sec_name = body_text[index]['section'].lower()
    if sec_name.find('introduction')>=0 or sec_name.find('conclusion')>=0:
      text_toks = body_text[index].text.split(' ')
      if len(text_toks) > 60:
        print("sec:{}, text:{}".format(sec_name, body_text[index]['text']))
        text += body_text[index]['text']
  
  return text

  
def get_aic_from_obj(paper_json):
  
  body_text = paper_json.body_text
  text = [t.text for t in paper_json.abstract]
  text = ''.join(text)
  for index in range(len(body_text)):
    sec = body_text[index].section
    if sec:
      sec_name = sec[0][-1].lower()
      if sec_name.find('introduction')>=0 or sec_name.find('conclusion')>=0 or sec_name.find('discuss')>=0:
        
        text_toks = body_text[index].text.split(' ')
        if len(text_toks) > 60:
          print("sec:{}, text:{}".format(sec_name, body_text[index].text))
          text += body_text[index].text
  return text


import time
def generate_tldr(pdf_path,model,tokenizer):
  paper_tmp,tei_file = process_pdf_file(pdf_path,BASE_TEMP_DIR,BASE_OUTPUT_DIR)
  # time.sleep(5)
  article = get_aic_from_obj(paper_tmp)
  print(article)
  x = tokenizer("summarize: " + article, return_tensors="pt", max_length=1024, truncation=True)
  outputsd = model.generate(
      x["input_ids"], max_length=100, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
      )
  gen_tldr = tokenizer.decode(outputsd[0])
  print(gen_tldr)
  return gen_tldr.strip('<pad> </s>')

In [None]:
input_path = '/content/s2orc-doc2json/tests/pdf/I3D.pdf'
tldr = generate_tldr(input_path,sum_model,sum_tokenizer)
print(tldr)

### named entity recognition

In [None]:
from transformers import pipeline
ner_pipe = pipeline(task="ner",model='HenryHXR/scibert_scivocab_uncased-finetuned-ner')

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/417M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/359 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/223k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/700k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
import spacy
nlp_sentence = spacy.load("en_core_web_sm")
label_list = ['O', 'B-Task', 'I-Task', 'B-Method', 'I-Method', 'B-OtherScientificTerm', 'I-OtherScientificTerm',
        'B-Generic', 'I-Generic', 'B-Material', 'I-Material', 'B-Metric', 'I-Metric']


In [None]:

def scientific_term_recognition(ner_pipe, tldr, nlp_sentence, label_list):

  label_dict_list = {'Task':[],'Method':[],'OtherScientificTerm':[],
            'Generic':[],'Material':[],'Metric':[]}
  tldr_doc = nlp_sentence(tldr)
  for sen in tldr_doc.sents:
    # print('sentence text:',sen.text.strip())
    entities = ner_pipe(sen.text)
    # for entity in entities:
    #   print(entity)

    prev = ' '
    label_sen = [label_list[int(en['entity'][-1])] for en in entities]
    for en in entities:
      label_name = label_sen[en['index']-1]
      if label_name != 'O':

        label_name = label_name[2:]
        if prev != label_name:
          first_list = [en['word']]
          label_dict_list[label_name].append(first_list)
        else:
          label_dict_list[label_name][-1].append(en['word'])
      prev = label_name

  for key in label_dict_list:
    label_dict_list[key] = [" ".join(a).replace(' ##','').replace(' - ','-').replace(' ( ','(').replace(' )',')') for a in label_dict_list[key]]

  return label_dict_list

In [None]:
input_path = '/content/ViT.pdf'
tldr = generate_tldr(input_path,sum_model,sum_tokenizer)
label_list_dict1 = scientific_term_recognition(ner_pipe, tldr, nlp_sentence, label_list)


sec:introduction, text:Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017) , have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019) . Thanks to Transformers' computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020) . With the models and datasets growing, there is still no sign of saturating performance.
sec:introduction, text:In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016) . Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020) , some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a) 

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


<pad> We show that a pure Transformer applied directly to sequences of image patches can perform very well on image classification tasks compared to state-of-the-art convolutional networks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks</s>


In [None]:
tldr

'We show that a pure Transformer applied directly to sequences of image patches can perform very well on image classification tasks compared to state-of-the-art convolutional networks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmark'

In [None]:
out_str = ''
firstEnter = False
for key in label_list_dict1:
  if label_list_dict1[key]:
    label_one_key = ','.join(label_list_dict1[key])
    tmp_str = f'{key}: {label_one_key}'
    if firstEnter:
      out_str += '\n'
    else: firstEnter = True
    out_str+=tmp_str
print(out_str)

Task: image classification tasks,mid,sized or small image recognition benchmark
Method: pure transformer,convolutional networks
OtherScientificTerm: image patches
Generic: -
Material: -


In [None]:
label_list_dict1

{'Generic': ['-'],
 'Material': ['-'],
 'Method': ['pure transformer', 'convolutional networks'],
 'Metric': [],
 'OtherScientificTerm': ['image patches'],
 'Task': ['image classification tasks',
  'mid',
  'sized or small image recognition benchmark']}