## 🤗Transformers - Generating Articles from Paper's Abstracts using T5 Model
This notebook uses T5 model - A Sequence to Sequence model fully capable to perform any text to text tasks. What does it mean - It means that T5 model can take any input text and convert it into any output text. Such Text to Text conversion is useful in NLP tasks like language translation, summarization etc.

In this notebook, we will take paper's abstracts as our input text and paper's title as output text and feed it to T5 model. So,let's dive in...



We will install dependencies and work with latest stable pytorch 1.6

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import json

data_file = '../input/hepth-dataset/HEPTH_METADATA_2.csv'

In [None]:
df = pd.read_csv(data_file)
df

In [None]:
import re
authors=[]
for key,row in df.iterrows():
    str = row["Authors"].replace("and",",")
    modified_string = re.sub(r"\([^()]*\)", "", str)
    authors.append(modified_string)

df["Authors"] = authors

**We will take last 5 years ArXiv papers (2016-2021) due to Kaggle'c compute limits**

In [None]:
df

In [None]:
new_df = pd.DataFrame(columns=["Paper Id","Authors","Year"])

for key,row in df.iterrows():
    authors = row["Authors"]
    authors_arr = authors.split(",")
    for i in range(len(authors_arr)):
        authors_arr[i] = authors_arr[i].strip()
        row_dict = {"Paper Id":row["Paper Id"], "Authors":authors_arr[i], "Year":row["Year"]}
        new_df = new_df.append(row_dict, ignore_index = True)
    

In [None]:
new_df

In [None]:
print(new_df["Paper Id"][61504])

In [None]:
new_df.to_csv("authors_hepth.csv")

In [None]:
import ast
list_arr = []
for i in range(len(arr)):
    temp = ast.literal_eval(arr[i])
    final_str = ''
    for j in range(len(temp)):
        final_str += temp[j][1]+ ' ' + temp[j][0]
        if j<len(temp)-1:
            final_str += ','
    list_arr.append(final_str)

In [None]:
len(list_arr)

In [None]:
papers['authors_parsed'] = list_arr

In [None]:
ds = papers
ds["authors_parsed"]

In [None]:
opt = []
for index, row in ds.iterrows():
    #print(type(row['authors_parsed']))
    tmp = row['authors_parsed'].split(',')
    for j in range(len(tmp)):
        opt.append({'paper_id':row['id'],'title':row['title'],'author':tmp[j],'abstract':row['abstract'],'year':row['year']})

In [None]:
new_ds = pd.DataFrame(opt)

In [None]:
new_ds

In [None]:
new_ds.sort_values(['year'])

In [None]:
community = new_ds.paper_id.str.split('/').str[0].tolist()

In [None]:
paper_id = new_ds.paper_id.str.split('/').str[1].tolist()

In [None]:
new_ds['paper_id'] = paper_id

In [None]:
new_ds = new_ds.sort_values(['year'])

In [None]:
new_ds

In [None]:
new_ds.to_csv('new_ds.csv')

In [None]:
cpu_stats()

 **We will use `simpletransformers` library to train a T5 model**

In [None]:
import logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


**Simpletransformers implementation of T5 model expects a data to be a dataframe with 3 columns:**
`<prefix>, <input_text>, <target_text>`
* `<prefix>`: A string indicating the task to perform. (E.g. "question", "stsb")
* `<input_text>`: The input text sequence (we will use Paper's abstract as `input_text`  )
* `<target_text`: The target sequence (we will use Paper's title as `output_text` )
    
    
 You can read about the data format:  https://github.com/ThilinaRajapakse/simpletransformers#t5-transformer

In [None]:
papers = papers[['title','abstract']]
papers.columns = ['target_text', 'input_text']
papers = papers.dropna()

In [None]:
eval_df = papers.sample(frac=0.2, random_state=101)
train_df = papers.drop(eval_df.index)

In [None]:
train_df.shape, eval_df.shape

**We will training out T5 model with very bare minimum `num_train_epochs=4`, `train_batch_size=16` to  fit into Kaggle's compute limits**

In [None]:
import logging

import pandas as pd
from simpletransformers.t5 import T5Model

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

train_df['prefix'] = "summarize"
eval_df['prefix'] = "summarize"


model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 512,
    "train_batch_size": 16,
    "num_train_epochs": 4,
}

# Create T5 Model
model = T5Model("t5-small", args=model_args, use_cuda=True)

# Train T5 Model on new task
model.train_model(train_df)

# Evaluate T5 Model on new task
results = model.eval_model(eval_df)

# Predict with trained T5 model
#print(model.predict(["convert: four"]))

In [None]:
results

## And We're Done ! 
**Let's see how our model performs in generating paper's titles**

In [None]:
random_num = 350
actual_title = eval_df.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+eval_df.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print(f'Predicted Title: {predicted_title}')
print(f'Actual Abstract: {actual_abstract}')


In [None]:
random_num = 478
actual_title = eval_df.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+eval_df.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print(f'Predicted Title: {predicted_title}')
print(f'Actual Abstract: {actual_abstract}')

In [None]:
random_num = 999
actual_title = eval_df.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+eval_df.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print(f'Predicted Title: {predicted_title}')
print(f'Actual Abstract: {actual_abstract}')