##**Training and Evaluating GPT2 for Text Summarization**

This notebook illustrate how to use this repository to train a **GPT2 for abstractive summarization**.




*   We will use a small sample from CNNDailyMail dataset ([download here](https://drive.google.com/file/d/19MXZFt6V-OQd0PgljC9GOscQy_ma3pdT/view?usp=sharing)) to train the model. One can also use any dataset of their choice. 
*   We will use a pretrained (DistilGPT2/ gpt2-medium) model from Huggingface [model hub](https://huggingface.co/models). It will be fine tuned on the sample dataset.

* We will also use a pretrained gpt2-medium  ([download here](https://drive.google.com/file/d/1pdJafkmv4phlMLjP6rGK8DvBiLZqYr12/view?usp=sharing)) for generating summaries on a provided textfile.


In [1]:
# create a project folder
!mkdir project

In [2]:
# clone the repository
!git clone https://github.com/rohitashwa1907/Text-Summarization-Using-GPT2.git /content/project

Cloning into '/content/project'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 38 (delta 13), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (38/38), done.


In [3]:
# mounting the google drive inside project folder
from google.colab import drive
drive.mount('/content/project/drive')

Mounted at /content/project/drive


In [4]:
%cd /content/project

/content/project


In [5]:
# running on gpu
import torch
if torch.cuda.is_available():
  device = torch.device('cuda:0')
  print('gpu')
  print(torch.cuda.get_device_properties(0))

else:
  device = torch.device('cpu')
  print(device)

gpu
_CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16280MB, multi_processor_count=56)


In [6]:
""" In order to use any other data, one must ensure the dataset must contain only two columns ['article', 'summary']. """

# loading the sample training dataset
import pandas as pd
traindata = pd.read_csv('/content/project/drive/MyDrive/Colab Notebooks/Text Summarization/CNNDailymail_small.csv')
traindata

Unnamed: 0,article,summary
0,It's official: U.S. President Barack Obama wan...,Syrian official: Obama climbed to the top of t...
1,(CNN) -- Usain Bolt rounded off the world cham...,Usain Bolt wins third gold of world championsh...
2,"Kansas City, Missouri (CNN) -- The General Ser...",The employee in agency's Kansas City office is...
3,Los Angeles (CNN) -- A medical doctor in Vanco...,NEW: A Canadian doctor says she was part of a ...
4,(CNN) -- Police arrested another teen Thursday...,Another arrest made in gang rape outside Calif...
...,...,...
2866,Beijing (CNN) -- Li Keqiang on Friday was name...,"Li Keqiang was named China's premier, the No. ..."
2867,"Fort Lauderdale, Florida (CNN) -- A Florida te...",Hospital spokeswoman says Michael Brewer havin...
2868,"(CNN Student News) -- December 16, 2010 . Down...",Examine some of the stories making headlines i...
2869,Istanbul (CNN) -- Crowds of mourners gathered ...,At least two people shot dead in Saturday clas...


In [None]:
# installing the required libraries
!pip install -r requirements.txt

In [None]:
# create a folder for the fine tuned model
!mkdir fine_tuned_folder

Training Script

In [None]:
!python train_GPT2.py --epochs=3 --data_path='/content/project/drive/MyDrive/Colab Notebooks/Text Summarization/CNNDailymail_small.csv' --model_arch_name='distilgpt2' --model_directory='/content/project/fine_tuned_folder/gpt2.pt'

PROCESSING THE DATA .......................................................................
Downloading: 100% 1.04M/1.04M [00:00<00:00, 4.13MB/s]
Downloading: 100% 456k/456k [00:00<00:00, 2.62MB/s]
Downloading: 100% 1.36M/1.36M [00:00<00:00, 5.22MB/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1177 > 1024). Running this sequence through the model will result in indexing errors
CREATING BATCHES FOR TRAINING .............................................................
DOWNLOADING MODEL FROM HUGGINGFACE 🤗 🤗........................................................
Downloading: 100% 762/762 [00:00<00:00, 689kB/s]
Downloading: 100% 353M/353M [00:05<00:00, 65.3MB/s]
STARTING THE TRAINING PROCESS  😇 😇
Learning Rate -->>  9.938666768372829e-05
For epoch : 1  Training loss is : 5.00528316812303  Validation loss is : 2.285308855651605  Time taken: 134.0967948436737
Saving the best model 😁
Learning Rate -->>  4.590520244649574e-05
For epoch :

Inference Script

In [None]:
!python eval.py --input_file='/content/project/sample.txt' --model_directory='/content/project/drive/MyDrive/Colab Notebooks/Text Summarization/GPT2_medium_CNNDailymail_new.pt'  --model_arch_name='distilgpt2' --num_of_samples=3