<a href="https://colab.research.google.com/github/harmishpatel21/SnapShort/blob/main/Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers -q
!pip install wandb -q

[K     |████████████████████████████████| 1.3MB 13.1MB/s 
[K     |████████████████████████████████| 890kB 55.1MB/s 
[K     |████████████████████████████████| 2.9MB 57.8MB/s 
[K     |████████████████████████████████| 1.1MB 53.8MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.8MB 11.1MB/s 
[K     |████████████████████████████████| 102kB 16.0MB/s 
[K     |████████████████████████████████| 133kB 54.5MB/s 
[K     |████████████████████████████████| 102kB 15.1MB/s 
[K     |████████████████████████████████| 163kB 59.6MB/s 
[K     |████████████████████████████████| 71kB 12.6MB/s 
[?25h  Building wheel for watchdog (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [2]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

# WandB – Import the wandb library
import wandb



In [3]:
!nvidia-smi

Sat Nov  7 22:40:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P8    10W /  70W |     10MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Preparing for TPU usage
# import torch_xla
# import torch_xla.core.xla_model as xm
# device = xm.xla_device()

In [5]:
# Login to wandb to log the model run and all the parameters
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [6]:
!git clone https://github.com/harmishpatel21/SnapShort.git

Cloning into 'SnapShort'...
remote: Enumerating objects: 18, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 18 (delta 3), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (18/18), done.


In [7]:
import pandas as pd
import numpy as np
import json
# Read Json file
data = pd.read_json("/content/SnapShort/data000000000000.json", lines= True)
data.head()

Unnamed: 0,publication_number,abstract,application_number,description
0,US-2006134160-A1,this invention relates to novel calcium phosph...,US-52740605-A,throughout the following description specific ...
1,US-4592498-A,"a stapler , particularly for suturing skin wou...",US-71138185-A,preferred embodiments of the invention are ill...
2,US-2014379009-A1,a nerve guidance conduit includes a spiral str...,US-201414313384-A,embodiments of the present invention provide n...
3,US-4157173-A,a rail connector and improvement in seat base ...,US-86596677-A,"referring now to the drawings , and particular..."
4,US-2017360443-A1,an anvil assembly is disclosed that includes a...,US-201715606289-A,exemplary embodiments of the presently disclos...


In [8]:
import re
pattern = "[^\x00-\x7F]+"
data['description'] = data['description'].apply(lambda x: re.sub(pattern, '', x))
data['abstract'] = data['abstract'].apply(lambda x: re.sub(pattern, '', x))
data.head()

Unnamed: 0,publication_number,abstract,application_number,description
0,US-2006134160-A1,this invention relates to novel calcium phosph...,US-52740605-A,throughout the following description specific ...
1,US-4592498-A,"a stapler , particularly for suturing skin wou...",US-71138185-A,preferred embodiments of the invention are ill...
2,US-2014379009-A1,a nerve guidance conduit includes a spiral str...,US-201414313384-A,embodiments of the present invention provide n...
3,US-4157173-A,a rail connector and improvement in seat base ...,US-86596677-A,"referring now to the drawings , and particular..."
4,US-2017360443-A1,an anvil assembly is disclosed that includes a...,US-201715606289-A,exemplary embodiments of the presently disclos...


In [9]:
# Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.text = self.data.abstract
        self.ctext = self.data.description

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

In [10]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we wnumerate over the training loader and passed to the defined network 

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
        loss = outputs[0]
        
        if _%10 == 0:
            wandb.log({"Training Loss": loss.item()})

        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # xm.optimizer_step(optimizer)
        # xm.mark_step()

In [11]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

In [12]:
def main():
    # WandB – Initialize a new run
    wandb.init(project="SnapShort")

    # WandB – Config is a variable that holds and saves hyperparameters and inputs
    # Defining some key variables that will be used later on in the training  
    config = wandb.config          # Initialize config
    config.TRAIN_BATCH_SIZE = 2    # input batch size for training (default: 64)
    config.VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
    config.TRAIN_EPOCHS = 4        # number of epochs to train (default: 10)
    config.VAL_EPOCHS = 1 
    config.LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
    config.SEED = 42               # random seed (default: 42)
    config.MAX_LEN = 512
    config.SUMMARY_LEN = 150 

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(config.SEED) # pytorch random seed
    np.random.seed(config.SEED) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained("t5-base")
    

    # Importing and Pre-Processing the domain data
    # Selecting the needed columns only. 
    # Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. 
    # df = pd.read_csv('./data/news_summary.csv',encoding='latin-1')
    # df = df[['text','ctext']]
    df = data
    df = data[['abstract','description']]
    df['abstract'] = 'summarize: ' + df['abstract']
    print(df.head())

    
    # Creation of Dataset and Dataloader
    # Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
    train_size = 0.8
    train_dataset=df.sample(frac=train_size,random_state = config.SEED)
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

    print("FULL Dataset: {}".format(df.shape))
    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("TEST Dataset: {}".format(val_dataset.shape))


    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = CustomDataset(train_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)
    val_set = CustomDataset(val_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)

    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': config.TRAIN_BATCH_SIZE,
        'shuffle': True,
        'num_workers': 0
        }

    val_params = {
        'batch_size': config.VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
        }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)


    
    # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    model = model.to(device)

    # Defining the optimizer that will be used to tune the weights of the network in the training session. 
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=config.LEARNING_RATE)

    # Log metrics with wandb
    wandb.watch(model, log="all")
    # Training loop
    print('Initiating Fine-Tuning for the model on our dataset')

    for epoch in range(config.TRAIN_EPOCHS):
        train(epoch, tokenizer, model, device, training_loader, optimizer)


    # Validation loop and saving the resulting file with predictions and acutals in a dataframe.
    # Saving the dataframe as predictions.csv
    print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
    for epoch in range(config.VAL_EPOCHS):
        predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
        final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
        final_df.to_csv('/content/SnapShort/model2.csv')
        print('Output Files generated for review')

if __name__ == '__main__':
    main()

[34m[1mwandb[0m: Currently logged in as: [33mharrypotter[0m (use `wandb login --relogin` to force relogin)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…


                                            abstract                                        description
0  summarize: this invention relates to novel cal...  throughout the following description specific ...
1  summarize: a stapler , particularly for suturi...  preferred embodiments of the invention are ill...
2  summarize: a nerve guidance conduit includes a...  embodiments of the present invention provide n...
3  summarize: a rail connector and improvement in...  referring now to the drawings , and particular...
4  summarize: an anvil assembly is disclosed that...  exemplary embodiments of the presently disclos...
FULL Dataset: (1016, 2)
TRAIN Dataset: (813, 2)
TEST Dataset: (203, 2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Initiating Fine-Tuning for the model on our dataset




Epoch: 0, Loss:  8.761116981506348
Epoch: 1, Loss:  2.2331957817077637
Epoch: 2, Loss:  2.121066093444824
Epoch: 3, Loss:  1.837071180343628
Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe
Completed 0
Completed 100
Output Files generated for review


In [None]:
# summary = pd.read_csv("/content/SnapShort/model.csv")
# summary.head()

Unnamed: 0.1,Unnamed: 0,Generated Text,Actual Text
0,0,the invention relates to a stapler for applyin...,"summarize: a stapler, particularly for suturin..."
1,1,a damage resistant anvil assembly for use in s...,summarize: an anvil assembly is disclosed that...
2,2,the invention relates to a method for treating...,summarize: a neural prosthetic device for redu...
3,3,a garnish pick for use in a martini glass incl...,summarize: a garnish pick for food and / or be...
4,4,a method and apparatus for treating wounds wit...,summarize: hydrostatic pressure of aqueous sol...


In [13]:
summary = pd.read_csv("/content/SnapShort/model2.csv")
summary.head()

Unnamed: 0.1,Unnamed: 0,Generated Text,Actual Text
0,0,a stapler for applying surgical staples to an ...,"summarize: a stapler, particularly for suturin..."
1,1,"an anvil assembly includes a handle assembly, ...",summarize: an anvil assembly is disclosed that...
2,2,the invention relates to a method for treating...,summarize: a neural prosthetic device for redu...
3,3,a garnish pick includes an appendage that exte...,summarize: a garnish pick for food and / or be...
4,4,a method and apparatus for treating wounds is ...,summarize: hydrostatic pressure of aqueous sol...


In [14]:
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [16]:
from rouge import Rouge

rouge = Rouge()
score = rouge.get_scores(summary['Generated Text'][0],summary['Actual Text'][0])
score

[{'rouge-1': {'f': 0.485714280736508,
   'p': 0.5204081632653061,
   'r': 0.45535714285714285},
  'rouge-2': {'f': 0.13461537963803644,
   'p': 0.14432989690721648,
   'r': 0.12612612612612611},
  'rouge-l': {'f': 0.32786884747111, 'p': 0.3448275862068966, 'r': 0.3125}}]

In [None]:
# df = data[['abstract','description']]
# df['abstract'] = 'summarize: ' + df['abstract']
# df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,abstract,description
0,summarize: this invention relates to novel cal...,throughout the following description specific ...
1,"summarize: a stapler , particularly for suturi...",preferred embodiments of the invention are ill...
2,summarize: a nerve guidance conduit includes a...,embodiments of the present invention provide n...
3,summarize: a rail connector and improvement in...,"referring now to the drawings , and particular..."
4,summarize: an anvil assembly is disclosed that...,exemplary embodiments of the presently disclos...
