# Part 0 - Data preparation

In this notebook we will download the arXiv dataset and save it to S3. We will also do some light data preprocessing by only keeping the columns we need, filtering out reviews that are too short, and limiting the size of the datasets.

To read more, please check out https://towardsdatascience.com/setting-up-a-text-summarisation-project-introduction-526622eea4a8.

First of all we want to make sure that the relevent libraries are installed on this machine:

In [None]:
%pip install "sagemaker>=2.48.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade

In [None]:
%pip install rouge_score

We will download the dataset directly from the Kaggle website so we need to install the Kaggle Python package

In [None]:
%pip install kaggle

In [None]:
# https://github.com/Kaggle/kaggle-api
# follow the instruction in the above link to export Kaggle username and key to local environment so you don't 
# have to put it in the notebook (so it won't be exposed on GitHub)

# only run the below if you aren't able to export these to your local environment through your shell
# import os
# os.environ['KAGGLE_USERNAME'] = "<your-kaggle-username>"
# os.environ['KAGGLE_KEY'] = "<your-kaggle-api-key>"

In [3]:
import kaggle
kaggle.api.authenticate()

In [2]:
kaggle.api.dataset_download_files('Cornell-University/arxiv', path=".")

In [6]:
!unzip arxiv.zip

Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [7]:
!mkdir data

mkdir: data: File exists


In [3]:
!mkdir raw_data
!mv arxiv.zip raw_data/

In [4]:
!mv arxiv-metadata-oai-snapshot.json raw_data/

In [5]:
from datasets import load_dataset
dataset = load_dataset("arxiv_dataset", data_dir='./raw_data/', split='train', ignore_verifications=True)

Using custom data configuration default-data_dir=.%2Fraw_data%2F


Downloading and preparing dataset arxiv_dataset/default (download: Unknown size, generated: 2.09 GiB, post-processed: Unknown size, total: 2.09 GiB) to /Users/robbdunlap/.cache/huggingface/datasets/arxiv_dataset/default-data_dir=.%2Fraw_data%2F/1.1.0/242eb95c95350194872f5be3fb00e7938e53b0944442e85f45a5d2240328f370...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset arxiv_dataset downloaded and prepared to /Users/robbdunlap/.cache/huggingface/datasets/arxiv_dataset/default-data_dir=.%2Fraw_data%2F/1.1.0/242eb95c95350194872f5be3fb00e7938e53b0944442e85f45a5d2240328f370. Subsequent calls will reuse this data.


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


The original dataset is too long, so we shuffle it and limit the number of records to 25,000.

In [6]:
dataset = dataset.shuffle(seed=42)
dataset = dataset.select(range(25000))
dataset

Dataset({
    features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'update_date'],
    num_rows: 25000
})

In [7]:
import pandas as pd
df = pd.DataFrame(dataset)

In [8]:
 # only keep columns that are required
df = df[['abstract', 'title']]
df = df.rename(columns={"abstract": "text", "title": "summary"})

In [9]:
df = df.replace(r'\n',' ', regex=True)

In [10]:
pd.options.display.max_colwidth = 100

In [11]:
df.head()

Unnamed: 0,text,summary
0,"The rest-frame UV spectra of three recent tidal disruption events (TDEs), ASASSN-14li, PTF15af...",Carbon and Nitrogen Abundance Ratio In the Broad Line Region of Tidal Disruption Events
1,Inspired by the success of transformer-based pre-training methods on natural language tasks an...,Survey: Transformer based Video-Language Pre-training
2,Pandharipande-Pixton have used the geometry of the moduli space of stable quotients to produce...,Tautological relations in moduli spaces of weighted pointed curves
3,Suppose X is a projective toric scheme defined over a commutative ring R equipped with an ampl...,A splitting result for the algebraic K-theory of projective toric schemes
4,We introduce an approach that accurately reconstructs 3D human poses and detailed 3D full-body...,Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images


## Filtering the dataset

We want to discard reviews and titles that are too short, so that our model can produce more interesting summaries.

**How this works**
This splits the sentences in both the summary and text columns, selects thoses that are greater than or equal to the minimum cutoff number of tokens

In [17]:
cutoff_summary = 5
cutoff_text = 20
df = df[(df['summary'].apply(lambda x: len(x.split()) >= cutoff_summary)) & (df['text'].apply(lambda x: len(x.split()) >= cutoff_text))]

In [18]:
len(df)

23719

## Limiting the size of the datasets and splitting

We want to limit the size of the datasets so that training of the model can finish in a reasonable amount of time. This is a decision that we might want to revisit in the experimentation phase if we want to increase the performance of the model. We then split the dataset into test (80%), validation (10%), and test (10%)

In [19]:
df = df.sample(20000, random_state=43)

In [20]:
import numpy as np
# split the dataset into train, val, and test
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=44), [int(0.8*len(df)), int((0.9)*len(df))])

In [21]:
df_train.to_csv('data/train.csv', index=False)
df_val.to_csv('data/val.csv', index=False)
df_test.to_csv('data/test.csv', index=False)

## Save the data as CSV files and upload them to S3

We need to upload the data to S3 in order to train the model at a later point.

# Connecting to AWS
For the below to work you'll need to configure your environment to work with AWS, if you don't you'll get **"ValueError: Must setup local AWS configuration with a region supported by SageMaker."** <br> 

<u>Steps</u>
1. Install AWS Command Line Interface (CLI) on your system (https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
(I used Homebrew to install instead of using curl/sudo).
2. Create an AWS Identify and Access Management (IAM) profile (https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html). This is like creating a user account that doesn't have root access on Unix.
3. Create access keys for the IAM profile (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-creds).
4. Configure your local environment using the AWS CLI tool (same link as above).

The next problem was a bugger to figure out. The instructional text showed go set "role" using:
```python
role = sagemaker.get_execution_role()
```
but this doesn't work (it only works in a Jupyter notebook within Sagemaker, not when you're executing code on your local maching and trying to work through the AWS API). Instead, do the following to create the "role" variable:

1. go to the IAM dashboard (https://us-east-1.console.aws.amazon.com/iamv2/home#/users).
2. Click on "Users" in the left-hand side navigation panel.
3. Click on the "User Name" you created when creating the IAM profile above.
4. Copy the "User ARN" that is at the top of the "Summary" section on the resulting page
5. Set the "role" variable equal to this value:
```python
role = arn:aws:iam::123456789:user/IAMaccountname 
```

If this still doesn't work, verify that your gave the IAM account access to S3 by clicking on the "Access Advisor" tab in the "Summary" section. If the account has access it will have "Amazon S3" listed as a service.

After having done these step the below code will run properly.

**\*\*Do not save your role ARN value to GitHub as it probably can be used by an attacker to get access to your account\*\***

In [None]:
import sagemaker
bucket = sagemaker.Session().default_bucket()

region = sagemaker.Session().boto_region_name
print(f"AWS Region: {region}")
role = 'put your ARN info here - as described above'
print(f"Role Arn: {role}")

In [None]:
!aws s3 cp data/train.csv s3://$bucket/summarization/data/train.csv
!aws s3 cp data/val.csv s3://$bucket/summarization/data/val.csv
!aws s3 cp data/test.csv s3://$bucket/summarization/data/test.csv