# Preparing data for fine-tuning on JumpStart

We first install the `datasets` packages where we will get our dataset from and update our SageMaker SDK version to the latest.

In [None]:
! pip install sagemaker datasets --upgrade --quiet

------
## 1. Set-up permissions and SageMaker Role

If you are going to use Sagemaker in a local environment (not SageMaker Studio or SageMaker Notebook Instances), you will need access to assume an IAM Role with the required permissions for Sagemaker. Find out more about this [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

In [None]:
import sagemaker, boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region = sess.boto_region_name
sm_client = boto3.client('sagemaker')

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

-----
## 2. Load `sts` dataset

As explained in the README file, we will be fine-tuning a model to identify if the question asked by a user is a paraphrase of one of the FAQ's in a list, i.e. if the two questions are semantically equivalent.

In order to do this, we want to start with a dataset focused on this task. We will be using the Semantic Textual Similarity (STS) dataset, comprised of pairs of semantically equivalent sentences from different domains (plagiarized sentences, machine translated sentences, and others), along with scores of their similarity in the range of 0-5.  

STS dataset is downloaded from
[Hugging Face](https://huggingface.co/datasets/stsb_multi_mt). 
[CC BY-SA 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/legalcode). 

In [None]:
from datasets import load_dataset

dataset = load_dataset("stsb_multi_mt", name="en", split=['train+dev+test'])[0]
df = dataset.to_pandas()

Let's check out what our data looks like.

In [None]:
df.head()

Since we will be training a model on the task of binary classification - meaning, we want a yes or no answer to the question "are these two sentences equivalent?" - we need to threshold the similarity scores into boolean labels. The dataset's [documentation](https://huggingface.co/datasets/stsb_multi_mt) gives us the following guide on what the 0-5 score means: 

![scores](img/score_meaning.png)

To be sure we don't capture ambiguous input data, we will only keep sentence pairs with scores higher than 4 as positive examples, and  lower than 2 as negative examples. We create a new `labels` series with 1's and 0's for the positive and negative examples respectively, the expected label format to fine-tune binary classification models in JumpStart. This will be explained in more detail in the next section of this notebook.  

In [None]:
df_ext = df[ (df['similarity_score'] >= 4) | (df['similarity_score'] <= 2)]
labels = df_ext.apply(lambda x : 1 if x['similarity_score'] >= 4 else 0, axis = 1)

Let's check how many positive and negative examples we're left with.

In [None]:
print(f'There are {len(labels[labels == 1].index)} positive examples')
print(f'There are {len(labels[labels == 0].index)} negative examples')

We have more negative than positive examples, but it's not too imbalanced; we will keep this partition of the dataset to train the model.

------
## 3. Transform dataset into expected format

We will now transform the data into a format that the pre-trained models available on JumpStart - as well as the underlying scripts to fine-tune them - can handle. All of the JumpStart models available for sentence-pair classification expect a common data format, explained below.

**Input**: A .csv file named `data.csv`, with the following structure:
- Each row of the first column of 'data.csv' should have 0/1 integer class labels.
- Each row of the second column should have the corresponding first sentence. 
- Each row of the third column should have the corresponding second sentence. 

Below is an example of a `data.csv` file for a random sentence-pair classification dataset, showing values in its first three columns. Note that the file should not have any header.

|   |  |  |
|---|---|---|
|0	|What is the Grotto at Notre Dame?	|Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection.|
|1	|What is the Grotto at Notre Dame?	|It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.|
|0	|What sits on top of the Main Building at Notre Dame?	|Atop the Main Building's gold dome is a golden statue of the Virgin Mary.|
|...|...|...|


Let us now transform our dataset into this format. This is quite simple, as we only need to insert our `labels` as the first column of the dataframe, and drop the `similarity_scores` column.

In [None]:
df_ext.insert(loc=0, column='label', value=labels)
df_ext.drop('similarity_score',axis=1,inplace=True)

The result is in line with the expected format.

In [None]:
df_ext.head()

Now, we can save our `data.csv` file to disk, making sure than the index column and header of the DataFrame are not included.

In [None]:
!mkdir data
file_name = 'data.csv'
df_ext.to_csv(f'data/{file_name}', header=False, index=False)

------
## 4. Upload data to S3

Finally, we save our data to S3, where our training job will draw it from.

In [None]:
prefix = 'datasets/sts-paraphrase'

data_url = sess.upload_data(f'data/{file_name}', sagemaker_session_bucket, f'{prefix}')

Make sure to copy the URL in the output of the following cell, so you can provide to your training job as input.

In [None]:
data_url

Next, go to the NLP lab README to learn how to fine-tune a pre-trained model on our prepared dataset via the JumpStart UI, or alternatively open and run the `jsapi_finetune_paraphrase.ipynb`, which you can find in the same directory as this notebook.