![NVIDIA Logo](images/nvidia.png)

# Data for P-tuning

In this notebook we look at the requirements for data when p-tuning, specifically on NeMo Service, and prep the pubmedqa data for p-tuning.

---

## Learning Objectives

By the time you complete this notebook you will be able to:
- Format the PubMedQA data for p-tuning.
- Know how to upload p-tuning data to NeMo Service.

---

## Imports

In [None]:
import json
import random

from llm_utils.pubmedqa import generate_prompt_and_answer
from llm_utils.mocks import upload

---

## Load Data

First we'll load the train and validate splits we made a few notebooks back.

In [None]:
pubmedqa_train_data = json.load(open('data/pubmedqa_train.json','r'))
pubmedqa_validate_data = json.load(open('data/pubmedqa_validate.json','r'))

---

## Process Data in Prep for Prompts

We have been working with the test data in the last few notebooks, and not the train and validation data, and we have not yet formatted either of them into the `prompts_and_answers` format we have been working with, so let's do that first.

In [None]:
train_prompts_and_answers = []
for value in pubmedqa_train_data.values():
    train_prompts_and_answers.append(generate_prompt_and_answer(value))

In [None]:
validate_prompts_and_answers = []
for value in pubmedqa_validate_data.values():
    validate_prompts_and_answers.append(generate_prompt_and_answer(value))

In [None]:
sample_prompt = train_prompts_and_answers[0][0]
sample_answer = train_prompts_and_answers[0][1]

In [None]:
print(sample_prompt)

In [None]:
sample_answer

---

## Format Data for P-tuning

When p-tuning with the NeMo Service it is expected that data be in JSON Lines (`jsonl`) format, with each line in the file being in the following format:

```python
{"prompt": <prompt>, "completion": <completion/label>}
```

In [None]:
ptuning_train_data_list = [{'prompt': prompt, 'completion': answer} for prompt, answer in train_prompts_and_answers]

In [None]:
ptuning_vallidation_data_list = [{'prompt': prompt, 'completion': answer} for prompt, answer in validate_prompts_and_answers]

Here we see an example of data well-formatted for p-tuning.

In [None]:
ptuning_train_data_list[0]

---

## Write P-tuning Data to File

We will ultimately upload our p-tuning data to the NeMo Service where it can be used for p-tuning. First we need to write it to file.

In [None]:
ptuning_train_data_filename = f'data/pubmedqa-train-data.jsonl'

In [None]:
ptuning_validation_data_filename = f'data/pubmedqa-validation-data.jsonl'

In [None]:
with open(ptuning_train_data_filename, 'w') as f:
    for ptuning_data in ptuning_train_data_list:
        f.write(json.dumps(ptuning_data) + '\n')

In [None]:
with open(ptuning_validation_data_filename, 'w') as f:
    for ptuning_data in ptuning_vallidation_data_list:
        f.write(json.dumps(ptuning_data) + '\n')

---

## NeMo Service Mocking

There are two scenarios in this workshop where instead of you working directly with the real NeMo Service, you are going to interact with mocks. The first, which we will discuss here, is when uploading data files. The primary reason for this is that with many students interacting with the same NeMo Service account at the same time, keeping track of all your specific files will get cumbersome quick.

Rather, we will provide mock functions that will simulate the real functions you would call if you were working with your own account.

---

## Upload Data to NeMo Service

Uploading data to the NeMo Service is straight forward. Typically you would create a `conn` object with the NeMo Service as we did in the first notebook and then use its `upload` method, passing it the file path of the file you would like to upload. In our case, we will use a mock `upload` method that we have provided for you, and view the (mock) response that it generates.

In [None]:
train_response = upload(ptuning_train_data_filename)

In [None]:
train_response

In [None]:
validation_response = upload(ptuning_validation_data_filename)

In [None]:
validation_response

Working with the NeMo Service API as we are doing in this workshop, we would want to keep track of the `id` field returned in the response so that later, when performing p-tuning, we can use it to specify the uploaded data that we would like to use in the p-tuning process.

In this workshop we have provided a setup for you where you do not need to keep track of the IDs yourself.