# GenAI model tuning prep data

This notebook attempts to transform free form data to JSONL format so it can be used by OpenAI model tuning API.

* 20231205 Using try4 data and the context rule hardcoded
* 20231125 Using try3 data and the context rule hardcoded
* 20231118 first set of train and test code.
* 20231116 Second set of actual asm code.
* 20231109 Uses first set of actual asm code.



In [11]:
import datetime
import pathlib
import json

import pandas as pd
import os
import os.path

### Setup dirs 

In [12]:
# Get the current date
current_date = datetime.datetime.now()

# Format the date as YYYYMMDD
formatted_date = current_date.strftime('%Y%m%d')
formatted_date

'20231206'

In [13]:
# This can be varied to point to different files.
OUT_TRAIN_FILE_NAME = 'train' + formatted_date + '.jsonl'
OUT_TEST_FILE_NAME = 'test' + formatted_date + '.jsonl'
os.environ['OUT_TRAIN_FILE_NAME'] = OUT_TRAIN_FILE_NAME
os.environ['OUT_TEST_FILE_NAME'] = OUT_TEST_FILE_NAME
print("OUT_TRAIN_FILE_NAME: ", OUT_TRAIN_FILE_NAME)
print("OUT_TEST_FILE_NAME: ", OUT_TEST_FILE_NAME)

OUT_TRAIN_FILE_NAME:  train20231206.jsonl
OUT_TEST_FILE_NAME:  test20231206.jsonl


In [14]:
# The current directory will be where this src file is located.
# Which is in the notebooks dir of the project
dirpath = os.getcwd()
print("current directory is : " + dirpath)

current directory is : /workspaces/BALSA/notebooks


In [15]:
# Use pathlib to find the root dir of the git repo
root_path = pathlib.PurePath(dirpath).parents[0]
data_path = root_path / 'data'
train_path = data_path / 'try4' / 'train'
test_path = data_path / 'try4' / 'test'
print("root directory is: ", root_path)
print("data directory is: ",  data_path)
print("train directory is: ",  train_path)
print("test directory is: ", test_path)

root directory is:  /workspaces/BALSA
data directory is:  /workspaces/BALSA/data
train directory is:  /workspaces/BALSA/data/try4/train
test directory is:  /workspaces/BALSA/data/try4/test


In [16]:
# Create equivalent dir names in the environment
# Data
DATA_DIR_NAME = data_path.as_posix()
print("DATA_DIR_NAME: ", DATA_DIR_NAME)
os.environ['DATA_DIR_NAME'] = DATA_DIR_NAME

DATA_DIR_NAME:  /workspaces/BALSA/data


In [17]:
%%bash
# Verify env variables are set
echo ${DATA_DIR_NAME}
echo ${LOGS_DIR_NAME}
echo ${CSV_FILE_NAME}

/workspaces/BALSA/data




# Routine to build our tuning jsonl file from txt files.

In [18]:
# Function to read the input text file and convert it to JSONL format
def convert_text_to_jsonl(input_file, output_file):
    lines = []
    messages = []
    # stub vars
    sample_code = []
    commentary = []
    prompt = []

    with open(input_file, 'r') as file:
        lines = file.readlines()


    # 
    # find delimiters
    #

    posn = []
    line_nbr = 0
    for a_line in lines:
        #print(a_line)
        if (a_line == "RESULT\n"):
            #print(a_line, " ", line_nbr )
            # save that position
            posn.append(line_nbr) 
        if (a_line == "PROMPT\n"):
            #print(a_line)
            posn.append(line_nbr) 
        
        line_nbr = line_nbr + 1

    # Record the last line in file
    posn.append(line_nbr) 

    #print("posn: ", posn)

    # separate out the parts
    result_lines = lines[1+1:posn[1]]
    prompt_lines = lines[posn[1]+1:posn[2]]

    # dump the parts
    #print("===result_lines:===\n", result_lines)
    #print("===prompt_lines:===\n", prompt_lines)

    hardcoded_context="""
You are a helpful assistant who understands BAL (a subset of IBM HLASM)

When writing code, obey these rules:

* NAME corresponds to a LABEL and is always in column 1.
    - The NAME is at most 8 characters long.
    - The NAME begins with characters A-Z, a-z, $, # or @. 
* OPERATION corresponds to an instruction (mnemonic) and starts in column 10.
* OPERANDS corresponds to instruction argumennts or parameters and starts in column 15.
    - Multiple operands are separated by a comma `,`.
    - Space ` ` characters are not permitted between OPERANDS.
* COMMENT corresponds to non functional text and has two possible starting locations.  
    - If the entire line is a comment, then the comment marker `*` starts in column 1.
    - If the comment is used at the end of a line of code, it starts at column 32.
* Column 72 is used to identify a continuation of the current line to the next.
    - Only use a continuation character when an instruction line spans more than 65 columns.
    - In this case, use a `x` in column 72 on the first line.
    - On the second continued line, code starts at column 16.

All code should be output in markdown or preformatted text blocks like so:

```
code here
```

Unless explictly told to do so, do not include any commentary.

When specifiying registers be explicit.  For example when referring to register one, use R1 rather than 1.

When issuing instruction commands, only consider the instructions / op code in this table:

When printing a markdown table, align the table in column 1.

| Instruction / Op Code | Description                        | 
| :-------| :------------------------------------------------|
| A       | Add fullword                       |
| AH      | Add halfword                       |
| AL      | Add unsigned fullword               |
| ALR     | Add unsigned register               |
| AP      | Add Packed two fields in Memory    | 
| AR      | Add Register fullword              | 
| BAL     | Branch and Link                    | 
| BALR    | Branch and Link Register           | 
| BAS     | Branch and Save                    | 
| BASR    | Branch and Save Register           | 
| BASSM   | Branch and Save and Set Mode       | 
| BC      | Branch on Condition                | 
| BCR     | Branch on Condition Register       | 
| BCT     | Branch on Count                    | 
| BCTR    | Branch on Count Register           | 
| BSM     | Branch and Set Mode                          | 
| BXH     | Branch on Index Greater                      | 
| BXLE    | Branch on Index Less than or Equal           | 
| C       | Compare Fullword in register against memory  | 
| CDS     | Compare Doubleword in even/odd register pair against memory and Swap                  | 
| CH      | Compare Halfword                             | 
| CL      | Compare logically unsigned fullword in register against memory   | 
| CLC     | Compare up to 256 consecutive bytes in Memory | 
| CLCL    | Compare Characters Long            | 
| CLI     | Compare Logical Immediate          | 
| CLM     | Compare Selected Bytes in Memory using mask  | 
| CLR     | Compare Logical Registers          | 
| CP      | Compare Packed two fields in Memory| 
| CR      | Compare Fullword in Registers      | 
| CS      | Compare Fullword in register against memory and Swap | 
| CVB     | Convert Packed Decimal Value in Memory to signed integers in register |
| CVD     | Convert Signed Fullword in Register to Packed Decimal in Memory |
| D       | Divide doubleword in in even/odd register pair by fullword value in Memory       | 
| DP      | Divide Packed Decimals two fields in Memory | 
| DR      | Divide doubleword in in even/odd register pair by fullword Register        | 
| ED      | Edit - formats packed decimal field | 
| EDMK    | Edit and Mark - address of first significant digit in R1      | 
| EX      | Execute a target instruction       | 
| IC      | Insert Character - into low byte of register   | 
| ICM     | Insert Characters under mask into Register | 
| L       | Load a fullword from memory into a register | 
| LA      | Load address of a storage location into a register | 
| LCR     | Load signed 2's complement value in Register2 into Register1 | 
| LH      | Load signed halfword from memory into a Register | 
| LM      | Load Multiple fullword values from memory into registers | 
| LNR     | Load value in register R2 into register R1 with negative sign | 
| LPR     | Load value in register R2 into register R1 with positive sign (absolute value) |
| LR      | Load value from register R1 into register R2 | 
| LTR     | Load and Test value in register R2 into register R1 | 
| M       | Multiply register fullword in even/odd register pair by fullword in memory|
| MH      | Multiply fullword register by halfowrd in memory in even/odd register pair |
| MP      | Multiply Packed decimal in memory by packed decimal in memory | 
| MR      | Multiply fullword value in even/odd register pair by fullword value in register |
| MVC     | Copy L bytes from memory to memory             | 
| MVCIN   | Copy L bytes from memory to memory reversing order of values with second operand is last byte |   
| MVCL    | Copy or fill bytes in memory       | 
| MVI     | Store byte I2 to memory            | 
| MVN     | Copy low nibbles from memory to memory | 
| MVO     | Copy nibbles from memory to memory offset by 4 bits | 
| MVZ     | Copy high nibbles from memory to memory | 
| N       | Logical AND register and full word in memory | 
| NC      | Logical AND consecutive bytes in memory with memory | 
| NI      | Logical AND byte in memory with immediate | 
| NR      | Logical AND register with register | 
| O       | Logical OR register with memory | 
| OC      | Logical OR consecutive bytes in memory with memory | 
| OI      | Logical OR byte in memory with immediate | 
| OR      | Logical OR register with register | 
| PACK    | Convert zoned decimal characters in memory to packed decimal numbers in memory |
| S       | Subtract signed fullword in memory from register  | 
| SH      | Subtract signed halfword in memory from register | 
| SL      | Subtract unsigned fullword in memory from register | 
| SLA     | Shift left fullword register arithmetically by specified number of bits |
| SLDA    | Shift left signed 64-bit value in even/odd register pair arithmetically by specified number of bits |
| SLDL    | Shift left signed 64-bit value in even/odd register pair logicallby specified number of bits |
| SLL     | Shift left fullword register logically by specified number of bits  | 
| SLR     | Subtract unsigned fullword in register by register           |   
| SP      | Subtract packed decimals in memory | 
| SR      | Subtract signed values in register by register | 
| SRA     | Shift right register arithmetically by specified number of bits | 
| SRDA    | Shift right signed 64-bit value from even/odd register pair arithmetically by specified number of bits |
| SRDL    | Shift right signed 64-bit value from even/odd register pair logically by specified number of bits |
| SRL     | Shift right register by specified number of bits               |
| SRP     | Shift packed number in memory by the 6-bit signed number using the value I3 as a rounding value and with a negative value shifting right. |
| ST      | Store fullword in register to memory | 
| STC     | Store lowest byte in register to memory | 
| STCM    | Store selected bytes in register to memory using mask | 
| STH     | Store halfword in register to memory | 
| STM     | Store values of several registers to fullwords in memory | 
| SVC     | Supervisor call - invoke Operating System service number  | 
| TM      | Test bits of byte in memory using mask | 
| TR      | Translate memory area at address using table in memory | 
| TRT     | Examine table in memory by using memory bytes as an index with the first non-zero value found is inserted into the low byte of R2 |
| UNPK    | Convert packed decimal in memory to zoned decimal in memory | 
| X       | Logical XOR register with memory | 
| XC      | Logical XOR up to 256 bytes in memory with bytes in memory | 
| XI      | Logical XOR byte in memory with immediate | 
| XR      | Logical XOR register with register | 
| ZAP     | Set packed decimal number in memory to 0 and then add from memorory |



"""

    a_dict = {}
    a_dict['messages'] = []
    
    a_dict['messages'].append({'role':'system',
                              'content': hardcoded_context})
    a_dict['messages'].append({'role':'user',
                               'content': ' '.join(prompt_lines)})
    a_dict['messages'].append({'role':'assistant',
                               'content': ' '.join(result_lines)})


    #print(a_dict)
    #print(output_file)

    # append to output file
    # modify with w to write a new one
    with open(output_file, 'a') as jsonl_file:
        jsonl_file.write(json.dumps(a_dict) + '\n')

In [19]:


#train_files

# build train file

In [26]:
OUT_TRAIN_FQPN = train_path /  pathlib.Path(OUT_TRAIN_FILE_NAME).as_posix()
#print(OUT_TRAIN_FQPN)

# remove any existing output
try:
    os.remove(OUT_TRAIN_FQPN)
    print("removed")
except OSError:
    print("did not remove the existing training file.")

train_files = os.listdir(train_path)
#print(train_files)

for a_file in train_files:
    #print("a file name: ", a_file)
    IN_FQPN = train_path /  pathlib.PurePath(a_file).as_posix()
    #print(IN_FQPN)
    #print(OUT_TRAIN_FQPN)
    convert_text_to_jsonl(IN_FQPN, OUT_TRAIN_FQPN)


removed


# build test file

In [27]:
OUT_TEST_FQPN = test_path /  pathlib.Path(OUT_TEST_FILE_NAME).as_posix()
#print(OUT_FQPN)

# remove any existing output
try:
    os.remove(OUT_TEST_FQPN)
    print("removed")
except OSError:
    print("did not remove the existing training file.")

test_files = os.listdir(test_path)

for a_file in test_files:
    #print("a file name: ", a_file)
    IN_FQPN = test_path /  pathlib.PurePath(a_file).as_posix()
    #print(IN_FQPN)
    convert_text_to_jsonl(IN_FQPN, OUT_TEST_FQPN)

removed


In [23]:
os.environ['OUT_TRAIN_FQPN'] = OUT_TRAIN_FQPN.as_posix()
os.environ['OUT_TEST_FQPN'] = OUT_TEST_FQPN.as_posix()

In [24]:
%%bash
head ${OUT_TRAIN_FQPN}

{"messages": [{"role": "system", "content": "\nYou are a helpful assistant who understands BAL (a subset of IBM HLASM)\n\nWhen writing code, obey these rules:\n\n* NAME corresponds to a LABEL and is always in column 1.\n    - The NAME is at most 8 characters long.\n    - The NAME begins with characters A-Z, a-z, $, # or @. \n* OPERATION corresponds to an instruction (mnemonic) and starts in column 10.\n* OPERANDS corresponds to instruction argumennts or parameters and starts in column 15.\n    - Multiple operands are separated by a comma `,`.\n    - Space ` ` characters are not permitted between OPERANDS.\n* COMMENT corresponds to non functional text and has two possible starting locations.  \n    - If the entire line is a comment, then the comment marker `*` starts in column 1.\n    - If the comment is used at the end of a line of code, it starts at column 32.\n* Column 72 is used to identify a continuation of the current line to the next.\n    - Only use a continuation character when

In [25]:
%%bash
head ${OUT_TEST_FQPN}

{"messages": [{"role": "system", "content": "\nYou are a helpful assistant who understands BAL (a subset of IBM HLASM)\n\nWhen writing code, obey these rules:\n\n* NAME corresponds to a LABEL and is always in column 1.\n    - The NAME is at most 8 characters long.\n    - The NAME begins with characters A-Z, a-z, $, # or @. \n* OPERATION corresponds to an instruction (mnemonic) and starts in column 10.\n* OPERANDS corresponds to instruction argumennts or parameters and starts in column 15.\n    - Multiple operands are separated by a comma `,`.\n    - Space ` ` characters are not permitted between OPERANDS.\n* COMMENT corresponds to non functional text and has two possible starting locations.  \n    - If the entire line is a comment, then the comment marker `*` starts in column 1.\n    - If the comment is used at the end of a line of code, it starts at column 32.\n* Column 72 is used to identify a continuation of the current line to the next.\n    - Only use a continuation character when