# **Demo on building data prep pipeline for model fine tuning**

This notebook demonstrates data preparation techniques for fine-tuning language models using the Data Prep Kit.
Here is the workflow:


![](https://raw.githubusercontent.com/sapthasurendran/data-prep-lab/nb/examples/notebooks/intro/images/code-processing-flowdiagram.png)


![](https://raw.githubusercontent.com/sapthasurendran/data-prep-kit/nb/examples/notebooks/intro/images/code-processing-flowdiagram.png)


## How to run this notebook

Two options:

- **Option 1 - Google Colab:** easiest option.  no setup required.  Click this link to open this on google colab.  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sapthasurendran/data-prep-lab/blob/nb/examples/notebooks/intro/dpk_intro_code_python.ipynb)
- **Option 2 - Local python dev environment:**  Setup using this [guide](../../../README.md#-getting-started)


## Step-1: SET UP

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment


In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

Running in Colab


### 1.2 -Download Data if running on Google Colab

In [2]:
if RUNNING_IN_COLAB:
    !mkdir -p 'input/source-code-data'

    !wget -O 'input/source-code-data/application-java.zip'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/application-java.zip'
    !wget -O 'input/source-code-data/data-processing-lib.zip' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/data-processing-lib.zip'
    !wget -O 'input/source-code-data/https___github.com_00000o1_environments_archive_refs_heads_master.zip' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/https___github.com_00000o1_environments_archive_refs_heads_master.zip'
    !wget -O 'my_utils.py'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/my_utils.py'
    !wget -O 'language.json'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/languages/lang_extensions.json'


--2024-12-18 06:55:05--  https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/application-java.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28680721 (27M) [application/zip]
Saving to: ‘input/source-code-data/application-java.zip’


2024-12-18 06:55:06 (230 MB/s) - ‘input/source-code-data/application-java.zip’ saved [28680721/28680721]

--2024-12-18 06:55:06--  https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/data-processing-lib.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443.

### 1.3 - Install dependencies if running on Google Colab

In [3]:
if RUNNING_IN_COLAB:
    !pip install "data-prep-toolkit-transforms[all]==0.2.2"
    !pip install pandas
    !pip install humanfriendly

Collecting data-prep-toolkit-transforms==0.2.2 (from data-prep-toolkit-transforms[all]==0.2.2)
  Downloading data_prep_toolkit_transforms-0.2.2-1-py3-none-any.whl.metadata (10 kB)
Collecting data-prep-toolkit>=0.2.2 (from data-prep-toolkit-transforms==0.2.2->data-prep-toolkit-transforms[all]==0.2.2)
  Downloading data_prep_toolkit-0.2.3-py3-none-any.whl.metadata (2.3 kB)
  Downloading data_prep_toolkit-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting scancode-toolkit==32.1.0 (from data-prep-toolkit-transforms[all]==0.2.2)
  Downloading scancode_toolkit-32.1.0-cp310-none-any.whl.metadata (16 kB)
Collecting bs4==0.0.2 (from data-prep-toolkit-transforms[all]==0.2.2)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting transformers==4.38.2 (from data-prep-toolkit-transforms[all]==0.2.2)
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m6.1 MB/s[0m eta [36m

Collecting humanfriendly
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: humanfriendly
Successfully installed humanfriendly-10.0


### 1.4 - Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configure


### Step-2.1: Basic configuration

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

Running in Colab


In [2]:
import os

## Configuration
class MyConfig:
    pass

MY_CONFIG = MyConfig ()

MY_CONFIG.INPUT_DATA_DIR = 'input/source-code-data/'

MY_CONFIG.OUTPUT_FOLDER = "output"
MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , "output_final")



In [3]:
## Add parent dir to path
import os,sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

### 2.2 - Setup input/outpur directories

In [4]:
import os
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')

output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_exact_dedupe_out')
output_code_quality_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_code_quality_out')
output_hap_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_hap_out')
output_filter_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_filter_out')
output_tokenisation_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_tokenisation_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: Data ingestion -  Convert source data to Parquet


This is the first component of this pipeline. It ingests few zip files and converts it into
parquet files for consumption by the next steps in this data processing pipeline.


### 3.1 - Set Input/output Folder

In [5]:
STAGE = 1

input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder =  output_parquet_dir

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-1: Processing input='input/source-code-data/' --> output='output/01_parquet_out'


### 3.2 - Execute

In [6]:

%%time

import sys
import ast
from data_processing.utils import ParamsUtils
from data_processing.runtime.pure_python import PythonTransformLauncher
from code2parquet_transform import (  # domain_key,; snapshot_key,
    detect_programming_lang_cli_key,
    supported_langs_file_cli_key,
)
from code2parquet_transform_python import CodeToParquetPythonConfiguration
# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

supported_languages_file = "language.json"

params = {
    # code2parquet parameters
    supported_langs_file_cli_key: supported_languages_file,
    detect_programming_lang_cli_key: True,
    "data_files_to_use": ast.literal_eval("['.zip']"),

    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(CodeToParquetPythonConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")


07:02:08 INFO - data factory code2parquet_ is using local configuration without input/output path
INFO:data_processing.data_access.data_access_factory_base9b4792f3-c008-4003-9760-41d0826f6106:data factory code2parquet_ is using local configuration without input/output path
07:02:08 INFO - data factory code2parquet_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_base9b4792f3-c008-4003-9760-41d0826f6106:data factory code2parquet_ max_files -1, n_sample -1
07:02:08 INFO - data factory code2parquet_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_base9b4792f3-c008-4003-9760-41d0826f6106:data factory code2parquet_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
07:02:08 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.executi

✅ Stage:1 completed successfully
CPU times: user 1.38 s, sys: 276 ms, total: 1.65 s
Wall time: 2.33 s


### 3.3 - Inspect Generated output


In [7]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (74, 10)


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language
0,application-java/bin/application-java,application-java.zip,#!/usr/bin/env sh\n\n#\n# Copyright 2015 the o...,501a4498-7594-42a5-a027-28dcbfeab5e9,,318066feeb8ac614dd5eab57fbf2faa5a83f2b582a089f...,8168,2024-12-18T07:02:08.978176,application-java,unknown
1,application-java/bin/application-java.bat,application-java.zip,@rem\r\n@rem Copyright 2015 the original autho...,6a9fb410-8156-4122-849d-d37786df53e8,.bat,e90dc018527e3b0798537919c4e49fb2b25ebec62aa7a9...,5539,2024-12-18T07:02:08.978398,application-java,Batchfile
2,environments-master/.bash_aliases,https___github.com_00000o1_environments_archiv...,function gdrive_download () {\n CONFIRM=$(wge...,577af8fc-b90a-473a-8a81-c9d0019663ff,,c95ebe462ef03e13dcec8918f35932a80de119acad9271...,379,2024-12-18T07:02:09.739644,https___github.com_00000o1_environments_archiv...,unknown
3,environments-master/.bashrc,https___github.com_00000o1_environments_archiv...,# ~/.bashrc: executed by bash(1) for non-login...,a0fa8941-a58d-4aa7-914a-98ae33ffaf5f,,c5ebce9359e2570aea09ef4dead7196210dbabf0311d79...,3840,2024-12-18T07:02:09.739896,https___github.com_00000o1_environments_archiv...,unknown
4,environments-master/.gitconfig,https___github.com_00000o1_environments_archiv...,[user]\n\temail = writetocris@outlook.com\n\tn...,d9fb1b62-f875-408e-abd0-e1e2caf72f55,,286fc2d8b5211247e850506197280ee5a59c0c6dd11327...,161,2024-12-18T07:02:09.740032,https___github.com_00000o1_environments_archiv...,unknown


## 3.4 - Understand the output

Each file contained within the ZIP is transformed into a distinct row within the Parquet dataset, adhering to the below schema.

- **title** : Path to the file within the ZIP archive.
- **document** : Name of the ZIP file containing the current file.
- **repo_name:** : The name of the repository to which the code belongs. This should match the name of the zip file containing the repository.

- **contents** : Content of the file, converted to a string.
- **document_id** :  Unique identifier computed as a uuid.
- **ext**: File extension extracted from the file path.
- **hash** : SHA256 hash value computed from the file content string.
- **size**: Size of the file content in bytes.
- **date_acquired** : Timestamp indicating when the file was processed.
- **programming_language** : Programming language detected using the file extension.




In [8]:
import pprint
import json

pprint.pprint (output_df.iloc[5, ])
# json.loads(output_df.iloc[0, ]['contents'])

title                                      environments-master/.gitignore
document                https___github.com_00000o1_environments_archiv...
contents                .*.swp\ninstall_nvm.sh\nnodesource_setup.sh\n*...
document_id                          a39ee0f8-9851-43bc-aa88-f14f7906d4c3
ext                                                                      
hash                    d5b36ed76ab3816191cd51579a17f81c1735d05f1f2530...
size                                                                   50
date_acquired                                  2024-12-18T07:02:09.741148
repo_name               https___github.com_00000o1_environments_archiv...
programming_language                                              unknown
Name: 5, dtype: object


##  Step-4: Exact Deduplication

This step will find exact duplicates in the 'content' column and remove them. This is done by computing SHA256 hash on the code files and remove records having identical hashes.

### 4.1 - Set Input/output Folder

In [9]:
STAGE = 2

input_folder = output_parquet_dir # previous output folder is the input folder for the current stage
output_folder =  output_exact_dedupe_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_exact_dedupe_out'


### 4.2 - Execute

In [10]:
%%time

import sys

from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration
from ededup_transform_base import doc_column_name_cli_param, int_column_name_cli_param

from data_processing.utils import ParamsUtils

# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

params = {
    # ededup parameters
    doc_column_name_cli_param: "contents",
    int_column_name_cli_param: "document_id",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")


07:02:31 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}
INFO:ededup_transform_base:exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}
07:02:31 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
07:02:31 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
07:02:31 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_exact_dedupe_out
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4-48d1-8a49-1c2289617776:data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_exact_dedupe_out
07:02:31 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_acce

✅ Stage:2 completed successfully
CPU times: user 80.9 ms, sys: 9.7 ms, total: 90.6 ms
Wall time: 99.3 ms


### 4.3 - Inspect Generated output



You will notice

In [11]:
from my_utils import read_parquet_files_as_df
import pandas as pd

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)


Input data dimensions (rows x columns)=  (74, 10)
Output data dimensions (rows x columns)=  (73, 11)


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language,removed
0,application-java/bin/application-java,application-java.zip,#!/usr/bin/env sh\n\n#\n# Copyright 2015 the o...,501a4498-7594-42a5-a027-28dcbfeab5e9,,318066feeb8ac614dd5eab57fbf2faa5a83f2b582a089f...,8168,2024-12-18T07:02:08.978176,application-java,unknown,[]
1,application-java/bin/application-java.bat,application-java.zip,@rem\r\n@rem Copyright 2015 the original autho...,6a9fb410-8156-4122-849d-d37786df53e8,.bat,e90dc018527e3b0798537919c4e49fb2b25ebec62aa7a9...,5539,2024-12-18T07:02:08.978398,application-java,Batchfile,[]
2,environments-master/.bash_aliases,https___github.com_00000o1_environments_archiv...,function gdrive_download () {\n CONFIRM=$(wge...,577af8fc-b90a-473a-8a81-c9d0019663ff,,c95ebe462ef03e13dcec8918f35932a80de119acad9271...,379,2024-12-18T07:02:09.739644,https___github.com_00000o1_environments_archiv...,unknown,[72e8e695-9447-4ea0-9458-02c29c57d982]
3,environments-master/.bashrc,https___github.com_00000o1_environments_archiv...,# ~/.bashrc: executed by bash(1) for non-login...,a0fa8941-a58d-4aa7-914a-98ae33ffaf5f,,c5ebce9359e2570aea09ef4dead7196210dbabf0311d79...,3840,2024-12-18T07:02:09.739896,https___github.com_00000o1_environments_archiv...,unknown,[]
4,environments-master/.gitconfig,https___github.com_00000o1_environments_archiv...,[user]\n\temail = writetocris@outlook.com\n\tn...,d9fb1b62-f875-408e-abd0-e1e2caf72f55,,286fc2d8b5211247e850506197280ee5a59c0c6dd11327...,161,2024-12-18T07:02:09.740032,https___github.com_00000o1_environments_archiv...,unknown,[]
5,environments-master/.gitignore,https___github.com_00000o1_environments_archiv...,.*.swp\ninstall_nvm.sh\nnodesource_setup.sh\n*...,a39ee0f8-9851-43bc-aa88-f14f7906d4c3,,d5b36ed76ab3816191cd51579a17f81c1735d05f1f2530...,50,2024-12-18T07:02:09.741148,https___github.com_00000o1_environments_archiv...,unknown,[]
6,environments-master/LICENSE,https___github.com_00000o1_environments_archiv...,MIT License\n\nCopyright (c) 2020 Cris Stringf...,cfbcd6cf-6698-43d2-a014-da457561b238,,df618e5918dc644fe48af59524ae0bb200c09db70e118b...,1100,2024-12-18T07:02:09.741324,https___github.com_00000o1_environments_archiv...,unknown,[]
7,environments-master/README.md,https___github.com_00000o1_environments_archiv...,# [:small_airplane: environments](https://gith...,b83cdc85-5edf-4059-9e5f-8c40fd12179a,.md,3a7b615878972b74384111a3ff92b54d68ba07a0b3d15d...,875,2024-12-18T07:02:09.741441,https___github.com_00000o1_environments_archiv...,Markdown,[]
8,environments-master/basic_setup,https___github.com_00000o1_environments_archiv...,git config --global gpg.format ssh\ngit config...,758c5336-6184-443c-b5f8-d3b59f63bb27,,f1175fbe8273c2b5c88a23c65c72194a4d22049862d324...,939,2024-12-18T07:02:09.741564,https___github.com_00000o1_environments_archiv...,unknown,[]
9,environments-master/cfortunes/acknowledge,https___github.com_00000o1_environments_archiv...,# Author\nhttps://github.com/threemachines/obl...,791c5411-fb13-41cc-ba87-b4739539a402,,5cbe02758a4501e5ac5017cf06e20cb55b216f9628435c...,196,2024-12-18T07:02:09.741668,https___github.com_00000o1_environments_archiv...,unknown,[]


##  Step-5: Code Quality

Code quality gives detailed evaluation of various aspects of code quality in your dataset, offering metrics to analyze structural properties, detect anomalies, and classify files based on their characteristics.

### 5.1 - Set Input/output Folder

In [12]:
STAGE = 3

input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage
output_folder =  output_code_quality_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-3: Processing input='output/02_exact_dedupe_out' --> output='output/03_code_quality_out'


### 5.2 - Execute

In [13]:

%%time

import sys
from code_quality_transform_python import CodeQualityPythonTransformConfiguration
from data_processing.utils import ParamsUtils


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

params = {
    # code quality parameters
    "cq_contents_column_name": "contents",
    "cq_language_column_name": "programming_language",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(CodeQualityPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

07:02:52 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
07:02:52 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
07:02:52 INFO - data factory data_ is using local data access: input_folder - output/02_exact_dedupe_out output_folder - output/03_code_quality_out
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4-48d1-8a49-1c2289617776:data factory data_ is using local data access: input_folder - output/02_exact_dedupe_out output_folder - output/03_code_quality_out
07:02:52 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4-48d1-8a49-1c2289617776:data factory data_ max_files -1, n_sample -1
07:02:52 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
INFO:data_processing.data_access

tokenizer_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/497k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/277k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/840k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3071 > 1024). Running this sequence through the model will result in indexing errors
07:02:56 INFO - Completed 1 files (33.33%) in 0.0 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (33.33%) in 0.0 min
07:02:56 INFO - Completed 2 files (66.67%) in 0.001 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 2 files (66.67%) in 0.001 min
07:02:56 INFO - Completed 3 files (100.0%) in 0.003 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 3 files (100.0%) in 0.003 min
07:02:56 INFO - Done processing 3 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 3 files, waiting for flush() completion.
07:02:56 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
07:02:56 INFO - 

✅ Stage:3 completed successfully
CPU times: user 3.63 s, sys: 468 ms, total: 4.1 s
Wall time: 8.39 s


### 5.3 - Inspect Generated output



You will notice we have extra columns:



- **line_mean**: Average line length.
- **line_max**: Longest line length.
- **total_num_lines**: Number of lines.
- **avg_longest_lines**: Avg. of top n longest lines.
- **alphanum_frac**: Alphanumeric fraction.
- **char_token_ratio**: Character-to-token ratio.
- **autogenerated**: Detects autogenerated files.
- **config_or_test**: Identifies config/test files.
- **has_no_keywords**: No Python keywords (e.g., class, def).
- **has_few_assignments**: Fewer than min = signs.
- **is_xml/is_html**: Detects XML or HTML content.


But still the same number or rows as before


In [14]:
from my_utils import read_parquet_files_as_df
import pprint

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)



pprint.pprint(f"output table columns: {output_df.columns}")

output_df.head(10)



Input data dimensions (rows x columns)=  (73, 11)
Output data dimensions (rows x columns)=  (73, 23)
("output table columns: Index(['title', 'document', 'contents', 'document_id', "
 "'ext', 'hash', 'size',\n"
 "       'date_acquired', 'repo_name', 'programming_language', 'removed',\n"
 "       'line_mean', 'line_max', 'total_num_lines', 'avg_longest_lines',\n"
 "       'alphanum_frac', 'char_token_ratio', 'autogenerated', "
 "'config_or_test',\n"
 "       'has_no_keywords', 'has_few_assignments', 'is_xml', 'is_html'],\n"
 "      dtype='object')")


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language,...,total_num_lines,avg_longest_lines,alphanum_frac,char_token_ratio,autogenerated,config_or_test,has_no_keywords,has_few_assignments,is_xml,is_html
0,application-java/bin/application-java,application-java.zip,#!/usr/bin/env sh\n\n#\n# Copyright 2015 the o...,501a4498-7594-42a5-a027-28dcbfeab5e9,,318066feeb8ac614dd5eab57fbf2faa5a83f2b582a089f...,8168,2024-12-18T07:02:08.978176,application-java,unknown,...,185,449.857143,0.628795,2.65972,False,False,False,False,False,False
1,application-java/bin/application-java.bat,application-java.zip,@rem\r\n@rem Copyright 2015 the original autho...,6a9fb410-8156-4122-849d-d37786df53e8,.bat,e90dc018527e3b0798537919c4e49fb2b25ebec62aa7a9...,5539,2024-12-18T07:02:08.978398,application-java,Batchfile,...,104,448.285714,0.702293,2.551359,False,False,False,False,False,False
2,environments-master/.bash_aliases,https___github.com_00000o1_environments_archiv...,function gdrive_download () {\n CONFIRM=$(wge...,577af8fc-b90a-473a-8a81-c9d0019663ff,,c95ebe462ef03e13dcec8918f35932a80de119acad9271...,379,2024-12-18T07:02:09.739644,https___github.com_00000o1_environments_archiv...,unknown,...,5,74.8,0.675462,2.65035,False,False,False,False,False,False
3,environments-master/.bashrc,https___github.com_00000o1_environments_archiv...,# ~/.bashrc: executed by bash(1) for non-login...,a0fa8941-a58d-4aa7-914a-98ae33ffaf5f,,c5ebce9359e2570aea09ef4dead7196210dbabf0311d79...,3840,2024-12-18T07:02:09.739896,https___github.com_00000o1_environments_archiv...,unknown,...,119,92.285714,0.678646,2.909091,False,False,False,False,False,False
4,environments-master/.gitconfig,https___github.com_00000o1_environments_archiv...,[user]\n\temail = writetocris@outlook.com\n\tn...,d9fb1b62-f875-408e-abd0-e1e2caf72f55,,286fc2d8b5211247e850506197280ee5a59c0c6dd11327...,161,2024-12-18T07:02:09.740032,https___github.com_00000o1_environments_archiv...,unknown,...,11,18.142857,0.68323,2.367647,False,False,False,False,False,False
5,environments-master/.gitignore,https___github.com_00000o1_environments_archiv...,.*.swp\ninstall_nvm.sh\nnodesource_setup.sh\n*...,a39ee0f8-9851-43bc-aa88-f14f7906d4c3,,d5b36ed76ab3816191cd51579a17f81c1735d05f1f2530...,50,2024-12-18T07:02:09.741148,https___github.com_00000o1_environments_archiv...,unknown,...,4,11.5,0.74,2.272727,False,False,False,False,False,False
6,environments-master/LICENSE,https___github.com_00000o1_environments_archiv...,MIT License\n\nCopyright (c) 2020 Cris Stringf...,cfbcd6cf-6698-43d2-a014-da457561b238,,df618e5918dc644fe48af59524ae0bb200c09db70e118b...,1100,2024-12-18T07:02:09.741324,https___github.com_00000o1_environments_archiv...,unknown,...,21,76.571429,0.806364,4.621849,False,False,False,False,False,False
7,environments-master/README.md,https___github.com_00000o1_environments_archiv...,# [:small_airplane: environments](https://gith...,b83cdc85-5edf-4059-9e5f-8c40fd12179a,.md,3a7b615878972b74384111a3ff92b54d68ba07a0b3d15d...,875,2024-12-18T07:02:09.741441,https___github.com_00000o1_environments_archiv...,Markdown,...,19,115.142857,0.757714,3.417969,False,False,False,False,False,False
8,environments-master/basic_setup,https___github.com_00000o1_environments_archiv...,git config --global gpg.format ssh\ngit config...,758c5336-6184-443c-b5f8-d3b59f63bb27,,f1175fbe8273c2b5c88a23c65c72194a4d22049862d324...,939,2024-12-18T07:02:09.741564,https___github.com_00000o1_environments_archiv...,unknown,...,20,67.428571,0.769968,3.019293,False,True,False,False,False,False
9,environments-master/cfortunes/acknowledge,https___github.com_00000o1_environments_archiv...,# Author\nhttps://github.com/threemachines/obl...,791c5411-fb13-41cc-ba87-b4739539a402,,5cbe02758a4501e5ac5017cf06e20cb55b216f9628435c...,196,2024-12-18T07:02:09.741668,https___github.com_00000o1_environments_archiv...,unknown,...,7,27.0,0.780612,2.8,False,False,False,False,False,False


In [15]:

pprint.pprint("-----------------Contents---------------------------")
total_num_lines_rows=output_df[output_df["total_num_lines"]<10].head(3)
for _, row in total_num_lines_rows.iterrows():  # Use the second element (row) from the tuple
    pprint.pprint(f'-------Total Num Lines {row["total_num_lines"]}------\n{row["contents"]}\n-------')

'-----------------Contents---------------------------'
('-------Total Num Lines 5------\n'
 'function gdrive_download () {\n'
 '  CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt '
 '--keep-session-cookies --no-check-certificate '
 '"https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn '
 "'s/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')\n"
 '  wget --load-cookies /tmp/cookies.txt '
 '"https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2\n'
 '  rm -rf /tmp/cookies.txt\n'
 '}\n'
 '\n'
 '-------')
('-------Total Num Lines 4------\n'
 '.*.swp\n'
 'install_nvm.sh\n'
 'nodesource_setup.sh\n'
 '*.patch\n'
 '\n'
 '-------')
('-------Total Num Lines 7------\n'
 '# Author\n'
 'https://github.com/threemachines/obliqueMOTD\n'
 '\n'
 '# Original Author\n'
 'Brian Eno (Oblique Strategies, c. 1970)\n'
 'Richard Diebenkorn (Notes to Myself on Beginning a Painting, unpublished, c. '
 '20C)\n'
 '\n'
 '\n'
 '-------')


##  Step-6: HAP

The hap transform maps a non-empty input table to an output table with an added hap_score column. Each row in the table represents a document, and the hap transform performs the following three steps to calculate the hap score for each document:

- Sentence spliting: we use NLTK to split the document into sentence pieces.
- hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap.
- Aggregation: the document hap score is determined by selecting the maximum hap score among its sentences.


### 6.1 - Set Input/output Folder

In [16]:
STAGE = 4

input_folder = output_code_quality_dir # previous output folder is the input folder for the current stage
output_folder =  output_hap_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-4: Processing input='output/03_code_quality_out' --> output='output/04_hap_out'


### 6.2 - Execute

In [17]:
%%time

import sys
from hap_transform_python import HAPPythonTransformConfiguration
from data_processing.utils import ParamsUtils


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

params = {
    # hap  parameters
    "model_name_or_path": 'ibm-granite/granite-guardian-hap-38m',
    "annotation_column": "hap_score",
    "doc_text_column": "contents",
    "inference_engine": "CPU",
    "max_length": 512,
    "batch_size": 128,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(HAPPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
07:03:27 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} 
INFO:hap_transform:hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} 
07:03:27 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
07:03:27 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
07:03:27 INFO - data factory data_ is using local data access: input_folder - output/03_code_quality_out output_folder - output/04_hap_out
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4

tokenizer_config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/698 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/154M [00:00<?, ?B/s]

Processing batch: 0/0


07:03:57 INFO - Completed 1 files (33.33%) in 0.331 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (33.33%) in 0.331 min


                                       title              document  \
0      application-java/bin/application-java  application-java.zip   
1  application-java/bin/application-java.bat  application-java.zip   

                                            contents  \
0  #!/usr/bin/env sh\n\n#\n# Copyright 2015 the o...   
1  @rem\r\n@rem Copyright 2015 the original autho...   

                            document_id   ext  \
0  501a4498-7594-42a5-a027-28dcbfeab5e9         
1  6a9fb410-8156-4122-849d-d37786df53e8  .bat   

                                                hash  size  \
0  318066feeb8ac614dd5eab57fbf2faa5a83f2b582a089f...  8168   
1  e90dc018527e3b0798537919c4e49fb2b25ebec62aa7a9...  5539   

                date_acquired         repo_name programming_language  ...  \
0  2024-12-18T07:02:08.978176  application-java              unknown  ...   
1  2024-12-18T07:02:08.978398  application-java            Batchfile  ...   

  avg_longest_lines  alphanum_frac  char_token_ratio 

07:04:42 INFO - Completed 2 files (66.67%) in 1.093 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 2 files (66.67%) in 1.093 min


                                                title  \
0                                      ray/.gitignore   
1   ray/test/data_processing_ray_tests/launch/ray/...   
2   ray/test/data_processing_ray_tests/launch/ray/...   
3   ray/test/data_processing_ray_tests/launch/ray/...   
4   ray/test/data_processing_ray_tests/launch/ray/...   
5                                  ray/pyproject.toml   
6   ray/test-data/data_processing/ray/noop/expecte...   
7                                        ray/Makefile   
8   ray/src/data_processing_ray/test_support/trans...   
9   ray/src/data_processing_ray/test_support/trans...   
10  ray/src/data_processing_ray/runtime/ray/transf...   
11  ray/src/data_processing_ray/runtime/ray/ray_ut...   
12  ray/src/data_processing_ray/runtime/ray/transf...   
13  ray/src/data_processing_ray/runtime/ray/transf...   
14  ray/src/data_processing_ray/runtime/ray/transf...   
15  ray/src/data_processing_ray/runtime/ray/transf...   
16  ray/src/data_processing_ray

07:06:03 INFO - Completed 3 files (100.0%) in 2.43 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 3 files (100.0%) in 2.43 min
07:06:03 INFO - Done processing 3 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 3 files, waiting for flush() completion.
07:06:03 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
07:06:03 INFO - Completed execution in 2.588 min, execution result 0
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 2.588 min, execution result 0


                                              title  \
0                 environments-master/.bash_aliases   
1                       environments-master/.bashrc   
2                    environments-master/.gitconfig   
3                    environments-master/.gitignore   
4                       environments-master/LICENSE   
5                     environments-master/README.md   
6                   environments-master/basic_setup   
7         environments-master/cfortunes/acknowledge   
8    environments-master/cfortunes/diebenkorn_notes   
9   environments-master/cfortunes/obliquestrategies   
10             environments-master/commands/addswap   
11                 environments-master/commands/adk   
12                 environments-master/commands/arx   
13     environments-master/commands/auto_cert_renew   
14                 environments-master/commands/cfo   
15            environments-master/commands/cp_certs   
16           environments-master/commands/diffrepos   
17        

### 6.3 - Inspect Generated output


In [18]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Rows created : {output_df.shape[0]:,}")

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Files processed : 73
Rows created : 73
Input data dimensions (rows x columns)=  (73, 23)
Output data dimensions (rows x columns)=  (73, 24)


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language,...,avg_longest_lines,alphanum_frac,char_token_ratio,autogenerated,config_or_test,has_no_keywords,has_few_assignments,is_xml,is_html,hap_score
0,application-java/bin/application-java,application-java.zip,#!/usr/bin/env sh\n\n#\n# Copyright 2015 the o...,501a4498-7594-42a5-a027-28dcbfeab5e9,,318066feeb8ac614dd5eab57fbf2faa5a83f2b582a089f...,8168,2024-12-18T07:02:08.978176,application-java,unknown,...,449.857143,0.628795,2.65972,False,False,False,False,False,False,0.113481
1,application-java/bin/application-java.bat,application-java.zip,@rem\r\n@rem Copyright 2015 the original autho...,6a9fb410-8156-4122-849d-d37786df53e8,.bat,e90dc018527e3b0798537919c4e49fb2b25ebec62aa7a9...,5539,2024-12-18T07:02:08.978398,application-java,Batchfile,...,448.285714,0.702293,2.551359,False,False,False,False,False,False,0.153107
2,environments-master/.bash_aliases,https___github.com_00000o1_environments_archiv...,function gdrive_download () {\n CONFIRM=$(wge...,577af8fc-b90a-473a-8a81-c9d0019663ff,,c95ebe462ef03e13dcec8918f35932a80de119acad9271...,379,2024-12-18T07:02:09.739644,https___github.com_00000o1_environments_archiv...,unknown,...,74.8,0.675462,2.65035,False,False,False,False,False,False,0.00044
3,environments-master/.bashrc,https___github.com_00000o1_environments_archiv...,# ~/.bashrc: executed by bash(1) for non-login...,a0fa8941-a58d-4aa7-914a-98ae33ffaf5f,,c5ebce9359e2570aea09ef4dead7196210dbabf0311d79...,3840,2024-12-18T07:02:09.739896,https___github.com_00000o1_environments_archiv...,unknown,...,92.285714,0.678646,2.909091,False,False,False,False,False,False,0.187185
4,environments-master/.gitconfig,https___github.com_00000o1_environments_archiv...,[user]\n\temail = writetocris@outlook.com\n\tn...,d9fb1b62-f875-408e-abd0-e1e2caf72f55,,286fc2d8b5211247e850506197280ee5a59c0c6dd11327...,161,2024-12-18T07:02:09.740032,https___github.com_00000o1_environments_archiv...,unknown,...,18.142857,0.68323,2.367647,False,False,False,False,False,False,0.003179
5,environments-master/.gitignore,https___github.com_00000o1_environments_archiv...,.*.swp\ninstall_nvm.sh\nnodesource_setup.sh\n*...,a39ee0f8-9851-43bc-aa88-f14f7906d4c3,,d5b36ed76ab3816191cd51579a17f81c1735d05f1f2530...,50,2024-12-18T07:02:09.741148,https___github.com_00000o1_environments_archiv...,unknown,...,11.5,0.74,2.272727,False,False,False,False,False,False,0.002326
6,environments-master/LICENSE,https___github.com_00000o1_environments_archiv...,MIT License\n\nCopyright (c) 2020 Cris Stringf...,cfbcd6cf-6698-43d2-a014-da457561b238,,df618e5918dc644fe48af59524ae0bb200c09db70e118b...,1100,2024-12-18T07:02:09.741324,https___github.com_00000o1_environments_archiv...,unknown,...,76.571429,0.806364,4.621849,False,False,False,False,False,False,0.000312
7,environments-master/README.md,https___github.com_00000o1_environments_archiv...,# [:small_airplane: environments](https://gith...,b83cdc85-5edf-4059-9e5f-8c40fd12179a,.md,3a7b615878972b74384111a3ff92b54d68ba07a0b3d15d...,875,2024-12-18T07:02:09.741441,https___github.com_00000o1_environments_archiv...,Markdown,...,115.142857,0.757714,3.417969,False,False,False,False,False,False,0.000222
8,environments-master/basic_setup,https___github.com_00000o1_environments_archiv...,git config --global gpg.format ssh\ngit config...,758c5336-6184-443c-b5f8-d3b59f63bb27,,f1175fbe8273c2b5c88a23c65c72194a4d22049862d324...,939,2024-12-18T07:02:09.741564,https___github.com_00000o1_environments_archiv...,unknown,...,67.428571,0.769968,3.019293,False,True,False,False,False,False,0.003785
9,environments-master/cfortunes/acknowledge,https___github.com_00000o1_environments_archiv...,# Author\nhttps://github.com/threemachines/obl...,791c5411-fb13-41cc-ba87-b4739539a402,,5cbe02758a4501e5ac5017cf06e20cb55b216f9628435c...,196,2024-12-18T07:02:09.741668,https___github.com_00000o1_environments_archiv...,unknown,...,27.0,0.780612,2.8,False,False,False,False,False,False,0.000215


##  Step-7: Filtering

This step can be used to filter the code files based on our chosen conditions.

### 7.1 - Set Input/output Folder

In [19]:
STAGE = 5

input_folder = output_hap_dir # previous output folder is the input folder for the current stage
output_folder =  output_filter_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-5: Processing input='output/04_hap_out' --> output='output/05_filter_out'


### 7.2 - Execute

In [20]:
%%time

import sys
from filter_transform import (
    filter_columns_to_drop_cli_param,
    filter_criteria_cli_param,
    filter_logical_operator_cli_param,
)
from filter_transform_python import FilterPythonTransformConfiguration
from data_processing.utils import ParamsUtils



# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

filter_criteria = [
    "total_num_lines > 10 AND total_num_lines < 90",
     "hap_score < 0.5"

]
filter_logical_operator = "AND"
filter_columns_to_drop = []

params = {

    # filter parameters
    filter_criteria_cli_param: filter_criteria,
    filter_columns_to_drop_cli_param: filter_columns_to_drop,
    filter_logical_operator_cli_param: filter_logical_operator,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(FilterPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

07:06:19 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
07:06:19 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
07:06:19 INFO - data factory data_ is using local data access: input_folder - output/04_hap_out output_folder - output/05_filter_out
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4-48d1-8a49-1c2289617776:data factory data_ is using local data access: input_folder - output/04_hap_out output_folder - output/05_filter_out
07:06:19 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4-48d1-8a49-1c2289617776:data factory data_ max_files -1, n_sample -1
07:06:19 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basefe731

✅ Stage:5 completed successfully
CPU times: user 188 ms, sys: 49.5 ms, total: 237 ms
Wall time: 562 ms


### 7.3 - Inspect Generated output


In [21]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Rows created : {output_df.shape[0]:,}")

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Files processed : 73
Rows created : 33
Input data dimensions (rows x columns)=  (73, 24)
Output data dimensions (rows x columns)=  (33, 24)


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language,...,avg_longest_lines,alphanum_frac,char_token_ratio,autogenerated,config_or_test,has_no_keywords,has_few_assignments,is_xml,is_html,hap_score
0,environments-master/.gitconfig,https___github.com_00000o1_environments_archiv...,[user]\n\temail = writetocris@outlook.com\n\tn...,d9fb1b62-f875-408e-abd0-e1e2caf72f55,,286fc2d8b5211247e850506197280ee5a59c0c6dd11327...,161,2024-12-18T07:02:09.740032,https___github.com_00000o1_environments_archiv...,unknown,...,18.142857,0.68323,2.367647,False,False,False,False,False,False,0.003179
1,environments-master/LICENSE,https___github.com_00000o1_environments_archiv...,MIT License\n\nCopyright (c) 2020 Cris Stringf...,cfbcd6cf-6698-43d2-a014-da457561b238,,df618e5918dc644fe48af59524ae0bb200c09db70e118b...,1100,2024-12-18T07:02:09.741324,https___github.com_00000o1_environments_archiv...,unknown,...,76.571429,0.806364,4.621849,False,False,False,False,False,False,0.000312
2,environments-master/README.md,https___github.com_00000o1_environments_archiv...,# [:small_airplane: environments](https://gith...,b83cdc85-5edf-4059-9e5f-8c40fd12179a,.md,3a7b615878972b74384111a3ff92b54d68ba07a0b3d15d...,875,2024-12-18T07:02:09.741441,https___github.com_00000o1_environments_archiv...,Markdown,...,115.142857,0.757714,3.417969,False,False,False,False,False,False,0.000222
3,environments-master/basic_setup,https___github.com_00000o1_environments_archiv...,git config --global gpg.format ssh\ngit config...,758c5336-6184-443c-b5f8-d3b59f63bb27,,f1175fbe8273c2b5c88a23c65c72194a4d22049862d324...,939,2024-12-18T07:02:09.741564,https___github.com_00000o1_environments_archiv...,unknown,...,67.428571,0.769968,3.019293,False,True,False,False,False,False,0.003785
4,environments-master/cfortunes/diebenkorn_notes,https___github.com_00000o1_environments_archiv...,Attempt what is not certain. Certainty may or ...,ab08881c-4d6f-4b2f-bee7-ce8ceee16290,,56c3dc574898fc8f104eae7481404091d2693822531aae...,759,2024-12-18T07:02:09.741782,https___github.com_00000o1_environments_archiv...,unknown,...,86.857143,0.766798,3.481651,False,False,False,False,False,False,0.00923
5,environments-master/commands/addswap,https___github.com_00000o1_environments_archiv...,"#!/usr/bin/env bash\n\nif [[ -z ""$1"" ]]; then\...",fd9b1845-53b7-46f3-819e-7592edaf5459,,76df5315cd84ca2f732c4b00b04b62df12d66ca2114404...,402,2024-12-18T07:02:09.749356,https___github.com_00000o1_environments_archiv...,unknown,...,37.428571,0.664179,2.830986,False,False,False,False,False,False,0.000722
6,environments-master/commands/auto_cert_renew,https___github.com_00000o1_environments_archiv...,"#!/usr/bin/env bash\n\nif [[ -z ""$1"" ]]; then\...",0a1a9f14-eb41-4764-8f74-b207efb0fb66,,70fc0311900ae36b0ad751d10420f3b346f6117b768d48...,429,2024-12-18T07:02:09.749632,https___github.com_00000o1_environments_archiv...,unknown,...,53.428571,0.631702,2.732484,False,False,False,False,False,False,0.000368
7,environments-master/commands/cp_certs,https___github.com_00000o1_environments_archiv...,"#!/usr/bin/env bash\n\nif [[ -z ""$1"" ]]; then\...",227834fd-c743-460f-987b-11b5d1972ab2,,fd7da5d160e0b19cd0318c64edd07e20b31eb7471f926a...,240,2024-12-18T07:02:09.749796,https___github.com_00000o1_environments_archiv...,unknown,...,30.714286,0.625,2.857143,False,False,False,False,False,False,0.000254
8,environments-master/commands/gclean,https___github.com_00000o1_environments_archiv...,"#!/bin/bash\n\nset -e\n\nbash -c ""java --versi...",656ebaf1-ee1a-45ac-bb70-eb4ecfc14b0c,,400d421b250439d06fdcc549d319414606360614c6ae24...,356,2024-12-18T07:02:09.750925,https___github.com_00000o1_environments_archiv...,unknown,...,42.571429,0.63764,2.170732,False,False,False,False,False,False,0.002831
9,environments-master/commands/gitrmdir,https___github.com_00000o1_environments_archiv...,#!/bin/bash\nif [[ $# -eq 0 ]] ; then\n echo ...,b3000c2e-821b-4c2d-ae6d-62831d39ef6a,,8027daa1b11ab752a52c64fc651c82e7193cf9928b417b...,1136,2024-12-18T07:02:09.751341,https___github.com_00000o1_environments_archiv...,unknown,...,87.285714,0.704225,3.401198,False,False,False,False,False,False,0.002564


##  Step-8: Tokenization

Next, we tokenize the data to be used for fine tuning.
Tokenization module can use any Hugging Face compatible tokenizer.



### 8.1 - Set Input/output Folder

In [22]:
STAGE = 6

input_folder = output_filter_dir # previous output folder is the input folder for the current stage
output_folder =  output_tokenisation_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-6: Processing input='output/05_filter_out' --> output='output/06_tokenisation_out'


### 8.2 - Execute

In [23]:

%%time

import sys

from data_processing.utils import ParamsUtils
from tokenization_transform_python import TokenizationPythonConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),

}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(TokenizationPythonConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

07:06:37 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
07:06:37 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
07:06:37 INFO - data factory data_ is using local data access: input_folder - output/05_filter_out output_folder - output/06_tokenisation_out
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4-48d1-8a49-1c2289617776:data factory data_ is using local data access: input_folder - output/05_filter_out output_folder - output/06_tokenisation_out
07:06:37 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basefe731803-e4f4-48d1-8a49-1c2289617776:data factory data_ max_files -1, n_sample -1
07:06:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access

tokenizer_config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

07:06:38 INFO - Completed 1 files (33.33%) in 0.0 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (33.33%) in 0.0 min
07:06:38 INFO - Completed 2 files (66.67%) in 0.001 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 2 files (66.67%) in 0.001 min
07:06:39 INFO - Completed 3 files (100.0%) in 0.001 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 3 files (100.0%) in 0.001 min
07:06:39 INFO - Done processing 3 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 3 files, waiting for flush() completion.
07:06:39 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
07:06:39 INFO - Completed execution in 0.026 min, execution result 0
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 0.026 min, execution result 0


✅ Stage:6 completed successfully
CPU times: user 375 ms, sys: 60.1 ms, total: 435 ms
Wall time: 1.57 s


### 8.3 - Inspect Generated output

Here we should see the contents column tokenised.

In [24]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)



Output dimensions (rows x columns)=  (33, 4)


Unnamed: 0,tokens,document_id,document_length,token_count
0,"[1, 518, 1792, 29962, 13, 12, 5269, 353, 2044,...",d9fb1b62-f875-408e-abd0-e1e2caf72f55,161,70
1,"[1, 341, 1806, 19245, 13, 13, 11882, 1266, 313...",cfbcd6cf-6698-43d2-a014-da457561b238,1100,352
2,"[1, 396, 20840, 9278, 29918, 1466, 22116, 2990...",b83cdc85-5edf-4059-9e5f-8c40fd12179a,875,280
3,"[1, 6315, 2295, 1192, 10945, 330, 4061, 29889,...",758c5336-6184-443c-b5f8-d3b59f63bb27,939,352
4,"[1, 6212, 3456, 825, 338, 451, 3058, 29889, 31...",ab08881c-4d6f-4b2f-bee7-ce8ceee16290,759,220


**The data is now ready for extended pretraining or fine tuning using any open source code models.**