# **Demo on building data prep pipeline for model fine tuning** 

<a href="https://colab.research.google.com/github/IBM/data-prep-kit/blob/tree/dev/examples/notebooks/code/sample-notebook.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook demonstrates data preparation techniques for fine-tuning language models using the Data Prep Kit.
Here is the workflow:

add pipeline image here


## How to run this notebook

Two options:

- **Option 1 - Google Colab:** easiest option.  no setup required.  Click this link to open this on google colab.  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/cods-demo.ipynb)
- **Option 2 - Local python dev environment:**  Setup using this [guide](../../../README.md#-getting-started)


## Step-1: SET UP

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment


In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 -Download Data if running on Google Colab

In [2]:
RUNNING_IN_COLAB = True
if RUNNING_IN_COLAB:
    !mkdir -p 'input/source-code-data'

    !wget -O 'input/source-code-data/application-java.zip'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/application-java.zip'
    !wget -O 'input/source-code-data/data-processing-lib.zip' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/data-processing-lib.zip'
    !wget -O 'input/source-code-data/https___github.com_00000o1_environments_archive_refs_heads_master.zip' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/https___github.com_00000o1_environments_archive_refs_heads_master.zip'
    !wget -O 'my_utils.py'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/my_utils.py'
    !wget -O 'language.json'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/languages/lang_extensions.json'


--2024-12-14 00:03:46--  https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/application-java.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28680721 (27M) [application/zip]
Saving to: ‘input/source-code-data/application-java.zip’


2024-12-14 00:03:50 (7.80 MB/s) - ‘input/source-code-data/application-java.zip’ saved [28680721/28680721]

--2024-12-14 00:03:50--  https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/code2parquet/python/test-data/input/data-processing-lib.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443

### 1.3 - Install dependencies if running on Google Colab

In [3]:
RUNNING_IN_COLAB = True
if RUNNING_IN_COLAB:
    !pip install "data-prep-toolkit-transforms[all]==0.2.2"
    !pip install datasets
    !pip install pandas
    !pip install humanfriendly

Collecting argparse (from data-prep-toolkit>=0.2.2->data-prep-toolkit-transforms==0.2.2->data-prep-toolkit-transforms[all]==0.2.2)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


### 1.4 - Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configure


### Step-2.1: Basic configuration

In [23]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


In [4]:
import os

## Configuration
class MyConfig:
    pass

MY_CONFIG = MyConfig ()

MY_CONFIG.INPUT_DATA_DIR = 'input/source-code-data/'

MY_CONFIG.OUTPUT_FOLDER = "output"
MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , "output_final")



In [5]:
## Add parent dir to path
import os,sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

### 2.2 - Setup input/outpur directories

In [6]:
import os
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')

output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_exact_dedupe_out')
output_code_quality_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_code_quality_out')
output_filter_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_filter_out')
output_tokenisation_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_tokenisation_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: Data ingestion -  Convert source data to Parquet


This is the first component of this pipeline. It ingests few zip files and converts it into
parquet files for consumption by the next steps in this data processing pipeline.


Add image - describe data transformation - multiple


### 3.1 - Set Input/output Folder

In [7]:
STAGE = 1

input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder =  output_parquet_dir

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-1: Processing input='input/source-code-data/' --> output='output/01_parquet_out'


### 3.2 - Execute

In [8]:

%%time

import sys
import ast
from data_processing.utils import ParamsUtils
from data_processing.runtime.pure_python import PythonTransformLauncher
from code2parquet_transform import (  # domain_key,; snapshot_key,
    detect_programming_lang_cli_key,
    supported_langs_file_cli_key,
)
from code2parquet_transform_python import CodeToParquetPythonConfiguration
# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}


supported_languages_file = "language.json"

params = {


    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),

    # code2parquet parameters
    supported_langs_file_cli_key: supported_languages_file,
    detect_programming_lang_cli_key: True,
    "data_files_to_use": ast.literal_eval("['.zip']"),
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(CodeToParquetPythonConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")


00:05:35 INFO - data factory code2parquet_ is using local configuration without input/output path
00:05:35 INFO - data factory code2parquet_ max_files -1, n_sample -1
00:05:35 INFO - data factory code2parquet_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:05:35 INFO - pipeline id pipeline_id
00:05:35 INFO - code location None
00:05:35 INFO - data factory data_ is using local data access: input_folder - input/source-code-data/ output_folder - output/01_parquet_out
00:05:35 INFO - data factory data_ max_files -1, n_sample -1
00:05:35 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.zip'], files to checkpoint ['.parquet']
00:05:35 INFO - orchestrator code2parquet started at 2024-12-14 00:05:35
00:05:35 INFO - Number of files is 3, source profile {'max_file_size': 27.35206699371338, 'min_file_size': 0.11289310455322266, 'total_file_

✅ Stage:1 completed successfully
CPU times: user 1.5 s, sys: 1.29 s, total: 2.79 s
Wall time: 606 ms


In [9]:
output_folder

'output/01_parquet_out'

### 3.3 - Inspect Generated output


In [10]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (74, 10)


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language
0,application-java/bin/application-java,application-java.zip,#!/usr/bin/env sh\n\n#\n# Copyright 2015 the o...,cadbcc7e-4a0b-4115-a5b5-ad325b8d2194,,318066feeb8ac614dd5eab57fbf2faa5a83f2b582a089f...,8168,2024-12-14T00:05:35.272041,application-java,unknown
1,application-java/bin/application-java.bat,application-java.zip,@rem\r\n@rem Copyright 2015 the original autho...,5d338050-f7ed-4f52-95aa-da4b905fa60c,.bat,e90dc018527e3b0798537919c4e49fb2b25ebec62aa7a9...,5539,2024-12-14T00:05:35.272261,application-java,Batchfile
2,ray/.gitignore,data-processing-lib.zip,\n\n\n# Byte-compiled / optimized / DLL files\...,899c061b-27a0-4b22-af03-4e2712f1f6df,,10d9872967cc070881e20d8691a6461abc148fad6543e0...,357,2024-12-14T00:05:35.428454,data-processing-lib,unknown
3,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,238d4a07-3758-4368-944b-1e549a02cde8,.py,8a768e1ed38b458e121928f15a0d782d2f9d4ee39e4f65...,2828,2024-12-14T00:05:35.428699,data-processing-lib,Python
4,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,faf3b1fa-51b4-407d-b70c-e6796e35735e,.py,5a2a60c8e23fffc0956595760b0b532d85b21dd36db8d5...,1853,2024-12-14T00:05:35.428746,data-processing-lib,Python


## 3.4 - Understand the output

Here are some interesting attributes to note:

- **filename** : original filename
- **contents** : text
- **document_id**: unique id (UUID) assignd to this document
- **hash** : hash of document
- **pdf_convert_time** : time to convert this pdf in seconds

Let's inspect the **contents** column.  See how the text is being divided up!

In [None]:
import pprint
import json

pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))
# json.loads(output_df.iloc[0, ]['contents'])

## 3.5 - Metadata information


In [11]:
read_metadata(f"{output_folder}/metadata.json")

NameError: name 'read_metadata' is not defined

##  Step-4: Exact Deduplication

This step will find exact duplicates in the 'content' column and remove them. This is done by computing SHA256 hash on the code files and remove records having identical hashes.

### 4.1 - Set Input/output Folder

In [12]:
STAGE = 2

input_folder = output_parquet_dir # previous output folder is the input folder for the current stage
output_folder =  output_exact_dedupe_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_exact_dedupe_out'


### 4.2 - Execute

In [17]:
%%time

import sys

from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration
from ededup_transform_base import doc_column_name_cli_param, int_column_name_cli_param

from data_processing.utils import ParamsUtils

# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

params = {
    # ededup parameters
    doc_column_name_cli_param: "contents",
    int_column_name_cli_param: "document_id",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")


00:13:14 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}
00:13:14 INFO - pipeline id pipeline_id
00:13:14 INFO - code location None
00:13:14 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_exact_dedupe_out
00:13:14 INFO - data factory data_ max_files -1, n_sample -1
00:13:14 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:13:14 INFO - orchestrator ededup started at 2024-12-14 00:13:14
00:13:14 INFO - Number of files is 3, source profile {'max_file_size': 0.03529071807861328, 'min_file_size': 0.00879669189453125, 'total_file_size': 0.06692790985107422}
00:13:14 INFO - Starting from the beginning
00:13:14 INFO - Completed 1 files (33.33%) in 0.0 min
00:13:14 INFO - Completed 2 files (66.67%) in 0.0 min
00:13:14 INFO

✅ Stage:2 completed successfully
CPU times: user 25.2 ms, sys: 9.9 ms, total: 35.1 ms
Wall time: 26.2 ms


### 4.3 - Inspect Generated output



You will notice 

In [20]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Input data dimensions (rows x columns)=  (74, 10)
Output data dimensions (rows x columns)=  (73, 11)


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language,removed
0,application-java/bin/application-java,application-java.zip,#!/usr/bin/env sh\n\n#\n# Copyright 2015 the o...,cadbcc7e-4a0b-4115-a5b5-ad325b8d2194,,318066feeb8ac614dd5eab57fbf2faa5a83f2b582a089f...,8168,2024-12-14T00:05:35.272041,application-java,unknown,[]
1,application-java/bin/application-java.bat,application-java.zip,@rem\r\n@rem Copyright 2015 the original autho...,5d338050-f7ed-4f52-95aa-da4b905fa60c,.bat,e90dc018527e3b0798537919c4e49fb2b25ebec62aa7a9...,5539,2024-12-14T00:05:35.272261,application-java,Batchfile,[]
2,ray/.gitignore,data-processing-lib.zip,\n\n\n# Byte-compiled / optimized / DLL files\...,899c061b-27a0-4b22-af03-4e2712f1f6df,,10d9872967cc070881e20d8691a6461abc148fad6543e0...,357,2024-12-14T00:05:35.428454,data-processing-lib,unknown,[]
3,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,238d4a07-3758-4368-944b-1e549a02cde8,.py,8a768e1ed38b458e121928f15a0d782d2f9d4ee39e4f65...,2828,2024-12-14T00:05:35.428699,data-processing-lib,Python,[]
4,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,faf3b1fa-51b4-407d-b70c-e6796e35735e,.py,5a2a60c8e23fffc0956595760b0b532d85b21dd36db8d5...,1853,2024-12-14T00:05:35.428746,data-processing-lib,Python,[]
5,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,d1b948b6-29e9-4eda-9b43-97b943c95b0e,.py,0d48720a35061d1085ac040f2a96cb57ae82d3921edfdd...,7229,2024-12-14T00:05:35.428786,data-processing-lib,Python,[]
6,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,66008011-5ee1-4620-bedd-be3c39b53169,.py,62286fd87e04cd78dc419854778ef4d98f7bdb2b8cf515...,3292,2024-12-14T00:05:35.428818,data-processing-lib,Python,[]
7,ray/pyproject.toml,data-processing-lib.zip,"[project]\nname = ""data_prep_toolkit_ray""\nver...",2b6fd286-c370-47c6-aed1-80018ad85681,.toml,6b2df4f160514a3f43fa81dec6c59da01bb2faa62d72dc...,1337,2024-12-14T00:05:35.428843,data-processing-lib,TOML,[]
8,ray/test-data/data_processing/ray/noop/expecte...,data-processing-lib.zip,"{\n ""pipeline"": ""pipeline_id"",\n ""job de...",9b2210d2-7031-4875-ba80-8996571a588a,.json,317541c784bbf90c260f9f74ca31957afbcaf8c4adb66e...,1128,2024-12-14T00:05:35.430952,data-processing-lib,JSON,[]
9,ray/Makefile,data-processing-lib.zip,"# Use make help, to see the available rules\nR...",891081b4-69c0-443a-a0fe-543fb7cdb1ec,,784fc2ccbc718411372a6f45d0405ff4b36da89f56bcd8...,1766,2024-12-14T00:05:35.432188,data-processing-lib,unknown,[]


##  Step-5: Code Quality

Code quality gives detailed evaluation of various aspects of code quality in your dataset, offering metrics to analyze structural properties, detect anomalies, and classify files based on their characteristics. 

### 5.1 - Set Input/output Folder

In [54]:
STAGE = 3

input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage
output_folder =  output_code_quality_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-3: Processing input='output/02_exact_dedupe_out' --> output='output/03_code_quality_out'


### 5.2 - Execute

In [27]:

%%time

import sys
from code_quality_transform_python import CodeQualityPythonTransformConfiguration
from data_processing.utils import ParamsUtils


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

params = {
    # code quality parameters
    "cq_contents_column_name": "contents",
    "cq_language_column_name": "programming_language",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(CodeQualityPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

00:24:36 INFO - pipeline id pipeline_id
00:24:36 INFO - code location None
00:24:36 INFO - data factory data_ is using local data access: input_folder - output/02_exact_dedupe_out output_folder - output/03_code_quality_out
00:24:36 INFO - data factory data_ max_files -1, n_sample -1
00:24:36 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:24:36 INFO - orchestrator code_quality started at 2024-12-14 00:24:36
00:24:36 INFO - Number of files is 3, source profile {'max_file_size': 0.035880088806152344, 'min_file_size': 0.009145736694335938, 'total_file_size': 0.0682210922241211}
Token indices sequence length is longer than the specified maximum sequence length for this model (3071 > 1024). Running this sequence through the model will result in indexing errors
00:24:36 INFO - Completed 1 files (33.33%) in 0.0 min
00:24:36 INFO - Completed 2 files (66.67%) in 0.001 min
00:24:36

✅ Stage:3 completed successfully
CPU times: user 150 ms, sys: 17.7 ms, total: 167 ms
Wall time: 392 ms


### 5.3 - Inspect Generated output



You will notice we have two extra columns



- **line_mean**: Average line length.
- **line_max**: Longest line length.
- **total_num_lines**: Number of lines.
- **avg_longest_lines**: Avg. of top n longest lines.
- **alphanum_frac**: Alphanumeric fraction.
- **char_token_ratio**: Character-to-token ratio.
- **autogenerated**: Detects autogenerated files.
- **config_or_test**: Identifies config/test files.
- **has_no_keywords**: No Python keywords (e.g., class, def).
- **has_few_assignments**: Fewer than min = signs.
- **is_xml/is_html**: Detects XML or HTML content.


But still the same number or rows as before


In [61]:
from my_utils import read_parquet_files_as_df
import pprint

output_df = read_parquet_files_as_df(output_folder)


print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

pprint.pprint(output_df.columns)
total_num_lines_rows=output_df[output_df["total_num_lines"]<10].head(3)

for _, row in total_num_lines_rows.iterrows():  # Use the second element (row) from the tuple
    pprint.pprint(f'-------Total Num Lines {row["total_num_lines"]}------\n{row["contents"]}\n-------')

   

Input data dimensions (rows x columns)=  (73, 11)
Output data dimensions (rows x columns)=  (73, 23)
Index(['title', 'document', 'contents', 'document_id', 'ext', 'hash', 'size',
       'date_acquired', 'repo_name', 'programming_language', 'removed',
       'line_mean', 'line_max', 'total_num_lines', 'avg_longest_lines',
       'alphanum_frac', 'char_token_ratio', 'autogenerated', 'config_or_test',
       'has_no_keywords', 'has_few_assignments', 'is_xml', 'is_html'],
      dtype='object')
('-------Total Num Lines 3------\n'
 'from .noop_transform import (\n'
 '    NOOPRayTransformConfiguration,\n'
 ')\n'
 '\n'
 '-------')
('-------Total Num Lines 8------\n'
 'from data_processing_ray.runtime.ray.ray_utils import RayUtils\n'
 'from data_processing_ray.runtime.ray.transform_statistics import '
 'TransformStatisticsRay\n'
 'from data_processing_ray.runtime.ray.transform_runtime import '
 'DefaultRayTransformRuntime\n'
 'from data_processing_ray.runtime.ray.runtime_configuration import '


##  Step-6: Filtering
 
This step can be used to filter the code files based on our chosen conditions.

### 6.1 - Set Input/output Folder

In [63]:
STAGE = 4

input_folder = output_code_quality_dir # previous output folder is the input folder for the current stage
output_folder =  output_filter_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-4: Processing input='output/03_code_quality_out' --> output='output/04_filter_out'


### 6.2 - Execute

In [64]:
%%time

import sys
from filter_transform import (
    filter_columns_to_drop_cli_param,
    filter_criteria_cli_param,
    filter_logical_operator_cli_param,
)
from filter_transform_python import FilterPythonTransformConfiguration
from data_processing.utils import ParamsUtils



# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

filter_criteria = [
    "total_num_lines > 10 AND total_num_lines < 90"
]
filter_logical_operator = "AND"
filter_columns_to_drop = ["removed"]

params = {

    # filter parameters
    filter_criteria_cli_param: filter_criteria,
    filter_columns_to_drop_cli_param: filter_columns_to_drop,
    filter_logical_operator_cli_param: filter_logical_operator,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(FilterPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

01:05:37 INFO - pipeline id pipeline_id
01:05:37 INFO - code location None
01:05:37 INFO - data factory data_ is using local data access: input_folder - output/03_code_quality_out output_folder - output/04_filter_out
01:05:37 INFO - data factory data_ max_files -1, n_sample -1
01:05:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
01:05:37 INFO - orchestrator filter started at 2024-12-14 01:05:37
01:05:37 INFO - Number of files is 3, source profile {'max_file_size': 0.041484832763671875, 'min_file_size': 0.013142585754394531, 'total_file_size': 0.08251476287841797}
01:05:37 INFO - Completed 1 files (33.33%) in 0.0 min
01:05:37 INFO - Completed 2 files (66.67%) in 0.001 min
01:05:37 INFO - Completed 3 files (100.0%) in 0.001 min
01:05:37 INFO - Done processing 3 files, waiting for flush() completion.
01:05:37 INFO - done flushing in 0.0 sec
01:05:37 INFO - Completed executi

✅ Stage:4 completed successfully
CPU times: user 46.4 ms, sys: 32.7 ms, total: 79.2 ms
Wall time: 55.6 ms


### 6.3 - Inspect Generated output


In [65]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Rows created : {output_df.shape[0]:,}")

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Files processed : 73
Rows created : 33
Input data dimensions (rows x columns)=  (73, 23)
Output data dimensions (rows x columns)=  (33, 22)


Unnamed: 0,title,document,contents,document_id,ext,hash,size,date_acquired,repo_name,programming_language,...,total_num_lines,avg_longest_lines,alphanum_frac,char_token_ratio,autogenerated,config_or_test,has_no_keywords,has_few_assignments,is_xml,is_html
0,ray/.gitignore,data-processing-lib.zip,\n\n\n# Byte-compiled / optimized / DLL files\...,899c061b-27a0-4b22-af03-4e2712f1f6df,,10d9872967cc070881e20d8691a6461abc148fad6543e0...,357,2024-12-14T00:05:35.428454,data-processing-lib,unknown,...,35,23.857143,0.714286,2.625,False,False,False,False,False,False
1,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,238d4a07-3758-4368-944b-1e549a02cde8,.py,8a768e1ed38b458e121928f15a0d782d2f9d4ee39e4f65...,2828,2024-12-14T00:05:35.428699,data-processing-lib,Python,...,80,79.571429,0.649222,3.677503,False,False,False,False,False,False
2,ray/test/data_processing_ray_tests/launch/ray/...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,faf3b1fa-51b4-407d-b70c-e6796e35735e,.py,5a2a60c8e23fffc0956595760b0b532d85b21dd36db8d5...,1853,2024-12-14T00:05:35.428746,data-processing-lib,Python,...,39,91.857143,0.69347,4.010823,False,True,False,False,False,False
3,ray/pyproject.toml,data-processing-lib.zip,"[project]\nname = ""data_prep_toolkit_ray""\nver...",2b6fd286-c370-47c6-aed1-80018ad85681,.toml,6b2df4f160514a3f43fa81dec6c59da01bb2faa62d72dc...,1337,2024-12-14T00:05:35.428843,data-processing-lib,TOML,...,49,67.571429,0.635752,2.70101,False,True,False,False,False,False
4,ray/test-data/data_processing/ray/noop/expecte...,data-processing-lib.zip,"{\n ""pipeline"": ""pipeline_id"",\n ""job de...",9b2210d2-7031-4875-ba80-8996571a588a,.json,317541c784bbf90c260f9f74ca31957afbcaf8c4adb66e...,1128,2024-12-14T00:05:35.430952,data-processing-lib,JSON,...,46,43.857143,0.454787,3.204545,False,False,False,False,False,False
5,ray/Makefile,data-processing-lib.zip,"# Use make help, to see the available rules\nR...",891081b4-69c0-443a-a0fe-543fb7cdb1ec,,784fc2ccbc718411372a6f45d0405ff4b36da89f56bcd8...,1766,2024-12-14T00:05:35.432188,data-processing-lib,unknown,...,49,121.714286,0.731597,3.109155,False,True,False,False,False,False
6,ray/src/data_processing_ray/test_support/trans...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,c0a4c2d4-6885-45ec-9941-5206f5177603,.py,653c74f4cf6f34e6aaad5cdeb93d9599633e205e14a62a...,1580,2024-12-14T00:05:35.432308,data-processing-lib,Python,...,45,76.571429,0.700633,4.463277,False,False,False,False,False,False
7,ray/src/data_processing_ray/runtime/ray/transf...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,10d62504-9d44-4d81-bfa5-5600632773b7,.py,d3e6f9b32396f7336767198c1c37703e8fe9be8a830623...,2523,2024-12-14T00:05:35.432567,data-processing-lib,Python,...,53,102.285714,0.690844,4.724719,False,False,False,True,False,False
8,ray/src/data_processing_ray/runtime/ray/transf...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,171b95eb-f494-400d-9280-c876a0798344,.py,4710d5b4ae145bfed0a5765bce5c12dce9b66bbe8536b8...,3289,2024-12-14T00:05:35.432641,data-processing-lib,Python,...,66,101.285714,0.622378,4.414765,False,False,False,False,False,False
9,ray/src/data_processing_ray/runtime/ray/transf...,data-processing-lib.zip,# (C) Copyright IBM Corp. 2024.\n# Licensed un...,979f574c-9e8f-4f8b-b156-08a09b0719de,.py,2a27ccb73d717ce53215b545521c64bd5fa83f62dd6db0...,1993,2024-12-14T00:05:35.432993,data-processing-lib,Python,...,46,79.285714,0.67135,4.592166,False,False,False,False,False,False


### 6.4 - Metadata Information 


In [None]:
read_metadata(f"{output_folder}/metadata.json")

##  Step-7: Tokenization

Next, we tokenize the data to be used for fine tuning. 

Add more info...tokeniser used

### 7.1 - Set Input/output Folder

In [66]:
STAGE = 5

input_folder = output_filter_dir # previous output folder is the input folder for the current stage
output_folder =  output_tokenisation_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-5: Processing input='output/04_filter_out' --> output='output/05_tokenisation_out'


### 7.2 - Execute

In [67]:

%%time

import sys

from data_processing.utils import ParamsUtils
from tokenization_transform_python import TokenizationPythonConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}


params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),

}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(TokenizationPythonConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

01:10:04 INFO - pipeline id pipeline_id
01:10:04 INFO - code location None
01:10:04 INFO - data factory data_ is using local data access: input_folder - output/04_filter_out output_folder - output/05_tokenisation_out
01:10:04 INFO - data factory data_ max_files -1, n_sample -1
01:10:04 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
01:10:04 INFO - orchestrator Tokenization started at 2024-12-14 01:10:04
01:10:04 INFO - Number of files is 3, source profile {'max_file_size': 0.027441024780273438, 'min_file_size': 0.004353523254394531, 'total_file_size': 0.05213642120361328}
01:10:05 INFO - Completed 1 files (33.33%) in 0.0 min
01:10:05 INFO - Completed 2 files (66.67%) in 0.0 min
01:10:05 INFO - Completed 3 files (100.0%) in 0.0 min
01:10:05 INFO - Done processing 3 files, waiting for flush() completion.
01:10:05 INFO - done flushing in 0.0 sec
01:10:05 INFO - Completed execu

✅ Stage:5 completed successfully
CPU times: user 108 ms, sys: 25.9 ms, total: 134 ms
Wall time: 963 ms


### 7.3 - Inspect Generated output

Here we should see the contents column tokenised.

In [1]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)



NameError: name 'output_folder' is not defined

**The data is now ready for extended pretraining or fine tuning using any open source code models.**