<a href="https://colab.research.google.com/github/utkarshgupta04092003/notebooks/blob/main/1_Sourcing%2C_Cleaning_and_Packging_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Sourcing datasets for pretraining

In [73]:
import warnings
warnings.filterwarnings('ignore')

In [74]:
# import wikipedia small dataset from hugging face
import datasets
pretraining_dataset = datasets.load_dataset("wikitext", "wikitext-103-v1", split="train")


In [75]:
# Display the dataset sample
pretraining_dataset['text'][:5]

['',
 ' = Valkyria Chronicles III = \n',
 '',
 ' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . \n',
 " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving

In [76]:
# Load and display story dataset from hugging face datasets
story_dataset = datasets.load_dataset("roneneldan/TinyStories", split="train")
story_dataset['text'][:5]

['One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.',
 'Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.\n\nOne day, Beep was driving in the park when he saw a big tree. The tree had many leaves that we

## 2. Scrape python code from Github

In [77]:
import os
import requests

In [78]:
# Path to store all the python scripts
code_dir = "downloaded_scripts"
os.makedirs(code_dir, exist_ok=True)

In [79]:
# Urls that has script for LLM Pretraining
urls = [
    "https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py",
    "https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py",
    "https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py",
    "https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py",
    "https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py",
    "https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py",
    "https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py",
    "https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib/python3.8/site-packages/django/contrib/messages/__init__.py",
    "https://raw.githubusercontent.com/PaliC/pytorch/master/test/fx/test_subgraph_rewriter.py"
]

In [80]:
# Fetch and store the scripts
print(f'Fetching script from {len(urls)} files')
for url in urls:
  print(f"Working on url: {url}")
  response = requests.get(url)
  file_name = os.path.basename(url)
  file_path = os.path.join(code_dir, file_name)
  with open(file_path, 'wb') as file:
    file.write(response.content)

Fetching script from 9 files
Working on url: https://raw.githubusercontent.com/TheAlgorithms/Python/master/searches/double_linear_search_recursion.py
Working on url: https://raw.githubusercontent.com/KosingZhu/tensorflow/master/tensorflow/python/tools/module_util.py
Working on url: https://raw.githubusercontent.com/EricRemmerswaal/tensorflow/master/tensorflow/python/distribute/distribute_coordinator_context.py
Working on url: https://raw.githubusercontent.com/computationalartist/tensorflow/master/tensorflow/python/ops/numpy_ops/integration_test/benchmarks/numpy_mlp.py
Working on url: https://raw.githubusercontent.com/Van-an/tensorflow/master/tensorflow/python/distribute/coordinator/values.py
Working on url: https://raw.githubusercontent.com/nkgwer/tensorflow/master/tensorflow/lite/tools/visualize.py
Working on url: https://raw.githubusercontent.com/gitblazer/youtube-dl/master/youtube_dl/version.py
Working on url: https://raw.githubusercontent.com/Joshua-Barawa/My-Photos/master/venv/lib

In [81]:
# List all the downloaded scripts filenames
files = os.listdir(code_dir)
for file in files:
  print(file)

version.py
numpy_mlp.py
distribute_coordinator_context.py
double_linear_search_recursion.py
module_util.py
values.py
__init__.py
visualize.py
test_subgraph_rewriter.py


In [82]:
# Concatenate scripts into a list
code_dataset = []
for file in files:
  file_path = os.path.join(code_dir, file)
  with open(file_path, 'r') as file:
    code = file.read()
    code_dataset.append({'text': code})

In [83]:
# Convert list to hugging face dataset format
code_dataset = datasets.Dataset.from_list(code_dataset)
code_dataset

Dataset({
    features: ['text'],
    num_rows: 9
})

In [84]:
# Combine code dataset to pretraining data set downloaded above
print(f'Length of pretraining dataset: {len(pretraining_dataset)}')
print(f'Length of code dataset: {len(code_dataset)}')
sharded_dataset = datasets.concatenate_datasets([pretraining_dataset, code_dataset])
sharded_dataset

Length of pretraining dataset: 1801350
Length of code dataset: 9


Dataset({
    features: ['text'],
    num_rows: 1801359
})

## 3. Data Cleaning

In [85]:
# Display the number of rows in concatinated dataset
sharded_dataset.num_rows

1801359

In [86]:
sharded_dataset['text'][:5]

['',
 ' = Valkyria Chronicles III = \n',
 '',
 ' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . \n',
 " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving

In [87]:
import re

def clean_paragraph(paragraph):
    text = paragraph["text"]

    # Skip empty lines
    if not text.strip():
        return None

    # Split into sentences using punctuation
    lines = re.split(r'[.!?]', text)
    lines = [line.strip() for line in lines if line.strip()]
    return lines


In [88]:
import heapq

def paragraph_length_filter(paragraph):
    lines = clean_paragraph(paragraph)
    if not lines:
        return False
    if len(lines) < 3:
        return False
    if min(heapq.nlargest(3, [len(line) for line in lines])) < 3:
        return False
    return True


In [89]:
filtered = sharded_dataset.filter(paragraph_length_filter)
print(filtered[0])


{'text': ' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . \n'}


In [90]:
filtered.num_rows

629564

### After cleaning 1801359 -> 629564

In [91]:
# Find the number of repetitions in the paragraphs.
def find_duplicates(paragraphs):
  unique_x = set()
  duplicate_chars = 0
  duplicate_elements = 0
  for element in paragraphs:
    if element in unique_x:
      duplicate_chars += len(element)
      duplicate_elements += 1
    else:
      unique_x.add(element)
  return duplicate_elements, duplicate_chars

In [92]:
import re

def paragraph_reperation_filter(x):
  text = x['text']
  paragraphs = re.compile(r'\n{2,}').split(text.strip())
  paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)
  if paragraphs_duplicates/len(paragraphs) > 0.3:
    return False
  if char_duplicates/len(text) > 0.2:
    return False
  return True


In [None]:
unique_dataset = filtered.filter(paragraph_reperation_filter, load_from_cache_file=False)

In [94]:
unique_dataset.num_rows

629563

In [95]:
print(f'Removed {filtered.num_rows - unique_dataset.num_rows}')

Removed 1


In [None]:
# Remove duplicate entries
def deduplication(ds):
  def dedup_func(x):
    if x['text'] in unique_text:
      return False
    else :
      unique_text.add(x['text'])
      return True

  unique_text = set()
  return ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)

deduplicated_dataset = deduplication(unique_dataset)

In [97]:
print(f'Removed {unique_dataset.num_rows - deduplicated_dataset.num_rows} rows')

Removed 7750 rows


In [98]:
deduplicated_dataset.num_rows

621813

In [None]:
# Filter entries for english language only
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException

DetectorFactory.seed = 0

def english_language_filter(ds):

  def is_english(text):
      try:
          if isinstance(text, list):
                text = " ".join(text)
          if detect(text['text']) == "en":
            return True
          else:
            return False
      except LangDetectException:
          return False
  return ds.filter(is_english, load_from_cache_file=False, num_proc=4)

english_dataset = english_language_filter(deduplicated_dataset)

print(english_dataset.num_rows)

In [101]:
print(f'English rows {english_dataset.num_rows}')
print(f'Removed {deduplicated_dataset.num_rows - english_dataset.num_rows} rows')

English rows 620583
Removed 1230 rows


In [None]:
# Saved processed english dataset to parquet file
file_path = 'processed_dataset.parquet'
english_dataset.to_parquet(file_path)

## 4. Data Packaging: Tokenizing + Packing
Tokenizing: Breaking each text into smaller, meaningful unnitss, which are called tokens
Packing: Packing tokens intothe maximum sequence length to imrpove training efficiency


### 4.1 Tokening

In [149]:
# Load parquet file data
import datasets
dataset = datasets.load_dataset('parquet', data_files='/content/processed_dataset.parquet', split='train')
dataset

Dataset({
    features: ['text'],
    num_rows: 620583
})

In [150]:
# Split dataset into smaller pieces for distributed processing
sharded_dataset = dataset.shard(num_shards=10, index=0)
print(sharded_dataset)

Dataset({
    features: ['text'],
    num_rows: 62059
})


In [151]:
# Load the tokenizer
from transformers import AutoTokenizer
model_to_path_or_name='upstage/Solar-10.7B-v1.0'
tokenizer = AutoTokenizer.from_pretrained(model_to_path_or_name, use_fast=False)

In [152]:
# test tokenization
tokenizer.tokenize('I am utkarsh gupta')

['▁I', '▁am', '▁ut', 'kar', 'sh', '▁gu', 'pt', 'a']

In [156]:
#  Create a helper function to convert text into numbers
def tokenization(example):
  tokens = tokenizer.tokenize(example['text'])
  token_ids = tokenizer.convert_tokens_to_ids(tokens)
  # Add <bos><eos> to tokens_ids
  token_ids = [tokenizer.bos_token_id] + token_ids + [tokenizer.eos_token_id]
  example['input_ids'] = token_ids
  example['num_tokens'] = len(token_ids)
  return example

In [157]:
tokenized_dataset = sharded_dataset.map(tokenization, load_from_cache_file=False)

Map:   0%|          | 0/62059 [00:00<?, ? examples/s]

In [158]:
print(tokenized_dataset)
print(tokenized_dataset[:2])

Dataset({
    features: ['text', 'input_ids', 'num_tokens'],
    num_rows: 62059
})
{'text': [' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . \n', " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjust

In [159]:
# Calculate the total number of tokens in the dataset
import numpy as np
np.sum(tokenized_dataset['num_tokens'])

np.int64(12188371)

### 4.2 Packing the data
- concatinating all the token ids, known as serilization.
- Reshape the large list by partitioning the list into smaller list with the max sequence list

In [161]:
input_ids = np.concatenate(tokenized_dataset['input_ids'])
len(input_ids)

12188371

In [162]:
# max sequence length
max_seq_length=32


In [164]:
# length of the ids, so that remainder is 0 when divided by max_seq_length
total_length = len(input_ids) - len(input_ids) % max_seq_length
total_length

12188352

In [165]:
input_ids = input_ids[:total_length]
input_ids.shape

(12188352,)

In [166]:
# Reshape the input ids
input_ids_reshaped = input_ids.reshape(-1, max_seq_length).astype(np.int32)
input_ids_reshaped.shape

(380886, 32)

In [167]:
# See exmaple of reshaped input ids
input_ids_reshaped[0]

array([    1, 28705,  5355, 28768, 28934,   708,   550,  1093, 28724,
        3931, 28705, 28770,   714, 28705,     0, 28705, 23967,  4992,
         325,  8092,   714, 28705, 30842, 30016, 28993, 31428, 30000,
       29182, 29753, 30051, 29306, 29322], dtype=int32)

In [170]:
# convert reshaped input ids to hugging face dataset
input_ids_list = input_ids_reshaped.tolist()
packged_pretrain_dataset = datasets.Dataset.from_dict({'input_ids': input_ids_list})
packged_pretrain_dataset

Dataset({
    features: ['input_ids'],
    num_rows: 380886
})

In [None]:
packged_pretrain_dataset.to_parquet('packged_pretrain_dataset.parquet')


In [175]:
# download the packged pretrain dataset locally or save to google drive for model training part
from google.colab import files
files.download("/content/packged_pretrain_dataset.parquet")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>