# Data Preprocessing Module MVP for Exploration

In this module I will be primarily focusing on the basics of preprocessing textual based data.
- Text Cleaning and Normalization
- Tokenization
- Deduplication
- Segmentation

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import os
from datasets import load_dataset

# Used for normalization and text cleaning
import re
import unicodedata

# For tokenizing
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
from transformers import AutoTokenizer

[nltk_data] Downloading package punkt to C:\Users\ROSHAL
[nltk_data]     CARDOZA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\ROSHAL
[nltk_data]     CARDOZA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Text Cleaning and normalization
This step insures consistency in text input, reducing noise that can adversely affectlater preprocessing or model traning.

using the opensource [wikitext-103-raw-v1 dataset from huggingface](https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1)

In [2]:
# Load the dataset
dataset = load_dataset("iohadrubin/wikitext-103-raw-v1", split="train")
df_wiki = pd.DataFrame(dataset)

Because the size of the preprocessed dataset was too large, I am capping the dataset size to 25 GB for initial analysis

In [3]:
# Calculate the byte size of each text entry
df_wiki['text_size'] = df_wiki['text'].apply(lambda x: len(x.encode('utf-8')))

# Compute cumulative size (in bytes)
df_wiki['cum_size'] = df_wiki['text_size'].cumsum()

# Define maximum allowed size: 25GB in bytes
max_bytes = 25 * 1024 * 1024 * 1024

# Filter the DataFrame to keep rows until we reach the cap
df_wiki_capped = df_wiki[df_wiki['cum_size'] <= max_bytes].copy()
print(f"Original dataset rows: {len(df_wiki)}")
print(f"Rows retained (capped to ~25GB): {len(df_wiki_capped)}")

# Drop the temporary columns
df_wiki_capped.drop(columns=['text_size', 'cum_size'], inplace=True)

Original dataset rows: 29567
Rows retained (capped to ~25GB): 29567


Data Exploration

In [4]:
df_wiki_capped.head()

Unnamed: 0,text
0,= Valkyria Chronicles III =\nSenjō no Valkyria...
1,= Tower Building of the Little Rock Arsenal =\...
2,= Cicely Mary Barker =\nCicely Mary Barker (28...
3,= Gambia women's national football team =\nThe...
4,= Plain maskray =\nThe plain maskray or brown ...


In [5]:
df_wiki_capped.tail()

Unnamed: 0,text
29562,"= Si Una Vez =\n""Si Una Vez"" (English: If I On..."
29563,= Sicklefin lemon shark =\nThe sicklefin lemon...
29564,= Flammulated flycatcher =\nThe flammulated fl...
29565,"= Ontario Highway 89 =\nKing's Highway 89, com..."
29566,= Luke Smith (writer) =\nLuke Michael Smith is...


In [6]:
# Basic info
print("Dataset columns:", df_wiki_capped.columns)
print("Dataset shape:", df_wiki_capped.shape)

Dataset columns: Index(['text'], dtype='object')
Dataset shape: (29567, 1)


#### Data cleaning exploration

In [7]:
# text length for each entry
df_wiki_capped['text_length'] = df_wiki_capped['text'].apply(lambda x: len(x))
print("Text lenght stats:")
print(df_wiki_capped['text_length'].describe())

Text lenght stats:
count     29567.000000
mean      17537.078161
std       14555.364685
min          16.000000
25%        7750.000000
50%       12994.000000
75%       22721.500000
max      140098.000000
Name: text_length, dtype: float64


In [8]:
# Lines with problematic formatting
problematic = df_wiki_capped[df_wiki_capped['text'].str.contains(r'["]{1,}', na=False)]
print("Rows with potential quotation issues: ") # decided to do this because of an table formatting error while doing df_wiki.head() step in the IDE.
# The error was: “Unterminated quoted field at end of CSV line”
print(problematic.head(5))

Rows with potential quotation issues: 
                                                text  text_length
0  = Valkyria Chronicles III =\nSenjō no Valkyria...        20297
1  = Tower Building of the Little Rock Arsenal =\...        20770
2  = Cicely Mary Barker =\nCicely Mary Barker (28...        15371
4  = Plain maskray =\nThe plain maskray or brown ...         6695
5  = 2011 – 12 Columbus Blue Jackets season =\nTh...        17189


### Text Cleaning and Normalization
cleaning function to:
- Remove HTML tags
- Normalize Unicode to standardize characters.
- Convert text to lowercase and remove accent marks
- Remove non-UTF characters and extra whitespaces.

This should make the text more uniform for intial tokenization.

In [9]:
def normalize_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)
    # Normalize Unicode (NFC)
    text = unicodedata.normalize('NFC', text)
    # Lowercase conversion and accent stripping
    text = text.lower()
    text = ''.join(c for c in text if not unicodedata.combining(c))
    # Remove non-UTF characters and extra whitespace
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [10]:
# Applying normalizations
df_wiki_capped['cleaned_text'] = df_wiki_capped['text'].apply(normalize_text)
print("cleaned sample:")
print(df_wiki_capped[['text', 'cleaned_text']].head(5))

cleaned sample:
                                                text  \
0  = Valkyria Chronicles III =\nSenjō no Valkyria...   
1  = Tower Building of the Little Rock Arsenal =\...   
2  = Cicely Mary Barker =\nCicely Mary Barker (28...   
3  = Gambia women's national football team =\nThe...   
4  = Plain maskray =\nThe plain maskray or brown ...   

                                        cleaned_text  
0  = valkyria chronicles iii = senj no valkyria 3...  
1  = tower building of the little rock arsenal = ...  
2  = cicely mary barker = cicely mary barker (28 ...  
3  = gambia women's national football team = the ...  
4  = plain maskray = the plain maskray or brown s...  


### Tokenization
Tokenizing the cleaned data using Hugging Face fast tokenizer for essential downstream processing

In [11]:
# Tokenizer initialization
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

def tokenize_text(text):
    tokens = tokenizer.tokenize(text)
    return tokens

df_wiki_capped['tokens'] = df_wiki_capped['cleaned_text'].apply(tokenize_text)
print("tokenized sample:")
print(df_wiki_capped[['cleaned_text', 'tokens']].head(2))

Token indices sequence length is longer than the specified maximum sequence length for this model (4206 > 512). Running this sequence through the model will result in indexing errors


tokenized sample:
                                        cleaned_text  \
0  = valkyria chronicles iii = senj no valkyria 3...   
1  = tower building of the little rock arsenal = ...   

                                              tokens  
0  [=, val, ##ky, ##ria, chronicles, iii, =, sen,...  
1  [=, tower, building, of, the, little, rock, ar...  


### Deduplication
Deduplication is the process of removing duplicates or near-duplicates to avoid redundancy in the dataset. For now I am choosing to remove the exact-matches.

In [12]:
def duplicate_texts(df, text_column = 'cleaned_text'):
    df = df.drop_duplicates(subset=[text_column])
    return df

df_wiki_capped_unique = duplicate_texts(df_wiki_capped)
print("Number of rows after deduplication:", len(df_wiki_capped_unique))

Number of rows after deduplication: 29116


### Data Segmentation
segmenting the cleaned text into sentences for finer analysis using NLTK's sentence tokenizer.

In [13]:
def segment_text(text, mode='sentence', fixed_token_length=100):
    if mode == 'sentence':
        segments = sent_tokenize(text)
    elif mode == 'fixed':
        tokens = tokenize_text(text)
        segments = [' '.join(tokens[i:i+fixed_token_length]) for i in range(0, len(tokens), fixed_token_length)]
    else:
        segments = [text]
    return segments

In [14]:
def segment_dataframe(df, text_column='cleaned_text', mode='sentence'):
    df['segments'] = df[text_column].apply(lambda x: segment_text(x, mode=mode))
    df_segmented = df.explode('segments')
    return df_segmented

In [15]:
df_segmented = segment_dataframe(df_wiki_capped_unique, text_column='cleaned_text', mode='sentence')
print("Segmented data sample:")
print(df_segmented[['cleaned_text', 'segments']].head(5))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['segments'] = df[text_column].apply(lambda x: segment_text(x, mode=mode))


Segmented data sample:
                                        cleaned_text  \
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   

                                            segments  
0  = valkyria chronicles iii = senj no valkyria 3...  
0  valkyria of the battlefield 3), commonly refer...  
0  released in january 2011 in japan, it is the t...  
0  employing the same fusion of tactical and real...  
0  the game began development in 2010, carrying o...  


In [16]:
len(df_segmented)

3414617

In [17]:
# Further capping the data
# Calculate total number of segmented rows
total_rows = len(df_segmented)
one_eighteenth_rows = int(total_rows / 18)
print(f"Total segmented rows: {total_rows}")
print(f"Keeping only one third: {one_eighteenth_rows} rows")

Total segmented rows: 3414617
Keeping only one third: 189700 rows


In [18]:
# Select the first one third of the rows
df_segmented_subset = df_segmented.iloc[:one_eighteenth_rows].copy()

### Saving the preprocessed data

In [19]:
output_dir = "data"
os.makedirs(output_dir, exist_ok=True)
output_file = os.path.join(output_dir, "preprocessed_wikitext103_subset.csv")

# Save the segmented DataFrame to the specified CSV file
df_segmented_subset.to_csv(output_file, index=False)
print(f"Subset of preprocessed data saved to {output_file}")

Subset of preprocessed data saved to data\preprocessed_wikitext103_subset.csv


In [20]:
# testing by loading the data

# Define the file path in the "data" directory
data_file = os.path.join("data", "preprocessed_wikitext103_subset.csv")

# Load the CSV file with a safe option to skip problematic lines if any exist
df_loaded = pd.read_csv(data_file, on_bad_lines='skip', engine='python')

# Display a sample of the loaded data and its dimensions
print("Loaded data sample:")
print(df_loaded.head())
print("\nShape of loaded data:", df_loaded.shape)

Loaded data sample:
                                                text  text_length  \
0  = Valkyria Chronicles III =\nSenjō no Valkyria...        20297   
1  = Valkyria Chronicles III =\nSenjō no Valkyria...        20297   
2  = Valkyria Chronicles III =\nSenjō no Valkyria...        20297   
3  = Valkyria Chronicles III =\nSenjō no Valkyria...        20297   
4  = Valkyria Chronicles III =\nSenjō no Valkyria...        20297   

                                        cleaned_text  \
0  = valkyria chronicles iii = senj no valkyria 3...   
1  = valkyria chronicles iii = senj no valkyria 3...   
2  = valkyria chronicles iii = senj no valkyria 3...   
3  = valkyria chronicles iii = senj no valkyria 3...   
4  = valkyria chronicles iii = senj no valkyria 3...   

                                              tokens  \
0  ['=', 'val', '##ky', '##ria', 'chronicles', 'i...   
1  ['=', 'val', '##ky', '##ria', 'chronicles', 'i...   
2  ['=', 'val', '##ky', '##ria', 'chronicles', 'i...   
3  [