# Data Preprocessing Module MVP for Exploration

In this module I will be primarily focusing on the basics of preprocessing textual based data.
- Text Cleaning and Normalization
- Tokenization
- Deduplication
- Segmentation

In [7]:
!pip install datasets


Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Importing libraries
import numpy as np
import pandas as pd
import os
from datasets import load_dataset

# Used for normalization and text cleaning
import re
import unicodedata

# For tokenizing
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
from transformers import AutoTokenizer

[nltk_data] Downloading package punkt to C:\Users\ROSHAL
[nltk_data]     CARDOZA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\ROSHAL
[nltk_data]     CARDOZA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Text Cleaning and normalization
This step insures consistency in text input, reducing noise that can adversely affectlater preprocessing or model traning.

using the opensource [wikitext-103-raw-v1 dataset from huggingface](https://huggingface.co/datasets/iohadrubin/wikitext-103-raw-v1)

In [3]:
# Loading the dataset
dataset = load_dataset("iohadrubin/wikitext-103-raw-v1", split="train")
# converting it into a Data frame for easier processing
df_wiki = pd.DataFrame(dataset)

dataset_infos.json:   0%|          | 0.00/802 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


(…)-00000-of-00002-b755d19de94348c6.parquet:   0%|          | 0.00/148M [00:00<?, ?B/s]

(…)-00001-of-00002-0bf6d0c487c2e75b.parquet:   0%|          | 0.00/152M [00:00<?, ?B/s]

(…)-00000-of-00001-4c013962448951dd.parquet:   0%|          | 0.00/631k [00:00<?, ?B/s]

(…)-00000-of-00001-b7859cf6365689a3.parquet:   0%|          | 0.00/707k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/29567 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/60 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/62 [00:00<?, ? examples/s]

Data Exploration

In [4]:
df_wiki.head()

Unnamed: 0,text
0,= Valkyria Chronicles III =\nSenjō no Valkyria...
1,= Tower Building of the Little Rock Arsenal =\...
2,= Cicely Mary Barker =\nCicely Mary Barker (28...
3,= Gambia women's national football team =\nThe...
4,= Plain maskray =\nThe plain maskray or brown ...


In [5]:
df_wiki.tail()

Unnamed: 0,text
29562,"= Si Una Vez =\n""Si Una Vez"" (English: If I On..."
29563,= Sicklefin lemon shark =\nThe sicklefin lemon...
29564,= Flammulated flycatcher =\nThe flammulated fl...
29565,"= Ontario Highway 89 =\nKing's Highway 89, com..."
29566,= Luke Smith (writer) =\nLuke Michael Smith is...


In [6]:
# Basic info
print("Dataset columns:", df_wiki.columns)
print("Dataset shape:", df_wiki.shape)

Dataset columns: Index(['text'], dtype='object')
Dataset shape: (29567, 1)


#### Data cleaning exploration

In [7]:
# text length for each entry
df_wiki['text_length'] = df_wiki['text'].apply(lambda x: len(x))
print("Text lenght stats:")
print(df_wiki['text_length'].describe())

Text lenght stats:
count     29567.000000
mean      17537.078161
std       14555.364685
min          16.000000
25%        7750.000000
50%       12994.000000
75%       22721.500000
max      140098.000000
Name: text_length, dtype: float64


In [8]:
# Lines with problematic formatting
problematic = df_wiki[df_wiki['text'].str.contains(r'["]{1,}', na=False)]
print("Rows with potential quotation issues: ") # decided to do this because of an table formatting error while doing df_wiki.head() step in the IDE.
# The error was: “Unterminated quoted field at end of CSV line”
print(problematic.head(5))

Rows with potential quotation issues: 
                                                text  text_length
0  = Valkyria Chronicles III =\nSenjō no Valkyria...        20297
1  = Tower Building of the Little Rock Arsenal =\...        20770
2  = Cicely Mary Barker =\nCicely Mary Barker (28...        15371
4  = Plain maskray =\nThe plain maskray or brown ...         6695
5  = 2011 – 12 Columbus Blue Jackets season =\nTh...        17189


### Text Cleaning and Normalization
cleaning function to:
- Remove HTML tags
- Normalize Unicode to standardize characters.
- Convert text to lowercase and remove accent marks
- Remove non-UTF characters and extra whitespaces.

This should make the text more uniform for intial tokenization.

In [9]:
def normalize_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)
    # Normalize Unicode (NFC)
    text = unicodedata.normalize('NFC', text)
    # Lowercase conversion and accent stripping
    text = text.lower()
    text = ''.join(c for c in text if not unicodedata.combining(c))
    # Remove non-UTF characters and extra whitespace
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [10]:
# Applying normalizations
df_wiki['cleaned_text'] = df_wiki['text'].apply(normalize_text)
print("cleaned sample:")
print(df_wiki[['text', 'cleaned_text']].head(5))

cleaned sample:
                                                text  \
0  = Valkyria Chronicles III =\nSenjō no Valkyria...   
1  = Tower Building of the Little Rock Arsenal =\...   
2  = Cicely Mary Barker =\nCicely Mary Barker (28...   
3  = Gambia women's national football team =\nThe...   
4  = Plain maskray =\nThe plain maskray or brown ...   

                                        cleaned_text  
0  = valkyria chronicles iii = senj no valkyria 3...  
1  = tower building of the little rock arsenal = ...  
2  = cicely mary barker = cicely mary barker (28 ...  
3  = gambia women's national football team = the ...  
4  = plain maskray = the plain maskray or brown s...  


### Tokenization
Tokenizing the cleaned data using Hugging Face fast tokenizer for essential downstream processing

In [11]:
# Tokenizer initialization
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

def tokenize_text(text):
    tokens = tokenizer.tokenize(text)
    return tokens

df_wiki['tokens'] = df_wiki['cleaned_text'].apply(tokenize_text)
print("tokenized sample:")
print(df_wiki[['cleaned_text', 'tokens']].head(2))

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4206 > 512). Running this sequence through the model will result in indexing errors


tokenized sample:
                                        cleaned_text  \
0  = valkyria chronicles iii = senj no valkyria 3...   
1  = tower building of the little rock arsenal = ...   

                                              tokens  
0  [=, val, ##ky, ##ria, chronicles, iii, =, sen,...  
1  [=, tower, building, of, the, little, rock, ar...  


### Deduplication
Deduplication is the process of removing duplicates or near-duplicates to avoid redundancy in the dataset. For now I am choosing to remove the exact-matches.

In [12]:
def duplicate_texts(df, text_column = 'cleaned_text'):
    df = df.drop_duplicates(subset=[text_column])
    return df

df_wiki_unique = duplicate_texts(df_wiki)
print("Number of rows after deduplication:", len(df_wiki_unique))

Number of rows after deduplication: 29116


### Data Segmentation
segmenting the cleaned text into sentences for finer analysis using NLTK's sentence tokenizer.

In [13]:
def segment_text(text, mode='sentence', fixed_token_length=100):
    if mode == 'sentence':
        segments = sent_tokenize(text)
    elif mode == 'fixed':
        tokens = tokenize_text(text)
        segments = [' '.join(tokens[i:i+fixed_token_length]) for i in range(0, len(tokens), fixed_token_length)]
    else:
        segments = [text]
    return segments

In [14]:
def segment_dataframe(df, text_column='cleaned_text', mode='sentence'):
    df['segments'] = df[text_column].apply(lambda x: segment_text(x, mode=mode))
    df_segmented = df.explode('segments')
    return df_segmented

In [15]:
df_segmented = segment_dataframe(df_wiki_unique, text_column='cleaned_text', mode='sentence')
print("Segmented data sample:")
print(df_segmented[['cleaned_text', 'segments']].head(5))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['segments'] = df[text_column].apply(lambda x: segment_text(x, mode=mode))


Segmented data sample:
                                        cleaned_text  \
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   
0  = valkyria chronicles iii = senj no valkyria 3...   

                                            segments  
0  = valkyria chronicles iii = senj no valkyria 3...  
0  valkyria of the battlefield 3), commonly refer...  
0  released in january 2011 in japan, it is the t...  
0  employing the same fusion of tactical and real...  
0  the game began development in 2010, carrying o...  


### Saving the preprocessed data

In [None]:
output_dir = "data"
os.makedirs(output_dir, exist_ok=True)  # Create the directory if it doesn't exist
output_file = os.path.join(output_dir, "preprocessed_wikitext103.csv")

# Save the segmented DataFrame to the specified CSV file
df_segmented.to_csv(output_file, index=False)
print(f"Preprocessed data saved to {output_file}")

In [None]:

# Import necessary libraries for Membership Inference Attack (MIA)
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

print("Libraries for MIA loaded successfully!")
    

In [None]:

# Load a pre-trained model for embeddings
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].numpy()  # CLS token representation

# Convert text to embeddings (Limit to first 100 for efficiency)
df_wiki['embedding'] = df_wiki['text'][:100].apply(get_embedding)

print("Embeddings extracted successfully!")
    

In [None]:

# Split embeddings into train & attack-test sets
X_train, X_attack_test = train_test_split(df_wiki['embedding'].tolist(), test_size=0.5, random_state=42)

# Compute cosine similarity
similarities = cosine_similarity(X_attack_test, X_train)

# Find nearest neighbor distance
nearest_distances = np.max(similarities, axis=1)

print("Nearest neighbor distances computed!")
    

In [None]:

# Generate true labels (1 for in-train, 0 for out-train) for attack testing
true_labels = np.random.choice([0, 1], size=len(X_attack_test), p=[0.5, 0.5])

# Train a logistic regression attack model
attack_model = LogisticRegression()
attack_model.fit(nearest_distances.reshape(-1, 1), true_labels)

# Evaluate the MIA attack
attack_predictions = attack_model.predict(nearest_distances.reshape(-1, 1))
accuracy = accuracy_score(true_labels, attack_predictions)

print(f"Membership Inference Attack(MIA) Accuracy: {accuracy:.2f}")
    