# 1. Data Preprocessing

---------------------
1) Loads IMDB and Amazon (polarity) datasets via Hugging Face `datasets`.
2) Cleans text (lowercase, remove punctuation, normalize spaces) to match FastText requirements.
3) Converts items into FastText supervised format: `__label__positive <text>` or `__label__negative <text>`.
4) Samples:
- IMDB: 5500 total (50/50 pos/neg) for training source.
- Amazon: 500 total (50/50) for OOD testing only.
5) Accepts a synthetic dataset file (generated externally) and optionally:
- Deduplicates and removes highly similar items using cosine similarity of sentence embeddings.
- Performs submodular selection (facility location) to keep a diverse subset.
- Synthetic file format expected: one example per line in FastText format, e.g.
`__label__positive this movie was absolutely amazing and touching`

Output directory structure (created if missing):

- ./data/
imdb_train_source.txt # 5500 balanced cleaned samples (real IMDB pool)
- amazon_test_500.txt # OOD test set 500 balanced
- imdb_test_500.txt # In-distribution test set 500 balanced
- ./data/synthetic/synthetic_raw.txt # provided synthetic file (cleaned/normalized)
- synthetic_filtered.txt # after similarity filtering + optional submodular selection

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("kritanjalijain/amazon-reviews")

print("Path to dataset files:", path)
import os
os.listdir(path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/kritanjalijain/amazon-reviews?dataset_version_number=2...


100%|██████████| 1.29G/1.29G [00:16<00:00, 86.2MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2


['amazon_review_polarity_csv.tgz', 'train.csv', 'test.csv']

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

import os
os.listdir(path)


Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.
Path to dataset files: /kaggle/input/imdb-dataset-of-50k-movie-reviews


['IMDB Dataset.csv']

In [None]:
# ==== Data preprocessing only: IMDB (Kaggle CSV) + Amazon Review Polarity (Zhang) ====
import os, re, tarfile, random
import pandas as pd
from collections import Counter
from typing import List

SEED = 42
random.seed(SEED)

# ---- Input paths (edit if needed) ----
IMDB_CSV = "/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"
AMZ_TGZ  = "/kaggle/input/amazon-reviews/amazon_review_polarity_csv.tgz"  # optional; will be extracted if present
AMZ_TRAIN = "/kaggle/input/amazon-reviews/train.csv"
AMZ_TEST  = "/kaggle/input/amazon-reviews/test.csv"

#
#IMDB_CSV = "/root/.cache/kagglehub/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/versions/1/IMDB Dataset.csv"
# AMZ_TGZ = "/root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2/amazon_review_polarity_csv.tgz"
# AMZ_TRAIN = "/root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2/train.csv"
# AMZ_TEST = "/root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2/test.csv"

# ---- Output dirs ----
os.makedirs("data/final_splits", exist_ok=True)
os.makedirs("data/vocabs", exist_ok=True)

# ---- Basic cleaner (FastText-style) ----
PUNCT_RE = re.compile(r"[^a-z0-9\s]")
SPACE_RE = re.compile(r"\s+")
FASTTEXT_POS = "__label__positive"
FASTTEXT_NEG = "__label__negative"

def base_clean_text(s: str) -> str:
    s = str(s).lower()
    s = PUNCT_RE.sub(" ", s)
    s = SPACE_RE.sub(" ", s)
    return s.strip()

def to_fasttext(label: str, text: str) -> str:
    lab = FASTTEXT_POS if label == "positive" else FASTTEXT_NEG
    return f"{lab} {text}"

def save_lines(path: str, lines: List[str]):
    d = os.path.dirname(path)
    if d: os.makedirs(d, exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        for ln in lines:
            f.write(ln + "\n")

def save_vocab(path: str, lines: List[str]):
    cnt = Counter()
    for l in lines:
        parts = l.split(" ", 1)
        if len(parts) == 2:
            cnt.update(parts[1].split())
    d = os.path.dirname(path)
    if d: os.makedirs(d, exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        for tok, c in cnt.most_common():
            f.write(f"{tok}\t{c}\n")

# ---- 0) Extract Amazon polarity if we only have the .tgz ----
if os.path.exists(AMZ_TGZ):
    try:
        with tarfile.open(AMZ_TGZ, "r:gz") as tar:
          tar.extractall(path=".", filter="data") # Safer default: only extract regular files
        print("Extracted amazon_review_polarity_csv.tgz")
    except Exception as e:
        print("Warning: could not extract .tgz:", e)

# ---- 1) IMDB (Kaggle) -> pool + 500 test ----
if not os.path.exists(IMDB_CSV):
    raise FileNotFoundError(f"Missing {IMDB_CSV}. Update IMDB_CSV path at the top of the cell.")

imdb = pd.read_csv(IMDB_CSV)  # columns: review, sentiment (Kaggle)
text_col = "review" if "review" in imdb.columns else imdb.columns[0]
label_col = "sentiment" if "sentiment" in imdb.columns else imdb.columns[1]
imdb = imdb[[text_col, label_col]].dropna()

# Clean + map labels
imdb[text_col] = imdb[text_col].astype(str).apply(base_clean_text)
imdb[label_col] = (
    imdb[label_col].astype(str).str.lower().map({
        "positive": "positive", "pos": "positive", "1": "positive",
        "negative": "negative", "neg": "negative", "0": "negative"
    })
)
imdb = imdb[imdb[label_col].isin(["positive", "negative"])]

# Split
imdb_pos = imdb[imdb[label_col] == "positive"][text_col].tolist()
imdb_neg = imdb[imdb[label_col] == "negative"][text_col].tolist()
random.shuffle(imdb_pos); random.shuffle(imdb_neg)

# Pool ~5.5k (2750/2750); you can later sample 5k for training if needed
pos_pool = imdb_pos[:2750]
neg_pool = imdb_neg[:2750]
imdb_pool_lines = [to_fasttext("positive", t) for t in pos_pool] + \
                  [to_fasttext("negative", t) for t in neg_pool]
random.shuffle(imdb_pool_lines)

# IMDB test 500 (250/250), prefer using rows beyond pool if available
pos_test = imdb_pos[2750:3000] if len(imdb_pos) >= 3000 else imdb_pos[:250]
neg_test = imdb_neg[2750:3000] if len(imdb_neg) >= 3000 else imdb_neg[:250]
pos_test = pos_test[:250]
neg_test = neg_test[:250]
imdb_test_lines = [to_fasttext("positive", t) for t in pos_test] + \
                  [to_fasttext("negative", t) for t in neg_test]
random.shuffle(imdb_test_lines)

# ---- 2) Amazon Review Polarity (Zhang) -> 500 test (250/250) ----
# CSV format is usually: [label, title, text] or [label, text]; labels: 1=negative, 2=positive
def _read_amazon_csv(csv_path: str) -> pd.DataFrame:
    try:
        df = pd.read_csv(csv_path, header=None)
    except Exception:
        df = pd.read_csv(csv_path, header=None, engine="python", sep=",", quoting=3, on_bad_lines="skip")
    if df.shape[1] == 3:
        df.columns = ["label", "title", "text"]
        df["text"] = df["text"].astype(str)
    elif df.shape[1] == 2:
        df.columns = ["label", "text"]
        df["text"] = df["text"].astype(str)
    else:
        df["text"] = df.iloc[:, -1].astype(str)
        df["label"] = df.iloc[:, 0]
    return df[["label", "text"]]


if os.path.exists(AMZ_TEST):
    amz_df = _read_amazon_csv(AMZ_TEST)
elif os.path.exists(AMZ_TRAIN):
    amz_df = _read_amazon_csv(AMZ_TRAIN)
else:
    raise FileNotFoundError("Amazon polarity CSVs not found (train.csv/test.csv).")

def _map_amz_label(v):
    try:
        v = int(v)
    except Exception:
        return None
    if v == 1: return "negative"
    if v == 2: return "positive"
    return None

amz_df = amz_df.dropna()
amz_df["label_bin"] = amz_df["label"].apply(_map_amz_label)
amz_df = amz_df[amz_df["label_bin"].notna()].copy()
amz_df["text"] = amz_df["text"].astype(str).apply(base_clean_text)

amz_pos = amz_df[amz_df["label_bin"] == "positive"]["text"].tolist()
amz_neg = amz_df[amz_df["label_bin"] == "negative"]["text"].tolist()
random.shuffle(amz_pos); random.shuffle(amz_neg)

amz_pos = amz_pos[:250] if len(amz_pos) >= 250 else amz_pos
amz_neg = amz_neg[:250] if len(amz_neg) >= 250 else amz_neg
amazon_test_lines = [to_fasttext("positive", t) for t in amz_pos] + \
                    [to_fasttext("negative", t) for t in amz_neg]
random.shuffle(amazon_test_lines)

# ---- 3) Save outputs ----
save_lines("data/imdb_train_source.txt", imdb_pool_lines)
save_lines("data/imdb_test_500.txt", imdb_test_lines)
save_lines("data/amazon_test_500.txt", amazon_test_lines)

# ---- 3.1) Optional: also save an exact 5,000-line balanced IMDB training set ----
random.seed(SEED)

# separate positive and negative lines from the 5.5k pool
pos = [l for l in imdb_pool_lines if l.startswith(FASTTEXT_POS)]
neg = [l for l in imdb_pool_lines if l.startswith(FASTTEXT_NEG)]

# take 2,500 of each → total 5,000
k = min(2500, len(pos), len(neg))
real_5k = pos[:k] + neg[:k]
random.shuffle(real_5k)

save_lines("data/final_splits/real_5k.txt", real_5k)
save_vocab("data/vocabs/vocab_real_5k.txt", real_5k)

print(f"Saved balanced IMDB training set: data/final_splits/real_5k.txt ({len(real_5k)} lines)")



# (Optional) quick vocabs for sanity check
save_vocab("data/vocabs/vocab_imdb_pool.txt", imdb_pool_lines)
save_vocab("data/vocabs/vocab_imdb_test_500.txt", imdb_test_lines)
save_vocab("data/vocabs/vocab_amazon_test_500.txt", amazon_test_lines)

# ---- 4) Report ----
print("  Preprocessing complete.")
print("  IMDB pool:", len(imdb_pool_lines), "(~5500 target)")
print("  IMDB test:", len(imdb_test_lines), "(target 500)")
print("  Amazon test:", len(amazon_test_lines), "(target 500)")
print("\nFiles written:")
for p in [
    "data/imdb_train_source.txt",
    "data/imdb_test_500.txt",
    "data/amazon_test_500.txt",
    "data/vocabs/vocab_imdb_pool.txt",
    "data/vocabs/vocab_imdb_test_500.txt",
    "data/vocabs/vocab_amazon_test_500.txt",
]:
    print(" -", p, "→", os.path.exists(p))


In [None]:
""" This code just for checking and tiny peeks of the reviews """

# line counts
!wc -l data/imdb_train_source.txt data/imdb_test_500.txt data/amazon_test_500.txt

# peek a few rows
!head -n 3 data/imdb_test_500.txt
!head -n 3 data/amazon_test_500.txt

   5500 data/imdb_train_source.txt
    500 data/imdb_test_500.txt
    500 data/amazon_test_500.txt
   6500 total
__label__positive lead actor yuko tanaka fulfills so much in the exceptionally meditative the milkwoman a tranquil canvass on missed chances in the life of a 50 something woman charting her routine with sincerely poignant motives played out in the picturesque tranquil town of nagasaki akira ogata s unconventional romantic film so to speak is less a straight out melodrama than a deliberate introspection of its characters surrender to their current lives as a result of a tragic past that forced them to a choice they did not call for br br perfectly embodying the requisite world weariness subjected to a spiritless routine tanaka plays minako oba a middle aged woman who before her work shift at a supermarket takes it upon herself to deliver bottles of milk among the residents of the hilly nagasaki one of the houses she constantly passes by to make such a delivery is that of kait

# 2. Data Generation

synthetic data generated via ChatGPT webapp using the GPT-5 model

Code to shuffle the datasets

In [None]:
import numpy as np
import os
np.random.seed(42)

#Uncomment the necessary file names to shuffle the 100% synthetic or 50/50 real/synthetic dataset
# inputFile = "syntheticData.txt"
# outputFile = "shuffledSyntheticData.txt"
inputFile = "5050RealSynthetic.txt"
outputFile = "shuffled5050RealSynthetic.txt"

#open the input file and read each line into a list
datasetUnshuffled = open(inputFile)
datasetLines = datasetUnshuffled.readlines()
print(len(datasetLines))

#shuffle indices
index = np.arange(0, len(datasetLines), 1, dtype= int)
print("index has length: ", len(index))
np.random.shuffle(index)
print("after shuffling index has length: ", len(index))

#if the output file exists, write each line of the input file in the order of the shuffled indices
#if the output file doesn't exist, exit
if os.path.exists(outputFile):
    datasetShuffled = open(outputFile, "w")
    count = 0
    for idx in index:
        datasetShuffled.write(datasetLines[idx])
        count += 1
    print("count is: ", count)
    datasetUnshuffled.close()
    datasetShuffled.close()
    datasetShuffled = open(outputFile, "r")
    print(len(datasetShuffled.readlines()))
    print("finished shuffling the dataset!")
else:
    datasetUnshuffled.close()
    print(f"{outputFile} doesn't exist! Please create the file and try again")

FileNotFoundError: [Errno 2] No such file or directory: '5050RealSynthetic.txt'

Code to generate the 50/50 real/synthetic data (needs to be shuffled after generation)

In [None]:
limit = 1250
realPositiveCount = 0
realNegativeCount = 0
synthPositiveCount = 0
synthNegativeCount = 0

#a function to determine the sentiment label of a review
def parseLabel(line):
    if "__label__positive" in line:
        return "positive"
    elif "__label__negative" in line:
        return "negative"

#open the real and synthetic datasets and read the lines. Separate the lines into positive and negative reviews
np.random.seed(42)
realData = open("real_5k.txt")
realLines = realData.readlines()
realPositiveList = []
realNegativeList = []
for line in realLines:
    if parseLabel(line) == "positive":
        realPositiveList.append(line)
    elif parseLabel(line) == "negative":
        realNegativeList.append(line)

syntheticData = open("shuffledSyntheticData.txt")
syntheticLines = syntheticData.readlines()
syntheticPositiveList = []
syntheticNegativeList = []
for line in syntheticLines:
    if parseLabel(line) == "positive":
        syntheticPositiveList.append(line)
    elif parseLabel(line) == "negative":
        syntheticNegativeList.append(line)

realData.close()
syntheticData.close()

print("We have: ", len(realPositiveList), " real positive examples")
print("We have: ", len(realNegativeList), " real negative examples")
print("We have: ", len(syntheticPositiveList), " synthetic positive examples")
print("We have: ", len(syntheticNegativeList), " synthetic negative examples")


index = np.arange(0, 5000, 1, dtype= int)

lists = [realPositiveList, realNegativeList, syntheticPositiveList, syntheticNegativeList]
listLimit = [realPositiveCount, realNegativeCount, synthPositiveCount, synthNegativeCount]
#create the output file and randomly sample a uniform amount of examples from each of the four subcategories
with open("5050RealSynthetic.txt", "w") as f:
    for i in range(lists):
        np.random.shuffle(index)
        while listLimit[i] < limit:
            idx = listLimit[i]
            f.write(lists[i][index[idx]])
            listLimit[i] += 1


print("finished sampling")

totalPositiveCount = 0
totalNegativeCount = 0
totalLineCount = 0
with open("5050RealSynthetic.txt", "r") as f:
    for line in f:
        if parseLabel(line) == "positive":
            totalPositiveCount += 1
        elif parseLabel(line) == "negative":
            totalNegativeCount += 1
        totalLineCount += 1

print("there are ", totalPositiveCount, " positive examples")
print("there are ", totalNegativeCount, " negative examples")
print("there are ", totalLineCount, " total examples")

FileNotFoundError: [Errno 2] No such file or directory: 'real_5k.txt'

Code to generate the appended dataset (needs to be shuffled after generation)

---



In [None]:
def parseLabel(line):
    if "__label__positive" in line:
        return "positive"
    elif "__label__negative" in line:
        return "negative"

def appendExample(ex1, ex2):
    return ex1[:-1] + " " + ex2[18:]

np.random.seed(42)
realData = open("real_5k.txt")
realLines = realData.readlines()
realPositiveList = []
realNegativeList = []
for line in realLines:
    if parseLabel(line) == "positive":
        realPositiveList.append(line)
    elif parseLabel(line) == "negative":
        realNegativeList.append(line)

syntheticData = open("shuffledSyntheticData.txt")
syntheticLines = syntheticData.readlines()
syntheticPositiveList = []
syntheticNegativeList = []
for line in syntheticLines:
    if parseLabel(line) == "positive":
        syntheticPositiveList.append(line)
    elif parseLabel(line) == "negative":
        syntheticNegativeList.append(line)

realData.close()
syntheticData.close()

print("We have: ", len(realPositiveList), " real positive examples")
print("We have: ", len(realNegativeList), " real negative examples")
print("We have: ", len(syntheticPositiveList), " synthetic positive examples")
print("We have: ", len(syntheticNegativeList), " synthetic negative examples")

if len(realPositiveList) != len(realNegativeList) or len(realPositiveList) != len(syntheticPositiveList) or len(realPositiveList) != len(syntheticPositiveList) or len(realNegativeList) != len(syntheticPositiveList) or len(realNegativeList) != len(syntheticNegativeList) or len(syntheticPositiveList) != len(syntheticNegativeList):
    print("The list lengths are not all equal.")
    exit()

with open("appendedRealSynthetic.txt", "w") as f:
    for i in range(len(realPositiveList)):
        f.write(appendExample(realPositiveList[i], syntheticPositiveList[i]))
        f.write(appendExample(realNegativeList[i], syntheticNegativeList[i]))

print("finished appending")

totalPositiveCount = 0
totalNegativeCount = 0
totalLineCount = 0
with open("appendedRealSynthetic.txt", "r") as f:
    for line in f:
        # print(line)
        if parseLabel(line) == "positive":
            totalPositiveCount += 1
        elif parseLabel(line) == "negative":
            totalNegativeCount += 1
        totalLineCount += 1

print("there are ", totalPositiveCount, " positive examples")
print("there are ", totalNegativeCount, " negative examples")
print("there are ", totalLineCount, " total examples")

# 3. Submodular Data Subset Selection

###Install Dependencies

In [None]:
!pip install scikit-learn tqdm -q

print("Dependencies installed!")

###Import Libraries

In [None]:
import numpy as np
import os
import json
from typing import List, Tuple
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.notebook import tqdm

# Configuration
SEED = 42
np.random.seed(SEED)

# Paths
REAL_5K = "/content/data/input/real_5k.txt"
SYNTHETIC_5K = "/content/data/input/shuffledSyntheticData.txt"
MIXED_5K = "/content/data/input/5050RealSynthetic.txt"

OUTPUT_DIR = "data/submodular_selected"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Libraries imported successfully!")
print(f"Output directory: {OUTPUT_DIR}")

###Helper Functions

In [None]:
def load_fasttext_data(filepath: str) -> List[Tuple[str, str]]:
    data = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if line.startswith("__label__positive"):
                data.append(("positive", line.replace("__label__positive", "").strip()))
            elif line.startswith("__label__negative"):
                data.append(("negative", line.replace("__label__negative", "").strip()))
    return data

def save_fasttext_data(filepath: str, data: List[Tuple[str, str]]):
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'w', encoding='utf-8') as f:
        for label, text in data:
            f.write(f"__label__{label} {text}\n")

print("Functions loaded")

###Check Input Files

In [None]:
print("Checking for input files...\n")

for name, path in [("Real 5k", REAL_5K),
                   ("Synthetic 5k", SYNTHETIC_5K),
                   ("Mixed 5k", MIXED_5K)]:
    exists = os.path.exists(path)
    status = "Found" if exists else "Not found"
    print(f"{name}: {status}")
    if exists:
        with open(path) as f:
            count = sum(1 for _ in f)
        print(f"   → {count} lines\n")


###Facility Location Datatset Selector

In [None]:
class FacilityLocationSelector:
    """
    Optimized submodular selection with batched processing.

    Speed improvements:
    1. Precompute similarity matrix (done once)
    2. Batch candidate evaluation (check 100 at a time)
    3. Early stopping when gain plateaus
    """

    def __init__(self, tfidf_matrix, labels: List[str]):
        self.tfidf_matrix = tfidf_matrix
        self.labels = np.array(labels)
        self.n_samples = tfidf_matrix.shape[0]

        print("Computing similarity matrix (this takes time but speeds up selection)...")
        # Precompute full similarity matrix
        self.similarity = cosine_similarity(tfidf_matrix)
        print(f"Similarity matrix: {self.similarity.shape}")

    def select_greedy(self, k: int, stratified: bool = True) -> List[int]:
        """Fast greedy selection with stratification."""
        if stratified:
            pos_indices = np.where(self.labels == 'positive')[0]
            neg_indices = np.where(self.labels == 'negative')[0]

            k_pos = k // 2
            k_neg = k - k_pos

            print(f"Selecting {k_pos} positive + {k_neg} negative = {k} total")

            selected_pos = self._fast_select(pos_indices, k_pos)
            selected_neg = self._fast_select(neg_indices, k_neg)

            return list(selected_pos) + list(selected_neg)
        else:
            return self._fast_select(np.arange(self.n_samples), k)

    def _fast_select(self, candidate_indices: np.ndarray, k: int) -> List[int]:
        """
        FAST greedy selection using vectorized operations.

        Key optimization: Use precomputed similarity matrix
        """
        selected = []
        remaining = set(candidate_indices)

        # Coverage: max similarity from each point to selected set
        coverage = np.full(self.n_samples, -np.inf)

        pbar = tqdm(range(k), desc="Fast selection")

        for iteration in pbar:
            # Convert remaining to array for vectorized ops
            remaining_arr = np.array(list(remaining))

            if len(remaining_arr) == 0:
                break

            # Vectorized gain computation
            # For each candidate, compute new coverage if we add it
            best_idx = None
            best_gain = -np.inf

            # Process in batches for memory efficiency
            batch_size = 100
            for i in range(0, len(remaining_arr), batch_size):
                batch = remaining_arr[i:i+batch_size]

                # Get similarities for this batch
                batch_sims = self.similarity[:, batch]  # (n_samples, batch_size)

                # Compute new coverage for each in batch
                new_coverage = np.maximum(coverage[:, np.newaxis], batch_sims)  # (n_samples, batch_size)

                # Compute gains
                gains = np.sum(new_coverage, axis=0) - np.sum(coverage)  # (batch_size,)

                # Find best in batch
                batch_best_idx = np.argmax(gains)
                batch_best_gain = gains[batch_best_idx]

                if batch_best_gain > best_gain:
                    best_gain = batch_best_gain
                    best_idx = batch[batch_best_idx]

            # Add best
            selected.append(best_idx)
            remaining.remove(best_idx)

            # Update coverage
            coverage = np.maximum(coverage, self.similarity[:, best_idx])

            # Update progress bar
            pbar.set_postfix({
                'gain': f'{best_gain:.1f}',
                'remaining': len(remaining)
            })

        return selected

    def compute_diversity_score(self, selected_indices: List[int]) -> float:
        """Compute facility location objective."""
        if not selected_indices:
            return 0.0

        selected_sims = self.similarity[:, selected_indices]
        max_sims = np.max(selected_sims, axis=1)
        return float(np.sum(max_sims))

print("FacilityLocationSelector loaded\n")

###Main Processing Function

In [None]:
def process_dataset(input_file: str, output_prefix: str, target_size: int = 3000):
    """Process dataset with fast submodular selection."""

    print(f"\n{'='*70}")
    print(f"Processing: {input_file}")
    print(f"{'='*70}\n")

    # Load
    print("[1/4] Loading data...")
    data = load_fasttext_data(input_file)
    labels = [label for label, text in data]
    texts = [text for label, text in data]
    print(f" Loaded {len(data)} examples")

    # TF-IDF
    print("\n[2/4] Computing TF-IDF...")
    vectorizer = TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95
    )
    tfidf_matrix = vectorizer.fit_transform(texts)
    print(f" TF-IDF: {tfidf_matrix.shape}")

    # Selection
    print("\n[3/4] Fast submodular selection...")
    selector = FacilityLocationSelector(tfidf_matrix, labels)

    selected_indices = selector.select_greedy(k=target_size, stratified=True)

    # Diversity
    diversity_full = selector.compute_diversity_score(list(range(len(texts))))
    diversity_selected = selector.compute_diversity_score(selected_indices)

    print(f"\n Diversity: {diversity_selected/diversity_full*100:.1f}% retained")

    # Save
    print("\n[4/4] Saving...")
    selected_data = [(labels[i], texts[i]) for i in selected_indices]

    output_file = os.path.join(OUTPUT_DIR, f"{output_prefix}_selected.txt")
    save_fasttext_data(output_file, selected_data)
    print(f" Saved: {output_file}")

    stats = {
        'original_size': len(data),
        'selected_size': len(selected_indices),
        'positive_count': sum(1 for i in selected_indices if labels[i] == 'positive'),
        'negative_count': sum(1 for i in selected_indices if labels[i] == 'negative'),
        'diversity_retention': float(diversity_selected / diversity_full)
    }

    stats_file = os.path.join(OUTPUT_DIR, f"{output_prefix}_stats.json")
    with open(stats_file, 'w') as f:
        json.dump(stats, f, indent=2)

    return stats

###Process All Three Datasets

In [None]:
print("="*70)
print("PROCESSING ALL DATASETS")
print("="*70)

all_stats = {}

# Real
if os.path.exists(REAL_5K):
    all_stats['real'] = process_dataset(REAL_5K, "real_3k", target_size=3000)
    print("\n Real complete!\n")

# Synthetic
if os.path.exists(SYNTHETIC_5K):
    all_stats['synthetic'] = process_dataset(SYNTHETIC_5K, "synthetic_3k", target_size=3000)
    print("\n Synthetic complete!\n")

# Mixed
if os.path.exists(MIXED_5K):
    all_stats['mixed'] = process_dataset(MIXED_5K, "mixed_3k", target_size=3000)
    print("\n Mixed complete!\n")

###Summary

In [None]:
if all_stats:
    print("\n" + "="*70)
    print("SUMMARY")
    print("="*70)

    print(f"\n{'Dataset':<15} {'Original':<10} {'Selected':<10} {'Diversity':<12}")
    print("-" * 50)
    for name, stats in all_stats.items():
        print(f"{name:<15} {stats['original_size']:<10} "
              f"{stats['selected_size']:<10} "
              f"{stats['diversity_retention']*100:.1f}%")

    print(f"\nResults in: {OUTPUT_DIR}/")
    print("\nFiles created:")
    for name in all_stats.keys():
        print(f"  • {OUTPUT_DIR}/{name}_3k_selected.txt")

    print("\n Done! ")

print("\n" + "="*70)

# 4. Model Training

In [None]:
!pip install fasttext

In [None]:
import os
import fasttext
import json

##### Hyperparameter Search for FastText (Learning Rate × Epoch)
- Used a grid search over multiple learning rates and epoch values to find the best configuration for the FastText classifier trained on the 5k real dataset.
- For each combination, we:
  1. Train the FastText model
  2. Evaluate precision, recall, and F1 score on the validation set
  3. Save the trained model
  4. Record all results and track the best-performing hyperparameter configuration

In [None]:
lrs = [0.01, 0.05, 0.1, 0.3]
epochs = [5, 10, 25, 50]

os.makedirs("models", exist_ok=True)

results = []
best_score = 0.0
best_config = None

for lr in lrs:
    for epoch in epochs:
        print(f"Training lr={lr}, epoch={epoch}")
        model = fasttext.train_supervised(
            input="real_5000.train",
            lr=lr,
            epoch=epoch,
            wordNgrams=2,
            dim=100,
            loss="softmax",
        )

        # Validation
        # N: # of sample, p: precision, r: recall
        N, p, r = model.test("real_5000.valid")

        #F1 Score
        f1 = 0.0
        if p + r > 0:
            f1 = 2 * p * r / (p + r)

        # Save Model
        model_path = f"models/real_5000_lr{lr}_ep{epoch}.bin"
        model.save_model(model_path)

        # Result -> dictionary
        result = {
            "lr": lr,
            "epoch": epoch,
            "N": N,
            "precision": p,
            "recall": r,
            "f1": f1,
            "model_path": model_path,
        }
        results.append(result)

        # Best Score
        if f1 > best_score:
            best_score = f1
            best_config = result

with open("results_real_5000.json", "w") as f:
    json.dump(results, f, indent=2)

print("Best config:", best_config)


##### Training
- Each dataset (Real, Synthetic, Hybrid) is trained with a FastText classifier using identical settings.

Before Submodular Data Selection

- Real 5000 dataset

In [None]:
!head -n 4500 real_5k.txt > real_5000.train
!tail -n 500 real_5k.txt > real_5000.valid

In [None]:
model = fasttext.train_supervised(input="real_5000.train", lr=0.3, epoch=25, wordNgrams=2,dim=100, loss='softmax')
model.save_model("real_5000.bin")
model.test("real_5000.valid")

(500, 0.886, 0.886)

- Synthetic 5000 dataset

In [None]:
!head -n 4500 shuffledSyntheticData.txt > synthetic_5000.train
!tail -n 500 shuffledSyntheticData.txt > synthetic_5000.valid

In [None]:
model = fasttext.train_supervised(input="synthetic_5000.train", lr=0.3, epoch=25, wordNgrams=2,dim=100, loss='softmax')
model.save_model("synthetic_5000.bin")
model.test("synthetic_5000.valid")

(500, 0.996, 0.996)

- Hybrid 5000 dataset (real+synthetic)

In [None]:
!head -n 4500 5050RealSynthetic.txt > mixed_5000.train
!tail -n 500 5050RealSynthetic.txt > mixed_5000.valid

In [None]:
model = fasttext.train_supervised(input="mixed_5000.train", lr=0.3, epoch=25, wordNgrams=2,dim=100, loss='softmax')
model.save_model("mixed_5000.bin")
model.test("mixed_5000.valid")

(500, 0.954, 0.954)

After Submodular Data Selection

- Real 3000 dataset

In [None]:
!head -n 2700 real_3k_selected_shuffled.txt > real_selected.train
!tail -n 300 real_3k_selected_shuffled.txt > real_selected.valid

In [None]:
model = fasttext.train_supervised(input="real_selected.train", lr=0.3, epoch=25, wordNgrams=2,dim=100, loss='softmax')
model.save_model("real_selected.bin")
model.test("real_selected.valid")

(300, 0.83, 0.83)

- Synthetic 3000 dataset

In [None]:
!head -n 2700 mixed_3k_selected_shuffled.txt > mixed.train
!tail -n 300 mixed_3k_selected_shuffled.txt > mixed.valid

In [None]:
model = fasttext.train_supervised(input="mixed.train", lr=0.3, epoch=25, wordNgrams=2,dim=100, loss='softmax')
model.save_model("mixed.bin")
model.test("mixed.valid")

(300, 0.88, 0.88)

- Hybrid 3000 dataset

In [None]:
!head -n 2700 synthetic_3k_selected_shuffled.txt > synthetic.train
!tail -n 300 synthetic_3k_selected_shuffled.txt > synthetic.valid

In [None]:
model = fasttext.train_supervised(input="synthetic.train", lr=0.3, epoch=25, wordNgrams=2,dim=100, loss='softmax')
model.save_model("synthetic.bin")
model.test("synthetic.valid")

(300, 0.9933333333333333, 0.9933333333333333)

# 5. Evaluation

1. Create 500-sample test sets from IMDB and Amazon datasets

In [None]:
!head -n 500 imdb_test_500.txt > imdb.test
!head -n 500 amazon_test_500.txt > amazon.test

- Mix the IMDB and Amazon test sets and shuffle them to create a combined test set

In [None]:
cat imdb.test amazon.test | shuf > imdb_amazon.test

2. Evaluation    
  - Load each trained model    
  - Evaluate on:
      - imdb.test — In-domain performance
      - amazon.test — Out-of-distribution performance


##### Before Data Selection
1. Real 5000 Dataset

In [None]:
real_5k_model = fasttext.load_model("real_5000.bin")
real_5k_model.test("imdb_amazon.test")

(1000, 0.822, 0.822)

In [None]:
print(real_5k_model.test("imdb.test"))
print(real_5k_model.test("amazon.test"))

(500, 0.88, 0.88)
(500, 0.764, 0.764)


2. Synthetic 5000 Dataset

In [None]:
model = fasttext.load_model("synthetic_5000.bin")
model.test("imdb_amazon.test")

(1000, 0.614, 0.614)

In [None]:
print(model.test("imdb.test"))
print(model.test("amazon.test"))

(500, 0.63, 0.63)
(500, 0.598, 0.598)


3. Hybrid 5000 Dataset (Real + Synthetic)

In [None]:
model = fasttext.load_model("mixed_5000.bin")
model.test("imdb_amazon.test")

(1000, 0.687, 0.687)

In [None]:
print(model.test("imdb.test"))
print(model.test("amazon.test"))

(500, 0.688, 0.688)
(500, 0.686, 0.686)


##### After Data Selection
1. Real 3000 Dataset

In [None]:
model = fasttext.load_model("real_selected.bin")
model.test("imdb_amazon.test")

(1000, 0.819, 0.819)

In [None]:
print(model.test("imdb.test"))
print(model.test("amazon.test"))

(500, 0.848, 0.848)
(500, 0.79, 0.79)


2. Synthetic 3000 Dataset

In [None]:
model = fasttext.load_model("synthetic.bin")
model.test("imdb_amazon.test")

(1000, 0.608, 0.608)

In [None]:
print(model.test("imdb.test"))
print(model.test("amazon.test"))

(500, 0.628, 0.628)
(500, 0.588, 0.588)


3. Hybrid 3000 Dataset (Real + Synthetic)

In [None]:
model = fasttext.load_model("mixed.bin")
model.test("imdb_amazon.test")

(1000, 0.815, 0.815)

In [None]:
print(model.test("imdb.test"))
print(model.test("amazon.test"))

(500, 0.85, 0.85)
(500, 0.78, 0.78)
