# **Information**

**• Members:** 사푸트라 (Saputra Rizky Johan), 바트오르식 (Butemj Bat-Orshikh), 쉬슈잔 (Shu Xian Chow)

**• Institution:** Seoul National University, South Korea

**• Course:** 2025-2 Introduction to Natural Language Processing (001)

**• Instructors:** 황승원 (Prof), 김종윤(TA), 한상은 (TA)

**• Project:** Classifier-Guided Politeness Rewriting via Span Detection and Controlled Text Generation

**• Corpus:** Stanford Politeness Corpus (Convokit)


# **Politeness Rewriter - Colab Runner**

**• Note:** This notebook runs the **politeness-rewriter** project end-to-end in Google Colab with the current version pins:
- Keep **NumPy ≥ 2** (for spaCy/thinc/Convokit)
- Pin **Transformers 4.44.2** (supports `evaluation_strategy`)
- Keep **pandas==2.2.2** (to match google-colab)
- Use **spacy 3.8.x** + **Convokit 3.4.1** (Py3.12-friendly wheels)

**• Pipeline:** Drive mount → unzip → fix package layout → install deps → download data → train classifier → sanity infer → rewrite → quick eval → *(optional)* Gradio demo.


# **Case 1. Run via Google Drive**

## **1.1. Check GPU and Mount Drive**

**• Purpose:**
- Execute the associated step in the Colab workflow and detect the available GPU.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).

In [None]:
## --------------------------------------- START ---------------------------------------
# Check the available GPU
!nvidia-smi || echo "No GPU visible (training will still work, just slower)."

# Mount drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
## ---------------------------------------- END ----------------------------------------

## **1.2. Locate and load the project file**

**• Purpose:**
- Change the working directory and prepare project paths for running.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Adjusts the project name accordingly if the file is in a different name.

In [None]:
## --------------------------------------- START ---------------------------------------
# Declare the names of the zip file in your drive (Adjust if necessary)
ZIP_CANDIDATES = [
    "/content/drive/MyDrive/politeness-rewriters.zip",
    "/content/drive/MyDrive/politeness-rewritters.zip",
    "/content/drive/MyDrive/Downloads/politeness-rewriter.zip",
    "/content/drive/MyDrive/Downloads/politeness-rewritter.zip",
    "/content/drive/MyDrive/politeness-rewriter.zip",
    "/content/drive/MyDrive/politeness-rewritter.zip",
]

# Locate the politeness rewriter zip file
zip_path = None
import os
for cand in ZIP_CANDIDATES:
    if os.path.exists(cand):
        zip_path = cand; break
if zip_path is None:
    raise FileNotFoundError("Could not find the project ZIP. Set `zip_path` manually to its location in Drive.")

# Display the zip file and its path
print("Using ZIP:", zip_path)

# Unzip the file into the content directory (Adjust if necessary)
%cd /content
!unzip -o "$zip_path" -d /content
## ---------------------------------------- END ----------------------------------------

## **1.3. Project Normalization**

**• Purpose:**
- Normalize the project folder name, ensure the src is a package and remove and bundled env/.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Adjusts the project name accordingly if the file is in a different name.

In [None]:
## --------------------------------------- START ---------------------------------------
# Setup the necessary modules or packages
import os, shutil, sys

# Normalize the project folder name
proj_path = None
for cand in ["/content/politeness-rewriters", "/content/politeness-rewritters"]:
    if os.path.isdir(cand):
        proj_path = cand; break
if proj_path is None:
    for root, dirs, files in os.walk("/content"):
        if "src" in dirs and "requirements.txt" in files:
            proj_path = root; break
if proj_path is None:
    raise RuntimeError("Could not locate the project folder after unzipping.")

# Rename the misspelling to a canonical folder
if proj_path.endswith("politeness-rewritters"):
    target = "/content/politeness-rewriters"
    if os.path.isdir(target):
        shutil.rmtree(target, ignore_errors=True)
    os.rename(proj_path, target)
    proj_path = target

print("Project path:", proj_path)
%cd "$proj_path"

# Remove the bundled venv if present
if os.path.isdir("env"):
    print("Removing bundled env/"); shutil.rmtree("env", ignore_errors=True)

# Ensure src is a package
os.makedirs("src", exist_ok=True)
if os.path.exists("src/init.py") and not os.path.exists("src/__init__.py"):
    os.rename("src/init.py", "src/__init__.py")
if not os.path.exists("src/__init__.py"):
    open("src/__init__.py", "a").close()
print("src/__init__.py OK")
## ---------------------------------------- END ----------------------------------------

## **1.4. Install Dependencies**

**• Purpose:**
- Install the necessary Python dependencies required to load and run the Politeness Rewriter project.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Adjusts the dependencies only when it is necessary. Otherwise, keep the current dependencies.

In [None]:
## --------------------------------------- START ---------------------------------------
%%bash
set -e

# Modernize toolchain
python -m pip install -U pip setuptools wheel

# Foundation pins (Keep NumPy >=2 and pandas==2.2.2 for Colab harmony)
python -m pip install -U "numpy>=2.0,<3" "pandas==2.2.2"

# Core HF stack (Transformers 4.44.2 supports `evaluation_strategy`)
python -m pip install -U "transformers==4.44.2" "tokenizers==0.19.1" "accelerate==0.33.0"

# NLP, metrics, utils
python -m pip install -U "datasets>=2.20.0" "sentence-transformers>=3.0.1"   "scikit-learn>=1.3.0" "tqdm>=4.66.0" "bert-score>=0.3.13" "evaluate>=0.4.1"   "gradio>=4.36.1" "nltk>=3.8.1" "pyyaml>=6.0"

# Spacy 3.8.x (Py3.12 wheels) + Convokit 3.4.1
python -m pip install -U "spacy>=3.8,<3.9" "convokit==3.4.1"

# If repo has requirements.txt, install it without deps so we don't downgrade the pinned stack.
if [ -f requirements.txt ]; then
  python -m pip install --no-deps -r requirements.txt || true
fi
## ---------------------------------------- END ----------------------------------------

## **1.5. Cache the HuggingFace models (Optional)**

**• Purpose:**
- cache HF models to Drive to avoid redownloading. However, this step is optional.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).

In [None]:
## --------------------------------------- START ---------------------------------------
# Setup the necessary modules or packages
import os

# Declare the directory for caching the HF models into Drive
os.environ["HF_HOME"] = "/content/drive/MyDrive/hf_cache"
os.environ["TRANSFORMERS_CACHE"] = "/content/drive/MyDrive/hf_cache"
os.makedirs(os.environ["HF_HOME"], exist_ok=True)
print("HF cache:", os.environ["HF_HOME"])
## ---------------------------------------- END ----------------------------------------

## **1.6. Patching to utilize the dataset keys**

**• Purpose:**
- Patch the dataset directory, specifically src/download_data.py, in order to to use the correct dataset key.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).

In [None]:
## --------------------------------------- START ---------------------------------------
# Setup the necessary modules or packages
from pathlib import Path

# Declare the directory and associated paths
p = Path("src/download_data.py")
s = p.read_text()

# Patch the download_data.py module
s2 = s.replace('download("stanford-politeness-corpus")',
               'download("stack-exchange-politeness-corpus")')
p.write_text(s2)
print("Patched download_data.py -> 'stack-exchange-politeness-corpus'")
## ---------------------------------------- END ----------------------------------------

## **1.7. Download the Stanford Politeness data via ConvoKit**

**• Purpose:**
- Run download_data.py with the stack-exchange-politeness-corpus.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).

In [None]:

## --------------------------------------- START ---------------------------------------
# Download the Stanford Politeness data and run the dataset loader for training
%cd "$proj_path"
!python src/download_data.py
## ---------------------------------------- END ----------------------------------------

## **1.8. Training the classifier**

**• Purpose:**
- Train the classifier through classifier_train.py with the training parameters (adjust the number of epochs, batch size and learning rate for better results)

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Adjust higher number of epochs and controlled learning rate (preferrably lower if the number of epochs is high) for better results without overfitting the data.

In [None]:
## --------------------------------------- START ---------------------------------------
# Train the classifier (Please adjust the parameters for better training results)
%cd "$proj_path"
!python src/classifier_train.py \
  --data_path data/train.jsonl \
  --save_dir out/classifier/model \
  --epochs 2 \
  --batch_size 16 \
  --lr 5e-5
## ---------------------------------------- END ----------------------------------------

## **1.9. Run inference and sanity check**

**• Purpose:**
- Run the inference and execute sanity checks from the training results through the classifier_infer.py module.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).

In [None]:
## --------------------------------------- START ---------------------------------------
# 3) Quick inference sanity check
%cd "$proj_path"
!python src/classifier_infer.py --text "send me the report now"
## ---------------------------------------- END ----------------------------------------

## **1.10. End-to-end rewriting**

**• Purpose:**
- Rewrite the sample input with the outputs using the trained classifier (Adjust the training parameters to achieve the best end to end results) through the pipeline.py module.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).

In [None]:
## --------------------------------------- START ---------------------------------------
# Execute the end-to-end rewriting
%cd "$proj_path"
!python src/pipeline.py --text "send me the report now asap"
## ---------------------------------------- END ----------------------------------------

## **1.11. Evaluate the model with samples**

**• Purpose:**
- Evaluate the dataset with a specified number of samples (Set more samples for more coherent results) through the eval.py module.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Adjust the number of samples for better coherence and accuracy (higher number of samples is preferred).

In [None]:
## --------------------------------------- START ---------------------------------------
# Evaluate the model with the specified number of samples
%cd "$proj_path"
!python src/eval.py --n 20 --save_csv out/eval_samples.csv
## ---------------------------------------- END ----------------------------------------

## **1.12. Lauch the gradio demo**

**• Purpose:**
- Launching a gradio demo as a visualizer and actual running with customized or personalized sentences via the app.py module.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Execute the app via terminal using cd politeness-rewriters (or other directory names) and python app.py for retrieving the public url host.

In [None]:
## --------------------------------------- START ---------------------------------------
# Launch the Gradio demo
%cd "$proj_path"
!python app.py
## ---------------------------------------- END ----------------------------------------

# **Case 2. Visual Analysis**

## **2.1. Load the Stanford Dataset**

**• Purpose:**
- Access and load the Standford dataset for overview of the sample sentences that is used for this project.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Add or adjust any file names if necessary (Make sure to document if changes is made to GitHub)

In [None]:
## --------------------------------------- START ---------------------------------------
# Setup the necessary modules or packages
import json
import pandas as pd

# Define the path of the dataset (Adjust if necessary)
DATA_PATH = "/content/politeness-rewriters/data/stanford_politeness.jsonl"

# Read the JSONL
records = []
with open(DATA_PATH, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if line:  # Skip any empty lines
            records.append(json.loads(line))

# Convert to DataFrame
df = pd.DataFrame(records)

# Load the dataset
print("Rows:", len(df))
print("Columns:", list(df.columns))
df.head() # Adjust the number of data if necessary
## ---------------------------------------- END ----------------------------------------

## **2.2. Load the Image of the Stanford Dataset Preview**

**• Purpose:**
- Load the image of the stanford dataset, partly for documentation and presentation purposes.

**• Note:**
- Make sure that the previous steps are completed without any errors (Exception for subprocess errors).
- Re-run this cell if you changed any configuration or file paths (Make sure to document these changes).
- Add or adjust any file names if necessary (Make sure to document if changes is made to GitHub)

In [None]:
## --------------------------------------- START ---------------------------------------
# Setup the necessary modules or packages
!pip install dataframe-image
import pandas as pd
import dataframe_image as dfi

# Declare the sample DataFrame
sample_df = df.head(10) # Adjust the number of data if necessary

# Stylized the layout of the data
styled = (
    sample_df.style.set_properties(**{
        "background-color": "#f9f9f9",
        "border-color": "black",
        "color": "black",
        "border-width": "1px",
        "border-style": "solid"
    })
)

# Call the matplotlib engine to avoid Playwright/Chrome errors or mismatch
dfi.export(styled, "dataset_preview.png", table_conversion="matplotlib")

# Load the generated preview of the dataset
print("Saved: dataset_preview.png")
## ---------------------------------------- END ----------------------------------------