# Steam Review Summarization - Sequential (Windows & Mac Compatible)

Uses smaller, faster model (distilbart-cnn-6-6) with GPU acceleration (CUDA/MPS).
Sequential processing prevents memory overflow for overnight runs.

**Compatible with:** Windows (CUDA/CPU), Mac (MPS/CPU), Linux (CUDA/CPU)

**Run cells in order: 1 → 7**

In [1]:
# Cell 1: Imports
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import pandas as pd
import glob, re, string, time, gc, warnings
warnings.filterwarnings('ignore')

try:
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
    from tqdm import tqdm
    import torch
except ImportError:
    import subprocess, sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "transformers", "sentencepiece", "torch", "tqdm"])
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
    from tqdm import tqdm
    import torch

print("✓ All imports loaded")

✓ All imports loaded


In [None]:
# Cell 2: Configuration (cross-platform paths)
from pathlib import Path

# Use smaller, faster 6-6 model
# Windows-compatible cache paths
if os.name == 'nt':  # Windows
    default_local_cache = Path.home() / "hf_cache" / "distilbart-6-6"
    hf_default_cache = Path.home() / ".cache" / "huggingface" / "hub" / "models--sshleifer--distilbart-cnn-6-6"
else:  # Mac/Linux
    default_local_cache = Path.home() / "hf_cache" / "distilbart-6-6"
    hf_default_cache = Path.home() / ".cache" / "huggingface" / "hub" / "models--sshleifer--distilbart-cnn-6-6"

if default_local_cache.exists():
    MODEL_PATH = str(default_local_cache)
elif hf_default_cache.exists():
    MODEL_PATH = str(hf_default_cache)
else:
    MODEL_PATH = "sshleifer/distilbart-cnn-6-6"  # smaller & faster than 12-6

# Project root detection (works on Windows, Mac, Linux)
env_root = os.environ.get("STEAM_GAMES_ROOT")
if env_root:
    project_root = Path(env_root).expanduser().resolve()
else:
    here = Path.cwd().resolve()
    project_root = None
    for cand in [here, *here.parents]:
        if cand.name == "python":
            continue
        if (cand / "data").exists():
            project_root = cand
            break
    if project_root is None:
        project_root = here.parent if here.name == "python" else here
if project_root.name == "python":
    project_root = project_root.parent

DATA_DIR = (project_root / "data").resolve()
PROCESSED_DIR = (DATA_DIR / "processed").resolve()
OUT_DIR = (DATA_DIR / "reviews_summary").resolve()
OUT_DIR.mkdir(parents=True, exist_ok=True)

START_APP_ID = 293741
END_APP_ID   = 942970
CHECKPOINT_EVERY = 10

print("Configuration:")
print(f"  Platform: {os.name} ({'Windows' if os.name == 'nt' else 'Unix-like'})")
print(f"  Model: {MODEL_PATH}")
print(f"  App ID range: {START_APP_ID or 'start'} to {END_APP_ID or 'end'}")
print(f"  Sequential processing, Checkpoint every: {CHECKPOINT_EVERY} games")
print(f"  Project root: {project_root}")
print(f"  Output dir:   {OUT_DIR}")

Configuration:
  Model: sshleifer/distilbart-cnn-6-6
  App ID range: 4780 to 204180
  Sequential processing, Checkpoint every: 10 games
  Project root: /Users/radimsoukal/Library/Mobile Documents/com~apple~CloudDocs/VŠE/05. SEMESTR/Text Analytics/R/Steam_Games 2
  Output dir:   /Users/radimsoukal/Library/Mobile Documents/com~apple~CloudDocs/VŠE/05. SEMESTR/Text Analytics/R/Steam_Games 2/data/reviews_summary


In [3]:
# Cell 3: Load data with cache (same pattern as combine_reviews_working_edit)
cache_file = PROCESSED_DIR / 'combined_reviews_cache.pkl'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

combined_df = None

if cache_file.exists():
    try:
        obj = pd.read_pickle(cache_file)
        if isinstance(obj, pd.DataFrame) and 'combined_reviews' in obj.columns:
            combined_df = obj
            print(f"✓ Loaded cache: {len(combined_df)} games")
        else:
            print("Cache invalid -> will rebuild")
            cache_file.unlink()
    except Exception as e:
        print(f"Cache error: {e} -> will rebuild")
        try:
            cache_file.unlink()
        except Exception:
            pass

if combined_df is None:
    print("Loading raw CSV files...")
    csv_files = sorted(glob.glob(str(DATA_DIR / 'raw' / 'app_reviews_*.csv')))
    if not csv_files:
        raise FileNotFoundError(f"No input files in {DATA_DIR / 'raw'}")
    dfs = [pd.read_csv(f) for f in csv_files]
    combined_df = pd.concat(dfs, ignore_index=True)
    print(f"✓ Loaded {len(csv_files)} files, {len(combined_df)} games")
    
    def clean_text(text):
        if not isinstance(text, str) or not text.strip():
            return None
        text = re.sub(r'<[^>]+>', '', text)
        text = re.sub(r'http[s]?://\S+|www\.\S+', '', text)
        text = re.sub(r'[\r\n\t]+', ' ', text)
        text = re.sub(r'[\U00010000-\U0010ffff]', '', text)
        allowed = set(string.ascii_letters + string.digits + ' .,!?\'-:/()[]')
        text = ''.join(ch for ch in text if ch in allowed)
        text = re.sub(r'\s+', ' ', text).strip()
        return text if text else None
    
    def combine_reviews(row):
        reviews = []
        for i in range(1, 101):
            rv = row.get(f'review_{i}')
            if rv and isinstance(rv, str):
                cleaned = clean_text(rv)
                if cleaned:
                    reviews.append(cleaned)
        unique = list(dict.fromkeys(reviews))
        return ' [SEP] '.join(unique) if unique else ''
    
    combined_df['combined_reviews'] = combined_df.apply(combine_reviews, axis=1)
    combined_df.to_pickle(cache_file)
    print("✓ Cleaned and cached")

print(f"Avg length: {combined_df['combined_reviews'].str.len().mean():.0f} chars")

✓ Loaded cache: 2622 games
Avg length: 80789 chars


In [None]:
# Cell 4: Load model with cross-platform GPU support (Windows CUDA / Mac MPS / CPU fallback)
def _resolve_hub_snapshot_dir(root_dir):
    snapshots_dir = os.path.join(root_dir, "snapshots")
    refs_main = os.path.join(root_dir, "refs", "main")
    if os.path.isfile(refs_main):
        try:
            with open(refs_main, "r") as f:
                commit = f.read().strip()
            cand = os.path.join(snapshots_dir, commit)
            if os.path.isdir(cand):
                return cand
        except:
            pass
    try:
        candidates = [d for d in glob.glob(os.path.join(snapshots_dir, "*")) if os.path.isdir(d)]
        if candidates:
            candidates.sort(key=lambda d: os.path.getmtime(d), reverse=True)
            return candidates[0]
    except:
        pass
    return None

def _looks_like_model_dir(path):
    must_have = ["config.json", "pytorch_model.bin", "tokenizer_config.json"]
    try:
        files = set(os.listdir(path))
        return any(m in files for m in must_have)
    except:
        return False

# Cross-platform device detection: Windows (CUDA) / Mac (MPS) / Fallback (CPU)
def detect_device():
    """Detect best available device: CUDA (Windows/Linux) > MPS (Mac) > CPU"""
    if torch.cuda.is_available():
        return torch.device("cuda"), torch.float16, "CUDA (GPU)"
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return torch.device("mps"), torch.float32, "MPS (Apple Silicon)"
    return torch.device("cpu"), torch.float32, "CPU"

print(f"Loading model from: {MODEL_PATH}")
is_local_path = os.path.exists(MODEL_PATH)

load_dir = None
if is_local_path:
    if os.path.basename(MODEL_PATH).startswith("models--"):
        candidate = _resolve_hub_snapshot_dir(MODEL_PATH)
        if candidate and _looks_like_model_dir(candidate):
            load_dir = candidate
    if load_dir is None and _looks_like_model_dir(MODEL_PATH):
        load_dir = MODEL_PATH

if load_dir:
    tokenizer = AutoTokenizer.from_pretrained(load_dir, local_files_only=True)
    model = AutoModelForSeq2SeqLM.from_pretrained(load_dir, local_files_only=True)
    print(f"✓ Loaded from local cache: {load_dir}")
else:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_PATH)
    print("✓ Downloaded from Hugging Face Hub")

device, dtype, dev_name = detect_device()
print(f"✓ Device: {dev_name}")
model = model.to(device)
if dtype == torch.float16:
    model = model.half()

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device=-1)
print("✓ Summarizer ready")

Loading model from: sshleifer/distilbart-cnn-6-6
✓ Downloaded from Hugging Face Hub
✓ Device: MPS (Apple Silicon)


Device set to use mps:0


✓ Summarizer ready


In [5]:
# Cell 5: Summarization functions with content filtering
GUIDANCE_PROMPT = (
    "Summarize the following user reviews into a concise, neutral third-person description of the game. "
    "Focus on: genre, core gameplay loop and mechanics, key features/modes (campaign/multiplayer/co-op), "
    "difficulty/learning curve, performance/technical notes. Avoid slang, memes, and repetition.\n\nReviews:\n"
)

KEYWORDS_RE = re.compile(
    r"gameplay|mechanic|combat|gun|weapon|campaign|story|mode|co-?op|multiplayer|"
    r"graphics|performance|bug|optim|difficulty|progression|content|map|level|class|rank",
    re.I
)

def keep_relevant_sentences(text):
    sents = re.split(r"(?<=[.!?])\s+", text)
    kept = [s for s in sents if KEYWORDS_RE.search(s)]
    filtered = " ".join(kept).strip()
    return filtered if len(filtered) >= 300 else text

def split_into_chunks(text, max_chunk=1500, overlap=100):
    if not text or len(text) <= max_chunk:
        return [text] if text else []
    chunks, start = [], 0
    while start < len(text):
        end = start + max_chunk
        if end < len(text):
            window = text[start:end]
            cut = max(window.rfind(". "), window.rfind("! "), window.rfind(" [SEP] "))
            if cut != -1:
                end = start + cut + 2
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        start = end - overlap if end < len(text) else end
    return chunks

def token_budget(n_chars, cap=96):
    return max(32, min(cap, n_chars // 4))

def generate_summary(text, cap=96):
    if not text or not text.strip():
        return ""
    max_new = token_budget(len(text), cap=cap)
    with torch.inference_mode():
        out = summarizer(text, max_new_tokens=max_new, num_beams=1, 
                        do_sample=False, truncation=True, batch_size=1)[0]["summary_text"]
    return out

def summarize_reviews(text):
    if not text or not text.strip():
        return ""
    base = keep_relevant_sentences(text)
    if len(base) <= 800:
        guided = f"{GUIDANCE_PROMPT}{base}"
        return generate_summary(guided, cap=64)
    parts = []
    for ch in split_into_chunks(base, max_chunk=1500, overlap=100):
        guided = f"{GUIDANCE_PROMPT}{ch}"
        parts.append(generate_summary(guided, cap=64))
    combined = " ".join(p for p in parts if p).strip()
    if len(combined) < 200:
        return combined
    final_guided = f"{GUIDANCE_PROMPT}{combined}"
    return generate_summary(final_guided, cap=96)

print("✓ Summarization functions ready")

✓ Summarization functions ready


In [None]:
# Cell 6: Sequential processing with memory management
df_sorted = combined_df.sort_values('app_id').reset_index(drop=True)
if START_APP_ID is not None:
    df_sorted = df_sorted[df_sorted['app_id'] >= START_APP_ID]
if END_APP_ID is not None:
    df_sorted = df_sorted[df_sorted['app_id'] <= END_APP_ID]
df_sorted = df_sorted.reset_index(drop=True)

n_rows = len(df_sorted)
print(f"\nProcessing {n_rows} games (SEQUENTIAL mode)")
print("-" * 60)

if n_rows == 0:
    print("No rows to process")
else:
    if 'reviews_summary' not in df_sorted.columns:
        df_sorted['reviews_summary'] = ''
    
    ok = fail = skip = 0
    batch_size = max(1, CHECKPOINT_EVERY)
    overall_start_id = int(df_sorted.iloc[0]['app_id'])
    overall_end_id = int(df_sorted.iloc[-1]['app_id'])
    total_batches = (n_rows + batch_size - 1) // batch_size
    
    for b in range(total_batches):
        s = b * batch_size
        e = min(s + batch_size, n_rows)
        batch = df_sorted.iloc[s:e]
        batch_start_id = int(batch.iloc[0]['app_id'])
        batch_end_id = int(batch.iloc[-1]['app_id'])
        
        print(f"\nBatch {b+1}/{total_batches}  (app_id {batch_start_id}–{batch_end_id})  size={len(batch)}")
        
        for idx, row in tqdm(batch.iterrows(), total=len(batch), desc=f"Batch {b+1}"):
            text = row.get('combined_reviews', '')
            app_id = int(row['app_id'])
            
            if not text or not text.strip():
                skip += 1
                continue
            
            try:
                summary = summarize_reviews(text)
                df_sorted.loc[idx, 'reviews_summary'] = summary
                ok += 1
            except Exception as ex:
                fail += 1
                print(f"  ⚠️  app_id {app_id}: {ex}")
        
        ck_path = OUT_DIR / f"checkpoint_{batch_start_id:06d}_to_{batch_end_id:06d}.csv"
        df_sorted.loc[s:e-1, ['app_id', 'reviews_summary']].to_csv(ck_path, index=False)
        print(f"  💾 Saved {ck_path.name} | ok={ok}, fail={fail}, skip={skip}")
        
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        time.sleep(0.2)
        print(f"  🧹 Memory cleanup complete")
    
    final_name = f"review_summaries_COMPLETE_{overall_start_id:06d}_to_{overall_end_id:06d}.csv"
    final_path = OUT_DIR / final_name
    df_sorted[['app_id', 'combined_reviews', 'reviews_summary']].to_csv(final_path, index=False)
    
    print(f"\n✓ Complete: ok={ok}, fail={fail}, skip={skip}")
    print(f"✓ Saved: {final_path}")
    gc.collect()
    print("✓ Final cleanup complete")


Processing 577 games (SEQUENTIAL mode)
------------------------------------------------------------

Batch 1/58  (app_id 4780–6060)  size=10


Batch 1:   0%|          | 0/10 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Batch 1:  30%|███       | 3/10 [08:18<19:23, 166.26s/it]

## Performance Notes

**Platform Compatibility:**
- ✅ **Windows**: CUDA GPU (NVIDIA) or CPU
- ✅ **Mac**: MPS (M1/M2/M3) or CPU  
- ✅ **Linux**: CUDA GPU or CPU

**Sequential vs Parallel:**
- ✅ Sequential: Stable, predictable memory ~2-4GB  
- ✅ No crashes overnight  
- ✅ GPU acceleration when available (2-5x faster than CPU)  
- ✅ Smaller model (6-6 vs 12-6) = faster processing  

**Speed estimates by device:**
- ~20-30 sec/game on **NVIDIA GPU** (Windows/Linux CUDA)
- ~30-40 sec/game on **M1/M2 Mac** (MPS)  
- ~60-90 sec/game on **CPU** (any platform)

**Memory optimizations:**
- Content filtering reduces input size  
- Sequential processing (no parallel overhead)  
- Garbage collection after each batch  
- Smaller model footprint  

**Windows-specific notes:**
- If you have an NVIDIA GPU, ensure CUDA is installed
- CPU mode works on any Windows machine (just slower)
- Monitor GPU usage in Task Manager > Performance > GPU

**If you need more speed:**
- Use the parallel version for small batches  
- Process in smaller ranges (e.g., 100-200 games)  
- This sequential version is best for overnight/long runs