# Text Process

Purpose: Clean and transform raw Reddit JSONL into a modeling-ready DataFrame, then tokenize and vectorize text (TF‑IDF) for NMF topic extraction.\
This notebook:
- Loads posts with embedded top comments from `data/raw/posts_with_comments_YYYYMMDD.jsonl`.
- Builds a DataFrame with post metadata and a combined text field (title + selftext + top comments).
- Applies basic text normalization (lowercase, URL removal, punctuation stripping as needed).
- Creates TF‑IDF features to feed into an NMF model for topic discovery.

In [2]:
import json
from pathlib import Path
from typing import List, Dict, Any
import pandas as pd

In [3]:
def load_jsonl(path: Path) -> List[Dict[str, Any]]:
    rows: List[Dict[str, Any]] = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows


def to_dataframe(records: List[Dict[str, Any]]) -> pd.DataFrame:
    # Extract required fields; coerce missing keys to None
    def extract(r: Dict[str, Any]) -> Dict[str, Any]:
        top_comments = r.get("top_comments") or []
        # Keep only comment bodies for compactness
        comment_bodies = [c.get("body") for c in top_comments if isinstance(c, dict)]
        return {
            "post_id": r.get("post_id"),
            "subreddit": r.get("subreddit"),
            "created_utc": r.get("created_utc"),
            "title": r.get("title"),
            "selftext": r.get("selftext"),
            "score": r.get("score"),
            "num_comments": r.get("num_comments"),
            "upvote_ratio": r.get("upvote_ratio"),
            "over_18": r.get("over_18"),
            # Keep list of strings (comment bodies)
            "top3_comments": comment_bodies,
        }

    df = pd.DataFrame([extract(r) for r in records])
    return df

In [None]:
records = load_jsonl(Path("../data/raw/posts_with_comments_20250907.jsonl")) # this jsonl file combines posts and comments
df = to_dataframe(records)
df.head()

Unnamed: 0,post_id,subreddit,created_utc,title,selftext,score,num_comments,upvote_ratio,over_18,top3_comments
0,1naxqdh,OpenAI,1757263000.0,"Are sora images free to use, print and sell?",Quick question... Are the images found on Sora...,1,2,1.0,False,"[Yes, but you don't have any copyright, so don..."
1,1nawwab,OpenAI,1757261000.0,Has anyone here used OpenAI tools to build som...,I’ve been experimenting with OpenAI tools like...,0,2,0.4,False,[I'm discussing investment options with it. Wo...
2,1nawvss,OpenAI,1757261000.0,PSA: OpenAI expires credits you buy automatica...,"Basically what it says on the tin, I learned t...",16,14,0.79,False,"[Always been the case., 🌍 👨‍🚀 🔫 👩‍🚀, Sensible...."
3,1navt7t,OpenAI,1757258000.0,OpenAI Should Publish Most Common Topics,Wouldn’t it be very useful for all of humanity...,5,4,0.78,False,[so basically… openAI should publish a weekly ...
4,1navjgf,OpenAI,1757257000.0,Introducing Terra Code CLI: Your AI Coding Ass...,Ever wished your AI coding assistant actually ...,0,11,0.31,False,[There are just too many CLIs now!\n\nhttps://...


In [9]:
df['top_comments'].values[2]

['Always been the case.',
 '🌍 👨\u200d🚀 🔫 👩\u200d🚀',
 'Sensible. Difficult to run a business with an ever growing stockpile of liabilities that could be called in at any time.']