<a href="https://colab.research.google.com/github/wtrekell/soylent-army/blob/main/colab/ai_vs_human_v1.4c.3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ai_vs_human_v1.4c.3 — Analyzer

Updated per your specifications.

In [1]:
# CELL 1: This cell installs the necessary Python libraries for the project using pip.
# %%capture --no-display: This is a cell magic that captures the standard output and standard error of the cell and prevents it from being displayed. This is used here to hide the detailed output of the pip install command, while still allowing progress bars (if any are shown by the installed packages) to be displayed.
# %pip install -q: This is a line magic that runs the pip install command. The -q flag stands for "quiet", which reduces the verbosity of the output.
# The rest of the line lists the specific libraries and their versions to be installed.

# In summary, this cell ensures that all required libraries are installed with minimal output shown, except for potential progress bars during the installation process.

%%capture --no-display
%pip install -q "markdown-it-py[linkify]==3.0.0" "linkify-it-py==2.0.3" \
                "sentence-transformers==3.0.1" "scikit-learn>=1.6.1" \
                "beautifulsoup4==4.12.3" "lxml==5.2.2" \
                "rapidfuzz==3.9.6" "pandas==2.2.2" "numpy>=2.0.0,<2.3.0"

In [2]:
# CELL 2: This cell imports several Python libraries and defines helper functions for processing and analyzing text data, along with a configuration class.

# Imports: It imports necessary libraries like pathlib, os, re, json, glob, time, numpy, pandas, BeautifulSoup (for parsing HTML), MarkdownIt (for parsing Markdown), SentenceTransformer (for creating sentence embeddings), TfidfVectorizer and cosine_similarity from sklearn (for TF-IDF analysis and similarity), and Levenshtein from rapidfuzz (for calculating edit distance). It also imports tqdm.autonotebook.tqdm for progress bars and suppresses a related warning.
# MarkdownIt and md_to_text: Initializes a Markdown parser and defines a function md_to_text to convert Markdown strings to plain text using BeautifulSoup.
# split_sentences: Defines a function to split a given text into a list of sentences based on punctuation and spacing.
# tokenize: Defines a function to tokenize text by converting it to lowercase, removing non-alphanumeric characters, and splitting it into words.
# jaccard: Defines a function to calculate the Jaccard similarity between two sets of tokens.
# modification_label_for: Defines a function to categorize the degree of modification ("unchanged", "minor", or "major") based on a similarity score and defined thresholds.
# Config class: Defines a class to hold configuration parameters such as data and output directories, the name of the sentence transformer model, parameters for candidate selection and similarity weighting, thresholds for modification labels, and file patterns for different document versions.
# Constants: Defines constants for the order of document versions (VERSION_ORDER), display names for versions (DISPLAY_NAME), and labels for modifications (MOD_LABELS).

# In essence, this cell sets up the tools and configurations needed for the subsequent steps of loading, processing, and comparing different versions of text documents.

from pathlib import Path
import os, re, json, glob, time
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from markdown_it import MarkdownIt
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from rapidfuzz.distance import Levenshtein

# Let tqdm use notebook mode (keeps progress bars visible in Colab/Jupyter).
# Suppress the "ExperimentalWarning" so you don't see the noisy message.
import warnings
from tqdm.autonotebook import tqdm
warnings.filterwarnings("ignore", category=UserWarning, module="tqdm")

md = MarkdownIt("commonmark", {})
def md_to_text(md_str: str) -> str:
    html = md.render(md_str or "")
    soup = BeautifulSoup(html, "lxml")
    text = soup.get_text(" ")
    return re.sub(r"\s+", " ", text).strip()

def split_sentences(text: str) -> list:
    if not text:
        return []
    text = re.sub(r"\s+", " ", text).strip()
    parts = re.split(r"(?<=[.!?])\s+(?=[\(\"\'A-Z0-9])", text)
    return [p.strip() for p in parts if p and not p.isspace()]

def tokenize(text: str) -> list:
    if not text:
        return []
    text = text.lower()
    # Replace non-alphanumeric characters with spaces
    text = re.sub(r'\W+', ' ', text)
    text = re.sub(r"\s+", " ", text).strip()
    return text.split()

def jaccard(tokens_a, tokens_b) -> float:
    if not tokens_a and not tokens_b:
        return 1.0
    A, B = set(tokens_a), set(tokens_b)
    if not A and not B:
        return 1.0
    if not A or not B:
        return 0.0
    return len(A & B) / float(len(A | B))

def modification_label_for(sim: float, th_unchanged=0.9, th_minor=0.7) -> str:
    if sim >= th_unchanged:
        return "unchanged"
    if sim >= th_minor:
        return "minor"
    return "major"

class Config:
    data_dir = "./data"
    out_dir = "./output"
    model_name = "all-MiniLM-L6-v2"
    topk_candidates = 10
    weight_semantic = 0.6
    weight_tfidf = 0.4
    unchanged_threshold = 0.90
    minor_change_threshold = 0.70
    patterns = {"draft":"draft*.md","refined":"refined*.md","edited":"edited*.md","final":"final*.md"}

cfg = Config()

VERSION_ORDER = ["draft", "refined", "edited", "final"]
DISPLAY_NAME  = {"draft":"Draft","refined":"Refined","edited":"Edited","final":"Final"}
MOD_LABELS    = ["unchanged","minor","major"]

print("ai_vs_human_v1.4c.3 configuration loaded.")

  from tqdm.autonotebook import tqdm, trange


ai_vs_human_v1.4c.3 configuration loaded.


In [3]:
# CELL 3: This cell contains functions for loading and discovering article data, and then uses them to find and list the available article locations.

# read_first_match(folder: Path, pattern: str): This function takes a folder path and a file pattern as input. It searches the folder for files matching the pattern, sorts them, and reads the content of the first matching file. If no files are found, it returns None.
# load_versions(article_folder: Path): This function takes an article folder path and uses the read_first_match function to load the content of different versions (draft, refined, edited, final) of an article based on predefined file patterns in the cfg.patterns dictionary. It returns a dictionary where keys are version names and values are the loaded content (or None if no file was found for a version).
# to_sentences(md_content: str) -> list: This function takes Markdown content as input, converts it to plain text using md_to_text (defined in a previous cell), and then splits the text into a list of sentences using split_sentences (also defined previously).
# discover_articles(data_dir: str): This function takes a data directory path, ensures the directory exists, and then finds all subdirectories within it. These subdirectories are considered as individual "article locations". If no subdirectories are found, it returns a list containing the data directory itself.
# articles = discover_articles(cfg.data_dir): This line calls the discover_articles function with the data directory specified in the configuration (cfg.data_dir) to find all article locations.
# The subsequent print statements display the number of discovered article locations and list each one.

# In summary, this cell sets up the functions to handle the loading of different article versions and then identifies where these articles are located within the specified data directory.

def read_first_match(folder: Path, pattern: str):
    files = sorted(folder.glob(pattern))
    if not files:
        return None
    return files[0].read_text(encoding="utf-8", errors="ignore")

def load_versions(article_folder: Path):
    data = {}
    for v in VERSION_ORDER:
        pattern = cfg.patterns.get(v)
        if not pattern:
            data[v] = None
            continue
        content = read_first_match(article_folder, pattern)
        data[v] = content
    return data

def to_sentences(md_content: str) -> list:
    return split_sentences(md_to_text(md_content or ""))

def discover_articles(data_dir: str):
    base = Path(data_dir)
    if not base.exists():
        base.mkdir(parents=True, exist_ok=True)  # ensure the folder exists
    folders = [p for p in base.iterdir() if p.is_dir()]
    if not folders:
        return [base]
    return folders

articles = discover_articles(cfg.data_dir)
print(f"Discovered {len(articles)} article location(s).")
for a in articles:
    print(' -', a)

Discovered 1 article location(s).
 - data


In [4]:
# CELL 4:This code cell loads a pre-trained sentence transformer model and initializes a TF-IDF vectorizer, which are both used for calculating text similarity later in the notebook.

# print("Loading embedding model:", cfg.model_name): This line prints a message indicating which sentence embedding model is being loaded, using the model name specified in the cfg (Config) object.
# st_model = SentenceTransformer(cfg.model_name): This line initializes a SentenceTransformer model with the specified model name. This model will be used to convert sentences into numerical vectors (embeddings) that capture their semantic meaning.
# def embed_sentences(sentences: list[str]) -> np.ndarray:: This function takes a list of sentences and uses the loaded st_model to generate embeddings for each sentence. It returns a NumPy array of these embeddings. If the input list is empty, it returns an empty array with the expected shape and data type.
# tfidf = TfidfVectorizer(min_df=1, ngram_range=(1,2)): This line initializes a TfidfVectorizer. This tool is used to convert a collection of raw documents into a matrix of TF-IDF features. min_df=1 means that terms that appear in less than 1 document will be ignored. ngram_range=(1,2) means that both unigrams (single words) and bigrams (two-word phrases) will be considered as features.

# In summary, this cell prepares the necessary models and tools for semantic and TF-IDF based similarity calculations between sentences.

print("Loading embedding model:", cfg.model_name)
st_model = SentenceTransformer(cfg.model_name)

def embed_sentences(sentences: list[str]) -> np.ndarray:
    if not sentences:
        return np.empty((0, 384), dtype=np.float32)
    return np.array(st_model.encode(sentences, show_progress_bar=False), dtype=np.float32)

tfidf = TfidfVectorizer(min_df=1, ngram_range=(1,2))


Loading embedding model: all-MiniLM-L6-v2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
## CELL 5: This code cell defines the main analysis function analyze_article which compares sentences from a "final" version of an article to earlier versions (draft, refined, edited) to determine how much each final sentence has been modified and attribute it to an origin version.

# Load Versions: It first loads the content of the different article versions using the load_versions function. If there's no "final" version, it skips the analysis for that article.
# Convert to Sentences: It converts the content of each version into a list of sentences using the to_sentences function. It also skips the analysis if the "final" version has no sentences.
# Prepare Candidates: It creates a combined list of all sentences from the "draft", "refined", and "edited" versions, along with metadata (version and index) for each candidate sentence.
# TF-IDF Vectorization: It fits a TfidfVectorizer on all sentences (final and candidates) to generate TF-IDF representations.
# Sentence Embedding: It generates sentence embeddings for both the final sentences and the candidate sentences using the loaded SentenceTransformer model.
# Iterate and Compare: It iterates through each sentence in the "final" version:
# Calculates semantic similarity (cosine similarity of embeddings) and TF-IDF similarity between the current final sentence and all candidate sentences.
# If there are no candidates, it records the final sentence with None values for origin and similarity metrics.
# Calculates a combined similarity score (weighted average of semantic and TF-IDF similarity).
# Identifies the top-k candidate sentences with the highest combined similarity.
# For the top candidates, it calculates Jaccard similarity (token overlap) and Levenshtein ratio (edit distance).
# Calculates a final combined score using a weighted average of the fast combined similarity, Jaccard, and Levenshtein ratio.
# Identifies the candidate sentence with the highest final combined score as the "best" match.
# Records the details of the final sentence and its best matching origin sentence (or None if no suitable candidate was found), including various similarity scores and a "modification label" ("unchanged", "minor", or "major") based on the final combined score and defined thresholds.
# Generate Output: After processing all final sentences, it creates a pandas DataFrame from the collected data. It also calculates the distribution of origin versions and modification labels.
# Save Results: It saves the detailed results to a CSV file and a JSON file in the output directory specified in the configuration. It also saves a separate JSON file with footer metrics like execution time and comparison counts.
# Print Diagnostic: Finally, it prints a plain-English diagnostic message summarizing the analysis results, including the number of sentences analyzed, heavy comparisons performed, and an interpretation of the heavy comparison count relative to the top-k setting.

# The function returns a dictionary containing the name of the article and the paths to the generated output files.

def analyze_article(article_path: Path):
    versions = load_versions(article_path)
    if not versions.get("final"):
        print(f"[SKIP] No final*.md found in {article_path}.")
        return None

    sents = {v: to_sentences(versions.get(v) or "") for v in VERSION_ORDER}
    final_sents = sents["final"]
    if not final_sents:
        print(f"[SKIP] No sentences produced for Final in {article_path}.")
        return None

    candidate_versions = ["draft", "refined", "edited"]
    candidates, cand_meta = [], []
    for v in candidate_versions:
        for idx, sent in enumerate(sents.get(v, [])):
            candidates.append(sent); cand_meta.append((v, idx))

    all_for_tfidf = final_sents + candidates
    tfidf_matrix = tfidf.fit_transform(all_for_tfidf)
    tfidf_final = tfidf_matrix[:len(final_sents)]
    tfidf_cands = tfidf_matrix[len(final_sents):]

    emb_final = embed_sentences(final_sents)
    emb_cands = embed_sentences(candidates)

    rows = []
    heavy_calls = 0
    t_heavy0 = time.perf_counter()

    for f_idx, f_sent in enumerate(final_sents):
        sem_sim = cosine_similarity(emb_final[f_idx:f_idx+1], emb_cands).ravel() if emb_cands.shape[0] > 0 else np.array([])
        tfidf_sim = cosine_similarity(tfidf_final[f_idx:f_idx+1], tfidf_cands).ravel() if tfidf_cands.shape[0] > 0 else np.array([])
        if sem_sim.size == 0 or tfidf_sim.size == 0:
            rows.append({
                "final_index": f_idx,
                "final_sentence": f_sent,
                "origin_version": None,
                "origin_index": None,
                "origin_sentence": None,
                "semantic_sim": None,
                "tfidf_sim": None,
                "jaccard": None,
                "levenshtein": None,
                "combined_sim": 0.0,
                "modification_label": modification_label_for(0.0, cfg.unchanged_threshold, cfg.minor_change_threshold),
            })
            continue

        combined_fast = cfg.weight_semantic * sem_sim + cfg.weight_tfidf * tfidf_sim
        topk = min(cfg.topk_candidates, combined_fast.shape[0])
        cand_idx = np.argpartition(-combined_fast, kth=topk-1)[:topk]

        f_tokens = tokenize(f_sent)
        best_score = -1.0
        best = None
        for ci in cand_idx:
            c_sent = candidates[ci]
            c_tokens = tokenize(c_sent)
            A = set(f_tokens); B = set(c_tokens)
            jac = 0.0 if (not A or not B) else len(A & B)/float(len(A | B))
            max_len = max(len(f_sent), len(c_sent)) or 1
            lev_dist = Levenshtein.distance(f_sent, c_sent)
            lev_ratio = 1.0 - (lev_dist / max_len)
            heavy_calls += 1
            score = 0.7 * combined_fast[ci] + 0.2 * jac + 0.1 * lev_ratio
            if score > best_score:
                best_score = score
                best = (ci, jac, lev_ratio, combined_fast[ci])

        if best is None:
            rows.append({
                "final_index": f_idx,
                "final_sentence": f_sent,
                "origin_version": None,
                "origin_index": None,
                "origin_sentence": None,
                "semantic_sim": float(np.max(sem_sim)),
                "tfidf_sim": float(np.max(tfidf_sim)),
                "jaccard": None,
                "levenshtein": None,
                "combined_sim": float(np.max(combined_fast)),
                "modification_label": modification_label_for(float(np.max(combined_fast)), cfg.unchanged_threshold, cfg.minor_change_threshold),
            })
            continue

        ci, jac, lev_ratio, fast_keep = best
        origin_version, origin_index = cand_meta[ci]
        origin_sentence = candidates[ci]
        final_combined = max(0.0, min(1.0, best_score))
        label = modification_label_for(final_combined, cfg.unchanged_threshold, cfg.minor_change_threshold)

        rows.append({
            "final_index": f_idx,
            "final_sentence": f_sent,
            "origin_version": origin_version,
            "origin_index": int(origin_index),
            "origin_sentence": origin_sentence,
            "semantic_sim": float(sem_sim[ci]),
            "tfidf_sim": float(tfidf_sim[ci]),
            "jaccard": float(jac),
            "levenshtein": float(lev_ratio),
            "combined_sim": float(final_combined),
            "modification_label": label,
        })

    t_heavy1 = time.perf_counter()
    avg_heavy_per_final = heavy_calls / max(1, len(final_sents))

    df = pd.DataFrame(rows)
    origin_dist = (df["origin_version"].fillna("none").value_counts(normalize=True)
                   .reindex(["draft","refined","edited","none"], fill_value=0.0).to_dict())
    mod_dist = (df["modification_label"].fillna("major").value_counts(normalize=True)
                .reindex(["unchanged","minor","major"], fill_value=0.0).to_dict())

    article_name = article_path.name if article_path.is_dir() else Path(cfg.data_dir).name

    out_dir = Path(cfg.out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    csv_path    = out_dir / f"{article_name}_final_sentence_attribution.csv"
    json_path   = out_dir / f"{article_name}_complete_summary.json"
    footer_path = out_dir / f"{article_name}_footer_metrics.json"

    df.to_csv(csv_path, index=False, encoding="utf-8")

    payload = {
        "version": "1.4c.3",
        "article_name": article_name,
        "config": {
            "model_name": cfg.model_name,
            "topk_candidates": cfg.topk_candidates,
            "weights": {"semantic": cfg.weight_semantic, "tfidf": cfg.weight_tfidf},
            "unchanged_threshold": cfg.unchanged_threshold,
            "minor_change_threshold": cfg.minor_change_threshold,
        },
        "summary": {
            "origin_distribution": origin_dist,
            "modification_distribution": mod_dist,
            "counts": {
                "sentences_final": int(len(final_sents)),
                "heavy_comparisons": int(heavy_calls),
                "avg_heavy_per_final": float(avg_heavy_per_final),
            }
        },
        "rows": rows
    }
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(payload, f, ensure_ascii=False, indent=2)

    footer = {
        "article_name": article_name,
        "version": "1.4c.3",
        "elapsed_heavy_seconds": float(t_heavy1 - t_heavy0),
        "counts": payload["summary"]["counts"]
    }
    with open(footer_path, "w", encoding="utf-8") as f:
        json.dump(footer, f, ensure_ascii=False, indent=2)

    # Plain-English Diagnostic
    msg = []
    msg.append(f"Analysis completed for '{article_name}'.")
    msg.append(f"- Final sentences analyzed: {len(final_sents)}")
    msg.append(f"- Heavy comparisons executed: {heavy_calls}")
    msg.append(f"- Average heavy comparisons per final sentence: {avg_heavy_per_final:.2f}")
    msg.append("Interpretation:")
    if len(candidates) == 0:
        msg.append("• No Draft/Refined/Edited candidates found. Origins are unspecified and heavy metrics were skipped where possible.")
    else:
        k = cfg.topk_candidates
        if avg_heavy_per_final <= k * 1.2:
            msg.append(f"• Close to top‑k (k={k}). Heavy metrics are gated to the shortlist as intended.")
        else:
            msg.append(f"• Higher than expected for top‑k={k}. Consider verifying the loop only iterates over top‑k indices or lowering k.")
    msg.append(f"Outputs written to: {out_dir}")
    print("\n".join(msg))

    return {
        "article_name": article_name,
        "csv": str(csv_path),
        "json": str(json_path),
        "footer": str(footer_path)
    }

print("Analyzer ready.")


Analyzer ready.


In [6]:
## CELL 6: This code snippet iterates through a list of article locations (articles), runs the analyze_article function on each location, and collects the results.

# results = []: Initializes an empty list to store the analysis results for each article.
# for art in articles:: This loop iterates through each item in the articles list. Based on the previous cell's output, articles contains the path to the data directory (./data).
# res = analyze_article(art): Calls the analyze_article function for the current article location (art). This function performs the core analysis of comparing document versions.
# if res:: Checks if the analyze_article function returned a result. The function returns None if no "final" version is found or if the "final" version has no sentences.
# results.append(res): If analyze_article returned a result (meaning the analysis was performed), the result is added to the results list.
# if not results:: After the loop finishes, this checks if the results list is empty. This would happen if analyze_article returned None for all article locations.
# print("No outputs produced. Ensure your ./data contains expected markdown files."): If results is empty, this message is printed, indicating that no analysis was completed.
# else:: If results is not empty, the code in this block is executed.
# print("\nSummary of outputs:"): Prints a header for the summary.
# for r in results:: This loop iterates through each result dictionary stored in the results list.
# print(f"- {r['article_name']} -> JSON: {r['json']} | CSV: {r['csv']}"): For each result, it prints a formatted string showing the article name and the paths to the generated JSON and CSV output files.

# In summary, this code orchestrates the analysis process for all discovered article locations and then provides a summary of the generated output files.

results = []
for art in articles:
    res = analyze_article(art)
    if res:
        results.append(res)

if not results:
    print("No outputs produced. Ensure your ./data contains expected markdown files.")
else:
    print("\nSummary of outputs:")
    for r in results:
        print(f"- {r['article_name']} -> JSON: {r['json']} | CSV: {r['csv']}")


Analysis completed for 'data'.
- Final sentences analyzed: 103
- Heavy comparisons executed: 1030
- Average heavy comparisons per final sentence: 10.00
Interpretation:
• Close to top‑k (k=10). Heavy metrics are gated to the shortlist as intended.
Outputs written to: output

Summary of outputs:
- data -> JSON: output/data_complete_summary.json | CSV: output/data_final_sentence_attribution.csv
