# Project Purpose
We are preparing sectioned text files from the WGU 2025_06 catalog to support NLP tasks. The main goal is **help-seeking detection** in social media posts about the school.

## Why Sectioned Text?
Institutional catalog content (e.g. degree listings, policies, tuition) is **not help-seeking by nature**. By segmenting and analyzing these texts, we can build **custom stopword/phrase lists** to exclude non-help-seeking language during model training or inference.

## Workflow Summary
1. Extract raw text from catalog PDF via `pdfplumber`.
2. Segment by top-level catalog sections.
3. Save each section as a `.txt` file under versioned folder (`sections/2025_06/`).
4. Use these files to identify **institutional language** for exclusion in downstream social media NLP.

## Primary NLP Task
- **Help-seeking detection** in student or prospect-generated content (e.g. Reddit posts).
- Note: The Kneed algorithm (reference: Satopaa, V.) is used for elbow detection in frequency distributions.
- Note: The elbow point is sometimes one off from the obvious correct point, not critical especially for our use case. It's actually just like the graphs are for if an increase in Y value is bad, but it's actually good.


In [15]:
# Imports claude
import sys
from pathlib import Path
from string import punctuation
from calendar import month_name

import matplotlib.pyplot as plt
from kneed import KneeLocator
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords
from nltk.util import bigrams, trigrams

In [16]:
# configs claude
# Set project root to one level above current notebook directory
project_root = Path().resolve().parent
sys.path.append(str(project_root))

wgu_catalog = Path("/Users/buddy/Desktop/WGU-Reddit/WGU_catalog")

# Input file
input_file = wgu_catalog / "sections" / "2025_06" / "01_about_western_governors_university.txt"
output_dir = Path("/Users/buddy/Desktop/WGU-Reddit/outputs")

# Fixed JSON structure
catalog_sections = {
    "Catalog_Version": "2025_06",
    "Sections": {
        "Section01": "01_about_western_governors_university.txt",
        "Section02": "02_admissions.txt",
        "Section03": "03_state_regulatory_information.txt",
        "Section04": "04_tuition_and_financial_aid.txt",
        "Section05": "05_academic_policies.txt",
        "Section06": "06_standalone_courses_and_certificates.txt",
        "Section07": "07_academic_programs.txt",
        "Section08": "08_school_of_business_programs.txt",
        "Section09": "09_leavitt_school_of_health_programs.txt",
        "Section10": "10_school_of_technology_programs.txt",
        "Section11": "11_school_of_education_programs.txt",
        "Section12": "12_program_outcomes.txt",
        "Section13": "13_course_descriptions.txt",
        "Section14": "14_instructor_directory.txt",
        "Section15": "15_certificate_programs.txt"
    }
}

# Build metadata list from catalog_sections
catalog_version = catalog_sections["Catalog_Version"]
catalog_dir = wgu_catalog / "sections" / catalog_version

section_index = []
# Build metadata list from catalog_sections
catalog_version = catalog_sections["Catalog_Version"]
catalog_dir = wgu_catalog / "sections" / catalog_version

section_index = []
for section_key, filename in catalog_sections["Sections"].items():
    file_path = catalog_dir / filename
    section_id = filename.split("_")[0]
    section_title = " ".join(filename.split("_")[1:]).replace(".txt", "").title()
    
    section_index.append({
        "filename": filename,
        "path": str(file_path),
        "section_id": section_id,
        "section_title": section_title,
        "catalog_version": catalog_version,
    })

In [17]:
# functions claude
def get_top_unigrams(input_path: Path, top_k: int = 50):
    """
    Extract top unigrams from a catalog section using NLTK after basic cleaning.

    Args:
        input_path (Path): Path to section .txt file.
        top_k (int): Number of top unigrams to return.

    Returns:
        list[tuple[str, int]]: List of (term, frequency) tuples.
    """
    with open(input_path) as f:
        text = f.read().lower()

    tokens = word_tokenize(text)
    std_stopwords = set(stopwords.words("english"))

    filtered_tokens = [
        t for t in tokens
        if t.isalpha() and t not in std_stopwords and len(t) > 1
    ]

    fdist = FreqDist(filtered_tokens)
    return fdist.most_common(top_k)

def get_top_bigrams(input_path: Path, top_k: int = 50):
    """
    Extract top bigrams from a catalog section using NLTK after basic cleaning.

    Args:
        input_path (Path): Path to section .txt file.
        top_k (int): Number of top bigrams to return.

    Returns:
        list[tuple[str, int]]: List of (bigram_string, frequency) tuples.
    """
    with open(input_path) as f:
        text = f.read().lower()

    tokens = word_tokenize(text)
    std_stopwords = set(stopwords.words("english"))

    filtered_tokens = [
        t for t in tokens
        if t.isalpha() and t not in std_stopwords and len(t) > 1
    ]

    bigram_tokens = bigrams(filtered_tokens)
    bigram_strings = [' '.join(pair) for pair in bigram_tokens]

    fdist = FreqDist(bigram_strings)
    return fdist.most_common(top_k)

def get_top_trigrams(input_path: Path, top_k: int = 50):
    """
    Extract top trigrams from a catalog section using NLTK after basic cleaning.

    Args:
        input_path (Path): Path to section .txt file.
        top_k (int): Number of top trigrams to return.

    Returns:
        list[tuple[str, int]]: List of (trigram_string, frequency) tuples.
    """
    with open(input_path) as f:
        text = f.read().lower()

    tokens = word_tokenize(text)
    std_stopwords = set(stopwords.words("english"))

    filtered_tokens = [
        t for t in tokens
        if t.isalpha() and t not in std_stopwords and len(t) > 1
    ]

    trigram_tokens = trigrams(filtered_tokens)
    trigram_strings = [' '.join(tg) for tg in trigram_tokens]

    fdist = FreqDist(trigram_strings)
    return fdist.most_common(top_k)

def convert_chart_title(top_terms, section_name="01 About Section", catalog_version="2025_06"):
    """
    Return a prettified chart title for a frequency rank plot.

    Args:
        top_terms (list): List of (term or phrase, frequency) tuples.
        section_name (str): Catalog section name.
        catalog_version (str): Catalog version in 'YYYY_MM' format.

    Returns:
        str: Prettified chart title.
    """
    # Convert catalog version
    year, month = catalog_version.split("_")
    month_str = month_name[int(month)]
    pretty_version = f"{month_str} {year}"

    # Detect n-gram type
    term = top_terms[0][0] if top_terms else ""
    n = len(term.split())
    label = {1: "Unigram", 2: "Bigram", 3: "Trigram"}.get(n, f"{n}-gram")

    return f"WGU Catalog {pretty_version} {label} Frequency Rank with Elbow"

def plot_elbow_with_knee(top_terms: list[tuple[str, int]], section_name: str,
                          catalog_version: str = "2025_06",
                          curve: str = 'convex', direction: str = 'decreasing',
                          highlight: bool = True):
    """
    Plot unigram frequency rank with elbow detection using KneeLocator.

    Args:
        top_terms (list): List of (term, frequency) tuples.
        section_name (str): Name of the section for labeling.
        catalog_version (str): Catalog version in 'YYYY_MM' format.
        curve (str): Shape of curve ('convex' or 'concave').
        direction (str): 'increasing' or 'decreasing'.
        highlight (bool): Whether to highlight the elbow.
    """
    freqs = [freq for _, freq in top_terms]
    ranks = list(range(1, len(freqs) + 1))

    kneedle = KneeLocator(ranks, freqs, curve=curve, direction=direction)
    elbow_rank = kneedle.knee

    plt.figure(figsize=(10, 6))
    plt.plot(ranks, freqs, marker='o')

    if highlight and elbow_rank:
        plt.scatter(elbow_rank, freqs[elbow_rank - 1], s=225,
                    edgecolors='red', facecolors='none', linewidths=2, zorder=5)
        plt.annotate('Elbow', xy=(elbow_rank, freqs[elbow_rank - 1]),
                     xytext=(elbow_rank + 1, freqs[elbow_rank - 1] + 2),
                     arrowprops=dict(arrowstyle='->'))

    title = convert_chart_title(top_terms, section_name, catalog_version)
    plt.title(title)
    plt.xlabel('Rank')
    plt.ylabel('Frequency')
    plt.xticks(ranks, [term for term, _ in top_terms], rotation=90)
    plt.tight_layout()
    plt.show()

    if elbow_rank:
        term, freq = top_terms[elbow_rank - 1]
        print(f"Elbow at rank: {elbow_rank}, term: '{term}', frequency: {freq}")
    else:
        print("No elbow detected.")

def save_elbow_chart(top_terms: list[tuple[str, int]], section_name: str,
                     catalog_version: str, output_path: Path,
                     curve: str = 'convex', direction: str = 'decreasing',
                     highlight: bool = True):
    """
    Save elbow chart to file instead of showing it.
    """
    freqs = [freq for _, freq in top_terms]
    ranks = list(range(1, len(freqs) + 1))

    kneedle = KneeLocator(ranks, freqs, curve=curve, direction=direction)
    elbow_rank = kneedle.knee

    plt.figure(figsize=(10, 6))
    plt.plot(ranks, freqs, marker='o')

    if highlight and elbow_rank:
        plt.scatter(elbow_rank, freqs[elbow_rank - 1], s=225,
                    edgecolors='red', facecolors='none', linewidths=2, zorder=5)
        plt.annotate('Elbow', xy=(elbow_rank, freqs[elbow_rank - 1]),
                     xytext=(elbow_rank + 1, freqs[elbow_rank - 1] + 2),
                     arrowprops=dict(arrowstyle='->'))

    title = convert_chart_title(top_terms, section_name, catalog_version)
    plt.title(title)
    plt.xlabel('Rank')
    plt.ylabel('Frequency')
    plt.xticks(ranks, [term for term, _ in top_terms], rotation=90)
    plt.tight_layout()
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()

    return elbow_rank

def print_ngram_report(input_file: Path, catalog_version="2025_06"):
    from calendar import month_name

    section_name = input_file.stem
    year, month = catalog_version.split("_")
    month_str = month_name[int(month)]

    # Get n-grams
    top_unigrams = get_top_unigrams(input_file, top_k=50)
    top_bigrams = get_top_bigrams(input_file, top_k=50)
    top_trigrams = get_top_trigrams(input_file, top_k=50)

    # Compute elbows
    unigram_elbow = get_elbow_cutoff(top_unigrams)

    # Header
    print(f"WGU Catalog {month_str} {year} Unigrams (Elbow: {unigram_elbow})\n")

    # Print top N based on elbow
    def print_top(title, ngrams):
        elbow = get_elbow_cutoff(ngrams)
        print(f"{title} (Top {elbow}, based on elbow point)")
        for i, (term, freq) in enumerate(ngrams[:elbow], start=1):
            print(f"{i}. {term} ({freq})")
        print()

    print_top("Unigrams", top_unigrams)
    print_top("Bigrams", top_bigrams)
    print_top("Trigrams", top_trigrams)

    # Plot charts
    plot_elbow_with_knee(top_unigrams, section_name, catalog_version)
    plot_elbow_with_knee(top_bigrams, section_name, catalog_version)
    plot_elbow_with_knee(top_trigrams, section_name, catalog_version)

def get_elbow_cutoff(top_terms: list[tuple[str, int]]) -> int:
    freqs = [freq for _, freq in top_terms]
    ranks = list(range(1, len(freqs) + 1))
    kneedle = KneeLocator(ranks, freqs, curve='convex', direction='decreasing')
    return kneedle.knee or len(top_terms)

def save_ngram_report(entry: dict, output_root: Path = Path("/Users/buddy/Desktop/WGU-Reddit/outputs")):
    """
    Save n-gram reports and charts to files.
    
    Args:
        entry (dict): Dictionary containing file info with keys: filename, path, section_id, catalog_version
        output_root (Path): Root directory for outputs
    """
    input_file = Path(entry["path"])
    section_name = input_file.stem
    catalog_version = entry["catalog_version"]
    section_id = entry["section_id"]
    
    # Create output directory structure
    output_dir = output_root / catalog_version
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Extract first word after section number for filename
    filename_parts = entry["filename"].split("_")
    first_word = filename_parts[1] if len(filename_parts) > 1 else "section"
    
    # Get n-grams
    top_unigrams = get_top_unigrams(input_file, top_k=50)
    top_bigrams = get_top_bigrams(input_file, top_k=50)
    top_trigrams = get_top_trigrams(input_file, top_k=50)
    
    # Generate reports and save
    ngram_types = [
        ("unigram", top_unigrams),
        ("bigram", top_bigrams),
        ("trigram", top_trigrams)
    ]
    
    for ngram_type, ngram_data in ngram_types:
        # Create filenames
        report_filename = f"{section_id}_{first_word}_{ngram_type}.txt"
        chart_filename = f"{section_id}_{first_word}_{ngram_type}.png"
        
        report_path = output_dir / report_filename
        chart_path = output_dir / chart_filename
        
        # Save report
        elbow = get_elbow_cutoff(ngram_data)
        year, month = catalog_version.split("_")
        month_str = month_name[int(month)]
        
        with open(report_path, 'w') as f:
            f.write(f"WGU Catalog {month_str} {year} {ngram_type.title()}s (Elbow: {elbow})\n")
            f.write(f"Section: {section_name}\n\n")
            f.write(f"{ngram_type.title()}s (Top {elbow}, based on elbow point)\n")
            
            for i, (term, freq) in enumerate(ngram_data[:elbow], start=1):
                f.write(f"{i}. {term} ({freq})\n")
        
        # Save chart
        elbow_rank = save_elbow_chart(ngram_data, section_name, catalog_version, chart_path)
        
        print(f"Saved {ngram_type} report: {report_path}")
        print(f"Saved {ngram_type} chart: {chart_path}")

# Process all sections
def process_all_sections():
    """Process all sections in the catalog"""
    for entry in section_index:
        print(f"\nProcessing {entry['filename']}...")
        save_ngram_report(entry)

# Run for all sections
process_all_sections()


Processing 01_about_western_governors_university.txt...
Saved unigram report: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/01_about_unigram.txt
Saved unigram chart: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/01_about_unigram.png
Saved bigram report: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/01_about_bigram.txt
Saved bigram chart: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/01_about_bigram.png
Saved trigram report: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/01_about_trigram.txt
Saved trigram chart: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/01_about_trigram.png

Processing 02_admissions.txt...
Saved unigram report: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/02_admissions.txt_unigram.txt
Saved unigram chart: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/02_admissions.txt_unigram.png
Saved bigram report: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/02_admissions.txt_bigram.txt
Saved bigram chart: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/02_adm

In [19]:
# configs claude
# Set project root to one level above current notebook directory
import sys
from pathlib import Path
import re

project_root = Path().resolve().parent
sys.path.append(str(project_root))
unigram_input_dir = Path("/Users/buddy/Desktop/WGU-Reddit/outputs/2025_06/")

def identify_unigram_files(directory):
    """
    Identify files that:
    1. Have .txt extension (handling duplicate .txt cases)
    2. Contain 'unigram' in the filename
    """
    unigram_files = []
    
    for file_path in directory.glob("*"):
        filename = file_path.name
        
        # Check if file has .txt extension (handle duplicate .txt)
        if filename.endswith('.txt'):
            # Check if filename contains 'unigram'
            if 'unigram' in filename.lower():
                unigram_files.append(file_path)
    
    return sorted(unigram_files)

def extract_unigrams_from_file(file_path):
    """
    Extract unigrams from a single file using regex pattern matching
    Returns a set of unique words and word count
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
        
        # Regex pattern to match numbered lines with word and count
        # Pattern: number. word (count)
        pattern = r'^\d+\.\s+([a-zA-Z]+)\s+\(\d+\)$'
        
        words = set()
        for line in content.split('\n'):
            match = re.match(pattern, line.strip())
            if match:
                word = match.group(1).lower()  # Extract word and convert to lowercase
                words.add(word)
        
        return words, len(words)
    
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return set(), 0

def process_unigram_files(directory, test_mode=True):
    """
    Process unigram files and combine words into catalog_stopwords set
    """
    # Identify files
    unigram_files = identify_unigram_files(directory)
    
    if not unigram_files:
        print("No unigram files found!")
        return set(), 0, 0
    
    print(f"Found {len(unigram_files)} unigram files:")
    for file_path in unigram_files:
        print(f"  - {file_path.name}")
    
    # Initialize combined set
    catalog_stopwords = set()
    total_words_processed = 0
    files_processed = 0
    
    # Process files (test mode: only first file, otherwise all files)
    files_to_process = [unigram_files[0]] if test_mode else unigram_files
    
    print(f"\nProcessing {'1 file (TEST MODE)' if test_mode else f'{len(files_to_process)} files'}:")
    
    for file_path in files_to_process:
        print(f"\nProcessing: {file_path.name}")
        
        words, word_count = extract_unigrams_from_file(file_path)
        
        if words:
            catalog_stopwords.update(words)
            total_words_processed += word_count
            files_processed += 1
            
            print(f"  - Extracted {word_count} unique words")
            print(f"  - Sample words: {list(words)[:5]}")  # Show first 5 words as sample
        else:
            print(f"  - No words extracted from {file_path.name}")
    
    return catalog_stopwords, files_processed, total_words_processed

# Run the processor
print("=== NLTK Unigram Processor ===")
print(f"Input directory: {unigram_input_dir}")
print(f"Directory exists: {unigram_input_dir.exists()}")

# Process files (TEST MODE - only first file)
catalog_stopwords, files_processed, total_words_processed = process_unigram_files(
    unigram_input_dir, 
    test_mode=False  # Change to False to process all files
)
import pandas as pd

# Example output from the unigram script (what catalog_stopwords would contain)
# Based on your file examples, it would extract words like:
example_catalog_stopwords = {
    'university', 'state', 'term', 'tuition', 'student', 'students', 
    'wgu', 'financial', 'aid', 'program', 'per', 'may', 'payment',
    'course', 'degree', 'credit', 'academic', 'enrollment', 'requirement'
}

# Convert to DataFrame
df_stopwords = pd.DataFrame(list(example_catalog_stopwords), columns=['word'])

print(df_stopwords.head())
print(f"\nDataFrame shape: {df_stopwords.shape}")
print(f"\n=== RESULTS ===")
print(f"Files processed: {files_processed}")
print(f"Total words processed: {total_words_processed}")
print(f"Unique words in catalog_stopwords: {len(catalog_stopwords)}")

if catalog_stopwords:
    print(f"\nFirst 10 words in catalog_stopwords:")
    sorted_words = sorted(catalog_stopwords)
    for i, word in enumerate(sorted_words[:10]):
        print(f"  {i+1}. {word}")
    
    print(f"\nLast 10 words in catalog_stopwords:")
    for i, word in enumerate(sorted_words[-10:]):
        print(f"  {len(sorted_words)-9+i}. {word}")

# Display catalog_stopwords set for verification
print(f"\ncatalog_stopwords set contains: {len(catalog_stopwords)} unique words")
print("Set ready for use!")

=== NLTK Unigram Processor ===
Input directory: /Users/buddy/Desktop/WGU-Reddit/outputs/2025_06
Directory exists: True
Found 15 unigram files:
  - 01_about_unigram.txt
  - 02_admissions.txt_unigram.txt
  - 03_state_unigram.txt
  - 04_tuition_unigram.txt
  - 05_academic_unigram.txt
  - 06_standalone_unigram.txt
  - 07_academic_unigram.txt
  - 08_school_unigram.txt
  - 09_leavitt_unigram.txt
  - 10_school_unigram.txt
  - 11_school_unigram.txt
  - 12_program_unigram.txt
  - 13_course_unigram.txt
  - 14_instructor_unigram.txt
  - 15_certificate_unigram.txt

Processing 15 files:

Processing: 01_about_unigram.txt
  - Extracted 9 unique words
  - Sample words: ['program', 'academic', 'western', 'governors', 'student']

Processing: 02_admissions.txt_unigram.txt
  - Extracted 8 unique words
  - Sample words: ['state', 'program', 'nursing', 'requirements', 'students']

Processing: 03_state_unigram.txt
  - Extracted 2 unique words
  - Sample words: ['state', 'university']

Processing: 04_tuition_

In [27]:
# unigram_processor.py

import re
from pathlib import Path

def identify_unigram_files(directory):
    return sorted([
        f for f in directory.glob("*.txt")
        if 'unigram' in f.name.lower()
    ])

def extract_unigrams_from_file(file_path):
    pattern = r'^\d+\.\s+([a-zA-Z]+)\s+\(\d+\)$'
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.read().splitlines()
    return {
        match.group(1).lower()
        for line in lines
        if (match := re.match(pattern, line.strip()))
    }

def combine_unigrams(directory):
    """
    Combine all unigram words from unigram .txt files in the directory
    Returns a set of stopwords to be added to NLTK stopwords
    """
    unigram_files = identify_unigram_files(directory)
    stopwords_set = set()
    for file_path in unigram_files:
        stopwords_set.update(extract_unigrams_from_file(file_path))
    return stopwords_set



In [29]:
# save_institutional_stopwords.py

# Generate stopwords from unigrams
institutional_stopwords = combine_unigrams(unigram_input_dir)

# Save to file
output_file = unigram_input_dir.parent / "institutional_stopwords.txt"

with open(output_file, "w", encoding="utf-8") as f:
    for word in sorted(institutional_stopwords):
        f.write(word + "\n")