### 📌 Help-Seeking Detection (NLTK + Keyword Match)

**1. Load & Prep**
- Load posts, combine `title` + `selftext` → `post_text`
- Lowercase for matching

**2. Baseline Detection**
- Flag posts with `help_keywords` → `has_help_phrase`

**3. Phrase Snippets**
- Extract & highlight ~30-char context around matched phrases

**4. Token Exploration**
- Run `FreqDist()` on help-seeking posts
- Remove stopwords/punctuation
- Export top tokens

**5. Build `help_keywords_v2`**
- Manual review of top tokens
- Apply to flag `has_help_v2`

**6. Compare v1 vs v2**
- v1: 13.5k posts, v2: 15.2k posts (+3.3k new)
- ~1.9k missed by both

**7. Phrase Discovery**
- On `v2_only`:
  - Extract bigrams/trigrams
  - Use `common_contexts()` for “help”, “advice”, etc.

**8. Sentiment Analysis**
- VADER sentiment → histogram of all / help / missed
- Found sentiment alone insufficient for detection

**9. Missed Post Analysis**
- Focus: `VADER < -0.2`, no help phrases
- Token + bigram `FreqDist()` to find implicit help signals

**10. Course Mentions**
- Count posts with Top 20 course codes

In [None]:
# === Project Bootstrapping ===
from pathlib import Path
import sys

# Locate and add project root so 'utils' is importable
ROOT_DIR = Path().resolve()
while not (ROOT_DIR / "utils" / "db_connection_new.py").exists():
    ROOT_DIR = ROOT_DIR.parent
sys.path.append(str(ROOT_DIR))

# Now imports will work
from utils.db_connection_new import load_posts_dataframe
from utils.new_paths import DATA_DIR, OUTPUT_DIR

# Input / Output files
course_list = DATA_DIR / "courses_with_college_v10.csv"
courses_top20 = DATA_DIR /"reddit_top_20_mentioned_courses.csv"
output_dir = OUTPUT_DIR 


df = load_posts_dataframe()

print(f"Loaded {len(df)} rows")
print("Columns:", df.columns.tolist())
display(df.head(3))  # optional if in notebook

def combine_and_clean_text(df):
    """
    Returns a cleaned text Series combining 'title' and 'selftext'.
    """
    post_text = (df['title'].fillna('') + ' ' + df['selftext'].fillna('')).str.strip()
    post_text = post_text.str.lower()
    return post_text

df['post_text'] = combine_and_clean_text(df).copy()

display(df.head(1))

In [None]:
help_keywords = [
    "need help", "help!", "help with", "any advice", "looking for advice",  
    "advice on", "tips on", "looking for tips", "need suggestions",
    "need recommendations", "how do i", "where do i", "where can i",
    "what do i", "when should i", "which should i", "does anyone know",
    "does anyone have", "anyone know how", "can someone help",
    "can anyone help", "stuck on", "struggling with", "cannot figure out",
    "can’t figure out", "having trouble with", "confused about",
    "lost on", "don’t understand", "not sure how", "no idea how",
    "trying to figure out", "help me understand", "explain how",
    "can someone explain", "make sense of", "anyone dealt with",
    "how did you handle", "how did you manage", "what worked for you",
    "am i missing something", "doing something wrong", "what am i doing wrong",
    "should i be", "am i supposed to", "can anyone explain",
    "what's the best way to", "any pointers on", 
]

In [None]:

display(df_posts.head(1))

# === Load CSVs ===
df_top20 = pd.read_csv(TOP20_CSV)
df_courses = pd.read_csv(COURSES_CSV)

print("✅ Loaded top 20 courses:")
display(df_top20.head(1))

print("✅ Loaded full course/college catalog:")
display(df_courses.head(1))

## Identify Help-Seeking Posts

Detect posts where students ask for help — explicitly  — using NLP tools.

---

### Goal
Go beyond manual keywords to uncover real help-seeking language patterns.

---

### Plan

#### Step 1: Baseline Keyword Match
- Use `help_keywords` (~40 phrases)
- Flag posts with `has_help_phrase`

#### Step 2: NLP Phrase Discovery
- Focus on posts with `has_help_phrase` or low VADER sentiment
- Use:
  - `FreqDist()` → top unigrams
  - `bigrams()` / `trigrams()` → discover help phrases
  - `common_contexts()` → explore usage of “help”, “stuck”, etc.

#### Step 3: Refine & Expand
- Build `help_keywords_v2`
- (Optional) Train a basic classifier using labeled examples

## Step 1: Baseline Keyword Match

### Clean

In [None]:
# === Step 2: Apply baseline help phrase match ===

print("Merging title + selftext into Post_Text...")
df_posts['Post_Text'] = df_posts.apply(
    lambda r: f"{r['title']}\n{r['selftext']}".strip() if pd.notnull(r['selftext']) else r['title'].strip(),
    axis=1
)
print("Sample Post_Text:")
print(df_posts['Post_Text'].head(3))

print("\nLowercasing for help phrase matching...")
df_posts['Post_Text_LC'] = df_posts['Post_Text'].str.lower()
print("Sample Post_Text_LC:")
print(df_posts['Post_Text_LC'].head(3))

print("\nChecking for help phrases...")
df_posts['has_help_phrase'] = df_posts['Post_Text_LC'].apply(
    lambda text: any(phrase in text for phrase in help_keywords)
)
print("Sample has_help_phrase values:")
print(df_posts['has_help_phrase'].head(3))

print("\nFiltering matched help-seeking posts...")
df_help_labeled = df_posts[df_posts['has_help_phrase']].copy()
print(f"Baseline help-seeking matches: {len(df_help_labeled)} / {len(df_posts)}")

print("Sample matched posts:")
display(df_help_labeled[['post_id', 'Post_Text']].head())

In [None]:
import re

# Combine title + selftext
df_help_labeled['Post_Text'] = df_help_labeled.apply(
    lambda r: f"{r['title']}\n{r['selftext']}".strip() if pd.notnull(r['selftext']) else r['title'].strip(),
    axis=1
)

# Lowercase version for matching
df_help_labeled['Post_Text_LC'] = df_help_labeled['Post_Text'].str.lower()

# Find matched phrases
df_help_labeled['matched_phrases'] = df_help_labeled['Post_Text_LC'].apply(
    lambda text: [p for p in help_keywords_v2 if p in text]
)

# Extract ~30-character window around each match
def extract_context_snippets(text, phrases, window=30):
    text_lc = text.lower()
    matches = []
    for phrase in sorted(set(phrases), key=len, reverse=True):
        for match in re.finditer(re.escape(phrase), text_lc):
            start, end = match.span()
            snippet = text[max(0, start-window):min(len(text), end+window)]
            matches.append((start, snippet))
    # Remove overlapping snippets
    matches = sorted(matches, key=lambda x: x[0])
    final_snippets = []
    last_end = -1
    for start, snippet in matches:
        if start > last_end:
            final_snippets.append(snippet.strip())
            last_end = start + len(snippet)
    return final_snippets

# Extract snippets
df_help_labeled['snippets'] = df_help_labeled.apply(
    lambda row: extract_context_snippets(row['Post_Text'], row['matched_phrases']),
    axis=1
)

# Highlight help phrases
def highlight_in_snippet(snippet, phrases):
    for p in sorted(phrases, key=len, reverse=True):
        pattern = re.compile(re.escape(p), re.IGNORECASE)
        snippet = pattern.sub(f"<mark>{p}</mark>", snippet)
    return snippet

# Filter + explode
df_snippet = df_help_labeled[df_help_labeled['snippets'].str.len() > 0].copy()
df_snippet = df_snippet.explode('snippets')

# Highlight in each snippet
df_snippet['Context'] = df_snippet.apply(
    lambda row: highlight_in_snippet(row['snippets'], row['matched_phrases']),
    axis=1
)

# Format display DataFrame
df_snippet['Title'] = df_snippet['title'].apply(lambda x: truncate(x, 40))
df_display = df_snippet[['post_id', 'Title', 'Context']].copy()

# Render HTML (wider table)
html_code = df_display.to_html(index=False, escape=False)
display(HTML(f"""
<div style='max-height: 700px; overflow-y: auto; border: 1px solid #ccc; padding: 10px; font-family: sans-serif'>
<table style='width:100%; table-layout: fixed'>
{html_code}
</table>
</div>
"""))

### Observation: Current help-seeking phrase detection works!
Every result matched by `help_keywords` is a true help-seeking post — no false positives observed.

---

### Next: Improve and Validate the List Using NLTK

1. **Token Exploration**
   - Run `FreqDist()` on `has_help_phrase == True` posts
   - Find common unigrams related to help-seeking

2. **Missed Signal Discovery**
   - Run `FreqDist()` on `has_help_phrase == False AND vader_score < -0.2`
   - Identify help-related language that isn’t in the current list

3. **Phrase Expansion**
   - Use `bigrams()` and `trigrams()` to find multi-word patterns
   - Use `common_contexts()` to explore key word usage

4. **Refine Help Phrase List**
   - Build `help_keywords_v2`
   - Optional: flag more posts or prep training data for a classifier

In [None]:
# === Step 5: Token Exploration — FreqDist on help-seeking posts ===

import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Ensure NLTK punkt tokenizer is available
nltk.download('punkt', quiet=True)

# Get only the help-seeking text
help_texts = df_help_labeled['Post_Text'].dropna().tolist()
all_help_tokens = []

print("➡️ Tokenizing help-seeking posts...")
for text in help_texts:
    tokens = word_tokenize(text.lower())
    all_help_tokens.extend(tokens)

# Generate frequency distribution
fdist = FreqDist(all_help_tokens)

# Show most common words
print("✅ Top 30 tokens in help-seeking posts:")
for word, count in fdist.most_common():
    print(f"{word:>12} : {count}")

### Remove Stop Words

In [None]:
# === Step 6: Cleaned Token Exploration — Stopwords Removed + Export ===

from nltk.corpus import stopwords
import string

# Download stopwords if needed
nltk.download('stopwords', quiet=True)

# Define base stopwords and punctuation
stop_words = set(stopwords.words('english'))
punct = set(string.punctuation)

# Tokenize and clean
clean_tokens = []
for text in help_texts:
    tokens = word_tokenize(text.lower())
    filtered = [t for t in tokens if t not in stop_words and t not in punct and len(t) > 2]
    clean_tokens.extend(filtered)

# Frequency distribution
clean_fdist = FreqDist(clean_tokens)

# Show top 30
print("Top 30 cleaned tokens in help-seeking posts:")
for word, count in clean_fdist.most_common(30):
    print(f"{word:>12} : {count}")

# Save full freq list to CSV
output_path = Path("/Users/buddy/Desktop/WGU-Reddit/notebooks/outputs/help_token_freq.csv")
freq_df = pd.DataFrame(clean_fdist.items(), columns=["token", "count"]).sort_values(by="count", ascending=False)
freq_df.to_csv(output_path, index=False)

print(f"Full token frequency saved to: {output_path}")

### Help-Seeking Tokens Identified
`FreqDist()` revealed strong signals like: `help`, `need`, `advice`, `anyone`, `question`, `tips`, `thanks`, `stuck`.

---

### Noise Observed
Default stopword removal missed low-signal terms:
- WGU-specific: `wgu`, `course`, `class`, `degree`
- Fillers: `like`, `get`, `know`, `really`, `also`

---

### What We Did
- Manually reviewed top tokens
- Removed generic terms
- Finalized a list of **62 high-signal help-seeking words**

Saved to: `/outputs/help_keywords_v2.csv`

---

### Sample from `help_keywords_v2`:
`help`, `advice`, `anyone`, `need`, `tips`, `questions`, `looking`, `struggling`, `please`, `appreciated`

### Next Steps 

1. **Apply `help_keywords_v2`**
   - Load CSV
   - Flag posts with `has_help_v2`

2. **Compare Old vs New**
   - Count posts:
     - `has_help_phrase`
     - `has_help_v2`
     - in `v2` only
     - still undetected

3. **Analyze Misses**
   - Focus: `has_help_v2 == False` and `vader_score < -0.2`
   - Run `FreqDist()` + `bigrams()` on this slice

4. **Prepare for Classifier**
   - Create labeled dataset with:
     - `post_id`, `Post_Text`, `has_help_v2`, `vader_score`
   - Save as `help_labeled_dataset.csv`

In [None]:
# === Step 1: Apply help_keywords_v2 ===

import csv

with open(v2_path, "r") as f:
    reader = csv.reader(f)
    help_keywords_v2 = [row[0].strip().lower() for row in reader if row]

help_keywords_v2.append("?")  # added manually for help-question detection
# Load refined keyword list (lowercase)
v2_path = PROJECT_ROOT / "WGU_catalog" / "outputs" / "help_keywords_v2.csv"
with open(v2_path, "r") as f:
    reader = csv.reader(f)
    help_keywords_v2 = [row[0].strip().lower() for row in reader if row]

# Flag posts containing any v2 keyword
df_posts["has_help_v2"] = df_posts["Post_Text_LC"].apply(
    lambda text: any(kw in text for kw in help_keywords_v2)
)

# Preview results
print(f"✅ help_keywords_v2 applied. Matches found: {df_posts['has_help_v2'].sum()}")
display(df_posts[df_posts["has_help_v2"]].head(10)[["post_id", "Post_Text"]])

## Compare first and second list

In [None]:
# === Step 2: Compare has_help_phrase vs has_help_v2 ===

# Basic counts
count_phrase = df_posts["has_help_phrase"].sum()
count_v2 = df_posts["has_help_v2"].sum()

# Posts newly flagged by v2
v2_only = df_posts[(df_posts["has_help_v2"]) & (~df_posts["has_help_phrase"])]
missed_by_both = df_posts[(~df_posts["has_help_v2"]) & (~df_posts["has_help_phrase"])]

print("Help Phrase Comparison:")
print(f"- Old (has_help_phrase): {count_phrase}")
print(f"- New (has_help_v2): {count_v2}")
print(f"- Detected by v2 only: {len(v2_only)}")
print(f"- Missed by both: {len(missed_by_both)}")



### Help Phrase Comparison – Observation

The refined keyword list (`help_keywords_v2`) flagged 15,244 posts, up from 13,551 using the original list — a significant increase in detected help-seeking behavior.

Over 3,300 posts were uniquely caught by v2, highlighting the expanded coverage.

However, this assumes all matches are valid — in reality, some may be false positives. A manual sample review is needed to verify precision.

Meanwhile, ~1,900 posts with low sentiment were missed entirely, suggesting additional help signals may still be uncovered.

**Help Phrase Comparison:**
- Old (`has_help_phrase`): 13,551  
- New (`has_help_v2`): 15,244  
- Detected by v2 only: 3,302  
- Missed by both: 1,896

### Phrase Expansion: Bigrams, Trigrams, and Contexts from `v2_only` Posts

To uncover additional help-seeking patterns beyond keywords, we analyze `v2_only` posts using NLTK tools:

1. **Tokenize & Clean Text**  
   Normalize case and remove punctuation (not '?') for consistent phrase extraction.

2. **Extract Bigrams and Trigrams**  
   Identify frequently occurring 2–3 word phrases that suggest help-seeking intent (e.g. "need advice", "anyone know how").

3. **Explore Common Contexts**  
   Use `common_contexts()` to see how key terms like "help", "tips", and "advice" are used in surrounding text.

These steps help surface real-world language patterns to refine detection beyond manual phrases.

In [None]:
# === Step 3: Phrase Expansion from v2_only posts ===

from nltk.tokenize import word_tokenize
from nltk.util import bigrams, trigrams
from nltk.text import Text
from nltk.probability import FreqDist
import string

def clean_and_tokenize(text):
    # Keep '?', remove all other punctuation
    punct = string.punctuation.replace('?', '')
    text = text.lower().translate(str.maketrans('', '', punct))
    return word_tokenize(text)

# Tokenize v2-only posts
v2_only_tokens = v2_only["Post_Text"].dropna().apply(clean_and_tokenize)

# Flatten tokens
flat_tokens = [token for tokens in v2_only_tokens for token in tokens]

# === Bigrams ===
bi = FreqDist(bigram for tokens in v2_only_tokens for bigram in bigrams(tokens))
print("\nTop 30 bigrams in v2-only posts:")
for phrase, count in bi.most_common(30):
    print(f"{phrase[0]} {phrase[1]}: {count}")

# === Trigrams ===
tri = FreqDist(trigram for tokens in v2_only_tokens for trigram in trigrams(tokens))
print("\nTop 30 trigrams in v2-only posts:")
for phrase, count in tri.most_common(30):
    print(f"{phrase[0]} {phrase[1]} {phrase[2]}: {count}")

# === Common Contexts ===
# Use flat tokens to build a Text object
text_obj = Text(flat_tokens)

# Explore how "help", "tips", "advice" appear in context
print("\nCommon contexts for 'help':")
text_obj.common_contexts(["help"])

print("\nCommon contexts for 'tips':")
text_obj.common_contexts(["tips"])

print("\nCommon contexts for 'advice':")
text_obj.common_contexts(["advice"])

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# add a column for VADER
if "VADER" not in df_posts.columns:
    analyzer = SentimentIntensityAnalyzer()
    df_posts["VADER"] = df_posts["Post_Text"].apply(
        lambda t: analyzer.polarity_scores(t)["compound"]
    )
    print("✅ VADER sentiment scores added.")

## Identify Sentiment Ranges with Help-Seeking Signals


In [None]:
import matplotlib.pyplot as plt
pd.set_option("display.max_colwidth", 500)

# === Histogram: All Posts ===
plt.figure(figsize=(10, 4))
plt.hist(df_posts["VADER"], bins=40, edgecolor='black')
plt.title("VADER Sentiment — All Posts")
plt.xlabel("VADER Score")
plt.ylabel("Post Count")
plt.grid(True)
plt.show()

# === Histogram: Help-Seeking Posts Only ===
help_mask = df_posts["has_help_phrase"] | df_posts["has_help_v2"]

plt.figure(figsize=(10, 4))
plt.hist(df_posts[help_mask]["VADER"], bins=40, edgecolor='black', color='green')
plt.title("VADER Sentiment — Help-Seeking Posts")
plt.xlabel("VADER Score")
plt.ylabel("Help-Seeking Post Count")
plt.grid(True)
plt.show()

# === Histogram: Missed Posts ===
missed_mask = (~df_posts["has_help_phrase"]) & (~df_posts["has_help_v2"])

plt.figure(figsize=(10, 4))
plt.hist(df_posts[missed_mask]["VADER"], bins=40, edgecolor='black', color='red')
plt.title("VADER Sentiment — Missed Posts (Unflagged)")
plt.xlabel("VADER Score")
plt.ylabel("Missed Post Count")
plt.grid(True)
plt.show()




### Observation

A sharp peak in the -0.5 to -0.4 range suggests many distress-based help posts. The large spike at 0.0 likely reflects noise. Consistent volume in the 0.3 to 0.9 range points to advice, gratitude, or resolved help — still relevant for identifying help-seeking behavior.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Define 0.1 bins from -1.0 to +1.0
bins = np.arange(-1.0, 1.1, 0.1)

# === Histogram: All Posts ===
plt.figure(figsize=(10, 4))
plt.hist(df_posts["VADER"], bins=bins, edgecolor='black')
plt.title("VADER Sentiment — All Posts (0.1 Bins)")
plt.xlabel("VADER Score")
plt.ylabel("Post Count")
plt.grid(True)
plt.show()

# === Histogram: Original Keyword Matches ===
plt.figure(figsize=(10, 4))
plt.hist(df_posts[df_posts["has_help_phrase"]]["VADER"], bins=bins, edgecolor='black', color='orange')
plt.title("VADER Sentiment — has_help_phrase (Original Keywords)")
plt.xlabel("VADER Score")
plt.ylabel("Help-Seeking Post Count")
plt.grid(True)
plt.show()

# === Histogram: Expanded Keyword Matches ===
plt.figure(figsize=(10, 4))
plt.hist(df_posts[df_posts["has_help_v2"]]["VADER"], bins=bins, edgecolor='black', color='green')
plt.title("VADER Sentiment — has_help_v2 (Expanded Keywords)")
plt.xlabel("VADER Score")
plt.ylabel("Help-Seeking Post Count")
plt.grid(True)
plt.show()

### Sentiment Histogram Conclusion

Sentiment distributions for all posts, original, and v2 help-seeking are similar. The v2 list captures more posts, especially with low sentiment, but doesn't shift the overall curve.

**Conclusion:** No clear sentiment range isolates missed help-seeking. Future improvements should focus on language patterns, not sentiment.

In [None]:
# === Step 3: Analyze Misses (VADER < -0.2 but no help keywords matched) ===

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import bigrams
import string

# Filter missed posts with negative sentiment
NEG_THRESHOLD = -0.2
missed_neg = df_posts[(~df_posts["has_help_v2"]) & 
                      (~df_posts["has_help_phrase"]) &
                      (df_posts["VADER"] < NEG_THRESHOLD)].copy()

print(f"Posts missed by both help detectors but VADER < {NEG_THRESHOLD}: {len(missed_neg)}")

# Lowercase + remove punctuation
def clean_and_tokenize(text):
    text = text.lower().translate(str.maketrans('', '', string.punctuation))
    return word_tokenize(text)

# Tokenize all missed posts
missed_tokens = missed_neg["Post_Text"].apply(clean_and_tokenize).explode()

# FreqDist of individual tokens
fdist = FreqDist(missed_tokens)
print("\nTop 30 tokens in missed posts:")
print(fdist.most_common(30))

# Get bigrams
missed_bigrams = missed_neg["Post_Text"].apply(
    lambda text: list(bigrams(clean_and_tokenize(text)))
).explode()

bigram_dist = FreqDist(missed_bigrams)
print("\nTop 30 bigrams in missed posts:")
for bigram, count in bigram_dist.most_common(30):
    print(f"{bigram[0]} {bigram[1]}: {count}")

## Top 20 course code match

In [None]:
# Check how many posts mention any Top 20 course code
top20_list = df_top20['Course Code'].str.upper().unique().tolist()

literal_counts = post_texts.str.upper().apply(
    lambda t: any(code in t for code in top20_list)
).sum()

print(f"✅ Posts mentioning a Top 20 course: {literal_counts} / {len(df_posts)}")

## VADER on preview

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
sample_scores = post_texts.head(5).apply(lambda t: analyzer.polarity_scores(t)['compound'])

print("✅ VADER compound scores (sample):")
print(sample_scores)