# Identify Help-Seeking Posts

Detect posts where students ask for help using NLP tools and data-driven keyword discovery.

## Goal
Go beyond manual keywords to uncover real help-seeking language patterns from student discourse.

## Data Source
**WGU-Reddit Database**: `WGU-Reddit/db/WGU-Reddit.db`
- Contains 18,000+ posts from ~50 WGU-related subreddits
- Covers posts from December 28, 2014 to July 14, 2025
- Includes post metadata (scores, comments, timestamps)
- *All data is sourced from public Reddit posts. No usernames or personal identifiers are stored.*

## Definitions

- **Help-Seeking**: A post where the author is asking for guidance, support, or answers—typically involving a need, confusion, or problem to solve.

    - **Explicit Help-Seeking**: Clearly stated requests, such as *"Can someone explain this?"*

    - **Implicit Help-Seeking**: Indirect expressions of struggle or uncertainty that imply a need for help. In natural conversation, people often ask for help without directly saying so. This kind of intent is hard to detect algorithmically. For example:
        - “This situation is getting out of hand.”
        - “I feel totally stuck.”
        - “Nothing is working.”

        These cases are difficult to classify and are excluded from this phase. We will refine this example list as we encounter real posts.

        We will narrow our scope to explicit asks due to the extreme challenge of modeling indirect emotional cues without tone or context—“even humans struggle to detect the emotion of user utterance solely on the basis of text” (Chatterjee et al., 2019).

- **Non-Help-Seeking**: Posts that are informational, reflective, or conversational with no clear request for help.

*Note: All posts come from WGU-related subreddits, so discussions are generally university-adjacent, though not always strictly academic.*

## Methodology

**Step 1: Load and Clean Data**
- Load posts from database
- Filter to last 24 hours for manual labeling sample
- Combine & lowercase `title` and `selftext` into a single `post_text` field
- Create baseline dataset for pattern discovery

**Step 2: Manual Labeling**
- Create ground truth dataset with explicit help-seeking labels
- Focus on direct questions and requests only (exclude implicit help-seeking)
- Establish single source of truth for model training

**Step 3: Baseline Keyword Detection**
- Start with strongest anchor: question mark (`?`)
- Evaluate performance against manual labels
- Analyze false positives/negatives for pattern insights
- Build a robust institutional stopword list using `FreqDist` on WGU Catalog Sections.

**Step 4: NLP Phrase Discovery**
- Extract linguistic patterns from labeled help-seeking posts
- Use NLTK tools:
  - `FreqDist()` → identify top unigrams
  - `bigrams()` / `trigrams()` → discover help phrases
  - `common_contexts()` → explore usage of "help", "stuck", etc.

**Step 5: Refine & Expand**
- Build enhanced keyword list (`help_keywords_v2`)
- Scale to full database for training/testing
- (Optional) Train classifier using discovered features

---

### Imports

In [129]:
from pathlib import Path
import sys
import pandas as pd
from IPython.display import display, HTML
from IPython.display import display, HTML
import re


# Set project root to one level above current notebook directory
project_root = Path().resolve().parent
sys.path.append(str(project_root))

from utils.db_connection import get_db_connection

## Step 1: Load and Clean Data

In [94]:
db = get_db_connection()
df = pd.read_sql_query(
    """
    SELECT p.post_id, p.subreddit_id, p.title, p.selftext, p.created_utc,
           p.score, p.num_comments, p.permalink, s.name AS subreddit_name
    FROM posts p
    LEFT JOIN subreddits s ON p.subreddit_id = s.subreddit_id
    """, db)
db.close()

df['created_at'] = pd.to_datetime(df['created_utc'], unit='s')

# Summary metrics
total_posts = len(df)
unique_subs = df['subreddit_name'].nunique()
min_date = df['created_at'].min().strftime('%Y-%m-%d')
max_date = df['created_at'].max().strftime('%Y-%m-%d')

print(f"Loaded {total_posts} posts from {unique_subs} subreddits ({min_date} to {max_date})")# Cell 2: Filter posts from the last 24 hours

latest_timestamp = df['created_at'].max()
df = df[df['created_at'] >= latest_timestamp - pd.Timedelta(hours=24)]
df = df[['post_id', 'title', 'selftext']]

print(f"Filtered {len(df)} posts from the last 24 hours")

Loaded 18829 posts from 51 subreddits (2014-12-28 to 2025-07-15)
Filtered 83 posts from the last 24 hours


### Preprocessing
Combine Title + Selftext
Lowercase for keyword matching

In [95]:
def combine_and_clean_text(df):
    """
    Returns a cleaned text Series combining 'title' and 'selftext'.
    """
    post_text = (df['title'].fillna('') + ' ' + df['selftext'].fillna('')).str.strip()
    post_text = post_text.str.lower()
    return post_text

In [96]:
print("df columns:", df.columns.tolist())

df columns: ['post_id', 'title', 'selftext']


In [97]:
# combine and clean with updated column name 'post_text'
df_clean = df.copy()
df_clean['post_text'] = combine_and_clean_text(df_clean)

print("df columns:", df.columns.tolist())
print("df_clean columns:", df_clean.columns.tolist())

df columns: ['post_id', 'title', 'selftext']
df_clean columns: ['post_id', 'title', 'selftext', 'post_text']


In [98]:
# display changes
preview = df_clean[['post_id','title', 'selftext', 'post_text']].copy()
pd.set_option('display.max_colwidth', None)  # Don't let pandas truncate; we do it ourselves
preview['selftext'] = preview['selftext'].str.slice(0, 120)
preview['post_text'] = preview['post_text'].str.slice(0, 120)

html = preview.head(5).to_html(index=False, escape=False)
display(HTML(html))

post_id,title,selftext,post_text
1lzfct7,MSN Application,I am in the process of filling out the application for MSN and it won’t let me pass the employment page even though I ha,msn application i am in the process of filling out the application for msn and it won’t let me pass the employment page
1lzf7hc,It is done.,"Honestly one of the hardest things I've ever done for myself. Did terrible in highschool (was diagnosed at 15 with MS),",it is done. honestly one of the hardest things i've ever done for myself. did terrible in highschool (was diagnosed at 1
1lzfgof,Dse withdraw process mba admissions?,How we can withdraw admission form dse mba program,dse withdraw process mba admissions? how we can withdraw admission form dse mba program
1m07p15,"D427 | Nervous, haven't coded in forever. Need tips/advice.","I'm trying to graduate and I have a few courses left to go. I am really scared of D427 (I did not have to do 426, I gue","d427 | nervous, haven't coded in forever. need tips/advice. i'm trying to graduate and i have a few courses left to go."
1m07304,Payment,Looking to start wgu for a bachelors \n\n\nDo I have to pay first before starting ?,payment looking to start wgu for a bachelors \n\n\ndo i have to pay first before starting ?


## Step 2: Manual Labeling
### Create labelled dataset
- Export a template CSV
`manual_help_truth.csv`:
| **post_id** | **text**               | **help_truth** |
|-------------|------------------------|----------------|
| abc123      | is this a question?    | 0 → 1          |  

- Manually tag and move file to /data
- Merge `help_truth` tag into dataset


### Export a template CSV to manually tag help-seeking posts

In [99]:
print("df_clean columns:", df_clean.columns.tolist())

df_clean[['post_id', 'post_text']].assign(help_truth=0).to_csv('outputs/manual_help_truth.csv', index=False)

print("Exported manual_help_truth.csv with columns: post_id, post_text, help_truth")

df_clean columns: ['post_id', 'title', 'selftext', 'post_text']
Exported manual_help_truth.csv with columns: post_id, post_text, help_truth


### Merge 'help_truth' tag into dataset

In [100]:
# add_truth_flag_to_clean.py
df_truth = pd.read_csv('data/manual_help_truth.csv')[['post_id', 'help_truth']]
df_labeled = df_clean.merge(df_truth, on='post_id', how='left')

### Display labeled dataset

In [101]:
preview = df_labeled[['post_id', 'post_text', 'help_truth']].copy()
preview['post_text'] = preview['post_text'].str.slice(0, 120)
html = preview.head(5).to_html(index=False, escape=False)
print("df_labeled columns:", df_labeled.columns.tolist())

display(HTML(html))

df_labeled columns: ['post_id', 'title', 'selftext', 'post_text', 'help_truth']


post_id,post_text,help_truth
1lzfct7,msn application i am in the process of filling out the application for msn and it won’t let me pass the employment page,1
1lzf7hc,it is done. honestly one of the hardest things i've ever done for myself. did terrible in highschool (was diagnosed at 1,0
1lzfgof,dse withdraw process mba admissions? how we can withdraw admission form dse mba program,1
1m07p15,"d427 | nervous, haven't coded in forever. need tips/advice. i'm trying to graduate and i have a few courses left to go.",1
1m07304,payment looking to start wgu for a bachelors \n\n\ndo i have to pay first before starting ?,1


## Step 3: Baseline Keyword Detection


In [102]:
keywords = ['?']  # initial list

# Create keyword_match and help_flag
def detect_keywords(post_text):
    matches = [kw for kw in keywords if kw in post_text]
    return ' | '.join(matches), int(bool(matches))

df_labeled['keyword_match'], df_labeled['help_flag'] = zip(*df_labeled['post_text'].map(detect_keywords))

In [103]:
# Summary metrics
total_posts = len(df_labeled)
total_flagged = df_labeled['help_flag'].sum()
total_truth = df_labeled['help_truth'].sum()
correct_matches = (df_labeled['help_flag'] == df_labeled['help_truth']).sum()
accuracy = correct_matches / total_posts
false_positives = ((df_labeled['help_flag'] == 1) & (df_labeled['help_truth'] == 0)).sum()
false_negatives = ((df_labeled['help_flag'] == 0) & (df_labeled['help_truth'] == 1)).sum()

print(f"Keyword Match Results (keywords: {', '.join(keywords)})")
print(f"Total posts reviewed:             {total_posts}")
print(f"Posts flagged by keyword:         {total_flagged}")
print(f"Help-seeking posts (ground truth):{total_truth}")
print(f"Correctly classified posts:       {correct_matches}")
print(f"False positives (flagged, not truth): {false_positives}")
print(f"False negatives (missed, were truth): {false_negatives}")
print(f"Accuracy:                         {accuracy:.2f}")

Keyword Match Results (keywords: ?)
Total posts reviewed:             83
Posts flagged by keyword:         54
Help-seeking posts (ground truth):62
Correctly classified posts:       73
False positives (flagged, not truth): 1
False negatives (missed, were truth): 9
Accuracy:                         0.88


### now test if it **ends** with a `?`. -- wrap this in a function

In [124]:
# test_keyword_match.py

def test_keywords(df, keywords):
    def detect_keywords(post_text):
        matches = [kw for kw in keywords if kw in post_text]
        return ' | '.join(matches), int(bool(matches))

    df = df.copy()
    df['keyword_match'], df['help_flag'] = zip(*df['post_text'].map(detect_keywords))
    accuracy = (df['help_flag'] == df['help_truth']).mean()
    return accuracy


# Example usage:
# keywords = ['?', 'help', 'advice', 'anyone', 'how do i', 'need', 'should i', 'what to do']
# accuracy = test_keywords(df_labeled, keywords)
# print(f"Accuracy: {accuracy:.2f}")

In [128]:
keywords = ['how']
accuracy = test_keywords(df_labeled, keywords)
print(f"Accuracy: {accuracy:.2f}")

keywords = ['?']
accuracy = test_keywords(df_labeled, keywords)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.46
Accuracy: 0.88


In [127]:
keywords = ['?']
accuracy = test_keywords(df_labeled, keywords)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.88


In [123]:
# Updated keyword detector: sentence ends with '?'
def ends_with_question(post_text):
    try:
        sentences = sent_tokenize(post_text)
        return any(s.strip().endswith('?') for s in sentences)
    except:
        return False

# Apply to labeled dataset
df_labeled['help_flag'] = df_labeled['post_text'].apply(ends_with_question)
df_labeled['help_flag'] = df_labeled['help_flag'].astype(int)
df_labeled['keyword_match'] = df_labeled['help_flag'].map(lambda x: '?' if x else '')

# Metrics
total_posts = len(df_labeled)
total_flagged = df_labeled['help_flag'].sum()
total_truth = df_labeled['help_truth'].sum()
correct_matches = (df_labeled['help_flag'] == df_labeled['help_truth']).sum()
accuracy = correct_matches / total_posts
false_positives = ((df_labeled['help_flag'] == 1) & (df_labeled['help_truth'] == 0)).sum()
false_negatives = ((df_labeled['help_flag'] == 0) & (df_labeled['help_truth'] == 1)).sum()

print("Keyword Match Results (sentence ends with '?'):")
print(f"Total posts reviewed:             {total_posts}")
print(f"Posts flagged by keyword:         {total_flagged}")
print(f"Help-seeking posts (ground truth):{total_truth}")
print(f"Correctly classified posts:       {correct_matches}")
print(f"False positives (flagged, not truth): {false_positives}")
print(f"False negatives (missed, were truth): {false_negatives}")
print(f"Accuracy:                         {accuracy:.2f}")

Keyword Match Results (sentence ends with '?'):
Total posts reviewed:             83
Posts flagged by keyword:         54
Help-seeking posts (ground truth):62
Correctly classified posts:       73
False positives (flagged, not truth): 1
False negatives (missed, were truth): 9
Accuracy:                         0.88


[nltk_data] Downloading package punkt to /Users/buddy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### inspect false positives

In [104]:
# filename: step3_false_positives.py

# Apply highlight and filter false positives
df_display = df_labeled.copy()
df_display['title'] = df_display['title'].fillna('').apply(highlight_snippet)
df_display['selftext'] = df_display['selftext'].fillna('').apply(highlight_snippet)

fp_df = df_display[(df_labeled['help_flag'] == 1) & (df_labeled['help_truth'] == 0)]

def render_table(df, title):
    html = df[['post_id', 'title', 'selftext', 'help_truth', 'help_flag', 'keyword_match']].to_html(index=False, escape=False)
    display(HTML(f"""
    <h4>{title}</h4>
    <div style="max-height:500px; overflow:auto; border:1px solid #ccc; padding:10px; font-family:monospace; font-size:12px">
    {html}
    </div>
    """))

render_table(fp_df, "False Positives (Flagged, but not Help-Seeking)")

post_id,title,selftext,help_truth,help_flag,keyword_match
1lzuqo9,D316/D317 IT Foundations/Applications 1101/1102 (Completed) in 2 months,they just purchased. What should you do **FIRST**?\n\nA. Install the new RAM and power on the system,0,1,?


**Observation:**  
The false positive shown above was triggered by a `?` character found in a **URL**, not in an actual help-seeking question. 

We should **remove URLs** during preprocessing to avoid misleading keyword matches.

### inspect false negatives

In [105]:
# filename: step3_false_negatives.py

# Apply highlight and filter false negatives
fn_df = df_display[(df_labeled['help_flag'] == 0) & (df_labeled['help_truth'] == 1)]

render_table(fn_df, "False Negatives (Missed, but Help-Seeking)")

post_id,title,selftext,help_truth,help_flag,keyword_match
1lzxbgh,FASFA Issues,"I Graduated 2022 High school diploma, and wanted to take a break for a semester or two. But my famil...",1,0,
1lzvqd9,Anyone have an update on the Notarized Fasfa Verification.,My documents were uploaded and expidited on friday I haven't received any update since. Just wonderi...,1,0,
1lzq8vi,Capstone “write up and summary product is missing” please help,Please help. I'm on an extension and needing to wrap up my capstone. It has been rejected with this ...,1,0,
1lzpqre,Statement of Purpose,"Hello, I wanted to make a thread for the people who had to fill out and submit the statement of purp...",1,0,
1lzohxm,ITIL4 room requirements,"Hey,\n\nCan a mirror be in the bathroom when I take my exam. That is the only room in my house with 1 ...",1,0,
1lzvmzt,Goreact,I am struggling trying figure out how to use The goreact recording system. Help me,1,0,
1lzpgag,Capstone Trouble - I don't understand what I'm doing wrong,Please help. I'm on an extension and needing to wrap up my capstone. It has been rejected with this ...,1,0,
1m03meb,D280 Javascript Programming,I just can't seem to get the map to load / become interactive. Desperately need help,1,0,
1m04rem,Failed OA,Okay so I’m not understanding this is my 3rd retake for the Learners and learning science class I un...,1,0,


**Observation:** 
false negatives show obvious keywords like "help". We will analyze them methodically to update our keyword list. 

___

## Step 4 is to filter full dataset by `?` and emerge patterns. Continued on Help Seeking Step 4.ipynb

In [None]:
### The `?` proved 88% accurate identifying help-seeking posts. 

NameError: name 'load_posts_dataframe' is not defined

In [107]:


# Filter all posts containing a question mark
df_question_posts = df[df['post_text'].str.contains(r'\?', na=False)]

# Combine into one text blob
all_text = ' '.join(df_question_posts['post_text'].dropna().tolist())

# Tokenize
tokens = word_tokenize(all_text)

# Unigrams
fdist_unigram = FreqDist(tokens)
print("Top unigrams:")
print(fdist_unigram.most_common(30))

Top unigrams:
[('i', 247), ('.', 168), ('to', 139), ('and', 125), ('the', 117), ('a', 99), ('?', 95), (',', 95), ('in', 88), ('my', 83), ('of', 71), ('for', 69), ('’', 66), ('│', 60), ('is', 55), ('this', 51), ('have', 50), ('it', 47), (')', 40), ('*', 40)]


**Observation:**  
Top unigrams are dominated by stopwords and punctuation.

**Plan:**  
- *Temporarily* use NLTK's stopword list to filter unigrams for clarity.
- Retain stopwords for bigram/trigram discovery — since many help-seeking phrases rely on them (e.g., *"how do I"*).

In [108]:
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/buddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [109]:
# Define stopwords and punctuation set
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation) # '?' removal ok - already a keyword

# Filter tokens
filtered_tokens = [t for t in tokens if t.lower() not in stop_words and t not in punctuation and len(t) > 1]

# Filtered unigram frequency
fdist_unigram_filtered = FreqDist(filtered_tokens)
print("Top unigrams (stopwords & punctuation removed):")
print(fdist_unigram_filtered.most_common(20))

Top unigrams (stopwords & punctuation removed):
[('├──', 36), ('accounting', 25), ('get', 24), ('classes', 24), ('degree', 22), ("'m", 20), ('take', 20), ('course', 17), ('one', 17), ('wgu', 15), ('time', 15), ('anyone', 15), ('business', 15), ('class', 13), ('school', 13), ('work', 13), ('trying', 12), ('first', 12), ('like', 11), ('test', 11)]


### Import the individual unigrams for each Catalog section and combine
(see notebooks/Catalog Stop Words.ipynb for methods)

**Observation:**  
After removing NLTK stopwords, the top unigrams include domain-specific terms that are frequent but do not indicate help-seeking intent.

**Domain-specific stopwords identified:**
- `accounting`
- `classes`
- `degree`
- `course`
- `wgu`
- `business`
- `class`
- `school`
- `test`
  
**Plan:**  
Introduce a custom stopword list to filter these out and improve signal.

## Work on the 2025_06 catalog to generate custom stopwords list is on notebooks/Catalog Stop Words.ipynb

In [110]:
custom_stopwords = {
    'wgu', 'class', 'classes', 'course', 'degree',
    'school', 'business', 'accounting', 'test'
}
custom_stopwords = stop_words.union(custom_stopwords)

# Re-filter tokens
filtered_tokens_custom = [
    t for t in tokens
    if t.lower() not in custom_stopwords and t not in punctuation and len(t) > 1
]

# Updated unigram frequency with domain stopwords removed
fdist_unigram_custom = FreqDist(filtered_tokens_custom)
print("Top unigrams (NLTK + domain stopwords removed):")
print(fdist_unigram_custom.most_common(50))

Top unigrams (NLTK + domain stopwords removed):
[('├──', 36), ('get', 24), ("'m", 20), ('take', 20), ('one', 17), ('time', 15), ('anyone', 15), ('work', 13), ('trying', 12), ('first', 12), ('like', 11), ("'s", 11), ("n't", 10), ('need', 10), ('advice', 10), ('capstone', 10), ('credits', 10), ('getting', 9), ('term', 9), ('transfer', 9), ('oa', 9), ('finish', 9), ('help', 9), ('questions', 8), ('complete', 8), ('know', 8), ('second', 8), ('foundations', 8), ('management', 8), ('└──', 8), ('study', 8), ('intermediate', 8), ('go', 7), ('really', 7), ('back', 7), ('tips', 7), ('new', 7), ('looking', 7), ('start', 7), ('taken', 7), ('applications', 7), ('im', 7), ('also', 7), ('long', 7), ('taking', 7), ('thanks', 7), ('project', 7), ('hello', 7), ('times', 7), ('every', 7)]


In [111]:
# filename: step4_bigrams_filtered.py

from nltk.util import bigrams

# Extend custom stopwords
# combined_keywords.py

domain_stopwords = [
    'academic', 'accounting', 'advanced', 'course', 'cus', "cu's", 'data', 'degree',
    'doctorate', 'education', 'engineering', 'faculty', 'financial', 'foundations',
    'governors', 'graduate', 'health', 'healthcare', 'information', 'learn',
    'learners', 'learning', 'management', 'marketing', 'master', 'methods', 'mgmt',
    'nurs', 'nursing', 'payment', 'phd', 'practice', 'program', 'project',
    'requirements', 'science', 'security', 'skills', 'software', 'student',
    'students', 'teaching', 'term', 'tuition', 'university', 'western', 'wgu',
    'wgu', 'class', 'classes', 'course', 'degree', 'school', 'business',
    'accounting', 'test', 'capstone', 'credits', 'term', 'transfer', 'oa',
    'foundations', 'management', 'study', 'applications', 'project', 'times'
]
custom_stopwords = stop_words.union(domain_stopwords)

# Re-filter tokens
filtered_tokens_custom = [
    t for t in tokens
    if t.lower() not in custom_stopwords and t not in punctuation and len(t) > 1
]

# Generate and count bigrams
bigram_tokens = list(bigrams(filtered_tokens_custom))
fdist_bigrams_custom = FreqDist(bigram_tokens)

print("Top bigrams (NLTK + domain stopwords removed):")
print(fdist_bigrams_custom.most_common(50))

Top bigrams (NLTK + domain stopwords removed):
[(('bachelor', "'s"), 5), (('feel', 'like'), 4), (('markdown', 'cell'), 4), (('health', 'human'), 3), (('human', 'services'), 3), (('commit', 'start'), 3), (('practice', 'tests'), 3), (('write', 'summary'), 3), (('summary', 'product'), 3), (('please', 'help'), 3), (('transferred', 'gpas'), 3), (('need', 'get'), 3), (('supply', 'chain'), 3), (('-discrete', 'math'), 3), (('discrete', 'math'), 3), (('student', 'teaching'), 3), (('full', 'time'), 3), (("'m", 'trying'), 2), (('easier', 'c877'), 2), (('c877', 'c883'), 2), (('want', 'take'), 2), (('view', 'poll'), 2), (('poll', 'https'), 2), (('bit', 'time'), 2), (('long', 'take'), 2), (('hoping', 'complete'), 2), (('one', 'semester'), 2), (('traditional', 'college'), 2), (('people', 'look'), 2), (('much', 'time'), 2), (('tests', 'taken'), 2), (('taken', 'one'), 2), (('anyone', 'taken'), 2), (('transcript', 'evaluation'), 2), (('trying', 'get'), 2), (('take', 'break'), 2), (('semester', 'failed')

In [112]:
# filename: step4_trigrams_filtered.py

from nltk.util import trigrams

# Generate and count trigrams
trigram_tokens = list(trigrams(filtered_tokens_custom))
fdist_trigrams_custom = FreqDist(trigram_tokens)

print("Top trigrams (NLTK + domain stopwords removed):")
print(fdist_trigrams_custom.most_common(50))

Top trigrams (NLTK + domain stopwords removed):
[(('health', 'human', 'services'), 3), (('write', 'summary', 'product'), 3), (('easier', 'c877', 'c883'), 2), (('view', 'poll', 'https'), 2), (('please', 'help', "'m"), 2), (('help', "'m", 'extension'), 2), (("'m", 'extension', 'needing'), 2), (('extension', 'needing', 'wrap'), 2), (('needing', 'wrap', 'rejected'), 2), (('wrap', 'rejected', 'exact'), 2), (('rejected', 'exact', 'message'), 2), (('exact', 'message', 'twice'), 2), (('message', 'twice', 'included'), 2), (('twice', 'included', '``'), 2), (('included', '``', 'write'), 2), (('``', 'write', 'summary'), 2), (('summary', 'product', "''"), 2), (('product', "''", 'made'), 2), (("''", 'made', 'jupyter'), 2), (('made', 'jupyter', 'notebook'), 2), (('jupyter', 'notebook', 'first'), 2), (('notebook', 'first', 'submission'), 2), (('first', 'submission', 'every'), 2), (('submission', 'every', 'piece'), 2), (('every', 'piece', 'write-up'), 2), (('piece', 'write-up', 'markdown'), 2), (('writ

In [113]:
# Focus on help-seeking posts
help_posts = df_labeled[df_labeled['help_truth'] == 1]['post_text'].dropna().tolist()
all_text = ' '.join(help_posts)

# Tokenize
tokens = word_tokenize(all_text)

# Unigrams
fdist_unigram = FreqDist(tokens)
print("Top unigrams:")
print(fdist_unigram.most_common(20))

Top unigrams (NLTK + domain stopwords removed):
[('├──', 36), ('get', 24), ("'m", 20), ('take', 20), ('one', 17), ('time', 15), ('anyone', 15), ('work', 13), ('trying', 12), ('first', 12), ('like', 11), ("'s", 11), ("n't", 10), ('need', 10), ('advice', 10), ('capstone', 10), ('credits', 10), ('getting', 9), ('term', 9), ('transfer', 9)]
# Trigrams
trigram_tokens = list(trigrams(tokens))
fdist_trigram = FreqDist(trigram_tokens)
print("\nTop trigrams:")
print(fdist_trigram.most_common(20))

# Common contexts for "help", "stuck", etc.
text_obj = Text(tokens)
print("\nContexts for 'help':")
text_obj.common_contexts(['help'])

print("\nContexts for 'stuck':")
text_obj.common_contexts(['stuck'])

SyntaxError: invalid syntax (2346319142.py, line 13)

## Step 5: Refine & Expand