# sudregex Tutorial   

In this tutorial, we will walk through the process of using `sudregex` package.
The goal of this tutorial is to illustrate the basic components and concepts of sudregex in a simple way. 

In [1]:
#importing neccessary packages 
import pandas as pd   
import sudregex as sud
import warnings
warnings.filterwarnings('ignore')

# 1. Loading Data 

Load clinical notes

In [None]:
#Directory of all the files needed for the tutorial
dir= 'your_directory_here'  #Change this to your directory
source_dirs = {
    'tutorial_data': dir + '/input_data.txt',
    'checklist_file': dir + '/abc_regex_checklist.py',
    'termlist_file': dir + '/termslist.py',
    'output_data': dir + '/output_data.txt'
}

In [None]:
## Access the file paths from the dictionary
tutorial_data = source_dirs['tutorial_data']
checklist_file = source_dirs['checklist_file']
termlist_file = source_dirs['termlist_file']
output_data = source_dirs['output_data']

In [None]:
#use pandas to read in the clinical notes
df_notes= pd.read_csv(tutorial_data,sep=r' !\^! ', header=None, engine='python')
df_notes.head(3)

Unnamed: 0,0,1,2
0,R3000000025,800000000000025,"""Patient smokes half a pack of cigarettes daily""."
1,R3000000026,800000000000026,"""Denies any current use of illicit substances."""
2,R3000000027,800000000000027,"""Discharge instructions include avoiding alcoh..."


In [16]:
# Renaming the columns for better readability
df_notes.columns = ["patient_id", "note_id", "note_text"]
df_notes.head(3)

Unnamed: 0,patient_id,note_id,note_text
0,R3000000025,800000000000025,"""Patient smokes half a pack of cigarettes daily""."
1,R3000000026,800000000000026,"""Denies any current use of illicit substances."""
2,R3000000027,800000000000027,"""Discharge instructions include avoiding alcoh..."


 ## The Core Function of `sudregex`: `extract_df`

The heart of the `sudregex` package is the function **`extract_df()`**, which applies predefined
regular expression (regex) rules to clinical note text.

### Arguments to specify 

**1) Checklist**
- Think of this as a **rulebook of keywords/patterns** to flag in text.
- By default, `sudregex` provides the **ABC (Addiction Behavior Checklist)**, which includes regex
  rules to identify addiction-related patterns (e.g., “substance craving,” “missed appointments,”
  “uncontrolled use”).
- You can also **load your own checklist** to adapt the package to your research context.

**2) Termslist**
- A **dictionary of substance-related keywords and phrases**, grouped into categories.
- Default term groups include:
  - **Alcohol**
  - **Opioids**
  - **Chronic Pain–related terms**
- Each group can be **activated independently** (e.g., only search for opioids) or **combined**,
  depending on your analysis needs.

### ⚙️ How It All Fits Together

When you pass a DataFrame of clinical notes to `extract_df`:
- The **termslist** helps locate mentions of specific substances (e.g., “opioids”) to ensure pattern matches are specific to substance of interest.
- The **checklist** applies higher-level rules that capture **behavioral or contextual signals** of
  substance use (e.g., “denies use,” “positive screen,” “discharge instructions”).
- The result is a **structured DataFrame** with new **indicator columns** that flag these matches,
  ready for **downstream analysis** (e.g., prevalence estimates, modeling features).


In [None]:
result = sud.extract_df(
    df_notes,
    checklist = checklist_file,  # you could do path-style here too if you wanted
    termslist = termlist_file,  # you could do path-style here too if you wanted
    terms_active = "opioid_terms", #active termlist to use
    note_column = "note_text", # name of the column with the text to be processed
    debug = False,  # set to True to see debug output
     # set to True to see debug output
)
result.head(3)

Unnamed: 0,note_id,illicit_drugs,illicit_drugs_SUBSTANCE_MATCHED,illicit_drugs_SUBSTANCE_MATCHED_NEG,problem_drinking,problem_drinking_NEG,dui,dui_NEG,hoarding,hoarding_SUBSTANCE_MATCHED,...,lack_interest_rehab_SUBSTANCE_MATCHED,minimal_relief_x,minimal_relief_x_SUBSTANCE_MATCHED,tolerance,tolerance_SUBSTANCE_MATCHED,tolerance_SUBSTANCE_MATCHED_NEG,med_agreement,SO_concern,SO_concern_SUBSTANCE_MATCHED,SO_concern_SUBSTANCE_MATCHED_NEG
0,800000000000025,0,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,800000000000026,0,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,800000000000027,0,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Disambiguation & Context Filters -- Optional

Short and sweet: even good regex will catch **spurious matches** (e.g., “family history” vs. true history-of-use)
and **template text** (e.g., discharge instructions) that can distort your counts. `sudregex` includes helpers to
reduce these errors.

### 1) Filter Common False Positives
Use `check_common_false_positives` to *zero-out* flags when a term appears in a known **non-informative phrase**.


In [5]:
#check for common false positives in your data

# pat, df_searched, col_name_fp, common_fp, span=20 -- add this arguments 
import re
df = pd.DataFrame(
        {"note_id": [1],
         "note_text": ["family history of asthma"], 
         "history": [1]}
    )

df_out = sud.check_common_false_positives(
        re.compile("history"), #
        df.copy(), 
        "history", 
        ["family"]
    )
    # 'family history' should be filtered, so hist → 0

df_out

Unnamed: 0,note_id,note_text,history
0,1,family history of asthma,0


## 2) Remove “Discharge Instructions” Context (Optional)

Use discharge_instructions to clear flags within discharge sections, which are often templated/repeated.

In [None]:

df = pd.DataFrame({
    "note_id": [1, 2],
    "note_text": ["Discharge instructions: avoid alcohol. foo", "No discharge section. foo"],
    "foo": [1, 1]   # example flag from earlier regex
})

# Provide the same pattern you previously used to set `foo`
out2 = sud.discharge_instructions(
    re.compile(r"foo", 
               flags=re.I),
    df.copy(),
    "foo"
)

# Expect: row 1 -> foo=0 (discharge context), row 2 -> foo=1 (kept)
out2

Unnamed: 0,note_id,note_text,foo
0,1,Discharge instructions: avoid alcohol. foo,0
1,2,No discharge section. foo,1


## ✏️ Customizing Checklists and Termslists (Notebook-Friendly)

One strength of `sudregex` is that you don’t have to rely only on the defaults —
you can **append new terms and rules** inside your notebook.  
This lets you experiment without permanently editing the package files.



In [None]:
import sudregex as sud

# Load the termslist (module or dict depending on your build)
raw_terms = sud.default_termslist

# If it's a module, convert to a dict of lists; if it's already a dict, keep as-is.
if hasattr(raw_terms, "__dict__") and not isinstance(raw_terms, dict):
    termslist = {
        k: getattr(raw_terms, k)
        for k in dir(raw_terms)
        if not k.startswith("_") and isinstance(getattr(raw_terms, k), (list, tuple))
    }
else:
    termslist = dict(raw_terms)  # shallow copy so you can edit safely

print("Available groups:", sorted(termslist.keys()))


Available groups: ['alcohol_terms', 'chronic_pain_terms', 'opioid_terms']


In [8]:

# Add new terms to an existing group
termslist['opioid_terms'].extend(['new added', 'new addded 2'])

# Confirm the update
print("Updated opioid terms:", termslist['opioid_terms'])


Updated opioid terms: ['pain med', 'opioid', 'opiod', 'narc', 'analges', 'suboxone', 'Avinza', 'codeine', 'dilaudid', 'fentanyl', 'hydrocodone', 'morphine', 'opana', 'opiate', 'oxycodone', 'oxycontin', 'oxymorphone', 'percocet', 'roxicodone', 'sufentanyl', 'vicodin', 'lortab', 'hydromorphone', 'abstral', 'actiq', 'alfentanil', 'arymo', 'ascomp', 'astramorph', 'avinza', 'belbuca', 'brompheniramine', 'bunavail', 'buprenex', 'buprenorphine', 'butalbital', 'butorphanol', 'butrans', 'capcof', 'carisoprodol', 'cheratussin', 'coditussin', 'conzip', 'demerol', 'dexbrompheniramine', 'dihydrocodeine', 'diskets', 'dolophine', 'durmorph', 'embeda', 'endacof', 'endocet', 'exalgo', 'fentora', 'fioricet', 'flowtuss', 'guaifenesin', 'histex', 'hycet', 'hycofenix', 'hydrocodone', 'hydromorphone', 'hysingla', 'ibudone', 'infumorph', 'iophen', 'iorinal', 'kadian', 'lazanda', 'levorphanol', 'lorcet', 'lotruss', 'meperidine', 'methadone', 'methadose', 'morphabond', 'morphine', 'ms contin', 'nalbuphine', 'n

## DataFrame with 2 workers

In [None]:
# Example data
df = pd.DataFrame({
    "note_id":  ["n1", "n2", "n3"],
    "note_text": [
        "Patient was arrested for DUI last year.",
        "Denies alcohol use; discharge instructions reviewed.",
        "History of opioid dependence; not currently using."
    ]
})

# Use packaged checklist + terms
checklist = sud.checklist_abc
termslist = sud.default_termslist

# Parallel on (2 workers) — your “n=2”
res = sud.extract_df(
    df=df,
    checklist=checklist,
    termslist=termslist,
    terms_active="alcohol_terms,opioid_terms",
    parallel=True,                # 👈 NEW: parallel on
    n_workers=2,                 # 👈 NEW: 2 workers
    include_note_text=True,
    exclude_discharge_mentions=False,  # default; set False to keep discharge-context hits
)

res.head()


Unnamed: 0,note_id,illicit_drugs,illicit_drugs_SUBSTANCE_MATCHED,illicit_drugs_SUBSTANCE_MATCHED_NEG,problem_drinking,problem_drinking_NEG,dui,dui_NEG,hoarding,hoarding_SUBSTANCE_MATCHED,...,minimal_relief_x,minimal_relief_x_SUBSTANCE_MATCHED,tolerance,tolerance_SUBSTANCE_MATCHED,tolerance_SUBSTANCE_MATCHED_NEG,med_agreement,SO_concern,SO_concern_SUBSTANCE_MATCHED,SO_concern_SUBSTANCE_MATCHED_NEG,note_text
0,n1,0,0,0,0,0,1,1.0,0,0,...,0,0,0,0,0,0,0,0,0,Patient was arrested for DUI last year.
1,n2,0,0,0,0,0,0,,0,0,...,0,0,0,0,0,0,0,0,0,Denies alcohol use; discharge instructions rev...
2,n3,0,0,0,0,0,0,,0,0,...,0,0,0,0,0,0,0,0,0,History of opioid dependence; not currently us...



### 📋 Checklist Validation Summary
The following code can be used if you want to ensure that the checklist and termslist are working with your data. 
- ✅ Loads regex patterns from `snapshotchecklist.py`
- 📝 Parses clinical notes from `notes_input.txt`
- 🔍 Validates each note against its expected match
- 📄 Saves:
  - Full results to `package_validation_result.csv`
  - Checklist item level results `package_checklist_validation_by_item.csv`
- 📊 Returns both results as DataFrames for further use


In [None]:
source_dir = "your_directory_here"  #Change this to your directory


detailed, by_item = sud.validation(
    checklist = dir + '/snapshotchecklist.py',
    examples = dir + '/notes_input.txt',
)

detailed

Unnamed: 0,item_code,expected,note_text,actual_match,mismatch
0,1a,1,Patient reports cocaine use last week.,1,0
1,1a,0,Patient reports no recreational substance use.,0,0
2,1a,0,Mother has history of depression.,0,0
3,1b,1,Problem drinking documented; alcohol abuse noted.,1,0
4,1b,0,Patient does not drink alcohol.,0,0
...,...,...,...,...,...
76,19,0,Patient follows the medication agreement witho...,0,0
77,19,0,Insurance agreement issue was noted.,0,0
78,20,1,Patient's wife is concerned about his use of p...,1,0
79,20,0,No relatives expressed alarm about medication ...,0,0


In [28]:
by_item

Unnamed: 0,item_key,n,correct,accuracy
0,10,3,3,1.0
1,11a,3,3,1.0
2,11b,3,3,1.0
3,12a,3,3,1.0
4,12b,3,3,1.0
5,13,3,3,1.0
6,14,3,3,1.0
7,15,3,3,1.0
8,16,3,3,1.0
9,17,3,3,1.0
