In [3]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloa

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import spacy
from bertopic import BERTopic

In [2]:
# Load datasets
df = pd.read_csv('0_all.csv')

## Preprocess Text Data

Clean text & handle missing values

In [4]:
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    # Check if text is NaN (float) or None
    if isinstance(text, float) or text is None:
        return ""

    doc = nlp(text.lower().strip())
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return " ".join(tokens)

df["Assault_Description_Clean"] = df["Assault Description"].apply(preprocess)
df["Primary_Assault_Description_Clean"] = df["Primary Assault Description"].apply(preprocess)
df["Primary_Contributing_Factors_Clean"] = df["Primary Contributing Factors"].apply(preprocess)

## NLP

### A. Topic Modeling (e.g., LDA, BERTopic)

#### Contributing Factor

In [6]:
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(df["Primary_Contributing_Factors_Clean"])
topic_model.get_topic_info()  # View topics

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,38,-1_disorient_confused_homelessness_housing,"[disorient, confused, homelessness, housing, a...",[action patient resident agitated cognitive im...
1,0,933,0_influence___,"[influence, , , , , , , , , ]","[, , influence]"
2,1,72,1_action_resident_patient_abuse,"[action, resident, patient, abuse, substance, ...","[action patient resident, action patient resid..."
3,2,65,2_alter_mental_status_understand,"[alter, mental, status, understand, lack, pati...","[alter mental status, alter mental status, alt..."
4,3,60,3_action_resident_patient_visitor,"[action, resident, patient, visitor, , , , , , ]","[action patient resident, action patient resid..."
5,4,60,4_unknown___,"[unknown, , , , , , , , , ]","[n unknown, n unknown, n unknown]"
6,5,43,5_bed_unavailable_inpatient_detox,"[bed, unavailable, inpatient, detox, snf, psyc...","[inpatient bed unavailable, inpatient bed unav..."
7,6,42,6_adherence_compliance_lack_patient,"[adherence, compliance, lack, patient, issue, ...","[patient lack compliance adherence, patient la..."
8,7,33,7_staff_insufficient_interference_guardian,"[staff, insufficient, interference, guardian, ...",[action patient resident staff insufficient st...
9,8,24,8_unknown___,"[unknown, , , , , , , , , ]","[unknown, unknown, unknown]"


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.




Interpretation:

### **Key Observations About the Table**
1.  **Structure**:
- **Topic**: Unique ID for the topic (e.g., `0`, `1`).  `-1` typically indicates "outliers" (texts that don’t fit any topic).
- **Count**: Number of documents/text entries assigned to the topic.
- **Name**: Automatically generated label based on top terms (e.g., `1_action_resident_patient_abuse`).
- **Representation**: Top keywords defining the topic (e.g., `[action, resident, patient, abuse]`).
- **Representative_Docs**: Example text entries from the dataset that belong to the topic.

2.  **Dominant Topics**:
- **Topic 0**: The largest cluster (`Count = 933`), but its keywords (`[influence, , , , , , ...]`) suggest it might be **noise** (e.g., empty strings or placeholder text).
- **Topic 1**: The second-largest meaningful cluster (`Count = 74`) focuses on **patient/resident abuse**.
- **Topic -1**: Outliers (`Count = 58`) include terms like "disorient," "confused," and "homelessness," indicating edge cases or unique scenarios.

---
### **Detailed Topic Breakdown**
#### **1.   Topic -1 (`-1_disorient_confused_lack_homelessness`)**:
- **Keywords**: `[disorient, confused, lack, homelessness, housing...]`
- **Interpretation**: Incidents involving individuals with **altered mental states** (e.g., confusion, disorientation) or **social determinants** like homelessness.
- **Actionable Insight**: Link to `General Severity` or `Care Level` – are these cases more likely to require urgent care?

#### **2.   Topic 0 (`0_influence___`)**:
- **Keywords**: `[influence, , , , ...]`
- **Issue**: This topic is likely **noise** due to empty/missing terms.   It may represent entries with minimal text (e.g., "N/A" or blank fields).
- **Recommendation**: Investigate raw data for this topic – clean or remove these entries if they add no value.

#### **3.   Topic 1 (`1_action_resident_patient_abuse`)**:
- **Keywords**: `[action, resident, patient, abuse, substance...]`
- **Interpretation**: Cases involving **abuse by residents/patients**, potentially linked to substance use.
- **Link to Structured Data**: Check if this topic correlates with high `Severity` or `Aggressor = Patient`.

#### **4.   Topic 2 (`2_alter_mental_status_`)**:
- **Keywords**: `[alter, mental, status]`
- **Interpretation**: Incidents where the aggressor or victim had **altered mental states** (e.g., dementia, psychosis).
- **Example Docs**: "alter mental status," "alter mental status" (repetition suggests templated descriptions).

#### **5.   Topic 3 (`3_unknown___`)**:
- **Keywords**: `[unknown]`
- **Interpretation**: Entries where the contributing factor was marked as **unknown**.
- **Insight**: Highlights gaps in documentation – consider policies to improve data collection.

#### **6.   Topic 4 (`4_action_resident_patient_visitor`)**:
- **Keywords**: `[action, resident, patient, visitor]`
- **Interpretation**: Incidents involving **visitors** (e.g., family members) alongside patients/residents.
- **Action**: Compare with `General Location` – are visitors more problematic in specific departments?

#### **7.   Topic 5 (`5_adherence_compliance_lack_patient`)**:
- **Keywords**: `[adherence, compliance, lack, patient]`
- **Interpretation**: Cases where **patients lacked compliance** with care plans (e.g., refusing medication).
- **Link to Outcomes**: Does non-compliance correlate with higher `Severity` scores?

#### **8.   Topic 6 (`6_inpatient_bed_unavailable_snf`)**:
- **Keywords**: `[inpatient, bed, unavailable, snf (skilled nursing facility)]`
- **Interpretation**: Root causes tied to **resource shortages** (e.g., lack of beds or long-term care placements).
- **Policy Implication**: Advocate for increased bed capacity or partnerships with SNFs.

#### **9.   Topic 9 (`9_staff_insufficient_issue_inadequate`)**:
- **Keywords**: `[staff, insufficient, issue, inadequate]`
- **Interpretation**: Systemic issues like **understaffing** or inadequate training.
- **Action**: Cross-reference with `General Location` to prioritize staffing in high-risk departments.

---


#### Assault Description

In [7]:
# Fill NA values with empty string and combine
df["Combined_Description"] = df["Primary_Assault_Description_Clean"].fillna('') + " " + df["Assault_Description_Clean"].fillna('')
df["Combined_Description"] = df["Combined_Description"].str.strip()

# Filter out empty strings if needed
non_empty_descriptions = df[df["Combined_Description"] != ""]["Combined_Description"]

# Fit the model
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(non_empty_descriptions)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,63,-1_slap_pt_bottle_mileu,"[slap, pt, bottle, mileu, face, area, cultural...",[pt pace mileu social worker mileu pt pt start...
1,0,309,0_patient_pt_rn_room,"[patient, pt, rn, room, staff, beating, kickin...",[grab pinching scratch hair pull kicking hit b...
2,1,105,1_pinching_scratch_hair_grab,"[pinching, scratch, hair, grab, pull, bite, pt...",[grab pinching scratch hair pull punch confuse...
3,2,56,2_fist_hit_trash_dodge,"[fist, hit, trash, dodge, open, slap, able, ar...","[hit fist, hit fist, hit fist]"
4,3,52,3_harassment_fighting_harrasment_sexual,"[harassment, fighting, harrasment, sexual, , ,...","[harassment, harassment, harassment]"
5,4,44,4_threat_violence_fist_fluid,"[threat, violence, fist, fluid, bodily, bitten...",[assault bodily fluid grab hit fist kick scrat...
6,5,42,5_object_throw_break_posturing,"[object, throw, break, posturing, verbal, expo...","[throw break object, throw break object, throw..."
7,6,40,6_intimidation_tone_record_consent,"[intimidation, tone, record, consent, abuse, v...","[intimidation, intimidation, intimidation]"
8,7,38,7_yelling_scream_threat_violence,"[yelling, scream, threat, violence, threaten, ...",[intimidation scream yelling threat violence u...
9,8,36,8_threat_violence_weapon_assault,"[threat, violence, weapon, assault, , , , , , ]","[threat violence, threat violence, threat viol..."


Interpretation: identify patterns in assault types

### **Column Explanations**
1.   **`Topic`**:
- The assigned topic ID (`-1` represents "outliers" – documents that don't fit clearly into any topic).
- Example: Topic `-1` contains ambiguous/unclassifiable entries, while Topic `0` is the largest coherent cluster.

2.   **`Count`**:
- Number of documents assigned to each topic.
- Example: Topic `0` has 309 entries (most frequent), while Topic `43` has only 11.

3.   **`Name`**:
- Automatically generated topic name (BERTopic concatenates the top 3-4 keywords).
- Example: `1_pinching_scratch_hair_grab` suggests a cluster about physical altercations involving scratching/hair-pulling.

4.   **`Representation`**:
- The most representative keywords defining the topic (ordered by importance).
- Example: Topic `2` includes words like `fist`, `hit`, `trash`, `dodge` – implying violent physical actions.

5.   **`Representative_Docs`**:
- Example snippets of documents assigned to the topic.
- Example: For Topic `-1`, documents include phrases like `"pt pace mileu social worker..."`.

---

### **Key Observations**
1.   **Dominant Themes**:
- **Physical Violence**: Topics like `0_patient_pt_rn_room`, `1_pinching_scratch_hair_grab`, `12_push_choke_bite_grab` focus on physical attacks (e.g., kicking, biting, choking).
- **Verbal Threats**: Topics like `7_yelling_scream_threat_violence`, `18_aggression_inanimate_object_fluid` include yelling, threats, or intimidation.
- **Sexual Harassment**: Topics like `3_harassment_fighting_harrasment_sexual`, `38_touching_offensive_unwelcome_sexual` highlight sexual misconduct.
- **Property Damage**: Topics like `6_object_throw_break_posturing` involve throwing objects or breaking things.

2.   **Outliers**:
- Topic `-1` has 63 entries that BERTopic couldn’t confidently assign to any cluster.   These might need manual review (e.g., ambiguous descriptions or poor text quality).

3.   **Overlap/Duplication**:
- Some topics seem similar (e.g., `26_lewd_profane_aggressive_language` vs.   `30_language_gesture_profane_lewd`).   This could indicate:
- Subtle differences in context (e.g., lewd language vs. aggressive language).
- Noise in the data (e.g., typos like `harrasment` in Topic 3).
- Consider merging or refining these topics post-hoc.

4.   **Data Quality Issues**:
- Topics like `41_yell___` (only the word "yell") or `34_grab_masturbation__` (sparse keywords) suggest:
- Short/incomplete text entries in the original data.
- Potential preprocessing gaps (e.g., handling stopwords, typos).

---

### **How to Use These Results**
1.   **Review Outliers** (`Topic -1`):
Check if these unclassified entries contain meaningful patterns or require re-preprocessing.

2.   **Actionable Clusters**:
- Prioritize high-frequency topics (e.g., Topic `0` with 309 entries) for intervention.
- Investigate clusters like `3_harassment_fighting_harrasment_sexual` for policy violations.

3.   **Refine Topics**:
- Use `topic_model.reduce_topics()` to merge overlapping topics.
- Manually rename topics (e.g., `topic_model.set_topic_labels()`) for clearer reporting.

4.   **Data Cleaning**:
- Fix typos (e.g., `harrasment` → `harassment`) to improve topic coherence.
- Filter very short/noisy text entries.

---

### **Example Interpretation**
- **Topic 0**: Likely describes assaults in healthcare settings (`patient`, `pt`, `rn`, `room` = patient, nurse, room).
- **Topic 12**: Focuses on animal-related incidents (`bite`, `dog`, `kneed`).
- **Topic 33**: Involves bodily fluid assaults (`fluid`, `bodily`, `assault`, `bitten`).

In [8]:
df.head()

Unnamed: 0,Event Date,Facility Type,Department/Office Incident Took Place,Occupational Category of Person Affected,Aggressor,Type of Violence,Primary Contributing Factors,Severity of Assault,Primary Assault Description,Assault Description,Emotional and/ or Psychological Impact,Level of Care Needed,Response Action Taken,Assault_Description_Clean,Primary_Assault_Description_Clean,Primary_Contributing_Factors_Clean,Combined_Description
0,1/9/2024,ED,Nurses station,"Nurse (RN, LPN), Nurse (RN, LPN)",Patient,"Physical, Verbal",Homelessness/Lack of Housing,Mild - Mild Soreness/Abrasions/Scratches/Small...,"Verbal Assault, Pushing/Shoving, Harassment",Head-butted another patient,Mild - Upset/Angry/Scared/Humiliated,,"Security Called, Law Enforcement Called, De-es...",head butt patient,verbal assault push shoving harassment,homelessness lack housing,verbal assault push shoving harassment head bu...
1,2/6/2024,ED,Patient room,"Nurse (RN, LPN), Security, Allied Health/Techn...",Patient,"Physical, Verbal","Altered Mental Status, Inpatient Bed Unavailable",None - No Contact/Unwanted Contact w/No Injury,"Grabbing/Pinching/Scratching/Hair Pull, Kickin...",,None - No emotional and/or Psychological Impact,,"Security Called, Law Enforcement Called, Physi...",,grab pinching scratch hair pull kicking hit be...,alter mental status inpatient bed unavailable,grab pinching scratch hair pull kicking hit be...
2,2/9/2024,ED,Patient room,"Nurse (RN, LPN), Physician/Advanced Practice P...",Patient,"Physical, Verbal",Inpatient Bed Unavailable,None - No Contact/Unwanted Contact w/No Injury,"Posturing, Throwing Object/Breaking Object",,None - No emotional and/or Psychological Impact,,"Security Called, Law Enforcement Called, Physi...",,posturing throw object break object,inpatient bed unavailable,posturing throw object break object
3,2/8/2024,ED,Patient room,"Nurse (RN, LPN), Physician/Advanced Practice P...",Patient,Verbal,Inpatient Bed Unavailable,None - No Contact/Unwanted Contact w/No Injury,"Harassment, Verbal Assault, Posturing",,None - No emotional and/or Psychological Impact,,"Emergency Call/Code, De-escalation Techniques,...",,harassment verbal assault posture,inpatient bed unavailable,harassment verbal assault posture
4,2/9/2024,ED,Behavioral health unit,"Nurse (RN, LPN)",Patient,"Physical, Verbal",Inpatient Bed Unavailable,None - No Contact/Unwanted Contact w/No Injury,"Posturing, Throwing Object/Breaking Object, Ve...",Attempting to break windows,None - No emotional and/or Psychological Impact,,"Security Called, Seclusion of Patient",attempt break window,posturing throw object break object verbal ass...,inpatient bed unavailable,posturing throw object break object verbal ass...


In [9]:
df.to_csv('0_9.1_all_clean_factor.des.csv', index=False)