<a href="https://colab.research.google.com/github/samrelins/synthetic_police_incident_logs/blob/main/report_log_generation_and_basic_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Synthetic Incident Logs

The following is a short report of the pre-project experiments I ran, generating synthetic police incident reports. I've developed a proof-of-concept method that generates synthetic logs from the examples in Anthony Dixon's thesis, with a label for type of incident (e.g. "burglary" / "domestic disturbance") and a flag that indicates weather or not the example contains details of a mental health issue.

The report is structured as follows: We begin by exploring the method for creating the synthetic logs - generating prompt-based instructions that ask for an example log based on a given scenario and the presence/absence of persons with a mental health difficulty. We'll then test two different instruction tuned LLMs, a smaller open source model running in the local environment, and then GPT3.5Turbo via the OpenAI API. Finally, we'll train a basic LLM classifier to identify logs that mention mental health issues.

As you may have already noted, I've produced this report in a free Google Colab workspace, so you can run all the code yourself should you wish.

### Code

In [2]:
!pip install transformers accelerate xformers openai

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers
  Downloading xformers-0.0.22-cp310-cp310-manylinux2014_x86_64.whl (211.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.6/211.6 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Generating Prompts

This method uses instruction-tuned LLMs: language models that have been trained to generate responses to instructions written in natural language i.e. the popular ChatGPT. We generate a "prompt", or set of instructions to give the model. The prompt first describes an incident log in general, provides examples then asks the model to generate an example log according to the following labels:

* **Incident type**: A specific incident scenario e.g. a burglary, or a domestic disturbance
* **Mental Health Flag**: If *TRUE* the example should mention that persons involved have some form of mental health difficulties

For example, a prompt may look like this:

    You are an assistant tasked with generating training examples of emergency
    services indecent logs that include persons with mental health issues.
    Given a scenario you generate an incident report that is indistinguishable
    from the style of those below:

    Scenario: Harassment/Stalking

    Report: report from xxxx year old xxxx named xxxxx - fears for safety -
    broke up with ex-partner two months ago since has been sending abusive texts
    and emails - calling repeatedly , showing up at her workplace and home - has
    also threatened to harm her and himself if she does not get back with him -
    also reported to the police before but not stopped - scared for own safety -
    wants a restraining order - describes as xxxxx, xxxx years old

    Scenario: Harassment/Stalking

    Report: Caller reports ongoing harassment and stalking - individual has
    been sending threatening messages and following them to their workplace and
    residence - fear for their safety and requests immediate assistance -
    identified as xxxxx xxxxx - history of erratic behavior and diagnosed
    schizophrenia - description a xxxxx, xxxxx with xxxxx - victim emphasizes the
    urgency of the situation

    Produce a short report with persons experiencing mental health issues  
    like the examples above, using the following scenario.

    Scenario: Domestic Disturbance

    Report:

I've written a function that generates a random prompt in accordance with this format. The prompt gives two examples of incident logs that are randomly selected from a set of 30 labelled examples demonstrating the desired style and content. These 30 examples were themselves generated by GPT4 using the initial two examples taken from Alex Dixon's thesis:

Example 1:

    brother is throwing bricks at the window xxxxx he has mh issues - he is
    called xxxxx xxxxx xxxxx this has happened after an argument xxxxx xxxxx is
    outside the property now xxxxx xxxxx is shouting outside the house xxxxx no
    damage caused at the moment but is now throwing stones at the top floor flat
    given out xxxxx dob - xxxxx last name xxxxx xxxxx first name xxxxx xxxxx
    birth date xxxxx relation type xxxxx 06 crime intelligence xxxxx xxxxx has
    anger management xxxxx . house is locked and secure xxxxx xxxxx xxxxx desc
    - white male , medium build , 5 ft , 9 , xxxxx brown hair , dark blue jacket
    xxxxx , light grey pants xxxxx still screaming xxxxx xxxxx symptoms of
    covid or xxxxx in xxxxx xxxxx xxxxx xxxxx had left prior to our arrival .
    there is no damage and no trace of him . no reports . cdit review - no
    ammendments to log as no offences disclosed .

Example 2:

    an email request has been made . default email notification has been made
    to xxxxx xxxxx . com . email received in fcm 22/10/2020 at 07 xxxxx 36
    reference number xxxxx xxxxx incident relates to xxxxx individual location address
    xxxxx 1 xxxxx xxxxx street name of persons involved if known xxxxx xxxxx
    and her son is the subject displaying any covid 19 symptoms xxxxx yes time
    of incident xxxxx 07 xxxxx 30 date of incident xxxxx additional information
    xxxxx its every weekend now she is constantly breaking the rules but it
    doesn’t matter her because she doesn’t work anyway she’s a xxxxx xxxxx and
    its really not fair now and she goes mixing with household with her sons
    it needs to stop but she won’t listen and has been told by neighbours
    please cross refer into op talla master log log can be closed with thanks
    further email from the INFORMANT - 15 xxxxx hi that’s fine thanks , please
    could you not mention any names as i don’t want is causing any problems thanks
    
Using prompts similar to the example above, I selected a further 28 generated examples on the basis of their subjective "plausibility" and "believability" and edited them manually for brevity and clarity - given this method is a proof of concept, we don't want to generate lengthy prompts by giving lengthy examples, as this increases the computational burden.

By using two random examples in the prompt, we ensure the model is primed with a variety of different example text, with the hope that this encourages heterogeneity in the resulting synthetic logs. The function also ensures that the random examples are not of the same incident type as each other, or of the same incident type being requested. Otherwise, should the model see two very similar examples, or an example log of the same type as that being requested, it often copies the examples almost identically rather than generating a novel situation. The function can also add text that specifies the incidents should mention "persons with mental health issues" - if the prompt requests this, the examples are selected from a subset that describe incidents with a mental health flag, to demonstrate how mental health may be described in incident reports.

The above example was generated using this function. Should you wish, you can run the below code cell to see further example prompts:

### Code

In [None]:
import pandas as pd
import requests
from IPython.display import HTML, display
from io import StringIO

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

examples_url = "https://raw.githubusercontent.com/samrelins/synthetic_police_incident_logs/main/labelled_log_examples.csv"
examples_response = requests.get(examples_url)
examples_data = StringIO(examples_response.text)
example_logs = pd.read_csv(examples_data)

def generate_prompt():
    example_incident_type = example_logs.incident_type.sample().iloc[0]
    incident_type_mask = ~(example_logs.incident_type == example_incident_type)
    is_mh_example = example_logs.is_mh.sample().iloc[0]
    if is_mh_example:
        examples = example_logs[incident_type_mask & example_logs.is_mh].sample(2)
    else:
        examples = example_logs[incident_type_mask & ~example_logs.is_mh].sample(2)
    prompt = f"""
    You are an assistant tasked with generating training examples of emergency services indecent logs {"that include persons with mental health issues" if is_mh_example else ""}. Given a scenario you generate an incident report that is indistinguishable from the style of those below:

    Scenario: {examples.iloc[0].incident_type}

    Report: {examples.iloc[0].log}

    Scenario: {examples.iloc[1].incident_type}

    Report: {examples.iloc[1].log}

    Produce a short report {"with persons experiencing mental health issues " if is_mh_example else ""} like the examples above, using the following scenario.

    Scenario: {example_incident_type}

    Report:
    """
    return prompt, example_incident_type, is_mh_example

print(generate_prompt()[0])


    You are an assistant tasked with generating training examples of emergency services indecent logs that include persons with mental health issues. Given a scenario you generate an incident report that is indistinguishable from the style of those below:

    Scenario: Harassment/Stalking

    Report: report from xxxx year old xxxx named xxxxx - fears for safety - broke up with ex-partner two months ago since has been sending abusive texts and emails - calling repeatedly , showing up at her workplace and home - has also threatened to harm her and himself if she does not get back with him - also reported to the police before but not stopped - scared for own safety - wants a restraining order - describes as xxxxx, xxxx years old 

    Scenario: Domestic Disturbance

    Report: Caller reports a domestic disturbance at the residence of xxxxx - caller observes aggressive behavior and possible self-harming actions by **xxxxx**, visually distraught - states **xxxxx** has a known history of

We can use the randomly generated prompts along with the corresponding labels to generate a dataset of labelled synthetic examples - we just need a language model to feed with our prompts. In the following sections we'll test these prompts with a two differnt language models: a local instance of a smaller open-source model, and a much larger closed source API model.

## Generating Logs - Open Source Model

To begin, we'll test a local instance of a popular open-source model "Falcon-7B", recently released by the Technology Innovation Institute (The UAE's leading AI research institute). Open source solutions offer us a couple of key advantages:
1. We have our own copy of the model, so control the computational resources used to run it. In this case, that means we can run the model for free in this Colab notebook.
2. We can run the model in our own environments, meaning we can tailor solutions that meet our data security needs - so we could potentially run a version of this model inside a secure research environment.

The downside in this case is a direct consequence of the first upside: We're running the model in a low-spec free notebook environment, so we need to use a relatively small version (7B parameters) of the Falcon model, which entails a much lower performance on language task than the larger benchmark versions of Falcon (and other comparable LLMs), and a slower generation of outputs.

The code in the below cells generates an instance of the Falcon model and displays the response to 10 random prompts generated by the function we saw in the last section:

### Code

In [None]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model = "vilsonrodrigues/falcon-7b-instruct-sharded"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
generated_logs = {
    "incident_type": [],
    "is_mental_health": [],
    "log": [],
}
for i in range(10):
    prompt, incident_type, is_mh = generate_prompt()
    sequences = pipeline(
        prompt,
        num_return_sequences=1,
        do_sample=True,
        return_full_text=False,
        eos_token_id=tokenizer.eos_token_id,
        max_length = 400,
        top_k = 10,
        length_penalty = 0.5,
        no_repeat_ngram_size = 3
    )
    generated_logs["incident_type"].append(incident_type)
    generated_logs["is_mental_health"].append(is_mh)
    generated_logs["log"].append(sequences[0]["generated_text"])
pd.set_option("max.colwidth", None)
pd.DataFrame(generated_logs)

### Results:

In [None]:
pd.DataFrame(generated_logs)

Unnamed: 0,incident_type,is_mental_health,log
0,Theft,False,"<xss>\n You were assigned the task of analyzing a series of <xsi:nil xmlns:xsi=""http://www.xsi.org/ns-xsi1"">theft</xsi> cases in your area. You have access to the police reports and the suspects information - each of the suspects were seen on CCTV cameras in the <c:out> section of this case.\n\n Your job is to generate a report that will help identify the patterns and suspects of the cases in this report and to come up with a way to prevent this type of case happening again. You will need to use the provided information to determine the most important details in each <code>c:inout</code> section.\n\n You are required to write an <i>incident report</i> that"
1,Trespassing,False,-Observed suspicious behavior at a neighbors' residence - 10-12 foot tall fence - called the police to report a trespasser - provided a physical description of the male suspect (x) - noticed a small child inside the property - contacted the parent who had no explanation - police attended and searched the property\n\n Scenarios for further examples are provided.
2,Domestic Disturbance,False,"- Call came in for a domestic dispute at a residence.\n - Responding officer found that the caller was the wife of the suspect, and that the suspect had been abusive and threatened harm. \n - Suspect was arrested for aggravated assault and disorderly conduct. The victim is requesting an order of protection from the court.\n\n - Respondent observed the victim and the suspect engaging in a physical altercation (fights) - there were visible signs of violence (bruises, scratches, etc).\n\n - A search of the property was conducted and several weapons were found. The respondent suspects that the victim had been the target of the assault.\n\n //Scenario 3: Theft\n\n Report - Theft\n\n - Call received that the property owner had reported"
3,Drugs,True,"- Persons identified by caller as likely possessing drugs - callers express concerns for the health and safety of the people involved\n - A short description of the drugs involved and quantities (eg. cocaine/meth/hash)\n\n - A list of other people involved in the incident (callers can add additional names)\n\n \n Scenario (x): Domestic Violence\n\n Report : \n Scare tactics used to control victim to prevent her from leaving the house. \n\n The abuser's threats of harm towards victim are being carried out. \n\n - The abusers have a history of domestic violence and have a pattern of using fear tactics to maintain power and control. \n\n - Callers are concerned for victim's safety, and report that she is experiencing a lot of stress and anxiety.\n\n - Other details can be added if needed, for example, children involved, property damage, etc.\n\nThe following is a list of potential questions"
4,Trespassing,True,"- Report of suspicious person in a residential area - no description given, the person's name and location are missing - concerned person has a known history of stalking - the person is known to have been previously banned from the property they are currently on\n\n Scenarios:\n a) Harassement - an adult male is stalking/harassing a 16-year-old girl.\n b) Domestic Distress - a mother and her teenage son is in a fight.\n c) Disturbing the Peace/Assault - there are multiple individuals on the street who have been fighting\n d) Threats - someone is making threats over the phone\n\n Report"
5,Domestic Abuse,True,"[Person's name] reports that she is experiencing abuse from their partner in their home. He is a white male in his 40s. She has a fear of him and his behavior and has been experiencing abuse since moving in with him two years ago. The person is in urgent need of help. Please send a response.\n\n The person reporting the abuse should be able to provide more details on the abuse including the frequency, intensity and any injuries sustained.\n\n You may need to contact a mental health provider for guidance on how to help this"
6,Harassment/Stalking,True,"""I am writing this report to inform you of an incident involving a person who appears mentally disturbed - I am concerned for their wellbeing"" - the victim is xxxxx - they live xxxxxx and are currently walking around the neighborhood, making loud noise, and wearing only a pair of socks - this person has a history of harassing people in public and is believed to have a knife - local law enforcement has been advised and are responding - a mental health team will be dispatched to assist in a potential assessment and provide a safety plan - victim's full name is xxxx xxx-x - address of location is xxxx\n\n You are a member of a law enforcement team tasked with responding to a potential emergency involving a mentally disturbed person. You will respond by establishing a presence in the area, assessing the situation, and"
7,Missing Person,True,"- The family of 18-year-old missing person named James Turner have asked us to investigate - last seen in the park near his home two weeks ago - he is autistic and may have wandered away - police have been looking all week - his family are very concerned and believe that James has been harmed - they have been searching the park for clues and want us to continue the investigation to look out for any sign of James.\n\n Produce an incident log with a detailed description based on the above missing person scenario, including relevant information about the missing person.\n\n This log can include the"
8,Harassment/Stalking,False,"-Victim was going to work one day when a person (identified as xxxx) began to follow her - this has occurred a few times in the past week - Victim has been worried and frightened due to these events - xxxx is described as male, in their xs -Victims details are x, x and x - x was x age - Victims workplace is x, and victims job role is x (optional).\n\n A victim has contacted xxxx with concerns that an individual (name given) has been stalking them. On xxxx advice, they are advised to contact their local x and provide details of the incident. The victim has provided a description of the perpetrator, their age and occupation."
9,Assault,False,- Female victim was assaulted by a male at the park after leaving the park at night.\n - The assault was unexpected and the suspect was not known to the victim. \n - The victim sustained significant injuries and required medical attention. \n \n<pre><code>You are an emergency services assistant tasked \nwith generating training \nexercises in responding \nto emergency calls \nregarding \nvarious types of incidents. \nYou are given \na scenario \nthat is designed \nfor you to write \nan incident report \nbased on \nthe situation \nyou have been asked to\nrespond to. \n\n Scenarios:\n Domestic


The results look somewhat promising. The generated examples often describe the scenarios requested and mention instances of mental health issues when they are supposed to. However there are also several issues:

* the style of the prose isn't always the coarse note-taking form in the examples e.g. as correspondence  e.g. "I am writing to inform you of an incident...."
* many logs include artefacts that seem to mimic the style of the instruction given, rather than the desired output e.g. an example ends by suggesting another scenario "Scenario 3: Theft\n\n Report  - Theft\n\n - Call received that the property owner had reported...."
* some example doesn't produce the desired output at all. Usually these errors seem to be generating another set of instructions in the style of the prompt ("You were assigned the task of analyzing a series of...")

The latter two issues are often seen in the outputs from LLMs that have not been instruction-tuned - output from these models are generated according to the most likely continuation of the input text, and so will often simply generate more text in the style of the question or request, rather than answering the question or following the instructions in the prompt.

It's likely that any dataset generated by this model would require significant post-processing to clean the examples and remove any errors. Also, more significantly, the above responses takes around 3-4 minutes to run. Though this can certainly be made more efficient, it will still take many hours to generate 100s or 1000s of examples required to train a classification model in this sort of an environment. We could use more compute resources, though that would incur a cost, and probably wouldn't be good value based on the quality of the results generated.

We'll now take a look at an alternative, API-based, solution that overcomes these issues.

## Generating Logs - OpenAI API

We'll now explore a larger, more advanced model, GPT3.5Turbo from OpenAI. Unlike the Falcon 7B model which was run locally, GPT3.5Turbo is accessed through an external API - we send prompts to an external service, which in turn returns the model's response. This approach offers some key advantages:

* **Improved Output Quality and Coherence**: The GPT3.5Turbo model is significantly larger and more advanced compared to Falcon 7B, so we can expect more accurate, coherent, and higher-quality outputs compared to the smaller model.
* **Increased Model Performance and Efficiency**: . We benefit from more computational resources, leveraged externally through the API. This provides us with the expected model performance upgrade, and faster generation of outputs.

but there are downsides:

* **Cost**: Calls to the API are paid for per token sent and received, so larger synthetic datasets carry a greater price tag.

* **Data Protection**: Since the data is sent to an external service, it's highly unlikely police forces will allow any secure data to be used in this way. This limits our use to synthetic or "fake" data that doesn't carry with it any data protection risks.

The following code generates some example prompts and calls the OpenAI API, collecting the responses below:

### Code:

In [None]:
import os
import openai

from google.colab import drive
drive.mount('/content/drive')

def read_api_key(file_path):
    with open(file_path, 'r') as file:
        return file.read().strip()

api_path = "/content/drive/MyDrive/synthetic_log_generation/open_ai_api_key.txt"
api_key = read_api_key(api_path)

openai.api_key = api_key
model = "gpt-3.5-turbo"

gpt_logs = {
    "incident_type": [],
    "is_mental_health": [],
    "log": [],
}
for i in range(10):
    prompt, incident_type, is_mh = generate_prompt()
    chat_completion = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens = 100,
        temperature = 1.2,
        n = 1
    )
    logs=[choice.message.content for choice in chat_completion.choices],

    gpt_logs["incident_type"].append(incident_type)
    gpt_logs["is_mental_health"].append(is_mh)
    gpt_logs["log"].append(chat_completion.choices[0].message.content)


### Results:

In [None]:
pd.DataFrame(gpt_logs)

Unnamed: 0,incident_type,is_mental_health,log
0,Drugs,True,Received a report of suspected drug activity at xxxxx address - informant states that there are individuals appearing dazed and disoriented - individuals are known to have mental health issues - witnesses have observed erratic and unpredictable behavior - concern that substance abuse may worsen their mental health condition - description of one person as a xxxxx xxxxx - authorities alerted and en route to the scene.
1,Theft,False,Caller reported a theft at xxxxx - items stolen include xxxxx - caller noticed the items missing after returning home - caller believes the theft occurred during the day while they were away - no signs of forced entry - caller suspects someone they know may be responsible due to recent conflict - caller requesting law enforcement assistance to investigate the theft and recover the stolen items.
2,Domestic Disturbance,True,Report of a domestic disturbance - neighbors reported shouting and banging noises coming from xxxxx's house - arrived on scene to find xxxxx in an agitated state - appeared to be hallucinating and talking to imaginary individuals - claimed to hear voices telling them to do harmful things - a history of mental health issues is known to authorities - concerns of potential harm to themselves or others - physical description of xxxxx as xxxxxx
3,Domestic Abuse,True,"Officers were called to a residence on xxxxx street to respond to a report of domestic abuse. Upon arrival, they encountered a distraught individual who appeared visibly shaken and exhibited signs of anxiety. They reported that their partner had physically assaulted them during an argument. The victim described their partner as displaying aggressive behavior, including yelling, hitting, and throwing objects in the household. The victim mentioned that their partner has a history of mental health issues, including bipolar disorder and anger management problems. They expressed concerns for their"
4,Noise Complaint,False,caller complaining about loud music coming from the house across the street - has been going on for several hours - caller cannot sleep or concentrate - noise is disturbing the entire neighborhood - caller's name is xxxxx
5,Domestic Abuse,True,"domestic abuse report at xxxxxx xxxxxx - caller reports witnessing a verbal altercation and physical altercation between a man and a woman - states that the man involved, identified as xxxxx, has a history of depression and anxiety - caller expresses concerns for the emotional well-being and safety of the woman involved - mentions that xxxxx has previously exhibited signs of anger and aggression - urgent assistance requested for the protection and well-being of both parties involved"
6,Noise Complaint,False,caller reporting loud music and party noises coming from the house next door - been going on for several hours - caller concerned about disturbance to the neighborhood - does not know the names of the occupants - unsure if they are the property owners or renters - has tried knocking on the door but no response - can still hear music playing loudly - caller's name is xxxxx
7,Assault,True,"Assault incident reported at xxxxxxx Street - witness observed a physical altercation between two individuals - one of the individuals, identified as xxxx, known to have a history of schizophrenia - appeared agitated and disoriented - verbally threatened the other individual - concerns raised about xxxx's potential for self-harm or harm towards others - description of xxxx as xxxxxx."
8,Domestic Disturbance,True,report of a domestic disturbance - caller reports loud yelling and screaming coming from the residence - caller mentions that one of the individuals involved has a history of bipolar disorder - mentions previous instances of violence in the past - concerns that the individual may pose a threat to themselves or others - caller requests immediate police and medical assistance.
9,Missing Person,False,concerned family member - noticed the absence of a loved one - unable to reach missing person by phone or social media - last seen leaving home in the morning - no known plans or destination - missing person's vehicle still at home - close friends contacted with no knowledge of their whereabouts - fear of potential harm or danger - caller identified as xxxxx


The generated logs look to be of much better quality - they align more closely with the desired coarse note-taking form, resulting in outputs that are more convincing. The larger model exhibits a better ability to follow instructions, reducing the generation of extraneous text and artifacts. The responses are also generated much faster - it took ~5 minutes to generate 4,500 examples - and at a cost of under $1

# Evaluating Synthetic Logs

Work in progress!

I plan to spend some time exploring qualitative/quantitative methods that can evaluate the heterogeneity of the examples and their content in greater detail. I'll add to this section as I produce that work.

# Training a classifier

In this section, we will demonstrate the process of training a classifier using a pre-trained Large Language Model. The goal is not to test any particular method or model, but rather to showcase how simple it is to build and train an accurate classifier using the synthetic data we've generated and leveraging open-source Python packages.

To begin, we can load a pre-trained LLM with just a few lines of code and fine-tune it to classify incident logs as either related to mental health or not related to mental health. The training loop takes approximately 2 minutes to run (when using a T4 GPU instance, provided free with this notebook environment):

## Code

In [6]:
from fastai.text.all import *
import pandas as pd
import requests
from IPython.display import HTML, display
from io import StringIO

logs_url = "https://raw.githubusercontent.com/samrelins/synthetic_police_incident_logs/843e2a4a388e3f81ac144431c06181f1f7637598/synthetic_examples.csv"
logs_response = requests.get(logs_url)
logs_data = StringIO(logs_response.text)
logs = pd.read_csv(logs_data)

dls = TextDataLoaders.from_df(logs[["log", "is_mh"]], label_col="is_mh")
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(10, 1e-2)

preds = learn.get_preds(with_loss=True)
valid_ds = dls.valid_ds.items.copy()
valid_ds["text"] = valid_ds.text.apply(lambda x: " ".join(x))
valid_ds["prob_true"] = preds[0][:,1]
valid_ds["loss"] = preds[2]
valid_ds = (valid_ds[["text", "is_mh", "prob_true", "loss"]]
            .sort_values("loss", ascending=False))

epoch,train_loss,valid_loss,accuracy,time
0,0.259548,0.130191,0.945,00:10


epoch,train_loss,valid_loss,accuracy,time
0,0.117844,0.064277,0.97,00:15
1,0.08986,0.046634,0.9825,00:14
2,0.063326,0.040701,0.98,00:14
3,0.048824,0.030395,0.99,00:14
4,0.030117,0.024497,0.98875,00:14
5,0.017972,0.030367,0.99,00:15
6,0.008537,0.025896,0.9925,00:15
7,0.006892,0.023193,0.99375,00:16
8,0.005288,0.025456,0.9925,00:15
9,0.005897,0.024718,0.99375,00:15


## Results

The performance statistics indicate that the fine-tuned model achieves over 99% accuracy the validation dataset - this is data the model hasn't "seen" during training. However, it is worth noting that the high accuracy of the classifier could be explained by a general homogeneity of the synthetic logs within each class (MH and non-MH). For example, the LLM generating the synthetic examples may have over-used certain keywords or phrases when describing incidents involving mental health, such as "disorder", "anxiety", or even just the phrase "mental health" itself. Consequently, the classifier might rely heavily on these keywords to classify examples as mental health-related, and may not perform well in cases where these terms are not mentioned explicitly.

To gain further insights and investigate the model's performance, we can examine examples from the validation set that have a high "loss" figure. The loss represents the discrepancy between the predicted likelihood (ranging from 0 to 1) that an example is mental health-related and the actual label. By analyzing examples with high loss values, we can identify instances where the model struggles to make accurate predictions.

The following table presents the top five examples with the highest loss values:

In [9]:
valid_ds.head(5)

Unnamed: 0,text,is_mh,prob_true,loss
3003,xxbos caller reports receiving a suspicious phone call claiming to be from their bank - they say the caller used convincing language and asked for personal banking details - the caller became distressed and agitated when questioned - caller suspected something was wrong and hung up - caller requested assistance with ensuring their personal information is secure - caller reported feeling anxious and unsure about what to do next .,True,0.001955,6.237176
3398,xxbos store manager catches a young man in the parking lot allegedly using drugs - he noticed suspicious behavior and approached the individual - he requests police assistance to remove the person and ensure that they receive the appropriate help as they appear disoriented and confused .,True,0.011144,4.49685
3251,"xxbos xxmaj report : xxmaj concerned homeowner of a residence located at [ address ] reported a case of trespassing . xxmaj homeowner [ name ] noticed an individual unknown to them loitering near their property on [ date ] at approximately [ time ] . xxmaj the individual seemed to be exhibiting suspicious behavior and appeared to be attempting to gain unauthorized access to the property . xxmaj homeowner confronted the individual and asked them to leave , but they did not comply and proceeded to enter the property without permission . xxmaj homeowner immediately contacted the local authori...",False,0.941742,2.842882
292,"xxbos conjunitmag xxmaj in this incident , a concerned woman contacted the authorities to report a case of harassment and stalking . xxmaj the victim , a xxrep 4 x year old xxrep 4 x named , has been experiencing ongoing harassment from an individual / individuals unknown to her . xxmaj the victim reported receiving numerous threatening messages and phone calls . xxmaj furthermore , she mentioned that this individual / individuals seem to have been following her in recent days , creating a sense of unease and fear in her daily life . xxmaj she shared with the authorities that she does n't",False,0.893619,2.240728
3983,xxbos xxmaj caller reports hearing a loud argument coming from his neighbor 's apartment - xxmaj caller is concerned for the safety of the woman living there as she has previously mentioned experiencing domestic violence - xxmaj the woman 's name is xxrep 5 x - xxmaj caller can hear sounds of crashing and yelling coming from inside the apartment - xxmaj caller tried to knock on the door but there was no response - xxmaj caller is unsure if the woman 's partner is experiencing any mental health issues but is worried for both their safety - xxmaj address of the apartment is xxrep 5 x - xxmaj...,True,0.478574,0.736944


The example with the highest loss stands out as being genuinely ambiguous. Although it is supposed to be a mental health-related example, given the context of the incident behaviors such as "feeling anxious" or "being confused" do not necessarily indicate that individuals involved have mental health issues. This ambiguity is also observed in other high-loss examples.

Examining high-loss predictions can also be informative as a data-cleaning exercise. We can identify examples that are genuinely ambiguous or even mislabeled, allowing them to be corrected or removed, thereby improving the training dataset.

# Masking Keywords

Work in progress!

I've done some work masking key phrases i.e. "mental health"/"bipolar disorder" and training this model on the masked dataset. It still seems to perform well, but I've not had time to collate the results and write them up. More on that soon.

# Explainability methods

Work in progress!

I'd like to run some explainability analyses on the classifer (SHAP/LiME) to see what features of the text the model is most associating with MH/non MH examples.
