# Test ways to extract regex patterns using small language models

Specifically with application to postcode extractions.

Requires Ollama (https://ollama.com/download).

See: https://ollama.com/blog/structured-outputs

Candidate model is : Phi-2 (2.7B) https://ollama.com/library/phi:2.7b

In [1]:
%load_ext kedro.ipython
%reload_kedro --env=test
%load_ext autoreload
%autoreload 2
%config IPCompleter.use_jedi=False
import os

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
os.chdir(context.project_path)
catalog = context.catalog
params = context.params

In [85]:
from ollama import chat
from pydantic import BaseModel


options={
        "temperature": 0.7,    # Controls randomness
        "top_k": 50,           # Limits token selection to top-K choices
        "top_p": 0.9,          # Controls nucleus sampling
        "repeat_penalty": 1.2  # Discourages repetitive responses
    }

class Regex(BaseModel):
  regex_pattern: str
  explainer: str
  example: str


system_prompt1 = """\
You are an expert in regular expressions. Your task is to generate python regex patterns based on user descriptions. 
Your responses should be structured in JSON format with the following fields:

{
  "regex_pattern": "<the regex pattern>",
  "explanation": "<detailed explanation of the regex>",
  "example_matches": ["example1", "example2"]
}

Ensure that:
- The output is strictly in JSON format.
- No unexpected escape characters (e.g., \\x08, \\x1b, or other non-printable characters).
- The regex should be properly formatted, valid, and optimized.
- Provide an explanation and valid example matches.

Return responses ONLY in the following JSON format:
{
  "regex_pattern": "<valid regex>",
  "explanation": "<why this regex works>",
  "example_matches": ["example1", "example2"]
}

Do NOT return any text outside the JSON format.

Example 1:
User: "Extract email addresses from text."
Response:
{
  "regex_pattern": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
  "explanation": "This pattern matches email addresses by capturing alphanumeric characters, dots, underscores, and other valid email symbols before and after the '@' sign.",
  "example_matches": ["user@example.com", "test.email@domain.net"]
}

Example 2:
User: "Find all dates in YYYY-MM-DD format."
Response:
{
  "regex_pattern": "\\b\\d{4}-\\d{2}-\\d{2}\\b",
  "explanation": "This pattern captures dates formatted as four digits for the year, followed by a hyphen, two digits for the month, another hyphen, and two digits for the day.",
  "example_matches": ["2024-01-31", "1999-12-25"]
}
"""

pass_examples = ["569933", "310145", "520147", "521147"]
fail_examples = ["123456", "56993", "56"]

test_regex = f"Singapore postcodes are five or six numerical numbers."

response = chat(
  messages=[
    #{"role": "system", "content": "The regex pattern outputs should be in the form of a string, to be used in python for regex matching."},
    {"role": "system", "content": system_prompt1},  # Define behavior
    {
      'role': 'user',
      'content': test_regex,
    }
  ],
  options=options,
  model='deepseek-r1:1.5b',
  format=Regex.model_json_schema(),
)

pattern = Regex.model_validate_json(response.message.content)
print(pattern)

regex_pattern='/\x08(0-9){6}\x08/' explainer='Singapore has a postcode system where each area is identified by 6 digits, and these can range from 212078 to 543216.' example="The Singapore postcode for the Albert City Post Office in Singapore is 212078. This matches the first six numerical digits of a Singapore postcode because it's a single block area identified by six distinct numbers, and any valid Singapore post office number would match this pattern perfectly."


In [53]:
import re

In [74]:
re.match(pattern.regex_pattern, "569933")

In [49]:

POSTCODE_PATTERN = r"(?<!\d)\d{5,6}(?!\d)"
re.match(POSTCODE_PATTERN, "569933")

[1m<[0m[1;95mre.Match[0m[39m object; [0m[33mspan[0m[39m=[0m[1;39m([0m[1;36m0[0m[39m, [0m[1;36m6[0m[1;39m)[0m[39m, [0m[33mmatch[0m[39m=[0m[32m'569933'[0m[1m>[0m