# Custom tokenizer
This notebook serves as a comparison between `nltk.tokenize.word_tokenizer` (what is currently used in `pvops`) and a custom regex-based tokenizer in an attempt to remove the `nltk` dependency.

## Define tokenizer

In [1]:
from nltk.tokenize import word_tokenize

import re
# Default pattern identifies blocks of contiguous non-alphanumeric non-whitespace characters
# that have whitespace on at least one side (i.e. are not interior to words or numbers)
def regex_tokenize(doc, pattern=r'(?<=\w)([^\w\s]+)(?=\s)|(?<=\s)([^\w\s]+)(?=\w)'):
    
    # Temporarily buffer the document with spaces
    doc = ' ' + doc + ' '

    # Buffer anything matching the pattern with spaces on either side
    doc = re.sub(pattern, r' \1\2 ', doc)

    # replace any contiguous whitespace with a single space
    doc = re.sub(r'\s+', ' ', doc)

    # remove leading and ending whitespace; break into tokens along spaces
    return doc.strip().split(' ')

## Read in data

In [2]:
import pandas as pd

df = pd.read_csv('example_data/example_ML_ticket_data.csv')
docs = df['CompletionDesc'].to_list()

## Perform tokenization and comapre

In [3]:
nltk_tokens = [word_tokenize(doc) for doc in docs]
regex_tokens = [regex_tokenize(doc) for doc in docs]

In [4]:
num_print = 7

for i in range(num_print):
    ntlk_doc = ' '.join(nltk_tokens[i])
    regex_doc = ' '.join(regex_tokens[i])
    print(f'doc {i+1}. \t{docs[i]}')
    print(f'nltk: \t{ntlk_doc}')
    print(f'regex: \t{regex_doc}')
    print()

doc 1. 	cb 1.18 was found to have contactor issue would not close the contactor was cycled terminated.. output was verified 126.1a combiner box output also verified with c4. techdispatched: yes
nltk: 	cb 1.18 was found to have contactor issue would not close the contactor was cycled terminated .. output was verified 126.1a combiner box output also verified with c4 . techdispatched : yes
regex: 	cb 1.18 was found to have contactor issue would not close the contactor was cycled terminated .. output was verified 126.1a combiner box output also verified with c4 . techdispatched : yes

doc 2. 	self resolved. techdispatched: no
nltk: 	self resolved . techdispatched : no
regex: 	self resolved . techdispatched : no

doc 3. 	all module rows washed, waiting for final report from sun power.
nltk: 	all module rows washed , waiting for final report from sun power .
regex: 	all module rows washed , waiting for final report from sun power .

doc 4. 	14 nov: we were alerted that e-c3-1 had faulted by 