# Tokenizer

Tokenizing is a very difficult task.
You need to take a single string and break it into individual tokens.

We could do it in pure Python, but let's try using Cython to get performance gains.

First let's download the data.

In [None]:
%%bash

mkdir -p ../.data
cd ../.data
# https://s3-us-west-2.amazonaws.com/resero2/datasets/ml-foundations/emoji_tweets_5k.csv
if [ ! -f emoji_tweets_5k.csv ]; then
    echo "File not found. Downloading from s3"
    wget -q https://s3-us-west-2.amazonaws.com/resero2/datasets/ml-foundations/emoji_tweets_5k.csv
else
    echo "File exists, not downloading form s3"
fi


In [None]:
%load_ext Cython

Now let's open up the data file and turn it into two individual lists.

We will have our tweets and emoji targets

In [None]:
import csv
import json

texts = []
emojis = []

with open("../.data/emoji_tweets_5k.csv") as infile:
    for row in csv.reader(infile):
        text = json.loads(row[1]).strip()
        texts.append(text)
        emojis.append(json.loads(row[2]))

print(f'Text count: {len(texts)}')
print(f'Emojis count: {len(emojis)}')

Let's build an analyzer.

The analyzer will take in a tokenizer and then be able to tokenize a list of tweets.

In [None]:
from typing import Iterable

class Analyzer:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def analyze_all(self, tweets: Iterable[str])-> Iterable[Iterable[str]]:
        return [self.analyze(tweet) for tweet in tweets]
            
    def analyze(self, tweet_text: str) -> Iterable[str]:
        tokens = self.tokenizer(tweet_text)
        return list(tokens)

First lets make a simple regex tokenizer that splits on whitespace

In [None]:
import re
def simple_re_tokenizer(text):
    return re.compile('\s').split(text)

In [None]:
analyzer = Analyzer(simple_re_tokenizer)
%timeit analyzer.analyze_all(texts)

## Regex in Cython
~10 ms isn't bad. But let's see if we can improve that using the C `regex.h` library.

This was pretty gnarly code so I won't make you implement it.
Let's just walk through the code.

In [None]:
%%cython

# import regex.h and it's functions/structs/dtypes
# See more on importing C & C++ headers here https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#referencing-c-header-files
cdef extern from "regex.h" nogil:
    ctypedef struct regmatch_t:
       int rm_so
       int rm_eo
    ctypedef struct regex_t:
       pass
    int REG_EXTENDED
    int regcomp(regex_t* preg, const char* regex, int cflags)
    int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
    void regfree(regex_t* preg) 

cdef regex_split(bytes pageContent, bytes regex):
    cdef int end_position
    cdef list results = list()
    cdef regex_t regex_obj
    cdef regmatch_t regmatch_obj[1]
    cdef int regex_res = 0
    cdef int current_str_pos = 0
    
    regcomp(&regex_obj, regex, REG_EXTENDED)
    regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)

    while regex_res == 0:
        if regmatch_obj[0].rm_so > 1:
            end_position = current_str_pos + regmatch_obj[0].rm_so
            results.append(pageContent[current_str_pos : end_position])

        current_str_pos += regmatch_obj[0].rm_eo
        regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
    cdef int bytes_len = len(pageContent)
    if current_str_pos != bytes_len:
        results.append(pageContent[current_str_pos : bytes_len])
    regfree(&regex_obj)
    return results

def cython_whitespace_tokenize(text):
    return [t.decode('utf8') for t in regex_split(text.encode('utf8'), b'\s')]


![elvish](https://ci.memecdn.com/5509375.jpg)

In [None]:
print(cython_whitespace_tokenize("Hello how are you\n I'm fine"))

Now let's test the performance of our tokenizer.

In [None]:
analyzer = Analyzer(cython_whitespace_tokenize)
%timeit analyzer.analyze_all(texts)

So we can see that our custom `C` regex was not as fast as Pythons regex.

Pythons regex runs very fast.
Not fast considering it's python, but fast compared to other languages.
You can also import re2 which runs 60% faster than the standard library re module.

Let's try building our own splitter in cython that splits on the space character `' '`.
To make things easier, you can access the c++ stdlib

In [None]:
def simple_whitespace_tokenizer(text):
    '''
    split on whitespace
    '''
#   We could just use return text.split(' ') but we want code that is closer to cython to make it easier to write Cython code.
    results = []
    last_whitespace = 0
    length = len(text)
    for i in range(length):
        if text[i] == ' ':
            if last_whitespace < i:
                results.append(text[last_whitespace : i])
            last_whitespace = i + 1
    if last_whitespace < length:
        results.append(text[last_whitespace : i+1])
    return results

In [None]:
def simple_whitespace_tokenizer(text):
    '''
    split on whitespace
    '''
    # We could just use return text.split(' ') but we want code that is closer to cython to make it easier to write Cython code.
    # So try to use for loops for this example
    pass

In [None]:
analyzer = Analyzer(simple_whitespace_tokenizer)
%timeit analyzer.analyze_all(texts)


Now implement the same code in Cython

In [None]:
%%cython

cdef c_tokenize(char *text):
    cdef int last_whitespace = 0
    cdef int length = len(text)
    cdef int i
    results = []
    for i in range(length):
        if text[i] == b' ':
            if last_whitespace < i:
                results.append(text[last_whitespace : i])
            last_whitespace = i + 1
    if last_whitespace < length:
        results.append(text[last_whitespace : i+1])
    return results

def cython_space_tokenize(text):
    return [t.decode('utf8') for t in c_tokenize(text.encode('utf8'))]


In [None]:
%%cython
# distutils: language = c++

# Import c++ classes from the standard library, if you need them
from libcpp.pair cimport pair
from libcpp.string cimport string
from libcpp.vector cimport vector

cdef c_tokenize(char *text):
    # split characters where the char == b' '
    # return a python object of byte arrays, i.e. [b'hi', b'I'm', b'bob']
    # remember that you can implement this in pure python, then slowly change variables to c variables for speed
    raise NotImplementedError()
    
def cython_space_tokenize(text):
#     take the python string and convert it to utf8 encoded bytes
    return [t.decode('utf8') for t in c_tokenize(text.encode('utf8'))]
 

In [None]:
analyzer = Analyzer(cython_space_tokenize)
%timeit analyzer.analyze_all(texts)


As we can see, we are now just half the speed of Pythons regex.
But that isn't a fair comparison. So let's see how fast python is if we split on whitespace

In [None]:
def simple_space_tokenizer(text):
    return text.split()

analyzer = Analyzer(simple_space_tokenizer)
%timeit analyzer.analyze_all(texts)

## Conclusion

Sometimes Python is just fast. String manipulation in Python is very fast.
It is often better to look at performance and then decide if you need to optimize.

Don't prematurely optimize.

Use a profiler to look for hotspots in code if you need performance gains.
The builtin cprofiler is a great tool to use.