# Word2vecTokenizer
<div style="position: absolute; right:0;top:0"><a href="./tokenizer.ipynb" style="text-decoration: none"> <font size="5">←</font></a>
<a href="../evaluation.py.ipynb" style="text-decoration: none"> <font size="5">↑</font></a></div>

This module provides the `W2VTokenizer` class that transforms the `text` of a document into `tokens`.
It keeps only those tokens that appear in the vocabulary of the corresponding embedding model,
but tries to combine tokens into phrases if they appear in the model.

---
## Setup and Settings
---

In [124]:
from __init__ import init_vars
init_vars(vars(), ('info', {}), ('runvars', {}))

import re
    
import data
import config
from base import nbprint
from widgetbase import nbbox
from util import ProgressIterator, add_method

from embedding.main import get_model

import tokenizer.common
from tokenizer.token_util import TokenizerBase
from tokenizer.default_tokenizer import DefaultTokenizer
from tokenizer.widgets import token_picker, run_and_compare, show_comparison

if RUN_SCRIPT: token_picker(info, runvars, 'C')

VBox(children=(Dropdown(description='Dataset', index=7, layout=Layout(width='400px'), options=(('ACM', 'acm'),…

0,1
Data Name,Reuters (exists)
Token,missing
Token Version,Cw2v
Class,embedding_tokenizer.W2VTokenizer
Settings,
Document,
Id,15
,"EC SUGAR TENDER HARD TO PREDICT - LONDON TRADE The outcome of today's European Community (EC) white sugar tender is extremely difficult to predict after last week's substantial award of 102,350 tonnes at the highest ever rebate of 46.864 European currency units (Ecus) per 100 kilos, traders said. Some said they believed the tonnage would probably be smaller, at around 60,000 tonnes, but declined to give a view on the likely restitution. Last week, the European Commission accepted 785,000 tonnes of sugar into intervention by operators protesting about low rebates. This might be a determining factor in today's result, they added."




FloatProgress(value=0.0, bar_style='info', layout=Layout(visibility='hidden'), max=1.0, style=ProgressStyle(de…

---
## Tokenize Document
---
The following functions consitute the `W2VTokenizer` class that transforms the raw text of a document into tokens.

In [170]:
class W2VTokenizer(TokenizerBase):
    def __init__(self, *args, **kwargs):
        super().__init__(*args,**kwargs)
        self.embedding_model = get_model(self.info)
        self.filter = self.embedding_model.filter.filter

if RUN_SCRIPT:
    nbbox()
    w2v_tokenizer = W2VTokenizer(info)
    w2v_tokenizer.text = runvars['document']['text']



FloatProgress(value=0.0, bar_style='info', layout=Layout(visibility='hidden'), max=1.0, style=ProgressStyle(de…

### Prepare Text

This step lowercases all characters and replaces the following:
- `separator_token` by `separator_token_replacement`
- all whitespaces by a single whitespace
- `#` by nothing

In [171]:
_re_whitespace = re.compile('[\s]+', re.UNICODE)
_re_url = re.compile('(http://[^\s]+)|(https://[^\s]+)|(www\.[^\s]+)')

@add_method(W2VTokenizer)
def prepare(self):
    self.text = self.text.lower()
    self.text = self.text.replace(tokenizer.common.separator_token,tokenizer.common.separator_token_replacement)
    self.text = self.text.replace('#', '')
    self.text, count = _re_url.subn(' ', self.text)
    self.text, count = _re_whitespace.subn(' ', self.text)
    
if RUN_SCRIPT:
    run_and_compare(w2v_tokenizer, w2v_tokenizer.prepare, 'text')

0,1
Before,After
"EC SUGAR TENDER HARD TO PREDICT - LONDON TRADE The outcome of today's European Community (EC) white sugar tender is extremely difficult to predict after last week's substantial award of 102,350 tonnes at the highest ever rebate of 46.864 European currency units (Ecus) per 100 kilos, traders said. Some said they believed the tonnage would probably be smaller, at around 60,000 tonnes, but declined to give a view on the likely restitution. Last week, the European Commission accepted 785,000 tonnes of sugar into intervention by operators protesting about low rebates. This might be a determining factor in today's result, they added.","ec sugar tender hard to predict - london trade the outcome of today's european community (ec) white sugar tender is extremely difficult to predict after last week's substantial award of 102,350 tonnes at the highest ever rebate of 46.864 european currency units (ecus) per 100 kilos, traders said. some said they believed the tonnage would probably be smaller, at around 60,000 tonnes, but declined to give a view on the likely restitution. last week, the european commission accepted 785,000 tonnes of sugar into intervention by operators protesting about low rebates. this might be a determining factor in today's result, they added."


### Replace numbers

All numbers are replaced by `#`. This include all numbers in the Unicode 'Number, Decimal Digit' category.

In [172]:
_re_decimal = re.compile('\d', re.UNICODE)

@add_method(W2VTokenizer)
def replace_numbers(self):
    self.text, count = _re_decimal.subn('#', self.text)
    
if RUN_SCRIPT:
    run_and_compare(w2v_tokenizer, w2v_tokenizer.replace_numbers, 'text')

0,1
Before,After
"ec sugar tender hard to predict - london trade the outcome of today's european community (ec) white sugar tender is extremely difficult to predict after last week's substantial award of 102,350 tonnes at the highest ever rebate of 46.864 european currency units (ecus) per 100 kilos, traders said. some said they believed the tonnage would probably be smaller, at around 60,000 tonnes, but declined to give a view on the likely restitution. last week, the european commission accepted 785,000 tonnes of sugar into intervention by operators protesting about low rebates. this might be a determining factor in today's result, they added.","ec sugar tender hard to predict - london trade the outcome of today's european community (ec) white sugar tender is extremely difficult to predict after last week's substantial award of ###,### tonnes at the highest ever rebate of ##.### european currency units (ecus) per ### kilos, traders said. some said they believed the tonnage would probably be smaller, at around ##,### tonnes, but declined to give a view on the likely restitution. last week, the european commission accepted ###,### tonnes of sugar into intervention by operators protesting about low rebates. this might be a determining factor in today's result, they added."


### Split at breaking characters

This step splits the string into substrings $s_i$ at all sequences of non alphanumeric characters (`\w`), whitespace (`\s`), or apostrophes (`\'`). Later, the algorithm will only try to combine tokens from each $s_i$ separately into phrases, but not tokens from different substrings.

In [173]:
_re_breaking = re.compile('[^\w\s\'\’#]+', re.UNICODE)

@add_method(W2VTokenizer)
def split_text(self):
    self.subtexts = _re_breaking.split(self.text)
    
if RUN_SCRIPT:
    run_and_compare(w2v_tokenizer, w2v_tokenizer.split_text, 'text', 'subtexts')

0,1
Before,After
"ec sugar tender hard to predict - london trade the outcome of today's european community (ec) white sugar tender is extremely difficult to predict after last week's substantial award of ###,### tonnes at the highest ever rebate of ##.### european currency units (ecus) per ### kilos, traders said. some said they believed the tonnage would probably be smaller, at around ##,### tonnes, but declined to give a view on the likely restitution. last week, the european commission accepted ###,### tonnes of sugar into intervention by operators protesting about low rebates. this might be a determining factor in today's result, they added.",ec sugar tender hard to predict ; london trade the outcome of today's european community ; ec; white sugar tender is extremely difficult to predict after last week's substantial award of ###; ### tonnes at the highest ever rebate of ##; ### european currency units ; ecus; per ### kilos; traders said; some said they believed the tonnage would probably be smaller; at around ##; ### tonnes; but declined to give a view on the likely restitution; last week; the european commission accepted ###; ### tonnes of sugar into intervention by operators protesting about low rebates; this might be a determining factor in today's result; they added;


### Split at nonbreaking characters

In [174]:
@add_method(W2VTokenizer)
def split_subtexts(self):
    self.tokenlists = [subtext.split()
                      for subtext in self.subtexts
                      if len(subtext) > 0]
    
if RUN_SCRIPT:
    run_and_compare(w2v_tokenizer, w2v_tokenizer.split_subtexts, 'subtexts', 'tokenlists')

0,1
Before,After
ec sugar tender hard to predict ; london trade the outcome of today's european community ; ec; white sugar tender is extremely difficult to predict after last week's substantial award of ###; ### tonnes at the highest ever rebate of ##; ### european currency units ; ecus; per ### kilos; traders said; some said they believed the tonnage would probably be smaller; at around ##; ### tonnes; but declined to give a view on the likely restitution; last week; the european commission accepted ###; ### tonnes of sugar into intervention by operators protesting about low rebates; this might be a determining factor in today's result; they added;,"['ec', 'sugar', 'tender', 'hard', 'to', 'predict']; ['london', 'trade', 'the', 'outcome', 'of', ""today's"", 'european', 'community']; ['ec']; ['white', 'sugar', 'tender', 'is', 'extremely', 'difficult', 'to', 'predict', 'after', 'last', ""week's"", 'substantial', 'award', 'of', '###']; ['###', 'tonnes', 'at', 'the', 'highest', 'ever', 'rebate', 'of', '##']; ['###', 'european', 'currency', 'units']; ['ecus']; ['per', '###', 'kilos']; ['traders', 'said']; ['some', 'said', 'they', 'believed', 'the', 'tonnage', 'would', 'probably', 'be', 'smaller']; ['at', 'around', '##']; ['###', 'tonnes']; ['but', 'declined', 'to', 'give', 'a', 'view', 'on', 'the', 'likely', 'restitution']; ['last', 'week']; ['the', 'european', 'commission', 'accepted', '###']; ['###', 'tonnes', 'of', 'sugar', 'into', 'intervention', 'by', 'operators', 'protesting', 'about', 'low', 'rebates']; ['this', 'might', 'be', 'a', 'determining', 'factor', 'in', ""today's"", 'result']; ['they', 'added']"


### Filter

In [175]:
@add_method(W2VTokenizer)
def build_tokens(self):
    self.tokens = []
    for tokenlist in self.tokenlists:
        self.tokens = self.tokens + self.filter(tokenlist)
    
if RUN_SCRIPT:
    run_and_compare(w2v_tokenizer, w2v_tokenizer.build_tokens, 'tokenlists', 'tokens')

0,1
Before,After
"['ec', 'sugar', 'tender', 'hard', 'to', 'predict']; ['london', 'trade', 'the', 'outcome', 'of', ""today's"", 'european', 'community']; ['ec']; ['white', 'sugar', 'tender', 'is', 'extremely', 'difficult', 'to', 'predict', 'after', 'last', ""week's"", 'substantial', 'award', 'of', '###']; ['###', 'tonnes', 'at', 'the', 'highest', 'ever', 'rebate', 'of', '##']; ['###', 'european', 'currency', 'units']; ['ecus']; ['per', '###', 'kilos']; ['traders', 'said']; ['some', 'said', 'they', 'believed', 'the', 'tonnage', 'would', 'probably', 'be', 'smaller']; ['at', 'around', '##']; ['###', 'tonnes']; ['but', 'declined', 'to', 'give', 'a', 'view', 'on', 'the', 'likely', 'restitution']; ['last', 'week']; ['the', 'european', 'commission', 'accepted', '###']; ['###', 'tonnes', 'of', 'sugar', 'into', 'intervention', 'by', 'operators', 'protesting', 'about', 'low', 'rebates']; ['this', 'might', 'be', 'a', 'determining', 'factor', 'in', ""today's"", 'result']; ['they', 'added']",ec; sugar; tender; HARD_TO; predict; london; trade; THE_OUTCOME_OF; today's; european; community; ec; white; sugar; tender; is; Extremely_Difficult; TO_PREDICT; after; last; week's; substantial; award; oF; ###; ###; tonnes; at; THE_HIGHEST; ever; rebate; oF; ##; ###; european; currency; units; ECUs; per; ###; kilos; traders; said; some; SAID_THEY; believed; the; tonnage; would; probably; be; smaller; at; around; ##; ###; tonnes; but; DECLINED_TO; give; a_; view; ON_THE; likely; restitution; LAST_WEEK; THE_EUROPEAN_COMMISSION; accepted; ###; ###; TONNES_OF; sugar; into; intervention; by; operators; protesting; about; low; rebates; this; Might_Be; a_; determining; FACTOR_IN; today's; result; they; added


---
## Complete function
---

In [177]:
@add_method(W2VTokenizer)
def tokenize(self, text, *args):
    self.text = text
    self.prepare()
    self.replace_numbers()
    self.split_text()
    self.split_subtexts()
    self.build_tokens()
    return self.tokens

## Test tokenizer

In [178]:
if RUN_SCRIPT:
    w2v_tokenizer.tokenize(runvars['document']['text'])
    show_comparison(runvars['document']['text'], w2v_tokenizer.tokens, 'Text', 'Tokens')

0,1
Text,Tokens
"EC SUGAR TENDER HARD TO PREDICT - LONDON TRADE The outcome of today's European Community (EC) white sugar tender is extremely difficult to predict after last week's substantial award of 102,350 tonnes at the highest ever rebate of 46.864 European currency units (Ecus) per 100 kilos, traders said. Some said they believed the tonnage would probably be smaller, at around 60,000 tonnes, but declined to give a view on the likely restitution. Last week, the European Commission accepted 785,000 tonnes of sugar into intervention by operators protesting about low rebates. This might be a determining factor in today's result, they added.",ec; sugar; tender; HARD_TO; predict; london; trade; THE_OUTCOME_OF; today's; european; community; ec; white; sugar; tender; is; Extremely_Difficult; TO_PREDICT; after; last; week's; substantial; award; oF; ###; ###; tonnes; at; THE_HIGHEST; ever; rebate; oF; ##; ###; european; currency; units; ECUs; per; ###; kilos; traders; said; some; SAID_THEY; believed; the; tonnage; would; probably; be; smaller; at; around; ##; ###; tonnes; but; DECLINED_TO; give; a_; view; ON_THE; likely; restitution; LAST_WEEK; THE_EUROPEAN_COMMISSION; accepted; ###; ###; TONNES_OF; sugar; into; intervention; by; operators; protesting; about; low; rebates; this; Might_Be; a_; determining; FACTOR_IN; today's; result; they; added
