In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import typing
import re

## Normalize text using a custom preprocessor

In [2]:
# for details on the docstring format used for this function, 
# see https://www.sphinx-doc.org/en/master/usage/restructuredtext/domains.html#python-signatures

def text_normalizer(text: str) -> str:
    '''
    Normalizes text (ex. $42.32 -> CURRENCY).
    
    :param str text: The raw text to be normalized
    :return: normalized text
    :rtype: str
    '''
    normalized = text.lower()

    CURRENCY   = re.compile("\$\d[\d,]*\.?\d{0,2}")
    URL        = re.compile("https?://[^\s]+?(?=\.?$|[\.,]\s)")

    normalized = re.sub(pattern=CURRENCY, repl="CURRENCY", string=normalized)
    normalized = re.sub(pattern=URL, repl="URL", string=normalized)
    
    normalized = normalized.strip()
    return normalized

In [3]:
# test our normalizer
text_normalizer("It's going to cost you $23,030.12 or more.  Send a payment to http://scam-you-later.com.")

"it's going to cost you CURRENCY or more.  send a payment to URL."

## Create our TF-IDF vectorizer

We'll register our `text_normalizer` with an instance of [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [4]:
vectorizer = TfidfVectorizer(
        encoding='utf-8',
        # let's register our custom normalization function
        preprocessor=text_normalizer,
        stop_words=None, 
        # we'll use word n-grams 
        # from size 1 (unigrams) to 3 (trigrams)
        ngram_range=(1, 3), 
        binary=False, 
        use_idf=True
)

In [7]:
docs_train = [
    "It's going to cost you $23,030.12 or more.", 
    "He charged me $10 for that banana.", 
    "Check this out: http://house-elves.com"
]

Fit the provided toy data.  This is where we determine our vocabulary, IDF scores, etc.

Note that we will **not** fit again on our test data.  _We only calculate our vocabulary/features on our training data._

In [6]:
vectorizer.fit(docs_train)

TfidfVectorizer(ngram_range=(1, 3),
                preprocessor=<function text_normalizer at 0x7f7590312ee0>)

`.fit()` assigns each unique feature an ID that corresponds to its position in any feature vector we later generate using the `.transform()` method of our vectorizer.

## Feature vocabulary

`TfidfVectorizer` provides a vocabulary mapping (`item` $\rightarrow$ `index`).

In [6]:
vectorizer.vocabulary_

{'it': 25,
 'going': 19,
 'to': 41,
 'cost': 13,
 'you': 44,
 'CURRENCY': 0,
 'or': 32,
 'more': 31,
 'it going': 26,
 'going to': 20,
 'to cost': 42,
 'cost you': 14,
 'you CURRENCY': 45,
 'CURRENCY or': 3,
 'or more': 33,
 'it going to': 27,
 'going to cost': 21,
 'to cost you': 43,
 'cost you CURRENCY': 15,
 'you CURRENCY or': 46,
 'CURRENCY or more': 4,
 'he': 22,
 'charged': 7,
 'me': 28,
 'for': 16,
 'that': 36,
 'banana': 6,
 'he charged': 23,
 'charged me': 8,
 'me CURRENCY': 29,
 'CURRENCY for': 1,
 'for that': 17,
 'that banana': 37,
 'he charged me': 24,
 'charged me CURRENCY': 9,
 'me CURRENCY for': 30,
 'CURRENCY for that': 2,
 'for that banana': 18,
 'check': 10,
 'this': 38,
 'out': 34,
 'URL': 5,
 'check this': 11,
 'this out': 39,
 'out URL': 35,
 'check this out': 12,
 'this out URL': 40}

Let's apply our vectorizer to some data using `.transform()`.

**QUESTION**: _Why shouldn't you use `fit()` or `fit_transform()` here?_

In [7]:
vectorizer.transform(
    [
        "It's going to cost you $23,030.12 or more.", 
        "Pay here: http://super-sketchy-site.info"
    ]
).todense()

matrix([[0.16765177, 0.        , 0.        , 0.22044193, 0.22044193,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.22044193, 0.22044193,
         0.22044193, 0.        , 0.        , 0.        , 0.22044193,
         0.22044193, 0.22044193, 0.        , 0.        , 0.        ,
         0.22044193, 0.22044193, 0.22044193, 0.        , 0.        ,
         0.        , 0.22044193, 0.22044193, 0.22044193, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.22044193, 0.22044193, 0.22044193, 0.22044193,
         0.22044193, 0.22044193],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         1.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,


## Unknown features

What happens if we pass a datum composed soley of unseen/unknown features?

In [8]:
vectorizer.transform(["ZAMBORTANI DIEMPO"]).todense()

matrix([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [9]:
vectorizer.transform(["Kltpzyxm"]).todense()

matrix([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [10]:
vectorizer.transform(["$20.00"]).todense()

matrix([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

**QUESTION**: _Why is it that calling `.transform()` on either **ZAMBORTANI DIEMPO** or **Klptzyxm** results in vectors of the same length?_

**QUESTION**: _Why are all values in both **ZAMBORTANI DIEMPO** and **Klptzyxm** vectors 0?_

# scores -> features

For ease of inspection, let's create a reverse mapping from (`index` $\rightarrow$ `item`).

In [11]:
i2v = dict((i, v) for (v, i) in vectorizer.vocabulary_.items())

In [12]:
# what is the feature in the first position (index=0)?
i2v[0]

'CURRENCY'

In [13]:
# what is the feature in the sixth position (index=5)?
i2v[5]

'URL'

Alternatively, we can transform some data and then map it back to feature names (note the use of `.todense()` here) to guarantee that the first element in each array corresponds to the `index=0` feature in our vocabulary.

In [14]:
vectorizer.inverse_transform(
    vectorizer.transform(["It's going to cost you $23,030.12 or more."]).todense()
)

[array(['CURRENCY', 'CURRENCY or', 'CURRENCY or more', 'cost', 'cost you',
        'cost you CURRENCY', 'going', 'going to', 'going to cost', 'it',
        'it going', 'it going to', 'more', 'or', 'or more', 'to',
        'to cost', 'to cost you', 'you', 'you CURRENCY', 'you CURRENCY or'],
       dtype='<U19')]