# About Tokenization: 
it will split sentence into segments by space or puctunations. Then we can use these tokens as part of machine learning algorithm. How we tokenize text in our DataFrame can affect the statistics we use in our model.
* Bag-of-words:  count the number of times a particular token appears. However, it discards information about word order.
* N-gram: In addition to a column for every token which is called "1-gram", we may have a column for every ordered pair of N words.  
See [More](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)

# Simple Bag of words

We can build the vocabulary using `Counter`, which is a dictionary of words and their count.

In [11]:
from collections import Counter
# define vocab
vocab = Counter()

for data in x:
    tokens = data.split()
    vocab.update(tokens)

tokens = [k for k,c in vocab.items()]

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in data if we split on non-alpha numeric"
print(msg.format(len(tokens)))
print(vocab.items())

There are 155 tokens in data if we split on non-alpha numeric
dict_items([('If', 1), ('you', 3), ('like', 4), ('adult', 1), ('comedy', 1), ('cartoons,', 1), ('South', 1), ('Park,', 1), ('then', 1), ('this', 2), ('is', 6), ('nearly', 1), ('a', 4), ('similar', 1), ('format', 1), ('about', 1), ('the', 15), ('small', 2), ('adventures', 1), ('of', 2), ('three', 1), ('teenage', 1), ('girls', 1), ('at', 3), ('Bromwell', 1), ('High.', 1), ('Keisha,', 1), ('Natella', 1), ('and', 6), ('Latrina', 1), ('have', 3), ('given', 1), ('exploding', 1), ('sweets', 1), ('behaved', 1), ('bitches,', 1), ('I', 2), ('think', 1), ('Keisha', 1), ('good', 1), ('leader.', 1), ('There', 1), ('are', 1), ('also', 2), ('stories', 1), ('going', 1), ('on', 1), ('with', 1), ('teachers', 1), ('school.', 1), ("There's", 1), ('idiotic', 1), ('principal,', 1), ('Mr.', 1), ('Bip,', 1), ('nervous', 1), ('Maths', 1), ('teacher', 1), ('many', 1), ('others.', 1), ('The', 2), ('cast', 1), ('fantastic,', 1), ('Lenny', 1), ("Henry's

`Scikit learn` provides `CountVectorizer()` for bag-of-words. It will do:
* Tokenize all the strings
* Build a 'vocabulary' containing all the tokens
* Count the occurences of each token in the vocabulary

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
data1 = "If you like adult comedy cartoons, like South Park, then this is nearly a similar format about the small adventures of three teenage girls at Bromwell High. Keisha, Natella and Latrina have given exploding sweets and behaved like bitches, I think Keisha is a good leader. There are also small stories going on with the teachers of the school. There's the idiotic principal, Mr. Bip, the nervous Maths teacher and many others. The cast is also fantastic, Lenny Henry's Gina Yashere, EastEnders Chrissie Watts, Tracy-Ann Oberman, Smack The Pony's Doon Mackichan, Dead Ringers' Mark Perry and Blunder's Nina Conti. I didn't know this came from Canada, but it is very good. Very good!"
data2 = 'All the world\'s a stage and its people actors in it"--or something like that. Who the hell said that theatre stopped at the orchestra pit--or even at the theatre door? Why is not the audience participants in the theatrical experience, including the story itself?<br /><br />This film was a grand experiment that said: "Hey! the story is you and it needs more than your attention, it needs your active participation". "Sometimes we bring the story to you, sometimes you have to go to the story."<br /><br />Alas no one listened, but that does not mean it should not have been said.'
x = pd.Series([data1, data2]) 

# Define a regular expression that does a split on whitespace
TOKEN_BASIC = '\\S+(?=\\s+)'

# Instantiate the CountVectorizer
vec_basic = CountVectorizer(token_pattern=TOKEN_BASIC)

# Fit to the data
vec_basic.fit(x)

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in data if we split on non-alpha numeric"
print(msg.format(len(vec_basic.get_feature_names())))
print(vec_basic.get_feature_names())
print(vec_basic)

There are 151 tokens in data if we split on non-alpha numeric
['"hey!', '"sometimes', '/><br', '/>alas', '/>this', 'a', 'about', 'active', 'actors', 'adult', 'adventures', 'all', 'also', 'and', 'are', 'at', 'attention,', 'audience', 'been', 'behaved', 'bip,', 'bitches,', "blunder's", 'bring', 'bromwell', 'but', 'came', 'canada,', 'cartoons,', 'cast', 'chrissie', 'comedy', 'conti.', 'dead', "didn't", 'does', 'doon', 'door?', 'eastenders', 'even', 'experience,', 'experiment', 'exploding', 'fantastic,', 'film', 'format', 'from', 'gina', 'girls', 'given', 'go', 'going', 'good', 'good.', 'grand', 'have', 'hell', "henry's", 'high.', 'i', 'idiotic', 'if', 'in', 'including', 'is', 'it', 'it"--or', 'its', 'itself?<br', 'keisha', 'keisha,', 'know', 'latrina', 'leader.', 'lenny', 'like', 'listened,', 'mackichan,', 'many', 'mark', 'maths', 'mean', 'more', 'mr.', 'natella', 'nearly', 'needs', 'nervous', 'nina', 'no', 'not', 'oberman,', 'of', 'on', 'one', 'orchestra', 'others.', 'park,', 'participan

# Text Preprocessing

## Remove punctuation
Tokenize on punctuation to avoid hyphens, underscores, etc

## N-grams
By including unigrams and bi-grams in the model to capture important information involving multiple tokens, it is more likely to capture the information that appears as multiple tokens like "middle school".

In [22]:
# Define the token pattern that contains only alphanumeric characters.
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Instantiate the CountVectorizer
vec = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                      ngram_range=(1,2))

## Interaction Terms

We can use `PolynomialFeatures` to tell it the degree of features to include.  
**Interaction terms** let us mathematically describe when tokens appear together.  
$$ \beta_{1} x_{1}+\beta_{2} x_{2}+\beta_{3}\left(x_{1} \times x_{2}\right) $$  
$ \beta_{3} $ measure how important it is that X1 and X2 appear together.

In the following example, we looked at multiplying 2 columns together to see if they co-occurred. `include_bias` defines the bias term which allows model to have non-zero y value where x value is zero. The argument `degree` tells it what polynomial degree of interactions to compute.

In [31]:
from sklearn.preprocessing import PolynomialFeatures
data = {"x1": [0, 1], "x2": [1, 1]}
x = pd.DataFrame(data, columns=['x1', 'x2'])
x.rename(index={0:'a',1:'b'}, inplace=True)
x

Unnamed: 0,x1,x2
a,0,1
b,1,1


In [32]:
interaction  = PolynomialFeatures(degree=2,
                                  interaction_only=True,
                                  include_bias=False)

interaction.fit_transform(x)

array([[0., 1., 0.],
       [1., 1., 1.]])

The X array will grow expotentially since we are making n features into n-squared features.  
Because `PolynomialFeatures` does not support sparse matrices, we can use a custom interaction object, `SparseInteractions`.

In [33]:
from itertools import combinations

import numpy as np
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin


class SparseInteractions(BaseEstimator, TransformerMixin):
    def __init__(self, degree=2, feature_name_separator="_"):
        self.degree = degree
        self.feature_name_separator = feature_name_separator

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not sparse.isspmatrix_csc(X):
            X = sparse.csc_matrix(X)

        if hasattr(X, "columns"):
            self.orig_col_names = X.columns
        else:
            self.orig_col_names = np.array([str(i) for i in range(X.shape[1])])

        spi = self._create_sparse_interactions(X)
        return spi

    def get_feature_names(self):
        return self.feature_names

    def _create_sparse_interactions(self, X):
        out_mat = []
        self.feature_names = self.orig_col_names.tolist()

        for sub_degree in range(2, self.degree + 1):
            for col_ixs in combinations(range(X.shape[1]), sub_degree):
                # add name for new column
                name = self.feature_name_separator.join(self.orig_col_names[list(col_ixs)])
                self.feature_names.append(name)

                # get column multiplications value
                out = X[:, col_ixs[0]]
                for j in col_ixs[1:]:
                    out = out.multiply(X[:, j])

                out_mat.append(out)

        return sparse.hstack([X] + out_mat)

In [34]:
SparseInteractions(degree=2).fit_transform(x).toarray()

array([[0, 1, 0],
       [1, 1, 1]], dtype=int64)

# Hash Function

But we need to balance adding new features with computational cost of additional columns. For example, adding 3-grams or 4-grams is going to have an enormous increase in the size of the array. So we need more computational power to fit out model.  
Hashing is a way of limiting the size of matrix that we create without sacrificing too much model accuracy.  

A hash function takes an input, in text case a token, and outputs a hash value. For example, the input may be a string and the hash value may be an integer.

When to use the hashing trick, we want to make array of features as small as possible. Doing so is called **Dimensionality Reduction**.

In `sklearn`, we can use `HashingVectorizer` instead of the `CountVectorizer`. `HashingVectorizer` maps every token to one of those pre-defined number of columns. Some columns may have multiple tokens that map to them.

In [39]:
from sklearn.feature_extraction.text import HashingVectorizer

vec = HashingVectorizer(norm=None,
                        token_pattern=TOKENS_ALPHANUMERIC,
                        ngram_range=(1,2))

In [42]:
x = [1,2,3]
np.array(x) > 2

array([False, False,  True])