# Bag of words in sklearn
Reference:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

- tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.

- counting the occurrences of tokens in each document.

- normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
pd.options.display.max_columns = 30
%matplotlib inline

In [3]:
texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

In [4]:
# A simplest countvectorizer
count_vectorizer = CountVectorizer()
x = count_vectorizer.fit_transform(texts)
print(x.toarray())
pd.DataFrame(x.toarray(), columns=count_vectorizer.get_feature_names())

[[0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0]
 [1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0]
 [0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 3 1 0 1 1 1 1]
 [1 2 0 0 0 0 2 0 1 0 1 2 1 1 1 0 0 0 1 0 3 0 0]
 [0 2 0 0 0 0 0 3 2 0 3 0 0 1 0 1 0 0 0 1 5 0 0]
 [0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0]]


Unnamed: 0,and,at,ate,blue,bought,bright,bug,cat,fish,fishes,is,it,meowed,meowing,once,orange,penny,saw,still,store,the,to,went
0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0
1,1,0,0,1,1,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0
2,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,2,0,0
3,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,3,1,0,1,1,1,1
4,1,2,0,0,0,0,2,0,1,0,1,2,1,1,1,0,0,0,1,0,3,0,0
5,0,2,0,0,0,0,0,3,2,0,3,0,0,1,0,1,0,0,0,1,5,0,0
6,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0


In [5]:
# CountVectorizer with stop words, we can see the number of features decreases by a lot.

count_vectorizer = CountVectorizer(stop_words='english')
x = count_vectorizer.fit_transform(texts)
print(x.toarray())
pd.DataFrame(x.toarray(), columns=count_vectorizer.get_feature_names())

[[0 1 1 1 0 0 0 1 0 0 0 2 0 0 0]
 [0 1 1 1 0 0 1 0 0 0 1 1 0 0 0]
 [1 0 0 0 0 1 1 0 0 0 0 0 0 1 0]
 [1 0 0 0 1 0 1 0 0 0 0 3 1 1 1]
 [0 0 0 0 2 0 1 0 1 1 0 0 0 0 0]
 [0 0 0 0 0 3 2 0 0 1 1 0 0 1 0]
 [0 0 0 0 0 0 1 0 0 0 0 1 0 0 0]]


Unnamed: 0,ate,blue,bought,bright,bug,cat,fish,fishes,meowed,meowing,orange,penny,saw,store,went
0,0,1,1,1,0,0,0,1,0,0,0,2,0,0,0
1,0,1,1,1,0,0,1,0,0,0,1,1,0,0,0
2,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0
3,1,0,0,0,1,0,1,0,0,0,0,3,1,1,1
4,0,0,0,0,2,0,1,0,1,1,0,0,0,0,0
5,0,0,0,0,0,3,2,0,0,1,1,0,0,1,0
6,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0


Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. 
What happens in the vectorizer in this situation is that each single word in the text are separated and passed in the tokenizer as "str_input", and you can modify the "str_input" any way you want and return it. An intuitive way to think about tokenizer would be a method to apply rules to words (e.g. making two different words to the same word; transforming a word with two different tense to the simple present tense.)

In [19]:
# CountVectorizer with custom tokenizer; boring tokenizer, nothing happened.
import re

def boring_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split() # https://docs.python.org/2/library/re.html
    
    # Uncomment to see an example of what you can do in tokenizer, does 'cat' or 'fish' still exist as features?
#     for i in range(len(words)):
#         if words[i] == 'cat' or words == 'fish':
#             words[i] = 'meat'
            
    return words

count_vectorizer = CountVectorizer(stop_words='english', tokenizer=boring_tokenizer)
X = count_vectorizer.fit_transform(texts)
print(count_vectorizer.get_feature_names())

['ate', 'blue', 'bought', 'bright', 'bug', 'cat', 'fish', 'fishes', 'meowed', 'meowing', 'orange', 'penny', 'saw', 'store', 'went']


In [11]:
# Tryout stemmer. 

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

print(porter_stemmer.stem('fishes'))
print(porter_stemmer.stem('meowed'))
print(porter_stemmer.stem('oranges'))
print(porter_stemmer.stem('meowing'))
print(porter_stemmer.stem('orange'))
print(porter_stemmer.stem('go'))
print(porter_stemmer.stem('went'))

fish
meow
orang
meow
orang
go
went


In [12]:
# Stop_word + stemming tokenizer
porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

count_vectorizer = CountVectorizer(stop_words='english', tokenizer=stemming_tokenizer)
X = count_vectorizer.fit_transform(texts)
print(count_vectorizer.get_feature_names())

['ate', 'blue', 'bought', 'bright', 'bug', 'cat', 'fish', 'meow', 'onc', 'orang', 'penni', 'saw', 'store', 'went']


  'stop_words.' % sorted(inconsistent))


### Term frequency = the probability of the term showing up in the string/sentence

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
# Same usage as CountVectorizer, as special setting here: use_idf=False. Setting norm = 'l1' makes it easier to
# see the difference
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, norm='l1')
X = tfidf_vectorizer.fit_transform(texts)
df = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())
df

Unnamed: 0,ate,blue,bought,bright,bug,cat,fish,meow,onc,orang,penni,saw,store,went
0,0.0,0.2,0.2,0.2,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.0
1,0.0,0.166667,0.166667,0.166667,0.0,0.0,0.166667,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0
2,0.25,0.0,0.0,0.0,0.0,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.25,0.0
3,0.111111,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,0.0,0.333333,0.111111,0.111111,0.111111
4,0.0,0.0,0.0,0.0,0.333333,0.0,0.166667,0.333333,0.166667,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.375,0.25,0.125,0.0,0.125,0.0,0.0,0.125,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.5,0.0,0.0,0.0


### Inverse term frequency = how much information the term provides - the more a term shows up in a text, the less important it is.

In [18]:
# Setting use_idf=True(the default setting is also true). The default setting for norm is l2.
idf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=True, norm='l2')
X = idf_vectorizer.fit_transform(texts)
idf_df = pd.DataFrame(X.toarray(), columns=idf_vectorizer.get_feature_names())
idf_df

Unnamed: 0,ate,blue,bought,bright,bug,cat,fish,meow,onc,orang,penni,saw,store,went
0,0.0,0.512612,0.512612,0.512612,0.0,0.0,0.258786,0.0,0.0,0.0,0.380417,0.0,0.0,0.0
1,0.0,0.45617,0.45617,0.45617,0.0,0.0,0.230292,0.0,0.0,0.45617,0.33853,0.0,0.0,0.0
2,0.578752,0.0,0.0,0.0,0.0,0.578752,0.292176,0.0,0.0,0.0,0.0,0.0,0.494698,0.0
3,0.303663,0.0,0.0,0.0,0.303663,0.0,0.153301,0.0,0.0,0.0,0.676058,0.365821,0.259561,0.365821
4,0.0,0.0,0.0,0.0,0.641958,0.0,0.162043,0.641958,0.386682,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.840166,0.282766,0.280055,0.0,0.280055,0.0,0.0,0.239382,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.562463,0.0,0.0,0.0,0.826823,0.0,0.0,0.0


# Smoothing parameter in naive Bayes

In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing), or Lidstone smoothing, is a technique used to smooth categorical data.
 
Given an observation x = (x1, …, xd) from a multinomial distribution with N trials and parameter vector θ = (θ1, …, θd), a "smoothed" version of the data gives the estimator:

$\hat\theta_{i} = \frac{x_{i} + \alpha}{N + \alpha d}$

where the pseudocount α > 0 is the smoothing parameter (α = 0 corresponds to no smoothing). Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical estimate xi / N, and the uniform probability 1/d. Using Laplace's rule of succession, some authors have argued that α should be 1 (in which case the term add-one smoothing is also used), though in practice a smaller value is typically chosen.

Reference: https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf

The following lines of code are useful in the project. 

tuned_parameters = [{'alpha': [1, 5, 10, 20]}].  
nb = GridSearchCV(MultinomialNB(), tuned_parameters, refit=True)
nb.fit(X,y)

Please read the documentation for specific use.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html