http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

# The Bag of words presentation
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
counting the occurrences of tokens in each document.
normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
In this scheme, features and samples are defined as follows:

each individual token occurrence frequency (normalized or not) is treated as a feature.
the vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords
from sklearn.model_selection import GridSearchCV

In [14]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

In [15]:
stop_words = set(stopwords.words('english'))
vectorizer = CountVectorizer(stop_words = stop_words)

In [16]:
vectorizer = CountVectorizer(stop_words = stop_words)
train_smatrix = vectorizer.fit_transform(train_set)

In [17]:
print(vectorizer.vocabulary_)

{'sky': 2, 'blue': 0, 'sun': 3, 'bright': 1}


In [18]:
print(train_smatrix) 

  (0, 0)	1
  (0, 2)	1
  (1, 1)	1
  (1, 3)	1


In [12]:
smatrix.todense()

matrix([[1, 0, 1, 0],
        [0, 1, 0, 1]], dtype=int64)

In [19]:
test_smatrix = vectorizer.transform(test_set)
print(test_smatrix) 

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (1, 1)	1
  (1, 3)	2


# Tf–idf term weighting

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: 

$\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}.$

Using the TfidfTransformer’s default settings, TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as

$ \text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1 $

where n_d is the total number of documents, and \text{df}(d,t) is the number of documents that contain term t. The resulting tf-idf vectors are then normalized by the Euclidean norm:

$ v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 +
v{_2}^2 + \dots + v{_n}^2}}. $

This was originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results) that has also found good use in document classification and clustering.



In [31]:
tfidf = TfidfTransformer(smooth_idf=True, norm=None)
train_tfidf = tfidf.fit_transform(train_smatrix)
test_tfidf =  tfidf.transform(test_smatrix)

print("IDF:", tfidf.idf_)
print("Test TFIDF")
print( test_tfidf)

print("Train TFIDF")
print( train_tfidf)

IDF: [1.40546511 1.40546511 1.40546511 1.40546511]
Test TFIDF
  (0, 3)	1.4054651081081644
  (0, 2)	1.4054651081081644
  (0, 1)	1.4054651081081644
  (1, 3)	2.8109302162163288
  (1, 1)	1.4054651081081644
Train TFIDF
  (0, 2)	1.4054651081081644
  (0, 0)	1.4054651081081644
  (1, 3)	1.4054651081081644
  (1, 1)	1.4054651081081644


In [30]:
tfidf = TfidfTransformer(smooth_idf=True)
train_tfidf = tfidf.fit_transform(train_smatrix)
test_tfidf =  tfidf.transform(test_smatrix)

print("IDF:", tfidf.idf_)
print("Test TFIDF")
print( test_tfidf)

IDF: [1.40546511 1.40546511 1.40546511 1.40546511]
Test TFIDF
  (0, 3)	0.5773502691896257
  (0, 2)	0.5773502691896257
  (0, 1)	0.5773502691896257
  (1, 3)	0.894427190999916
  (1, 1)	0.447213595499958


# Train the NB classifier

Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

In [33]:
Y = ['sky', 'sun']
clf = MultinomialNB().fit(train_tfidf, Y)

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call transform instead of fit_transform on the transformers, since they have already been fit to the training set:

In [34]:
print(clf.predict(test_tfidf))

['sun' 'sun']


In [36]:
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [37]:
text_clf.fit(train_set, Y)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [39]:
print(text_clf.predict(test_set))

['sun' 'sun']
