In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV 

%matplotlib inline

## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Stop Words and Word Stems
Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance.

## Tokenization and Tagging
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

In [9]:
text = ['This is a line',
           "This is another line and has no connection with previous line",
       "Completely different line"]

In [16]:
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer,CountVectorizer
cv = CountVectorizer(stop_words='english') # ignore the stop words!! 
cv.fit_transform(text)

sparse_mat = cv.fit_transform(text)

In [17]:
sparse_mat

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [18]:
print(sparse_mat)

  (0, 3)	1
  (1, 3)	2
  (1, 1)	1
  (1, 4)	1
  (2, 3)	1
  (2, 0)	1
  (2, 2)	1


In [19]:
sparse_mat.todense() # Avoid for big Datasets

matrix([[0, 0, 0, 1, 0],
        [0, 1, 0, 2, 1],
        [1, 0, 1, 1, 0]], dtype=int64)

In [20]:
cv.vocabulary_

{'line': 3, 'connection': 1, 'previous': 4, 'completely': 0, 'different': 2}

In [21]:
tfidf_transformer = TfidfTransformer()

In [22]:
cv = CountVectorizer()
counts = cv.fit_transform(text)

In [23]:
counts

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [24]:
counts.todense()

matrix([[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0],
        [1, 1, 0, 1, 0, 1, 1, 2, 1, 1, 1, 1],
        [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

In [25]:
tfidf = tfidf_transformer.fit_transform(counts)

In [26]:
tfidf.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.61980538, 0.48133417, 0.        , 0.        ,
         0.61980538, 0.        ],
        [0.32355669, 0.32355669, 0.        , 0.32355669, 0.        ,
         0.32355669, 0.2460732 , 0.38219558, 0.32355669, 0.32355669,
         0.2460732 , 0.32355669],
        [0.        , 0.        , 0.65249088, 0.        , 0.65249088,
         0.        , 0.        , 0.38537163, 0.        , 0.        ,
         0.        , 0.        ]])

In [27]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('cv',CountVectorizer()),('tfidf',TfidfTransformer())])
results = pipe.fit_transform(text)
results

<3x12 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [28]:
results.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.61980538, 0.48133417, 0.        , 0.        ,
         0.61980538, 0.        ],
        [0.32355669, 0.32355669, 0.        , 0.32355669, 0.        ,
         0.32355669, 0.2460732 , 0.38219558, 0.32355669, 0.32355669,
         0.2460732 , 0.32355669],
        [0.        , 0.        , 0.65249088, 0.        , 0.65249088,
         0.        , 0.        , 0.38537163, 0.        , 0.        ,
         0.        , 0.        ]])

### Instead do all at once!

In [29]:
tfidf = TfidfVectorizer()
new = tfidf.fit_transform(text)
new.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.61980538, 0.48133417, 0.        , 0.        ,
         0.61980538, 0.        ],
        [0.32355669, 0.32355669, 0.        , 0.32355669, 0.        ,
         0.32355669, 0.2460732 , 0.38219558, 0.32355669, 0.32355669,
         0.2460732 , 0.32355669],
        [0.        , 0.        , 0.65249088, 0.        , 0.65249088,
         0.        , 0.        , 0.38537163, 0.        , 0.        ,
         0.        , 0.        ]])