# Computing TF-IDF Vectors with Scikit-Learn


excerpt from __Data Science Bookcamp: Five Python Projects__ MEAP V04 livebook by Leonard Apeltsin


<div class="alert alert-block alert-info">
Sweeping parts of the explainer text in this notebook was from the liveLessons notebook.
</div>

NB. The author has directed the reader to interacting with the newsgroups dataset (`fetch_20newsgroups` from `sklearn.datasets`). As this dataset is large, it is not pre-packaged with Scikit-Learn. You may wish to define a shell variable named `SCIKIT_LEARN_DATA` for your environment so that the dataset is in a known singular location.

In [5]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(remove=('headers', 'footers'))

In [6]:
print(newsgroups.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [7]:
# Return 1st newsgroup posting
print(f'---\n\n{newsgroups.data[0]}')

---

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [8]:
# Return the newsgroup name associated with the posting
origin = newsgroups.target_names[newsgroups.target[0]]
print(f'---\n\nThe post at index 0 first appeared in the \'{origin}\' group.')

---

The post at index 0 first appeared in the 'rec.autos' group.


NB. So far, nothing unexpected… car post content was sourced from the car discussions on usenet!

In [9]:
# Count the number of newsgroup posts
dataset_size = len(newsgroups.data)
print(f'---\n\nOur dataset contains {dataset_size:,} newsgroup posts.')

---

Our dataset contains 11,314 newsgroup posts.


In [10]:
# Lets move on to transforming input texts into TF vectors via the Scikit-Learn `CountVectorizer` class
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [11]:
# Do the transform
tf_matrix = vectorizer.fit_transform(newsgroups.data)
print(tf_matrix)

  (0, 108644)	4
  (0, 110106)	1
  (0, 57577)	2
  (0, 24398)	2
  (0, 79534)	1
  (0, 100942)	1
  (0, 37154)	1
  (0, 45141)	1
  (0, 70570)	1
  (0, 78701)	2
  (0, 101084)	4
  (0, 32499)	4
  (0, 92157)	1
  (0, 100827)	6
  (0, 79461)	1
  (0, 39275)	1
  (0, 60326)	2
  (0, 42332)	1
  (0, 96432)	1
  (0, 67137)	1
  (0, 101732)	1
  (0, 27703)	1
  (0, 49871)	2
  (0, 65338)	1
  (0, 14106)	1
  :	:
  (11313, 55901)	1
  (11313, 93448)	1
  (11313, 97535)	1
  (11313, 93393)	1
  (11313, 109366)	1
  (11313, 102215)	1
  (11313, 29148)	1
  (11313, 26901)	1
  (11313, 94401)	1
  (11313, 89686)	1
  (11313, 80827)	1
  (11313, 72219)	1
  (11313, 32984)	1
  (11313, 82912)	1
  (11313, 99934)	1
  (11313, 96505)	1
  (11313, 72102)	1
  (11313, 32981)	1
  (11313, 82692)	1
  (11313, 101854)	1
  (11313, 66399)	1
  (11313, 63405)	1
  (11313, 61366)	1
  (11313, 7462)	1
  (11313, 109600)	1


In [12]:
# Suspense! What kind of data structure did Scikit-Learn `CountVectorized` yield?
print(type(tf_matrix))

<class 'scipy.sparse.csr.csr_matrix'>


NB. The matrix is a *Compressed Sparse Row* (CSR) SciPy object. By storing only non-zero elements, the CSR matrix is efficient in storage and memory usage. There are some nuances between SciPy CSR matrix and a NumPy array so that in order to reduce confusion, a conversion will be done on the object. This will allow a better comprehension on the similarities and differences between the two matrix representations.

In [13]:
# For discussion, we are changing the CSR matrix to a NumPy ndarray
tf_np_matrix = tf_matrix.toarray()
print(tf_np_matrix)
# -> Yields an 2D NumPy array

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [38]:
# For sure!
print(type(tf_np_matrix))
# print(tf_np_matrix.shape) # -> init is slow to count to (11314, 114441)
print(tf_np_matrix[0,:].shape) # ???
print(tf_np_matrix[:,0].shape) # REM. 11,314 is len(newsgroup.data)

<class 'numpy.ndarray'>
(11314, 114441)
(114441,)
(11314,)


NB. So this is a sparse matrix, no surprises there…
The printed matrix is a 2D NumPy array. Each matrix element corresponds to the count of a word within a post, and each matrix row represents a post. The matrix columns represent individual words so the column count equals the vocabulary size of the dataset.

In [15]:
# Get some stats
assert tf_np_matrix.shape == tf_matrix.shape
num_posts, vocabulary_size = tf_np_matrix.shape
print(f'---\n\nOur collection of {num_posts:,} newsgroup posts contain a total of '
      f'{vocabulary_size:,} unique words.')

---

Our collection of 11,314 newsgroup posts contain a total of 114,751 unique words.


NB. There are over 114k words, but most posts only hold a few dozen of them. You can measure the unique word count of a post at index `i` by counting the number of non-zero elements in row `tf_np_matrix[i]`. NumPy can easily count the non-zero indices of the vector at `tf_np_matrix[i]` from the `np.flatnonzero` function.

In [27]:
# Do the 👆 task
import numpy as np
tf_vector = tf_np_matrix[0]
non_zero_indices = np.flatnonzero(tf_vector) # Equal to `a.ravel().nonzero()[0]` where `a` is array_like
num_unique_words = non_zero_indices.size
print(f'---\n\nThe newsgroup in row 0 contains {num_unique_words} unique words.\n'
      f'The actual word-counts map to the following column indices:\n\n'
      f'{non_zero_indices}')

---

The newsgroup in row 0 contains 34 unique words.
The actual word-counts map to the following column indices:

[ 14106  15549  22085  29307  30041  30577  32139  32441  39212  42264
  42265  43571  45010  45062  50060  55328  58709  63943  65197  66993
  66996  68934  72762  84198  87487  88958  91953  93095  95006  95918
  96205 100175 109803 112632]


NB. We have the index values for the 64 unique words. Mapping back to the word-values is done via the `CountVectorizer` method `get_feature_names()`. The method-call will return a list of words, and each index `i` will correspond to the `i`-ith word within that list.

In [39]:
# Get list of words from the `CountVectorizer` method
words = vectorizer.get_feature_names()
# List comprehension to view our non-zero words
unique_words = [words[i] for i in non_zero_indices]
print(unique_words)

['60s', '70s', 'addition', 'body', 'bricklin', 'bumper', 'called', 'car', 'day', 'door', 'doors', 'early', 'engine', 'enlighten', 'funky', 'history', 'info', 'know', 'late', 'looked', 'looking', 'mail', 'model', 'production', 'really', 'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'tellme', 'wondering', 'years']


NB. You can also get these words by calling `inverse_transform(tf_vector)`. This method call will return all the words associated with the input TF vector (which is a NumPy matrix, from the above).

### Activity: View _word_ mention counts by extracting _Non-Zero_ elements of 1D NumPy arrays

`non_zero_indices = np.flatnonzero(np_vector)`: Returns the non-zero indices in a 1D NumPy array.

`non_zero_vector = np_vector[non_zero_indices]`: Selects the non-zero elements of a 1D NumPy array (assuming `non_zero_indices` corresponds to non-zero indices of that array).

We have printed the words from `newsgroup.data[0]`, but some of these words are more frequent than others. Lets dig down to find the more frequent words along with the count of use for that word. Represent them in a Pandas table.

In [53]:
import pandas as pd
data = {'Word': unique_words, # our list
        'Count': tf_vector[non_zero_indices]} # non-zero indices NumPy array into the the first row (REM. position `0`) of the 2D NumPy array

df = pd.DataFrame(data).sort_values('Count', ascending = False)
print(df[:10].to_string(index = False))

       Word  Count
        car      4
        60s      1
        saw      1
    looking      1
       mail      1
      model      1
 production      1
     really      1
       rest      1
   separate      1


NB. So we have a top ten, but the top words are not interesting. Good for us is that `CountVectorizer` has a class to remove *stop words* which, although part of speech and written language, do not carry information for the "science". We will now re-init a stop-word aware vectorizer and re-compute the TF matrix. We will also regenerate our `Words` list.

In [19]:
# Set up the transform and end with an assert that common stop words are dropped
vectorizer = CountVectorizer(stop_words = 'english')
tf_matrix = vectorizer.fit_transform(newsgroups.data)
assert tf_matrix.shape[1] < 114751 # number of unique words known to be in our newsgroup vocabulary

words = vectorizer.get_feature_names()
for common_word in ['the', 'this', 'was', 'if', 'it', 'on']:
    assert common_word not in words

In [54]:
tf_np_matrix = tf_matrix.toarray()
tf_vector = tf_np_matrix[0]
non_zero_indices = np.flatnonzero(tf_vector)
unique_words = [words[index] for index in non_zero_indices]
data = {'Word': unique_words,
        'Count': tf_vector[non_zero_indices]}

df = pd.DataFrame(data).sort_values('Count', ascending=False)
print(f'---\n\nAfter stop-word deletion, {df.shape[0]} unique words remain.')
print('The 10 most frequent words are:\n')
print(df[:10].to_string(index=False))

---

After stop-word deletion, 34 unique words remain.
The 10 most frequent words are:

       Word  Count
        car      4
        60s      1
        saw      1
    looking      1
       mail      1
      model      1
 production      1
     really      1
       rest      1
   separate      1


### Activity: Ranking _words_ by both post-frequency and count

NB. Each of the 34 words in our dataframe appears in a certain fraction of newsgroup posts. In NLP, this fraction is referred to as the _document frequency_ of a word. From here, the job of the scientist is the hypothesize that document frequencies can be used to improve word rankings and thus our analysis. Initially, we will limit the exploration to a single document. Later, we will generalize the insights we obtain to the other documents in the dataset.

#### Interlude: Common Scikit-Learn CountVectorizer Methods
`vectorizer = CountVectorizer()`: Initializes a `CountVectorizer` object capable of vectorizing input texts based on their TF counts.

`vectorizer = CountVectorizer(stopwords='english')`: Initializes an object capable of vectorizing input texts, while filtering for common English words like "this" or "the".

`tf_matrix = vectorizer.fit_transform(texts)`: Executes TF vectorization on a list of input texts, using the initialized `vectorizer` object. Returns CSR matrix of term-frequency values. Each matrix row `i` corresponds to `texts[i]`. Each matrix column `j` corresponds to the term-frequency of word `j`.

`vocabulary_list = vectorizer.get_feature_names()`: Returns the vocabulary-list associated with the columns of a computed TF matrix. Each column `j` of the matrix corresponds to `vocabulary_list[j]`.

Begin an exploration to compute 34 document frequencies to try and improve our word relevancy rankings. We can compute these frequencies using a series of NumPy matrix manipulations. First, select thise columns of `tf_np_matrix` that correspond to the 34 non-zero indices within the `non_zero_indices` array. The sub-matrix is available via `tf_np_matrix[:, non_zero_indices]`. 

In [21]:
sub_matrix = tf_np_matrix[:, non_zero_indices]
print(f'---\n\nGet the sub-matrix corresponding to the 34 words within post 0.'
      f'\nThe first row in the sub-matrix is:\n\n{sub_matrix[0]}')

---

Get the sub-matrix corresponding to the 34 words within post 0.
The first row in the sub-matrix is:

[1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [22]:
# (szf) For show
print(type(sub_matrix))
print(sub_matrix.shape)
print(sub_matrix.size) # 11,314 * 34 -> 384,676

<class 'numpy.ndarray'>
(11314, 34)
384676


NB. The first row of `sub_matrix` corresponds to the 34 word counts in `df`. Together, all the matrix rows correpsond to counts across all posts. However, the goal is to know whether a work is present or absent from each post. Consequently, we will need to convert the counts to binary values (a binary matrix, if you will). Then each element `(i, j)` shall equal 1 if word `i` is in post `j`. We binarize the sub-matrix ny importing `binarize` from `sklearn.preprocessing` and then sampling the results.

In [23]:
from sklearn.preprocessing import binarize
binary_matrix = binarize(sub_matrix)
print(binary_matrix)

[[1 1 1 ... 1 1 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Now add together the rows of our binary sub-matrix producing a vector of integer counts. Each `i`th vector element will equal the number of unique posts in which word `i` is present. Summation of the array need only `axis = 0` passed to the `sum` method of the array. 
NB. A 2D NumPy array contains two axes. Axis0 is horizontal rows and axis1 is vertical. The summation is a vector of summed columns.

In [55]:
# Sum of the unique words in the matrix
unique_post_mentions = binary_matrix.sum(axis = 0)
print(f'---\n\nThis vector counts the unique posts in which each word is mentioned:\n'
      f'{unique_post_mentions}')

---

This vector counts the unique posts in which each word is mentioned:
[  18   21  202  314    4   26  802  536  842  154   67  348  184   25
    7  368  469 3093  238  268  780  901  292   95 1493  407  354  158
  574   95   98    2  295 1174]


In [56]:
# No change in our sample
print(unique_post_mentions.size)

34


NB. We should note that the above three procedures can be combined into a single line of code, by running `binarize(tf_np_matrix[:,non_zero_indices]).sum(axis=0)`. Furthermore, substituting NumPy’s `tf_np_matrix` with SciPy’s `tf_matrix` will still produce the same post mention-counts.

In [50]:
# Do over, just for the academic rub
np_post_mentions = binarize(tf_np_matrix[:,non_zero_indices]).sum(axis=0)
csr_post_mentions = binarize(tf_matrix[:,non_zero_indices]).sum(axis=0)
print(f'---\n\nNumPy matrix-generated counts:\n {np_post_mentions}\n')
print(f'---\n\nCSR matrix-generated counts:\n {csr_post_mentions}')

---

NumPy matrix-generated counts:
 [  18   21  202  314    4   26  802  536  842  154   67  348  184   25
    7  368  469 3093  238  268  780  901  292   95 1493  407  354  158
  574   95   98    2  295 1174]

---

CSR matrix-generated counts:
 [[  18   21  202  314    4   26  802  536  842  154   67  348  184   25
     7  368  469 3093  238  268  780  901  292   95 1493  407  354  158
   574   95   98    2  295 1174]]


In [54]:
# (szf) For show
print(type(np_post_mentions))
print(np_post_mentions.shape)
print('^ NumPy matrix\n\n---\n\nv CSR matrix')
print(type(csr_post_mentions))
print(csr_post_mentions.shape)

<class 'numpy.ndarray'>
(34,)
^ NumPy matrix

---

v CSR matrix
<class 'numpy.matrix'>
(1, 34)


### Activity: Methods for Aggregating Matrix Rows

`vector_of_sums = np_matrix.sum(axis=0)`: Sums-up the rows of a NumPy matrix. If `np_matrix` is a TF matrix, then vector_of_sums[i] equals the total mention-count of word `i` within the dataset.

`vector_of_sums = binary( np_matrix).sum(axis=0)`: Converts a NumPy matrix to binary, and then sums-ups its rows. If `np_matrix` is a TF matrix, then `vector_of_sums[i]` equals the total count of texts in which word `i` is mentioned.

`matrix_1D = binary( csf_matrix).sum(axis=0)`: Converts a CSR matrix to binary, and then sums-ups its rows. The returned result is a special 1-dimensional matrix object. _It is not a NumPy vector_. The `matrix_1D` can be converted into a NumPy vector by running `np.asarray(matrix_1D)[0]`.

New goal: transform the word counts into document frequencies and align these frequencies with `df.Word`. Afterwards, we'll output all the words that are mentioned in at-least 10% of newsgroup posts. If we (as scientists, remember) hypothesize that the printed words will not be specific to a particular topic. If the hypothesis is correct, then these words will not be very relevant.

In [57]:
# Print the words with the highest document frequency
document_frequencies = unique_post_mentions / dataset_size
data = {'Word': unique_words,
        'Count': tf_vector[non_zero_indices],
        'Document Frequency': document_frequencies}

df = pd.DataFrame(data)
df_common_words = df[df['Document Frequency'] >= .1]
print(df_common_words.to_string(index=False))

   Word  Count  Document Frequency
   know      1            0.273378
 really      1            0.131960
  years      1            0.103765


From the 34 unique words, three have a document frequency greater than 0.1. These words are very general and not specific to a usenet post on cars (esp. word "really"). Lets apply the discovered document frequencies for ranking purposes. Lets rank our words by relevance, in the following manner. First, we’ll sort the word by count, from greatest to smallest. Afterwards, all words with equal count will be sorted by document frequency, from smallest to greatest. In Pandas, we can execute this dual-column sorting by running `df.sort_values(['Count', 'Document Frequency'], ascending=[True, False])`.

In [65]:
# Ranking words by count and frequency
df_sorted = df.sort_values(['Count', 'Document Frequency'],
                           ascending=[False, True])
print(df_sorted[:10].to_string(index=False))

       Word  Count  Document Frequency          IDF  Combined
        car      4            0.047375    21.108209  5.297806
     tellme      1            0.000177  5657.000000  3.752586
   bricklin      1            0.000354  2828.500000  3.451556
      funky      1            0.000619  1616.285714  3.208518
        60s      1            0.001591   628.555556  2.798344
        70s      1            0.001856   538.761905  2.731397
  enlighten      1            0.002210   452.560000  2.655676
     bumper      1            0.002298   435.153846  2.638643
      doors      1            0.005922   168.865672  2.227541
 production      1            0.008397   119.094737  2.075893


So there are things of interest in this printout… the word 'bumper' is both car-related and in the resultset. With the two-level sorting, most of the resultset has a count of 1 and it shows that the run-in term "tellme" has the least document frequency from our `df.Word`.
This can be simplifed for understanding by combining the word counts and the document frequencies into one score. One approach is to device each word-count by its associated document frequency. This means that the resulting value will go up if:
* The word-count goes up
* The document frequency goes down

Start by computing `1/document_frequencies`, producing an array of inverse document frequencies (commonly shortened to IDF). Next, we’ll multiply `df.Word` by the IDF array,in order to compute the combined score. We’ll then add both the IDF values and our combined scores to our Pandas table. Finally, we’ll sort on the combined score at printout.

In [60]:
# Combining counts and frequencies into a single score
inverse_document_frequencies = 1 / document_frequencies
df['IDF'] = inverse_document_frequencies
df['Combined'] = df.Count * inverse_document_frequencies
df_sorted = df.sort_values('Combined', ascending=False)
print(df_sorted[:10].to_string(index=False))

       Word  Count  Document Frequency          IDF     Combined
     tellme      1            0.000177  5657.000000  5657.000000
   bricklin      1            0.000354  2828.500000  2828.500000
      funky      1            0.000619  1616.285714  1616.285714
        60s      1            0.001591   628.555556   628.555556
        70s      1            0.001856   538.761905   538.761905
  enlighten      1            0.002210   452.560000   452.560000
     bumper      1            0.002298   435.153846   435.153846
      doors      1            0.005922   168.865672   168.865672
      specs      1            0.008397   119.094737   119.094737
 production      1            0.008397   119.094737   119.094737


💥 There is a problem now! The word *car* is no longer at the top of the list. /TBD. Look within the table/ The printout has some huge IDF values, but the word-count range is very small with values from 1 to 4. When we multiply word-counts by IDF values, the IDF will dominate. The counts (as for word "car" with 4 occurences) will then have no impact on the final results. 

This is a common problem for DataSci. One technique is to apply a logarithmic function. For example `np.log10(100000)` returns a value of `6` which is the count of zeroes in the value. 

Lets recompute our ranking score by running `df.Count * np.log10(df.IDF)`. The product of the counts and the shrunken IDF values should lead to a more reasonable ranking metric.

In [66]:
# Adjustment of combined score using logarithms
df['Combined'] = df.Count * np.log10(df.IDF)
df_sorted = df.sort_values('Combined', ascending=False)
print(df_sorted[:10].to_string(index=False))

      Word  Count  Document Frequency          IDF  Combined
       car      4            0.047375    21.108209  5.297806
    tellme      1            0.000177  5657.000000  3.752586
  bricklin      1            0.000354  2828.500000  3.451556
     funky      1            0.000619  1616.285714  3.208518
       60s      1            0.001591   628.555556  2.798344
       70s      1            0.001856   538.761905  2.731397
 enlighten      1            0.002210   452.560000  2.655676
    bumper      1            0.002298   435.153846  2.638643
     doors      1            0.005922   168.865672  2.227541
     specs      1            0.008397   119.094737  2.075893


We have clawed back words "car" and "bumper", whereas word "really" remains missing from the list.
Our effective score is called the **term frequency-inverse document frequency**, or TFIDF for short. The TFIDF can be computed by taking the product of the TF (word-count) with the log of the IDF.

Mathematically, `np.log(1/x)` is equal to `-np.log(x)`. Therefore, we can compute the TFIDF directly from the document frequencies. We simply need to run `df.Count * -np.log10(document_frequences)`. Also please be aware that other, less common formulations of TFIDF exist in the literature. For instance, when dealing with large documents, some NLP practitioners compute the TFIDF as `np.log(df.Count + 1) * -np.log10(document_frequences)`. *This compute limits the influence of any very common word with a document.*

For most real-world text datasets, TFIDF produces good ranking results. Furthermore, the metric has additional uses. It can be utilized to vectorize words within a document. The numeric content of `df.Combined` is essentially a vector. It was produced by modifying the TF vector stored in `df.Count`. In this same manner, we can transform any TF vector into a TFIDF vector. We just need to multiply the TF vector by the log of inverse document frequencies.

Within larger text datasets, the transform of TF vectors into more complicated TFIDF vectors will provide a greater signal of textual similarity and divergence. For example, two texts that are both discussing "cars" are more likely to cluster together if their irrelevant vector elements are penalized. Thus, penalizing common words using the IDF will improve the clustering of large text collections.

NB. The transform from TF vectors to TFIDF vectors is not necessarily true of smaller datasets where the number of documents is low and the document frequency is high. The IDF might be too small to improve the clustering results meaningfully.

### Activity: Computing TFIDF Vectors with Scikit-Learn

The `TfidfVectorizer` class is nearly identical to `CountVectorizer`, except that it takes IDF into account during the vectorization process. Initializing the class with `stop_words` will yield an object parameterized to ignore all stop words. Upon performing `fit_transform(newsgroups.data)`, we have a matric of vectorized TFIDF values. The shape of the matrix will remain the same between this object and that of `tf_matrix`.

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups.data)
assert tfidf_matrix.shape == tf_matrix.shape, "The matrices do not have the same shape."

Our `tfdif_vectorizer` has learned the same vocabulary as the simpler TF vectorizer. In fact, the indices of words in `tfidf_matrix` are identical to those of `tf_matrix`. We can confirm this by calling `tfidf_vectorizer.get_feature_names()`. The method-call will return an ordered list of words that is identical to our previously computed words list.

In [76]:
assert tfidf_vectorizer.get_feature_names() == words, "The ordered lists are different."

Since word-order is preserved, we should expect the non-zero indices of `tfidf_matrix[0]` to equal our previously computed non_zero_indices array. We’ll confirm below, after converting `tfidf_matrix` from a CSR data-structure to a NumPy array.

In [78]:
tfidf_np_matrix = tfidf_matrix.toarray()
tfidf_vector = tfidf_np_matrix[0]
tfidf_non_zero_indices = np.flatnonzero(tfidf_vector)
assert np.array_equal(tfidf_non_zero_indices,
                      non_zero_indices), "The NumPy arrays are different."

NB. The non-zero indices of `tf_vector` and `tfidif_vector` are identical! We thus can add the TFIDF vector as a column in our existing df table. Adding a TFIDF column will allow us to compare Scikit-Learn’s output with our manually-computed score.

In [79]:
# Adding TFIDF vector to the existing Pandas table
df['TFIDF'] = tfidf_vector[non_zero_indices]

In [80]:
# Sorting relevancy rankings is the same between `df.TFIDF` and `df.Combined`
df_sorted_old = df.sort_values('Combined', ascending=False)
df_sorted_new = df.sort_values('TFIDF', ascending=False)
assert np.array_equal(df_sorted_old['Word'].values,
                      df_sorted_new['Word'].values)
print(df_sorted_new[:10].to_string(index=False))

      Word  Count  Document Frequency          IDF  Combined     TFIDF
       car      4            0.047375    21.108209  5.297806  0.459552
    tellme      1            0.000177  5657.000000  3.752586  0.262118
  bricklin      1            0.000354  2828.500000  3.451556  0.247619
     funky      1            0.000619  1616.285714  3.208518  0.234280
       60s      1            0.001591   628.555556  2.798344  0.209729
       70s      1            0.001856   538.761905  2.731397  0.205568
 enlighten      1            0.002210   452.560000  2.655676  0.200827
    bumper      1            0.002298   435.153846  2.638643  0.199756
     doors      1            0.005922   168.865672  2.227541  0.173540
     specs      1            0.008397   119.094737  2.075893  0.163752


Our word-rankings have remained unchanged. However, the values of the *TFIDF* and *Combined* columns are not identical. Our top 10 manually-computed Combined values are all greater than 1. Meanwhile, all of *Scikit-Learn's TFIDF* values are less than 1. Why is this the case?

As it turns out, **Scikit-Learn automatically normalizes its TFIDF vector results**. The magnitude of `df.TFIDF` has been modified to equal 1. We can confirm by calling `norm(df.TFIDF.values)`.

In [None]:
NB. In order to turn off the normalization we must pass `norm=None` into the vectorizer’s initialization function. Running `TfidfVectorizer(norm=None, stop_words='english')` will return a vectorizer in which normalization has been deactivated

SciKit-Learn has done this optimization ro more easily compute text-vector similarity when all vector magnitudes equal 1. Consequentlt, our normalized TFIDF matrix is primed for similarity analysis.

In [83]:
# Confirmation that the TFIDF vector is normalized
from numpy.linalg import norm
assert norm(df.TFIDF.values) == 1, "The TFIDF vector is not equal to 1."



### Common Scikit-Learn TfidfVectorizer Methods

`tfidf_vectorizer = TfidfVectorizer(stopwords='english')`: Initializes a `TfidfVectorizer` object capable of vectorizing input texts based on their TFIDF values. The object is pre-set to filter common English stop words

`tfidf_matrix = tfidf_vectorizer.fit_transform(texts)`: Executes TFIDF vectorization on a list of input texts, using the initialized vectorizer object. Returns CSR matrix of normalized TFIDF values. Each row of the matrix is automatically normalized, for easier similarity computation.

`vocabulary_list = tifdf_vectorizer.get_feature_names()`: Returns the vocabulary-list associated with the columns of a computed TFIDF matrix. Each column `j` of the matrix corresponds to `vocabulary_list[j]`.