# Computing TF-IDF Vectors with Scikit-Learn


excerpt from __Data Science Bookcamp: Five Python Projects__ MEAP V04 livebook by Leonard Apeltsin

NB. The author has directed the reader to interacting with the newsgroups dataset (`fetch_20newsgroups` from `sklearn.datasets`). As this dataset is large, it is not pre-packaged with Scikit-Learn. You may wish to define a shell variable named `SCIKIT_LEARN_DATA` for your environment so that the dataset is in a known singular location.

In [6]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(remove=('headers', 'footers'))

In [7]:
print(newsgroups.target_names)

[&#39;alt.atheism&#39;, &#39;comp.graphics&#39;, &#39;comp.os.ms-windows.misc&#39;, &#39;comp.sys.ibm.pc.hardware&#39;, &#39;comp.sys.mac.hardware&#39;, &#39;comp.windows.x&#39;, &#39;misc.forsale&#39;, &#39;rec.autos&#39;, &#39;rec.motorcycles&#39;, &#39;rec.sport.baseball&#39;, &#39;rec.sport.hockey&#39;, &#39;sci.crypt&#39;, &#39;sci.electronics&#39;, &#39;sci.med&#39;, &#39;sci.space&#39;, &#39;soc.religion.christian&#39;, &#39;talk.politics.guns&#39;, &#39;talk.politics.mideast&#39;, &#39;talk.politics.misc&#39;, &#39;talk.religion.misc&#39;]


In [14]:
# Return 1st newsgroup posting
print(f'---\n\n{newsgroups.data[0]}')

---

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [13]:
# Return the newsgroup name associated with the posting
origin = newsgroups.target_names[newsgroups.target[0]]
print(f'---\n\nThe post at index 0 first appeared in the \'{origin}\' group.')

---

The post at index 0 first appeared in the &#39;rec.autos&#39; group.


NB. So far, nothing unexpected… car post content was sourced from the car discussions on usenet!

In [28]:
# Count the number of newsgroup posts
dataset_size = len(newsgroups.data)
print(f'---\n\nOur dataset contains {dataset_size:,} newsgroup posts.')

---

Our dataset contains 11,314 newsgroup posts.


In [29]:
# Lets move on to transforming input texts into TF vectors via the Scikit-Learn `CountVectorizer` class
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [30]:
tf_matrix = vectorizer.fit_transform(newsgroups.data)
print(tf_matrix)

  (0, 108644)	4
  (0, 110106)	1
  (0, 57577)	2
  (0, 24398)	2
  (0, 79534)	1
  (0, 100942)	1
  (0, 37154)	1
  (0, 45141)	1
  (0, 70570)	1
  (0, 78701)	2
  (0, 101084)	4
  (0, 32499)	4
  (0, 92157)	1
  (0, 100827)	6
  (0, 79461)	1
  (0, 39275)	1
  (0, 60326)	2
  (0, 42332)	1
  (0, 96432)	1
  (0, 67137)	1
  (0, 101732)	1
  (0, 27703)	1
  (0, 49871)	2
  (0, 65338)	1
  (0, 14106)	1
  :	:
  (11313, 55901)	1
  (11313, 93448)	1
  (11313, 97535)	1
  (11313, 93393)	1
  (11313, 109366)	1
  (11313, 102215)	1
  (11313, 29148)	1
  (11313, 26901)	1
  (11313, 94401)	1
  (11313, 89686)	1
  (11313, 80827)	1
  (11313, 72219)	1
  (11313, 32984)	1
  (11313, 82912)	1
  (11313, 99934)	1
  (11313, 96505)	1
  (11313, 72102)	1
  (11313, 32981)	1
  (11313, 82692)	1
  (11313, 101854)	1
  (11313, 66399)	1
  (11313, 63405)	1
  (11313, 61366)	1
  (11313, 7462)	1
  (11313, 109600)	1


In [31]:
# Suspense! What kind of data structure did Scikit-Learn `CountVectorized` yield?
print(type(tf_matrix))

&lt;class &#39;scipy.sparse.csr.csr_matrix&#39;&gt;


NB. The matrix is a *Compressed Sparse Row* (CSR) SciPy object. By storing only non-zero elements, the CSR matrix is efficient in storage and memory usage. There are some nuances between SciPy CSR matrix and a NumPy array so that in order to reduce confusion, a conversion will be done on the object. This will allow a better comprehension on the similarities and differences between the two matrix representations.

In [32]:
tf_np_matrix = tf_matrix.toarray()
print(tf_np_matrix)
# -> Yields an 2D NumPy array

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


NB. So this is a sparse matrix, no surprises there…
Each matrix element corresponds to the count of a word within a post, and each matrix row represents a post. The matrix columns represent individual words so the column count equals the vocabulary size of the dataset.

In [35]:
assert tf_np_matrix.shape == tf_matrix.shape
num_posts, vocabulary_size = tf_np_matrix.shape
print(f'---\n\nOur collection of {num_posts:,} newsgroup posts contain a total of '
      f'{vocabulary_size:,} unique words.')

---

Our collection of 11,314 newsgroup posts contain a total of 114,751 unique words.


NB. There are over 114k words, but most posts only hold a few dozen of them. You can measure the unique word count of a post at index `i` by counting the number of non-zero elements in row `tf_np_matrix[i]`. NumPy can easily count the non-zero indeces of the vector at `tf_np_matrix[i]` from the `np.flatnonzero` function.

In [42]:
import numpy as np
tf_vector = tf_np_matrix[0]
non_zero_indeces = np.flatnonzero(tf_vector)
num_unique_words = non_zero_indeces.size
print(f'---\n\nThe newsgroup in row 0 contains {num_unique_words} unique words.\n'
      f'The actual word-counts map to the following column indeces:\n\n'
      f'{non_zero_indeces}')

---

The newsgroup in row 0 contains 64 unique words.
The actual word-counts map to the following column indeces:

[ 14106  15549  22088  23323  24398  27703  29357  30093  30629  32194
  32305  32499  37154  39275  42332  42333  43643  45089  45141  49871
  49881  50165  54442  55453  57577  58321  58842  60116  60326  64083
  65338  67137  67140  68931  69080  70570  72915  75280  78264  78701
  79055  79461  79534  82759  84398  87690  89161  92157  93304  95225
  96145  96432 100406 100827 100942 101084 101732 108644 109086 109254
 109294 110106 112936 113262]


NB. We have the index values for the 64 unique words. Mapping back to the word-values is done via the `CountVectorizer` method `get_feature_names()`. The method-call will return a list of words, and each index `i` will correspond to the `i`-ith word within that list.

In [43]:
# Get list of words from the `CountVectorizer` method
words = vectorizer.get_feature_names()
# List comprehension to view our non-zero words
unique_words = [words[i] for i in non_zero_indeces]
print(unique_words)

[&#39;60s&#39;, &#39;70s&#39;, &#39;addition&#39;, &#39;all&#39;, &#39;anyone&#39;, &#39;be&#39;, &#39;body&#39;, &#39;bricklin&#39;, &#39;bumper&#39;, &#39;called&#39;, &#39;can&#39;, &#39;car&#39;, &#39;could&#39;, &#39;day&#39;, &#39;door&#39;, &#39;doors&#39;, &#39;early&#39;, &#39;engine&#39;, &#39;enlighten&#39;, &#39;from&#39;, &#39;front&#39;, &#39;funky&#39;, &#39;have&#39;, &#39;history&#39;, &#39;if&#39;, &#39;in&#39;, &#39;info&#39;, &#39;is&#39;, &#39;it&#39;, &#39;know&#39;, &#39;late&#39;, &#39;looked&#39;, &#39;looking&#39;, &#39;made&#39;, &#39;mail&#39;, &#39;me&#39;, &#39;model&#39;, &#39;name&#39;, &#39;of&#39;, &#39;on&#39;, &#39;or&#39;, &#39;other&#39;, &#39;out&#39;, &#39;please&#39;, &#39;production&#39;, &#39;really&#39;, &#39;rest&#39;, &#39;saw&#39;, &#39;separate&#39;, &#39;small&#39;, &#39;specs&#39;, &#39;sports&#39;, &#39;tellme&#39;, &#39;the&#39;, &#39;there&#39;, &#39;this&#39;, &#39;to&#39;, &#39;was&#39;, &#39;were&#39;, &#39;whatever&#39;, &#39;whe

NB. You can also get these words by calling `inverse_transform(tf_vector)`. This method call will return all the words associated with the input TF vector (which is a NumPy matrix, from the above).

### Activity: View _word_ mention counts

We have printed the words from `newsgroup.data[0]`, but some of these words are more frequent than others. Lets dig down to find the more frequent words along with the count of use for that word. Represent them in a Pandas table.

In [47]:
import pandas as pd
data = {'Word': unique_words, # our list
        'Count': tf_vector[non_zero_indeces]} # non-zero indeces NumPy array into the the first row (REM. position `0`) of the 2D NumPy array

df = pd.DataFrame(data).sort_values('Count', ascending = False)
print(df[:10].to_string(index = False))

   Word  Count
    the      6
   this      4
    was      4
    car      4
     if      2
     is      2
     it      2
   from      2
     on      2
 anyone      2


NB. So we have a top ten, but the top four words are not interesting. Good for us is that `CountVectorizer` has a class to remove *stop words* which, although part of speech and written language, do not carry information for the "science". We will now re-init a stop-word aware vectorizer amd re-compute the TF matrix. We will also regenerate out `words` list.

In [49]:
vectorizer = CountVectorizer(stop_words = 'english')
tf_matrix = vectorizer.fit_transform(newsgroups.data)
assert tf_matrix.shape[1] < 114751 # number of unique words known to be in our newsgroup vocabulary

words = vectorizer.get_feature_names()
for common_word in ['the', 'this', 'was', 'if', 'it', 'on']:
    assert common_word not in words

In [51]:
tf_np_matrix = tf_matrix.toarray()
tf_vector = tf_np_matrix[0]
non_zero_indices = np.flatnonzero(tf_vector)
unique_words = [words[index] for index in non_zero_indices]
data = {'Word': unique_words,
        'Count': tf_vector[non_zero_indices]}

df = pd.DataFrame(data).sort_values('Count', ascending=False)
print(f'---\n\nAfter stop-word deletion, {df.shape[0]} unique words remain.')
print('The 10 most frequent words are:\n')
print(df[:10].to_string(index=False))

---

After stop-word deletion, 34 unique words remain.
The 10 most frequent words are:

       Word  Count
        car      4
        60s      1
        saw      1
    looking      1
       mail      1
      model      1
 production      1
     really      1
       rest      1
   separate      1


In [None]:
NB. Each of the 34 words in our dataframe appears in a certain fraction of newsgroup posts. In NLP, this fraction is referred to as the _document frequency_ of a word. From here, the job of the scientist is the hypothesize that document frequencies can be used to improve word rankings and thus our analysis. Initially, we will limit the exploration to a single document. Later, we will generalize the insights we obtain to the other documents in the dataset.

#### Interlude: Common Scikit-Learn CountVectorizer Methods
`vectorizer = CountVectorizer()`: Initializes a `CountVectorizer` object capable of vectorizing input texts based on their TF counts.

`vectorizer = CountVectorizer(stopwords='english')`: Initializes an object capable of vectorizing input texts, while filtering for common English words like "this" or "the".

`tf_matrix = vectorizer.fit_transform(texts)`: Executes TF vectorization on a list of input texts, using the initialized `vectorizer` object. Returns CSR matrix of term-frequency values. Each matrix row `i` corresponds to `texts[i]`. Each matrix column `j` corresponds to the term-frequency of word `j`.

`vocabulary_list = vectorizer.get_feature_names()`: Returns the vocabulary-list associated with the columns of a computed TF matrix. Each column `j` of the matrix corresponds to `vocabulary_list[j]`.