tf-idf is a way to encode words numerically, so they can be fed into a model (e.g., Logistic Regression, Multionomial Bayes, etc.). As you can see below it creates a sparse matrix where every document has zeros for most word indices and tf-idf scores in the word indices where the words in the document are present (i.e. if the total number of distinct words across all documents in your corpus is ten thousand then each individual document's tf-idf representation will be a 1,10000 sparse matrix).

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
#create three example documents
x = ["hello world", "I ate the bagel.", "what the hell, Danny!"]

In [16]:
#create the vectorizer
vec = TfidfVectorizer()

#conver the documents to tf-df vectors
transform_examples = vec.fit_transform(x)

In [17]:
#show the shape of the newly created tf-idf matrix
transform_examples.shape

(3, 8)

In [15]:
#see the entire vocabulary of the vectorizer
vec.vocabulary_

{'ate': 0,
 'bagel': 1,
 'danny': 2,
 'hell': 3,
 'hello': 4,
 'the': 5,
 'what': 6,
 'world': 7}

In [18]:
#view the idf scores of each word
#NOTE: they are identical bc each word is only present once except for 'the'
vec.idf_

array([ 1.69314718,  1.69314718,  1.69314718,  1.69314718,  1.69314718,
        1.28768207,  1.69314718,  1.69314718])

In [19]:
#check stop words, I'm not sure why 'I' isn't showing
vec.stop_words_

set()

In [24]:
#show the third document's matrix
transform_examples[2]

<1x8 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [28]:
#show the matrix of all documents
transform_examples.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.70710678,
         0.        ,  0.        ,  0.70710678],
       [ 0.62276601,  0.62276601,  0.        ,  0.        ,  0.        ,
         0.4736296 ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.52863461,  0.52863461,  0.        ,
         0.40204024,  0.52863461,  0.        ]])