![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

## Introduction to Text Mining and Natural Language Processing


## Session 4: Vectormath

This notebook goes deeper into different tf-idf weightings. As a preamble we will go into understanding the functional forms of tf-idf formulas with the sklearn package and the classic formula. We will see that a critical component for the sklearn package is the sublinear_tf option.

In [None]:
import math
import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer

def classic_tfidf(x, df, D):
    """
    x      = raw count in the document
    df     = number of documents containing the term
    D      = total number of documents
    Returns the "classic" textbook tf-idf
    """
    if x == 0:
        return 0.0
    # log base can be e or 10 - typically it doesn't matter as it only changes scaling
    return (1.0 + math.log(x)) * math.log(D / df)

def sklearn_like_tfidf(x, df, D, sublinear_tf=False, smooth_idf=True):
    """
    Rough replication of how scikit-learn does its weighting by default.
    """
    # 1) TF part
    if sublinear_tf and x > 0:
        tf = 1.0 + math.log(x)
    else:
        tf = float(x)
    
    # 2) IDF part
    #  - if smooth_idf=True, formula is log((1 + D)/(1 + df)) + 1
    #  - if smooth_idf=False, formula is log(D/df) + 1
    if smooth_idf:
        idf = math.log((1.0 + D)/(1.0 + df)) + 1.0
    else:
        idf = math.log(D/df) + 1.0
    
    return tf * idf

# --- Compare the two for different df values in a small example

D           = 1000          # total documents
df_values   = [50, 100]     # two different df values
x_values    = [1, 2, 5, 10, 20]  # raw counts to examine

print("Raw count (x)   | Classic(df=50)  sklearn(df=50)  |  Classic(df=100)  sklearn(df=100)")
print("---------------+--------------------------------+------------------------------------")

for x in x_values:
    # For df=50
    c50 = classic_tfidf(x, df_values[0], D)
    s50 = sklearn_like_tfidf(x, df_values[0], D, sublinear_tf=True, smooth_idf=False)
    
    # For df=100
    c100 = classic_tfidf(x, df_values[1], D)
    s100 = sklearn_like_tfidf(x, df_values[1], D, sublinear_tf=True, smooth_idf=False)
    
    print(f"{x:14d} | {c50:14.4f}  {s50:14.4f}  |  {c100:14.4f}  {s100:14.4f}")

print("     ")
print("Now with sublinear_tf=False:")

for x in x_values:
    # For df=50
    c50 = classic_tfidf(x, df_values[0], D)
    s50 = sklearn_like_tfidf(x, df_values[0], D, sublinear_tf=False, smooth_idf=False)
    
    # For df=100
    c100 = classic_tfidf(x, df_values[1], D)
    s100 = sklearn_like_tfidf(x, df_values[1], D, sublinear_tf=False, smooth_idf=False)
    
    print(f"{x:14d} | {c50:14.4f}  {s50:14.4f}  |  {c100:14.4f}  {s100:14.4f}")

# Note: smoothing makes only small difference overall.


Raw count (x)   | Classic(df=50)  sklearn(df=50)  |  Classic(df=100)  sklearn(df=100)
---------------+--------------------------------+------------------------------------
             1 |         2.9957          3.9957  |          2.3026          3.3026
             2 |         5.0722          6.7654  |          3.8986          5.5918
             5 |         7.8172         10.4266  |          6.0085          8.6179
            10 |         9.8937         13.1962  |          7.6045         10.9071
            20 |        11.9701         15.9659  |          9.2005         13.1962
     
Now with sublinear_tf=False:
             1 |         2.9957          3.9957  |          2.3026          3.3026
             2 |         5.0722          7.9915  |          3.8986          6.6052
             5 |         7.8172         19.9787  |          6.0085         16.5129
            10 |         9.8937         39.9573  |          7.6045         33.0259
            20 |        11.9701         79.914

# Understanding tf-idf in practice

In [6]:
# Define some example sentences
sentence1 = "The president of the United States (US) is president Donald Trump."
sentence2 = "Donald Trump the president wants us in a united country to trump other countries."
sentence3 = "Did a known artist paint portraits of Donald Trump?"
sentence4 = "A really well-known portrait artist is Vincent van Gogh."

corpus=[sentence1,sentence2,sentence3,sentence4]

cv = CountVectorizer(ngram_range = (1,1))
cv.fit(corpus)
vectorized_text=cv.transform(corpus)
vectorized_text=vectorized_text.todense()
print("document term matrix has size", vectorized_text.shape)
print(cv.get_feature_names_out())

document term matrix has size (4, 26)
['artist' 'countries' 'country' 'did' 'donald' 'gogh' 'in' 'is' 'known'
 'of' 'other' 'paint' 'portrait' 'portraits' 'president' 'really' 'states'
 'the' 'to' 'trump' 'united' 'us' 'van' 'vincent' 'wants' 'well']


In [7]:
for document in vectorized_text:
    print(document)

[[0 0 0 0 1 0 0 1 0 1 0 0 0 0 2 0 1 2 0 1 1 1 0 0 0 0]]
[[0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 2 1 1 0 0 1 0]]
[[1 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0]]
[[1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1]]


In [8]:
print(cv.get_feature_names_out())


['artist' 'countries' 'country' 'did' 'donald' 'gogh' 'in' 'is' 'known'
 'of' 'other' 'paint' 'portrait' 'portraits' 'president' 'really' 'states'
 'the' 'to' 'trump' 'united' 'us' 'van' 'vincent' 'wants' 'well']


In [9]:
cv = TfidfVectorizer(ngram_range = (1,1), norm=None, sublinear_tf=False)
cv.fit(corpus)
vectorized_text=cv.transform(corpus)
vectorized_text=vectorized_text.todense()
print("document term matrix has size", vectorized_text.shape)
print(cv.get_feature_names_out())

document term matrix has size (4, 26)
['artist' 'countries' 'country' 'did' 'donald' 'gogh' 'in' 'is' 'known'
 'of' 'other' 'paint' 'portrait' 'portraits' 'president' 'really' 'states'
 'the' 'to' 'trump' 'united' 'us' 'van' 'vincent' 'wants' 'well']


In [10]:
vectorized_text=vectorized_text.round(decimals=3, out=None)
for document in vectorized_text:
    print(document)

[0.    0.    0.    0.    1.223 0.    0.    1.511 0.    1.511 0.    0.
 0.    0.    3.022 0.    1.916 3.022 0.    1.223 1.511 1.511 0.    0.
 0.    0.   ]
[0.    1.916 1.916 0.    1.223 0.    1.916 0.    0.    0.    1.916 0.
 0.    0.    1.511 0.    0.    1.511 1.916 2.446 1.511 1.511 0.    0.
 1.916 0.   ]
[1.511 0.    0.    1.916 1.223 0.    0.    0.    1.511 1.511 0.    1.916
 0.    1.916 0.    0.    0.    0.    0.    1.223 0.    0.    0.    0.
 0.    0.   ]
[1.511 0.    0.    0.    0.    1.916 0.    1.511 1.511 0.    0.    0.
 1.916 0.    0.    1.916 0.    0.    0.    0.    0.    0.    1.916 1.916
 0.    1.916]


### sklearn
The formula that is used to compute the tf-idf for a term v of a document d in a document set is tf-idf(v, d) = tf(v, d) * idf(v), and the idf is computed as idf(v) = log [ D / df(v) ] + 1 (if smooth_idf=False), where D is the total number of documents in the document set and df(v) is the document frequency of v; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(v) = log [ D / (df(v) + 1) ]).

Above the default is smooth:
idf(t) = log [ (1 + D) / (1 + df(v)) ] + 1.

But then the default setting normalizes by the eucledian norm.

In [123]:
import math
#Take the term donald:
#a) mentioned once
#b) D=4
#c) df(v)=3
1*(math.log(5/4)+1)

1.2231435513142097

In [12]:
#Take the term "countries" in the second document:
#a) mentioned once
#b) D=4
#c) df(v)=1
import math
1*(math.log(5/2)+1)

1.916290731874155

In [11]:
#Take the term "president" in the first document:
#a) mentioned twice
#b) D=4
#c) df(v)=2
import math
2*(math.log(5/3)+1)

3.0216512475319814

In [127]:
#So this is not downweighted so strongly and appears more.