# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [1]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd

Import the data into the file *headlines.csv*

In [2]:
# TODO: Load the dataset
df = pd.read_csv('headlines.csv')

As usual, check the dataset basic information.

In [5]:
# TODO: Have a look at the data
print(df.head())
print(df.info())

   publish_date                                      headline_text
0      20170721  algorithms can make decisions on behalf of fed...
1      20170721  andrew forrests fmg to appeal pilbara native t...
2      20170721                           a rural mural in thallan
3      20170721  australia church risks becoming haven for abusers
4      20170721  australian company usgfx embroiled in shanghai...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB
None


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [11]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

df['tokenized_text'] = df['headline_text'].apply(word_tokenize)
df['tokenized_text'] = df['tokenized_text'].apply(lambda x: [word for word in x if word.isalpha()])
stop_words = set(stopwords.words('english'))
df['tokenized_text'] = df['tokenized_text'].apply(lambda x: [word for word in x if word not in stop_words])
stemmer = PorterStemmer()
df['stemmed'] = df['tokenized_text'].apply(lambda x: [stemmer.stem(word) for word in x])
print(df['stemmed'])

0         [algorithm, make, decis, behalf, feder, minist]
1       [andrew, forrest, fmg, appeal, pilbara, nativ,...
2                                 [rural, mural, thallan]
3                  [australia, church, risk, becom, abus]
4       [australian, compani, usgfx, embroil, shanghai...
                              ...                        
1994    [constitut, avenu, win, top, prize, act, archi...
1995                         [dark, mofo, number, crunch]
1996    [david, petraeu, say, australia, must, firm, s...
1997    [driverless, car, australia, face, challeng, r...
1998               [drug, compani, criticis, price, hike]
Name: stemmed, Length: 1999, dtype: object


Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [151]:
from sklearn.feature_extraction.text import CountVectorizer
def identity_tokenizer(text):
    return text

vectorizer = CountVectorizer(analyzer=identity_tokenizer)
bow = vectorizer.fit_transform(df['stemmed'])
feature_names = vectorizer.get_feature_names_out()
print(bow.toarray().shape)

(1999, 4165)


You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [6]:
# TODO: Compute the TF using the BOW


array([0.        , 0.08333333, 0.09090909, 0.1       , 0.11111111,
       0.125     , 0.14285714, 0.16666667, 0.18181818, 0.2       ,
       0.22222222, 0.25      , 0.28571429, 0.33333333, 0.4       ,
       0.5       , 1.        ])

In [94]:
bow_array = bow.toarray()

total_words_in_bow = np.sum(bow_array)

tf_matrix = np.zeros(bow_array.shape)
for i in range(bow_array.shape[0]):
    total_words_in_document = np.sum(bow_array[i])
    if total_words_in_document != 0:
        tf_matrix[i] = bow_array[i] / total_words_in_document

unique_values = np.unique(tf_matrix.flatten())
unique_values


array([0.        , 0.08333333, 0.09090909, 0.1       , 0.11111111,
       0.125     , 0.14285714, 0.16666667, 0.18181818, 0.2       ,
       0.22222222, 0.25      , 0.28571429, 0.33333333, 0.4       ,
       0.5       , 1.        ])

In [100]:
import numpy as np

document_frequency = np.sum(bow_array > 0, axis=0)
idf_matrix = np.log(bow_array.shape[0] / document_frequency)

sorted_indices = np.argsort(idf_matrix)
idf_array_sorted = idf_matrix[sorted_indices]

idf_array_sorted

array([3.28291422, 3.36629583, 3.44151925, ..., 7.60040233, 7.60040233,
       7.60040233])

Compute finally the TF-IDF.

In [149]:
tf_idf_matrix = tf_matrix.copy()

for i in range(tf_matrix.shape[0]):
    for j in range(tf_matrix.shape[1]):
        tf_idf_matrix[i, j] = tf_matrix[i, j] * idf_matrix[j]

tf_idf_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

What are the 10 words with the highest and lowest TF-IDF on average?

In [153]:
word_tfidf_df = pd.DataFrame({'Word': feature_names, 'TF-IDF': tf_idf_matrix.mean(axis=0)})
lowest_words = word_tfidf_df.sort_values(by='TF-IDF').head(10)
highest_words = word_tfidf_df.sort_values(by='TF-IDF', ascending=False).head(10)

print("Top 10 words with lowest TF-IDF on average:")
print(lowest_words)

print("\nTop 10 words with highest TF-IDF on average:")
print(highest_words)


Top 10 words with lowest TF-IDF on average:
      Word    TF-IDF
1648    gw  0.000317
2515  nmfc  0.000317
2300  melb  0.000317
760   coll  0.000317
1526  gcfc  0.000317
1528  geel  0.000317
33    adel  0.000317
1684   haw  0.000317
3619   syd  0.000317
3930     v  0.000346

Top 10 words with highest TF-IDF on average:
            Word    TF-IDF
249    australia  0.019863
250   australian  0.019709
2493         new  0.017262
2245      market  0.015168
2771       polic  0.014775
3223         say  0.014357
3836       trump  0.013593
3995          wa  0.012777
2224         man  0.012624
3620      sydney  0.012067


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [154]:
from sklearn.feature_extraction.text import TfidfVectorizer
df['processed_text'] = df['stemmed'].apply(lambda x: ' '.join(x))
vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(df['processed_text'])
print(tfidf_matrix)


  (0, 2351)	0.37940445994614136
  (0, 1351)	0.36465387895025764
  (0, 354)	0.4836746969147799
  (0, 990)	0.3852340690707783
  (0, 2217)	0.3267905927279272
  (0, 90)	0.4836746969147799
  (1, 3170)	0.327942300480771
  (1, 3747)	0.3229796671605963
  (1, 2455)	0.34651373707753075
  (1, 2724)	0.39063104545637256
  (1, 159)	0.327942300480771
  (1, 1413)	0.3756519462531484
  (1, 1440)	0.3756519462531484
  (1, 130)	0.35454009167689143
  (2, 3689)	0.6136272674992542
  (2, 2426)	0.6136272674992542
  (2, 3175)	0.4969136274673873
  (3, 11)	0.43643479335149765
  (3, 345)	0.48833892208322827
  (3, 3116)	0.4412675558631679
  (3, 704)	0.5235336535355645
  (3, 249)	0.3197580743141225
  (4, 3491)	0.4228385123497916
  (4, 3482)	0.3424134326053298
  (4, 3302)	0.4228385123497916
  :	:
  (1995, 916)	0.5485265217155414
  (1995, 2373)	0.5485265217155414
  (1995, 2536)	0.4302755758469799
  (1995, 959)	0.4616278141304413
  (1996, 2701)	0.4205359647251682
  (1996, 972)	0.3718074452777344
  (1996, 2436)	0.3718074

Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [155]:
average_tfidf_per_word = np.asarray(tfidf_matrix.mean(axis=0)).flatten()
feature_names = vectorizer.get_feature_names_out()
word_tfidf_df = pd.DataFrame({'Word': feature_names, 'Average TF-IDF': average_tfidf_per_word})
lowest_words = word_tfidf_df.sort_values(by='Average TF-IDF').head(10)
highest_words = word_tfidf_df.sort_values(by='Average TF-IDF', ascending=False).head(10)

print("Top 10 words with lowest TF-IDF on average:")
print(lowest_words)

print("\nTop 10 words with highest TF-IDF on average:")
print(highest_words)

Top 10 words with lowest TF-IDF on average:
       Word  Average TF-IDF
759    coll        0.000153
3612    syd        0.000153
2511   nmfc        0.000153
2297   melb        0.000153
1646     gw        0.000153
33     adel        0.000153
1524   gcfc        0.000153
1526   geel        0.000153
1682    haw        0.000153
1308  fabio        0.000161

Top 10 words with highest TF-IDF on average:
            Word  Average TF-IDF
249    australia        0.010058
250   australian        0.009756
2489         new        0.008802
2766       polic        0.007834
3216         say        0.007608
3829       trump        0.006907
2221         man        0.006618
3986          wa        0.006291
668        charg        0.006091
3613      sydney        0.005723


Do you have the same words? How do you explain it?