# Feature Engineering and Syntactic Similarity: ABC News Dataset

Tutorial on how to do a feature engineering and syntatic similarity on a corpus. Functions referred to Blueprints for Text Analytics by Albrecht et al. (2021) with several adjustments to make it more clear. For the ready-to-use functions, please refer to file **fun_syntatic_similarity.py.**

In Feature Engineering, it will show you:
- How to vectorizing data
- Calculate similarities with cosine similarity
- Improving time efficiency
- Improving the feature itself

After that, for Syntatic Similarity, it's about the application. It will explain you:
- Finding similar document from the document that you made up (like search engine)
- Finding the most 2 similar documents from a corpus
- Finding similar words from documents in corpus

In [87]:
import pandas as pd
import numpy as np

import nltk
import spacy
from tqdm import tqdm

# libraries for the machine learning models
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer  
from sklearn.metrics.pairwise import cosine_similarity

# Basic of Feature Engineering

1. Vectorizing Data
2. Calculating Similarities

## Building Own Vectorizer

    Create the dummy data by creating sentences

In [3]:
sentences = ['It was the best of times',
             'it was the worst of times',
             'it was the age of wisdom',
             'it was the age of foolishness']

    Tokenizing text by splitting the data

In [4]:
# basic of splitting the sentence
# the iterating each of the items in the list

sentences[0].split()

['It', 'was', 'the', 'best', 'of', 'times']

In [5]:
# compile to a list of tokenized text
tokenized_sentences = [[token for token in sentence.split()] for sentence in sentences]
display(tokenized_sentences)

# then convert the list to a set
vocabulary = set([word for sentence in tokenized_sentences for word in sentence])
display(vocabulary)

# convert it to pandas
# to make the dictionary dataframe and it's word order
pd.DataFrame([word, i] for i, word in enumerate(vocabulary))

[['It', 'was', 'the', 'best', 'of', 'times'],
 ['it', 'was', 'the', 'worst', 'of', 'times'],
 ['it', 'was', 'the', 'age', 'of', 'wisdom'],
 ['it', 'was', 'the', 'age', 'of', 'foolishness']]

{'It',
 'age',
 'best',
 'foolishness',
 'it',
 'of',
 'the',
 'times',
 'was',
 'wisdom',
 'worst'}

Unnamed: 0,0,1
0,it,0
1,best,1
2,was,2
3,the,3
4,of,4
5,times,5
6,worst,6
7,age,7
8,wisdom,8
9,It,9


    Vectorizing Documents and Building Document Term Matrix

Comparing and calculating the sentences we have to the vocabulary data frame.    

In [6]:
# create function to one-hot encode the tokenized sentence
def onehot_encode(tokenized_sentence):
    return[1 if word in tokenized_sentence else 0 for word in vocabulary]

# one-hot encode every sentence in the tokenized sentences list
# using the for in iteration
onehot = [onehot_encode(tokenized_sentence) for tokenized_sentence in tokenized_sentences]

# print the one hot encoded sentence result
for (sentence, oh) in zip (sentences, onehot):
    print(f'{oh}' + ': ' + f'{sentence}')

[0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0]: It was the best of times
[1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0]: it was the worst of times
[1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0]: it was the age of wisdom
[1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1]: it was the age of foolishness


In [7]:
pd.DataFrame([word, i] for i, word in enumerate(vocabulary))

Unnamed: 0,0,1
0,it,0
1,best,1
2,was,2
3,the,3
4,of,4
5,times,5
6,worst,6
7,age,7
8,wisdom,8
9,It,9


In [8]:
# testing the data using out-of-vocabulary documents

print(onehot_encode('the age of wisdom is the best of times'.split()))  # with known vocabulary

# without known vocabulary
# BEWARE! that it actually can detect part of the letter from vocabulary, like "of" from pr(OF)essional
print(onehot_encode('Lionel Andrés Messi, also known as Leo Messi, is an Argentine professional footballer who plays as a forward for Ligue 1')) 

[0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0]
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]


**A document-term matrix** is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In [9]:
# create a dataframe from the list of one hot encoded sentences
# using the vocabulary as it's column

pd.DataFrame(onehot, columns=vocabulary)

Unnamed: 0,it,best,was,the,of,times,worst,age,wisdom,It,foolishness
0,0,1,1,1,1,1,0,0,0,1,0
1,1,0,1,1,1,1,1,0,0,0,0
2,1,0,1,1,1,0,0,1,1,0,0
3,1,0,1,1,1,0,0,1,0,0,1


    The Similarity Matrix

Calculating similarities between sentences/documents by calculating the number of common 1s at the corresponding positions.

In [10]:
# basic of the building

# by iterating
# we want to know the similarity between 2nd and 3rd sentece, thus it's set to onehot[1] and onehot[2]

sim = [onehot[1][i] & onehot[2][i] for i in range(len(vocabulary))]
display(sim)
display(sum(sim))

# or with the sumproduct approach
# more or less like the above functions, with only one line code

np.dot(onehot[1], onehot[2])

[1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0]

4

4

In [11]:
# calculate the similarity matrix using a faster approach
# see the book page 127

np.dot(onehot, np.transpose(onehot))

array([[6, 4, 3, 3],
       [4, 6, 4, 4],
       [3, 4, 6, 5],
       [3, 4, 5, 6]])

    One-Hot Encoding Using scikit-learn
    
As we already the foundation of it, there's acutally the function for vectorization from scikit-learn using **MultiLabelBinarizer**.

In [12]:
from sklearn.preprocessing import MultiLabelBinarizer

lb = MultiLabelBinarizer()  # define the function
lb.fit(vocabulary)  # fitting to the list of dictionary we have
lb.transform(sentences)  # transforming the list of sentences we have



array([[1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
       [0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1],
       [0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1],
       [0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1]])

## Bag-of-Words Models

Calculating the frequency of words for each document, rather then counting whether the words from vocabularies appear or not in the document like we did before. I will use the **CountVectorizer()** from sklearn.feature_extraction.text method.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [14]:
new_text = ['John likes to watch movies. Mary likes movies too.', 
            'Mary also likes to watch football games']

sentences = sentences + new_text

    Fitting the Vocabulary

In [21]:
# the way countvectorizer works

# 1. learn about the vocabulary
# for the parameter building, see the documentation on https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

cv.fit(sentences)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=np.int64, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

CountVectorizer()

In [23]:
# print the vocabulary
print(cv.get_feature_names_out())

['age' 'also' 'best' 'foolishness' 'football' 'games' 'it' 'john' 'likes'
 'mary' 'movies' 'of' 'the' 'times' 'to' 'too' 'was' 'watch' 'wisdom'
 'worst']


    Transforming the Sentences

In [25]:
# 2. transforming the documents to vectors
# scikit-learn uses sparse matrix, instead of list

dt = cv.transform(sentences)
dt

<6x20 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

In [29]:
# convert sparse matrix to pandas dataframe for readability

pd.DataFrame(dt.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,1
2,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,0
3,1,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,1,2,1,2,0,0,0,1,1,0,1,0,0
5,0,1,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,1,0,0


    Calculating Similarities

Calculating using cosine similarity, like the way we did for building a product recommendation.

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

In [31]:
# calculating similarity between sentence 1 and 2
cosine_similarity(dt[0], dt[1])

array([[0.83333333]])

In [32]:
# calculating similarity for all sentences
# then convert it to a dataframe

pd.DataFrame(cosine_similarity(dt, dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.833333,0.666667,0.666667,0.0,0.0
1,0.833333,1.0,0.666667,0.666667,0.0,0.0
2,0.666667,0.666667,1.0,0.833333,0.0,0.0
3,0.666667,0.666667,0.833333,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.524142
5,0.0,0.0,0.0,0.0,0.524142,1.0


    TF-IDF Models for Similarity

Counting the number of total word occurrences. It will reduce weights of frequent words and at the same time increase the weights of uncommon words. We can incorporate TF-IDF to our vector using the **TfidfTransformer()** from scikit-learn.

In [40]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()  # as always, define the model first
tfidf_dt = tfidf.fit_transform(dt)  # fit and transform, and pass the 'dt' vector
pd.DataFrame(tfidf_dt.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0.0,0.0,0.56978,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.56978
2,0.467228,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.56978,0.0
3,0.467228,0.0,0.0,0.56978,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.305609,0.501208,0.250604,0.611219,0.0,0.0,0.0,0.250604,0.305609,0.0,0.250604,0.0,0.0
5,0.0,0.419233,0.0,0.0,0.419233,0.419233,0.0,0.0,0.343777,0.343777,0.0,0.0,0.0,0.0,0.343777,0.0,0.0,0.343777,0.0,0.0


In [43]:
# new cosine similarity with TF-IDF Model

pd.DataFrame(cosine_similarity(tfidf_dt, tfidf_dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.675351,0.457049,0.457049,0.0,0.0
1,0.675351,1.0,0.457049,0.457049,0.0,0.0
2,0.457049,0.457049,1.0,0.675351,0.0,0.0
3,0.457049,0.457049,0.675351,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.43076
5,0.0,0.0,0.0,0.0,0.43076,1.0


# Real-World Application: ABC Dataset

The ABC Dataset contains information on news headlines that were published over a 19-year span (until 2021). The data is taken from ABC, a respectable Australian news organization, as said (Australian Broadcasting Corporation). This notebook requires a lot (around a million records), as one of the goals of the project is to **demonstrate numerous approaches that can be applied to improve the time-efficiency of the feature engineering of the text**.

In [64]:
df = pd.read_csv('dataset/abcnews-date-text.csv', parse_dates=['publish_date'])

display(df.head(5))
print('length of the dataframe:', len(df))

Unnamed: 0,publish_date,headline_text
0,2003-02-19,aba decides against community broadcasting lic...
1,2003-02-19,act fire witnesses must be aware of defamation
2,2003-02-19,a g calls for infrastructure protection summit
3,2003-02-19,air nz staff in aust strike for pay rise
4,2003-02-19,air nz strike to affect australian travellers


length of the dataframe: 1244184


In [68]:
# TfidfVectorizer: library to vectorize using tf-idf the specific text, from a series
from sklearn.feature_extraction.text import TfidfVectorizer  

tfidf = TfidfVectorizer()
dt = tfidf.fit_transform(df['headline_text'])

In [70]:
# this is a very large dimension of sparse matrix
# even though the stored elemet is smaller, beware when processing this (use sampling first)
# will take a very long time 

dt

<1244184x105966 sparse matrix of type '<class 'numpy.float64'>'
	with 8072405 stored elements in Compressed Sparse Row format>

In [72]:
%%time

# example to show the duration it takes
# processing the cosine similarity only for the first 100000 records
# it already took around 2 minutes to finish

cosine_similarity(dt[0:100000], dt[0:100000])

CPU times: user 38.6 s, sys: 56.1 s, total: 1min 34s
Wall time: 2min 47s


array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.03076323],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.03076323, 0.        , ..., 0.        , 0.        ,
        1.        ]])

## Improving Time-Efficiency: Reducing Feature Dimensions    

    Removing stopwords    

In [76]:
stopwords = set(nltk.corpus.stopwords.words('english'))

tfidf = TfidfVectorizer(stop_words=stopwords)
dt = tfidf.fit_transform(df['headline_text'])
dt

<1244184x105830 sparse matrix of type '<class 'numpy.float64'>'
	with 6730097 stored elements in Compressed Sparse Row format>

    Minimum frequency

Remove all words occuring less than twice.    

In [77]:
tfidf = TfidfVectorizer(stop_words=stopwords, min_df=2)
dt = tfidf.fit_transform(df['headline_text'])
dt

<1244184x64129 sparse matrix of type '<class 'numpy.float64'>'
	with 6688396 stored elements in Compressed Sparse Row format>

    Maximum frequency 

In [79]:
tfidf = TfidfVectorizer(stop_words=stopwords, min_df=2, max_df=0.1)
dt = tfidf.fit_transform(df['headline_text'])
dt

<1244184x64129 sparse matrix of type '<class 'numpy.float64'>'
	with 6688396 stored elements in Compressed Sparse Row format>

## Improving Features: Lemmatization

Instead of using the original words, we can utilize linguistic analysis, then taking the lemmatized form of the text.

In [96]:
# beware of running this code
# will take some time
# to track the progress, wrap the for-loops using tqdm

# for this project, we will restrict the data from the news of 2015 and later

df_up_2015 = df[df['publish_date'] > '2015-01-01']
df_up_2015.reset_index(inplace=True)

import spacy

nlp = spacy.load('en_core_web_sm')
pos_to_take = ['NOUN', 'PROPN', 'ADJ', 'ADV', 'VERB']

for i, row in tqdm(df_up_2015.iterrows(), total=df_up_2015.shape[0]):  # iterating dataframe row with its index
    doc = nlp(str(row['headline_text']))
    df_up_2015.at[i, 'lemmas'] = ' '.join([token.lemma_ for token in doc])  # instead of 'df.loc', we can use df.at to slice only one row
    df_up_2015.at[i, 'nav'] = ' '.join([token.lemma_ for token in doc if token.pos_ in pos_to_take])  # this one as well

  for obj in iterable:
100%|██████████| 318573/318573 [33:46<00:00, 157.22it/s] 


In [108]:
df_up_2015

Unnamed: 0,publish_date,headline_text,lemmas,nav
0,2015-01-02,abalone salmon fish farming environment,abalone salmon fish farming environment,abalone salmon fish farming environment
1,2015-01-02,abalone salmon tassal huon aquaculture aquacul...,abalone salmon tassal huon aquaculture aquacul...,abalone salmon tassal huon aquaculture aquacul...
2,2015-01-02,act government approves changes to asbestos ma...,act government approve change to asbestos mana...,act government approve change asbestos managem...
3,2015-01-02,activist to send copies of the interview to no...,activist to send copy of the interview to nort...,activist send copy interview north korea
4,2015-01-02,agricultural graduate job access limited by la...,agricultural graduate job access limit by lack...,agricultural graduate job access limit lack ex...
...,...,...,...,...
318568,2021-12-31,two aged care residents die as state records 2...,two aged care resident die as state record 2;093,aged care resident die state record
318569,2021-12-31,victoria records 5;919 new cases and seven deaths,victoria record 5;919 new case and seven death,victoria record new case death
318570,2021-12-31,wa delays adopting new close contact definition,wa delay adopt new close contact definition,wa delay adopt new close contact definition
318571,2021-12-31,western ringtail possums found badly dehydrate...,western ringtail possum find badly dehydrate i...,western ringtail possum find badly dehydrate h...


In [129]:
tfidf = TfidfVectorizer()
dt = tfidf.fit_transform(df_up_2015['headline_text'].map(str))  # without lemmatized form
print(dt.shape)
print(dt.data.nbytes)

tfidf = TfidfVectorizer(stop_words=stopwords)
dt = tfidf.fit_transform(df_up_2015['nav'].map(str))  # with lemmatized form
print(dt.shape)
print(dt.data.nbytes)

(318573, 69905)
19475544
(318573, 58427)
15420656


    Removing Most Common Words

Using <a href='https://github.com/first20hours/google-10000-english/blob/master/google-10000-english.txt'>Google dataset</a> of 10,000 most common words in English.

In [124]:
# we save elements stored a lot by this process

common_words = pd.read_csv('dataset/google-10000-english.txt', header=None)
common_words = set(common_words.iloc[:,0].values)

tfidf = TfidfVectorizer(stop_words=common_words)
dt = tfidf.fit_transform(df_up_2015['nav'].map(str))
dt

<318573x51206 sparse matrix of type '<class 'numpy.float64'>'
	with 479932 stored elements in Compressed Sparse Row format>

    Adding Context via N-Grams    

In [128]:
# 2 n-grams
tfidf = TfidfVectorizer(stop_words=stopwords, ngram_range=(1,2), min_df=2)
dt = tfidf.fit_transform(df_up_2015['headline_text'])
print(dt.shape)
print(dt.data.nbytes)

# 3 n-grams
tfidf = TfidfVectorizer(stop_words=stopwords, ngram_range=(1,3), min_df=2)
dt = tfidf.fit_transform(df_up_2015['headline_text'])
print(dt.shape)
print(dt.data.nbytes)

(318573, 221206)
22983976
(318573, 286281)
24720640


In [131]:
# combining ngrams with linguistic features + most common words
# reduce the len of dt by factor 6

tfidf = TfidfVectorizer(stop_words=common_words, ngram_range=(1,2), min_df=2)
dt = tfidf.fit_transform(df_up_2015['nav'])
print(dt.shape)
print(dt.data.nbytes)

(318573, 46834)
4261056


# Business Application: After Vectorizing

For this, we will only use the vector of the text with below specifications:
- Original dataframe, not preprocessed using linguistic process (lemmatized and pos tags)
- Removed the stopwords
- Minimum of the word appearing in the data: 2 times
- Normalization using L2

In [133]:
tfidf = TfidfVectorizer(stop_words=stopwords, 
                        ngram_range=(1,2), 
                        min_df=2, 
                        norm='l2')

dt = tfidf.fit_transform(df['headline_text'])

    Finding Similar Headlines

Looking for a made-up headline that is most similar.

In [188]:
# first, we transform the headline that we're going to check
text_to_look_for = 'Lionel Messi and the world cup'
look_for = tfidf.transform([text_to_look_for])

# second, calculate the cosine similarity between look_for and dt
sim = cosine_similarity(look_for, dt)

In [187]:
# last one, show the most similar headlines from dt

top_n = 5  # change top_n here by the number of top similar news you want to see
x = np.argsort(sim)[0][::-1][:top_n]
top_df = df[['publish_date', 'headline_text']].iloc[x]
for i in x:
    top_df.loc[i, 'cosine_sim'] = sim[0][i]

top_df

Unnamed: 0,publish_date,headline_text,cosine_sim
1004486,2016-01-12,lionel messi in quotes,0.753968
1129253,2018-06-01,lionel messi montage,0.721831
1211510,2020-08-26,could lionel messi be coming to your club,0.715942
1150146,2018-12-09,young afghan lionel messi fan threatened by ta...,0.624409
1129594,2018-06-04,world cup journeys lionel messi chance to matc...,0.608173


In [249]:
# then, we can wrap it to a function

def look_for_text(doc, text_to_look_for, top_n, **kwargs):
    '''
    Function to look for the most similar text from a vectorized data text
    
    Parameters:
    doc                 : collection of docs we want to analyze (from a corpus)
    text_to_look_for    : text that we're looking for
    dt                  : data frame that has been vectorized
    top_n               : n top similar text we're looking for
    **kwargs            : pass the parameters from TfidfVectorizer
    
    Returns:
    Dataframe of top similar text with cosine similarity method
    '''
    
    tfidf = TfidfVectorizer(**kwargs)
    dt = tfidf.fit_transform(doc)
    
    look_for = tfidf.transform([text_to_look_for])
    sim = cosine_similarity(look_for, dt)
    
    x = np.argsort(sim)[0][::-1][:top_n]
    top_df = df[['publish_date', 'headline_text']].iloc[x]
    for i in x:
        top_df.loc[i, 'cosine_sim'] = sim[0][i]
    
    return top_df

In [255]:
# let's try the function

text_to_look_for = 'Australia and Indonesia agreement'
look_for_text(df['headline_text'], 
              text_to_look_for, 
              top_n=7,
              stop_words=stopwords, 
              ngram_range=(1,2), 
              min_df=2, 
              norm='l2')

Unnamed: 0,publish_date,headline_text,cosine_sim
722039,2012-09-05,australia and indonesia at odds over rescued,0.671871
826204,2013-10-15,australia and indonesia split with activists over,0.666306
245606,2006-06-27,australia and indonesia back on track,0.637413
834743,2013-11-18,australia indonesia relationship,0.589054
896768,2014-08-19,australia and indonesia sort out strained,0.57568
1093584,2017-08-10,australia and indonesia at the asean summit,0.555942
1157287,2019-02-28,australia indonesia to sign free trade agreement,0.539515


    Finding the Two Most Similar Documents in a Corpus

This function will be taken directly from the book as it **has something to do with time-efficiency**. See more on **PAGE 146** for more detail information on how to do it.

In [193]:
%%time
batch = 10000
max_sim = 0.0

max_a = None
max_b = None

for a in tqdm(range(0, dt.shape[0], batch)):
    
    for b in range(0, a+batch, batch): 
        # print(a, b) -> should be eliminated, the book says to print(a,b)
        r = np.dot(dt[a:a+batch], np.transpose(dt[b:b+batch]))
        
        # eliminate identical vectors
        # by setting their similarity to np.nan which gets sorted out r[r > 0.9999] = np.nan
        r[r > 0.9999] = np.nan
        sim = r.max()
        
        if sim > max_sim:
            # argmax returns a single value which we have to
            # map to the two dimensions
            (max_a, max_b) = np.unravel_index(np.argmax(r), r.shape)  
            
            # adjust offsets in corpus (this is a submatrix)
            max_a += a
            max_b += b
            max_sim = sim

100%|██████████| 125/125 [10:34<00:00,  5.07s/it]

CPU times: user 9min 31s, sys: 17.3 s, total: 9min 48s
Wall time: 10min 34s





In [274]:
# wrap into a function
# beware of the how long it takes to process

def similar_doc_in_corpus(doc, **kwargs):
    batch = 10000
    max_sim = 0.0

    max_a = None
    max_b = None
    
    tfidf = TfidfVectorizer(**kwargs)
    dt = tfidf.fit_transform(doc)

    for a in tqdm(range(0, dt.shape[0], batch)):
    
        for b in range(0, a+batch, batch): 
            # print(a, b) -> should be eliminated, the book says to print(a,b)
            r = np.dot(dt[a:a+batch], np.transpose(dt[b:b+batch]))
        
            # eliminate identical vectors
            # by setting their similarity to np.nan which gets sorted out r[r > 0.9999] = np.nan
            r[r > 0.9999] = np.nan
            sim = r.max()
        
            if sim > max_sim:
                # argmax returns a single value which we have to
                # map to the two dimensions
                (max_a, max_b) = np.unravel_index(np.argmax(r), r.shape)  
            
                # adjust offsets in corpus (this is a submatrix)
                max_a += a
                max_b += b
                max_sim = sim
        
    list_sim = [df.iloc[max_a]['headline_text'], df.iloc[max_b]['headline_text']]
        
    return print('\n'.join((list_sim)))

In [275]:
# let's check the most similar sentences here
similar_doc_in_corpus(df['headline_text'], 
                      stop_words=stopwords, 
                      ngram_range=(1,2), 
                      min_df=2, 
                      norm='l2')

100%|██████████| 125/125 [13:14<00:00,  6.35s/it]


vline fails to meet punctuality targets report
vline fails to meet punctuality targets


    Finding Related Words in the Corpus

Based by two rules:

- Words are related if they're used in the same documents
- It becomes more related if they frequently appear together in the corpus

In [202]:
# create the term-document matrix
# it's the transposed form of the document-term matrix

tfidf_word = TfidfVectorizer(stop_words=stopwords, min_df=1000)
dt_word = tfidf_word.fit_transform(df['headline_text'])

r = cosine_similarity(dt_word.T, dt_word.T)  # this is the part where we transpose the data
np.fill_diagonal(r, 0)

In [226]:
voc = tfidf_word.get_feature_names_out()  # create the vocabulary
size = r.shape[0]

for i in np.argsort(r.flatten())[::-1][0:40]:
    # finding the pair
    a = int(i/size)
    b = i%size
    if a > b:  # to avoid repetitions (only show the pair once)
        print(f'{voc[a]} related to {voc[b]}')

kong related to hong
sri related to lanka
covid related to 19
seekers related to asylum
springs related to alice
trump related to donald
hour related to country
pleads related to guilty
hill related to broken
vs related to summary
violence related to domestic
climate related to change
royal related to commission
care related to aged
scott related to morrison
gold related to coast
driving related to drink
wall related to street
mental related to health
world related to cup


In [256]:
# we can make it into a dataframe and show it's cosine similarity score

def similar_word(doc, top_n, min_appear, **kwargs):
    '''
    Function to look for the most similar word from a corpus
    
    Parameters:
    doc         : collection of docs we want to analyze (from a corpus)
    top_n       : n top similar text we're looking for
    min_n       : minimal appearence of the pair
    **kwargs    : pass the parameters from TfidfVectorizer
    
    Returns:
    Dataframe of top similar word from a corpus with it's cosine similarity
    '''
    
    tfidf_word = TfidfVectorizer(min_df=min_appear, **kwargs)  
    dt_word = tfidf_word.fit_transform(doc)

    r = cosine_similarity(dt_word.T, dt_word.T)  # this is the part where we transpose the data
    np.fill_diagonal(r, 0)
    
    voc = tfidf_word.get_feature_names_out()  # create the vocabulary
    size = r.shape[0]
    
    # create the data frame and its row iteration
    df_sim = pd.DataFrame()
    row = 0
    
    for i in np.argsort(r.flatten())[::-1][0:100]:
        # finding the pair
        a = int(i/size)
        b = i%size
        if a > b:  # to avoid repetitions (only show the pair once)
            df_sim.loc[row, 'word_1'] = voc[a]
            df_sim.loc[row, 'word_2'] = voc[b]
            df_sim.loc[row, 'sim'] = r[a][b]
            row += 1
    
    return df_sim.head(top_n)

In [257]:
similar_word(df['headline_text'], 
             top_n=10, 
             min_appear=1000, 
             stop_words=stopwords)

Unnamed: 0,word_1,word_2,sim
0,kong,hong,0.953522
1,sri,lanka,0.817174
2,covid,19,0.630267
3,seekers,asylum,0.579449
4,springs,alice,0.575388
5,trump,donald,0.571107
6,hour,country,0.53605
7,pleads,guilty,0.516256
8,hill,broken,0.479397
9,vs,summary,0.448103
