<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Similar-Texts-Recommendation-Program" data-toc-modified-id="Similar-Texts-Recommendation-Program-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Similar Texts Recommendation Program</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Text-Processing" data-toc-modified-id="Text-Processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Text Processing</a></span><ul class="toc-item"><li><span><a href="#What-does-text-processing-do?" data-toc-modified-id="What-does-text-processing-do?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What does text processing do?</a></span></li><li><span><a href="#Apply-text-processing-to-all-text" data-toc-modified-id="Apply-text-processing-to-all-text-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Apply text processing to all text</a></span></li></ul></li><li><span><a href="#TF-IDF-(Term-Frequency---Inverse-Document-Frequency)" data-toc-modified-id="TF-IDF-(Term-Frequency---Inverse-Document-Frequency)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>TF-IDF (Term Frequency - Inverse Document Frequency)</a></span></li><li><span><a href="#Cosine-Similarity" data-toc-modified-id="Cosine-Similarity-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Cosine-Similarity</a></span></li><li><span><a href="#Why-are-texts-similar?—Common-features" data-toc-modified-id="Why-are-texts-similar?—Common-features-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Why are texts similar?—Common features</a></span></li><li><span><a href="#Output" data-toc-modified-id="Output-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Output</a></span></li></ul></div>

# Similar Texts Recommendation Program

1. This program finds top n most similar texts (article titles + abstracts) for a given text using term frequency inverse document frequency and cosine similarity.
2. This can be applied to literature review and document classification.

# Data

In [289]:
import pandas as pd
import time as tm
start_time = tm.strftime("%m/%d/%Y, %A, %H:%M %p")
print("This jupyter notebook was created on ", str(start_time))

This jupyter notebook was created on  05/27/2023, Saturday, 02:38 AM


In [290]:
df = pd.read_excel("mig_analysis.xlsx")

In [495]:
# which are the variables?
df.columns

Index(['Author Full Names', 'Source Title', 'Times Cited, WoS Core',
       'Publication Year', 'WoS Categories', 'id', 'id2', 'keywords', 'decade',
       'alltext', 'token'],
      dtype='object')

# Text Processing

In [292]:
# concatenate the article title and abstract
# no need to fill NA with ' ' because I really need the full text
df['alltext'] = df['Article Title'] + ' ' + df['Abstract']
# drop the columns
df = df.drop(['Article Title', 'Abstract'], axis = 1)

In [293]:
# remove rows with any NAs
df = df.dropna(how = 'any')

In [389]:
# remaining observations
print(df.shape[0])
df.tail(2)

2439


Unnamed: 0,Author Full Names,Source Title,"Times Cited, WoS Core",Publication Year,WoS Categories,id,id2,keywords,decade,alltext,token
2615,"Ly Thi Tran; Tan, George; Bui, Huyen; Rahimi, ...",POPULATION SPACE AND PLACE,0,2023,Demography; Geography,23_50,2616,EMPLOYMENT OUTCOMES; GRADUATE EMPLOYABILITY; H...,t20,international graduates on temporary post grad...,intern graduat temporari post graduat visa aus...
2616,"Mabi, Millicent N.; O'Brien, Heather L.; Natha...",JOURNAL OF DOCUMENTATION,1,2023,"Computer Science, Information Systems; Informa...",23_51,2617,INFORMATION POVERTY; AFRICAN IMMIGRANTS; INFOR...,t20,questioning the role of information poverty in...,question role inform poverti immigr employ acq...


In [294]:
# to surpress warning messages
import warnings
warnings.filterwarnings("ignore")
# tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# lemmatize: to the base form of a word
from nltk.stem import WordNetLemmatizer
# keep the root of a word
from nltk.stem.porter import *
stemmer = PorterStemmer()
lm = WordNetLemmatizer()
# abstracts also available in Spanish and/or German sometimes
stopwords = set(stopwords.words(['english','spanish','german']))

## What does text processing do?

- A brief demostration of what each step does
- Comparing output under different configurations

In [350]:
# replace hyphen '-' with ' '
body = 'This paper takes a preliminary look at a disaggregate data source not previously used in the analysis of Northern-Ireland migration patterns. '
# Generator expression must be parenthesized with []
print("1. STEM & LEMMATIZE, then TOKENIZE:\n"\
      ,[stemmer.stem(lm.lemmatize(w)) for w in word_tokenize(body)],"\nPuntuations and stopwords are here.\n")
print("2. STEM & LEMMATIZE, TOKENIZE, and LETTERS only:\n"\
      ,[stemmer.stem(lm.lemmatize(w)) for w in word_tokenize(body) if w.isalpha()],'\nStopwords are still here.\n')
print("3. STEM & LEMMATIZE,TOKENIZE, LETTERS only, and NO STOPWRODS:\n"\
      ,[stemmer.stem(lm.lemmatize(w)) for w in word_tokenize(body) if (w.isalpha() and w not in stopwords)],"\nBoth punctuations and stopwords are gone.")

1. STEM & LEMMATIZE, then TOKENIZE:
 ['thi', 'paper', 'take', 'a', 'preliminari', 'look', 'at', 'a', 'disaggreg', 'data', 'sourc', 'not', 'previous', 'use', 'in', 'the', 'analysi', 'of', 'northern-ireland', 'migrat', 'pattern', '.'] 
Puntuations and stopwords are here.

2. STEM & LEMMATIZE, TOKENIZE, and LETTERS only:
 ['thi', 'paper', 'take', 'a', 'preliminari', 'look', 'at', 'a', 'disaggreg', 'data', 'sourc', 'not', 'previous', 'use', 'in', 'the', 'analysi', 'of', 'migrat', 'pattern'] 
Stopwords are still here.

3. STEM & LEMMATIZE,TOKENIZE, LETTERS only, and NO STOPWRODS:
 ['thi', 'paper', 'take', 'preliminari', 'look', 'disaggreg', 'data', 'sourc', 'previous', 'use', 'analysi', 'migrat', 'pattern'] 
Both punctuations and stopwords are gone.


## Apply text processing to all text

In [352]:
# set to lowercase
df['alltext'] = df['alltext'].str.lower()

# replace hyphen '-' with ' '. w/o this step, words w/ '-' are removed at w.isalpha() step
df['alltext'] = df['alltext'].str.replace('-',' ')

# initiate a blank list, fill it later by appending the individual results
tokens = []

# loop for each observation/row in the alltext column/variable
for row in df['alltext']:
    # use [] to wrap what I'd like to do with the object
    # here, the object is the single words(as 'w') in each row
    my_token = [stemmer.stem(lm.lemmatize(w)) for w in word_tokenize(row)\
               if (w.isalpha() and w not in stopwords)]
    # add individual tokenized words by joining them together as one text body
    tokens.append(' '.join(map(str, my_token)))
df['token'] = tokens

# TF-IDF (Term Frequency - Inverse Document Frequency)

$$w_{i,j} = tf_{i,j} \times log\frac{N}{df_{i}}$$

where

- $tf_{i,j} = $ the number of occurrences of $i$ and $j$
- $df_{i} = $ the number of documents containing $i$
- $N = $ total number of documents

In [353]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.5, min_df = 3)
corpus = tfidf_vectorizer.fit_transform(df['token'])

In [372]:
print(type(corpus))

<class 'scipy.sparse._csr.csr_matrix'>


At this step, the `['alltext']` column has been fitted to a `corpus`. Next, transform the `corpus` to a dataframe, with the selected features(tokenized words) as column names. 

In [370]:
# What is the type/property of the vectorized features? -- Numpy array.
print(type(tfidf_vectorizer.get_feature_names_out()))
# Randomly check whether a stopword exists in the features:
np.where(tfidf_vectorizer.get_feature_names_out() == "and")

<class 'numpy.ndarray'>


(array([], dtype=int64),)

In [354]:
tfidf_matrix_df = pd.DataFrame(corpus.toarray(),\
                              columns = tfidf_vectorizer.get_feature_names_out(),\
                              index = df.index)

In [356]:
round(tfidf_matrix_df,2)

Unnamed: 0,abandon,abil,abl,abroad,absenc,absent,absolut,absorb,abstract,abu,...,yield,york,young,younger,youth,yugoslavia,zealand,zero,zimbabw,zone
18,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.00
20,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.00
23,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.00
24,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.00
26,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2612,0.0,0.00,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.09
2613,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.0,0.0,0.00
2614,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.00
2615,0.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.00


# Cosine-Similarity

$$similarity(A,B) = cos(\theta) = \frac{A \times B}{\|A\| \|B\|} = \frac{\sum\limits_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum\limits_{i = 1}^{n} A^{2}_{i}} \sqrt{\sum\limits_{i = 1}^{n} B^{2}_{i}}}$$

where

- $\theta$ is the angle between the vectors,
- $A \times B$ is dot product between $A$ and $B$ and calculated as $A \times B = A^{T}B = \Sigma^{n}_{i = 1} A_{i}B_{i} = A_{1}B_{1} + A_{2}B_{2} + ... + A_{n}B_{n}$
- $\|A\|$ represents the $L_{2}$ norm or magnitude of the vector which is calculated as $\|A\| = \sqrt{A^{2}_{1} + A^{2}_{2} + ... + A^{2}_{n}}$


In [427]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = pd.DataFrame(cosine_similarity(tfidf_matrix_df,\
                                            dense_output = True),\
                                           columns = df.index,\
                                           index = df.index)

In [428]:
cosine_sim.index

Int64Index([  18,   20,   23,   24,   26,   28,   29,   31,   34,   36,
            ...
            2607, 2608, 2609, 2610, 2611, 2612, 2613, 2614, 2615, 2616],
           dtype='int64', length=2439)

In [429]:
%%time
for i in range(len(cosine_sim)):
    top_score = cosine_sim.iloc[i].sort_values(ascending = False)[1:6]
    # be careful with the following, make sure to use the index to index the df
    # also, select the "useful" columns from the df to keep in the result
    output = df.loc[cosine_sim.iloc[i].sort_values(ascending = False).index[1:6],\
                    ['id','id2','WoS Categories','alltext']]
    output = output.reset_index(drop = True)
    # put all results together: copy the result from the first iteration
    if i == 0:
        all_outputs = output.copy(deep = True)
        all_scores = top_score.copy(deep = True)
    # after the first iteration, concatenate the following results
    else:
        all_outputs = pd.concat([all_outputs, output])
        all_scores = pd.concat([all_scores, top_score])

CPU times: user 3.92 s, sys: 60.7 ms, total: 3.98 s
Wall time: 4.02 s


In [359]:
all_scores = pd.DataFrame(all_scores)

In [360]:
all_scores = all_scores.rename(columns={0:"cosine_similarity"}).reset_index(drop = True)
print(all_scores.shape[0])
print(all_scores.head())

12195
   cosine_similarity
0           0.318249
1           0.311735
2           0.298912
3           0.294885
4           0.271116


In [361]:
# create a query id data frame that repeats n time--consistent with the top n matches
match_id = pd.DataFrame(np.repeat(df[['id2']].values, 5, axis = 0)).rename(columns={0:'query_id2'})
match_id.head()

Unnamed: 0,query_id2
0,19
1,19
2,19
3,19
4,19


In [362]:
all_scores = all_scores.reset_index(drop=True)
# reset index
match_id = match_id.reset_index(drop=True)

In [363]:
# match python's zero indexing
all_outputs['rank'] = all_outputs.index+1
# reset index
all_outputs = all_outputs.reset_index(drop = True)

In [364]:
# all data frames to be concatenated must have consistent indeces
matches_df = pd.concat([match_id, all_scores, all_outputs],axis=1).\
rename(columns={'id':'match_id','id2':'match_id2'})

In [365]:
# merge the final results with the query ids
# rename the columns to make them more explanatory 
final_output = df[['alltext','id','id2']]\
.merge(matches_df, left_on='id2', right_on='query_id2', how = 'inner')\
.rename(columns={'alltext_x':'query_text', 'id':'query_id', 'alltext_y':'match_text'})\
.drop(['id2','query_id2','WoS Categories'], axis = 1)

In [366]:
# set the upper and lower cut-off of cosine_similarity
final_output = \
final_output[(final_output['cosine_similarity'] < 0.99) & (final_output['cosine_similarity'] > 0.3)]

In [367]:
# distribution of cosine_similarity scores
final_output['cosine_similarity'].describe()

count    6047.000000
mean        0.382366
std         0.075261
min         0.300002
25%         0.327077
50%         0.360428
75%         0.416307
max         0.795589
Name: cosine_similarity, dtype: float64

In [368]:
# after filtering the cosine similarity scores, the remaining rows
final_output.shape[0]

6047

In [371]:
# the distribution of ranked matches
final_output['rank'].value_counts()

1    1854
2    1412
3    1128
4     917
5     736
Name: rank, dtype: int64

# Why are texts similar?—Common features

In [381]:
# Show non-zero features and their scores of each row
tfidf_matrix_df.iloc[0][tfidf_matrix_df.iloc[0]>0]

allow       0.064269
analysi     0.080951
benefici    0.088011
brain       0.123310
capit       0.045787
              ...   
unemploy    0.065561
urban       0.119121
wage        0.048593
welfar      0.063713
worker      0.033935
Name: 18, Length: 67, dtype: float64

In [397]:
# non-zero features as column names from the tfidf_matrix
tfidf_matrix_df.columns[tfidf_matrix_df.iloc[0]>0]

Index(['allow', 'analysi', 'benefici', 'brain', 'capit', 'caput', 'caus',
       'character', 'condit', 'consequ', 'construct', 'countri', 'cours',
       'develop', 'discov', 'drain', 'effect', 'emigr', 'examin', 'exchang',
       'export', 'fall', 'foreign', 'gain', 'growth', 'harri', 'home',
       'impact', 'incom', 'labor', 'laid', 'ldc', 'le', 'long', 'lose', 'loss',
       'model', 'order', 'paper', 'period', 'phenomenon', 'possibl',
       'profession', 'promot', 'purpos', 'rate', 'real', 'receiv', 'remitt',
       'return', 'rise', 'run', 'rural', 'second', 'sector', 'short', 'social',
       'theoret', 'todaro', 'type', 'ultim', 'unambigu', 'unemploy', 'urban',
       'wage', 'welfar', 'worker'],
      dtype='object')

In [403]:
# common features between two articles
list(set(tfidf_matrix_df.columns[tfidf_matrix_df.iloc[0]>0])\
     & set(tfidf_matrix_df.columns[tfidf_matrix_df.iloc[1]>0]))

['growth',
 'emigr',
 'receiv',
 'labor',
 'develop',
 'consequ',
 'countri',
 'exchang']

In [400]:
type(tfidf_matrix_df.columns[tfidf_matrix_df.iloc[0]>0])

pandas.core.indexes.base.Index

In [461]:
# confirm the rows in the original df equals to the rows in the tfidf_matrix
df.shape[0] == tfidf_matrix_df.shape[0]

True

In [465]:
# It is time to re-index, using the id column to be the new index.
# This helps adding more information to the final results
tfidf_matrix_df.index = df['id']
tfidf_matrix_df.head()

Unnamed: 0_level_0,abandon,abil,abl,abroad,absenc,absent,absolut,absorb,abstract,abu,...,yield,york,young,younger,youth,yugoslavia,zealand,zero,zimbabw,zone
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
91_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92_5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92_7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [474]:
range(len(final_output))
print(final_output['query_id'].iloc[0])
print(final_output['match_id'].iloc[0])

91_3
13_56


In [488]:
# initiate an empty list
common_feature = []
# find the features in both the query and match ids, then get the common features
for i in range(len(final_output)):
    feature = list(set(tfidf_matrix_df.columns[tfidf_matrix_df.loc[final_output['query_id'].iloc[i]]>0])\
                   & set(tfidf_matrix_df.columns[tfidf_matrix_df.loc[final_output['match_id'].iloc[i]]>0]))
    common_feature.append(', '.join(map(str,feature)))       

In [489]:
# add the common feature to the final output data frame
final_output['common_feature'] = common_feature

In [493]:
# Take a look at the final results
final_output[['query_id','query_text','match_id','match_text','common_feature','cosine_similarity','rank']]\
.head()

Unnamed: 0,query_id,query_text,match_id,match_text,common_feature,cosine_similarity,rank
0,91_3,a theoretical analysis of the beneficial effec...,13_56,what circumstances lead a government to promot...,"worker, emigr, caput, fall, le, brain, drain, ...",0.318249,1
1,91_3,a theoretical analysis of the beneficial effec...,11_67,the brain drain and the world distribution of ...,"emigr, analysi, short, possibl, effect, long, ...",0.311735,2
10,92_4,"factor mobility, trade and welfare a north s...",17_139,north south migrations and the asymmetric expu...,"popul, capit, develop, south, mobil, affect, n...",0.487309,1
11,92_4,"factor mobility, trade and welfare a north s...",19_175,a tale of two countries: directed technical ch...,"unskil, posit, substitut, labor, trade, south,...",0.351653,2
12,92_4,"factor mobility, trade and welfare a north s...",14_72,"trade, capital adjustment and the migration of...","unskil, sector, capit, labor, develop, trade, ...",0.341164,3


In [497]:
final_output['common_feature'][0:5]

0     worker, emigr, caput, fall, le, brain, drain, ...
1     emigr, analysi, short, possibl, effect, long, ...
10    popul, capit, develop, south, mobil, affect, n...
11    unskil, posit, substitut, labor, trade, south,...
12    unskil, sector, capit, labor, develop, trade, ...
Name: common_feature, dtype: object

# Output

In [498]:
final_output.to_csv('article_similarity.csv', index=False)