Longformer
--
The Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Pre-processing the content
--

In [2]:
import json
import pandas as pd

In [3]:
f = open("/content/drive/MyDrive/Gensim_LDA/source_data/article.json", encoding="utf-8")
articles = []

for line in f:
    articles.append(json.loads(line))
art_train = pd.DataFrame(pd.DataFrame(articles))

In [4]:
f = open("/content/drive/MyDrive/Gensim_LDA/source_data/careerpathpage.json", encoding="utf-8")
cp = []

for line in f:
    cp.append(json.loads(line))
cp_train = pd.DataFrame(pd.DataFrame(cp))

In [5]:
f = open("/content/drive/MyDrive/Gensim_LDA/source_data/coverletter.json", encoding="utf-8")
cl = []

for line in f:
    cl.append(json.loads(line))
cl_train = pd.DataFrame(pd.DataFrame(cl))

In [6]:
f = open("/content/drive/MyDrive/Gensim_LDA/source_data/resumesamplepage.json", encoding="utf-8")
res = []

for line in f:
    res.append(json.loads(line))
res_train = pd.DataFrame(pd.DataFrame(res))

In [7]:
def remove_hyperlinks(corpus, col):
    """
    Remove hypoerlinks from the content
    """  
    for i, content in enumerate(corpus[col]):
        content = content.split("\n")
        for j, cont in enumerate(content): 
            if cont!='':
                if "https" in cont or "http" in cont:
                    content[j] = ""
        corpus.iloc[i]["content"] = " ".join(content)
    return corpus

In [8]:
art_train = remove_hyperlinks(art_train, "content")
cl_train = remove_hyperlinks(cl_train, "contentA")
res_train  = remove_hyperlinks(res_train, "contentA")

Loading Longformer Model
--

In [9]:
!conda create --name longformer python=3.7
!conda activate longformer
!conda install cudatoolkit=10.0
!pip install git+https://github.com/allenai/longformer.git

/bin/bash: conda: command not found
/bin/bash: conda: command not found
/bin/bash: conda: command not found
Collecting git+https://github.com/allenai/longformer.git
  Cloning https://github.com/allenai/longformer.git to /tmp/pip-req-build-_xy74mfy
  Running command git clone -q https://github.com/allenai/longformer.git /tmp/pip-req-build-_xy74mfy
Collecting transformers@ git+http://github.com/ibeltagy/transformers.git@longformer_encoder_decoder#egg=transformers
  Cloning http://github.com/ibeltagy/transformers.git (to revision longformer_encoder_decoder) to /tmp/pip-install-ty8zl5ef/transformers
  Running command git clone -q http://github.com/ibeltagy/transformers.git /tmp/pip-install-ty8zl5ef/transformers
  Running command git checkout -b longformer_encoder_decoder --track origin/longformer_encoder_decoder
  Switched to a new branch 'longformer_encoder_decoder'
  Branch 'longformer_encoder_decoder' set up to track remote branch 'longformer_encoder_decoder' from 'origin'.
Collecting p

In [10]:
!pip install sentence_transformers

Collecting sentence_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cc/75/df441011cd1726822b70fbff50042adb4860e9327b99b346154ead704c44/sentence-transformers-1.2.0.tar.gz (81kB)
[K     |████                            | 10kB 19.6MB/s eta 0:00:01[K     |████████                        | 20kB 18.3MB/s eta 0:00:01[K     |████████████                    | 30kB 15.5MB/s eta 0:00:01[K     |████████████████                | 40kB 14.6MB/s eta 0:00:01[K     |████████████████████▏           | 51kB 8.0MB/s eta 0:00:01[K     |████████████████████████▏       | 61kB 7.6MB/s eta 0:00:01[K     |████████████████████████████▏   | 71kB 8.7MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 5.6MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-1.2.0-cp37-none-any.whl size=123339 sha256=9baeb3

In [11]:
import torch
from transformers import LongformerModel, LongformerTokenizer

In [12]:
device = "cuda" if torch.cuda.is_available else "cpu"

In [13]:
model = LongformerModel.from_pretrained('allenai/longformer-base-4096').to(device)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=694.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=597257159.0, style=ProgressStyle(descri…




Some weights of LongformerModel were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['longformer.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




Checking for top 10 content Title's
--

Note: Just selected 10 as the cuda runs out of memory beyond that.

In [14]:
title1 =list(art_train['contentTitle'][:10])

In [15]:
input_ids = tokenizer(title1, padding=True, return_tensors="pt")

In [16]:
outputs = model(**input_ids.to(device))

In [17]:
outputs[0].shape

torch.Size([10, 18, 768])

In [18]:
from sentence_transformers import SentenceTransformer, util
cosine_scores = util.pytorch_cos_sim(outputs[1], outputs[1])
cosine_scores.shape

torch.Size([10, 10])

In [19]:
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

In [20]:
import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)
for pair in pairs:
    i, j = pair['index']
    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

In [21]:
df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)

Longformer results by Cosine similarity
--

In [22]:
df.head(60)

Unnamed: 0,sent1,sent2,scores
0,Collaboration Skills: Definition and Examples,Management Skills: Definition and Examples,0.999523
1,Management Skills: Definition and Examples,Problem-Solving Skills: Definitions and Examples,0.999465
2,Collaboration Skills: Definition and Examples,Problem-Solving Skills: Definitions and Examples,0.999347
3,Management Skills: Definition and Examples,High School Resume Tips and Example,0.998996
4,Collaboration Skills: Definition and Examples,Communication Skills for Career Success,0.998816
5,Collaboration Skills: Definition and Examples,High School Resume Tips and Example,0.998804
6,Communication Skills for Career Success,Management Skills: Definition and Examples,0.998664
7,High School Resume Tips and Example,Problem-Solving Skills: Definitions and Examples,0.998651
8,Becoming a Manager: How To Develop a Work Sche...,How to Create a Resume Template in Word,0.998578
9,Communication Skills for Career Success,Problem-Solving Skills: Definitions and Examples,0.998574


Sentence Transformer to rescue!
--

Sentence Transformer reads the semantic contexuality behind the text and then checks the similarity between the texts.

In [23]:
title1 =list(art_train['contentTitle'])[:200]

In [24]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
embeddings1 = model.encode(title1, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']

    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)

HBox(children=(FloatProgress(value=0.0, max=122959036.0), HTML(value='')))




In [29]:
pd.set_option('max_colwidth', 400)
df[df["sent1"]=="Combination Resume Tips and Examples"][:5]

Unnamed: 0,sent1,sent2,scores
2,Combination Resume Tips and Examples,Resume Objectives: 70+ Examples and Tips,0.881121
4,Combination Resume Tips and Examples,A Complete Resume Summary Guide (40+ Examples),0.850433
31,Combination Resume Tips and Examples,How to Make a Resume (With Examples),0.799398
189,Combination Resume Tips and Examples,Guide To Updating Your Resume,0.677714
242,Combination Resume Tips and Examples,10 Best Skills To Include on a Resume (With Examples),0.653828


In [30]:
df[df["sent1"]=="High School Resume Tips and Example"][:5]

Unnamed: 0,sent1,sent2,scores
6,High School Resume Tips and Example,9 College Resume Tips + Examples,0.846611
17,High School Resume Tips and Example,Combination Resume Tips and Examples,0.82363
28,High School Resume Tips and Example,Chronological Resume Tips and Examples,0.804529
49,High School Resume Tips and Example,Resume Objectives: 70+ Examples and Tips,0.769507
100,High School Resume Tips and Example,How to Make a Resume (With Examples),0.719584


Sentence transformer model results ordered by cosine similarity scores in descending order
--

In [31]:
df.head(20)

Unnamed: 0,sent1,sent2,scores
0,How To Write a Cover Letter (Plus Tips and Examples),7 Powerful Ways to Start a Cover Letter (With Examples),0.892443
1,How to Format a Cover Letter (With Tips and Examples),How To Write a Cover Letter (Plus Tips and Examples),0.882038
2,Combination Resume Tips and Examples,Resume Objectives: 70+ Examples and Tips,0.881121
3,Chronological Resume Tips and Examples,Combination Resume Tips and Examples,0.856826
4,Combination Resume Tips and Examples,A Complete Resume Summary Guide (40+ Examples),0.850433
5,Best Careers for ISTJ Personalities,Best Careers for ISTP Personalities,0.846749
6,High School Resume Tips and Example,9 College Resume Tips + Examples,0.846611
7,39 Strengths and Weaknesses to Discuss in a Job Interview,List of Weaknesses: 10 Things To Say in an Interview,0.842401
8,Best Careers for ISTP Personalities,Best Careers for INTP Personalities,0.840465
9,Here's Everything You Should Include on a Resume,10 Best Skills To Include on a Resume (With Examples),0.840029


Trying the sntence transformer on article content
--

In [32]:
cont =list(art_train['content'])[:200]

In [33]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
embeddings1 = model.encode(cont, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']

    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)

In [34]:
df.head(20)

Unnamed: 0,sent1,sent2,scores
0,How to Write a Letter of Recommendation (With Template and Example),Letter of Recommendation for a Teacher,0.827835
1,Best Careers for ENFP Personalities,Best Careers for INFP Personalities,0.814514
2,How to Make a Resume for Your First Job,Writing a Resume With No Experience,0.813814
3,How To List Education on a Resume,How to Make a Resume (With Examples),0.795837
4,How to Make a Resume (With Examples),How to Make a Resume for Your First Job,0.793513
5,How To List Education on a Resume,Listing Hobbies and Interests on Your Resume (With Examples),0.789733
6,The Essential Job Search Guide,The New Graduate's Guide To Job Search,0.789454
7,Career Advice for Service Members: Making a Transition To Civilian Life,Job Search Guide for Former Military Members,0.766813
8,How to Create a Resume Template in Word,How to Make a Resume for Your First Job,0.766515
9,Resignation Letter Due to a Career Change: Tips and Examples,How To Write a Resignation Letter (With Samples and Tips),0.765828


Checking other sentence transformer paraphrase sentence textual similarity models for better cosine similarity scores.
--

In [35]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embeddings1 = model.encode(title1, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']
    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)
df.head(20)

HBox(children=(FloatProgress(value=0.0, max=83426730.0), HTML(value='')))




Unnamed: 0,sent1,sent2,scores
0,Best Careers for ISTJ Personalities,Best Careers for ISTP Personalities,0.908848
1,Best Careers for ENFP Personalities,Best Careers for INFP Personalities,0.889055
2,Best Careers for ESFJ Personalities,Best Careers for ISFJ Personalities,0.884238
3,Chronological Resume Tips and Examples,Combination Resume Tips and Examples,0.882135
4,How to Format a Cover Letter (With Tips and Examples),How To Write a Cover Letter (Plus Tips and Examples),0.874603
5,High School Resume Tips and Example,9 College Resume Tips + Examples,0.865415
6,Best Careers for ISTP Personalities,Best Careers for INTP Personalities,0.851982
7,Chronological Resume Tips and Examples,A Complete Resume Summary Guide (40+ Examples),0.849624
8,Technical Resume Writing: Tips and Examples,10 Resume Writing Tips To Help You Land a Job,0.842368
9,Combination Resume Tips and Examples,Resume Objectives: 70+ Examples and Tips,0.840752


In [36]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embeddings1 = model.encode(cont, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']
    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)
df.head(20)

Unnamed: 0,sent1,sent2,scores
0,The Essential Job Search Guide,Job Search Guide: Product Management and Software Engineering,0.888103
1,Technical Resume Writing: Tips and Examples,How to Make a Resume (With Examples),0.864793
2,Human Resources: Definition and How It Works,12 Human Resources Jobs That Pay Well,0.843934
3,How to Make a Resume for Your First Job,Writing a Resume With No Experience,0.841326
4,How To List Education on a Resume,Listing Hobbies and Interests on Your Resume (With Examples),0.833636
5,How To List Volunteer Work on Your Resume (With Example),Listing Hobbies and Interests on Your Resume (With Examples),0.823916
6,How To List Education on a Resume,How to Make a Resume (With Examples),0.82262
7,Best Careers for ENFP Personalities,Best Careers for INFP Personalities,0.819723
8,Guide To Updating Your Resume,How to Make a Resume (With Examples),0.81571
9,How to Create a Resume Template in Word,How to Make a Resume for Your First Job,0.812128


In [37]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-MiniLM-L-12-v3')
embeddings1 = model.encode(title1, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']

    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)
df.head(20)

HBox(children=(FloatProgress(value=0.0, max=122509599.0), HTML(value='')))




Unnamed: 0,sent1,sent2,scores
0,Best Careers for ISTJ Personalities,Best Careers for ISTP Personalities,0.88342
1,How to Write a Letter of Recommendation (With Template and Example),How To Ask for a Letter of Recommendation (With Examples),0.855017
2,How to Format a Cover Letter (With Tips and Examples),How To Write a Cover Letter (Plus Tips and Examples),0.850855
3,Best Careers for ESFJ Personalities,Best Careers for ISFJ Personalities,0.828206
4,How To Write a Cover Letter (Plus Tips and Examples),7 Powerful Ways to Start a Cover Letter (With Examples),0.79389
5,Best Careers for ENFP Personalities,Best Careers for INFP Personalities,0.793882
6,14 Sales Jobs That Pay Well,12 Retail Jobs That Pay Well,0.793466
7,How to Introduce Yourself in an Interview,How To Prepare for an Interview,0.77756
8,15 Phone Interview Questions (With Example Answers),5 Situational Interview Questions (With Example Answers),0.771728
9,14 Sales Jobs That Pay Well,15 Marketing Jobs That Pay Well,0.77076


In [38]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-MiniLM-L-12-v3')
embeddings1 = model.encode(cont, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']

    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)
df.head(20)

Unnamed: 0,sent1,sent2,scores
0,Resignation Letter Due to a Career Change: Tips and Examples,How To Write a Resignation Letter (With Samples and Tips),0.911585
1,10 Resume Writing Tips To Help You Land a Job,How to Write a Resume Employers Will Notice,0.908975
2,Chronological Resume Tips and Examples,2021’s Top Resume Formats: Tips and Examples of Three Common Resumes,0.893147
3,6 Universal Rules for Resume Writing (With Video),10 Resume Writing Tips To Help You Land a Job,0.89181
4,The Essential Job Search Guide,Job Search Guide: Product Management and Software Engineering,0.888012
5,"What Does ""Business Casual"" Mean? (With Example Outfits)",What to Wear: The Best Job Interview Attire,0.880033
6,Management Skills: Definition and Examples,How To Be a Good Manager,0.873066
7,How to Write a Letter of Intent (With Examples and Writing Tips),"Letter of Interest: Definition, Tips and Examples",0.872647
8,How to Write a Letter of Recommendation (With Template and Example),How To Ask for a Letter of Recommendation (With Examples),0.870218
9,Soft Skills: Definitions and Examples,10 Best Skills To Include on a Resume (With Examples),0.869144


"msmarco-MiniLM-L-6-v3" seems to be more semantically aligned and produces better results for both titles and content
--

In [39]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-MiniLM-L-6-v3')
embeddings1 = model.encode(title1, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']

    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)
df.head(60)

HBox(children=(FloatProgress(value=0.0, max=82977541.0), HTML(value='')))




Unnamed: 0,sent1,sent2,scores
0,Best Careers for ISTJ Personalities,Best Careers for ISTP Personalities,0.909458
1,Best Careers for ENFP Personalities,Best Careers for INFP Personalities,0.891321
2,Best Careers for ESFJ Personalities,Best Careers for ISFJ Personalities,0.847839
3,Letter of Recommendation for College Students,Letter of Recommendation for a Teacher,0.845687
4,How to Format a Cover Letter (With Tips and Examples),How To Write a Cover Letter (Plus Tips and Examples),0.822301
5,How To Write a Cover Letter (Plus Tips and Examples),7 Powerful Ways to Start a Cover Letter (With Examples),0.817732
6,High School Resume Tips and Example,9 College Resume Tips + Examples,0.814334
7,39 Strengths and Weaknesses to Discuss in a Job Interview,List of Weaknesses: 10 Things To Say in an Interview,0.80442
8,How to Write a Letter of Recommendation (With Template and Example),How To Ask for a Letter of Recommendation (With Examples),0.8031
9,Jobs That Pay Well,Work-From-Home Jobs That Pay Well,0.794318


In [40]:
########## final #############

cont =list(art_train['content'])[:200]
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-MiniLM-L-6-v3')
embeddings1 = model.encode(cont, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']

    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)
df.head(60)

Unnamed: 0,sent1,sent2,scores
0,Chronological Resume Tips and Examples,2021’s Top Resume Formats: Tips and Examples of Three Common Resumes,0.938358
1,Resignation Letter Due to a Career Change: Tips and Examples,How To Write a Resignation Letter (With Samples and Tips),0.929564
2,"What Does ""Business Casual"" Mean? (With Example Outfits)",What to Wear: The Best Job Interview Attire,0.913917
3,Guide To Business Attire (With Examples),"What Does ""Business Casual"" Mean? (With Example Outfits)",0.907581
4,Here's Everything You Should Include on a Resume,How to Make a Resume for Your First Job,0.896563
5,10 Resume Writing Tips To Help You Land a Job,Here's Everything You Should Include on a Resume,0.893672
6,10 Resume Writing Tips To Help You Land a Job,How to Write a Resume Employers Will Notice,0.890749
7,10 Resume Writing Tips To Help You Land a Job,How to Make a Resume for Your First Job,0.881679
8,6 Universal Rules for Resume Writing (With Video),10 Resume Writing Tips To Help You Land a Job,0.880301
9,Here's Everything You Should Include on a Resume,How to Make a Resume (With Examples),0.878379


Comparison of the results with Link Data provided by Indeed and "msmarco-MiniLM-L-6-v3" model!
--

In [45]:
from collections import defaultdict

path = "/content/drive/MyDrive/Copy of Pageview_matrix_20210511.csv"
links_not_found = set()
final_dict = defaultdict(list)
fout = open(path, "r")
fout.readline()
for line in fout:
    if line!="\n":
        try:
            line = line.split(",")
            url =  line[0].split("/")[-1]
            if url not in art_train["urlRoute"].values:
                if url not in cp_train["urlRoute"].values:
                    if url not in cl_train["urlRoute"].values:
                        if url not in res_train["urlRoute"].values:
                            links_not_found.add(line[0])
                        else:
                            sent1 = res_train[res_train["urlRoute"]==url]
                            sent1 = sent1["title"].values[0]
                    else:
                        sent1 = cl_train[cl_train["urlRoute"]==url]
                        sent1 = sent1["title"].values[0]
                else:
                    sent1 = cp_train[cp_train["urlRoute"]==url]
                    sent1 = sent1["h1"].values[0]
            else:
                sent1 = art_train[art_train["urlRoute"]==url]
                sent1 = sent1["contentTitle"].values[0]

            url =  line[1].split("/")[-1]
            if url not in art_train["urlRoute"].values:
                if url not in cp_train["urlRoute"].values:
                    if url not in cl_train["urlRoute"].values:
                        if url not in res_train["urlRoute"].values:
                            links_not_found.add(line[1])
                        else:
                            sent2 = res_train[res_train["urlRoute"]==url]
                            sent2 = sent2["title"].values[0]
                    else:
                        sent2 = cl_train[cl_train["urlRoute"]==url]
                        sent2 = sent2["title"].values[0]
                else:
                    sent2 = cp_train[cp_train["urlRoute"]==url]
                    sent2 = sent2["h1"].values[0]
            else:
                sent2 = art_train[art_train["urlRoute"]==url]
                sent2 = sent2["contentTitle"].values[0] 

            final_dict["sent1"].append(sent1)
            final_dict["sent2"].append(sent2)
            final_dict["visit"].append(int(line[2].replace("\n","")))
        except IndexError:
            print(line)

In [46]:
title1=list(art_train['contentTitle'])[:200]
cont =list(art_train['content'])[:200]
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-MiniLM-L-6-v3')
embeddings1 = model.encode(cont, convert_to_tensor=True) 
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings1)

pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

import pandas as pd
from collections import defaultdict

pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)
dict_ = defaultdict(list)

for pair in pairs:
    i, j = pair['index']

    dict_["sent1"].append(title1[i]) 
    dict_["sent2"].append(title1[j])
    dict_["scores"].append(pair["score"].item())    

final_df = pd.DataFrame.from_dict(dict_).sort_values(by=['scores'], ascending=False)

In [47]:
link_df = pd.DataFrame.from_dict(final_dict).sort_values(by=['visit'], ascending=False)
link_df.head(60)

Unnamed: 0,sent1,sent2,visit
0,List of Weaknesses: 10 Things To Say in an Interview,39 Strengths and Weaknesses to Discuss in a Job Interview,2083
2,List of Weaknesses: 10 Things To Say in an Interview,"How to Answer ""Tell Me About Yourself"" (Tips and Example Answers)",322
25,List of Weaknesses: 10 Things To Say in an Interview,"Interview Question: ""How Would You Describe Yourself?"" (With Examples)",116
86,List of Weaknesses: 10 Things To Say in an Interview,"Interview Question: ""Why Should We Hire You?""",65
93,List of Weaknesses: 10 Things To Say in an Interview,"Interview Question: ""How Do You Handle Conflict in the Workplace?""",63
104,List of Weaknesses: 10 Things To Say in an Interview,Where Do You See Yourself in 5 Years?,57
123,List of Weaknesses: 10 Things To Say in an Interview,39 of the Best Questions to Ask at the End of an Interview,53
125,List of Weaknesses: 10 Things To Say in an Interview,How to Explain Your Reasons for Leaving a Job (With Examples),52
143,List of Weaknesses: 10 Things To Say in an Interview,How to Introduce Yourself in an Interview,48
155,List of Weaknesses: 10 Things To Say in an Interview,Interview Question: What Are Your Greatest Weaknesses?,46


In [48]:
final_df[final_df["sent2"]=="List of Weaknesses: 10 Things To Say in an Interview"].head(60)

Unnamed: 0,sent1,sent2,scores
160,39 Strengths and Weaknesses to Discuss in a Job Interview,List of Weaknesses: 10 Things To Say in an Interview,0.754508
250,"How to Answer ""Tell Me About Yourself"" (Tips and Example Answers)",List of Weaknesses: 10 Things To Say in an Interview,0.726102
458,How To Prepare for an Interview,List of Weaknesses: 10 Things To Say in an Interview,0.68661
579,14 Common Second Interview Questions (With Example Answers),List of Weaknesses: 10 Things To Say in an Interview,0.669918
598,Problem-Solving Skills: Definitions and Examples,List of Weaknesses: 10 Things To Say in an Interview,0.667006
624,21 Job Interview Tips: How To Make a Great Impression,List of Weaknesses: 10 Things To Say in an Interview,0.663864
632,12 Tough Interview Questions and Answers,List of Weaknesses: 10 Things To Say in an Interview,0.662564
661,"Interview Question: ""How Would You Describe Yourself?"" (With Examples)",List of Weaknesses: 10 Things To Say in an Interview,0.659704
693,5 Situational Interview Questions (With Example Answers),List of Weaknesses: 10 Things To Say in an Interview,0.656247
1136,125 Common Interview Questions and Answers (With Tips),List of Weaknesses: 10 Things To Say in an Interview,0.61672


In [None]:
link_df[link_df["sent1"]=="125 Common Interview Questions and Answers (With Tips)"].head(60)

Unnamed: 0,sent1,sent2,visit
1,125 Common Interview Questions and Answers (With Tips),21 Job Interview Tips: How To Make a Great Impression,428
3,125 Common Interview Questions and Answers (With Tips),39 Strengths and Weaknesses to Discuss in a Job Interview,307
4,125 Common Interview Questions and Answers (With Tips),"How to Answer ""Tell Me About Yourself"" (Tips and Example Answers)",295
6,125 Common Interview Questions and Answers (With Tips),List of Weaknesses: 10 Things To Say in an Interview,274
10,125 Common Interview Questions and Answers (With Tips),"Interview Question: ""How Would You Describe Yourself?"" (With Examples)",177
26,125 Common Interview Questions and Answers (With Tips),"Interview Question: ""Why Do You Want to Work Here?""",115
29,125 Common Interview Questions and Answers (With Tips),Interview Question: What Are Your Greatest Weaknesses?,110
32,125 Common Interview Questions and Answers (With Tips),30+ Questions to Ask in a Job Interview (With Video Examples),103
37,125 Common Interview Questions and Answers (With Tips),How to Answer “What Motivates You?” (With Examples),99
49,125 Common Interview Questions and Answers (With Tips),How To Use the STAR Interview Response Technique,89


In [None]:
final_df[final_df["sent2"]=="125 Common Interview Questions and Answers (With Tips)"].head(60)

Unnamed: 0,sent1,sent2,scores
68,21 Job Interview Tips: How To Make a Great Impression,125 Common Interview Questions and Answers (With Tips),0.800237
204,12 Tough Interview Questions and Answers,125 Common Interview Questions and Answers (With Tips),0.739717
206,"How to Answer ""Tell Me About Yourself"" (Tips and Example Answers)",125 Common Interview Questions and Answers (With Tips),0.739399
211,15 Phone Interview Questions (With Example Answers),125 Common Interview Questions and Answers (With Tips),0.737007
339,How to Introduce Yourself in an Interview,125 Common Interview Questions and Answers (With Tips),0.707659
385,Email Examples: How to Respond to an Employer Interview Request,125 Common Interview Questions and Answers (With Tips),0.698295
545,39 Strengths and Weaknesses to Discuss in a Job Interview,125 Common Interview Questions and Answers (With Tips),0.675285
855,10 Resume Writing Tips To Help You Land a Job,125 Common Interview Questions and Answers (With Tips),0.639319
865,Guide: How to Succeed at a Hiring Event or Open Interview,125 Common Interview Questions and Answers (With Tips),0.638334
910,Follow-Up Email Examples For After the Interview,125 Common Interview Questions and Answers (With Tips),0.634134


In [None]:
link_df[link_df["sent1"]=="39 Strengths and Weaknesses to Discuss in a Job Interview"].head(60)

Unnamed: 0,sent1,sent2,visit
5,39 Strengths and Weaknesses to Discuss in a Job Interview,"How to Answer ""Tell Me About Yourself"" (Tips and Example Answers)",281
27,39 Strengths and Weaknesses to Discuss in a Job Interview,"Interview Question: ""How Would You Describe Yourself?"" (With Examples)",111
82,39 Strengths and Weaknesses to Discuss in a Job Interview,How To Prepare for an Interview,67
100,39 Strengths and Weaknesses to Discuss in a Job Interview,Interview Question: What Are Your Greatest Weaknesses?,61
127,39 Strengths and Weaknesses to Discuss in a Job Interview,"Interview Question: ""What Are Your Future Goals?""",51
134,39 Strengths and Weaknesses to Discuss in a Job Interview,How to Explain Your Reasons for Leaving a Job (With Examples),50
142,39 Strengths and Weaknesses to Discuss in a Job Interview,"Interview Question: ""How Do You Handle Conflict in the Workplace?""",48
210,39 Strengths and Weaknesses to Discuss in a Job Interview,How To Prepare for a Behavioral Interview,39
225,39 Strengths and Weaknesses to Discuss in a Job Interview,"Interview Question: ""Do You Have Any Questions?""",38
273,39 Strengths and Weaknesses to Discuss in a Job Interview,How to Introduce Yourself in an Interview,34


In [None]:
final_df[final_df["sent2"]=="39 Strengths and Weaknesses to Discuss in a Job Interview"].head(60)

Unnamed: 0,sent1,sent2,scores
639,How to Write a Resume Employers Will Notice,39 Strengths and Weaknesses to Discuss in a Job Interview,0.661646
745,6 Universal Rules for Resume Writing (With Video),39 Strengths and Weaknesses to Discuss in a Job Interview,0.649263
872,10 Resume Writing Tips To Help You Land a Job,39 Strengths and Weaknesses to Discuss in a Job Interview,0.637309
889,15 Phone Interview Questions (With Example Answers),39 Strengths and Weaknesses to Discuss in a Job Interview,0.636109
1155,Guide: How To Choose a Career,39 Strengths and Weaknesses to Discuss in a Job Interview,0.615863
1261,How to Introduce Yourself in an Interview,39 Strengths and Weaknesses to Discuss in a Job Interview,0.609409
1337,"Interview Question: ""What Are Your Salary Expectations?""",39 Strengths and Weaknesses to Discuss in a Job Interview,0.60474
1446,15 Best Jobs for Introverts,39 Strengths and Weaknesses to Discuss in a Job Interview,0.599167
1456,Listing Professional Experience on Your Resume,39 Strengths and Weaknesses to Discuss in a Job Interview,0.598745
1538,Management Skills: Definition and Examples,39 Strengths and Weaknesses to Discuss in a Job Interview,0.59407


Performing Kendall tau to check if relative pairwise ranking predicted by the links is preserved in the ranking coming from the model
--

In [50]:
golds = set(link_df["sent1"].values)
systems = set(final_df["sent2"].values)

In [51]:
from scipy.stats import kendalltau
results = defaultdict(float)

for gold in golds:
    if gold in systems:
        final = model.encode(link_df[link_df["sent1"]==gold]["sent2"].values[0], convert_to_tensor=True)
        to_check = model.encode(final_df[final_df["sent2"]==gold]["sent1"].values[0], convert_to_tensor=True)
        kt = kendalltau(final.cpu(), to_check.cpu())[0]
        results[gold] = kt

In [52]:
import operator

sorted_tuples = sorted(results.items(), key=operator.itemgetter(1))[::-1]

In [53]:
sorted_tuples

[('How To Ask for a Letter of Recommendation (With Examples)', 1.0),
 ('List of Weaknesses: 10 Things To Say in an Interview', 1.0),
 ('7 Powerful Ways to Start a Cover Letter (With Examples)', 1.0),
 ('125 Common Interview Questions and Answers (With Tips)', 1.0),
 ('Best Careers for INFP Personalities', 1.0),
 ('Letter of Recommendation for College Students', 1.0),
 ('What Does "Business Casual" Mean? (With Example Outfits)', 1.0),
 ('Video Interview Guide: Tips for a Successful Interview', 1.0),
 ('Guide to Gender Neutral Attire', 0.7363468233246301),
 ('How To Write a Resignation Letter (With Samples and Tips)',
  0.5741405570060922),
 ('10 Ways To Get the Most From Your Internship', 0.5520833333333333),
 ('12 Tough Interview Questions and Answers', 0.5054939077458659),
 ('Resume Objectives: 70+ Examples and Tips', 0.47076261966927757),
 ('How to Make a Resume (With Examples)', 0.4505820278503046),
 ('Low Stress Jobs', 0.43877828546562225),
 ('2021’s Top Resume Formats: Tips and Ex

**!!Good results!!**