This notebook will take in a list of questions and calculate sentence similarity scores between each of them using the BERT model. We will begin by importing the dataset itself below. Credit for the "how-to" and code below goes to this [fantastic article](https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1). The main instruction is that the column with the text must be called 'text' for this script to work.

In [17]:
# Constants
dev = False
dataset = 'chinese_proverbs' # https://www.kaggle.com/bryanb/scraping-sayings-and-proverbs
column = 'text' # text
model_name = 'all-mpnet-base-v2' #'dmis-lab/biobert-base-cased-v1.2' #'multi-qa-mpnet-base-dot-v1' # all-mpnet-base-v2
cutoff = 0.6 # Note that cutoffs differ depending on the model used. 0.5 for all-mpnet-base-v2

In [18]:
import pandas as pd
dat = pd.read_csv('../data/processed/' + dataset + '.csv')

if dev == True:
    dat = dat.head(100)
    
display(dat.shape)
display(dat)

(127, 6)

Unnamed: 0.1,Unnamed: 0,in_chinese,pin_yin,text,category,origin
0,0,不作不死。,Bù zuò bù sǐ. 'Not do not die.',If you don't do stupid things you won't end up in tragedy.,Wisdom,Chinese
1,1,塞翁失马，焉知非福。,"Sài Wēng shī mǎ, yān zhī fēi fú. 'Sai Weng [legendary old man's name] lost horse, how know not blessing'.",Blessings come in disguise.,Wisdom,Chinese
2,2,小洞不补，大洞吃苦。,"Xiǎodòng bù bǔ, dàdòng chī kǔ.'small hole not mend; big hole eat hardship'","If small holes aren't fixed, then big holes will bring hardship.",Wisdom,Chinese
3,3,水满则溢。,Shuǐmǎn zé yì. 'water full but overflows',Water flows in only to flow out.,Wisdom,Chinese
4,4,读万卷书不如行万里路。,"Dú wànjuànshū bù rú xíng wànlǐlù. 'reading 10,000 books, not as good as walking 10,000 li road'",It's better to walk thousands of miles than to read thousands of books.,Wisdom,Chinese
...,...,...,...,...,...,...
122,122,龙潭虎穴。,Lóng tán hǔ xué. 'dragon pool tiger cave',A dragon's pool and a tiger's den.,Dragons,Chinese
123,123,画龙点睛。,Huàlóngdiǎnjīng. 'paint dragon dot eye',Paint a dragon and dot the eye.,Dragons,Chinese
124,124,叶公好龙。,Yè Gōng hào long.,Lord Ye loves dragons.,Dragons,Chinese
125,125,鲤鱼跳龙门。,Lǐyú tiào lóng mén. 'carp jump dragon gate',A carp has jumped the dragon's gate.,Dragons,Chinese


In [19]:
# Optional block where you filter and modify the dataset to fit your needs 
#d = {column: dat[column], 'links': dat[column + '_links']}
#dat = pd.DataFrame(data=d)
#display(dat)

Now we will isolate the questions as an array of sentences, which will be fed into a pre-trained user-specified model. We note that there was a "module not found" error in the code below. The maintainer of the sentence-transformers package fixed it and requires the user to install via "pip install -U sentence-transformers." The code below takes a while to run, as the pre-trained model is quite large. A menu of models is [here](https://www.sbert.net/docs/pretrained_models.html)

In [None]:
from sentence_transformers import SentenceTransformer

sentences = dat[column].tolist()
model = SentenceTransformer(model_name)

sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction


TODO Here, we make a topic model for our sentence similarity embedding. We will use the BERTopic library, [here](https://maartengr.github.io/BERTopic/index.html#quick-start)

From here, we're going to use cosine similarity to determine which questions are most similar to each other. One example of this is below. We compare the initial question to the first five questions after the initial question. We will display the questions and the similarity scores and see if it makes sense. 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

cos_sim = cosine_similarity(
    [sentence_embeddings[20]],
    sentence_embeddings
).tolist()

df = pd.DataFrame({'text': sentences,
            'similarity': cos_sim[0]})

# Arrange by similarity
pd.set_option('display.max_colwidth', None)
display(df.sort_values(by = 'similarity', ascending = False))

We were going to run a UMAP on the vector space to get some intuition around what it looks like, but because UMAP has issues with python 3.9 at the moment, we're going to jump right to making a cosine similarity matrix that we will then turn into a graph. 

In [None]:
import sklearn

dist = sklearn.metrics.pairwise.cosine_distances(sentence_embeddings)
dist

# nn = sklearn.neighbors.kneighbors_graph(sentence_embeddings, n_neighbors = 1)
# nn

In [None]:
import igraph as ig
import matplotlib.pyplot as plt

dist = np.where(dist == 0, 1, dist) # For the boolean below. Can't figure out compound boolean.

g = ig.Graph.Adjacency(dist < cutoff) # Need to convert to boolean
g = g.as_undirected()

if dat.shape[0] <= 1000:
    fig, ax = plt.subplots()
    ig.plot(g, target=ax)

Now we will do a quick measure of betweenness and eigenvector centrality to get a feel for what questions are the most central. We'll print the top 10 of each.

In [None]:
deg = g.degree()
btw = g.betweenness()
dat['degree'] = g.degree()
dat['betweenness'] = g.betweenness()
display(dat.sort_values(by='degree', ascending=False))
display(dat.sort_values(by='betweenness', ascending=False))

From here, we run clustering to see if the the quesitons group into particular themes. We will use Louvain clustering, as it is often used in graph-based analysis.

In [None]:
clust = ig.Graph.community_multilevel(g)
#df = pd.DataFrame({'name': dat['Question'], 'cluster': clust.membership})
dat['cluster'] = clust.membership

# Get cluster ID, mine the per-cluster topics
for i in pd.unique(dat['cluster']):
    curr = dat[dat['cluster'] == i]
    if curr.shape[0] > 5:
        display(curr.sort_values(by='degree', ascending=False))
        

Now, we're going to pull the graph out as an edgelist and get a feel for who is connected to who. We will then export the edgelist for import into Neo4J. 

TODO 1) make this a searchable umap. 2) place this into neo4j 3) set up the virtual environemnt and stop installing anything to the regular environemnt

In [None]:
el = g.get_edge_dataframe()
el

Now we have to convert the edges from their edge IDs to the questions that are in the order of the IDs. The reason we're seeing numbers right now is that the original adjacency matrix was made with the IDs and not the questions themselves. We're going to do that using a simple conversion function. We're going to do this by creating a dictionary. 

In [None]:
q_dict = {}

for i in range(0, len(sentences)):
    q_dict[i] = sentences[i]


In [None]:
e1 = [q_dict[i] for i in el['source']]
e2 = [q_dict[i] for i in el['target']]

el_df = pd.DataFrame({'edge1': e1, 'edge2': e2})
el_df 



In [None]:
el_df.to_csv('../output/' + dataset + '_' + model_name + '_edgelist_dist_' + str(cutoff) + '.csv', encoding='utf-8-sig')
dat.to_csv('../output/' + dataset + '_' + model_name + '_analyzed_' + str(cutoff) + '.csv', encoding='utf-8-sig')
#model_out.to_csv('../output/' + dataset + '_topic_model' + str(cutoff) + '.csv', encoding='utf-8-sig')