# Rough Unsupervised Paragraph matching
Given a list of questions we want to answer with the document:
* construct a doc2vec model of the paragraphs and questions
* create a TfIdf matrix of the documents
Use linear_kernal to estimate the most similar paragraphs which may answer the question

In [47]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

In [None]:
questions =["Does a company need to keep track of the carbon intensity of the electricity?",
    "What metric is used for evaluating emissions?",
    "How does one get to net-zero emissions economy?",
    "What is net-zero emissions economy?",
    "How can carbon emission of the processes of cement clinker be reduced?",
   "How is the weighted cogeneration threshold calculated?",
    "What is carbon capture and sequestration?",
    "What are the stages of carbon capture and sequestration?",
    "What should the average energy consumption of a water supply system be?",
    "What are sludge treatments?",
    "How does anaerobic digestion work?",
    "What qualifies as a zero direct emission vehicle?",
]

In [117]:
para_df = pd.read_csv('finance_taxonomy.csv', sep='\t')
para_df.head()

Unnamed: 0,paragraphs
0,Updated methodology & Updated Technical Screen...
1,March 2020
2,About this report\nThis document includes an ...
3,Explanation of the Taxonomy approach. This sec...
4,PART B


In [118]:
# Perform some obvious data clean up and count rough number of tokens for filtration
para_df['paragraphs'] = para_df['paragraphs'].astype(str)
para_df['paragraphs'] = para_df['paragraphs'].apply(lambda x: x.replace('\n', ' '))
para_df['num_tokens'] = para_df['paragraphs'].apply(lambda x: len(x.split()))
para_df

Unnamed: 0,paragraphs,num_tokens
0,Updated methodology & Updated Technical Screen...,9
1,March 2020,2
2,About this report This document includes an u...,48
3,Explanation of the Taxonomy approach. This sec...,42
4,PART B,2
...,...,...
8979,Heat Stress,2
8980,Economic Activity,2
8981,Wildfire,1
8982,5.,1


In [123]:
# Drop duplicates, keeping the first occurence
para_df.drop_duplicates(subset="paragraphs", inplace=True)
len(para_df)

4037

# Create Doc2Vec model

In [124]:
paras = para_df.query('num_tokens > 5')['paragraphs'].tolist()
all_paras = questions + paras
len(paras), len(df)

(2530, 12)

In [125]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate( all_paras)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
model.save("paragraphs.doc2vec.mdl")

In [126]:
# Prove one example
# model.wv.n_similarity(all_paras[0], all_paras[0]), model.wv.n_similarity(all_paras[0], all_paras[100])
# (1.0, -0.65090156)

# Compute top 5 similarities and save in a dataframe

In [127]:
results =[]
for question in tqdm(questions):
    tmp_results =[]
    for idx, para in enumerate(paras):
        score = model.wv.n_similarity(question, para)
        tmp_results.append( (score, idx) )
    tmp_results = sorted(tmp_results, key=lambda x:x[0], reverse=True)
    cnt =1
    for score, idx in tmp_results[:5]:
        results.append( (question, cnt, idx, score) )
        cnt+=1
        
question, rank, idx, score = zip(*results)
doc_paras = [ paras[tmp]  for tmp in idx]
doc2vec_df =  pd.DataFrame({"question": question, "rank": rank,
                         "doc_paragraph": doc_paras, "score": score, 
                         "doc_para_idx": idx})
doc2vec_df.to_csv('doc2vecparas.csv', index=False, sep='\t')

100%|██████████| 12/12 [00:31<00:00,  2.60s/it]


# Create TfIdf Rankings

In [128]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(all_paras)
X.shape

(2542, 6647)

In [129]:
# Prove one example
# slot = 1556
# q_ar = vectorizer.transform([questions[0]])
# score = linear_kernel(q_ar, X[slot + len(questions)]) # [0]
# questions[0], paras[slot], score[0][0]

# ('Does a company need to keep track of the carbon intensity of the electricity?',
#  '\x0cincrease\ncarbon\nsequestration\nin soil, reduce\nfertilizer need,\nand N20\nemissions)',
#  0.13801633141227623)

# Compute top 5 similarities and save in a dataframe

In [130]:
results =[]
for question in tqdm(questions):
    q_ar = vectorizer.transform([question])    
    tmp_results =[]
    for idx, x_ar in enumerate(X[len(questions):]):        
        score = linear_kernel(q_ar, x_ar)[0][0] 
        tmp_results.append( (score, idx) )
    tmp_results = sorted(tmp_results, key=lambda x:x[0], reverse=True)
    cnt =1
    for score, idx in tmp_results[:5]:
        results.append( (question, cnt, idx, score) )
        cnt+=1

question, rank, idx, score = zip(*results)
doc_paras = [ paras[tmp]  for tmp in idx]
tfidf_df = pd.DataFrame({"question": question, "rank": rank,
                         "doc_paragraph": doc_paras, "score": score, 
                         "doc_para_idx": idx})
tfidf_df.to_csv('tfidfparas.csv', index=False, sep='\t')

100%|██████████| 12/12 [00:15<00:00,  1.28s/it]


In [131]:
tfidf_df.head()

Unnamed: 0,question,rank,doc_paragraph,score,doc_para_idx
0,Does a company need to keep track of the carbo...,1,no increase in emissions intensity of the acti...,0.265953,2103
1,Does a company need to keep track of the carbo...,2,This approach ensures translation of the thre...,0.264573,2237
2,Does a company need to keep track of the carbo...,3,Average carbon intensity of the electricity th...,0.197531,957
3,Does a company need to keep track of the carbo...,4,Average carbon intensity of the electricity pr...,0.194597,938
4,Does a company need to keep track of the carbo...,5,projects (i.e. non-certified) may also meet t...,0.188993,334


In [132]:
tfidf_df.head()

Unnamed: 0,question,rank,doc_paragraph,score,doc_para_idx
0,Does a company need to keep track of the carbo...,1,no increase in emissions intensity of the acti...,0.265953,2103
1,Does a company need to keep track of the carbo...,2,This approach ensures translation of the thre...,0.264573,2237
2,Does a company need to keep track of the carbo...,3,Average carbon intensity of the electricity th...,0.197531,957
3,Does a company need to keep track of the carbo...,4,Average carbon intensity of the electricity pr...,0.194597,938
4,Does a company need to keep track of the carbo...,5,projects (i.e. non-certified) may also meet t...,0.188993,334


In [133]:
df = tfidf_df.query('rank==1').merge(doc2vec_df.query('rank==1'), 
                                     left_on='question', right_on='question', 
                                     suffixes=('_tfidf', '_doc2vec'))

In [134]:
df.head()

Unnamed: 0,question,rank_tfidf,doc_paragraph_tfidf,score_tfidf,doc_para_idx_tfidf,rank_doc2vec,doc_paragraph_doc2vec,score_doc2vec,doc_para_idx_doc2vec
0,Does a company need to keep track of the carbo...,1,no increase in emissions intensity of the acti...,0.265953,2103,1,any solid bio-waste used in the manufacturing ...,0.99996,1049
1,What metric is used for evaluating emissions?,1,transport sector. The Heavy Duty CO2 Regulati...,0.204511,1699,1,When considering the development of Circular E...,0.999992,543
2,How does one get to net-zero emissions economy?,1,Support a transition to a net-zero emissions e...,0.428232,1107,1,"Flood (coastal, fluvial, pluvial, ground water)",0.99972,261
3,What is net-zero emissions economy?,1,Support a transition to a net-zero emissions e...,0.582024,1107,1,5.4 Separate collection and transport of non-...,0.999985,1464
4,How can carbon emission of the processes of ce...,1,Thresholds for cement Clinker (A) are applicab...,0.52194,882,1,SFM requirements are essential to guarantee th...,0.999987,445


In [104]:
df.drop(columns=['rank_tfidf', 'rank_doc2vec'] , axis=0, inplace=True)

In [105]:
df

Unnamed: 0,question,doc_paragraph_tfidf,score_tfidf,doc_para_idx_tfidf,doc_paragraph_doc2vec,score_doc2vec,doc_para_idx_doc2vec
0,Does a company need to keep track of the carbo...,no increase in emissions intensity of the acti...,0.266298,2901,Where the (remaining) lifecycle of the crop pr...,0.999993,807
1,What metric is used for evaluating emissions?,transport sector. The Heavy Duty CO2 Regulati...,0.20772,2197,This is an illustrative example and should not...,0.999976,3722
2,How does one get to net-zero emissions economy?,Support a transition to a net-zero emissions e...,0.412338,1316,"Storm (including blizzards, dust and sandstorms)",0.999882,260
3,What is net-zero emissions economy?,Support a transition to a net-zero emissions e...,0.574484,1316,Enabling the integration of renewable energy B...,0.999569,1534
4,How can carbon emission of the processes of ce...,Thresholds for cement Clinker (A) are applicab...,0.524109,1042,“Material recovery from non-hazardous waste” S...,0.999972,1928
5,How is the weighted cogeneration threshold cal...,Any cogeneration technology can be included in...,0.246546,1635,Further guidance Typical sensitivities The ta...,0.999993,3046
6,What is carbon capture and sequestration?,E39.0.0 Remediation activities: 8. Landfill ga...,0.372866,1814,Use of perennial crops / pasture in highly ero...,0.999959,2669
7,What are the stages of carbon capture and sequ...,E39.0.0 Remediation activities: 8. Landfill ga...,0.304103,1814,Use of perennial crops / pasture in highly ero...,0.999981,2669
8,What should the average energy consumption of ...,The unit of measurement is the Infrastructure...,0.444345,1840,"use/conservation management plans, developed ...",0.999973,1850
9,What are sludge treatments?,No threshold applies. Rationale Sewage sludge ...,0.333653,1872,An ILI of 1.5 represents a very efficient perf...,0.999989,1842


In [106]:
df.to_csv('doc2vec_tfidf_paras.csv', index=False, sep="\t")