Using the tfidf data, input a new document, get a list of the most textually similar documents including author names, email addresses etc.  To assist with referee selection.

### Usage
- download new submission pdfs to 'new_subs' folder
- run all cells in this notebook
- see excel output in 'outputs'
- check excel output.  There is a lot of noise in the pdf data, so you may find matches are not ideal.  However, it's likely you will find similar papers and therefore suitable referees in the results.


In [1]:
# imports
import os
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
import pickle
from bs4 import BeautifulSoup as bs
import string
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer

In [2]:
# configure
from config import Config as c
working_data = c.working_data
doi_datapath = c.dois_pkl
# word_datapath = c.word_datapath
tfidf_datapath = c.tfidf_datapath
# cosine_sims_datapath = c.cosine_sims_datapath
vectorizer_datapath = c.vectorizer_datapath
tsne_data = c.tsne_data
vocab_p = c.vocab_p
# idf_p = c.idf_p


# how many rows of results do we want in the output files?
n_results=20

In [3]:
# try to load in data.  If that fails, re-index the stuff you need for this script
try:
    tfidf = pickle.load(open(tfidf_datapath,'rb')).todense() # should this be todense?
    dois = pickle.load(open(doi_datapath,'rb'))
    vocab = pickle.load(open(vocab_p,'rb'))
    df = pd.read_csv(working_data,index_col = 0)
except:
    print('Unable to find data for search engine.  Re-indexing now.')
    files = [
    'Step_0_create_word_data',
    'Step_1_data_processing_TF-IDF',
    'Step_3_Add_citation_data']
    to_run = [0,1,2]
    # convert to python
    for file in [files[i] for i in to_run]:
        print('--- Converting .ipynb to .py ---')
        os.system('jupyter nbconvert --to python {}.ipynb'.format(file))
        print('--- Executing .py file ---')
        os.system('python {}.py'.format(file))
        print('--- Deleting .py file ---')
        os.system('del {}.py'.format(file))
    tfidf = pickle.load(open(tfidf_datapath,'rb')).todense() # should this be todense?
    dois = pickle.load(open(doi_datapath,'rb'))
    vocab = pickle.load(open(vocab_p,'rb'))
    df = pd.read_csv(working_data,index_col = 0)

### Before we begin
Search queries are going to be pdf documents.  These should be converted to text

In [4]:
# define procedure
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    manager = PDFResourceManager()
    codec = 'utf-8'
    caching = True
    output = io.StringIO()
    converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()
    print(convertedPDF)

    infile.close()
    converter.close()
    output.close()
    return convertedPDF

### Set up indexing for new doc

In [5]:
print('Starting TF-IDF indexing.')
# stopwords
from sklearn.feature_extraction import text

my_words = ['et','al','use','article','introduction','abstract','title', 'nan'] # note that there are 'NaN's in WoS data!

# Add custom stopwords here.  E.g. ['sensor','network','data','node']  are so common in DSN
# that they appear in almost every paper and make it hard to differentiate between clusters.  
custom_stops = []
my_words = my_words+custom_stops

my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)
my_stop_words = set(my_stop_words)

Starting TF-IDF indexing.


In [6]:
from tools import strip_stem

## Load Vectorizer

In [7]:
## also loads stopwords
vectorizer_datapath = c.vectorizer_datapath
with open(vectorizer_datapath,'rb') as f:
    vectorizer = pickle.loads(f.read())
    
with open(vocab_p,'rb') as f:
    vocab = pickle.loads(f.read())

vectorizer.vocabulary_ = vocab

In [8]:
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words={'what', 'onto', 'not', 'should', 'nothing', 'anywhere', 'several', 'couldnt', 'and', 'down', 'beside', 'twenty', 'always', 'therefore', 'ours', 'next', 'anyone', 'my', 'anyhow', 'whose', 'whereupon', 'go', 'full', 'etc', 'everyone', 'often', 'above', 'else', 'here', 'as', 'mine', 'under'...tter', 'thru', 'towards', 'then', 'another', 'it', 'only', 'could', 'thereafter', 'do', 'if', 'are'},
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

### Find similar documents

In [10]:
# new doc
folder = 'new_subs/pdf/'
fnames = os.listdir(folder)
for filename in fnames:
    doc = convert('text',os.path.join(folder,filename))
    print('First line of doc is: ')
    print(doc[:doc.find('\n')])
    print('This repeats regularly on the submitted MS and therefore skews results')
    print("Should this be deleted iteratively from each page of data?")
    doc = strip_stem(doc)
    q_vec = vectorizer.transform([doc])
    q_vec = q_vec.todense()
    sims = q_vec*tfidf.T
    sims = np.array(sims)[0,:] # convert from matrix to array
    best_matches = np.argsort(-sims)[:n_results] # get indices of top 10 values
    results = [dois[best_match] for best_match in best_matches]
    # create a new df with only results in it
    df_out = df[df.DI.isin(results)] # df is read in near the top
    # Create the dictionary that defines the order for sorting
    sorterIndex = dict(zip(results,range(len(results))))
    # add the ordering as a col
    df_out['rank'] = df_out['DI'].map(sorterIndex)
    # sort the dataframe according to that col
    df_out.sort_values(['rank'],ascending=True, inplace=True)
    # remove the ordering col
    df_out.drop('rank',1,inplace=True)
    # reset the index to show the ranks of the results
    df_out.reset_index(inplace=True)
    # write to file
    df_out.to_csv('outputs/search_results_{}.csv'.format(filename))
    

Journal of Cerebral Blood Flow and Metabolism

 

 

C

Protective Effect of Primary Collaterals on Dynamic Cerebral 
o

Autoregulation 

n

fi

Journal:  Journal of Cerebral Blood Flow and Metabolism 
d

Manuscript ID  JCBFM-0094-18-ORIG 

 

 

 

 

 

 

Manuscript Type:  Original Article 

e

n

Date Submitted by the Author:  20-Feb-2018 

Complete List of Authors:  Guo, Zhen-Ni; Jilin University First Hospital, Clinical Trial and Research 

tial: F

o

r 

R

Center for Stroke, Department of Neurology, the First Hospital of Jilin 
University 
Sun, Xin; Jilin University First Hospital, Department of Neurology, the First 
Hospital of Jilin University 
Liu, Jia; Chinese Academy of Sciences, Shenzhen Institutes of Advanced 
Technology 
Sun, Huijie; Jilin University First Hospital, Cadre Ward, Department of 
Neurology, the First Hospital of Jilin University 
Zhao, Yingkai; Jilin University First Hospital, Cadre Ward, Department of 
Neurology, the First Hospital of Jilin University 
Ma

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Journal of Cerebral Blood Flow and Metabolism

 

 

C

o

How doesn't cerebrospinal fluid circulate? A literature 
review from the historical pioneers' theories to current 
n

models. 

fi

d
Journal:  Journal of Cerebral Blood Flow and Metabolism 

Manuscript ID  JCBFM-0089-18-OP 

e

Manuscript Type:  Opinion 

n

 

 

 

 

 

 

  

 

 

Date Submitted by the Author:  19-Feb-2018 

Complete List of Authors:  Mantovani, Giorgio; University of Ferrara 

tial: F

Menegatti, Marta; University of Ferrara 
Scerrati, Alba; University of Ferrara, Neurosurgery 
De Bonis, Pasquale; University of Ferrara, Neurosurgery 

Research Topics: 

Cerebrospinal fluid, Neuroanatomy, Neuropathology, Neurophysiology, 
Neurosurgery 

o

Research Techniques:  Brain Imaging, MRI 

r 

R

e

vi

e

w

 

O

n

ly

Page 1 of 19

Journal of Cerebral Blood Flow and Metabolism

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46


NameError: name 'dim_data' is not defined