In [1]:
import pickle
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cos_sim

In [2]:
with open('id+abstract.pkl', 'rb') as f:
    df = pickle.load(f)

In [3]:
tfidf_vectorizer_ngram1 = TfidfVectorizer(stop_words='english', ngram_range=(1, 1), max_features=30000)
%time tfidf_vectorizer_ngram1.fit(df.abstract)

CPU times: user 2min 20s, sys: 1.64 s, total: 2min 22s
Wall time: 2min 22s


TfidfVectorizer(max_features=30000, stop_words='english')

In [4]:
# tfidf_vectorizer_ngram2 = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=30000)
# %time tfidf_vectorizer_ngram2.fit(df.abstract)

In [5]:
tfidf_v = tfidf_vectorizer_ngram1

In [6]:
%time X_tfidf = tfidf_v.transform(df.abstract)

CPU times: user 2min 29s, sys: 1.01 s, total: 2min 30s
Wall time: 2min 30s


In [7]:
my_thesis = """
Wildfires are a growing problem in the US and worldwide – in the last decade we
witnessed some of the costliest, most destructive, and deadliest wildland fires on record.
The consistent growth in the number of incidents, affected area, and suppression costs
suggests that the issue might become even worse in the future. Solutions include early fire
detection and preventative scanning of the vast wildlands. This thesis proposes a
vision-based multimodal fire detection system that is deployed on an Unmanned Aerial
Vehicle (UAV, drone) and can be used for early detection of new wildfires, and
surveillance of existing ones. The Fire Perception Box multimodal perception hardware is
designed and deployed onboard a custom built UAV. Visual spectrum (RGB) and infrared
(IR) classification algorithms along with a fusion strategy are proposed and deployed to
the UAV system. Overall, the system is capable of fully onboard real-time visual
processing and produces spatial results which can later be utilized for realtime wildfire
maps — a technology that is very much needed in fire management. The effectiveness of
the system is shown via quantitative evaluation on the proposed Aerial Fire Dataset, as
well as external datasets. Furthermore, the performance of the system is evaluated on
never-seen data from a real-world 80-acre wildfire.
"""

In [8]:
resnet_abstact = """
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. 
"""

In [9]:
def inference(abstr):
    x = tfidf_v.transform([abstr])
    sims = cos_sim(X_tfidf, x)
    sims = np.squeeze(sims)
    most_sim_idx = sims.argsort()
    return df.iloc[most_sim_idx[:10]].abstract

In [10]:
x = tfidf_v.transform([my_thesis])
sims = cos_sim(X_tfidf, x)
sims = np.squeeze(sims)
most_sim_idx = sims.argsort()[::-1]

In [11]:
with open('X_tfidf.pkl', 'wb') as f:
    pickle.dump(X_tfidf, f)

In [16]:
most_sim_idx

array([1325672,  837142, 1312419, ...,  303361,  303362, 1789906])

In [17]:
df.iloc[1325672].abstract

"  The challenge of wildfire management and detection is recently gaining\nincreased attention due to the increased severity and frequency of wildfires\nworldwide. Popular fire detection techniques such as satellite imaging and\nremote camera-based sensing suffer from late detection and low reliability\nwhile early wildfire detection is a key to prevent massive fires. In this\npaper, we propose a novel wildfire detection solution based on unmanned aerial\nvehicles assisted Internet of things (UAV-IoT) networks. The main objective is\nto (1) study the performance and reliability of the UAV-IoT networks for\nwildfire detection and (2) present a guideline to optimize the UAV-IoT network\nto improve fire detection probability under limited budgets. We focus on\noptimizing the IoT devices' density and number of UAVs covering the forest area\nsuch that a lower bound of the wildfires detection probability is maximized\nwithin a limited time and budget. At any time after the fire ignition, the

In [22]:
%time inference(resnet_abstact)

CPU times: user 1.37 s, sys: 176 ms, total: 1.54 s
Wall time: 1.54 s


1789906      The Ginzburg Landau theory for d_{x^2-y^2}-w...
776722       The static properties of the fundamental mod...
356930       We study ultracold Rydberg-dressed Bose gase...
776724       Inspired by recent work of Carlson, Friedlan...
776725       The transient Be/X-ray binary A0538-66 shows...
1635971      Dyonic black holes with string-loop correcti...
1635973      The emitted power of the radiation from a ch...
1635974      We prove that there do not exist multisolito...
776731       Ivory's Lemma is a geometrical statement in ...
1635975      Using the results of previous investigations...
Name: abstract, dtype: object

In [19]:
inference(my_thesis)

1789906      The Ginzburg Landau theory for d_{x^2-y^2}-w...
303362       Let $H(\mathbb{B})$ denote the space of all ...
303361       We study the discreteness for non-elementary...
303360       In this article, we assume that a cold charg...
799639       We prove that intermediate extensions of per...
1671642      We study the general gaugings of N=2 Maxwell...
1671643      We present what we believe is the minimal th...
1671644      We investigate the instanton effects of non-...
1438950      BVRI light curves are presented for 27 Type ...
1671639      Vacuum spherically symmetric Einstein gravit...
Name: abstract, dtype: object

In [20]:
most_sim_idx = sims.argsort()[::-1] # reverse 
# if we also want to reject the paper itself, then it'd be most_sim_idx[-2::-1]

In [21]:
most_sim_idx

array([1325672,  837142, 1312419, ...,  303361,  303362, 1789906])