In [1]:
import pickle
import time
import scipy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cos_sim

In [2]:
with open('df_arxiv_id_abst_full.pkl', 'rb') as f:
    df = pickle.load(f)

In [3]:
with open('tf_idf_vectorizer_ngram1.pkl', 'rb') as f:
     tfidf_v = pickle.load(f)

In [4]:
with open('X_tfidf.pkl', 'rb') as f:
     X_tfidf = pickle.load(f)
        
# For fast column lookups
X_tfidf_csc = X_tfidf.tocsc()

In [5]:
def calculate_cos_sim_fast(query_str: str):
    # Performs query only on columns existing in the 
    # Time complexity is O(n) where n is # of unique tokens
    # This is more efficinet, but only for small inputs 
    
    # tokenize the string
    x = np.array(tfidf_v.transform([query_str]).todense())
    
    # construct query vector
    x_nonzero_idx = x.nonzero()[1]
    x_small = x[:,x_nonzero_idx]
    
    # Narrow the similarity computation to tokens that are present in the query_string
    X_words_sparse = scipy.sparse.hstack([X_tfidf_csc.getcol(c) for c in x_nonzero_idx])

    # the vectors are already normalized
    # we can thus limit a*b/(norm(a)*norm(b)) to a*b
    cos_sim = X_words_sparse @ x_small.T

    return cos_sim



def inference(abstract, top_k=5):
    st = time.perf_counter()
    similarities = calculate_cos_sim_fast(abstract)
    # if we also want to reject the paper itself, then it'd be most_sim_idx[-2::-1]
    most_sim_idx = similarities.squeeze().argsort()[::-1]
    df_similar = df.iloc[most_sim_idx[:5]]

        
    print(f"Took {1000 * (time.perf_counter() - st)} ms")

    return df_similar

In [6]:
my_thesis = """
Wildfires are a growing problem in the US and worldwide – in the last decade we
witnessed some of the costliest, most destructive, and deadliest wildland fires on record.
The consistent growth in the number of incidents, affected area, and suppression costs
suggests that the issue might become even worse in the future. Solutions include early fire
detection and preventative scanning of the vast wildlands. This thesis proposes a
vision-based multimodal fire detection system that is deployed on an Unmanned Aerial
Vehicle (UAV, drone) and can be used for early detection of new wildfires, and
surveillance of existing ones. The Fire Perception Box multimodal perception hardware is
designed and deployed onboard a custom built UAV. Visual spectrum (RGB) and infrared
(IR) classification algorithms along with a fusion strategy are proposed and deployed to
the UAV system. Overall, the system is capable of fully onboard real-time visual
processing and produces spatial results which can later be utilized for realtime wildfire
maps — a technology that is very much needed in fire management. The effectiveness of
the system is shown via quantitative evaluation on the proposed Aerial Fire Dataset, as
well as external datasets. Furthermore, the performance of the system is evaluated on
never-seen data from a real-world 80-acre wildfire.
"""

In [7]:
resnet_abstact = """
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. 
"""

In [8]:
something_recent = """
In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
"""

In [9]:
X_tfidf.shape

(1000000, 407494)

In [10]:
inference(my_thesis)

Took 160.81252599542495 ms


Unnamed: 0,id,abstract
837142,1704.0263,Wildland fire fighting is a very dangerous j...
972416,1804.10723,With the maturity of unmanned aerial vehicle...
934017,1801.05086,Unmanned aerial vehicles (UAV) are commonly ...
579939,1412.1961,This paper presents an approach for defining...
905772,1710.10389,The past few years have witnessed a tremendo...


In [11]:
inference(something_recent)

Took 153.42719700129237 ms


Unnamed: 0,id,abstract
982156,1805.09111,Graph-based design languages in UML (Unified...
868199,1707.03167,"In this paper, we present RegNet, the first ..."
609968,1503.07254,This paper studies the problem of designing ...
986744,1806.01104,Many-core co-design is a complex task in whi...
669210,1510.05253,This paper reviews the design of experiments...


In [12]:
inference(my_thesis[:50])

Took 43.04327900172211 ms


Unnamed: 0,id,abstract
154072,911.0051,Models for wildfires must be stochastic if t...
753110,1607.05559,Gelfand numbers represent a measure for the ...
79793,808.3661,"In this paper, we abstract a kind of stochas..."
539846,1407.3089,We propose new summary statistics for intens...
278792,1108.0754,The Burning Index (BI) produced daily by the...
