#### Test sentence-transformers with Wikipedia Big Data classification approach 

In [1]:
import wikipedia
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

##### Get Wiki Big Data Summary and embed

In [2]:
print(wikipedia.search("big data"))

['Big data', 'Data', 'Big Data (band)', 'Data science', 'Big data ethics', 'List of big data companies', 'Data lake', 'Data mining', 'Data analysis', 'Streaming data']


In [3]:
big_data_wiki_text = wikipedia.summary("Big data")
print(big_data_wiki_text[0:100])

Big data refers to data sets that are too large or complex to be dealt with by traditional data-proc


In [4]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [5]:
wiki_embed = model.encode(big_data_wiki_text)
print(wiki_embed.shape)

(768,)


##### Read in Federal RePORTER abstracts and embed

In [6]:
# pull in data
df = pd.read_pickle("../../../data/prd/Paper/FR_meta_and_final_tokens_23DEC21.pkl")
df.reset_index(inplace = True, drop = True)

print(df.shape)

(1143869, 30)


In [7]:
df.head()

Unnamed: 0,PROJECT_ID,ABSTRACT,PROJECT_TERMS,PROJECT_TITLE,DEPARTMENT,AGENCY,IC_CENTER,PROJECT_NUMBER,PROJECT_START_DATE,PROJECT_END_DATE,...,BUDGET_END_DATE,CFDA_CODE,FY,FY_TOTAL_COST,FY_TOTAL_COST_SUB_PROJECTS,ORG_COUNT,PI_COUNT,FY_TOTAL_COST_SUM,NUM_RECORDS,final_tokens
0,89996,"This is a project to explore Game-based, Metap...",Achievement; analog; base; Cognitive Science; ...,RUI: CYGAMES: CYBER-ENABLED TEACHING AND LEARN...,NSF,NSF,,814512,9/15/2008,8/31/2012,...,,47.076,2008,1999467.0,,1,1,1999467.0,1,project explore game base metaphor enhanced ga...
1,89997,Institution: Franklin Institute Science Museum...,Active Learning; Child; Computer software; des...,ARIEL - AUGMENTED REALITY FOR INTERPRETIVE AND...,NSF,NSF,,741659,9/15/2008,8/31/2012,...,,47.076,2008,1799699.0,,1,1,1799699.0,1,institution franklin institute science museum ...
2,89998,Through programs (including small group conver...,Address; Age; Birth; Brain; Caregivers; Child;...,BRIGHTER FUTURES: PUBLIC DELIBERATION ABOUT TH...,NSF,NSF,,813522,9/15/2008,8/31/2011,...,,47.076,2008,1505858.0,,1,1,1505858.0,1,program include small group conversation citiz...
3,89999,In partnership with the American Chemical Soci...,Advanced Development; American; Chemicals; Che...,FOSTERING US-INTERNATIONAL COLLABORATIVE PARTN...,NSF,NSF,,838627,8/1/2008,12/31/2010,...,,47.049,2008,51000.0,,1,1,51000.0,1,partnership american chemical society acs nati...
4,90001,The Center for Molecular Interfacing (CMI) wil...,Address; Architecture; Carbon Nanotubes; Catal...,CCI PHASE I: CENTER FOR MOLECULAR INTERFACING,NSF,NSF,,847926,10/1/2008,9/30/2011,...,,47.049,2008,1519821.0,,1,1,1519821.0,1,center molecular interfacing cmi enable integr...


In [8]:
# as a test, we will embed the first 5000 raw abstracts

abstract_embeddings = model.encode(df['ABSTRACT'][0:5000])
print(abstract_embeddings.shape)

(5000, 768)


In [9]:
abstract_embeddings

array([[-0.31997305,  0.700894  ,  0.74465793, ..., -0.7184012 ,
        -0.9252532 ,  1.1383075 ],
       [-0.32650214,  0.427477  ,  1.1298351 , ..., -0.31565255,
        -0.57146335,  0.23342875],
       [-0.21210362,  0.13335809,  1.1379268 , ..., -0.11215588,
        -0.6897892 ,  0.18945196],
       ...,
       [-0.3130962 ,  0.4245602 ,  0.6859571 , ..., -0.2976885 ,
        -0.08703874,  0.2766873 ],
       [-0.3731108 ,  0.06801865,  0.572278  , ...,  0.13657175,
        -0.93242854,  0.63883793],
       [-0.67293763,  0.51720095,  1.1885586 , ..., -0.77405745,
        -0.3536448 ,  0.39297816]], dtype=float32)

##### Calculate similarity between Big Data Wiki summary and abstracts

In [10]:
scores = cosine_similarity(
    [wiki_embed],
    abstract_embeddings
)
#print(scores)

In [11]:
type(scores)

numpy.ndarray

In [12]:
scores.max()

0.8374932

In [13]:
np.argmax(scores)

3830

In [14]:
df['ABSTRACT'][3830]

'Explosion of wireless products and innovative use of the ISM bands lead to a very crowded spectrum space. When densely deployed, significant performance degradation may be experienced ranging from higher latency and lower data rate to starvation and service disruption. To tackle the co-existence problems, two key challenges need to be addressed. First, there exists innate uncertainty in channel quality, user location and population as well as coexisting devices and networks. Second, many emerging applications using radio technologies in the ISM bands require high availability and predictable services instead of large access bandwidth. The focus of this project is thus to develop theoretical models and algorithms for robust resource management that target at minimizing the outage and/or disruption of desired service level under varying resource availability in 802.11 like networks. This work will result in i) new methods and measurement procedures for inferring internal and external co