# Natural Language Processing using Doc2Vec on National Science Foundation Awards Abstracts
---
### Team:  
Jacob Noble  
Himanshu Gamit  
Shantanu Hadap


### Imports

sklearn's Standard Scaler should be used.

In [0]:
import gensim
from gensim.models import Doc2Vec
import pandas as pd
from sklearn.preprocessing import StandardScaler # Replace with SAS version

In [2]:
!wget https://nsfdata.s3.amazonaws.com/nsfdataset.zip
!unzip nsfdataset.zip -d data

--2020-04-15 20:28:03--  https://nsfdata.s3.amazonaws.com/nsfdataset.zip
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.92.107
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.92.107|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358656602 (342M) [application/zip]
Saving to: ‘nsfdataset.zip’


2020-04-15 20:28:11 (45.7 MB/s) - ‘nsfdataset.zip’ saved [358656602/358656602]

Archive:  nsfdataset.zip
  inflating: data/nsf_proposals.csv  


In [3]:
#import saspy
import pandas as pd
import numpy as np


selected_cols = ['abstractText','date']
projects = pd.read_csv("data/nsf_proposals.csv", usecols = selected_cols, low_memory=False) #, nrows=30000
print ("Total Projects: ", projects.shape[0], "\nTotal Features: ", projects.shape[1])
projects.date = pd.to_datetime(projects.date.str.replace('D', 'T'))
projects = projects.sort_values('date')

Total Projects:  329321 
Total Features:  2


### Load in Data using Pandas

In [4]:
# Drop values without Abstract texts
projects = projects.dropna(how='any')

print("Max length of the abstractText:", projects.abstractText.str.len().max())
print("Min length of the abstractText:", projects.abstractText.str.len().min())
print("Avg length of the abstractText:", projects.abstractText.apply(lambda x: len(x) - x.count(" ")).mean())

words = projects.abstractText.str.split().apply(len)
print("Max words abstractText:", words.max())
projects.head()

Max length of the abstractText: 8166
Min length of the abstractText: 1
Avg length of the abstractText: 1645.861842268939
Max words abstractText: 1252


Unnamed: 0,abstractText,date
295988,Nondestructive Evaluation (NDE) is important t...,1985-08-30
307099,An Industry/University Cooperative Research Ce...,1985-08-30
284877,This research seeks to obtain a representation...,1985-09-06
177779,"This is an attempt to develop a ""solvated elec...",1986-01-16
166668,The New York State College of Ceramics at Alfr...,1986-06-03


### Setup Training Set for Doc2Vec

In [5]:
X_raw = projects.abstractText.str.lower().values
num_of_docs = len(X_raw)
print('Number of abstracts: ', num_of_docs)

Number of abstracts:  321560


### Create generator that will tokenize the training abstracts on the fly to save on memory

In [0]:
def doc_generator(input_docs_array):
    for i, doc in enumerate(input_docs_array):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [0]:
X = doc_generator(X_raw)

### Create Initial Doc2Vec Model and Training

#### Skip to next section to work load in a pretrained model.  Training could take a potentially long time.

In [0]:
# %%time
# import multiprocessing
# from pprint import pprint
# cores = multiprocessing.cpu_count()

# models = [
#     # PV-DBOW 
#     Doc2Vec(dm=0, dbow_words=1, size=300, window=8, min_count=1, iter=100, workers=cores),
#     # PV-DM w/average
#     Doc2Vec(dm=1, dm_mean=1, size=300, window=8, min_count=1, iter =100, workers=cores),
# ]

# models[0].build_vocab(doc_generator(X_raw))
# print(str(models[0]))
# models[1].reset_from(models[0])
# print(str(models[1]))

In [0]:
# %%time 
# for i, model in enumerate(models):
#     model.train(doc_generator(X_raw), total_examples=len(X_raw), epochs=model.iter)
#     model.save(str(i)+'New_nsf_doc2vec.model')

In [0]:
# model.docvecs.vectors_docs.shape

In [0]:
# model.train(doc_generator(X_raw), total_examples=num_of_docs, epochs=1)
# model.save('nsf_doc2vec.model')

### Load Existing Model

In [8]:
existing_model = 'nsf_d2v_100.model' # Name of Existing model to load
model = gensim.models.Doc2Vec.load(str(existing_model)) # Model is assumed to be in the shared folder

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### Testing Doc2Vec Model

In [13]:
text = """This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, document vectors were used to cluster similar abstracts using SAS® Studio KMeans Clustering Module. For visualization, the abstract embeddings were reduced to two dimensions using Principal Component Analysis (PCA) within SAS® Studio. This was then compared to a t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique as part of the Scikit-learn machine learning toolkit for Python.Conclusively, NSF proposal abstract text analysis can help an awardee read and improve their proposal model by identifying similar proposal abstracts from the last 24 years. It could also help NSF evaluators identify similar existing proposals that indirectly provides insights on whether a new proposal is going to be fruitful or not."""
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, d

In [0]:
test_vector = model.infer_vector(gensim.utils.simple_preprocess(text))

In [11]:
test_vector

array([-1.40085745e+00, -1.14149630e+00,  9.86760318e-01, -1.62086606e+00,
        1.42755830e+00, -1.04308939e+00,  7.57398307e-01,  7.10320592e-01,
        4.35654730e-01, -6.89626532e-03,  2.37417603e+00,  3.77059531e+00,
        2.33537984e+00,  1.54080391e-01, -5.34181595e-01, -1.64020312e+00,
       -5.56470789e-02, -1.47133923e+00, -7.56075680e-02, -8.01131248e-01,
       -3.72663260e+00, -1.56961095e+00,  1.13951278e+00,  3.87071043e-01,
        9.35525373e-02,  1.49220622e+00, -4.79168594e-01, -1.82896674e+00,
        2.15608788e+00,  7.90614605e-01,  6.88332319e-01, -2.58386111e+00,
        2.04685712e+00,  7.56009459e-01, -7.70424485e-01,  1.54531622e+00,
       -9.00842607e-01, -2.23787022e+00,  1.88065755e+00,  1.21908095e-02,
        3.37004691e-01, -1.12742722e+00, -3.59022069e+00,  2.73497081e+00,
       -9.36699629e-01,  1.98303711e+00,  5.09007454e-01, -2.09456515e+00,
       -1.69952250e+00, -2.63290793e-01, -4.67105448e-01,  7.25143254e-01,
       -4.64004904e-01, -

In [12]:
model.docvecs.most_similar(
    positive=[test_vector], 
    topn=3)

  if np.issubdtype(vec.dtype, np.int):


[(289674, 0.3379325568675995),
 (258922, 0.3322885036468506),
 (305430, 0.32162389159202576)]

In [0]:
file = open('top1.txt', 'w',encoding="utf-8")
file.write(X_raw[289674])
file.close()

In [0]:
file = open('top2.txt', 'w',encoding="utf-8")
file.write(X_raw[258922])
file.close()

In [0]:
file = open('top3.txt', 'w',encoding="utf-8")
file.write(X_raw[24392])
file.close()

In [0]:
### Text summerizer

### Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [1]:
import sagemaker as sage
from time import gmtime, strftime
from sagemaker import get_execution_role

sess = sage.Session()
role = get_execution_role()

## Create Model

Now we use the Model Package to create a model

In [2]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
model_package_arn = 'arn:aws:sagemaker:us-east-2:057799348421:model-package/marketplace-text-summarizer-11-d2490248e8de20f24ae3b72d0d74654c'
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Input File

Now we pull a sample input file for testing the model.

In [3]:
top1_txt="s3://nsf/top1.txt"
top2_txt="s3://nsf/top2.txt"
top3_txt="s3://nsf/top3.txt"

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [6]:
# %%time
# import json 
# import uuid
# transformer = model.transformer(2, 'ml.m5.xlarge')
# transformer.output_path = "s3://nsf/Summerized"
# transformer.transform(top1_txt, content_type='text/plain')
# transformer.wait()
# print("Batch Transform complete")

## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [9]:
import boto3
print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")

s3://nsf/Summerized


In [10]:
bucket_name="nsf"
with open('result.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/top1.txt.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [11]:
with open('./result.txt', 'rb') as file_stream:
    output_text = file_stream.read().decode('utf-8')

#### Original Text Input

In [14]:
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, d

#### Similar abstract

In [15]:
bucket_name="nsf"
with open('top1.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,'top1.txt', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [16]:
with open('./top1.txt', 'rb') as file_stream:
    input_text = file_stream.read().decode('utf-8')

In [17]:
print(input_text)

this reu site award to auburn university, located in auburn, alabama, will support the training of 10 students for 10 weeks during each summer from 2018-2020. this program is focused on warm-water aquatic ecology and will use the vast biological diversity and many aquatic ecosystems found in the southeastern u.s. as well as the available infrastructure in participating labs to broadly train reu students in core elements of interdisciplinary research, namely establishing hypotheses, experimental design, data interpretation, professional network development, and scientific ethics. reu students will participate in a variety of activities, such as a two-day aquatic ecology course, mentor research seminars, regular mentor-mentee and reu student meetings, student-driven research projects, professional development activities, and field trips.

it is anticipated that a total of 30 students, primarily freshman and sophomores from schools with limited research opportunities including community c

#### Summary of similar abstract

In [18]:
print(output_text)

This reu site award to auburn university , located in auburn , alabama , will support the training of 10 students for 10 weeks during each summer from 2018-2020. This program is focused on warm-water aquatic ecology and will use the vast biological diversity and many aquatic ecosystems found in the southeastern u. As well as the available infrastructure in participating labs to broadly train reu students in core elements of interdisciplinary research , namely establishing hypotheses , experimental design , data interpretation , professional network development , and scientific ethics. It is anticipated that a total of 30 students , primarily freshman and sophomores from schools with limited research opportunities including community colleges , will be trained in the program. Reu students will participate in a variety of activities , such as a two-day aquatic ecology course , mentor research seminars , regular mentor-mentee and reu student meetings , student-driven research projects , p