# Natural Language Processing using Doc2Vec on National Science Foundation Awards Abstracts
---
### Team:  
Jacob Noble  
Himanshu Gamit  
Shantanu Hadap


### Imports

sklearn's Standard Scaler should be used.

In [1]:
!pip install gensim

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import gensim
from gensim.models import Doc2Vec
import pandas as pd
from sklearn.preprocessing import StandardScaler # Replace with SAS version

unable to import 'smart_open.gcs', disabling that module


In [3]:
!wget https://nsfdata.s3.amazonaws.com/nsfdataset.zip
!unzip nsfdataset.zip -d data -y

--2020-04-16 16:01:19--  https://nsfdata.s3.amazonaws.com/nsfdataset.zip
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.108.51
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.108.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358656602 (342M) [application/zip]
Saving to: ‘nsfdataset.zip.2’


2020-04-16 16:01:28 (38.6 MB/s) - ‘nsfdataset.zip.2’ saved [358656602/358656602]

Archive:  nsfdataset.zip
caution: filename not matched:  -y


In [4]:
#import saspy
import pandas as pd
import numpy as np


selected_cols = ['abstractText','date']
projects = pd.read_csv("data/nsf_proposals.csv", usecols = selected_cols, low_memory=False) #, nrows=30000
print ("Total Projects: ", projects.shape[0], "\nTotal Features: ", projects.shape[1])
projects.date = pd.to_datetime(projects.date.str.replace('D', 'T'))
projects = projects.sort_values('date')

Total Projects:  10000 
Total Features:  2


### Load in Data using Pandas

In [5]:
# Drop values without Abstract texts
projects = projects.dropna(how='any')

print("Max length of the abstractText:", projects.abstractText.str.len().max())
print("Min length of the abstractText:", projects.abstractText.str.len().min())
print("Avg length of the abstractText:", projects.abstractText.apply(lambda x: len(x) - x.count(" ")).mean())

words = projects.abstractText.str.split().apply(len)
print("Max words abstractText:", words.max())
projects.head()

Max length of the abstractText: 7977
Min length of the abstractText: 5
Avg length of the abstractText: 1468.5334029227558
Max words abstractText: 1169


Unnamed: 0,abstractText,date
2,A research and training center for quantitativ...,1988-08-08
8891,In order to understand the impact of energetic...,1988-11-25
7780,"Studies are proposed of a single electron, iso...",1988-11-29
5558,Texas A&M University and the University of Tex...,1988-11-30
6669,Major objectives of this study are to determin...,1988-11-30


### Setup Training Set for Doc2Vec

In [6]:
X_raw = projects.abstractText.str.lower().values
num_of_docs = len(X_raw)
print('Number of abstracts: ', num_of_docs)

Number of abstracts:  9580


### Create generator that will tokenize the training abstracts on the fly to save on memory

In [7]:
def doc_generator(input_docs_array):
    for i, doc in enumerate(input_docs_array):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [8]:
X = doc_generator(X_raw)

### Create Initial Doc2Vec Model and Training

#### Skip to next section to work load in a pretrained model.  Training could take a potentially long time.

We conduct the replication to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998). In this paper, they showed only DBOW results to NSF data. So we replicate this experiments using not only DBOW but also DM.


In [9]:
%%time
import multiprocessing
from pprint import pprint
cores = multiprocessing.cpu_count()

#It is training 100 epochs in 300 dimension vector space
models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, size=300, window=8, min_count=1, iter=100, workers=cores),
    # PV-DM w/average
    #Doc2Vec(dm=1, dm_mean=1, size=300, window=8, min_count=1, iter =100, workers=cores),
]

models[0].build_vocab(doc_generator(X_raw))
print(str(models[0]))
#models[1].reset_from(models[0])
#print(str(models[1]))



Doc2Vec(dbow+w,d300,n5,w8,s0.001,t2)
CPU times: user 4.99 s, sys: 41 ms, total: 5.03 s
Wall time: 5.03 s


In [10]:
%%time 
for i, model in enumerate(models):
    model.train(doc_generator(X_raw), total_examples=len(X_raw), epochs=model.iter)
    model.save(str(i)+'nsf_doc2v.model')

  from ipykernel import kernelapp as app


CPU times: user 43.9 s, sys: 326 ms, total: 44.2 s
Wall time: 23.8 s


In [11]:
print("Doc2Vec Embedding Shape:",model.docvecs.vectors_docs.shape)

Doc2Vec Embedding Shape: (9580, 300)


In [12]:
# model.train(doc_generator(X_raw), total_examples=num_of_docs, epochs=1)
# model.save('nsf_doc2vec.model')

### Load Existing Model

In [13]:
existing_model = '0nsf_doc2v.model' #'nsf_d2v_100.model' # Name of Existing model to load
model = gensim.models.Doc2Vec.load(str(existing_model)) # Model is assumed to be in the shared folder

### Testing Doc2Vec Model

1. Infer Embedding Vecor for a New Abstract
2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.
3. Storing the most similar files to S3 bucket.


#### 1. Infer Embedding Vecor for a New Abstract

First, calculating cosine similarity of New Text using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model.

In [14]:
!wget https://nsf-bucket.s3.us-east-2.amazonaws.com/input.txt

--2020-04-16 16:01:58--  https://nsf-bucket.s3.us-east-2.amazonaws.com/input.txt
Resolving nsf-bucket.s3.us-east-2.amazonaws.com (nsf-bucket.s3.us-east-2.amazonaws.com)... 52.219.84.208
Connecting to nsf-bucket.s3.us-east-2.amazonaws.com (nsf-bucket.s3.us-east-2.amazonaws.com)|52.219.84.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1758 (1.7K) [text/plain]
Saving to: ‘input.txt.2’


2020-04-16 16:01:58 (106 MB/s) - ‘input.txt.2’ saved [1758/1758]



In [15]:
with open('./input.txt', 'rb') as file_stream:
    text = file_stream.read().decode('utf-8')

In [16]:
#text = """This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, document vectors were used to cluster similar abstracts using SAS® Studio KMeans Clustering Module. For visualization, the abstract embeddings were reduced to two dimensions using Principal Component Analysis (PCA) within SAS® Studio. This was then compared to a t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique as part of the Scikit-learn machine learning toolkit for Python.Conclusively, NSF proposal abstract text analysis can help an awardee read and improve their proposal model by identifying similar proposal abstracts from the last 24 years. It could also help NSF evaluators identify similar existing proposals that indirectly provides insights on whether a new proposal is going to be fruitful or not."""
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. 
To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards,

In [17]:
test_vector = model.infer_vector(gensim.utils.simple_preprocess(text))

In [18]:
test_vector

array([-0.48014164,  0.2292498 , -0.12621427, -0.7242522 , -0.20420226,
        0.39799866, -0.09107541, -0.3656355 , -0.6984015 ,  0.6077379 ,
       -0.7734449 ,  0.06929234, -0.27689552,  0.02980844,  0.45976412,
       -0.54307365, -0.4842458 ,  0.1078022 , -0.3171443 , -0.10234564,
        0.07606005, -0.24821833,  0.04033251, -0.33498216, -0.60328996,
       -0.6001    , -0.29221055,  0.5743363 ,  0.1012374 , -0.4831736 ,
        0.4124102 ,  0.00392349, -0.13052528,  0.18914907, -0.3898968 ,
       -0.21136637,  0.40480727, -0.39430317,  0.16069913, -0.15890579,
        0.05071688,  0.11542416,  0.12898336,  0.44509062,  0.49148583,
        0.18427867, -0.21677431, -0.43187273,  0.3376419 ,  0.32490942,
       -0.110402  , -0.41751203,  0.42074555,  0.07995483, -0.10702007,
        0.06365649,  0.23538487, -0.17533523, -0.7302663 , -0.14668289,
        0.1569598 ,  0.08411324, -0.21061108, -0.0579369 , -0.07535064,
       -0.0165423 , -0.5922711 , -0.27911848, -0.24422689, -0.73

#### 2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.

In [19]:
similar = model.docvecs.most_similar(
    positive=[test_vector], 
    topn=3)
print(similar)

[(6880, 0.3747926950454712), (4998, 0.3733541667461395), (8597, 0.3627268075942993)]


3. Storing the most similar files to S3 bucket.

In [20]:
# file = open('top1.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[0][0]])
# file.close()

In [21]:
# file = open('top2.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[1][0]])
# file.close()

In [22]:
# file = open('top3.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[2][0]])
# file.close()

In [23]:
import boto3
s3_conn = boto3.client("s3")
s3 = boto3.resource('s3')
object = s3.Object('nsf-bucket', 'top1.txt')
object.put(Body=X_raw[similar[0][0]])

{'ResponseMetadata': {'RequestId': 'E605CDCBEB09B713',
  'HostId': 'F7O1HFG+UEOxNBr0IZioDhjnGaNIFRGmysaX3xexB6jVDEYr5aK+/xa9DqXE9Hg00/hcnR9XPNI=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'F7O1HFG+UEOxNBr0IZioDhjnGaNIFRGmysaX3xexB6jVDEYr5aK+/xa9DqXE9Hg00/hcnR9XPNI=',
   'x-amz-request-id': 'E605CDCBEB09B713',
   'date': 'Thu, 16 Apr 2020 16:02:00 GMT',
   'etag': '"408662cfbd020c93f25d8b75015cda76"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"408662cfbd020c93f25d8b75015cda76"'}

### Text summerizer

### Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [24]:
import sagemaker as sage
from time import gmtime, strftime
from sagemaker import get_execution_role

sess = sage.Session()
role = get_execution_role()

## Create Model

Now we use the Model Package to create a model

In [25]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
model_package_arn = 'arn:aws:sagemaker:us-east-2:057799348421:model-package/marketplace-text-summarizer-11-d2490248e8de20f24ae3b72d0d74654c'
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Input File

Now we pull a sample input file for testing the model.

In [26]:
top1_txt="s3://nsf-bucket/top1.txt"

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [27]:
import json 
import uuid
transformer = model.transformer(2, 'ml.m5.xlarge')
transformer.output_path = "s3://nsf-bucket/summerized-abstract"
transformer.transform(top1_txt, content_type='text/plain')
transformer.wait()
print("Batch Transform complete")

.........................[34m2020-04-16T16:05:58.054:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[33m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[33m169.254.255.130 - - [16/Apr/2020 16:05:58] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[33m169.254.255.130 - - [16/Apr/2020 16:05:58] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[35m2020-04-16T16:06:00.752:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[36m2020-04-16T16:06:00.752:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[34m1


Batch Transform complete


## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [28]:
import boto3
print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")

s3://nsf-bucket/summerized-abstract


In [29]:
bucket_name="nsf-bucket"
with open('result.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/top1.txt.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [30]:
with open('./result.txt', 'rb') as file_stream:
    output_text = file_stream.read().decode('utf-8')

#### Original Text Input

In [31]:
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. 
To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards,

#### Similar abstract

In [32]:
bucket_name="nsf-bucket"
with open('top1.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,'top1.txt', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [33]:
with open('./top1.txt', 'rb') as file_stream:
    top1_text = file_stream.read().decode('utf-8')

In [34]:
print(top1_text)

ccr-0098072
edmund clarke
cmu

abstract:

model checking is an automatic verification technique for concurrent systems such as sequential circuit design and communication protocols in which temporal logic specifications are checked by an exhaustive search of the state space of the concurrent system. considerable progress has been made in the last two decades, and many major companies are now using model checking. to extend the potential of the method, this project pursues several avenues of research that will enable larger hardware systems and certain software systems to be verified.

model checking and theorem proving:  theorem proving avoids the state explosion problem, but relies heavily on human guidance and tends to get unmanageable for large designs.  the project attempts to find and implement a practical methodology that will combine the benefits of theorem proving and model checking and apply it to verification of processors and security protocols.

sat-based model checking: de

#### Summary of similar abstract

In [35]:
print(output_text)

Model checking and theorem proving : theorem proving avoids the state explosion problem , but relies heavily on human guidance and tends to get unmanageable for large designs. Considerable progress has been made in the last two decades , and many major companies are now using model checking. To extend the potential of the method , this project pursues several avenues of research that will enable larger hardware systems and certain software systems to be verified.
Execution time : 1.55seconds

