# Natural Language Processing using Doc2Vec on National Science Foundation Awards Abstracts
---
### Team:  
Jacob Noble  
Himanshu Gamit  
Shantanu Hadap


## 1. Pre-requisites: Sagemaker Jupyter Instance- ml.m5.xlarge and MphasisDeepInsightsTextSummarizer Product ARN

1. Make sure you are in North Virginia Region.
2. Go to https://aws.amazon.com/marketplace/pp/prodview-uzkcdmjuagetk and 
3. Subscribe and accept offer.
3. Wait for a confirmation email and then Continue to Configuration.
4. Select your launch method to SageMaker Console
5. Go to Configure for Amazon SageMaker console
6. Copy Prodcut arn keep it for later use. 


#### Make Sure to update product arn as highlighted below 

#### MphasisDeepInsightsTextSummarizer Product ARN

## 2. FOR CUSTOM USER ABSTRACT INPUT Update text variable section highlighted below

### Imports


In [1]:
!pip install gensim

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import gensim
from gensim.models import Doc2Vec
import pandas as pd
from sklearn.preprocessing import StandardScaler # Replace with SAS version

unable to import 'smart_open.gcs', disabling that module


In [3]:
#!rm -r data
!wget https://nsfdata.s3.amazonaws.com/nsfdataset.zip -O nsfdataset.zip
!unzip -o nsfdataset.zip -d data

--2020-04-17 03:58:38--  https://nsfdata.s3.amazonaws.com/nsfdataset.zip
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.251.188
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.251.188|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358656602 (342M) [application/zip]
Saving to: ‘nsfdataset.zip’


2020-04-17 03:58:42 (82.9 MB/s) - ‘nsfdataset.zip’ saved [358656602/358656602]

Archive:  nsfdataset.zip
  inflating: data/nsf_proposals.csv  


In [4]:
!ls data

nsf_proposals.csv


In [5]:
#import saspy
import pandas as pd
import numpy as np


selected_cols = ['abstractText','date']
projects = pd.read_csv("data/nsf_proposals.csv", usecols = selected_cols, low_memory=False) #, nrows=30000
print ("Total Projects: ", projects.shape[0], "\nTotal Features: ", projects.shape[1])
projects.date = pd.to_datetime(projects.date.str.replace('D', 'T'))
projects = projects.sort_values('date')

Total Projects:  329321 
Total Features:  2


### Load in Data using Pandas

In [6]:
# Drop values without Abstract texts
projects = projects.dropna(how='any')

print("Max length of the abstractText:", projects.abstractText.str.len().max())
print("Min length of the abstractText:", projects.abstractText.str.len().min())
print("Avg length of the abstractText:", projects.abstractText.apply(lambda x: len(x) - x.count(" ")).mean())

words = projects.abstractText.str.split().apply(len)
print("Max words abstractText:", words.max())
projects.head()

Max length of the abstractText: 8166
Min length of the abstractText: 1
Avg length of the abstractText: 1645.861842268939
Max words abstractText: 1252


Unnamed: 0,abstractText,date
295988,Nondestructive Evaluation (NDE) is important t...,1985-08-30
307099,An Industry/University Cooperative Research Ce...,1985-08-30
284877,This research seeks to obtain a representation...,1985-09-06
177779,"This is an attempt to develop a ""solvated elec...",1986-01-16
166668,The New York State College of Ceramics at Alfr...,1986-06-03


### Setup Training Set for Doc2Vec

In [7]:
X_raw = projects.abstractText.str.lower().values
num_of_docs = len(X_raw)
print('Number of abstracts: ', num_of_docs)

Number of abstracts:  321560


#### Download Model Files

In [39]:
!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model -O 0nsf_doc2v.model
!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model.docvecs.vectors_docs.npy -O 0nsf_doc2v.model.docvecs.vectors_docs.npy
!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model.trainables.syn1neg.npy -O 0nsf_doc2v.model.trainables.syn1neg.npy
!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model.wv.vectors.npy -O 0nsf_doc2v.model.wv.vectors.npy

--2020-04-17 04:21:05--  https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.85.227
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.85.227|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628559 (19M) [application/x-www-form-urlencoded]
Saving to: ‘0nsf_doc2v.model’


2020-04-17 04:21:06 (46.8 MB/s) - ‘0nsf_doc2v.model’ saved [19628559/19628559]

--2020-04-17 04:21:06--  https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model.docvecs.vectors_docs.npy
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.85.227
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.85.227|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 385872128 (368M) [application/x-www-form-urlencoded]
Saving to: ‘0nsf_doc2v.model.docvecs.vectors_docs.npy’


2020-04-17 04:21:10 (92.9 MB/s) - ‘0nsf_doc2v.model.docvecs.vectors_docs.npy’ saved [3

### Create generator that will tokenize the training abstracts on the fly to save on memory

In [9]:
def doc_generator(input_docs_array):
    for i, doc in enumerate(input_docs_array):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [10]:
X = doc_generator(X_raw)

### Create Initial Doc2Vec Model and Training

#### Skip to next section to work load in a pretrained model.  Training could take a potentially long time.

We conduct the replication to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998). In this paper, they showed only DBOW results to NSF data. So we replicate this experiments using not only DBOW but also DM.


In [11]:
%%time
import multiprocessing
from pprint import pprint
cores = multiprocessing.cpu_count()

#It is training 100 epochs in 300 dimension vector space
#models = [
    # PV-DBOW 
#    Doc2Vec(dm=0, dbow_words=1, size=300, window=8, min_count=1, iter=100, workers=cores),
    # PV-DM w/average
    #Doc2Vec(dm=1, dm_mean=1, size=300, window=8, min_count=1, iter =100, workers=cores),
#]

#models[0].build_vocab(doc_generator(X_raw))
#print(str(models[0]))
#models[1].reset_from(models[0])
#print(str(models[1]))

CPU times: user 34 µs, sys: 5 µs, total: 39 µs
Wall time: 43.9 µs


In [12]:
%%time 
#for i, model in enumerate(models):
#    model.train(doc_generator(X_raw), total_examples=len(X_raw), epochs=model.iter)
#    model.save(str(i)+'nsf_doc2v.model')
    
#print("Doc2Vec Embedding Shape:",model.docvecs.vectors_docs.shape)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.72 µs


##OTHER Method

In [13]:
%%time
#model = Doc2Vec(vector_size=300, window=8, min_count=1, workers=4)
#model.build_vocab(documents=doc_generator(X_raw))
#model.train(doc_generator(X_raw), total_examples=num_of_docs, epochs=100)
#model.save('nsf_doc2vec.model')

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 5.96 µs


In [14]:
# model.train(doc_generator(X_raw), total_examples=num_of_docs, epochs=1)
# model.save('nsf_doc2vec.model')

### Load Existing Model

In [40]:
existing_model = '0nsf_doc2v.model' #'0nsf_doc2v.model' #'nsf_d2v_100.model' # Name of Existing model to load
model = gensim.models.Doc2Vec.load(str(existing_model)) # Model is assumed to be in the shared folder

### Testing Doc2Vec Model

1. Infer Embedding Vecor for a New Abstract
2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.
3. Storing the most similar files to S3 bucket.


#### 1. Infer Embedding Vecor for a New Abstract

First, calculating cosine similarity of New Text using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model.

In [16]:
!wget https://nsfdata.s3.amazonaws.com/input.txt

--2020-04-17 03:59:20--  https://nsfdata.s3.amazonaws.com/input.txt
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.163.11
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.163.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1758 (1.7K) [text/plain]
Saving to: ‘input.txt.3’


2020-04-17 03:59:20 (111 MB/s) - ‘input.txt.3’ saved [1758/1758]



In [17]:
with open('./input.txt', 'rb') as file_stream:
    text = file_stream.read().decode('utf-8')

# FOR CUSTOM USER ABSTRACT INPUT - Update below text variable value

In [18]:
#text = """This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, document vectors were used to cluster similar abstracts using SAS® Studio KMeans Clustering Module. For visualization, the abstract embeddings were reduced to two dimensions using Principal Component Analysis (PCA) within SAS® Studio. This was then compared to a t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique as part of the Scikit-learn machine learning toolkit for Python.Conclusively, NSF proposal abstract text analysis can help an awardee read and improve their proposal model by identifying similar proposal abstracts from the last 24 years. It could also help NSF evaluators identify similar existing proposals that indirectly provides insights on whether a new proposal is going to be fruitful or not."""
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. 
To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards,

In [19]:
test_vector = model.infer_vector(gensim.utils.simple_preprocess(text))

In [20]:
test_vector

array([-7.25775585e-02, -5.57031870e-01,  6.69417799e-01,  4.46947575e-01,
       -1.70384920e+00, -6.58859491e-01, -6.52592927e-02,  6.26584530e-01,
       -1.43293905e+00,  9.31003392e-01, -1.62645832e-01,  1.30462810e-01,
       -1.31180871e+00,  4.55911577e-01,  1.12497248e-01, -7.34875724e-02,
        1.34455514e+00,  1.95934862e-01,  7.87647903e-01, -1.34009421e-01,
        5.23046136e-01,  5.31487703e-01,  5.20799458e-01, -2.56483734e-01,
        8.64075541e-01,  2.54420936e-01, -6.63086534e-01, -1.37219131e-01,
       -4.47274178e-01,  6.99588180e-01, -7.66095400e-01,  7.78913777e-03,
        3.42206925e-01, -1.46842778e-01,  4.58635688e-01, -5.77536345e-01,
       -5.92268586e-01, -9.92954195e-01,  5.41749537e-01, -6.72993541e-01,
        5.24993181e-01,  1.07623406e-01, -8.39666605e-01,  6.64346099e-01,
        6.74897075e-01, -2.28553519e-01,  1.52662590e-01, -1.46814391e-01,
       -3.94025832e-01,  7.21900463e-01,  9.37481344e-01, -4.35221583e-01,
        1.19576883e+00,  

#### 2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.

In [41]:
similar = model.docvecs.most_similar(
    positive=[test_vector], 
    topn=3)
print(similar)

[(180443, 0.28728967905044556), (229192, 0.2774297595024109), (183100, 0.2724209427833557)]


In [42]:
X_raw[similar[0][0]]

'abstract\n\niis - 0836181   \nframework for a general multimedia workbench\nbyrd, donald, indiana university\n\nthis sger project will pursue the development of a general framework for exploring and editing time-dependent multimedia using a variety of coordinated visualizations and sonifications.  it enhances contemporary methods of content-based searching and retrieval of musical information from online databases and stored in a variety of formats ranging from encoded score files to digital audio.  the type of system proposed would for the first time make it possible for scholars to compare relationships among musical works and among versions of the same work in a new flexible ways. the underlying general problem addressed in the project of a framework supporting combinations of visualizations and sonifications  in a modular framework allows the results to apply to temporal phenomena generally across disciplinary domains.'

3. Storing the most similar files to S3 bucket.

In [23]:
# file = open('top1.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[0][0]])
# file.close()

In [24]:
# file = open('top2.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[1][0]])
# file.close()

In [25]:
# file = open('top3.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[2][0]])
# file.close()

In [26]:
import boto3
s3_conn = boto3.client("s3")
s3 = boto3.resource('s3')
object = s3.Object('nsfdata', 'top1.txt')
object.put(Body=X_raw[similar[0][0]])

{'ResponseMetadata': {'RequestId': '64F3A14308EE35B2',
  'HostId': 'SD3KACeltyQwl5sgDxXVsb/jKljcgdJfa6HrPsArzpFxVei3MWOnzo/M6FyA6Zj0okq68JioPkM=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'SD3KACeltyQwl5sgDxXVsb/jKljcgdJfa6HrPsArzpFxVei3MWOnzo/M6FyA6Zj0okq68JioPkM=',
   'x-amz-request-id': '64F3A14308EE35B2',
   'date': 'Fri, 17 Apr 2020 03:59:22 GMT',
   'etag': '"3bc1e4f25546d98b8cc9fff3c4189309"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"3bc1e4f25546d98b8cc9fff3c4189309"'}

###  MphasisDeepInsightsTextSummarizer

### Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [27]:
import sagemaker as sage
from time import gmtime, strftime
from sagemaker import get_execution_role

sess = sage.Session()
role = get_execution_role()

## Create Model

Now we use the Model Package to create a model,

# PLEASE CHANGE MphasisDeepInsightsTextSummarizer Product ARN

## Make sure you update model_package_arn to product arn which you retrieved from marketplace-text-summarizer


In [28]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
##OHIO REGION
#model_package_arn = 'arn:aws:sagemaker:us-east-2:057799348421:model-package/marketplace-text-summarizer-11-d2490248e8de20f24ae3b72d0d74654c'

##North Virginia Region
model_package_arn = 'arn:aws:sagemaker:us-east-1:865070037744:model-package/marketplace-text-summarizer-11-d2490248e8de20f24ae3b72d0d74654c'

from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Input File

Now we pull a sample input file for testing the model.

In [29]:
top1_txt="s3://nsfdata/top1.txt"

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [30]:
import json 
import uuid
transformer = model.transformer(2, 'ml.m5.xlarge')
transformer.output_path = "s3://nsfdata/summerized-abstract"
transformer.transform(top1_txt, content_type='text/plain')
transformer.wait()
print("Batch Transform complete")

............................[35m2020-04-17T04:03:56.919:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[34m169.254.255.130 - - [17/Apr/2020 04:03:56] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[34m169.254.255.130 - - [17/Apr/2020 04:03:56] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34m---input--- abstract
[0m
[34miis - 0836181   [0m
[34mframework for a general multimedia workbench[0m
[34mbyrd, donald, indiana university
[0m
[34mthis sger project will pursue the development of a general framework for exploring and editing time-dependent multimedia using a variety of coordinated visualizations and sonifications.  it enhances contemporary methods of content-based searching and retrieval of musical information from

## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [31]:
import boto3
print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")

s3://nsfdata/summerized-abstract


In [32]:
bucket_name="nsfdata"
with open('result.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/top1.txt.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [33]:
with open('./result.txt', 'rb') as file_stream:
    output_text = file_stream.read().decode('utf-8')

#### Original Text Input

In [34]:
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. 
To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards,

#### Similar abstract

In [35]:
bucket_name="nsfdata"
with open('top1.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,'top1.txt', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [36]:
with open('./top1.txt', 'rb') as file_stream:
    top1_text = file_stream.read().decode('utf-8')

In [37]:
print(top1_text)

abstract

iis - 0836181   
framework for a general multimedia workbench
byrd, donald, indiana university

this sger project will pursue the development of a general framework for exploring and editing time-dependent multimedia using a variety of coordinated visualizations and sonifications.  it enhances contemporary methods of content-based searching and retrieval of musical information from online databases and stored in a variety of formats ranging from encoded score files to digital audio.  the type of system proposed would for the first time make it possible for scholars to compare relationships among musical works and among versions of the same work in a new flexible ways. the underlying general problem addressed in the project of a framework supporting combinations of visualizations and sonifications  in a modular framework allows the results to apply to temporal phenomena generally across disciplinary domains.


## Summary of similar abstract

### Output is about content-based searching and retrieval of musical information files to digital audio similar to machine learning for abtract recommendation in input study. 

In [38]:
print(output_text)

The type of system proposed would for the first time make it possible for scholars to compare relationships among musical works and among versions of the same work in a new flexible ways. It enhances contemporary methods of content-based searching and retrieval of musical information from online databases and stored in a variety of formats ranging from encoded score files to digital audio. The underlying general problem addressed in the project of a framework supporting combinations of visualizations and sonifications in a modular framework allows the results to apply to temporal phenomena generally across disciplinary domains.
Execution time : 1.28seconds

