# Natural Language Processing using Doc2Vec on National Science Foundation Awards Abstracts
---
### Team:  
Jacob Noble  
Himanshu Gamit  
Shantanu Hadap


## 1. Pre-requisites: Sagemaker Jupyter Instance- ml.m5.xlarge and MphasisDeepInsightsTextSummarizer Product ARN

1. Make sure you are in North Virginia Region.
2. Go to https://aws.amazon.com/marketplace/pp/prodview-uzkcdmjuagetk and 
3. Subscribe and accept offer.
3. Wait for a confirmation email and then Continue to Configuration.
4. Select your launch method to SageMaker Console
5. Go to Configure for Amazon SageMaker console
6. Copy Prodcut arn keep it for later use. 


#### Make Sure to update product arn as highlighted below 

#### MphasisDeepInsightsTextSummarizer Product ARN

## 2. FOR CUSTOM USER ABSTRACT INPUT Update text variable section highlighted below

### Imports


In [1]:
!pip install gensim

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import gensim
from gensim.models import Doc2Vec
import pandas as pd
from sklearn.preprocessing import StandardScaler # Replace with SAS version

unable to import 'smart_open.gcs', disabling that module


In [3]:
#!rm -r data
!wget https://nsfdata.s3.amazonaws.com/nsfdataset.zip -O nsfdataset.zip
!unzip -o nsfdataset.zip -d data

--2020-04-17 02:11:27--  https://nsfdata.s3.amazonaws.com/nsfdataset.zip
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.160.43
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.160.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358656602 (342M) [application/zip]
Saving to: ‘nsfdataset.zip’


2020-04-17 02:11:32 (65.8 MB/s) - ‘nsfdataset.zip’ saved [358656602/358656602]

Archive:  nsfdataset.zip
  inflating: data/nsf_proposals.csv  


In [4]:
!ls data

nsf_proposals.csv


In [5]:
#import saspy
import pandas as pd
import numpy as np


selected_cols = ['abstractText','date']
projects = pd.read_csv("data/nsf_proposals.csv", usecols = selected_cols, low_memory=False) #, nrows=30000
print ("Total Projects: ", projects.shape[0], "\nTotal Features: ", projects.shape[1])
projects.date = pd.to_datetime(projects.date.str.replace('D', 'T'))
projects = projects.sort_values('date')

Total Projects:  329321 
Total Features:  2


### Load in Data using Pandas

In [6]:
# Drop values without Abstract texts
projects = projects.dropna(how='any')

print("Max length of the abstractText:", projects.abstractText.str.len().max())
print("Min length of the abstractText:", projects.abstractText.str.len().min())
print("Avg length of the abstractText:", projects.abstractText.apply(lambda x: len(x) - x.count(" ")).mean())

words = projects.abstractText.str.split().apply(len)
print("Max words abstractText:", words.max())
projects.head()

Max length of the abstractText: 8166
Min length of the abstractText: 1
Avg length of the abstractText: 1645.861842268939
Max words abstractText: 1252


Unnamed: 0,abstractText,date
295988,Nondestructive Evaluation (NDE) is important t...,1985-08-30
307099,An Industry/University Cooperative Research Ce...,1985-08-30
284877,This research seeks to obtain a representation...,1985-09-06
177779,"This is an attempt to develop a ""solvated elec...",1986-01-16
166668,The New York State College of Ceramics at Alfr...,1986-06-03


### Setup Training Set for Doc2Vec

In [7]:
X_raw = projects.abstractText.str.lower().values
num_of_docs = len(X_raw)
print('Number of abstracts: ', num_of_docs)

Number of abstracts:  321560


#### Download Model Files

In [8]:
#!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model -O 0nsf_doc2v.model
#!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model.docvecs.vectors_docs.npy -O 0nsf_doc2v.model.docvecs.vectors_docs.npy
#!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model.trainables.syn1neg.npy -O 0nsf_doc2v.model.trainables.syn1neg.npy
#!wget https://nsfdata.s3.amazonaws.com/0nsf_doc2v.model.wv.vectors.npy -O 0nsf_doc2v.model.wv.vectors.npy

### Create generator that will tokenize the training abstracts on the fly to save on memory

In [9]:
def doc_generator(input_docs_array):
    for i, doc in enumerate(input_docs_array):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [10]:
X = doc_generator(X_raw)

### Create Initial Doc2Vec Model and Training

#### Skip to next section to work load in a pretrained model.  Training could take a potentially long time.

We conduct the replication to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998). In this paper, they showed only DBOW results to NSF data. So we replicate this experiments using not only DBOW but also DM.


In [11]:
%%time
import multiprocessing
from pprint import pprint
cores = multiprocessing.cpu_count()

#It is training 100 epochs in 300 dimension vector space
models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, size=300, window=8, min_count=1, iter=100, workers=cores),
    # PV-DM w/average
    #Doc2Vec(dm=1, dm_mean=1, size=300, window=8, min_count=1, iter =100, workers=cores),
]

models[0].build_vocab(doc_generator(X_raw))
print(str(models[0]))
#models[1].reset_from(models[0])
#print(str(models[1]))



Doc2Vec(dbow+w,d300,n5,w8,s0.001,t8)
CPU times: user 2min 26s, sys: 948 ms, total: 2min 27s
Wall time: 2min 27s


In [12]:
%%time 
for i, model in enumerate(models):
    model.train(doc_generator(X_raw), total_examples=len(X_raw), epochs=model.iter)
    model.save(str(i)+'nsf_doc2v.model')
    
print("Doc2Vec Embedding Shape:",model.docvecs.vectors_docs.shape)

  from ipykernel import kernelapp as app


Doc2Vec Embedding Shape: (321560, 300)
CPU times: user 26min 13s, sys: 1.91 s, total: 26min 15s
Wall time: 26min 15s


##OTHER Method

In [13]:
%%time
#model = Doc2Vec(vector_size=300, window=8, min_count=1, workers=4)
#model.build_vocab(documents=doc_generator(X_raw))
#model.train(doc_generator(X_raw), total_examples=num_of_docs, epochs=100)
#model.save('nsf_doc2vec.model')

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


In [14]:
# model.train(doc_generator(X_raw), total_examples=num_of_docs, epochs=1)
# model.save('nsf_doc2vec.model')

### Load Existing Model

In [15]:
existing_model = '0nsf_doc2v.model' #'0nsf_doc2v.model' #'nsf_d2v_100.model' # Name of Existing model to load
model = gensim.models.Doc2Vec.load(str(existing_model)) # Model is assumed to be in the shared folder

### Testing Doc2Vec Model

1. Infer Embedding Vecor for a New Abstract
2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.
3. Storing the most similar files to S3 bucket.


#### 1. Infer Embedding Vecor for a New Abstract

First, calculating cosine similarity of New Text using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model.

In [16]:
!wget https://nsfdata.s3.amazonaws.com/input.txt

--2020-04-17 02:40:53--  https://nsfdata.s3.amazonaws.com/input.txt
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.244.188
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.244.188|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1758 (1.7K) [text/plain]
Saving to: ‘input.txt.1’


2020-04-17 02:40:53 (111 MB/s) - ‘input.txt.1’ saved [1758/1758]



In [17]:
with open('./input.txt', 'rb') as file_stream:
    text = file_stream.read().decode('utf-8')

# FOR CUSTOM USER ABSTRACT INPUT - Update below text variable value

In [18]:
#text = """This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, document vectors were used to cluster similar abstracts using SAS® Studio KMeans Clustering Module. For visualization, the abstract embeddings were reduced to two dimensions using Principal Component Analysis (PCA) within SAS® Studio. This was then compared to a t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique as part of the Scikit-learn machine learning toolkit for Python.Conclusively, NSF proposal abstract text analysis can help an awardee read and improve their proposal model by identifying similar proposal abstracts from the last 24 years. It could also help NSF evaluators identify similar existing proposals that indirectly provides insights on whether a new proposal is going to be fruitful or not."""
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. 
To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards,

In [19]:
test_vector = model.infer_vector(gensim.utils.simple_preprocess(text))

In [20]:
test_vector

array([ 0.6275049 , -0.69909126,  0.24648897,  0.6473362 , -0.14309214,
       -0.12364113, -0.45723322,  0.19341896,  0.3059978 ,  0.5516042 ,
       -0.93698806, -0.5967499 , -0.47690502, -0.77345866, -0.4994785 ,
       -0.22809467, -0.0147303 , -0.5137103 ,  0.22297043,  1.2357723 ,
       -0.16500396, -1.2042545 , -0.783777  ,  0.12628776,  0.51366127,
       -0.06516813, -0.5799814 , -0.6226896 , -0.7133552 , -0.9241057 ,
        1.2958399 ,  0.11201333,  1.0210373 , -0.22493178,  0.6654851 ,
        0.24299653,  0.39740056, -0.36731383, -1.1357244 ,  0.10958847,
        0.0349189 ,  0.08420368,  0.75649464, -0.3745052 ,  0.557635  ,
        0.9662127 ,  1.0169607 , -0.07046982, -0.01328701,  0.068611  ,
        0.7502527 , -0.15684119, -0.12343872,  0.8755129 , -1.9245733 ,
        1.1535857 , -0.06343684, -0.26710075,  0.99154687,  0.5169596 ,
       -0.27407172, -0.82625276,  0.09931551, -0.641839  , -0.30299821,
       -0.90837485, -0.8976406 ,  0.26431394,  0.02322918, -0.27

#### 2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.

In [21]:
similar = model.docvecs.most_similar(
    positive=[test_vector], 
    topn=3)
print(similar)

[(272039, 0.2936977744102478), (240400, 0.28609341382980347), (240378, 0.285956472158432)]


In [22]:
X_raw[similar[0][0]]

"the broader impact/commercial potential of this small business technology transfer (sttr) phase i project will be to bring modern data analytics to the book publishing industry and apply machine learning to extract and articulate human emotion as applied to the reading of literature for the first time in history. this innovation will dramatically change the way books are discovered, resulting in the first commercial version of a book recommendation system based on the experiential reading value of books. with approximately 1.4 million new books published each year, it's extremely difficult for authors to connect with readers and for readers to find the book that is just right for them. current recommendation systems are based on purchase history or social networks and fail to provide what readers told us are the most important factors in their reading satisfaction: writing style and how a book will make them feel. the proposed sttr project will lead to a commercially marketable produc

3. Storing the most similar files to S3 bucket.

In [23]:
# file = open('top1.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[0][0]])
# file.close()

In [24]:
# file = open('top2.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[1][0]])
# file.close()

In [25]:
# file = open('top3.txt', 'w',encoding="utf-8")
# file.write(X_raw[similar[2][0]])
# file.close()

In [26]:
import boto3
s3_conn = boto3.client("s3")
s3 = boto3.resource('s3')
object = s3.Object('nsfdata', 'top1.txt')
object.put(Body=X_raw[similar[0][0]])

{'ResponseMetadata': {'RequestId': 'E397F67285A79584',
  'HostId': 'bQODYBEMvArx0vneQhfvoFPeN/CDx4UhqHa1yTNbZY/qqWZHzvKrDoD89rGsRGCxG/3XBQke9vs=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'bQODYBEMvArx0vneQhfvoFPeN/CDx4UhqHa1yTNbZY/qqWZHzvKrDoD89rGsRGCxG/3XBQke9vs=',
   'x-amz-request-id': 'E397F67285A79584',
   'date': 'Fri, 17 Apr 2020 02:40:55 GMT',
   'etag': '"5284bb0274fe21e74bc6c54b4d4e6658"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"5284bb0274fe21e74bc6c54b4d4e6658"'}

###  MphasisDeepInsightsTextSummarizer

### Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [27]:
import sagemaker as sage
from time import gmtime, strftime
from sagemaker import get_execution_role

sess = sage.Session()
role = get_execution_role()

## Create Model

Now we use the Model Package to create a model,

# PLEASE CHANGE MphasisDeepInsightsTextSummarizer Product ARN

## Make sure you update model_package_arn to product arn which you retrieved from marketplace-text-summarizer


In [28]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
##OHIO REGION
#model_package_arn = 'arn:aws:sagemaker:us-east-2:057799348421:model-package/marketplace-text-summarizer-11-d2490248e8de20f24ae3b72d0d74654c'

##North Virginia Region
model_package_arn = 'arn:aws:sagemaker:us-east-1:865070037744:model-package/marketplace-text-summarizer-11-d2490248e8de20f24ae3b72d0d74654c'

from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Input File

Now we pull a sample input file for testing the model.

In [29]:
top1_txt="s3://nsfdata/top1.txt"

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [30]:
import json 
import uuid
transformer = model.transformer(2, 'ml.m5.xlarge')
transformer.output_path = "s3://nsfdata/summerized-abstract"
transformer.transform(top1_txt, content_type='text/plain')
transformer.wait()
print("Batch Transform complete")

................................[35m2020-04-17T02:46:06.980:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[36m2020-04-17T02:46:06.980:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[34m169.254.255.130 - - [17/Apr/2020 02:46:06] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[34m169.254.255.130 - - [17/Apr/2020 02:46:06] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[32m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[32m169.254.255.130 - - [17/Apr/2020 02:46:06] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[32m169.254.255.130 - - [17

## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [31]:
import boto3
print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")

s3://nsfdata/summerized-abstract


In [32]:
bucket_name="nsfdata"
with open('result.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/top1.txt.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [33]:
with open('./result.txt', 'rb') as file_stream:
    output_text = file_stream.read().decode('utf-8')

#### Original Text Input

In [34]:
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. 
To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards,

#### Similar abstract

In [35]:
bucket_name="nsfdata"
with open('top1.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,'top1.txt', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [36]:
with open('./top1.txt', 'rb') as file_stream:
    top1_text = file_stream.read().decode('utf-8')

In [37]:
print(top1_text)

the broader impact/commercial potential of this small business technology transfer (sttr) phase i project will be to bring modern data analytics to the book publishing industry and apply machine learning to extract and articulate human emotion as applied to the reading of literature for the first time in history. this innovation will dramatically change the way books are discovered, resulting in the first commercial version of a book recommendation system based on the experiential reading value of books. with approximately 1.4 million new books published each year, it's extremely difficult for authors to connect with readers and for readers to find the book that is just right for them. current recommendation systems are based on purchase history or social networks and fail to provide what readers told us are the most important factors in their reading satisfaction: writing style and how a book will make them feel. the proposed sttr project will lead to a commercially marketable product

## Summary of similar abstract

### Output is about machine learning for book recommendation similar to machine learning for abtract recommendation in input study.

In [38]:
print(output_text)

This innovation will dramatically change the way books are discovered , resulting in the first commercial version of a book recommendation system based on the experiential reading value of books. Then we apply machine learning to 19 million current books to generate deeply personalized book recommendations. The proposed sttr project will lead to a commercially marketable product that deeply personalizes the book discovery process and perpetuates literacy.
Execution time : 1.81seconds

