# Natural Language Processing using Doc2Vec on National Science Foundation Awards Abstracts
---
### Team:  
Jacob Noble  
Himanshu Gamit  
Shantanu Hadap


### Imports

sklearn's Standard Scaler should be used.

In [0]:
import gensim
from gensim.models import Doc2Vec
import pandas as pd
from sklearn.preprocessing import StandardScaler # Replace with SAS version

In [2]:
!wget https://nsfdata.s3.amazonaws.com/nsfdataset.zip
!unzip nsfdataset.zip -d data

--2020-04-16 14:46:53--  https://nsfdata.s3.amazonaws.com/nsfdataset.zip
Resolving nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)... 52.216.16.176
Connecting to nsfdata.s3.amazonaws.com (nsfdata.s3.amazonaws.com)|52.216.16.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358656602 (342M) [application/zip]
Saving to: ‘nsfdataset.zip’


2020-04-16 14:47:04 (33.1 MB/s) - ‘nsfdataset.zip’ saved [358656602/358656602]

Archive:  nsfdataset.zip
  inflating: data/nsf_proposals.csv  


In [3]:
#import saspy
import pandas as pd
import numpy as np


selected_cols = ['abstractText','date']
projects = pd.read_csv("data/nsf_proposals.csv", usecols = selected_cols, nrows=30000, low_memory=False) #, nrows=30000
print ("Total Projects: ", projects.shape[0], "\nTotal Features: ", projects.shape[1])
projects.date = pd.to_datetime(projects.date.str.replace('D', 'T'))
projects = projects.sort_values('date')

Total Projects:  30000 
Total Features:  2


### Load in Data using Pandas

In [4]:
# Drop values without Abstract texts
projects = projects.dropna(how='any')

print("Max length of the abstractText:", projects.abstractText.str.len().max())
print("Min length of the abstractText:", projects.abstractText.str.len().min())
print("Avg length of the abstractText:", projects.abstractText.apply(lambda x: len(x) - x.count(" ")).mean())

words = projects.abstractText.str.split().apply(len)
print("Max words abstractText:", words.max())
projects.head()

Max length of the abstractText: 7977
Min length of the abstractText: 1
Avg length of the abstractText: 1524.8768992781024
Max words abstractText: 1169


Unnamed: 0,abstractText,date
2,A research and training center for quantitativ...,1988-08-08
20002,Biotechnology is now recognized as a new focus...,1988-08-24
25558,The Luquillo Experimental Forest (LEF) in Puer...,1988-10-12
28891,An institute for atomic and molecular theory w...,1988-10-12
27780,An institute for atomic and molecular theory w...,1988-10-12


### Setup Training Set for Doc2Vec

In [5]:
X_raw = projects.abstractText.str.lower().values
num_of_docs = len(X_raw)
print('Number of abstracts: ', num_of_docs)

Number of abstracts:  29090


### Create generator that will tokenize the training abstracts on the fly to save on memory

In [0]:
def doc_generator(input_docs_array):
    for i, doc in enumerate(input_docs_array):
        tokens = gensim.utils.simple_preprocess(doc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [0]:
X = doc_generator(X_raw)

### Create Initial Doc2Vec Model and Training

#### Skip to next section to work load in a pretrained model.  Training could take a potentially long time.

We conduct the replication to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998). In this paper, they showed only DBOW results to NSF data. So we replicate this experiments using not only DBOW but also DM.


In [17]:
%%time
import multiprocessing
from pprint import pprint
cores = multiprocessing.cpu_count()

#It is training 100 epochs in 300 dimension vector space
models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, size=300, window=8, min_count=1, iter=100, workers=cores),
    # PV-DM w/average
    #Doc2Vec(dm=1, dm_mean=1, size=300, window=8, min_count=1, iter =100, workers=cores),
]

models[0].build_vocab(doc_generator(X_raw))
print(str(models[0]))
#models[1].reset_from(models[0])
#print(str(models[1]))



Doc2Vec(dbow+w,d300,n5,w8,s0.001,t2)
CPU times: user 30.6 s, sys: 153 ms, total: 30.7 s
Wall time: 30.7 s


In [18]:
%%time 
for i, model in enumerate(models):
    model.train(doc_generator(X_raw), total_examples=len(X_raw), epochs=model.iter)
    model.save(str(i)+'nsf_doc2v.model')

  
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


CPU times: user 2min 39s, sys: 531 ms, total: 2min 40s
Wall time: 1min 24s


In [19]:
print("Doc2Vec Embedding Shape:",model.docvecs.vectors_docs.shape)

Doc2Vec Embedding Shape: (29090, 300)


In [0]:
# model.train(doc_generator(X_raw), total_examples=num_of_docs, epochs=1)
# model.save('nsf_doc2vec.model')

### Load Existing Model

In [20]:
existing_model = '0nsf_doc2v.model' #'nsf_d2v_100.model' # Name of Existing model to load
model = gensim.models.Doc2Vec.load(str(existing_model)) # Model is assumed to be in the shared folder

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### Testing Doc2Vec Model

1. Infer Embedding Vecor for a New Abstract
2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.
3. Storing the most similar files to S3 bucket.


#### 1. Infer Embedding Vecor for a New Abstract

First, calculating cosine similarity of New Text using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model.

In [30]:
!wget https://nsf-bucket.s3.us-east-2.amazonaws.com/input.txt

--2020-04-16 15:12:41--  https://nsf-bucket.s3.us-east-2.amazonaws.com/input.txt
Resolving nsf-bucket.s3.us-east-2.amazonaws.com (nsf-bucket.s3.us-east-2.amazonaws.com)... 52.219.96.8
Connecting to nsf-bucket.s3.us-east-2.amazonaws.com (nsf-bucket.s3.us-east-2.amazonaws.com)|52.219.96.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1758 (1.7K) [text/plain]
Saving to: ‘input.txt’


2020-04-16 15:12:41 (122 MB/s) - ‘input.txt’ saved [1758/1758]



In [0]:
with open('./input.txt', 'rb') as file_stream:
    text = file_stream.read().decode('utf-8')

In [32]:
#text = """This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards, document vectors were used to cluster similar abstracts using SAS® Studio KMeans Clustering Module. For visualization, the abstract embeddings were reduced to two dimensions using Principal Component Analysis (PCA) within SAS® Studio. This was then compared to a t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique as part of the Scikit-learn machine learning toolkit for Python.Conclusively, NSF proposal abstract text analysis can help an awardee read and improve their proposal model by identifying similar proposal abstracts from the last 24 years. It could also help NSF evaluators identify similar existing proposals that indirectly provides insights on whether a new proposal is going to be fruitful or not."""
print(text)

This study utilizes publicly available data from the National Science Foundation (NSF) Web Application Programming Interface (API). In this paper, various machine learning techniques are demonstrated to explore, analyze and recommend similar proposal abstracts to aid the NSF or Awardee with the Merit Review Process. These techniques extract textual context and group it with similar context. The goal of the analysis was to utilize a Doc2Vec unsupervised learning algorithms to embed NSF funding proposal abstracts text into vector space.  Once vectorized, the abstracts were grouped together using K-means clustering. These techniques together proved to be successful at grouping similar proposals together and could be used to find similar proposals to newly submitted NSF funding proposals. 
To perform text analysis, SAS® University Edition is used which supports SASPy, SAS® Studio and Python JupyterLab. Gensim Doc2vec is used to generate document vectors for proposal abstracts. Afterwards,

In [0]:
test_vector = model.infer_vector(gensim.utils.simple_preprocess(text))

In [34]:
test_vector

array([-0.8438365 , -0.17401043, -0.13399816, -0.2000463 ,  0.13937432,
        0.23570478,  0.09454932,  0.57916385,  0.23849203, -0.17535031,
        1.3722361 , -1.0783211 ,  0.270243  , -0.97854984,  0.8234382 ,
        0.02699925, -0.4561137 , -0.00563184, -0.5500358 , -0.47532412,
        0.4896474 , -1.061523  ,  0.41433418, -0.33881608, -0.50388104,
        0.13778783,  0.7151214 ,  0.38627973,  0.26692402,  0.78165257,
        0.3631258 ,  0.31219473, -0.06193725,  0.1172061 ,  0.36414784,
        0.35421765, -0.13689107, -0.45280287,  0.80544066, -0.41016334,
       -0.6796709 ,  0.16536756, -0.04947023,  0.36030695,  0.02316136,
       -0.14729203, -0.5510913 ,  0.4065121 , -0.2794985 ,  0.44954312,
       -0.6153437 , -0.15876888, -0.92806524,  1.0740391 ,  0.7172982 ,
       -0.96308166,  0.13134834,  1.0354185 ,  0.28054276, -0.7680927 ,
        0.5887584 , -1.5406504 ,  0.10726024, -0.01891238, -0.41754752,
        0.04708375, -0.9622146 , -0.40153816, -0.83389646,  0.50

#### 2. Generating Most Similar Articles based on document cluster. This is using cosine similary score.

In [35]:
model.docvecs.most_similar(
    positive=[test_vector], 
    topn=3)

  if np.issubdtype(vec.dtype, np.int):


[(28343, 0.2686604857444763),
 (28391, 0.2660232186317444),
 (28587, 0.26435062289237976)]

3. Storing the most similar files to S3 bucket.

In [0]:
file = open('top1.txt', 'w',encoding="utf-8")
file.write(X_raw[289674])
file.close()

In [0]:
file = open('top2.txt', 'w',encoding="utf-8")
file.write(X_raw[258922])
file.close()

In [0]:
file = open('top3.txt', 'w',encoding="utf-8")
file.write(X_raw[24392])
file.close()

In [0]:
### Text summerizer

### Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [0]:
import sagemaker as sage
from time import gmtime, strftime
from sagemaker import get_execution_role

sess = sage.Session()
role = get_execution_role()

ValueError: ignored

## Create Model

Now we use the Model Package to create a model

In [0]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
model_package_arn = 'arn:aws:sagemaker:us-east-2:057799348421:model-package/marketplace-text-summarizer-11-d2490248e8de20f24ae3b72d0d74654c'
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Input File

Now we pull a sample input file for testing the model.

In [0]:
top1_txt="s3://nsf/top1.txt"
top2_txt="s3://nsf/top2.txt"
top3_txt="s3://nsf/top3.txt"

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [0]:
import json 
import uuid
transformer = model.transformer(2, 'ml.m5.xlarge')
transformer.output_path = "s3://nsf/Summerized"
transformer.transform(top2_txt, content_type='text/plain')
transformer.wait()
print("Batch Transform complete")

Using already existing model: marketplace-text-summarizer-11-d2490248-2020-04-15-19-31-40-147


..........................[34m2020-04-15T20:22:02.869:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[33m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[33m169.254.255.130 - - [15/Apr/2020 20:22:02] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[33m169.254.255.130 - - [15/Apr/2020 20:22:02] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[35m2020-04-15T20:22:09.872:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[36m2020-04-15T20:22:09.872:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[34m


Batch Transform complete


## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [0]:
import boto3
print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")

s3://nsf/Summerized


In [0]:
bucket_name="nsf"
with open('result.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/top2.txt.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [0]:
with open('./result.txt', 'rb') as file_stream:
    output_text = file_stream.read().decode('utf-8')

#### Original Text Input

In [0]:
print(text)

This research project will test four hypotheses about evolutionary forces that affect the behavior of pair-bonded birds to each other and to the nestlings they raise:1.male-male aggression during a female's fertile period decreases the likelihood that the female's young are sired by males other thanher territorial partner.2.female-female aggression during the egg-laying period functions to decrease the likelihood that females' broods contain nestlings from other females of the samespecies that lay their eggs in the nests of neighbors.3.mate- guarding by males is a response to females' behavior or ecological circumstances that increase the likelihood thatfemales mate with more than one male.4.parental feeding ofnestlings by males varies according to males' likelihood of genetic paternity as indicated by the behavior of females whenthey are fertile. these are important questions for several reasons.dr. gowatyand her colleagues are attempting to observe evolutionary processes as they occu

#### Similar abstract

In [0]:
bucket_name="nsf"
with open('top1.txt', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,'top2.txt', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [0]:
with open('./top2.txt', 'rb') as file_stream:
    input_text = file_stream.read().decode('utf-8')

In [0]:
print(input_text)

numerical and asymptotic techniques will be used to                        investigate transonic flow about aircraft wings and flows with                  distributed vorticity.  specific transonic problems to be                       analyzed are design of optimum critical axisymmetric bodies,                    flows about lifting, not-so-slender transonic wings and the                     transonic area rule, and wind tunnel flows.  some use will be                   made of the transonic small disturbance formulation.  the                       investigations of the vortical flow problem will involve                        development and clarification of theories integrating nonlinear                 and viscous effects.  both the transonic and vortical flow                      problems require interaction between numerical and asymptotic                   treatments in order to deal with complex nonlinearities.                             present day commercial aircraft cruise i

#### Summary of similar abstract

In [0]:
print(output_text)

Numerical and asymptotic techniques will be used to investigate transonic flow about aircraft wings and flows with distributed vorticity. Specific transonic problems to be analyzed are design of optimum critical axisymmetric bodies , flows about lifting , not-so-slender transonic wings and the transonic area rule , and wind tunnel flows. Some use will be made of the transonic small disturbance formulation.
Execution time : 1.40seconds

