The purpose of this script is to update the NASA ADS synonyms files with AGU index terms related to Heliophysics and Space Weather

Relevant resources:
- ADS synonyms files: 
    - simple (single term) synonyms: data/ads_simple_synonyms.txt
    - multi-term synonyms: data/ads_multi_synonyms.txt
- AGU Index Terms: data/agu-index-terms.xlsx
- Heliohpysics Acronyms (generated from 2013 Decadal Survey): data/solar_physics_acronyms.csv

Dependencies:
- [NLTK](https://www.nltk.org); and you must download the NLTK corpora

In [1]:
import os, json
import numpy as np
import pandas as pd

import re

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt

import nltk

In [9]:
# nltk.download()
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ryanmcgranaghan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/ryanmcgranaghan/nltk_data...


True

In [13]:


def synonym_antonym_extractor(phrase):
    '''
    Obtained from: https://www.holisticseo.digital/python-seo/nltk/wordnet
    '''
    from nltk.corpus import wordnet
    synonyms = []
    antonyms = []

    for syn in wordnet.synsets(phrase):
        for l in syn.lemmas():
            synonyms.append(l.name())
            if l.antonyms():
                santonyms.append(l.antonyms()[0].name())

    print(set(synonyms))
    print(set(antonyms))



### Read in AGU index terms

In [17]:
# Read in AGU Index Terms
pd_agu = pd.read_excel('data/agu-index-terms.xlsx')
pd_agu


for r in range(len(pd_agu)): 
    if ~( (pd_agu['Code'][r] >= 1900) & (pd_agu['Code'][r] < 1999) |
         (pd_agu['Code'][r] >= 2100) & (pd_agu['Code'][r] < 2199) |
         (pd_agu['Code'][r] >= 2400) & (pd_agu['Code'][r] < 2499) |
         (pd_agu['Code'][r] >= 2700) & (pd_agu['Code'][r] < 2799) |
         (pd_agu['Code'][r] >= 3200) & (pd_agu['Code'][r] < 3299) |
         (pd_agu['Code'][r] >= 4300) & (pd_agu['Code'][r] < 4399) |
         (pd_agu['Code'][r] >= 6900) & (pd_agu['Code'][r] < 6999) |
         (pd_agu['Code'][r] >= 7500) & (pd_agu['Code'][r] < 7599) |
         (pd_agu['Code'][r] >= 7800) & (pd_agu['Code'][r] < 7899) ):
#         print('Code = {} --> Term = {}'.format(pd_agu['Code'][r],pd_agu['Description'][r]))
        pd_agu = pd_agu.drop([r])

for r in pd_agu.index:
    if '(' in pd_agu['Description'][r]:
        print('prior = {}'.format(pd_agu['Description'][r]))
        pd_agu['Description'][r] = pd_agu['Description'][r][0:pd_agu['Description'][r].find('(')-1]
        print('post = {}'.format(pd_agu['Description'][r]))

# Transforming into a new dataframe that can be combined with other glossaries
pd_agu_terms = pd.DataFrame(columns=['term','definition'])
pd_agu_terms['term'] = pd_agu['Description']
pd_agu_terms

pd_agu_terms['term'] = pd_agu_terms['term'].str.lower()

pd_agu_terms['source'] = np.tile('agu',(len(pd_agu_terms),1))
pd_agu_terms


prior = Decision analysis (4324, 6309)
post = Decision analysis
prior = Forecasting (2722, 4315, 7924)
post = Forecasting
prior = Machine learning (0555)
post = Machine learning
prior = Modeling (0466, 0545, 0798, 1847, 4255, 4316)
post = Modeling
prior = Real-time and responsive information delivery (4346)
post = Real-time and responsive information delivery
prior = Spatial analysis and representation (0500, 3252)
post = Spatial analysis and representation
prior = Statistical methods: Descriptive (4318)
post = Statistical methods: Descriptive
prior = Statistical methods: Inferential (4318)
post = Statistical methods: Inferential
prior = Temporal analysis and representation (1872, 3270, 4277, 4475)
post = Temporal analysis and representation
prior = Uncertainty (1873, 3275)
post = Uncertainty
prior = Visualization and portrayal (0530)
post = Visualization and portrayal
prior = Coronal mass ejections (4305, 7513)
post = Coronal mass ejections
prior = Discontinuities (7811)
post = Discon

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pd_agu['Description'][r] = pd_agu['Description'][r][0:pd_agu['Description'][r].find('(')-1]


Unnamed: 0,term,definition,source
432,informatics,,agu
433,community modeling frameworks,,agu
434,community standards,,agu
435,"computational models, algorithms",,agu
436,cyberinfrastructure,,agu
...,...,...,...
1154,transport processes,,agu
1155,turbulence,,agu
1156,wave/particle interactions,,agu
1157,wave/wave interactions,,agu


### Read in ADS synonyms

In [6]:
simple_syns_file = '/Users/ryanmcgranaghan/Documents/Helio_ECIP/dev/Helio-KNOW/ADS_enrichment/data/ads_simple_synonyms.txt'
f = open(simple_syns_file,"r")
txt_data = f.read().split('\n')#.remove('')
f.close()

simple_syns_data = [x.split('=>') for x in txt_data]
pd_simple_syns = pd.DataFrame(simple_syns_data,columns=['words','ADS term'])
pd_simple_syns

Unnamed: 0,words,ADS term
0,"1820-30, 1820-303",1820-30
1,"first, 1st",first
2,"second, 2nd",second
3,"third, 3rd",third
4,"fourth, 4th",fourth
...,...,...
9736,"zt, zts",zt
9737,"zuckerman, zuckermann",zuckerman
9738,"zustandsdiagramm, zustandsdiagramms",zustandsdiagramm
9739,"zwicky, zw",zwicky


In [7]:
multi_syns_file = '/Users/ryanmcgranaghan/Documents/Helio_ECIP/dev/Helio-KNOW/ADS_enrichment/data/ads_multi_synonyms.txt'
f = open(multi_syns_file,"r")
txt_data = f.read().split('\n')#.remove('')
f.close()

#TODO: how do we do this more complicated file programmatically?

In [8]:
txt_data

['# A list of acronyms that ADS curates independently of Wikipedia',
 '# They are useful to deal with some normalization of the input',
 '# stream which would otherwise require regular expressions',
 '#',
 '# AA 12/14/2012',
 '',
 '# first we start with classes of stars',
 'o star,ostar,o stars,ostars',
 'b star,bstar,b stars,bstars',
 'a star,astar,a stars,astars',
 'f star,fstar,f stars,fstars',
 'g star,gstar,g stars,gstars',
 'k star,kstar,k stars,kstars',
 'm star,mstar,m stars,mstars',
 's star,sstar,s stars,sstars',
 'l star,lstar,l stars,lstars',
 't star,tstar,t stars,tstars',
 'be star,bestar,be stars,bestars',
 '',
 '# common constellations',
 'cas,cassiopeiae',
 'cen,centaurus',
 'cyg,cygnus',
 'her,hercules',
 'per,perseus',
 'sgr,sagittarius',
 'tau,taurus',
 'vir,virgo',
 '',
 '# main stars in popular constellations',
 'cas a,casa,cassiopeiae a',
 'cen a,cena,centaurus a',
 'cyg a,cyga,cygnus a',
 'her a,hera,hercules a',
 'per a,pera,perseus a',
 'sgr a,sgra,sagittarius

### Loop over AGU terms and add to synonyms file

In [None]:
# Exploration for using NLTK for identifying similar terms when searching for related terms in ADS synonyms 
#   to a give AGU index term

# Open questions:
#  - How to handle multi-term phrases in the AGU index terms?
#  - 


In [37]:
# phrase_to_search = pd_agu_terms['term'].iloc[1]
phrase_to_search = 'model'
print('trialing AGU index term = {}'.format(phrase_to_search))
print('output using wordnet ....\n\t')
synonym_antonym_extractor(phrase=phrase_to_search)

# for syn in wordnet.synsets("Spaces"):
#     print('{}: {}'.format(syn,syn.lemmas()))

trialing AGU index term = model
output using wordnet ....
	
{'mannequin', 'exemplar', 'manakin', 'pose', 'mould', 'role_model', 'theoretical_account', 'mold', 'good_example', 'posture', 'mock_up', 'modeling', 'pattern', 'fashion_model', 'framework', 'sit', 'mannikin', 'simulate', 'exemplary', 'poser', 'modelling', 'simulation', 'model', 'example', 'manikin'}
set()


In [10]:
for t in pd_agu_terms['term']:
    print('working on term {}'.format(t))
    
    

working on term informatics
working on term community modeling frameworks
working on term community standards
working on term computational models, algorithms
working on term cyberinfrastructure
working on term data assimilation, integration and fusion
working on term data management, preservation, rescue
working on term data mining
working on term data and information discovery
working on term decision analysis
working on term emerging informatics technologies
working on term forecasting
working on term formal logics and grammars
working on term geospatial
working on term gis science
working on term data and information governance
working on term high-performance computing
working on term international collaboration
working on term interoperability
working on term knowledge representation and knowledge bases
working on term machine-to-machine communication
working on term machine learning
working on term markup languages
working on term metadata
working on term metadata: provenance
wo