## Machine Learning Record Mining

Project to create a pipeline that uses GeoDeepDive's output to find Unaquired Sites for Neotoma.

Using NLP parsed text and a Data Science approach, identify whether a paper is suitable for Neotoma and detect features such as 'Site Name', 'Location', 'Age Span' and 'Site Descriptions'.

In [1]:
# Loading libraries

import numpy as np
import pandas as pd
import csv
import psycopg2

In [214]:
# Options for DF display
pd.set_option('display.max_colwidth', 10)
pd.set_option('display.max_rows', 10)

## Loading and viewing the Data

### Loading NLP Sentences

In [199]:
# Connect to PostgreSQL server from terminal:
# pg_ctl -D PSQL_Data -l logfile start

try:
    # TODO Change SQL connection to a different script
    conn = psycopg2.connect("dbname=gdd_database user=seiryu8808 password=")
    nlp_sentences = pd.read_sql_query('''SELECT * FROM sentences;''', conn)
    
# If no SQL db, load from a file
except:
    header_list = ["_gddid", "sentence", "wordidx", "words", "poses", "ners", 
               "lemmas", "dep_paths", "dep_parents"]
    # poses = part of speech , ners = special class, dep_paths = wordtype, = word modified
    nlp_sentences = pd.read_csv("../Do_not_commit_data/sentences_nlp352", sep='\t', names = header_list)
    nlp_sentences = nlp_sentences.replace('"', '', regex = True)\
                                 .replace('\{', '', regex = True)\
                                 .replace('}', '', regex = True)\
                                 .replace(',', ',', regex = True)
    nlp_sentences['wordidx']= nlp_sentences['wordidx'].str.split(",")
    nlp_sentences['words']= nlp_sentences['words'].str.split(",")
    nlp_sentences['poses']= nlp_sentences['poses'].str.split(",")
    nlp_sentences['ners']= nlp_sentences['ners'].str.split(",")
    nlp_sentences['lemmas']= nlp_sentences['lemmas'].str.split(",")
    nlp_sentences['dep_paths']= nlp_sentences['dep_paths'].str.split(",")
    nlp_sentences['dep_parents']= nlp_sentences['dep_parents'].str.split(",")
    #nlp_sentences['wordIndex'].str.len()

In [201]:
nlp_sentences.head(10)

Unnamed: 0,docid,sentid,wordidx,words,poses,ners,lemmas,dep_paths,dep_parents
0,54b432...,1,"[1, 2,...",[Avail...,"[JJ, N...","[O, O,...",[avail...,"[dep, ...","[218, ..."
1,54b432...,2,"[1, 2,...","[The, ...","[DT, N...","[O, O,...","[the, ...","[det, ...","[4, 4,..."
2,54b432...,3,"[1, 2,...","[An, A...","[DT, N...","[O, O,...","[a, Ar...","[det, ...","[3, 3,..."
3,54b432...,4,"[1, 2,...","[C/N, ...","[JJ, N...","[O, O,...","[c/n, ...","[amod,...","[7, 7,..."
4,54b432...,5,"[1, 2,...",[Highe...,"[JJR, ...","[O, O,...",[highe...,"[amod,...","[2, 10..."
5,54b432...,294,"[1, 2,...",[Ander...,"[NNP, ...",[PERSO...,[Ander...,[compo...,"[3, 0,..."
6,54b432...,6,"[1, 2,...","[From,...","[IN, F...","[O, O,...","[from,...","[case,...","[2, 19..."
7,54b432...,7,"[1, 2,...","[C/N, ...","[JJ, N...","[O, O,...","[c/n, ...","[amod,...","[2, 3,..."
8,54b432...,8,"[1, 2,...",[Wette...,"[JJ, N...","[O, O,...",[wette...,"[amod,...","[2, 3,..."
9,54b432...,9,"[1, 2,...",[Highe...,"[JJR, ...","[O, O,...",[highe...,"[amod,...","[4, 4,..."


### Loading Bibliography Data

In [203]:
# TODO Load into SQL server and connect through SQL

bibliography = pd.read_json (r'../Do_not_commit_data/bibjson')

In [205]:
bibliography.head(10)

# TODO Figure out how to flatten nested Json `[{}]` 

Unnamed: 0,publisher,title,journal,author,year,number,volume,link,_gddid,identifier,type,pages
0,Elsevier,Palaeo...,{'name...,[{'nam...,1999,7,18.0,[{'url...,550453...,[{'typ...,article,945--960
1,Canadi...,Holoce...,{'name...,[{'nam...,1992,1,70.0,[{'url...,578b5a...,[{'typ...,article,6--18
2,Elsevier,Glacia...,{'name...,[{'nam...,1980,,,[{'url...,54b432...,[{'typ...,article,247--340
3,GSA,A reco...,{'name...,[{'nam...,2014,6,42.0,[{'url...,57c5b9...,[{'typ...,article,499--502
4,Elsevier,Plant ...,{'name...,[{'nam...,1981,1,16.0,[{'url...,54b432...,[{'typ...,article,66--79
5,Taylor...,10. Na...,{'name...,[{'nam...,2010,1,49.0,[{'url...,58d27c...,[{'typ...,article,79--81
6,Canadi...,Age ve...,{'name...,[{'nam...,1999,3,36.0,[{'url...,574629...,[{'typ...,article,383--393
7,Elsevier,Holoce...,{'name...,[{'nam...,2013,3,79.0,[{'url...,54b432...,[{'typ...,article,366--376
8,Elsevier,Synchr...,{'name...,[{'nam...,2009,2,72.0,[{'url...,54b432...,[{'typ...,article,234--245
9,Wiley,Contra...,{'name...,[{'nam...,2005,7-8,20.0,[{'url...,56f8f6...,[{'typ...,article,663--670


## EDA

Reviewing our data includes skimming through some papers online and seeing if the data is consistent with our NLP Sentences dataframe.

From there, we can also visualize what we would like our model to predict: 'Location', 'Site Name', 'Age Span', and 'Site Description' from a "Human perspective".

In [206]:
def order_article(article_id):
    '''
    Function to find an article by its gddid in the NLP sentences and have it displayed in order
    
    Keyword arguments:
    article_id -- gddid
    
    Returns:
    article ordered by sentence index
    '''
    article = nlp_sentences[nlp_sentences['docid'] == article_id]
    return article[['sentid', 'words']].sort_values(by = 'sentid')

In [215]:
order_article('550453fde1382326932d85f7')

Unnamed: 0,sentid,words
94865,1,[Quate...
94866,2,[Plant...
94867,3,[Quant...
94868,4,"[The, ..."
94870,5,[Surfa...
...,...,...
95325,448,[Veget...
95326,449,[Quate...
95327,450,[Zolit...
95329,451,[Sedim...


Skimmed info: 
 * `Link`: http://www.sciencedirect.com/science/article/pii/S0277379199000074
 * `Site Name`:  
 * `Location`:  
 * `Age Span`:   
 * `Site Descriptions`:    

In [189]:
order_article('54b43266e138239d8684efed')

Unnamed: 0,sentid,words
0,1,"[Available, online, at, www.sciencedirect.com, Quaternary, Research, 69, -LRB-, 2008, -RRB-, 263..."
1,2,"[The, Chihuahueños, Bog, record, extends, to, over, 15,000, cal, yr, BP, .]"
2,3,"[An, Artemisia, steppe, ,, then, an, open, Picea, woodland, grew, around, a, small, pond, until,..."
3,4,"[C/N, ratios, ,, δ13C, and, δ15N, values, indicate, both, terrestrial, and, aquatic, organic, ma..."
4,5,"[Higher, percentages, of, aquatic, algae, and, elevated, C/N, ratios, indicate, higher, lake, le..."
...,...,...
605,572,"[Is, the, Valles, caldera, entering, a, new, cycle, of, activity, ?]"
144,573,"[Geology, 23, ,, 411, --, 414, .]"
607,574,"[Wright, Jr., ,, H.E., ,, Bent, ,, A.M., ,, Hansen, ,, B.S., ,, Maher, Jr., ,, L.J., ,, 1973, .]"
608,575,"[Present, and, past, vegetation, of, the, Chuska, Mountains, ,, northwestern, New, Mexico, .]"


Skimmed info:  
* `Link`: http://www.sciencedirect.com/science/article/pii/S0033589407001512
* `Site Name`:  
* `Location`:  
* `Age Span`:   
* `Site Descriptions`:    

In [190]:
order_article('57c5b941cf58f1338eaddb5b')

Unnamed: 0,sentid,words
67708,1,"[A, record, of, sustained, prehistoric, and, historic, land, use, from, the, Cahokia, region, ,,..."
67709,2,"[Here, we, report, a, high-resolution, and, multiproxy, paleoecological, record, from, Horseshoe..."
67710,3,"[Palynological, and, carbon, isotope, data, document, pronounced, vegetation, changes, over, the..."
67712,4,"[Rapid, forest, clearance, was, followed, closely, by, the, proliferation, of, indigenous, seed,..."
67713,5,"[Agricultural, intensiﬁcation, that, included, the, use, of, maize, -LRB-, Zea, mays, subsp, .]"
...,...,...
67845,123,"[Simon, ,, M.L., ,, and, Parker, ,, K.E., ,, 2006, ,, Prehistoric, plant, use, in, the, American..."
67846,124,"[Smith, ,, B.D., ,, and, Yarnell, ,, R.A., ,, 2009, ,, Initial, formation, of, an, indigenous, c..."
67848,125,"[Sugita, ,, S., ,, 1993, ,, A, model, of, pollen, source, area, for, an, entire, lake, surface, ..."
67849,126,"[Trubitt, ,, M.B.D., ,, 2000, ,, Mound, building, and, prestige, goods, exchange, :, Changing, s..."


Skimmed info:  
* `Link`: http://dx.doi.org/10.1130/g35541.1  # No Full access to article
* `Site Name`:  
* `Location`:  
* `Age Span`:   
* `Site Descriptions`:    

In [207]:
order_article('58d29193cf58f14928755ba5')

Unnamed: 0,sentid,words
110659,1,[Grana...
110660,2,[Peat-...
110661,3,[Peat-...
110662,4,[Submi...
110665,5,[Peat-...
...,...,...
110740,80,"[Ann, .]"
110741,81,[Sofia...
110742,82,"[Fac, .]"
110743,83,"[Geol, .]"


Skimmed info:  
* `Link:`http://www.tandfonline.com/doi/abs/10.1080/00173130902965157
* `Site Name`:  
* `Location`:  
* `Age Span`:   
* `Site Descriptions`:    

In [192]:
order_article('57928e07cf58f133d1c26609')

Unnamed: 0,sentid,words
39606,1,"[Timberline, fluctuations, and, late, Quaternary, paleoclimates, in, the, Southern, Rocky, Mount..."
39607,2,"[By, tracking, climatically, sensitive, forest, boundaries, ,, the, moisturecontrolled, lower, t..."
39608,3,"[Pollen, data, suggest, that, prior, to, 11, 000, yr, B.P., ,, a, subalpine, forest, dominated, ..."
39609,4,"[The, inferred, climate, was, 2Ð5, ¡, C, cooler, and, had, 7Ð16, cm, greater, precipitation, tha..."
39611,5,"[Abies, -LRB-, fir, -RRB-, increased, in, abundance, in, the, subalpine, forest, around, 11, 000..."
...,...,...
40107,448,"[Weber, ,, W., A., ,, 1987, ,, Colorado, flora, :, Western, slope, :, Boulder, ,, Colorado, Asso..."
40108,449,"[Whitlock, ,, C., ,, 1993, ,, Postglacial, vegetation, and, climate, of, Grand, Teton, and, sout..."
40110,450,"[Whitlock, ,, C., ,, and, Bartlein, ,, P., J., ,, 1993, ,, Spatial, variation, of, Holocene, cli..."
40111,451,"[Wright, ,, H., E., ,, Jr., ,, 1983, ,, Late-Quaternary, environments, of, the, United, States, ..."


In [211]:
bibliography[bibliography['_gddid'] == '57928e07cf58f133d1c26609']

Unnamed: 0,publisher,title,journal,author,year,number,volume,link,_gddid,identifier,type,pages
93,GSA,"Timberline fluctuations and late Quaternary paleoclimates in the Southern Rocky Mountains, Colorado",{'name': {'name': 'Geological Society of America Bulletin'}},"[{'name': 'Fall, Patricia L.'}]",1997,10,109,"[{'url': 'http://dx.doi.org/10.1130/0016-7606(1997)109<1306:tfalqp>2.3.co;2', 'type': 'publisher'}]",57928e07cf58f133d1c26609,"[{'type': 'doi', 'id': '10.1130/0016-7606(1997)109<1306:tfalqp>2.3.co;2'}]",article,1306--1320


Skimmed info:  
* `Link`: http://dx.doi.org/10.1130/0016-7606(1997)109<1306:tfalqp>2.3.co
* `Site Name`:  
* `Location`:  
* `Age Span`:   
* `Site Descriptions`:    