### Project Kojak

** The problem  **

We will attempt to identify amboguously defined words - words that are homographs (spelled the same, but with multiple meanings) and determine the exact meaning of the word from a context window.

Here we attempt to do this in a few stages
1. train a word embedding on some training corpus using skip-gram (Here we use 1000 sholarly research papers) 
2. identify common homographs and extract the various context windows
3. interpret the context windows as vectors in the embedding space and appy a clustering algorith (DBSCAN). Each cluster is interpreted as a distinct definition of the homograph. Each cluster then is representative vector.
4. apply to a test corpus - match context of given homograph to most similar group.


### This notebook

Loads a pre-trained word embedding model, uses DBSCAN clustering to identify several 'definitions' of a set of homographs, and saves those definitions for later use.

In [1]:
import gensim
import json
import os
import re
import time
from nltk.corpus import stopwords
from nltk import tokenize
from nltk import pos_tag
from pprint import pprint



Using Theano backend.


In [2]:
# Declare stopwords, preprocess the data from source file

stop = stopwords.words('english')
stop += ['?','!',':',';','[',']','[]','“' ]
stop += ['.', ',', '(', ')', "'", '"',"''",'""',"``",'”', '“', '?', '!', '’', 'et', 'al', 'al.']
stop = set(stop)

class MyPapers(object):
    # a memory-friendly way to load a large corpora
     def __init__(self, dirname):
            self.dirname = dirname
 
     def __iter__(self):
        with open(self.dirname) as data_file:    
            data = json.load(data_file)
        # iterate through all file names in our directory
        for paper in data:
            sentences = tokenize.sent_tokenize(paper['full_text'])
            for sentence in sentences:
                try:
                    line = re.sub(r'[?\.,!:;\(\)“\[\]]',' ',sentence)
                    line = [word for word in line.lower().split() if word not in stop]
                    yield line
                except:
                    print("Empty line found")
                    continue
                

In [3]:
#Instantiate iterable on the data

#papers is an iterable of scholarly papers, tokenized for prcessing
papers = MyPapers('data/train_data.json') 


In [4]:
def find_sentence(json_file, word_list):
    words = []
    for w in word_list:
        for _ in w.split('_'):
            words.append(_)
    for paper in json_file:
        for sentence in tokenize.sent_tokenize(paper['full_text']):
            if all(word in sentence.lower() for word in words):
                return sentence

## Word embeddings

Import word2vec word embeddings trained on 2848 scholarly journal articles

In [5]:
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
from functools import reduce

In [6]:
model = gensim.models.word2vec.Word2Vec.load("data/journal.txt")

In [7]:
model.corpus_count

182533

In [8]:
vectors = model.wv

In [9]:
vocab = vectors.vocab

In [10]:
len(vectors.vocab)

127777

In [None]:
vectors

** contexts to vectors **

In [None]:
from collections import Counter, defaultdict
# The function takes as arguments a list of tokenized documents and a window size
# and returns each word in the document along with its window context as a tuple

def generate_word_counts(documents):
    counts = Counter()
    
    for document in documents:        
        for word in set(document):                    
            counts[word] += 1
            
    return counts

# Takes list of word tokens as arguments
# Returns a list of vectors whose components are the arithmetic mean of the 
# corresponding component of all of the input vectors

def get_vectors(word_list):
    vecs = []
    for word in word_list:
        try:
            vecs.append(vectors[word])
        except:
            print("{} missing from vocabulary".format(word))
            #continue
    return vecs

# Takes list of words as arguments
# Returns a single vector whose components are the arithmetic mean of the 
# corresponding component of all of the input vectors

def vector_average(vector_list):
    #words = [x for x in word_list if x in vocab]
    #vector_list = get_vectors(words)
    A = np.array(vector_list)
    dim = A.shape[0]
    ones = np.ones(dim)
    return ones.dot(A)/len(vector_list)

# Takes list of tokenized documents, target word and window size as arguments
# Returns list of vectors where each vector represents the context window 
# of the target word in the word embedding space

def context2vectors(documents,target):

    context_vectors = []

    for document in documents:
        sentence = document
        if target in sentence:
            #str_sentence = streamlined_sentence(sentence)
            context_vectors.append(vector_average(get_vectors(sentence)))
                    
    return list(context_vectors)


# Takes list of vectors as arguments
# Returns a single vector whose components are the arithmetic mean of the 
# corresponding component of all of the input vectors weighted by Inverse Document Frecuency

def vector_average2(words): #, word_counts, vectors):
    
    total = sum(list(word_counts.values()))
    vocab = set(vectors.vocab.keys())
    words = [x for x in words if x in vocab]
    vector_list = list(map((lambda x: vectors[x]*np.log((1 + total)/(1 + word_counts[x]))),words))
    
    if len(vector_list) == 0:
        return 0
    elif len(vector_list) == 1:
        vector_sum = vector_list[0]
    else:
        vector_sum = reduce((lambda x,y: np.add(x,y)),vector_list)
        
    weighted_average = (1.0/np.linalg.norm(vector_sum))*vector_sum
    
    return weighted_average

# Takes list of tokenized documents, target word and window size as arguments
# Returns list of vectors where each vector represents the context window 
# of the target word in the word embedding space

def context2vectors2(documents,target):

    context_vectors = []

    for document in documents:
        sentence = document
        if target in sentence:
            i = sentence.index(target)
            #str_sentence = streamlined_sentence(sentence)
            s = max(i-6,0)
            e = min(i+6, len(sentence))
            context_vectors.append(vector_average2(sentence[s:e]))
                    
    return context_vectors

In [None]:
def MyPapers_plus(papers):
    
    phrases = gensim.models.phrases.Phrases(sentences = papers, min_count = 5, threshold = 150)
    bigram = gensim.models.phrases.Phraser(phrases)
    phrases2 = gensim.models.phrases.Phrases(sentences = bigram[papers], min_count = 5, threshold = 300)
    trigram = gensim.models.phrases.Phraser(phrases2)
    
    return trigram[bigram[papers]]

In [None]:
word_counts = generate_word_counts(MyPapers_plus(papers))

In [None]:
word_counts['new_york_city']

In [None]:
#dictionary = gensim.corpora.dictionary.Dictionary(MyPapers_plus(papers))
#text = [dictionary.doc2bow(c) for c in MyPapers_plus(papers)]

** Clustering with DBSAN **

Use DBSCAN to determine similar usages of the target homographs. Each of these similar usages will be combined to a representative vector in the embedding space and constitute a "definition" of that word.

In [None]:
def streamlined_sentence(sentence):
    POS = {'JJ','JJR','JJS','NN','NNS','NNP','NNPS','VB','VBD','VBG','VBN','VBD','VBZ'}
    st_sent = [word[0] for word in pos_tag(sentence) if word[1] in POS]
    return st_sent

In [None]:
#Arguments: The desired cluster number, a list of documents making up the corpus, the target homograph
#           a list of labels for the conext sentences indicating the homographs usage, and window size
# The function prints the representative context windows for the target word within the desired cluster   

def print_cluster_context(cluster_number, documents, target, labels):
    
    context_vectors = []

    for document in documents:
        sentence = document
        if target in sentence:
            str_sentence = streamlined_sentence(sentence)
            context_vectors.append(str_sentence)
            
    for i, label in enumerate(labels):
        if label == cluster_number:
            print(context_vectors[i])
            
#Arguments: The desired cluster number, a list of documents making up the corpus, the target homograph
#           a list of labels for the conext sentences indicating the homographs usage, and window size
# The function prints the representative context windows for the target word within the desired cluster   

def cluster_context(documents, target, labels):
    
    context_sentences = []

    for document in documents:
        sentence = document
        if target in sentence:
            str_sentence = streamlined_sentence(sentence)
            context_sentences.append(str_sentence)
                     
    clustered_sentences = defaultdict(list)                
    for i, label in enumerate(labels):
        clustered_sentences[label].append(context_sentences[i])
    
    return clustered_sentences

# Arguments: List of vectors, each representing a context sentence and a list of labels 
#            corresponding to the context vectors
# Returns:   A dictionary where keys are the identified labels from clustering and the value is a single 
#            representing the cluster

def identify_definition(context_vectors, labels):
    
    cluster_numbers = set(labels)
    definitions = dict()
    cluster_vectors = defaultdict(list)
    
    if len(set(labels)) == 1:
        print("No consistent definition found")
        definitions[0] = np.zeros(len(context_vectors[0]))
        return definitions
    
    for i, label in enumerate(labels):
        cluster_vectors[label].append(context_vectors[i])
    
    for key in cluster_vectors.keys():
        if key < 0:
            continue
        else:
            v = vector_average(cluster_vectors[key])
            definitions[key] = v/np.linalg.norm(v)
                    
    return definitions


In [18]:
test = MyPapers('data/testing_data.json')

In [19]:
target = u'state'
context_vectors = context2vectors2(MyPapers_plus(papers), target)

In [114]:
epsilon = .115

In [115]:
dbscan = DBSCAN(eps = epsilon, metric = 'cosine', algorithm = 'brute', min_samples = 5)
dbscan.fit(context_vectors)

DBSCAN(algorithm='brute', eps=0.115, leaf_size=30, metric='cosine',
    min_samples=5, n_jobs=1, p=None)

In [116]:
labels = dbscan.labels_
n_clusters = len(set(labels)) # - (1 if -1 in labels else 0)
print(n_clusters)

12


In [117]:
count = 0 
while count < len(labels):
    print(labels[count:count+20])
    count += 20 
    

[-1 -1  0  0  0  0 -1  0  0  0  0  0 -1 -1 -1  0 -1  0  0 -1]
[ 0  0 -1 -1  0 -1  0  0 -1 -1  0 -1 -1  0  0 -1  0  0 -1  0]
[ 0  0  0 -1 -1  0  0 -1 -1  0  0  0  0  0  0  0  0  0  0 -1]
[ 0 -1  0 -1  0  0  0  0 -1  0  0 -1 -1  0  0  0 -1 -1 -1 -1]
[ 0 -1  0 -1  0  0 -1 -1  0 -1 -1  0 -1  0  0  0 -1 -1  0  0]
[-1  0  0 -1  0 -1 -1 -1 -1  0 -1 -1 -1  0 -1  0  0 -1  0  0]
[ 0  0  0 -1  0  0 -1 -1  0  0  0 -1 -1 -1  0  0  0  0  0  0]
[ 0  0  0  0  0  0 -1 -1 -1  0  0 -1 -1 -1  0 -1  0  0 -1  0]
[ 0 -1  0 -1 -1 -1  0  0  0  0  0 -1 -1 -1 -1  8 -1 -1 -1  0]
[-1 -1 -1 -1  0 -1 -1 -1 -1  0 -1  0 -1  0  0  0 -1 -1 -1 -1]
[ 0 -1 -1 -1  0  0  0  0 -1 -1 -1  0 -1 -1 -1 -1 -1  0 -1 -1]
[ 0  0  0  0 -1  0 -1 -1  0  0  0 -1  0  0 -1  0 -1 -1  0 -1]
[-1 -1 -1 -1 -1  0 -1 -1 -1  0  0  0  0  0  0  0  0  0 -1  0]
[-1 -1  0  0  0 -1 -1 -1  7  0 -1  0 -1 -1 -1  0  0  0 -1  0]
[ 0  0  0 -1 -1 -1 -1 -1 -1  0 -1 -1  0 -1 -1 -1 -1  0 -1  0]
[-1 -1 -1 -1  0 -1  1  2  1  1  1  2  1  1  2 -1  1 -1  1 -1]
[ 0 -1  

In [None]:
len(labels)

In [95]:
target_definitions = identify_definition(context_vectors, labels)

In [24]:
def read_glossary(glossary):
    
    vector_glossary = dict()
    
    for k, v in glossary.items():
        vector_glossary[k] = {key:vector_average2(tokenize.word_tokenize(value)) for (key,value) in v.items()}
    
    return vector_glossary

def get_target_sentences(documents, target):
    
    context_sentences = []

    for document in documents:
        #print(document[:15])
        sentence = document
        if target in sentence:
            #str_sentence = streamlined_sentence(sentence)
            #print[str_sentence]
            sentence.remove(target)
            context_sentences.append(sentence)
            
    return context_sentences



In [96]:
target_sentences = get_target_sentences(MyPapers_plus(test), target)

In [97]:
for s in target_sentences:
    d = define_target_from_sentence(s, target_definitions)
    print("{}\nDefined as {} \n".format(s, d))

[(0.12806673202069829, 0), (0.29771833157340333, 1), (0.3877854064279056, 2)]
['although', 'evident', 'communist', 'east', 'germany', 'unjust', 'many', 'people', 'consider', 'prosecution', 'condemnation', 'leading', 'individuals', 'gdr', 'form', 'victor’s', 'justice']
Defined as 0 

[(0.15005540137358664, 0), (0.27383506746701536, 1), (0.37858901890537766, 2)]
['researchers', 'believe', 'take', 'time', 'chc_teachers', 'change', 'habits', 'traditional', 'teaching', 'interact', 'newly', 'designed', 'social', 'constructivism-based', 'science', 'curriculum', 'challenges', 'vietnamese', 'chc_teachers', 'need', 'deep', 'understanding', 'scientific', 'content', 'knowledge', 'vietnamese_teachers', 'find', 'teaching', 'learning', 'scientific', 'argumentation', 'difficult', 'align', 'concerns', 'researchers', 'educators', 'science', 'teaching', 'many', 'primary', 'classrooms', 'recently']
Defined as 0 

[(0.068144872437592374, 0), (0.20233850878751714, 1), (0.35691849132607811, 2)]
['specificall

[(0.22944286943619208, 2), (0.25922620760269011, 0), (0.39220721360012334, 1)]
['16', 'hectare', '38', 'acre', 'area', 'north', 'carolina', 'university’s', 'campus']
Defined as 2 

[(0.17393984242034055, 0), (0.24476494151610351, 1), (0.40516353128964089, 2)]
['non-steady', 'operation', 'one', 'needs', 'take_account', 'heat', 'dynamics', 'occur', 'near', 'inner', 'surface', 'furnace', 'walls', 'coupled', 'dynamics', 'radiant', 'heat_transfer', 'heat_transfer', 'within', 'steel', 'strip']
Defined as 0 

[(0.25000465240492331, 0), (0.39004282661893175, 1), (0.46246980577886077, 2)]
['invaded', 'pastoral', 'agro-pastoral', 'areas', 'afar', 'regional', 'dire_dawa_administration']
Defined as 0 

[(0.15522134991243985, 0), (0.21175259054383933, 1), (0.36751160970624031, 2)]
['present', 'work', 'aims', 'model', 'shape', 'memory', 'effect', 'axially', 'compressed', 'polymeric', 'rod', 'post-buckled', 'equilibrium']
Defined as 0 

[(0.20562528184846107, 0), (0.24648732532580442, 1), (0.31011430

[(0.14819118849474056, 0), (0.28102060489278879, 1), (0.43681387275774286, 2)]
['growth-enhancing', 'structural', 'changes', 'observed', 'institutions', 'production', 'adequately', 'enforced', 'khan']
Defined as 0 

[(0.097888589079507704, 0), (0.21098189565774006, 1), (0.31521311255595053, 2)]
['determines', 'rate', 'firm', 'level', 'innovation', 'diversification', 'economy', 'length', 'job', 'ladders', 'direction', 'structural', 'change', 'many', 'sources', 'structural', 'transformation', 'grouped', 'two', 'broad', 'categories', '1', 'intervention', '2', 'external', 'shocks']
Defined as 0 

[(0.18855841362182302, 0), (0.35987602093996951, 1), (0.47326012537616879, 2)]
['intervention', 'encompasses', 'deliberate', 'change', 'market', 'incentives', 'creation', 'destruction', 'markets']
Defined as 0 

[(0.23696074840474446, 0), (0.38351706694614918, 1), (0.42806284766470837, 2)]
['experience', 'emerging_markets', 'east_asia', 'old', 'europe', 'nineteenth', 'twentieth', 'centuries', 'exa

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

# Recording Definitions

Use the DBSCAN clusters to determine the various definitions of a word, then create a dictionary for the word

In [25]:
def define_target_from_sentence(sentence, dictionary):
    cosine_dists = []
    sv = vector_average2(sentence)
    sentence_vector = sv/np.linalg.norm(sv)
    for k,v in dictionary.items():
        cosine_dists.append((1 - np.dot(sentence_vector, v),k))
    cosine_dists.sort()
    print(cosine_dists)
    return cosine_dists[0][1]
    
def extract_dictionary(papers, homographs):
    dictionary = dict()
    for word in homographs:
        print("Calculating context vectors for \"{}\"".format(word))
        context_vectors = context2vectors2(MyPapers_plus(papers), word)
        print("Clustering...")
        if len(context_vectors)<1:
            print("no example for \"{}\" found.".format(word))
            continue
        dbscan = DBSCAN(eps = epsilon, metric = 'cosine', algorithm = 'brute', min_samples = 3)
        dbscan.fit(context_vectors)
        labels = dbscan.labels_
        print("found {} distinct definitions".format(len(set(labels))))
        print("Building definitions for \"{}\"".format(word))
        dictionary[word] = identify_definition(context_vectors, labels)
        
    print("Dictionary complete")    
    return dictionary    

In [53]:
epsilon = 0.1
homographs = ['charge','state', 'train']

dictionary = extract_dictionary(papers, homographs)

Calculating context vectors for "charge"
Clustering...
found 5 distinct definitions
Building definitions for "charge"
Calculating context vectors for "state"
Clustering...
found 7 distinct definitions
Building definitions for "state"
Calculating context vectors for "train"
Clustering...
found 3 distinct definitions
Building definitions for "train"
Dictionary complete


In [138]:
dictionary['charge'].keys()

dict_keys([0, 1, 2, 3, 4])

## Testing

In [None]:
# TEST WINDOW #
epsilon = 0.12
target = u'charge'
with open('data/testing_data.json') as f:
    file = json.load(f)
    
context_vectors = context2vectors2(MyPapers_plus(papers), target)
dbscan = DBSCAN(eps = epsilon, metric = 'cosine', algorithm = 'brute', min_samples = 4)
dbscan.fit(context_vectors)
labels = dbscan.labels_
#target_definitions = identify_definition(context_vectors, labels)
target_definitions = dictionary[target]


contexts = cluster_context(MyPapers_plus(papers), target, labels)
correct = 0
wrong = 0
for c in contexts:
    for window in contexts[c]:
        d = define_target_from_sentence(window, target_definitions)
        if c == d:
            correct += 1
        else:
            print("Cluster {}, defined as {} \n".format(c, d), find_sentence(file,window), '\n')
            if c != -1:
                wrong += 1
                
print("Number correct: {}\nNumber wrong: {}".format(correct, wrong))

In [137]:
target = u'charge'
with open('data/testing_data.json') as f:
    file = json.load(f)
context_sentences = context(MyPapers_plus(test), target)
#print(context_sentences)

for c in context_sentences:
    d = define_target_from_sentence(c, dictionary[target])
    #print(c)
    print("{}\nDefined as {} \n".format(find_sentence(file,c),d))



[(0.055669052328530477, 0), (0.16997656549947227, 3), (0.18869912453325421, 1), (0.20350545291678956, 4), (0.22427200788766244, 2)]
But the most significant implication is that the way that H is incorporated in olivine may vary with the fugacity of HO, depending on whether the H atoms needed to achieve charge balance are completely associated with the point defect by being bonded to the oxygen atoms surrounding it in specific locations, or are disordered over the lattice by being bonded to oxygen atoms without regard to location.
Defined as 0 

[(0.070765776459314367, 0), (0.18354397038542425, 1), (0.20159771967971685, 3), (0.26746119951088776, 2), (0.28529326483576622, 4)]
For example, if the four H atoms of the [Si] mechanism are bonded to oxygen atoms surrounding the Si site vacancy to produce local charge balance (that is short-range order), as recently shown by Xue et al.
Defined as 0 

[(0.11366215858995266, 0), (0.16091510984935953, 1), (0.19528214471626981, 3), (0.2607369904631

After this, a training set of 1249 samples (423 positive samples and 826 negative samples) and a test set of 312 samples (105 positive samples and 207 negative samples) were obtained.In the feature calculation step, 203 descriptor were calculated including 30 constitution descriptors, 44 connectivity indices, 7 kappa indices, 32 Moran auto-correction descriptors, 5 molecular properties, 25 charge descriptors and 60 MOE-Type descriptors.
Defined as 0 

[(0.10224518372763458, 0), (0.16258850563945493, 4), (0.16728005781188271, 3), (0.18051267370454949, 2), (0.30600007017572417, 1)]
This has resulted in significant new functionality in core application programming interfaces (APIs) while maintaining the quality of code depending on those core APIs.Examples of new features supported by the improved development model include InChI functionality [], greatly improved ring detection algorithms [], improvements to the core atom type perception module that now covers a much more comprehensive se

## Validate glossary approach

### STATE

In [None]:
charge_def = {1:"(criminal law) a pleading describing some wrong or offense",
              2:"(explosive) a quantity of explosive to be set off at one time",
              3:"(physics) the quantity of unbalanced electricity in a body (either positive or negative) and construed as an excess or deficiency of electrons",
              4:"(finance) request for payment or fee in exchange for a good or service",
              5:"the responsibility of taking care or control of someone or something."}

state_def = {1:"the particular condition that someone or something is in at a specific time",
            2:"a nation or territory considered as an organized political community under one government",
            3:"express something definitely or clearly in speech or writing"}

train_def = {1:"teach (a person or animal) a particular skill or type of behavior through practice and instruction over a period of time.",
            2: "a series of railroad cars moved as a unit by a locomotive or by integral motors."}

attribute_def = {1:"a quality or feature that is as a characteristic of a person or thing",
                2:"regard something as being caused by someone or something"}

glossary = { 'charge':charge_def, 'state':state_def, 'train':train_def, 'attribute':attribute_def}

In [3]:
import pandas as pd

df = pd.read_pickle('first_experiments/STATE_val_df_labeled')

In [28]:
windows = list(df.window)
labels = list(df.sense_labels)
sentences = list(df.sentences)

In [27]:
g = read_glossary(glossary)

In [37]:
target = u'state'
#with open('data/testing_data.json') as f:
#    file = json.load(f)
context_sentences = MyPapers_plus(windows)
#print(context_sentences)

results = []
for i, c in enumerate(context_sentences):
    print(i)
    d = define_target_from_sentence(c, g[target])
    print("{}\nDefined as {} \nLabeled as {}".format(sentences[i],d, labels[i]))
    results.append((labels[i],d))



0
[(0.23731625080108643, 1), (0.3905489444732666, 3), (0.39849257469177246, 2)]
In particular we show that the power spectrum of large-scale neocortical activity has a Brownian motion baseline  and that the statistical structure of the random bursts of spiking activity found near the resting state indicates that such a state can be represented as a percolation process on a random graph  called  percolation Other data indicate that resting cortex exhibits pair correlations between neighboring populations of cells  the amplitudes of which decay slowly with distance  whereas stimulated cortex exhibits pair correlations which decay rapidly with distance 
Defined as 1 
Labeled as 1
1
[(0.24511873722076416, 1), (0.317055344581604, 3), (0.34091228246688843, 2)]
The realistic national relations are formed by various factors in the history and the multiple actions of modern people  so that they are not caused and restricted by a certain factor Beyond doubt  the significance of methods is multip

22
[(0.26863282918930054, 2), (0.38878738880157471, 1), (0.39108192920684814, 3)]
This is significant because employment relations are regulated by the Labor Law rather than by state policies 
Defined as 2 
Labeled as 2
23
[(0.25346779823303223, 2), (0.28704023361206055, 1), (0.33847665786743164, 3)]
Although the increasing importance of laws and markets partially frees workers from their past economic and political dependence on the state  it nevertheless enhances their dependence on market forces This paper principally draws on approaches that rest upon qualitative traditions to critically analyze challenges facing knowledge workers  particularly editors in the publishing industry  during China’s on-going social transformation 
Defined as 2 
Labeled as 2
24
[(0.2575223445892334, 2), (0.26773428916931152, 1), (0.29634344577789307, 3)]
Results show that companies owned or controlled by the state are more likely to form interlocking networks  and these relations tend to emerge among com

In [45]:
report = np.zeros(9)
report = report.reshape(3,3)

for result in results:
    report[result[0]-1][result[1]-1] += 1
    
print(report)

[[ 94.   8.  40.]
 [ 22.  73.  15.]
 [ 14.   1.   4.]]


### TRAIN

In [23]:
test = MyPapers('data/validation_data.json')

In [124]:
target = u'train'
with open('data/validation_data.json') as f:
    file = json.load(f)
context_sentences = context(MyPapers_plus(test), target)
#print(context_sentences)

for c in context_sentences:
    d = define_target_from_sentence(c, g[target])
    #print(c)
    print("{}\nDefined as {} \n".format(find_sentence(file,c),d))



[(0.20578956604003906, 1), (0.29383689165115356, 2)]
military, fire fighters and labourers) [], who rely upon physical and mental proficiencies to compete or train at elite levels and remain productive in the workforce.
Defined as 1 

[(0.17394262552261353, 1), (0.24395477771759033, 2)]
To help the clarification process, face-to-face-communication would always be preferable over e-mails, webinars or other mere technical means and one of the main rules for a double bind-free organisation should be: Talk  each other instead of  each other!All staff members on each level should therefore train and maintain a neutral attitude towards all their colleagues, supervisors and co-workers.
Defined as 1 

[(0.18678128719329834, 1), (0.22531867027282715, 2)]
The proposed extracting method improves (1) the extraction quality of the system by increasing the precision of entity extraction as a result of the initial extracted entities from our method; our highly precised entities will be used as a seed

In [125]:
target = u'charge'
with open('data/testing_data.json') as f:
    file = json.load(f)
context_sentences = context(MyPapers_plus(test), target)
#print(context_sentences)

for c in context_sentences:
    d = define_target_from_sentence(c, g[target])
    #print(c)
    print("{}\nDefined as {} \n".format(find_sentence(file,c),d))



[(0.14442133903503418, 3), (0.22563308477401733, 2), (0.40358829498291016, 1), (0.41298401355743408, 5), (0.43839478492736816, 4)]
But the most significant implication is that the way that H is incorporated in olivine may vary with the fugacity of HO, depending on whether the H atoms needed to achieve charge balance are completely associated with the point defect by being bonded to the oxygen atoms surrounding it in specific locations, or are disordered over the lattice by being bonded to oxygen atoms without regard to location.
Defined as 3 

[(0.17687559127807617, 3), (0.25559866428375244, 2), (0.45852863788604736, 4), (0.47920215129852295, 1), (0.49328607320785522, 5)]
For example, if the four H atoms of the [Si] mechanism are bonded to oxygen atoms surrounding the Si site vacancy to produce local charge balance (that is short-range order), as recently shown by Xue et al.
Defined as 3 

[(0.19387054443359375, 3), (0.26459795236587524, 2), (0.43715059757232666, 1), (0.448050141334533

After this, a training set of 1249 samples (423 positive samples and 826 negative samples) and a test set of 312 samples (105 positive samples and 207 negative samples) were obtained.In the feature calculation step, 203 descriptor were calculated including 30 constitution descriptors, 44 connectivity indices, 7 kappa indices, 32 Moran auto-correction descriptors, 5 molecular properties, 25 charge descriptors and 60 MOE-Type descriptors.
Defined as 3 

[(0.20881187915802002, 3), (0.24343907833099365, 2), (0.34048378467559814, 1), (0.36043977737426758, 4), (0.36821633577346802, 5)]
This has resulted in significant new functionality in core application programming interfaces (APIs) while maintaining the quality of code depending on those core APIs.Examples of new features supported by the improved development model include InChI functionality [], greatly improved ring detection algorithms [], improvements to the core atom type perception module that now covers a much more comprehensive se

In [136]:
target = u'attribute'
with open('data/testing_data.json') as f:
    file = json.load(f)
context_sentences = context(MyPapers_plus(test), target)
#print(context_sentences)

for c in context_sentences:
    d = define_target_from_sentence(c, g[target])
    #print(c)
    print("{}\nDefined as {} \n".format(find_sentence(file,c),d))



[(0.33354330062866211, 1), (0.37769865989685059, 2)]
Here, we focus on models explaining linkage between dyads beyond structure by incorporating node attribute information.
Defined as 1 

[(0.38523983955383301, 1), (0.52475601434707642, 2)]
The  () models network structure and node attributes by learning the attribute correlations in the observed network.
Defined as 1 

[(0.41379630565643311, 1), (0.5419037938117981, 2)]
Furthermore, the  () takes into account attribute information from nodes to model network structure.
Defined as 1 

[(0.31510353088378906, 1), (0.39183962345123291, 2)]
This model defines the probability of an edge as the product of individual attribute link formation affinities.
Defined as 1 

[(0.35475432872772217, 1), (0.53952652215957642, 2)]
Each feature vector 
                        =(
                        [1],...,
                        []) maps a node 
                         to  (numeric or categorical) attribute values.
Defined as 1 

[(0.3240596055984

## Validation

In [60]:
target = u'state'
with open('data/validation_data.json') as f:
    file = json.load(f)
validation_text = MyPapers('data/validation_data.json')
context_sentences = get_target_sentences(MyPapers_plus(validation_text), target)
#print(context_sentences)

for c in context_sentences:
    d = define_target_from_sentence(c, dictionary[target])
    #print(c)
    print("{}\nDefined as {} \n".format(find_sentence(file,c),d))



[(0.10803020897731686, 2), (0.12519943250077081, 0), (0.17674089186264252, 3), (0.23380997755080324, 4), (0.28968123389905787, 5), (0.40995219110093428, 1)]
In particular we show that the power spectrum of large-scale neocortical activity has a Brownian motion baseline, and that the statistical structure of the random bursts of spiking activity found near the resting state indicates that such a state can be represented as a percolation process on a random graph, called  percolation.Other data indicate that resting cortex exhibits pair correlations between neighboring populations of cells, the amplitudes of which decay slowly with distance, whereas stimulated cortex exhibits pair correlations which decay rapidly with distance.
Defined as 2 

[(0.044577169705506847, 0), (0.16763083963604863, 2), (0.20982685040993343, 3), (0.26670168416072482, 5), (0.33375389007581391, 4), (0.3627322726399137, 1)]
The realistic national relations are formed by various factors in the history and the multip

[(0.15058479522502366, 0), (0.20409318910126339, 5), (0.24807350266152983, 1), (0.28878309965120663, 4), (0.3121640228734619, 3), (0.3417241230131125, 2)]
Thus, when the state of Louisiana legalized a casino, casinos were prohibited from having in-house hotel facilities and restaurants.
Defined as 0 

[(0.086746510950096889, 0), (0.24211804542967408, 5), (0.28775421262871814, 3), (0.28846687740205901, 2), (0.33119461322575494, 1), (0.35586221170084398, 4)]
Meanwhile in Australia, state governments have liberalized gambling policy though at the same time, revising regulations and coming up with new ones to deal with negative socio-economic impact of growth in the gambling industry (Delfabbro and King ).
Defined as 0 

[(0.081081075572314254, 0), (0.21287332406490478, 5), (0.26321762893068523, 2), (0.2768653180005276, 3), (0.30692117350422732, 4), (0.34518639206601931, 1)]
, ).In the case of the poor and less educated, participation in forms of gambling such as state lotteries would actu

The two databases were coreleased by the National Bureau of Statistics and State Ethnic Affairs Commission, which recorded the information obtained from these two censuses, such as ethnic identity for the head and spouse in households nationwide.
Defined as 0 

[(0.16713209486352532, 0), (0.25342362354061365, 3), (0.30888631723731841, 2), (0.33160222940712747, 1), (0.38876148064686844, 5), (0.40851626364261739, 4)]
The first step, entitled K-Tech Tree, involves the development of necessary technologies for SWMI implementation based on the current state of technologies.
Defined as 0 

[(0.086664901580448817, 0), (0.14734292758088174, 3), (0.2181935959470408, 2), (0.26465649551366777, 5), (0.29762231425017749, 4), (0.33032548221304903, 1)]
With GPU accelerating, this should be also feasible.The reward function should be redesigned as well because now every non-collision state was rewarded with the same score, which limits the interactions between the state and the robot in very few condi

The question of public participation is fundamental in the governance of biobanks and has been framed in two different ways, focusing on people providing biological material for research biobanks on the one hand, and on people sharing in the decision-making involved in these projects (Machado and Silva ).Because of their value in criminal investigations, forensic DNA databases are used in a completely different context from biobanks for research purposes, and the former are controlled by state authorities or police services under specific legal provisions.
Defined as 0 

[(0.046349142360539552, 0), (0.16140215972256922, 5), (0.20976400695724329, 2), (0.21663696268912513, 3), (0.27952512615185765, 4), (0.32738409286480286, 1)]
They seemed to have little idea of the benefits and risks of such action, and their reasons for agreeing or refusing to have their genetic profile stored in the database.In particular, we want to underline that the Italian Forensic DNA database has yet to come int

Hence, every element should correspond to a state in the statechart.In comparison with the work of (), the main difference is that we address the system’s context, the operationalization of the NFRs and prioritization of variants.
Defined as 0 

[(0.14425197664982781, 0), (0.18455751017714006, 3), (0.27952940400304793, 2), (0.37125194274608775, 5), (0.42608708445947041, 4), (0.43007966877560122, 1)]
The process is guided by heuristic rules and patterns to map a goal hierarchy into an isomorphic state hierarchy in a statechart.
Defined as 0 

[(0.087337770914695412, 0), (0.18098255045392209, 3), (0.23220493531952413, 2), (0.30741565856846842, 5), (0.35791769086062053, 1), (0.36715733760960445, 4)]
In their work, RGMs extend DGMs with additional state, flow expressions and historical information about the fulfilment of goals.In this work, we proposed a systematic process for deriving the behavior of context-sensitive systems, expressed as statechart, from goal models namely GO2S.
Defined

In Section , by employing the Manasevich-Mawhin continuation theorem, we state and prove the existence of a positive periodic solution for () with attractive singularity.
Defined as 0 

[(0.20836016171848182, 0), (0.26287923857973172, 3), (0.30420199084186361, 2), (0.39623744129969451, 5), (0.41296150242039587, 4), (0.51423848938659322, 1)]
As an application, we study the convergence of the state observers of linear-time-invariant systems.
Defined as 0 

[(0.20836016171848182, 0), (0.26287923857973172, 3), (0.30420199084186361, 2), (0.39623744129969451, 5), (0.41296150242039587, 4), (0.51423848938659322, 1)]
As an application, we study the convergence of the state observers of linear-time-invariant systems.
Defined as 0 

[(0.15828642155142181, 0), (0.23901068205479881, 3), (0.27965241695226539, 2), (0.35474487201073757, 5), (0.35709290284366335, 4), (0.48646624591886933, 1)]
In addition, for the case , we have the following result.As an application of Theorem , we study the convergenc

While Chinese workers are  relatively free to take jobs in areas where they do not have a permanent residence permit, both state and city governments have created a set of rules that distinguish between local  (LH) and non-local  persons (NLH) in a number of key markets, including those for housing, labor, education and health.
Defined as 0 

[(0.15090487717822709, 0), (0.24840289620713141, 3), (0.26129346983899104, 2), (0.27793301595356512, 5), (0.30812748979719018, 4), (0.38344573939822801, 1)]
These authors’ analysis of NLSY79 data uses state employment growth as a measure of the business cycle.
Defined as 0 

[(0.14379767847811342, 0), (0.17743815684003561, 5), (0.29280294165445331, 3), (0.3008972180171896, 1), (0.34638561807137136, 4), (0.35787248526673965, 2)]
For both the primary and second jobs, information is collected on class of the job (private for-profit, private not-for-profit, federal, state, or local), detailed industry and occupation, and usual weekly hours worked.
Def

Nevertheless, he would not have been able to convince the ruler for complete abolishment as it was a significant source of revenue to the state exchequer.
Defined as 0 

[(0.11345418791837325, 0), (0.26650491656087394, 5), (0.28572304965854434, 3), (0.29393969773532769, 2), (0.34912656117123675, 4), (0.38110689233600303, 1)]
In doing so, they are not unique.My claim in this paper is that, to a significant degree, ritual among the Rathvas, a community of  (indigenous people) who live in the easternmost portion of the western Indian state of Gujarat, proceeds by constructing and then breaching borders.
Defined as 0 

[(0.093106501356432947, 0), (0.22035015376447631, 5), (0.25916668250389174, 3), (0.27292969186414007, 2), (0.28836352657931485, 4), (0.29548811237178296, 1)]
I am deliberately limiting myself here to the practices of non-, people known locally as , , or  and seen as being traditional .According to the 2011 Census, Rathvas (Rathawas) are the third largest  community (Schedule

Comprehensive state data collection systems now being put into place are a major tool to conduct such study.Contemporary ECCE research began with a few small-scale studies showing that high-quality early childhood programs produce an intellectual boost.
Defined as 0 

[(0.1765230642209239, 0), (0.26461668743927569, 3), (0.28068529086005789, 5), (0.34683615804680223, 2), (0.40571409194500263, 4), (0.40755564541254008, 1)]
It is easy to see one difference between the first programs and the federal and state programs.
Defined as 0 

[(0.048482560474102154, 0), (0.19536513691388979, 5), (0.19802056008867142, 2), (0.21181920648203945, 3), (0.26408329977227418, 4), (0.32565305490392615, 1)]
But surely scale alone cannot explain the difference between success and failure, and the difference between the Tulsa early childhood program and the Tennessee prekindergarten program is more highly nuanced.Like virtually all countries, the U.S. is governed by policymakers—heads of state, bureaucrats, le

The reduced mitochondrial respiratory activity in this group was characterized by a decrease in the maximum activity in both the coupled (maximum oxidative phosphorylation, OxPhos) and uncoupled (maximum capacity of the electron transfer system) state.
Defined as 2 

[(0.1259955075023087, 2), (0.17076599224771383, 0), (0.26379012802507407, 4), (0.30764472137293453, 3), (0.35617374448355554, 5), (0.47400154338196765, 1)]
This resulting state of reduced mitochondrial respiration is a sign of injury, which has also been confirmed in experimental CLP models [].
Defined as 2 

[(0.11663388737951563, 0), (0.18530341495076907, 2), (0.20666960183059213, 3), (0.31896416689311424, 4), (0.35506346256832999, 5), (0.40743973778127585, 1)]
The modeling results are thus capable of demonstrating the importance of a quantitative approach in studying the present thermal state of a sedimentary basin.Therefore, in this study, we attempt to perform the first numerical modeling study of the 3-D thermal stru

This is a common assumption in target state estimation problems (e.g.
Defined as 0 

[(0.19244171039779745, 0), (0.24134657464019593, 3), (0.24245866165443664, 2), (0.29730576433264044, 4), (0.37838247603395692, 5), (0.40178612080014275, 1)]
A Stark decelerator produces beams of molecules with high quantum state purity, and small spatial, temporal and velocity spreads.
Defined as 0 

[(0.27511906354532423, 0), (0.31120631580632274, 3), (0.32854640388946943, 4), (0.33294813117992872, 2), (0.42705818545496632, 5), (0.45501711424487568, 1)]
The meta-stable energy state is the lowest excited state of the atoms.
Defined as 0 

[(0.25230805436816373, 0), (0.2698742620860074, 3), (0.33554401063556671, 2), (0.37152836946737233, 4), (0.39771471931354596, 1), (0.45888949579868399, 5)]
Robust and scaleable chip-based electric and magnetic traps and guides have been developed for atomic ions [] and neutral ground state atoms [].
Defined as 0 

[(0.16981929828372633, 0), (0.18013807745712773, 3), (

That is, to produce the distribution , , where , for some inverse temperature , if the goal is to predict thermal properties of the model; or  with  being the ground state of , if the goal is to predict ground state properties.
Defined as 0 

[(0.12750284611121809, 0), (0.18941123889074019, 2), (0.23349950545415799, 3), (0.28377442267750785, 5), (0.30358049031786938, 4), (0.44837750540022014, 1)]
We quantify the reliability of an AQS by the robustness of this probability distribution with respect to the deviations of  from its ideal value .In general, there is no reason to expect that the prepared state  will be robust to perturbations of .
Defined as 0 

[(0.103267675476864, 0), (0.17958443900863841, 2), (0.20669546598743316, 3), (0.27430678610930825, 4), (0.32799612771790776, 5), (0.43394407569036575, 1)]
This is expected since the equilibrium state becomes dominated by thermal fluctuations at high temperatures, and observables become insensitive to underlying Hamiltonian parameters.

), the crime data we use in the present study is for the state of Minas Gerais which is rated as having high quality police recorded crime data (Fórum Brasileiro de Segurança Pública ).The city of Belo Horizonte in Brazil was the chosen area of study.
Defined as 0 

[(0.15773725247596371, 0), (0.20941843303270147, 5), (0.26246083329025227, 4), (0.29353842908875549, 3), (0.31362657300891017, 1), (0.3502671300358986, 2)]
Belo Horizonte is in the southeastern region of Brazil, is capital of the state of Minas Gerais, and is Brazil’s third largest metropolitan area with a population of 5.5 million (Brookings ).An important contextual difference between Brazilian cities and many cities in western countries relates to urban living.
Defined as 0 

[(0.18444390432057656, 0), (0.1996663060024535, 2), (0.27637741250742731, 3), (0.39913693233588199, 5), (0.41854332225945767, 4), (0.43201303213313413, 1)]
Complex networks often exhibit co-evolutionary dynamics, meaning that the network topology an

It triggers people’s need to know.Most studies of the TOT state focus on either the characteristics of the information pertaining to the not-yet-retrieved items—such as the first letter, the number of syllables, the gender of the word in certain languages, or incorrect words called ‘blockers’ (Brown, , ; Gollan & Brown, ; Kornell & Metcalfe, ; Miozzo & Caramazza, )—or they focus on the compelling phenomenology of the state (Cleary & Claxton, ; Schwartz & Cleary, ).
Defined as 0 

[(0.11308443239083132, 0), (0.21066563653267778, 2), (0.24095368975694664, 5), (0.26626462120989625, 4), (0.29067725453215976, 3), (0.35830474178621963, 1)]
With respect to the latter, researchers often include William James’ (, p. 251) description “The state of our consciousness is peculiar.
Defined as 0 

[(0.090197732417899745, 0), (0.21384655587451062, 2), (0.24372115797591798, 3), (0.28653331140938032, 5), (0.37609169521272401, 4), (0.39049673883853897, 1)]
The question that underlies the research in the 

The second dimension of burnout (i.e., depersonalization) is the state of becoming indifferent to the people and ignoring the service recipients in order to put distance between themselves and oneself (Maslach & Jackson, ).
Defined as 0 

[(0.10335438706429367, 0), (0.27133280476096988, 5), (0.28443845620954722, 2), (0.28818406203533731, 3), (0.35254291545008809, 4), (0.39198985992480717, 1)]
Finding no hope from bilateral negotiations Bangladesh being the lower riparian state tried to internationalize the issue to get the equitable water share of the River Ganges but failed.
Defined as 0 

[(0.11848174185177851, 0), (0.22512905296944319, 2), (0.26798321157589799, 3), (0.29582852009237448, 4), (0.30234385269564457, 5), (0.38795792940138962, 1)]
Absence of such friendly atmosphere between the states might lead to a popular movement in the lower riparian state if deprived by the unilateral water withdrawal by the upper riparian state.
Defined as 0 

[(0.085801660472189401, 0), (0.2295402