# Testing graph and path building strategies

1. Get material from wikipedia articles.
2. Extract concepts
 - after part of speech tagging, chunk different types of concepts. make probable_lists and sentence classifications. 
3. Build undirected graph of concepts. Graph conveys centrality of concepts, and the existence of a relationship between two concepts. 
4. Get student's input on which concepts are not at all known
___________
5. Side note: maybe build a directed graph from the undirected graph. annotate sentence types and calculate readability. Then, look for patterns to learn how to classify prerequisites from this. 
_______

Evaluating different metrics for building directed prerequisite concepts graph:

1. term frequency
2. inverse document frequency for several corpii:
    wiki corpus 
    

## 1. Downloading a wikipedia article's text

In [None]:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Facet'

source = requests.get(url).text
soup = BeautifulSoup(source,'lxml')


text_set = soup.find_all(['p'])
text_list = [p1.get_text() for p1 in text_set]
tags_list = [p1.name for p1 in text_set ]

rawtxt = ''.join(text_list)
## This will skip headings ('h2','h3') and lists that are made as links( 'li'). For now, this is okay.
print("length of material")
print(len(rawtxt))

print("Sample of text")
print(rawtxt[0:500])

## Save rawtxt as is for later:

In [None]:
path_name = "C:/Users/Arati/Documents/personal docs/python_introduction_course/textdata/"
with open(path_name + filename,"a",encoding="utf-8") as myfile:
    myfile.write(rawtxt)
myfile.close()

## Alternately getting file from disk and loading to rawtxt:



In [1]:
filename = 'Cognitive_Load_Theory.txt'
path_name = "C:/Users/Arati/Documents/personal docs/python_introduction_course/textdata/"
with open (path_name +filename, "r",encoding="utf-8") as myfile:
    rawtxt=myfile.read()
myfile.close()

## 2. Extracting list of concepts:

### 2.1. Importing libraries

In [46]:
import nltk
from nltk import word_tokenize
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from nltk import Tree
import re
import pickle
import math
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))
import itertools
from itertools import chain
import collections
import numpy as num
import pandas as pd
import csv
import statistics
from nltk.corpus import cmudict
cmud = cmudict.dict()
wnl = nltk.WordNetLemmatizer()
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
with open('common_words_documents.pickle', 'rb') as f:
       common_words_documents = pickle.load(f)
f.close()
type(common_words_documents)

dict

### 2.2 Training an unsupervised sentence tokenizer based off downloaded material

In [47]:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(rawtxt)
 
tokenizer = PunktSentenceTokenizer(trainer.get_params())
sents = tokenizer.tokenize(rawtxt)

print("Number of sentences in text "+str(len(sents)))
print(len(sents))

print("Sample of sentences:")
print(sents[0:5])

Number of sentences in text 269
269
Sample of sentences:
['\ufeffCatalysis (/kəˈtæləsɪs/) is the process of increasing the rate of a chemical reaction by adding a substance known as a catalyst[1] (/ˈkætəlɪst/), which is not consumed in the catalyzed reaction and can continue to act repeatedly. Because of this, only very small amounts of catalyst are required to alter the reaction rate in principle.', '[2]\nIn general, chemical reactions occur faster in the presence of a catalyst because the catalyst provides an alternative reaction pathway with a lower activation energy than the non-catalyzed mechanism.', 'In catalyzed mechanisms, the catalyst usually reacts to form a temporary intermediate, which then regenerates the original catalyst in a cyclic process.', 'A substance which provides a mechanism with a higher activation energy does not decrease the rate because the reaction can still occur by the non-catalyzed route.', '[3] An added substance which does reduce the reaction rate is no

### 2.3 Extracting concepts from text, sentence by sentence

#### 2.3.1 Setting up chunking rules:

Chunking has to be done in batches. this way single noun concepts can be extracted as well as adjectives. 

In [48]:
chunkrules = {}

chunkrules['JJNP'] = r"""    
    JJNP: {<RB.*>?<J.*>?<NN.*>{1,}}       
"""
## Examples: reusable contactless stored value smart card

##### Validchar function - checks if it is alphanumeric including hyphens

In [49]:
def validchar(wrd):
    if wrd.isalnum():
        ##print('isalnum')
        return 1
    elif '-' in wrd:
        wrd = wrd.replace('-','1')
        if wrd.isalnum():
            ##print('only replaced hyphens')
            return 1
        else:
           ## print('replaced hyphens but still not alnum')
            return 0
            
    else:
       ## print('there were no hyphens')
        return 0

In [50]:
def lemmatize_by_pos(tag):
    token = tag[0]
    pos = tag[1]
    if token in stop_words:
        return (token,pos)
    if pos.startswith('J'):
        # adjective form
        lemma = wnl.lemmatize(token,'s')
    elif pos.startswith('N'):
        # noun form
        lemma = wnl.lemmatize(token,'n')
    elif pos.startswith('R'):
        # adverb
        lemma = wnl.lemmatize(token,'r')
    elif pos.startswith('V'):
        lemma = wnl.lemmatize(token,'v')
    else:
        lemma = token
    return (lemma,pos)

In [29]:
lemmatize_by_pos(('dancing','VBS'))

('dance', 'VBS')

##### Chunk this function = takes a grammar rule and sentence tags and returns chunks

In [30]:
def chunk_this(grammar_rule_key,sentence_tags):
    setlist = []
    cp = nltk.RegexpParser(chunkrules[grammar_rule_key])
    J = cp.parse(sentence_tags) 
    for i in range(len(J)):
        if not(isinstance(J[i],tuple)):
            if (J[i].label()==grammar_rule_key):
                setlist.append((' '.join([J[i][j][0] for j in range(len(J[i])) if (validchar(J[i][j][0])==1)])))
    setlist = list(set(setlist))
    setlist = [wrd.lower() for wrd in setlist if len(wrd)>0]
    return setlist
#%% 


Creating an eqn dictionary to annotate mathy stuff out and use as noun phrases

In [31]:
global eqn_dict
eqn_dict = {}
global eqn_count
eqn_count = 1

def eqn_label(tokens):
    global eqn_count
    global eqn_dict
    EQNlist = [wrd for wrd in tokens if not(wrd.isalnum()) and re.search(r'[\[\]\{\}\+*^=_%$]',wrd) and len(wrd)>1 ]
    ## replace queations with a label and save to equation dictionary
    for eqn in EQNlist:
        
        if not(eqn in eqn_dict):
            
            eqn_dict[eqn] = ''.join(['equation',str(eqn_count)])
            eqn_count = eqn_count + 1                          
        else:    
            tokens[tokens.index(eqn)] = eqn_dict[eqn]
                  
    return tokens

In [32]:
global inv_eqn_dict
inv_eqn_dict = dict([[value,key] for key,value in eqn_dict.items()])

def display_equation(reptokens):
    for wrd in reptokens:
        if wrd in inv_eqn_dict:
            reptokens[reptokens.index(wrd)] = inv_eqn_dict[wrd]
    return reptokens

In [33]:
def chunker(sentence_tags):
    return [chunk_this(key,sentence_tags)  for key in chunkrules]

Will be stemming before saving sents_to_np

Process each sentence:

In [34]:
%%time 
sent_to_np = {}
sent_to_tokens = {}
sent_to_tags = {}

# tokens = [word_tokenize(s) for s in sents]
# reptokens = [eqn_label(t) for t in tokens]
# tags = [nltk.pos_tag(rt) for rt in reptokens]
# sent_to_np = {s:}

for i in range(len(sents)):
    tokens = word_tokenize(sents[i])
    reptokens = eqn_label(tokens)
    tags = nltk.pos_tag(reptokens)
    #print(tags)
    lemmatags = [lemmatize_by_pos(t) for t in tags]
    #print(lemmatags)
    sent_to_np[i] = chunker(lemmatags)
    ##sent_to_np_basics = extract_basicnp(sent_to_np[i])
    #sent_to_tokens[i] = lemma
    #sent_to_tags[i] = tags

Wall time: 1.57 s


In [35]:
print(sent_to_np[0])

[['station rf impedance', 'shd', 'ccp discharge how', 'pecvd q', 'a pecvd']]


In [36]:
sent_to_npflat = {}
np_to_sent = {}
for key in sent_to_np:
    sent_to_npflat[key] = list(set((itertools.chain(*sent_to_np[key]))))  
    for np in sent_to_npflat[key]:            
        if np in np_to_sent:                           
            np_to_sent[np].append(key)
        else:                
            np_to_sent[np]=[key]

In [37]:
# np1 always appears before np2

def build_graph(sent_to_npflat,max_sent_dist, min_sent_dist):
    npnp_bondstrengthdir = {}
    for i in range(len(sent_to_npflat)-max_sent_dist):
        for np1 in sent_to_npflat[i]:
            npnp_bondstrengthdir[np1] = {}
            for j in range(min_sent_dist, max_sent_dist):
                np2list = [np2 for np2 in sent_to_npflat[i+j] if np2!=np1]
                for np2 in np2list:
                    npnp_bondstrengthdir[np1][np2] =npnp_bondstrengthdir[np1].get(np2,0) + 1/(j+1)
    return npnp_bondstrengthdir

npnp_bondstrengthdir = build_graph(sent_to_npflat,3,0)

In [38]:
count = 0
for np1 in npnp_bondstrengthdir.keys():
    for np2 in npnp_bondstrengthdir[np1].keys():
        print(np1,np2)
        count = count+1
        if count>10:
            break
    if count>10:
        break

ccp discharge how station rf impedance
ccp discharge how pecvd q
ccp discharge how shd
ccp discharge how a pecvd
ccp discharge how series rf circuit
ccp discharge how simple model assumes
ccp discharge how gas molecule
ccp discharge how resistive impedance
ccp discharge how collision
ccp discharge how real part
station rf impedance ccp discharge how


Reuters IDF dictionary has already been made.

In [39]:
def tf(np,rawtxt):
   p = re.compile(np)
   return len(p.findall(rawtxt))        

wnl = nltk.WordNetLemmatizer()   
def reuters_idf(token):
    ## assuming np1list contains the np1 words for the sentence under consideration.
    if wnl.lemmatize(token) in common_words_documents:
           idf = math.log(10788) - math.log((1+common_words_documents[wnl.lemmatize(token)]))
    else:
           idf = math.log(10788)
    return idf

def tf_reuters_idf(np,rawtxt):
   return tf(np,rawtxt)*reuters_idf(np)

In [40]:


# print('prefixes')
# example = 'dancer'
# print(wnl.lemmatize(example))
# print(porter.stem(example))
# print(lancaster.stem(example))

# print('prefix2')
# example = 'dances'
# print(wnl.lemmatize(example))
# print(porter.stem(example))
# print(lancaster.stem(example))

# print('prefix3')
# example = 'dancers'
# print(wnl.lemmatize(example))
# print(porter.stem(example))
# print(lancaster.stem(example))

# print('prefix4')
# example = 'dancing'
# print(wnl.lemmatize(example))
# print(porter.stem(example))
# print(lancaster.stem(example))

# there seems to be no hard and fast rules for hte behaviour of the porter stemmer.
# lemmatizer is slowest. 
# lemmatizer always solves plural to singular
# examples = leaves, boxes, machines, areas: 
# prefixes for antonyms - word is retained as is for lemmatizer, porter and lancaster reduce to stems without getting rid of prefixes

In [41]:
Concept = pd.Series([key for (key,value) in np_to_sent.items()])
Occurence = pd.Series([num.array(value) for (key,value) in np_to_sent.items()])
Frequency = pd.Series([len(o) for o in Occurence])
Sdev = pd.Series([num.std(o) for o in Occurence])
#ReutersIDF = pd.Series([reuters_idf(key) for (key,value) in np_to_sent.items()])
Conceptdata = pd.DataFrame({'Concept':Concept,'Occurence':Occurence,'Frequency':Frequency,'Sdev':Sdev})

In [42]:
Conceptdata.sort_values(by='Frequency',ascending=False).head(25)

Unnamed: 0,Concept,Occurence,Frequency,Sdev
324,plasma,"[102, 107, 112, 113, 114, 137, 156, 172, 190, ...",30,230.600434
85,ion,"[27, 70, 99, 111, 128, 129, 131, 134, 135, 136...",26,237.70636
136,surface,"[44, 78, 80, 82, 88, 102, 147, 162, 163, 208, ...",23,218.631402
47,lf,"[15, 21, 131, 197, 199, 201, 287, 292, 305, 36...",20,180.389377
235,radical,"[69, 78, 79, 87, 89, 149, 151, 209, 218, 224, ...",20,120.784219
63,wafer,"[19, 125, 203, 254, 268, 307, 335, 411, 448, 4...",19,220.698727
91,process,"[29, 33, 55, 98, 150, 169, 303, 317, 318, 390,...",19,207.593786
107,sheath,"[34, 72, 101, 103, 106, 109, 127, 128, 129, 13...",18,142.286834
166,showerhead,"[52, 203, 268, 407, 420, 461, 497, 500, 540, 5...",17,169.368642
486,station,"[176, 209, 297, 406, 484, 485, 487, 489, 492, ...",17,114.125712


In [43]:
Conceptdata.to_csv('PlasmafaqConceptdata.csv',sep=',')

In [None]:
import bisect as bs

def find_shortest_distance(search_list, value):
    ins_point = bs.bisect_right(search_list,value)
    if ins_point < len(search_list):
        return min(abs(search_list[ins_point] - value), abs(search_list[ins_point - 1] - value))
    return abs(search_list[ins_point - 1] - value)


In [None]:
def find_shortest_distance_withdir(search_list,value):
    ins_point = bs.bisect_right(search_list, value)
    if ins_point < len(search_list):
        if abs(search_list[ins_point] - value) < abs(search_list[ins_point - 1] - value):
            return search_list[ins_point] - value
        else:
            return search_list[ins_point-1] - value
    return search_list[ins_point-1] - value

# a negative value means that new value is greater or , a positive value means that new value is lower

In [None]:
search_list = [1,2,3,4,5]
value = 3
print(bs.bisect_right([1,2,3,4,5],3))
print(find_shortest_distance(search_list,value))
print(find_shortest_distance_withdir(search_list,value))

In [None]:
def manual_syllable_count(phrase):
    vowels = {'a','e','i','o','u'}
    consonants = {'b','c','d','f','g','h','j','k','l','m','n','p','q','r','s','t','v','w','x','z'}
    y = {'y'}
    length = len(phrase)
    count_s = 0
    # syllables are counted in middle and end from the starting consonant or y sound with vowel sound following
    # in the starting: vowel sounds from a,e,i,o,u,and y are counted as 1 syllable regardless
    # in the end: consonant - vowel end with e is not counted, every other case including y as the vowel is counted
    first = phrase[0]
    #print(first)
    # dividing middle portion of word into pairs
    pairs = [phrase[i:i+2] for i in range(len(phrase)-2)]
    # getting ending pair
    end = phrase[len(phrase)-2:len(phrase)]
    
    if first in vowels|y:
        count_s = count_s + 1
        #print(first,count_s)
        
    for p in pairs:
        if p[0] in consonants|y and p[1] in vowels|y:
            count_s = count_s + 1
        #print(p,count_s)
    #print(end)
    if end[0] in consonants|y and end[1] in {'a','i','o','u','y'}:
        count_s = count_s + 1
        #print(end,count_s)
    return count_s
#     'employee'
#     'e'   :1
#     'em'  :0
#     'mp'  :0
#     'pl'  :0
#     'lo'  :1
#     'oy'  :0
#     'ye'  :1
#     'ee'  :0
    # getting first letter

def syllable_count(phrase):
    toks = nltk.word_tokenize(phrase)
    count = 0
    for t in toks:
        #syll_list = list(chain.from_iterable(cmud.get(t,[[0]])))
        syll_list = cmud.get(t,[[0]])[0] # randomly choosing the first pronunciation
        #print(syll_list)
        if syll_list==[0]:
            count = count + manual_syllable_count(t)
        else:
            count = count + sum([1 for y in syll_list if y[-1].isdigit()])
    return count


In [None]:
Concept1 = [[np1]*len(npnp_bondstrengthdir[np1]) for np1 in npnp_bondstrengthdir.keys()]
Concept1 = list(chain.from_iterable(Concept1))

In [None]:
Concept2 = [np2 for np1 in npnp_bondstrengthdir.keys() for np2 in npnp_bondstrengthdir[np1].keys()]

In [None]:
Bondstrength = [npnp_bondstrengthdir[Concept1[i]][Concept2[i]] for i in range(len(Concept1))]

In [None]:
# Number of sentences in which concept occurs
FA = [len(np_to_sent[np1]) for np1 in Concept1]
FB = [len(np_to_sent[np2]) for np2 in Concept2]

In [None]:
# std deviation of occurence of concept: the spread - does it occur all over the document or just in one section. 
SdevA = [num.std(np_to_sent[np1]) for np1 in Concept1]
SdevB = [num.std(np_to_sent[np2]) for np2 in Concept2]

In [None]:
## Computing the mean bond strength of A to other concepts
meanBSA = [num.mean(list(npnp_bondstrengthdir[np1].values())) for np1 in Concept1]
meanBSB = [num.mean(list(npnp_bondstrengthdir.get(np2,{}).values())) for np2 in Concept2]

In [None]:
## Computing average shortest distance of each A to a B and vice versa. metric for co-occurence
OcA = [np_to_sent[np1] for np1 in Concept1]
OcB = [np_to_sent[np2] for np2 in Concept2]

In [None]:
dAB=[]
dBA=[]

for i in range(len(Concept1)):
    dAB.append(num.mean([abs(find_shortest_distance(OcB[i],o)) for o in OcA[i]]))
    dBA.append(num.mean([abs(find_shortest_distance(OcA[i],o)) for o in OcB[i]]))

In [None]:
print('Computing number of mappings for Concept1, Concept2 respectively and how many of those concepts intersect') 
npnp_bondstrengthdir.get('complete exoneration',{})
%time Amap = [len(npnp_bondstrengthdir[np1]) for np1 in Concept1]
%time Bmap = [len(npnp_bondstrengthdir.get(np2,{})) for np2 in Concept2]
%time AmapintersectBmap = [len(set(npnp_bondstrengthdir[Concept1[i]].keys()) & set(npnp_bondstrengthdir.get(Concept2[i],{}).keys())) for i in range(len(Concept1))]
%time AminusB = [Amap[i]-AmapintersectBmap[i] for i in range(len(Concept1))]
%time BminusA = [Bmap[i]-AmapintersectBmap[i] for i in range(len(Concept1))]

In [None]:
## Edit word distance between the two concepts
nptoWtkeys = list(np_to_sent.keys())
print('word tokenizing each concept')
%time nptoWtvals = [nltk.word_tokenize(np) for np in nptoWtkeys]
nptoWt = dict(zip(nptoWtkeys,nptoWtvals))

In [None]:
print('syllable counts for each concept')
%time scvals = [syllable_count(np) for np in nptoWtkeys]
nptoSC = dict(zip(nptoWtkeys,scvals))

In [None]:
wtA = [nptoWt[np1] for np1 in Concept1]
wtB = [nptoWt[np2] for np2 in Concept2]
lenwtA = [len(wtA[i]) for i in range(len(Concept1))]
lenwtB = [len(wtB[i]) for i in range(len(Concept1))]

print('calculating word edit distance for concept1 and 2')
%time editDwAtoB = [nltk.edit_distance(wtA[i],wtB[i])/lenwtA[i] for i in range(len(Concept1))]
%time editDwBtoA = [nltk.edit_distance(wtA[i],wtB[i])/lenwtB[i] for i in range(len(Concept1))]

print('calculating letter edit distance for concept1 and 2')
lenA = [len(np1) for np1 in Concept1]
lenB = [len(np2) for np2 in Concept2]
%time editD =[nltk.edit_distance(Concept1[i],Concept2[i]) for i in range(len(Concept1))]
editDlAtoB = [editD[i]/lenA[i] for i in range(len(Concept1))]
editDlBtoA = [editD[i]/lenB[i] for i in range(len(Concept1))]

## Jaccard word distance A to B
print('calculating Jaccard distances by word')
%time Jaccardw = [nltk.jaccard_distance(set(wtA[i]),set(wtB[i])) for i in range(len(Concept1))]
print('calculating jaccard distance by letter')
%time Jaccardl = [nltk.jaccard_distance(set(Concept1[i]),set(Concept2[i])) for i in range(len(Concept1))]

lensents = len(sents)
lennp = len(np_to_sent)

In [None]:
AfirstOc = [np_to_sent[np1][0]/lensents for np1 in Concept1]
BfirstOc = [np_to_sent[np2][0]/lensents for np2 in Concept2]

In [None]:
syllcountA = [nptoSC[np1] for np1 in Concept1]
syllcountB = [nptoSC[np2] for np2 in Concept2]

In [None]:
print('getting reuters idf value for each concept')
nptoReutersIDFvals = [num.mean([reuters_idf(t) for t in nptoWt[np1]]) for np1 in nptoWt.keys()]
nptoReutersIDF = dict(zip(nptoWtkeys,nptoReutersIDFvals))

In [None]:
ReutersIDFA = [nptoReutersIDF[np1] for np1 in Concept1]
ReutersIDFB = [nptoReutersIDF[np2] for np2 in Concept2]

In [None]:
print('making into dataframe')

%time df = pd.DataFrame({'Concept1':Concept1,'Concept2': Concept2,'FA':FA,'FB':FB,'SdevA':SdevA,'SdevB':SdevB, 'meanBSA':meanBSA, 'meanBSB':meanBSB,'dAB':dAB,'dBA':dBA,'Amap':Amap,'Bmap':Bmap,'AmapintersectBmap':AmapintersectBmap, 'AminusB':AminusB, 'BminusA':BminusA,'lenwtA':lenwtA,'lenwtB':lenwtB,'editDwAtoB':editDwAtoB, 'editDwBtoA':editDwBtoA, 'lenA':lenA,'lenB':lenB,'editDlAtoB':editDlAtoB, 'editDlBtoA':editDlBtoA, 'Jaccardw':Jaccardw, 'Jaccardl':Jaccardl,'AfirstOc':AfirstOc,'BfirstOc':BfirstOc,'syllcountA':syllcountA,'syllcountB':syllcountB,'ReutersIDFA':ReutersIDFA,'ReutersIDFB':ReutersIDFB, 'lennp':[lennp]*len(Concept1),'lensents':[lensents]*len(Concept1)})
df.head(5)

In [None]:
'complete exoneration' in Concept2

In [None]:
df.sort_values(by=['AfirstOc','BfirstOc'],ascending=True)[['Concept1','Concept2','AfirstOc','BfirstOc']].head(20)

## Path building

1. graph is a dictionary
2. decide whether to convert to dataframe?
3. write function to return graph edges, and their corresponding sentences - these would not be in the npnp_bondstrength graph - nothing directs to them
4. optional write function to look up on wiki or wordnet and add to the graph dynamically if edges are not understood.
5. write function to slice out graphs with quantiles of number of nodes/ strength of bonds connectedness. add visualization. 
6. write function to take in a concept to understand and find the shortest path to it.  
7. write function to return the shortest path to cover all the concepts in the sliced out graphs. 

### Assigning a direction to every mapping: 

Before we make the concept graph, we need to ensure the graph is directed and acyclic so it is clear which concept is the prerequisite and which is the more advanced topic. 

For now, the rules used to do this are pretty random.

'Concept1', 'Concept2', 'FA', 'FB', 'SdevA', 'SdevB', 'meanBSA',
       'meanBSB', 'dAB', 'dBA', 'Amap', 'Bmap', 'AmapintersectBmap', 'AminusB',
       'BminusA', 'lenwtA', 'lenwtB', 'editDwAtoB', 'editDwBtoA', 'lenA',
       'lenB', 'editDlAtoB', 'editDlBtoA', 'Jaccardw', 'Jaccardl', 'AfirstOc',
       'BfirstOc', 'syllcountA', 'syllcountB', 'ReutersIDFA', 'ReutersIDFB',
       'lennp', 'lensents']

Hypotheses:
1. lower reuters IDF value is a prerequisite concept.
2. if there is an editDWA value less than one, then the concept with the lower lenwt is the prerequisite. For example:  
Concept1                    statistical science
Concept2             statistical methods mining
lenwtA                                        2
lenwtB                                        3
editDwAtoB                                    1
editDwBtoA                             0.666667
ReutersIDFA                             9.28619
ReutersIDFB                             7.45517
It should be noted here, however, that the reuters IDF value for the second concept is lower. So we need to weight these hypotheses. 
3. AminusB, B minus A : the one with the higher value is the prerequisite.
(this assumes that the prereq is mentioned several times, and is in fact more __central__. 
4. lower syllable count is prerequisite- this becomes important when both are one word concepts. 
5. first occurence in document is prerequisite

For now, there is equal weightage to each of these rules and direction will be decided by majority.


In [None]:
df['Cond1'] = df['ReutersIDFA']< df['ReutersIDFB']
df['Cond2'] =  df['lenwtA']<df['lenwtB']
#df['Cond2'] = df['meanBSA'] < df['meanBSB']
df['Cond3'] = df['AminusB'] > df['BminusA'] # A maps to more concepts that don't map to B within this document. 
df['Cond4'] = df['syllcountA'] < df['syllcountB']
df['Cond5'] = df['AfirstOc'] < df['BfirstOc']
df['Cond6'] = df['SdevA'] > df['SdevB'] # spread of A in document is more than spread of B
df['Cond7'] = df['FA']>df['FB']
#df['Cond8'] = df['editDlAtoB'] < 1
#df['Cond9'] = df['Jaccardw'] >= 0.5
#df['Cond10'] = df['Jaccardl'] > 0.5

#df['Direction'] = num.sum(df.loc[:,'Cond1':'Cond7'],axis=1)
df['Direction'] = df['Cond3']


In [None]:
#writing to csv file for easier exploration temporarily
df.to_csv('df_CLT.csv',sep=',')

In [None]:
dfdir = df[df['Direction']>=1]
print(len(dfdir))

In [None]:
import networkx as nx
%time G = nx.from_pandas_edgelist(dfdir,'Concept1','Concept2', create_using=nx.DiGraph())
#%time num_cycles = len(list(nx.simple_cycles(G)))
#print('Number of cycles in this graph = '+str(num_cycles))
#print(G)
#print(nx.draw(G))
Conceptdata.sort_values(by=['Frequency','Sdev'],ascending = [0,0]).head(20)

In [None]:
#nodelist = list(Conceptdata.sort_values(by=['Frequency','Sdev'],ascending = [0,0]).head(25)['Concept'])

# checkign if graph has cycles
%time loadcentdict = nx.load_centrality(G)
%time dfdirConcept1 = list(dfdir['Concept1'])
%time dfdirConcept2 = list(dfdir['Concept2'])
%time centralityA = [loadcentdict.get(np1,0) for np1 in dfdirConcept1]
%time centralityB = [loadcentdict.get(np2,0) for np2 in dfdirConcept2]

dfdir['centralityA'] = centralityA
dfdir['centralityB'] = centralityB

%time dfdir.sort_values(by = ['centralityA'],ascending=[False]).head(20)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(dfdir['centralityA'],dfdir['meanBSA'])
plt.show()

In [None]:
# find paths from each of these to the others. 
%time paths=dict(nx.all_pairs_shortest_path(G,cutoff=None))
type(paths)

In [None]:
def get_nodes_allpairs(concept_list):
    nodelist = [paths.get(cl1, {}).get(cl2,None) for cl1 in concept_list for cl2 in concept_list if paths.get(cl1, {}).get(cl2,None) is not None]
    nodelist = list(chain.from_iterable(nodelist))
    return list(set(nodelist))

In [None]:
start_concept_list = list(Conceptdata.sort_values(by=['Frequency','Sdev'],ascending = [0,0]).head(5)['Concept'])
#start_concept_list = ['trump','midterm election']
print(start_concept_list)
# now get the paths from all pairs in the concept list and the corresponding nodes 
nodelist = get_nodes_allpairs(start_concept_list)
print(nodelist)
start_concept_edges = dfdir[dfdir['Concept1'].isin(nodelist) & dfdir['Concept2'].isin(nodelist)]
start_concept_from = set(start_concept_edges['Concept1'])
#print(start_concept_edges)
print(start_concept_from)
start_concept_to = set(start_concept_edges['Concept2'])
print(start_concept_to)
len(start_concept_edges)
plt.figure(figsize=(20,10))
nx.draw_circular(G.subgraph(list(start_concept_from|start_concept_to)),with_labels=True, font_size=18,node_size=600)


In [None]:
print(np_to_sent['trump'])
print(np_to_sent['president'])
print(paths['midterm election']['trump'])
#[p for p in paths['learners'].keys() if p in nodelist

In [None]:
def get_sentence_indices(np1,np2,max_distance=3):
    sents1 = np_to_sent[np1]
    sents2 = np_to_sent[np2]
    ind1 = 0
    ind2 = 0
    tuplist = []
    lensents1 = len(sents1)
    print(lensents1)
    lensents2 = len(sents2)
    print(lensents2)
    while(ind1<lensents1 and ind2 <lensents2):
        #print(ind1,ind2)
        if (sents1[ind1]<sents2[ind2]):
            #print('sent1 less than sent2')
            if sents2[ind2]-sents1[ind1]<=max_distance:
                tuplist.append((sents1[ind1],sents2[ind2]))
                ind1 = ind1+1
                ind2 = ind2 + 1
            else:
                #ind1 = bs.bisect_left(sents1,sents2[ind2])
                ind1 = ind1 + 1
        elif (sents1[ind1]>sents2[ind2]):
            #print('sent2 less than sent1')
            if sents1[ind1]-sents2[ind2] <= max_distance:
                tuplist.append((sents2[ind2],sents1[ind1]))
                ind1 = ind1 + 1
                ind2 = ind2 + 1
            else:
                #ind2 = bs.bisect_left(sents2,sents1[ind1])
                ind2 = ind2 + 1
        else:
            tuplist.append((sents1[ind1],sents2[ind2]))
            ind1 = ind1+1
            ind2 = ind2+1
    return tuplist


In [None]:
def get_blurbs(np1,np2,max_distance=3):
    blurblist = []
    tuplist = get_sentence_indices(np1,np2,max_distance)
    print(tuplist)
    for t in tuplist:
        blurb = []
        print(t)
        blurb = ' '.join(sents[t[0]:t[1]+1]).replace('\n', ' ').replace('\r', '')
        print(blurb)
        blurblist.append(blurb)
    return blurblist

In [None]:
blurblist = get_blurbs('fitness function','rnns')

In [None]:
np_to_sent['rnns']
print(sents[148])
print(sents[148:149])

In [None]:
print(blurblist)

In [None]:
# # find paths from each of these nodes to others.
# indegreecentralitydict = nx.in_degree_centrality(G)
# outdegreecentralitydict = nx.out_degree_centrality(G)

# # in_centralityA = [indegreecentralitydict.get(np1,0) for np1 in Concept1]
# # in_centralityB = [indegreecentralitydict[np2] for np2 in Concept2]
# # out_centralityA = [outdegreecentralitydict[np1] for np1 in Concept1]
# # out_centralityB = [outdegreecentralitydict[np1] for np2 in Concept2]

### Function to find common paras - this is possibly a redo of npnp_bondstrengthdir, maybe club this in there.



In [None]:
def find_close_text(np1,np2,max_sent_dist=3):
    sentlistA = np_to_sent.get(np1,[0])
    sentlistB = np_to_sent.get(np2,[0])
    global lensents
    
    SA_dist = [(sA,find_shortest_distance_withdir(sentlistB,sA)) for sA in sentlistA if find_shortest_distance(sentlistB,sA)<=max_sent_dist]
    index_tuples = [(tup[0],tup[0]-tup[1]) for tup in SA_dist]
    return index_tuples
# maybe pass a list of ranges? - convert each tuple to range, and sets, and if any intersections exist, then make them one. 
# or back calculate from npnp_bondstrengthdir?
    


In [None]:
# take dAB and dBA, make blurbs indices of sentences +- rounded up int dAB.
# merge blurb indices with max(dAB,dBA) and return (this might be recursive)

def display_sentences(np1,np2):

    print(index_tup)
    if batch_size>len(index_tup):
        batch_size = len(index_tup)
    for i in range(batch_size):
        textblob=''
        for s in sents[index_tup[i][0]:index_tup[i][1]+1]:
            textblob = textblob + ' '+s.replace('\n', ' ').replace('\r', '')
        print(textblob)

# print the concepts in a different color. 


In [None]:
find_close_text('trump','mueller',4)

## Take input on what concepts are not known:

Need to calculate idf for different corpus. 

Todo 3/25/2019
Add the following columns to df, and increase speed
2. stem/lem the concepts word by word and unite them with the first mention and rerun analysis:
 - how to get simple adjective from comparative and superlative?
3. annotate: semantic notes and roles for each concept. identify prerequisites and make direction column manually.
4. rough rule-based or n-gram/ sentence structure based role classification from pos and keywords in sentence?
5. centrality from graph library
6. annotate for training and test data using 5 more wiki articles atleast: also save total number of concepts and sentences, etc. i.e columns for the data set
7. setup rough plan for classification - what algorithms are you going to try. do some exploration. 
8. google ngrams or scientific corpus IDF 
9. modify Reuters IDF to take in word tokenized noun phrase and consider each one separately, return the average. 
10. what other pos taggers are available, and should we look at those instead. should we clean up our sents a little more?
11. consider stemming for idf values and for combining singular/ plural forms - going to take the shortest word that is in the cmud dictionary. 
12. what does sorting by descending amapintersectbmap show? does text1.similar show anything interesting?
13. while displaying sentences with a concept, it should not include noun phrases containing more than just that concept? like ___ dance and just dance - only select sentences that display the concept asked for. 
14.  one way of measuring the clt is to get the combined idf for all the concepts, the number of total concepts and number of unknown concepts. Then every sentence/blurb will have a clt number, then you can select blurbs so that clt is always minimized. 
15. while calculating combined reuters_idf: do not include stop words. (and, of etc.)

Todo 3/28/2019
interactive graph visualization

1. How do we have nodes with no parents or children?? does it automatically show the concepts in concept data that are not pointed to?

Sentence classification with rules and annotation. (the J tree graph structure as input for ML maybe?)

Todo 3/29/2019
simple form GUI in tkinter

Todo 3/30/2019
inference engine: validation, inference, user friendly form questions. review question formats.

I came across the same problem, searched the web with no answer, then discovered that it actually can be done with the WordNet lemmatizer in nltk.

Recall that WordNet has those simplified pos tags:

n    NOUN 
v    VERB 
a    ADJECTIVE 
s    ADJECTIVE SATELLITE 
r    ADVERB 
among which the adjective tags, aand s, can be used for the normalization.

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('biggest', 'a')
u'big'
>>> wnl.lemmatize('better', 'a')
u'good'
Here the second parameter does the magic trick. If left blank, it defaults to 'n', or wordnet.NOUN inlemmatize(). Similarly, it should be put explicitly as 'v' or 'r' for normalizing verbs and adverbs, respectively.


In [None]:
wnl.lemmatize('pickling','v')

In [None]:
import pattern3


In [None]:
from pattern3.en import conjugate, lemma, lexeme, parse

In [None]:
reuters_idf('of')

In [None]:
nltk.help.upenn_tagset()

In [None]:
# reduce tokens to root so that variables and variable can be counted as one np
nodelist = []
for i in np_to_sent['president']:
    nodelist.extend(np for np in sent_to_np[i])
nodelist = set(list(chain.from_iterable(nodelist)))



In [None]:
nodelist