# Learning linkset correctness - (MTSR 2019)

**Research Questions**: 
- can we train a neural network distingushing between correct and uncorrect links?

## Preparing the training sets
The training set is based on a set of linksets that have been generated building [Linked Thesaurus fRamework for Environment (LusTRE)](http://linkeddata.ge.imati.cnr.it/) as part of the research activity carried out during two EU funded projects: NatureSDIPlus and eENVplus. 


The procedure adopted to prepare a view of the linksets with the label, BT,NT,RT are described in 
* [Preparing Linkset involving local dumps](http://localhost:8888/notebooks/ai-related/LinkCorrectess/PreparingLinksetWithLocalDumps.ipynb)
, which include all the linksets not involving DBPEDIA
* [Preparing Linkset involving Dbpedia](http://localhost:8888/notebooks/ai-related/LinkCorrectess/PreparingLinksetWithDBPEDIA.ipynb)

### Useful tutorial 
 -  A  useful tutorial about pandas's dataframe is available at https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/
 - creating and editing https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
 - [Advanced Jupyter Notebook Tricks — Part I](https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/)

### Global variable 

In [25]:
### Global variables 
path="data/" # path where to find data
#namesa=['sBT','sprefLabel','sURI','oURI','oprefLabel','oBT', 'KindOfLink'] #column names for training data frame

## What features are we going to consider to characterize a link?


Text and Conceptual similarity among prefered labels  and  broader terms are considered as significant features on which classify a link.

Different approaches are available in order to work out the text similarity

### word2Vec
- A pretrained model for text similarity http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ (**pretrained model in Users/bubu/model/ but instructions outdated** )
- example of usage in  https://radimrehurek.com/gensim/models/keyedvectors.html
Others  resources 
- https://medium.freecodecamp.org/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3
- https://www.slideshare.net/lechatpito
- https://code.google.com/archive/p/word2vec/
- [Vector Representations of Words IN TF](https://www.tensorflow.org/tutorials/representation/word2vec)
- [Stanford courser - Word Vector Representations: word2vec](https://www.youtube.com/watch?v=ERibwqs9p38)
- using word2Vec in rapidMiner https://community.rapidminer.com/discussion/43860/synonym-detection-with-word2vec
 -https://www.neuralmarkettrends.com/word2vec-example-process-rapidminer

### Glo Ve

- https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b

### Text Similarity
- Basic text similarities https://pypi.org/project/textdistance/



# A - Attempt 1: Let's initialize the Word2Vec with a pre-existing model

## Design choices
- **design choice 1**: We use the Google’s pre-trained model see [here](http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/). It’s 1.5GB! It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.
- **design choice 2**: Similarity(s1,s2) implements a first attempt to work out the similarity between two set of words. It works out the max of sim on the pairs taken in the cardinal product of the sets, not considering the stoplist. 





In [29]:
#It takes very long to be executed
# https://radimrehurek.com/gensim/models/keyedvectors.html
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

#model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
#word_vectors = model.wv
word_vectors = KeyedVectors.load_word2vec_format("/Users/bubu/model/GoogleNews-vectors-negative300.bin", binary=True)  # C bin format
 

## A1 - How to call the similarity between vectors

In [83]:
#similarity = word_vectors.similarity('africa'.lower(), 'Countries in Africa'.lower().split())
#print(similarity)
docdistance=word_vectors.wmdistance('africa', 'Africa')
print(docdistance)

0.675937254097414


In [84]:
#v= ['woman','man', 'house', 'pippo']
def printifvector(v):
    for e in v:
        print(e +":")
        try:
        #vector = word_vectors.wv.word_vec( word_vectors.doesnt_match(e), use_norm=True)
            print(word_vectors.get_vector(e)) 
        except KeyError as ex:
            print('exception for ' +e)


## A2 - Procedure  work to out similarity on BT according to the first attempt 


In [85]:
# Attempt to work out similarity between two set of words
#lw1 and lw2 are two documents containing set of words
def similarityBetweenSetsSplitingInWords(s1,s2):
    ## remove common words and not indexed words and tokenize
    stoplist = set('for a of the and to in'.split())
    if (type(s1) is not str) or (type(s2) is not str):
        return 0.0
    # tokenize and removing |
    remove= lambda x :x.replace('|',"")
    lw1=list(map(remove, s1.lower().split()))
    lw2=list(map(remove,s2.lower().split()))
    
    strip= lambda x :x.strip()
    lw1=list(map(strip, lw1))
    lw2=list(map(strip, lw2))
    
    
    
    ## what words are indexed?
    lw1 =  [word for word in lw1 if word not in stoplist]
    lw2 =  [word for word in lw2 if word not in stoplist]
    
    print(lw1)
    print(lw2)
    amax=-1.0
    
    # if one of the sets is empty it returns 0
    if not ((len(lw1) == 0) or (len(lw2) == 0)): 
        for i in lw1 :  
            lmax=-1.0
            try:
                #test if i is indexed otherwise exception
                word_vectors.get_vector(i) 
            except KeyError as ex:
                print('exception for ' +i)
                continue
            for ii in lw2 :
                try:
                     #test if i is indexed otherwise exception
                    word_vectors.get_vector(ii)
                except KeyError as ex:
                    print('exception for ' +ii)
                    continue
                sim=word_vectors.similarity(i,ii)
                #print('sim(%s, %s) = %f' %(i,ii,sim) )   
                lmax = max(sim , lmax)
                amax+=lmax
                       
    return amax



In [86]:
print(similarityBetweenSetsSplitingInWords("africa", 'africa'))


['africa']
['africa']
0.0


## maxInSplitWords (M)
Given 
*  two sets of words $S_1$ and $S_2$
*  a similarity functions
*  $(x_i,y_j) \in S_1 \times S_2$
  
maxInSplitWords implements the following mathematical function

$\text{maxInSplitWords(x,y,sim)}=\text{MAX}_{i,j}(sim(x_i,y_j))$

In [87]:
# Attempt to work out similarity between two set of words
# lw1 and lw2 are two documents containing set of words
# function is the similarity function to apply
# it returns the maximun similarity comaring the set product
def maxInSplitWords(s1,s2, function):
    ## remove common words and not indexed words and tokenize
    stoplist = set('for a of the and to in'.split())
    if (type(s1) is not str) or (type(s2) is not str):
        return 0.0
    # tokenize and removing |
    remove= lambda x :x.replace('|',"")
    lw1=list(map(remove, s1.lower().split()))
    lw2=list(map(remove,s2.lower().split()))
    
    strip= lambda x :x.strip()
    lw1=list(map(strip, lw1))
    lw2=list(map(strip, lw2))
    
    
    ## what words are indexed?
    lw1 =  [word for word in lw1 if word not in stoplist]
    lw2 =  [word for word in lw2 if word not in stoplist]
    
    print(lw1)
    print(lw2)
    lmax=float('nan')
    firstTime = True
    # if one of the sets is empty it returns 0
    if not ((len(lw1) == 0) or (len(lw2) == 0)): 
        for i in lw1 :  
            for ii in lw2 :
                sim=function(i,ii)
                if ( math.isnan(lmax)) :
                    lmax=sim
                else:
                    lmax = max(sim , lmax)                     
    return lmax



## SummingMax (SM)

Given 
*  two sets of words $S_1$ and $S_2$
*  a similarity functions
*  $(x_i,y_j) \in S_1 \times S_2$
  
summingMax implements the following mathematical function


$\text{summingMax(x,y,sim)}=\sum_i{\text{MAX}_{i,j}(sim(x_i,y_j))}$

In [88]:
# Attempt to work out similarity between two set of words
# lw1 and lw2 are two documents containing set of words
# function is the similarity function to apply
import math
def summingMax(s1, s2, function):
    ## remove common words and not indexed words and tokenize
    stoplist = set('for a of the and to in'.split())
    if (type(s1) is not str) or (type(s2) is not str):
        return 0.0
    # tokenize and removing |
    remove= lambda x :x.replace('|',"")
    lw1=list(map(remove, s1.lower().split()))
    lw2=list(map(remove,s2.lower().split()))
    
    strip= lambda x :x.strip()
    lw1=list(map(strip, lw1))
    lw2=list(map(strip, lw2))
    
    
    
    ## what words are indexed?
    lw1 =  [word for word in lw1 if word not in stoplist]
    lw2 =  [word for word in lw2 if word not in stoplist]
    
    print(lw1)
    print(lw2)
   
    amax=float('nan')
    #firstTime = True
    # if one of the sets is empty it returns 0
    if not ((len(lw1) == 0) or (len(lw2) == 0)): 
        for i in lw1 :  
            lmax=float('nan')
            for ii in lw2 :
                sim=function(i,ii)
                if (math.isnan(lmax)) :
                    lmax=sim
                    #firstTime =False
                else:
                    lmax = max(sim , lmax)
            if (math.isnan(amax)):
                amax=lmax;
            else: 
                amax+=lmax;
    return amax



In [89]:
print(summingMax('', '', textdistance.hamming.normalized_similarity))


[]
[]
nan


### Generating the List of validated linksets

In [90]:
# https://dzone.com/articles/listing-a-directory-with-python
import os

TrainingFiles= filter(lambda x: x.endswith('EnrichedLinkeset.csv'),  os.listdir(path)) 
 
for file in TrainingFiles : 
    print(file)
       
    

Thist2AGROVOCEnrichedLinkeset.csv
Thist2EUROVOCEnrichedLinkeset.csv
ThIST2BpediaEnrichedLinkeset.csv
Thist2GEMETEnrichedLinkeset.csv


In [103]:
import pandas as pd

def workOutSim():
    TrainingFiles= filter(lambda x: x.endswith('EnrichedLinkeset.csv'),  os.listdir(path)) 
    for file in TrainingFiles: 
        print(file)
        lf=pd.read_csv(path+ file,delimiter=",")
        lf=lf.fillna('')
        
        #BT  
        lf['BT_similaritySInW']=0.0 
        lf['BT_wmdistance']=lf['BT_Mwmdistance']= lf['BT_SMwmdistance']= 0.0
        lf['BT_nhammingSim']= lf['BT_MnhammingSim']= lf['BT_SMnhammingSim']= 0.0
        #Preferred Label
        lf['PrefLabel_similaritySInW']=0.0 
        lf['PrefLabel_wmdistance']=lf['PrefLabel_Mwmdistance']= lf['PrefLabel_SMwmdistance']= 0.0
        lf['PrefLabel_nhammingSim']= lf['PrefLabel_MnhammingSim']= lf['PrefLabel_SMnhammingSim']= 0.0
         #NT
        lf['RT_similaritySInW']=0.0 
        lf['RT_wmdistance']=lf['RT_Mwmdistance']= lf['RT_SMwmdistance']= 0.0
        lf['RT_nhammingSim']= lf['RT_MnhammingSim']= lf['RT_SMnhammingSim']= 0.0


        l = range(1, len(lf))
        for i in l:
            if (lf.sBT.iloc[i]!='') and (lf.oBT.iloc[i]!=''):
                lf['BT_similaritySInW'][i] =similarityBetweenSetsSplitingInWords(lf.sBT[i], lf.oBT[i])
                lf['BT_wmdistance'][i]=word_vectors.wmdistance(lf.sBT[i], lf.oBT[i])
                lf['BT_Mwmdistance'][i]= maxInSplitWords(lf.sBT[i], lf.oBT[i], word_vectors.wmdistance)
                lf['BT_SMwmdistance'][i]= summingMax(lf.sBT[i], lf.oBT[i], word_vectors.wmdistance)
                lf['BT_nhammingSim'][i]=textdistance.hamming.normalized_similarity(lf.sBT[i], lf.oBT[i])
                lf['BT_MnhammingSim'][i]=maxInSplitWords(lf.sBT[i], lf.oBT[i],textdistance.hamming.normalized_similarity)
                lf['BT_SMnhammingSim'][i] =summingMax(lf.sBT[i], lf.oBT[i],textdistance.hamming.normalized_similarity)

            if (lf.sRT.iloc[i]!='') and (lf.oRT.iloc[i]!=''):
                 lf['RT_similaritySInW'][i] =similarityBetweenSetsSplitingInWords(lf.sRT[i], lf.oRT[i])
                 lf['RT_wmdistance'][i]=word_vectors.wmdistance(lf.sRT[i], lf.oRT[i])
                 lf['RT_Mwmdistance'][i]= maxInSplitWords(lf.sRT[i], lf.oRT[i], word_vectors.wmdistance)
                 lf['RT_SMwmdistance'][i]= summingMax(lf.sRT[i], lf.oRT[i], word_vectors.wmdistance)
                 lf['RT_nhammingSim'][i]=textdistance.hamming.normalized_similarity(lf.sRT[i], lf.oRT[i])
                 lf['RT_MnhammingSim'][i]=maxInSplitWords(lf.sRT[i], lf.oRT[i],textdistance.hamming.normalized_similarity)
                 lf['RT_SMnhammingSim'][i] =summingMax(lf.sRT[i], lf.oRT[i],textdistance.hamming.normalized_similarity)
                
            if (lf.sPrefLabel.iloc[i]!='') and (lf.oBT.iloc[i]!=''):
                lf['PrefLabel_similaritySInW'][i] =similarityBetweenSetsSplitingInWords(lf.sPrefLabel[i], lf.oPrefLabel[i])
                lf['PrefLabel_wmdistance'][i]=word_vectors.wmdistance(lf.sPrefLabel[i], lf.oPrefLabel[i])
                lf['PrefLabel_Mwmdistance'][i]= maxInSplitWords(lf.sPrefLabel[i], lf.oPrefLabel[i], word_vectors.wmdistance)
                lf['PrefLabel_SMwmdistance'][i]= summingMax(lf.sPrefLabel[i], lf.oPrefLabel[i], word_vectors.wmdistance)
                lf['PrefLabel_nhammingSim'][i]=textdistance.hamming.normalized_similarity(lf.sPrefLabel[i], lf.oPrefLabel[i])
                lf['PrefLabel_MnhammingSim'][i]=maxInSplitWords(lf.sPrefLabel[i], lf.oPrefLabel[i],textdistance.hamming.normalized_similarity)
                lf['PrefLabel_SMnhammingSim'][i] =summingMax(lf.sPrefLabel[i], lf.oPrefLabel[i],textdistance.hamming.normalized_similarity)
            
        lf.to_csv(path+file.replace('.csv','SimilarityResult.csv'), sep=';')



In [None]:
workOutSim()

## Cleaning   the Similarity Results

In [106]:
TrainingFiles= filter(lambda x: x.endswith('SimilarityResult.csv'),  os.listdir(path)) 
for file in TrainingFiles: 
    print(file)
    lf=pd.read_csv(path+ file,delimiter=";")
    lf.to_csv(path+file.replace('SimilarityResult.csv','SimilarityResultOld.csv'), sep=';')
    lf1=lf[['s','o','KindOfLink','sBT','sNT','sRT','sPrefLabel','sAltLabels','oPrefLabel','oAltLabels','oBT','oNT','oRT','RT_similaritySInW','RT_wmdistance', 'RT_Mwmdistance', 'RT_SMwmdistance','RT_nhammingSim','RT_MnhammingSim','RT_SMnhammingSim','BT_similaritySInW','BT_wmdistance','BT_Mwmdistance','BT_SMwmdistance','BT_nhammingSim','BT_MnhammingSim','BT_SMnhammingSim','PrefLabel_similaritySInW','PrefLabel_wmdistance','PrefLabel_Mwmdistance','PrefLabel_SMwmdistance','PrefLabel_nhammingSim','PrefLabel_MnhammingSim','PrefLabel_SMnhammingSim'
]]
    lf1.to_csv(path+file.replace('SimilarityResult.csv','SimilarityResult1.csv'), sep=';')
    
    

ThIST2BpediaEnrichedLinkesetSimilarityResult.csv
Thist2AGROVOCEnrichedLinkesetSimilarityResult.csv
Thist2GEMETEnrichedLinkesetSimilarityResult.csv
Thist2EUROVOCEnrichedLinkesetSimilarityResult.csv
