# Learning linkset correctness

**Research Questions**: 
- can we train a neural network distingushing between correct and uncorrect links?
- can we train a neural network distingushing between a exactMatch and a closeMatch?



## Preparing the training sets
The training set is based on a set of linksets that have been generated building [Linked Thesaurus fRamework for Environment (LusTRE)](http://linkeddata.ge.imati.cnr.it/) as part of the research activity carried out during two EU funded projects: NatureSDIPlus and eENVplus. 

### Useful tutorial 
 -  A  useful tutorial about pandas's dataframe is available at https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/
 - creating and editing https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
 - [Advanced Jupyter Notebook Tricks — Part I](https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/)

### Global variable 

In [3]:
### Global variables 
path="data/" # path where to find data
namesa=['sBT','sprefLabel','sURI','oURI','oprefLabel','oBT', 'KindOfLink'] #column names for training data frame

### Generating the List of validated linksets

In [4]:
# https://dzone.com/articles/listing-a-directory-with-python
import os

TrainingFiles= filter(lambda x: x.endswith('.csv'),  os.listdir(path)) 
 
for file in TrainingFiles : 
    print(file)
       
    

Thist2CGI_ICST_ok.csv
ThIST2BpediaEnrichedLinkeset.csv
Thist2DBPEDIA_ok.csv
DBPEDIA_seeds.csv
Thist2AGROVOC_ok.csv
Thist2DBPEDIA_ok_res.csv
Thist2GEMET_ok.csv
Thist2EUROVOC_ok.csv


### Reading a specific validated linkset (Thist2DBPEDIA_ok.csv)

In [5]:
#preparing a dictionary with  

for file in TrainingFiles : 
    print(file)
       

In [6]:
import pandas as pd


# reading Thist2DBPEDIA_ok.csv
THIST2DBPEDIA_df=pd.read_csv(path+'Thist2DBPEDIA_ok.csv',delimiter=";",names=namesa)
#deleting row 0 which contains the old colums name
THIST2DBPEDIA_df=THIST2DBPEDIA_df.drop([0])
THIST2DBPEDIA_df.head()

THIST2DBPEDIA_df.count()


sBT           2304
sprefLabel    3455
sURI          3455
oURI          3455
oprefLabel    3455
oBT           3427
KindOfLink    3428
dtype: int64

### Crawling RDF 


In [9]:
import pandas as pd

#setting variable for crawling
pathLDSpider='/Users/bubu/Work/Programming'
pathcrawling=path+'crawling/'
CrawlingDepth="1";
CrawlingNumberOfURI="-1" #not limited
crawledFile=pathcrawling+'DBPEDIA_Crawling.nt'
pathLDSpider+='/ldspider-1.1e.jar'
pathSeed=pathcrawling+'DBPEDIA_Crawling_seeds.csv'
pathLog=pathcrawling+'DBPEDIA_Crawling_seeds.log'

#Preparing Seed Files

r=THIST2DBPEDIA_df["oURI"].drop_duplicates()
r.to_csv(pathSeed, header=False, index=False, sep='\t', mode='a')



#### A bash for crawling via LDspider

In [11]:
%%bash -s "$crawledFile" "$pathLDSpider" "$pathSeed" "$pathLog" "$CrawlingDepth" "$CrawlingNumberOfURI" 
#echo "$1 $2 $3 $4 $5 $6"
if test -f $1;  then 
 echo "File $1 ; exist";
else
 echo "File $1 Not exist; start crawling java -jar $2 -s $3 -a $4  -y -b $5 $6 -o $1"
 java -jar $2 -s $3  -a $4  -y -b $5 $6 -o $1
fi 

File data/crawling/DBPEDIA_Crawling.nt ; exist


## What features are we going to consider to characterize a link?


Text and Conceptual similarity among prefered labels  and  broader terms are considered as significant features on which classify a link.

Different approaches are available in order to work out the text similarity

### word2Vec
- A pretrained model for text similarity http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ (**pretrained model in Users/bubu/model/ but instructions outdated** )
- example of usage in  https://radimrehurek.com/gensim/models/keyedvectors.html
Others  resources 
- https://medium.freecodecamp.org/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3
- https://www.slideshare.net/lechatpito
- https://code.google.com/archive/p/word2vec/
- [Vector Representations of Words IN TF](https://www.tensorflow.org/tutorials/representation/word2vec)
- [Stanford courser - Word Vector Representations: word2vec](https://www.youtube.com/watch?v=ERibwqs9p38)
- using word2Vec in rapidMiner https://community.rapidminer.com/discussion/43860/synonym-detection-with-word2vec
 -https://www.neuralmarkettrends.com/word2vec-example-process-rapidminer

### Glo Ve

- https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b

### Text Similarity
- Basic text similarities https://pypi.org/project/textdistance/



In [47]:
#import textdistance as textd
## glo ve
# import nltk 
# nltk.download() 


# A - Attempt 1: Let's initialize the Word2Vec with a pre-existing model

## Design choices
- **design choice 1**: We use the Google’s pre-trained model see [here](http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/). It’s 1.5GB! It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.
- **design choice 2**: Similarity(s1,s2) implements a first attempt to work out the similarity between two set of words. It works out the max of sim on the pairs taken in the cardinal product of the sets, not considering the stoplist. 





In [108]:
#It takes very long to be executed
# https://radimrehurek.com/gensim/models/keyedvectors.html
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

#model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
#word_vectors = model.wv
word_vectors = KeyedVectors.load_word2vec_format("/Users/bubu/model/GoogleNews-vectors-negative300.bin", binary=True)  # C bin format
 

## A1 - How to call the similarity between vectors

In [78]:
#similarity = word_vectors.similarity('africa'.lower(), 'Countries in Africa'.lower().split())
#print(similarity)
docdistance=word_vectors.wmdistance('africa', 'Africa')
print(docdistance)

0.675937254097414


In [50]:
#v= ['woman','man', 'house', 'pippo']
def printifvector(v):
    for e in v:
        print(e +":")
        try:
        #vector = word_vectors.wv.word_vec( word_vectors.doesnt_match(e), use_norm=True)
            print(word_vectors.get_vector(e)) 
        except KeyError as ex:
            print('exception for ' +e)


## A2 - Procedure  work to out similarity on BT according to the first attempt 


In [51]:
# Attempt to work out similarity between two set of words
#lw1 and lw2 are two documents containing set of words
def similarityBetweenSetsSplitingInWords(s1,s2):
    ## remove common words and not indexed words and tokenize
    stoplist = set('for a of the and to in'.split())
    if (type(s1) is not str) or (type(s2) is not str):
        return 0.0
    # tokenize and removing |
    remove= lambda x :x.replace('|',"")
    lw1=list(map(remove, s1.lower().split()))
    lw2=list(map(remove,s2.lower().split()))
    
    strip= lambda x :x.strip()
    lw1=list(map(strip, lw1))
    lw2=list(map(strip, lw2))
    
    
    
    ## what words are indexed?
    lw1 =  [word for word in lw1 if word not in stoplist]
    lw2 =  [word for word in lw2 if word not in stoplist]
    
    print(lw1)
    print(lw2)
    amax=-1.0
    
    # if one of the sets is empty it returns 0
    if not ((len(lw1) == 0) or (len(lw2) == 0)): 
        for i in lw1 :  
            lmax=-1.0
            try:
                #test if i is indexed otherwise exception
                word_vectors.get_vector(i) 
            except KeyError as ex:
                print('exception for ' +i)
                continue
            for ii in lw2 :
                try:
                     #test if i is indexed otherwise exception
                    word_vectors.get_vector(ii)
                except KeyError as ex:
                    print('exception for ' +ii)
                    continue
                sim=word_vectors.similarity(i,ii)
                #print('sim(%s, %s) = %f' %(i,ii,sim) )   
                lmax = max(sim , lmax)
                amax+=lmax
                       
    return amax



In [1]:
print(similarityBetweenSetsSplitingInWords("", ''))


NameError: name 'similarityBetweenSetsSplitingInWords' is not defined

## maxInSplitWords (M)
Given 
*  two sets of words $S_1$ and $S_2$
*  a similarity functions
*  $(x_i,y_j) \in S_1 \times S_2$
  
maxInSplitWords implements the following mathematical function

$\text{maxInSplitWords(x,y,sim)}=\text{MAX}_{i,j}(sim(x_i,y_j))$

In [131]:
# Attempt to work out similarity between two set of words
# lw1 and lw2 are two documents containing set of words
# function is the similarity function to apply
# it returns the maximun similarity comaring the set product
def maxInSplitWords(s1,s2, function):
    ## remove common words and not indexed words and tokenize
    stoplist = set('for a of the and to in'.split())
    if (type(s1) is not str) or (type(s2) is not str):
        return 0.0
    # tokenize and removing |
    remove= lambda x :x.replace('|',"")
    lw1=list(map(remove, s1.lower().split()))
    lw2=list(map(remove,s2.lower().split()))
    
    strip= lambda x :x.strip()
    lw1=list(map(strip, lw1))
    lw2=list(map(strip, lw2))
    
    
    ## what words are indexed?
    lw1 =  [word for word in lw1 if word not in stoplist]
    lw2 =  [word for word in lw2 if word not in stoplist]
    
    print(lw1)
    print(lw2)
    lmax=float('nan')
    firstTime = True
    # if one of the sets is empty it returns 0
    if not ((len(lw1) == 0) or (len(lw2) == 0)): 
        for i in lw1 :  
            for ii in lw2 :
                sim=function(i,ii)
                if ( math.isnan(lmax)) :
                    lmax=sim
                else:
                    lmax = max(sim , lmax)                     
    return lmax



## SummingMax (SM)

Given 
*  two sets of words $S_1$ and $S_2$
*  a similarity functions
*  $(x_i,y_j) \in S_1 \times S_2$
  
summingMax implements the following mathematical function


$\text{summingMax(x,y,sim)}=\sum_i{\text{MAX}_{i,j}(sim(x_i,y_j))}$

In [132]:
# Attempt to work out similarity between two set of words
# lw1 and lw2 are two documents containing set of words
# function is the similarity function to apply
import math
def summingMax(s1, s2, function):
    ## remove common words and not indexed words and tokenize
    stoplist = set('for a of the and to in'.split())
    if (type(s1) is not str) or (type(s2) is not str):
        return 0.0
    # tokenize and removing |
    remove= lambda x :x.replace('|',"")
    lw1=list(map(remove, s1.lower().split()))
    lw2=list(map(remove,s2.lower().split()))
    
    strip= lambda x :x.strip()
    lw1=list(map(strip, lw1))
    lw2=list(map(strip, lw2))
    
    
    
    ## what words are indexed?
    lw1 =  [word for word in lw1 if word not in stoplist]
    lw2 =  [word for word in lw2 if word not in stoplist]
    
    print(lw1)
    print(lw2)
   
    amax=float('nan')
    #firstTime = True
    # if one of the sets is empty it returns 0
    if not ((len(lw1) == 0) or (len(lw2) == 0)): 
        for i in lw1 :  
            lmax=float('nan')
            for ii in lw2 :
                sim=function(i,ii)
                if (math.isnan(lmax)) :
                    lmax=sim
                    #firstTime =False
                else:
                    lmax = max(sim , lmax)
            if (math.isnan(amax)):
                amax=lmax;
            else: 
                amax+=lmax;
    return amax



In [2]:
print(summingMax('', '', textdistance.hamming.normalized_similarity))


NameError: name 'summingMax' is not defined

In [145]:
a = 'africa'.lower().split()
b = 'countries in africa'.lower().split()
c ='countries in Europe'.lower().split()
d ='Italy'
similarity = word_vectors.wmdistance(a, b)
print("{:.4f}".format(similarity))
similarity = word_vectors.wmdistance(a, c)
print("{:.4f}".format(similarity))
similarity = word_vectors.wmdistance(b, c)
print("{:.4f}".format(similarity))
similarity = word_vectors.wmdistance(a, d)
print("{:.4f}".format(similarity))

#a='great africa'.lower().strip().split()
#b='Countries in Africa'.lower().strip().split()
#print(a)
#print(b)
#similarity = word_vectors.distances(a,b )
#print(similarity)
#docdistance=word_vectors.wmdistance('woman', 'man')
#print(docdistance)

2.6975
3.7890
1.0915
4.1048


In [147]:
import textdistance
#df  data frame
df=THIST2DBPEDIA_df.drop_duplicates()

#df.assign(wSim = lambda df : similarity(df.sBT,df.oBT))
#print(df['sBT'])
#print(df['oBT'])

#sim = lambda x,y : similarity(x , y)
#df.assign(sim = lambda x, y : df.sBT + df.oBT)
##df.assign(sim = lambda df: df.sBT + df.oBT)
df['similaritySInW']=0.0 
df['wmdistance']=df['Mwmdistance']= df['SMwmdistance']= 0.0
df['nhammingSim']= df['MnhammingSim']= df['SMnhammingSim']= 0.0

l = range(1, len(df))
for i in l:
     df['similaritySInW'][i] =similarityBetweenSetsSplitingInWords(df.sBT[i], df.oBT[i])
     df['wmdistance'][i]=word_vectors.wmdistance(df.sBT[i], df.oBT[i])
     df['Mwmdistance'][i]= maxInSplitWords(df.sBT[i], df.oBT[i], word_vectors.wmdistance)
     df['SMwmdistance'][i]= summingMax(df.sBT[i], df.oBT[i], word_vectors.wmdistance)
     df['nhammingSim'][i]=textdistance.hamming.normalized_similarity(df.sBT[i], df.oBT[i])
     df['MnhammingSim'][i]=maxInSplitWords(df.sBT[i], df.oBT[i],textdistance.hamming.normalized_similarity)
     df['SMnhammingSim'][i] =summingMax(df.sBT[i], df.oBT[i],textdistance.hamming.normalized_similarity)
    

['africa']
['countries', 'africa']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


['africa']
['countries', 'africa']
['africa']
['countries', 'africa']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


['africa']
['countries', 'africa']
['africa']
['countries', 'africa']
['africa']
['east', 'africa']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


['africa']
['east', 'africa']
['africa']
['east', 'africa']
['africa']
['east', 'africa']
['africa']
['east', 'africa']
['africa']
['french-speaking', 'countries', 'territories']
exception for french-speaking
['africa']
['french-speaking', 'countries', 'territories']
['africa']
['french-speaking', 'countries', 'territories']
['africa']
['french-speaking', 'countries', 'territories']
['africa']
['french-speaking', 'countries', 'territories']
['africa']
['island', 'countries']
['africa']
['island', 'countries']
['africa']
['island', 'countries']
['africa']
['island', 'countries']
['africa']
['island', 'countries']
['africa']
['islands', 'africa']
['africa']
['islands', 'africa']
['africa']
['islands', 'africa']
['africa']
['islands', 'africa']
['africa']
['islands', 'africa']
['africa']
['member', 'states', 'la', 'francophonie']
['africa']
['member', 'states', 'la', 'francophonie']
['africa']
['member', 'states', 'la', 'francophonie']
['africa']
['member', 'states', 'la', 'francophonie']

['africa', 'central', 'africa']
['member', 'states', 'la', 'francophonie', 'member', 'states', 'la', 'francophonie']
['africa', 'central', 'africa']
['member', 'states', 'la', 'francophonie', 'member', 'states', 'la', 'francophonie']
['africa', 'central', 'africa']
['member', 'states', 'la', 'francophonie', 'member', 'states', 'la', 'francophonie']
['africa', 'central', 'africa']
['member', 'states', 'la', 'francophonie', 'member', 'states', 'la', 'francophonie']
['africa', 'central', 'africa']
['member', 'states', 'african', 'union', 'member', 'states', 'african', 'union']
['africa', 'central', 'africa']
['member', 'states', 'african', 'union', 'member', 'states', 'african', 'union']
['africa', 'central', 'africa']
['member', 'states', 'african', 'union', 'member', 'states', 'african', 'union']
['africa', 'central', 'africa']
['member', 'states', 'african', 'union', 'member', 'states', 'african', 'union']
['africa', 'central', 'africa']
['member', 'states', 'african', 'union', 'member

['africa', 'central', 'africa']
['member', 'states', 'organisation', 'islamic', 'cooperation', 'member', 'states', 'organisation', 'islamic', 'cooperation']
['africa', 'central', 'africa']
['member', 'states', 'organisation', 'islamic', 'cooperation', 'member', 'states', 'organisation', 'islamic', 'cooperation']
['africa', 'central', 'africa']
['member', 'states', 'organisation', 'islamic', 'cooperation', 'member', 'states', 'organisation', 'islamic', 'cooperation']
['africa', 'central', 'africa']
['member', 'states', 'organisation', 'islamic', 'cooperation', 'member', 'states', 'organisation', 'islamic', 'cooperation']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['states', 'territories', 'established', '196

['africa', 'central', 'africa']
['member', 'states', 'commonwealth', 'nations', 'member', 'states', 'commonwealth', 'nations']
['africa', 'central', 'africa']
['member', 'states', 'commonwealth', 'nations', 'member', 'states', 'commonwealth', 'nations']
['africa', 'central', 'africa']
['member', 'states', 'commonwealth', 'nations', 'member', 'states', 'commonwealth', 'nations']
['africa', 'central', 'africa']
['member', 'states', 'commonwealth', 'nations', 'member', 'states', 'commonwealth', 'nations']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['republics', 'republics']
['africa', 'central', 'africa']
['southeast', 'africa', 'southeast', 'africa']
['africa', 'central', 'africa']
['southeast', 'africa', 'southeast', 'africa']
['africa', 'central', 'africa']
['south

['africa', 'east', 'africa']
['island', 'countries', 'island', 'countries']
['africa', 'east', 'africa']
['island', 'countries', 'island', 'countries']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'east', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'east', 'africa

['africa', 'east', 'africa']
['east', 'african', 'countries', 'east', 'african', 'countries']
['africa', 'east', 'africa']
['east', 'african', 'countries', 'east', 'african', 'countries']
['africa', 'east', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territories']
exception for english-speaking
exception for english-speaking
exception for english-speaking
exception for english-speaking
exception for english-speaking
exception for english-speaking
['africa', 'east', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territories']
['africa', 'east', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territories']
['africa', 'east', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territories']
['africa', 'east', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territorie

['africa', 'east', 'africa']
['member', 'states', 'community', 'portuguese', 'language', 'countries', 'member', 'states', 'community', 'portuguese', 'language', 'countries']
['africa', 'east', 'africa']
['member', 'states', 'community', 'portuguese', 'language', 'countries', 'member', 'states', 'community', 'portuguese', 'language', 'countries']
['africa', 'east', 'africa']
['southeast', 'africa', 'southeast', 'africa']
['africa', 'east', 'africa']
['southeast', 'africa', 'southeast', 'africa']
['africa', 'east', 'africa']
['southeast', 'africa', 'southeast', 'africa']
['africa', 'east', 'africa']
['southeast', 'africa', 'southeast', 'africa']
['africa', 'east', 'africa']
['southeast', 'africa', 'southeast', 'africa']
['africa', 'east', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'east', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'east', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'east', 'africa']
['southern', 'africa'

['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'east', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'east', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'east', 'africa']
['w

['africa', 'east', 'africa', 'djibouti', 'djibouti', 'djibouti']
['horn', 'africa', 'horn', 'africa', 'horn', 'africa', 'horn', 'africa']
['africa', 'east', 'africa', 'djibouti', 'djibouti', 'djibouti']
['horn', 'africa', 'horn', 'africa', 'horn', 'africa', 'horn', 'africa']
['africa', 'east', 'africa', 'djibouti', 'djibouti', 'djibouti']
['horn', 'africa', 'horn', 'africa', 'horn', 'africa', 'horn', 'africa']
['africa', 'east', 'africa', 'djibouti', 'djibouti', 'djibouti']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
exception for djibouti
exception for djibouti
exception for djibouti
['africa', 'east', 'africa', 'djibouti', 'djibouti', 'djibouti']
['member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league', 'member', 'states', 'arab', 'league']
['africa', 'east', 'africa', 'djibouti', 'djibouti', 'djibouti']
['member', 'stat

['africa', 'east', 'africa', 'france']
['outermost', 'regions', 'european', 'union', 'outermost', 'regions', 'european', 'union', 'outermost', 'regions', 'european', 'union']
['africa', 'east', 'africa', 'france']
['outermost', 'regions', 'european', 'union', 'outermost', 'regions', 'european', 'union', 'outermost', 'regions', 'european', 'union']
['africa', 'east', 'africa', 'france']
['outermost', 'regions', 'european', 'union', 'outermost', 'regions', 'european', 'union', 'outermost', 'regions', 'european', 'union']
['africa', 'east', 'africa', 'france']
['overseas', 'departments', 'france', 'overseas', 'departments', 'france', 'overseas', 'departments', 'france']
['africa', 'east', 'africa', 'france']
['overseas', 'departments', 'france', 'overseas', 'departments', 'france', 'overseas', 'departments', 'france']
['africa', 'east', 'africa', 'france']
['overseas', 'departments', 'france', 'overseas', 'departments', 'france', 'overseas', 'departments', 'france']
['africa', 'east', 'af

['africa', 'north', 'africa']
['countries', 'africa', 'countries', 'africa']
['africa', 'north', 'africa']
['countries', 'africa', 'countries', 'africa']
['africa', 'north', 'africa']
['countries', 'africa', 'countries', 'africa']
['africa', 'north', 'africa']
['maghreb', 'maghreb']
exception for maghreb
exception for maghreb
exception for maghreb
exception for maghreb
exception for maghreb
exception for maghreb
['africa', 'north', 'africa']
['maghreb', 'maghreb']
['africa', 'north', 'africa']
['maghreb', 'maghreb']
['africa', 'north', 'africa']
['maghreb', 'maghreb']
['africa', 'north', 'africa']
['maghreb', 'maghreb']
['africa', 'north', 'africa']
['north', 'african', 'countries', 'north', 'african', 'countries']
['africa', 'north', 'africa']
['north', 'african', 'countries', 'north', 'african', 'countries']
['africa', 'north', 'africa']
['north', 'african', 'countries', 'north', 'african', 'countries']
['africa', 'north', 'africa']
['north', 'african', 'countries', 'north', 'african

['africa', 'north', 'africa', 'libya']
['ancient', 'roman', 'provinces', 'ancient', 'roman', 'provinces', 'ancient', 'roman', 'provinces']
['africa', 'north', 'africa', 'libya']
['ancient', 'roman', 'provinces', 'ancient', 'roman', 'provinces', 'ancient', 'roman', 'provinces']
['africa', 'north', 'africa', 'libya']
['former', 'italian', 'colonies', 'former', 'italian', 'colonies', 'former', 'italian', 'colonies']
['africa', 'north', 'africa', 'libya']
['former', 'italian', 'colonies', 'former', 'italian', 'colonies', 'former', 'italian', 'colonies']
['africa', 'north', 'africa', 'libya']
['former', 'italian', 'colonies', 'former', 'italian', 'colonies', 'former', 'italian', 'colonies']
['africa', 'north', 'africa', 'libya']
['former', 'italian', 'colonies', 'former', 'italian', 'colonies', 'former', 'italian', 'colonies']
['africa', 'north', 'africa', 'libya']
['former', 'italian', 'colonies', 'former', 'italian', 'colonies', 'former', 'italian', 'colonies']
['africa', 'north', 'africa

['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['countries', 'africa', 'countries', 'africa', 'countries', 'africa', 'countries', 'africa']
exception for tunis
exception for tunisia
exception for tunisia
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['countries', 'africa', 'countries', 'africa', 'countries', 'africa', 'countries', 'africa']
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['countries', 'africa', 'countries', 'africa', 'countries', 'africa', 'countries', 'africa']
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['countries', 'africa', 'countries', 'africa', 'countries', 'africa', 'countries', 'africa']
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['countries', 'africa', 'countries', 'africa', 'countries', 'africa', 'countries', 'africa']
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['maghreb', 'maghreb', 'maghreb', 'maghreb']
exception for maghreb
exception for maghreb
exception for ma

['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'north', 'africa', 'tunis', 'tunisia', 'tunisia']
['wikipedia', 'categories', 'named', 'after', 'countries'

['africa', 'southern', 'africa']
['countries', 'africa', 'countries', 'africa']
['africa', 'southern', 'africa']
['countries', 'africa', 'countries', 'africa']
['africa', 'southern', 'africa']
['countries', 'africa', 'countries', 'africa']
['africa', 'southern', 'africa']
['countries', 'africa', 'countries', 'africa']
['africa', 'southern', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territories']
exception for english-speaking
exception for english-speaking
exception for english-speaking
exception for english-speaking
exception for english-speaking
exception for english-speaking
['africa', 'southern', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territories']
['africa', 'southern', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'countries', 'territories']
['africa', 'southern', 'africa']
['english-speaking', 'countries', 'territories', 'english-speaking', 'c

['africa', 'southern', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'southern', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'southern', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'southern', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'southern', 'africa']
['southern', 'africa', 'southern', 'africa']
['africa', 'southern', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'southern', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'southern', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa', 'southern', 'africa']
['wikipedia', 'categories', 'named', 'after', 'countries', 'wikipedia', 'categories', 'named', 'after', 'countries']
['africa',

['africa', 'southern', 'africa', 'france', 'reunion']
['wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents']
['africa', 'southern', 'africa', 'france', 'reunion']
['wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents']
['africa', 'southern', 'africa', 'france', 'reunion']
['wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents', 'wikipedia', 'categories', 'named', 'after', 'continents']
['africa', 'west', 'africa']
['central', 'african', 'countries', 'central', 'african', 'countries']
['africa', 'west', 'africa'

['africa', 'west', 'africa']
['central', 'africa', 'central', 'africa']
['africa', 'west', 'africa']
['central', 'africa', 'central', 'africa']
['africa', 'west', 'africa']
['central', 'africa', 'central', 'africa']
['africa', 'west', 'africa']
['central', 'africa', 'central', 'africa']
['africa', 'west', 'africa']
['west', 'africa', 'west', 'africa']
['africa', 'west', 'africa']
['west', 'africa', 'west', 'africa']
['africa', 'west', 'africa']
['west', 'africa', 'west', 'africa']
['africa', 'west', 'africa']
['west', 'africa', 'west', 'africa']
['africa', 'west', 'africa']
['west', 'africa', 'west', 'africa']


TypeError: object of type 'float' has no len()

#### Printing the similaritities according to the Similarity 

In [146]:
print(df[['sBT','oBT','similaritySInW', 'wmdistance','Mwmdistance','SMwmdistance','nhammingSim','MnhammingSim','SMnhammingSim']])


                       sBT                                                oBT  \
1                   Africa                                Countries in Africa   
2                   Africa                                        East Africa   
3                   Africa          French-speaking countries and territories   
4                   Africa                                   Island countries   
5                   Africa                                  Islands of Africa   
6                   Africa                   Member states of La Francophonie   
7                   Africa                            Physiographic provinces   
8                   Africa                                    Southern Africa   
9                   Africa         States and territories established in 1960   
10                  Africa         Wikipedia categories named after countries   
11                  Africa           Wikipedia categories named after islands   
12                  Africa  

In [148]:
#saving the results in Thist2DBPEDIA_ok_res.csv
df.to_csv(path+'Thist2DBPEDIA_ok_res.csv', sep=';', index=False) 

# Considerations - lesson learnt for further attempts
Using the similarity on BT token does not seem very characterizing:
That is due to the design choices and data preparation:
- **choice 1**:  we split BT (e.g "Countries in Africa") in tokens (Countries, in, Africa) as "wmdistance" does not work on composed  text
     - should we use other pretrained model or build our own ? https://radimrehurek.com/gensim/wiki.html#preparing-the-corpus  
     - should we avoid to split the words in a flatten structure? 
     - Shall we use the vectors sum instead of splitting?
- **choice 2**: the similarity functions implements a naive and brute force approach
    - exceptions about not indexed word are repeated too often, should we check them in the similarity function outside the for?
    - is the right similarity? 
        - we might study and try other similarities provided by the same package GemSim
            - evaluate_word_pairs
            - Should we use document similarity? distance(d1, d2)¶ 
- **Data preparation**: 
    - there is more than a row for each link to be validated
    - BT are prefered label instead of URIs 
    - BT have repetition which are reevaluated during similarity calculation and create distortions on the score, can we remove them at the source?
- ** should we train word2vec differently?**    
    - **Is there any semantic embeddings of entities working with URIs?**
       - [Entity2vec](https://github.com/ot/entity2vec)
       - [Entity enbeddings](https://towardsdatascience.com/understanding-entity-embeddings-and-its-application-69e37ae1501d)
       - [An Introduction to Deep Learning for Tabular Data](https://www.fast.ai/2018/04/29/categorical-embeddings/)