# **Generating MUSCLE semantic split**
Given a file with the Louvain commuties (`louvainGlobalWithIDs.csv`) and a complete dataset (`dataset_P_L25.csv`), this code generates the train/test dataset of the MUSCLE semantic split.

In [5]:
# download file louvainGlobalWithIDs.csv
!gdown '1DqVW0Cy-B7nshjjBcaZfQpN5qyNAfmU0'

# download file dataset_P_L25.csv
!gdown '1cYPnih0UVpbdzkVa16wTrabdIisdWsAy'

Downloading...
From: https://drive.google.com/uc?id=1DqVW0Cy-B7nshjjBcaZfQpN5qyNAfmU0
To: /content/louvainGlobalWithIDs.csv
100% 170k/170k [00:00<00:00, 19.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1cYPnih0UVpbdzkVa16wTrabdIisdWsAy
To: /content/dataset_P_L25.csv
100% 23.6M/23.6M [00:00<00:00, 50.0MB/s]


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import ast
import re
import numpy as np
from scipy.stats import entropy

**Read the Louvain communities**

In [4]:
FILE_NAME = '/content/louvainGlobalWithIDs.csv'

louvain_com = pd.read_csv(FILE_NAME)
communities_louvain = []
for l in louvain_com.iloc[:,2]:
    l = re.sub('\\[','',l)
    l = re.sub('\\]', '', l)
    vals = l.split(',')
    communities_louvain.append(vals)

**Find all concepts in the Louvain communities**

In [None]:
all_concepts = sum(communities_louvain,[])
print(len(communities_louvain))
print(len(all_concepts))
print(len(all_concepts)-len(set(all_concepts)))

1718
7231
0


**Separate all concepts in train/test concepts to obtain balanced datasets (50% train /50% test).**

In [None]:
len_train = 0
concepts_train = []
len_test = 0
concepts_test = []
for c in communities_louvain:
    if len_train <= len_test:
        concepts_train.extend(c)
        len_train += len(c)
    else:
        concepts_test.extend(c)
        len_test += len(c)

In [None]:
print(len(concepts_train))
print(len(concepts_test))

3616
3615


**Read the complete dataset `dataset_P_L25.csv`.**

In [6]:
data_25 = pd.read_csv('dataset_P_L25.csv')

In [None]:
data_25

Unnamed: 0,subject,object,relation_type,property,en_label_subject,en_label_object,fr_label_subject,fr_label_object,de_label_subject,de_label_object,...,tr_label_subject,tr_label_object,id_label_subject,id_label_object,sr_label_subject,sr_label_object,hu_label_subject,hu_label_object,da_label_subject,da_label_object
0,Q33514,Q19860,hyponym for,P279,Indo-Iranian,Indo-European,langues indo-iraniennes,langues indo-européennes,Indoiranisch,indogermanische Sprachen,...,Hint-İran dilleri,Hint-Avrupa dil ailesi,Rumpun bahasa Indo-Iran,Rumpun bahasa Indo-Eropa,индо-ирански језици,индоевропски језици,indoiráni nyelvek,indoeurópai nyelvcsalád,Indoiranske sprog,indoeuropæiske sprog
1,Q2736,Q28640,random,random,association football,profession,football,profession,Fußball,Beruf,...,futbol,meslek,sepak bola,profesi,фудбал,занимање,labdarúgás,szakma,fodbold,profession
2,Q166376,Q172833,random,random,doping in sport,broom,dopage sportif,balai,Doping,Besen,...,Doping,Süpürge,Doping,Sapu,допинг,метла,dopping,seprű,Doping,kost
3,Q194235,Q44722,hyperonym for,P279_inv,lunisolar calendar,Hebrew calendar,calendrier luni-solaire,calendrier hébraïque,Lunisolarkalender,Jüdischer Kalender,...,lunisolar takvim,İbrani takvimi,Kalender suryacandra,Kalender Ibrani,Лунисоларни календар,Јеврејски календар,Szolunáris naptár,zsidó naptár,lunisolarkalender,Den jødiske kalender
4,Q93200,Q44602,random,random,sexism,fasting,sexisme,jeûne,Sexismus,Fasten,...,cinsiyetçilik,oruç,seksisme,puasa,сексизам,Пост,szexizmus,böjt,sexisme,Faste
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27502,Q186385,Q56061,random,random,caviar,administrative territorial entity,caviar,entité territoriale administrative,Kaviar,administrativ-territoriale Entität,...,Havyar,idari bölünüş,kaviar,wilayah administratif,Кавијар,управна јединица,kaviár,közigazgatási egység,kaviar,administrativ-territorial enhed
27503,Q7918,Q6583695,random,random,Bulgarian,thermal expansion,bulgare,dilatation thermique,Bulgarisch,Wärmeausdehnung,...,Bulgarca,Genleşme,Bahasa Bulgaria,Pemuaian,бугарски језик,Termička dilatacija,bolgár,hőtágulás,bulgarsk,Termisk ekspansion
27504,Q32090,Q152234,random,random,lifestyle,edema,mode de vie,œdème,Lebensstil,Ödem,...,yaşam tarzı,Ödem,gaya hidup,Sembap,животни стил,Otok,életstílus,ödéma,livsstil,ødem
27505,Q48422,Q845120,random,random,cadaver,segc,cadavre,économie du Brésil,Leichnam,Wirtschaft Brasiliens,...,ceset,Brezilya ekonomisi,jenazah,ekonomi Brasil,леш,привреда Бразила,holttest,Brazília gazdasága,menneskelig,Brasiliens økonomi


In [None]:
all_25 = data_25.subject.tolist()
all_25.extend(data_25.object.tolist())
all_25 = set(all_25)
print(len(all_25))
print(len(all_concepts))
all_25 = all_25.union(all_concepts)
print(len(all_25))

7231
7231
7231


**Split the complete dataset in train/test according the calculated train/test concepts and remove relations between train and test.**

In [None]:
filter_train = data_25.subject.isin(concepts_train) & data_25.object.isin(concepts_train)
filter_test = data_25.subject.isin(concepts_test) & data_25.object.isin(concepts_test)

In [None]:
print(sum(filter_train))
print(sum(filter_test))

7616
7841


In [None]:
data_25_train = data_25[filter_train]
data_25_test = data_25[filter_test]
print(data_25_train.shape)
print(data_25_test.shape)

(7616, 54)
(7841, 54)


In [None]:
import csv
data_25_train.to_csv('dataset_L25_train.csv', index=False, quoting=csv.QUOTE_ALL)
data_25_test.to_csv('dataset_L25_test.csv', index=False, quoting=csv.QUOTE_ALL)

In [None]:
sum(data_25_train.relation_type.value_counts()) - data_25_train.relation_type.value_counts()['random']

2903

In [None]:
print(data_25_train.relation_type.value_counts())
print(data_25_train.relation_type.value_counts()/sum(data_25_train.relation_type.value_counts()))

print(data_25_test.relation_type.value_counts())
print(data_25_test.relation_type.value_counts()/sum(data_25_test.relation_type.value_counts()))

print(sum(data_25_train.relation_type.value_counts()[1:]))
print(sum(data_25_test.relation_type.value_counts()[1:]))

random           4713
hyponym for       993
hyperonym for     955
holonym for       470
meronym for       340
antonym for       145
Name: relation_type, dtype: int64
random           0.618829
hyponym for      0.130383
hyperonym for    0.125394
holonym for      0.061712
meronym for      0.044643
antonym for      0.019039
Name: relation_type, dtype: float64
random           4833
hyperonym for     952
hyponym for       888
holonym for       691
meronym for       354
antonym for       123
Name: relation_type, dtype: int64
random           0.616375
hyperonym for    0.121413
hyponym for      0.113251
holonym for      0.088127
meronym for      0.045147
antonym for      0.015687
Name: relation_type, dtype: float64
2903
3008
