### Comparison and alignment of multi-lingual representational similarities for cross-linguistic comparison

**COLT: Computational Linguistics and Linguistic Theory**

*Universitat Pompeu Fabra*

In [1]:
# !pip install fasttext
import pandas as pd
import numpy as np
import fasttext.util
import nltk
# nltk.download('punkt')

In [2]:
ft_en = fasttext.load_model('../cc.en.300.bin')
ft_tl = fasttext.load_model('../cc.tl.300.bin')



In [3]:
df = pd.read_csv('../df_all_raw.csv')
df.head(5)

  df = pd.read_csv('../df_all_raw.csv')


Unnamed: 0,dataset_ID,Form_ID,Form,clics_form,gloss_in_source,Concepticon_ID,Concepticon_Gloss,Ontological_Category,Semantic_Field,variety,...,Macroarea,Family,Latitude,Longitude,MRC_WORD,AGE_OF_ACQUISITION,CONCRETENESS,FAMILIARITY,IMAGABILITY,KUCERA_FRANCIS_FREQUENCY
0,abrahammonpa,BugunBichom-100_gold-1,san,san,gold,1369,GOLD,Person/Thing,Basic actions and technology,Bugun Bichom,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0
1,abrahammonpa,BugunKaspi-100_gold-1,san,san,gold,1369,GOLD,Person/Thing,Basic actions and technology,Bugun Kaspi,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0
2,abrahammonpa,BugunNamphri-100_gold-1,san,san,gold,1369,GOLD,Person/Thing,Basic actions and technology,Bugun Namphri,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0
3,abrahammonpa,BugunSingchung-100_gold-1,san,san,gold,1369,GOLD,Person/Thing,Basic actions and technology,Bugun Singchung,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0
4,abrahammonpa,BugunWangho-100_gold-1,san,san,gold,1369,GOLD,Person/Thing,Basic actions and technology,Bugun Wangho,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0


*1. Extract all the unique "Concepticon_ID" - "clics_form" tuples from that language (meaning-word pairs)*


- "clics_form": the (normalized) word for a meaning in the language (i.e. carne, dza)
- "form": the word for the meaning in the language as written in the original resource this information comes from (i.e. carne, dzà)
- "Concepticon_ID": unique numerical identifier for a meaning (i.e. 634, 111)
- "Concepticon_Gloss": Intuitive gloss of the meaning in plain English (i.e. MEAT, DRUMMING)
- "variety": Intuitive name of the language (i.e. Spanish, Fali Mucella)
- "Glottocode": unique identifier for the language (i.e. stan1288, gude1246)

In [4]:
# all unique tuples
tuples_df = df.drop_duplicates(['Concepticon_ID', 'clics_form'])
tuples_df.head(5)


Unnamed: 0,dataset_ID,Form_ID,Form,clics_form,gloss_in_source,Concepticon_ID,Concepticon_Gloss,Ontological_Category,Semantic_Field,variety,...,Macroarea,Family,Latitude,Longitude,MRC_WORD,AGE_OF_ACQUISITION,CONCRETENESS,FAMILIARITY,IMAGABILITY,KUCERA_FRANCIS_FREQUENCY
0,abrahammonpa,BugunBichom-100_gold-1,san,san,gold,1369,GOLD,Person/Thing,Basic actions and technology,Bugun Bichom,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0
5,abrahammonpa,ChugParchu-100_gold-1,ser,ser,gold,1369,GOLD,Person/Thing,Basic actions and technology,Chug Parchu,...,Eurasia,Sino-Tibetan,27.418381,92.234687,GOLD,,576.0,550.0,594.0,52.0
6,abrahammonpa,DammaiDibin-100_gold-1,sen,sen,gold,1369,GOLD,Person/Thing,Basic actions and technology,Dammai Dibin,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0
7,abrahammonpa,DammaiRurang-100_gold-1,ʃə,s@,gold,1369,GOLD,Person/Thing,Basic actions and technology,Dammai Rurang,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0
23,abrahammonpa,NamreiNabolang-100_gold-1,seʔ,se,gold,1369,GOLD,Person/Thing,Basic actions and technology,Namrei Nabolang,...,,Sino-Tibetan,,,GOLD,,576.0,550.0,594.0,52.0


From the dataset, check which languages are included in fasttext.

In [5]:
# filter dataframe to specific language
def get_lang_df(df, lang):

    lang_df = df[df.variety==lang]

    # tokenize variety
    for index, row in lang_df.iterrows():
        tokens = nltk.tokenize.word_tokenize(row['variety'])
        lang_df.loc[index, 'tk_variety'] = tokens

    return lang_df

eng_df = get_lang_df(tuples_df, 'English')
tgl_df =  get_lang_df(tuples_df, 'Tagalog')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lang_df.loc[index, 'tk_variety'] = tokens
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lang_df.loc[index, 'tk_variety'] = tokens


*2. Generate all possible pairs of words based on (1)*

Here, I merge the dataframe to generate unique pairs.
To this end, for each language (unique Glottocode) for which we have a word embedding model (e.g., a fastText model):

In [6]:
def get_pairs_df(df, word1, word2):
    # merge dataframe to itself to get all possible pairings
    pairs_df = pd.merge(df, df, on='variety', how='outer')
    print(pairs_df.shape)

    # remove duplicated word pairs
    pairs_df = pairs_df[pairs_df[word1] != pairs_df[word2]]
    print(pairs_df.shape)

    return pairs_df

eng_pairs_df = get_pairs_df(eng_df, 'Concepticon_ID_x', 'Concepticon_ID_y')
tgl_pairs_df = get_pairs_df(tgl_df, 'Concepticon_ID_x', 'Concepticon_ID_y')



(5253264, 45)
(5249174, 45)
(4489, 45)
(4422, 45)


In [8]:
eng_pairs_df = eng_pairs_df.sample(10000)

*3. Query the model to add, to (2), the cosine similarity of each pair of words, according to the word embedding model for the language*


The `get_cos_similarity` reads in the dataframe columns containing our target words.  Then, it extracts the word vector for and takes the cosine similarity between the two words.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

def get_cos_similarity(df, col1, col2, model):
    # add word pair names
    df['word_pair'] = list(zip(df[col1], df[col2]))

    # sort words within the tuple to look for duplicates
    df['word_pair'] = df['word_pair'].apply(lambda x: tuple(sorted(x)))
    df = df.drop_duplicates('word_pair')

    # use ft model to get word vector for each word
    word_vector_1 = df[col1].apply(lambda x: model.get_sentence_vector(x))
    word_vector_2 = df[col2].apply(lambda x: model.get_sentence_vector(x))

    # get cosine similarity btwn vectors
    cosine_sim = cosine_similarity(np.vstack(word_vector_1), np.vstack(word_vector_2)).diagonal()

    # append col back to dataframe
    df['cosine_similarity'] = cosine_sim

    return df

eng_sim = get_cos_similarity(eng_pairs_df,'clics_form_x','clics_form_y', ft_en)
tgl_sim = get_cos_similarity(tgl_pairs_df,'clics_form_x','clics_form_y', ft_tl)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cosine_similarity'] = cosine_sim
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cosine_similarity'] = cosine_sim


*4. Save this as a CSV with five columns that has "Concepticon_ID" for each of the two meanings that the words express; the two "clics_form"s; and their cosine similarity.*

Now,

In [10]:
# filter the desired columns
res_df_en = eng_sim[['Concepticon_ID_x', 'Concepticon_ID_y', 'clics_form_x', 'clics_form_y', 'cosine_similarity']]
res_df_en

Unnamed: 0,Concepticon_ID_x,Concepticon_ID_y,clics_form_x,clics_form_y,cosine_similarity
894879,692,525,bring,artery,0.004467
2708409,474,2468,womansdress,baeg,0.297035
1868241,1200,522,husband,worry,0.148170
2704670,326,1401,cloak,drink,0.095954
291983,615,1172,horse,shoot,0.156469
...,...,...,...,...,...
265405,1398,1577,dry,aend,0.066536
1177058,2249,1901,thin,shovel,0.106086
1577200,1521,666,seem,brook,0.073672
3204290,615,1489,hos,cloud,0.089852


In [11]:
res_df_tgl = tgl_sim[['Concepticon_ID_x', 'Concepticon_ID_y', 'clics_form_x', 'clics_form_y', 'cosine_similarity']]
res_df_tgl

Unnamed: 0,Concepticon_ID_x,Concepticon_ID_y,clics_form_x,clics_form_y,cosine_similarity
1,499,24,lumunas,dasal,0.269225
2,499,1773,lumunas,libingan,0.202819
3,499,1973,lumunas,demonyo,0.211333
4,499,1175,lumunas,multo,0.147523
5,499,1252,lumunas,bahay,0.261617
...,...,...,...,...,...
4286,1616,2680,kayo,gulok,0.134507
4287,1616,846,kayo,tigre,0.136307
4353,1927,2680,kawayanspinyspecies,gulok,0.271064
4354,1927,846,kawayanspinyspecies,tigre,0.307241


In [16]:
# output to csv
res_df_en.to_csv('res_df_english.csv', index=False)
res_df_tgl.to_csv('res_df_tagalog.csv', index=False)
