# **Exploring Multilingual Word Embeddings**

The aim of this notebook is to briefly explore how to work with FastText word embeddings and to investigate their capacity to align multiple languages. 

**Note**: This notebook is intended to run in GoogleColab.

# **Setup**

In [1]:
pip install pytorch-nlp fasttext

Collecting pytorch-nlp
[?25l  Downloading https://files.pythonhosted.org/packages/4f/51/f0ee1efb75f7cc2e3065c5da1363d6be2eec79691b2821594f3f2329528c/pytorch_nlp-0.5.0-py3-none-any.whl (90kB)
[K     |████████████████████████████████| 92kB 2.2MB/s 
[?25hCollecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████████████████████████████████| 71kB 3.8MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3018303 sha256=a766b442480a9e160d56a94443ea6e495231f9f1ec084dc5e6d2d4ad77c68295
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c154b75231136cc3a3321ab0e30f592
Successfully built fasttext
Installing collected packages: pytorch-nlp, fasttext
Successfully installed fasttext-0.9.2 pytorch-

In [None]:
# basic libraries
import torch
from torch import nn
import pandas as pd
import pickle
import sklearn
import torch
import numpy as np
import string
import re
from collections import Counter
import time
import matplotlib.pyplot as plt
from torchnlp.word_to_vector import FastText


## **FastText Embeddings**

In [None]:
start = time.time()

# English embeddings (alligned with other languages)
en_embeddings = FastText(language = "en", aligned = True)

# French embeddings (alligned with other languages)
fr_embeddings = FastText(language = "fr", aligned = True)

# German embeddings (alligned with other languages)
de_embeddings = FastText(language = "de", aligned = True)
end = time.time()

print("******************** Time ellapsed loading embeddings: " ,(end - start)/60)

wiki.en.align.vec: 5.69GB [02:52, 32.9MB/s]                            
  0%|          | 0/2519371 [00:00<?, ?it/s]Skipping token 2519370 with 1-dimensional vector ['300']; likely a header
100%|██████████| 2519371/2519371 [04:50<00:00, 8660.62it/s]
wiki.fr.align.vec: 2.61GB [01:16, 34.1MB/s]                            
  0%|          | 0/1152450 [00:00<?, ?it/s]Skipping token 1152449 with 1-dimensional vector ['300']; likely a header
100%|██████████| 1152450/1152450 [02:10<00:00, 8803.75it/s]
wiki.de.align.vec: 5.15GB [03:28, 24.7MB/s]                            
  0%|          | 0/2275234 [00:00<?, ?it/s]Skipping token 2275233 with 1-dimensional vector ['300']; likely a header
100%|██████████| 2275234/2275234 [04:21<00:00, 8707.69it/s]


******************** Time ellapsed loading embeddings:  20.711681509017943


## **Testing Word Embeddings**

In [None]:
# get the embedding for any word
word = "politician"
en_embeddings[word]

tensor([-0.0837, -0.0981, -0.0277, -0.0433, -0.0062, -0.0337,  0.0926, -0.0561,
        -0.0294,  0.0495, -0.0249,  0.0286, -0.0654, -0.1783, -0.0070, -0.0665,
        -0.0468, -0.0673, -0.0650,  0.1219, -0.0324,  0.1153, -0.0321,  0.0027,
        -0.0351, -0.0350, -0.0293, -0.0468, -0.0130, -0.0296,  0.0095,  0.0135,
        -0.0095, -0.0108, -0.0021, -0.0207, -0.0012, -0.0047,  0.0590,  0.0893,
        -0.0688,  0.0524, -0.0786, -0.0105,  0.0609,  0.0239, -0.0004, -0.0171,
        -0.0131,  0.0266,  0.0350,  0.0011, -0.0240, -0.0832,  0.0016, -0.1578,
         0.0537, -0.0399,  0.0060,  0.0871, -0.0204, -0.0299,  0.0182,  0.0402,
         0.0047, -0.0203,  0.0460, -0.0312, -0.0531, -0.0269,  0.0426,  0.0300,
         0.1082, -0.0076,  0.0665, -0.0754,  0.0342, -0.0120, -0.0166,  0.0246,
        -0.0575,  0.1112,  0.1102, -0.0808,  0.0334,  0.0126, -0.0354, -0.0758,
         0.0233, -0.0333,  0.0590, -0.0967,  0.0719, -0.0449,  0.0015,  0.0209,
         0.0682,  0.0842, -0.0892, -0.05

In [None]:
# FastText also provides id's for words and dictionaries to map one way or the other
word_id = en_embeddings.token_to_index[word]
id_word = en_embeddings.index_to_token[word_id]
word_id, id_word

(1084, 'politician')

In [None]:
# We can also use the id of the word to get the corresponding embedding
en_embeddings.vectors[word_id]

tensor([-0.0837, -0.0981, -0.0277, -0.0433, -0.0062, -0.0337,  0.0926, -0.0561,
        -0.0294,  0.0495, -0.0249,  0.0286, -0.0654, -0.1783, -0.0070, -0.0665,
        -0.0468, -0.0673, -0.0650,  0.1219, -0.0324,  0.1153, -0.0321,  0.0027,
        -0.0351, -0.0350, -0.0293, -0.0468, -0.0130, -0.0296,  0.0095,  0.0135,
        -0.0095, -0.0108, -0.0021, -0.0207, -0.0012, -0.0047,  0.0590,  0.0893,
        -0.0688,  0.0524, -0.0786, -0.0105,  0.0609,  0.0239, -0.0004, -0.0171,
        -0.0131,  0.0266,  0.0350,  0.0011, -0.0240, -0.0832,  0.0016, -0.1578,
         0.0537, -0.0399,  0.0060,  0.0871, -0.0204, -0.0299,  0.0182,  0.0402,
         0.0047, -0.0203,  0.0460, -0.0312, -0.0531, -0.0269,  0.0426,  0.0300,
         0.1082, -0.0076,  0.0665, -0.0754,  0.0342, -0.0120, -0.0166,  0.0246,
        -0.0575,  0.1112,  0.1102, -0.0808,  0.0334,  0.0126, -0.0354, -0.0758,
         0.0233, -0.0333,  0.0590, -0.0967,  0.0719, -0.0449,  0.0015,  0.0209,
         0.0682,  0.0842, -0.0892, -0.05

In [None]:
# what happens if we try to get the index of a word that is not in the dictionary?
ind = en_embeddings.token_to_index['#$pld']
ind

KeyError: ignored

### **Embeddings algebra**

In [None]:
emb = en_embeddings['germany'] + en_embeddings['capital'] 

In [None]:
def get_nn_emb(word_emb, src_emb, K=10):
    """ 
    A function to get the word with the nearest embeddings representation to
    the embedding provided as input
    """

    try:
        # calculate the cosine similarity between all the words of the target
        # dictionary and the inputed word
        scores = (np.array(src_emb.vectors)/np.linalg.norm(np.array(src_emb.vectors), 2, 1)[:, None]).dot(word_emb / np.linalg.norm(word_emb))
        k_best = scores.argsort()[-K:][::-1]
        for i, idx in enumerate(k_best):
            print('%.4f - %s' % (scores[idx], src_emb.index_to_token[idx]))
    except:
        print("Word not in the dictionary")
    
    # return the nearest neighbor
    return src_emb.index_to_token[k_best[1]]

In [None]:
get_nn_emb(emb, en_embeddings, 20)

0.7614 - germany
0.7614 - capital
0.6809 - germany`s
0.6694 - germany,
0.6545 - germany´s
0.6490 - germany‎
0.6490 - germany‘s
0.6350 - berlin,germany
0.6345 - germany—in
0.6327 - germany—was
0.6279 - germany#nazi
0.6276 - germany…
0.6246 - germany—the
0.6235 - germanys
0.6234 - germany—where
0.6192 - ,germany
0.6166 - cologne,germany
0.6160 - germanyitaly
0.6154 - germany—whose
0.6148 - germanyoccupied


'capital'

### **Normalization**

Are FastText word embeddings normalized?

In [None]:
print(np.linalg.norm(en_embeddings['hello']))
print(np.linalg.norm(en_embeddings['cat']))
print(np.linalg.norm(en_embeddings['running']))
print(np.linalg.norm(en_embeddings['migration']))

print(np.linalg.norm(fr_embeddings['oui']))
print(np.linalg.norm(fr_embeddings['sociale']))
print(np.linalg.norm(fr_embeddings['aide']))
print(np.linalg.norm(fr_embeddings['cheval']))

print(np.linalg.norm(de_embeddings['gesetz']))
print(np.linalg.norm(de_embeddings['tor']))
print(np.linalg.norm(de_embeddings['politik']))

0.9999994
1.0000013
1.000005
1.0000033
1.000018
1.0000418
1.0000318
0.9999768
0.9999604
1.000003
1.0000075


## **Exploring the Multilingual Universe**

In [None]:
import nltk
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances
from nltk.cluster import KMeansClusterer

In [None]:
# example words (the same word in all 3 languages)
ger_word = 'hund'
ger_emb = np.array(de_embeddings[ger_word])
en_word = 'dog'
en_emb = np.array(en_embeddings[en_word])
fr_word = 'chien'
fr_emb = np.array(fr_embeddings[fr_word])

In [None]:
print(nltk.cluster.util.cosine_distance(ger_emb, en_emb))
print(1- nltk.cluster.util.cosine_distance(ger_emb, en_emb))

0.4677146024463045
0.5322853975536955


In [None]:
print(" ========= English and German cosine similarity ==============")
print(cosine_similarity(ger_emb.reshape(1,300),en_emb.reshape(1,300))[0][0])
print(1- nltk.cluster.util.cosine_distance(ger_emb, en_emb))

print(" ========= English and German euclidean distance ==============")
print(pairwise_distances(ger_emb.reshape(1,300),en_emb.reshape(1,300), metric = 'euclidean')[0][0])

0.5322854
0.5322853975536955
0.9671857


In [None]:
print(" ========= English and French similarity ==============")
print(cosine_similarity(fr_emb.reshape(1,300),en_emb.reshape(1,300))[0][0])
print(1- nltk.cluster.util.cosine_distance(fr_emb, en_emb))

print(" ========= English and French euclidean distance ==============")
print(pairwise_distances(fr_emb.reshape(1,300),en_emb.reshape(1,300), metric = 'euclidean')[0][0])

0.5562931
0.5562930855472754
0.94201696


In [None]:
print(" ========= German and French similarity ==============")
print(cosine_similarity(ger_emb.reshape(1,300),fr_emb.reshape(1,300))[0][0])
print(1- nltk.cluster.util.cosine_distance(ger_emb, fr_emb))

print(" ========= German and French euclidean distance ==============")
print(pairwise_distances(ger_emb.reshape(1,300), fr_emb.reshape(1,300), metric = 'euclidean')[0][0])

0.70935184
0.7093518149716478
0.7624254


In [None]:
# example words (3 different words)
ger_word = 'vogel'
ger_emb = np.array(de_embeddings[ger_word])
en_word = 'water'
en_emb = np.array(en_embeddings[en_word])
fr_word = 'carnet'
fr_emb = np.array(fr_embeddings[fr_word])

In [None]:
print(" ========= English and German similarity ==============")
print(cosine_similarity(ger_emb.reshape(1,300),en_emb.reshape(1,300))[0][0])
print(1- nltk.cluster.util.cosine_distance(ger_emb, en_emb))

print(" ========= English and German euclidean distance ==============")
print(pairwise_distances(ger_emb.reshape(1,300),en_emb.reshape(1,300), metric = 'euclidean')[0][0])

0.009164232
0.00916423094868457
1.4077184


In [None]:
print(" ========= English and French similarity ==============")
print(cosine_similarity(fr_emb.reshape(1,300),en_emb.reshape(1,300))[0][0])
print(1- nltk.cluster.util.cosine_distance(fr_emb, en_emb))

print(" ========= English and French euclidean distance ==============")
print(pairwise_distances(fr_emb.reshape(1,300),en_emb.reshape(1,300), metric = 'euclidean')[0][0])

-0.08737762
-0.08737760906309533
1.4747393


In [None]:
print(" ========= German and French similarity ==============")
print(cosine_similarity(ger_emb.reshape(1,300),fr_emb.reshape(1,300))[0][0])
print(1- nltk.cluster.util.cosine_distance(ger_emb, fr_emb))

print(" ========= German and French euclidean distance ==============")
print(pairwise_distances(ger_emb.reshape(1,300), fr_emb.reshape(1,300), metric = 'euclidean')[0][0])

0.22292829
0.22292826918326591
1.2466705


In [None]:
def get_nn(word, src_emb, tgt_emb, K=10):
    """
    A function to get the words from the target language that are closer to 
    the embedding of the provided word
    """
    
    print("Nearest neighbors of \"%s\":" % word)
    try:
        word_emb = np.array(src_emb[word])
        # calculate the cosine similarity between all the words of the target
        # dictionary and the inputed word
        scores = (np.array(tgt_emb.vectors)/np.linalg.norm(np.array(tgt_emb.vectors), 2, 1)[:, None]).dot(word_emb / np.linalg.norm(word_emb))
        k_best = scores.argsort()[-K:][::-1]
        for i, idx in enumerate(k_best):
            print('%.4f - %s' % (scores[idx], tgt_emb.index_to_token[idx]))
    except:
        print("Word not in the dictionary")
    
    # return the nearest neighbor
    return tgt_emb.index_to_token[k_best[0]]

In [None]:
get_nn('bird', en_embeddings, de_embeddings)

Nearest neighbors of "bird":
0.4386 - vogel
0.4362 - bird
0.4204 - vogel,
0.4113 - vogelschnabel
0.4082 - vögel
0.4053 - vogell
0.4016 - vogelschwanz
0.3969 - wattvögel
0.3834 - »vogel
0.3775 - vogelschwarm


'vogel'

### **Clustering example words**

In [None]:
# a potential cluster of words in different languages
cluster1_en = ['puppies', 'cat', 'tiger', 'bird', 'worms', 'frog']
cluster1_en_emb = [np.array(en_embeddings[word]) for word in cluster1_en]
cluster1_fr = ['cheval', 'chiens', 'lapin', 'veau', 'canard', 'âne']
cluster1_fr_emb = [np.array(fr_embeddings[word]) for word in cluster1_fr]
cluster1_de = ['hund', 'wolf', 'krebs', 'fisch', 'hai', 'igel']
cluster1_de_emb = [np.array(de_embeddings[word]) for word in cluster1_de]

In [None]:
cluster1_emb = cluster1_en_emb + cluster1_fr_emb + cluster1_de_emb
df1 = pd.DataFrame({'words': cluster1_en + cluster1_fr + cluster1_de,
                    'embeddings': cluster1_emb, 'cluster': [1]*len(cluster1_emb)})

df1

Unnamed: 0,words,embeddings,cluster
0,puppies,"[0.0368, -0.0728, -0.0413, 0.0336, -0.0342, 0....",1
1,cat,"[-0.0327, 0.0332, -0.0772, 0.0275, -0.0469, 0....",1
2,tiger,"[-0.067, 0.0162, -0.0978, 0.0705, -0.053, 0.09...",1
3,bird,"[-0.0982, -0.0368, 0.0447, 0.0752, -0.0479, 0....",1
4,worms,"[0.0219, 0.026, -0.0404, -0.0292, -0.0343, 0.0...",1
5,frog,"[-0.0393, -0.0135, -0.0017, 0.1069, -0.0813, -...",1
6,cheval,"[0.0667, 0.0019, 0.0687, -0.0496, -0.0082, 0.0...",1
7,chiens,"[0.0242, -0.0141, -0.0569, -0.0495, -0.0034, 0...",1
8,lapin,"[0.0313, -0.0256, -0.0598, -0.0521, 0.0351, 0....",1
9,veau,"[0.0059, 0.0308, 0.0282, -0.0869, 0.0484, 0.11...",1


In [None]:
# a potential different cluster of words in different languages
cluster2_en = ['computer', 'tablet', 'cpu', 'televsion']
cluster2_en_emb = [np.array(en_embeddings[word]) for word in cluster2_en]
cluster2_fr = ['ordinateur', 'téléphone', 'clavier', 'bureau', 'courriel']
cluster2_fr_emb = [np.array(fr_embeddings[word]) for word in cluster2_fr]
cluster2_de = ['fernseh', 'internet', 'digitalen', 'wlan', 'telefon', 'handy']
cluster2_de_emb = [np.array(de_embeddings[word]) for word in cluster2_de]

In [None]:
cluster2_emb = cluster2_en_emb + cluster2_fr_emb + cluster2_de_emb
df2 = pd.DataFrame({'words': cluster2_en + cluster2_fr + cluster2_de,
                    'embeddings': cluster2_emb, 'cluster': [2]*len(cluster2_emb)})

df2

Unnamed: 0,words,embeddings,cluster
0,computer,"[0.0704, -0.0315, 0.0575, -0.042, 0.0005, 0.03...",2
1,tablet,"[0.0675, 0.1163, 0.0446, -0.0074, -0.0983, 0.0...",2
2,cpu,"[0.0262, 0.0757, 0.0103, -0.0744, -0.0175, 0.0...",2
3,televsion,"[-0.0576, 0.0362, 0.0436, 0.0011, -0.0137, -0....",2
4,ordinateur,"[0.0864, 0.0765, 0.0122, -0.0753, 0.0349, 0.04...",2
5,téléphone,"[0.0774, 0.0804, 0.0242, -0.1341, -0.0219, 0.0...",2
6,clavier,"[-0.0145, -0.0059, -0.0098, -0.1505, 0.0501, -...",2
7,bureau,"[0.0866, -0.045, 0.097, -0.0708, 0.0368, -0.01...",2
8,courriel,"[-0.011, -0.003, 0.0532, -0.0926, 0.081, -0.00...",2
9,fernseh,"[0.0046, 0.0656, 0.0341, -0.0594, 0.0889, -0.0...",2


In [None]:
# concatenate dfs
df = pd.concat([df1,df2])
df

Unnamed: 0,words,embeddings,cluster
0,puppies,"[0.0368, -0.0728, -0.0413, 0.0336, -0.0342, 0....",1
1,cat,"[-0.0327, 0.0332, -0.0772, 0.0275, -0.0469, 0....",1
2,tiger,"[-0.067, 0.0162, -0.0978, 0.0705, -0.053, 0.09...",1
3,bird,"[-0.0982, -0.0368, 0.0447, 0.0752, -0.0479, 0....",1
4,worms,"[0.0219, 0.026, -0.0404, -0.0292, -0.0343, 0.0...",1
5,frog,"[-0.0393, -0.0135, -0.0017, 0.1069, -0.0813, -...",1
6,cheval,"[0.0667, 0.0019, 0.0687, -0.0496, -0.0082, 0.0...",1
7,chiens,"[0.0242, -0.0141, -0.0569, -0.0495, -0.0034, 0...",1
8,lapin,"[0.0313, -0.0256, -0.0598, -0.0521, 0.0351, 0....",1
9,veau,"[0.0059, 0.0308, 0.0282, -0.0869, 0.0484, 0.11...",1


In [None]:
# Create an empty array in which we will introduce the embedding data
emb_cluster = np.empty(shape=(len(df),300))
# add embeddings to the array
for i,emb in enumerate(df.embeddings):
  print(np.linalg.norm(np.array(emb)))
  emb_cluster[i] = np.array(emb)

0.99998415
1.0000013
1.0000073
1.0000074
1.0000066
0.9999982
0.9999768
0.9999764
1.0000417
1.0000749
1.0000218
1.0000094
1.000017
0.9999636
0.99999255
0.99999815
1.0000035
0.9999997
0.99999803
1.0000094
1.0000132
0.99999535
1.0000223
0.9999764
1.0000381
0.9999807
1.0000062
1.0000147
1.0000341
0.99999964
0.9999854
0.99997264
0.9999807


In [None]:
# KMEANS
from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering
k = 2
kmeans = KMeans(k, verbose=0)
kmeans.fit(emb_cluster)
kmeans_pred = kmeans.labels_

In [None]:
df['kmeans'] = kmeans_pred
df

Unnamed: 0,words,embeddings,cluster,kmeans,kmeans_cos,db
0,puppies,"[0.0368, -0.0728, -0.0413, 0.0336, -0.0342, 0....",1,0,1,-1
1,cat,"[-0.0327, 0.0332, -0.0772, 0.0275, -0.0469, 0....",1,0,1,-1
2,tiger,"[-0.067, 0.0162, -0.0978, 0.0705, -0.053, 0.09...",1,0,1,-1
3,bird,"[-0.0982, -0.0368, 0.0447, 0.0752, -0.0479, 0....",1,0,1,-1
4,worms,"[0.0219, 0.026, -0.0404, -0.0292, -0.0343, 0.0...",1,0,1,-1
5,frog,"[-0.0393, -0.0135, -0.0017, 0.1069, -0.0813, -...",1,0,1,-1
6,cheval,"[0.0667, 0.0019, 0.0687, -0.0496, -0.0082, 0.0...",1,1,1,0
7,chiens,"[0.0242, -0.0141, -0.0569, -0.0495, -0.0034, 0...",1,1,1,0
8,lapin,"[0.0313, -0.0256, -0.0598, -0.0521, 0.0351, 0....",1,1,1,0
9,veau,"[0.0059, 0.0308, 0.0282, -0.0869, 0.0484, 0.11...",1,1,1,0


In [None]:
# KMEANS WITH COSINE SIMILARITY

# initialize the KMeansClusterer
kmeans_cos = KMeansClusterer(k, distance= nltk.cluster.util.cosine_distance, repeats = 25, normalise=True)
kmeans_cos_pred  = kmeans_cos.cluster(emb_cluster, assign_clusters=True)

In [None]:
df['kmeans_cos'] = kmeans_cos_pred
df

Unnamed: 0,words,embeddings,cluster,kmeans,kmeans_cos,db
0,puppies,"[0.0368, -0.0728, -0.0413, 0.0336, -0.0342, 0....",1,0,1,-1
1,cat,"[-0.0327, 0.0332, -0.0772, 0.0275, -0.0469, 0....",1,0,1,-1
2,tiger,"[-0.067, 0.0162, -0.0978, 0.0705, -0.053, 0.09...",1,0,1,-1
3,bird,"[-0.0982, -0.0368, 0.0447, 0.0752, -0.0479, 0....",1,0,1,-1
4,worms,"[0.0219, 0.026, -0.0404, -0.0292, -0.0343, 0.0...",1,0,1,-1
5,frog,"[-0.0393, -0.0135, -0.0017, 0.1069, -0.0813, -...",1,0,1,-1
6,cheval,"[0.0667, 0.0019, 0.0687, -0.0496, -0.0082, 0.0...",1,1,0,0
7,chiens,"[0.0242, -0.0141, -0.0569, -0.0495, -0.0034, 0...",1,1,0,0
8,lapin,"[0.0313, -0.0256, -0.0598, -0.0521, 0.0351, 0....",1,1,1,0
9,veau,"[0.0059, 0.0308, 0.0282, -0.0869, 0.0484, 0.11...",1,1,1,0


In [None]:
# DBSCAN
db = DBSCAN(metric = 'cosine')
db = db.fit(emb_cluster)
db_pred = db.labels_

In [None]:
df['db'] = db_pred
df

Unnamed: 0,words,embeddings,cluster,kmeans,kmeans_cos,db
0,puppies,"[0.0368, -0.0728, -0.0413, 0.0336, -0.0342, 0....",1,0,1,-1
1,cat,"[-0.0327, 0.0332, -0.0772, 0.0275, -0.0469, 0....",1,0,1,-1
2,tiger,"[-0.067, 0.0162, -0.0978, 0.0705, -0.053, 0.09...",1,0,1,-1
3,bird,"[-0.0982, -0.0368, 0.0447, 0.0752, -0.0479, 0....",1,0,1,-1
4,worms,"[0.0219, 0.026, -0.0404, -0.0292, -0.0343, 0.0...",1,0,1,-1
5,frog,"[-0.0393, -0.0135, -0.0017, 0.1069, -0.0813, -...",1,0,1,-1
6,cheval,"[0.0667, 0.0019, 0.0687, -0.0496, -0.0082, 0.0...",1,1,0,0
7,chiens,"[0.0242, -0.0141, -0.0569, -0.0495, -0.0034, 0...",1,1,0,0
8,lapin,"[0.0313, -0.0256, -0.0598, -0.0521, 0.0351, 0....",1,1,1,0
9,veau,"[0.0059, 0.0308, 0.0282, -0.0869, 0.0484, 0.11...",1,1,1,0


In [None]:
# Agglomerative Clustering
agglo = AgglomerativeClustering(k, affinity='cosine', linkage='complete')
agglo.fit(emb_cluster)
agglo_pred = agglo.labels_

In [None]:
df['agglo'] = agglo_pred
df

Unnamed: 0,words,embeddings,cluster,kmeans,kmeans_cos,db,agglo
0,puppies,"[0.0368, -0.0728, -0.0413, 0.0336, -0.0342, 0....",1,0,1,-1,1
1,cat,"[-0.0327, 0.0332, -0.0772, 0.0275, -0.0469, 0....",1,0,1,-1,1
2,tiger,"[-0.067, 0.0162, -0.0978, 0.0705, -0.053, 0.09...",1,0,1,-1,1
3,bird,"[-0.0982, -0.0368, 0.0447, 0.0752, -0.0479, 0....",1,0,1,-1,1
4,worms,"[0.0219, 0.026, -0.0404, -0.0292, -0.0343, 0.0...",1,0,1,-1,1
5,frog,"[-0.0393, -0.0135, -0.0017, 0.1069, -0.0813, -...",1,0,1,-1,1
6,cheval,"[0.0667, 0.0019, 0.0687, -0.0496, -0.0082, 0.0...",1,1,0,0,0
7,chiens,"[0.0242, -0.0141, -0.0569, -0.0495, -0.0034, 0...",1,1,0,0,0
8,lapin,"[0.0313, -0.0256, -0.0598, -0.0521, 0.0351, 0....",1,1,1,0,0
9,veau,"[0.0059, 0.0308, 0.0282, -0.0869, 0.0484, 0.11...",1,1,1,0,0


### **Clustering example sentences**

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# a potential cluster of words in different languages

# english
cluster1_en = ['The shark is the most ferocious animal in the sea It’s my favourite',
               'Scientist have discovered that squids are extremely intelligent creatures',
               'The animal with the longest neck is the giraffe']

cluster1_en_emb = [[] for i in range(len(cluster1_en))]
cluster1_en_emb_stop = [[] for i in range(len(cluster1_en))]

# iterate through the list of sentences and generate embeddings
for i,tweet in enumerate(cluster1_en):
  for word in tweet.split():
    try:
      word = word.lower()
      ind = en_embeddings.token_to_index[word]
      cluster1_en_emb[i].append(np.array(en_embeddings.vectors[ind].tolist()))
      if word not in stopwords.words('english'):
        cluster1_en_emb_stop[i].append(np.array(en_embeddings.vectors[ind].tolist()))
    except:
      continue

# french
cluster1_fr = ['Le requin est l’animal le plus féroce de la mer C’est mon prefere',
               'Les scientifiques ont découvert que les calmars sont des créatures extrêmement intelligentes', 
               'L’animal au cou le plus long est la girafe']

cluster1_fr_emb = [[] for i in range(len(cluster1_fr))]
cluster1_fr_emb_stop = [[] for i in range(len(cluster1_fr))]

for i,tweet in enumerate(cluster1_fr):
  for word in tweet.split():
    try:
      word = word.lower()
      ind = fr_embeddings.token_to_index[word]
      cluster1_fr_emb[i].append(np.array(fr_embeddings.vectors[ind].tolist()))
      if word not in stopwords.words('french'):
        cluster1_fr_emb_stop[i].append(np.array(fr_embeddings.vectors[ind].tolist()))
    except:
      continue

# german 
cluster1_de = ['Der Hai ist das wildeste Tier im Meer Es ist mein Favorit',
               'Wissenschaftler haben entdeckt dass Tintenfische extrem intelligente Wesen sind', 
               'Das Tier mit dem längsten Hals ist die Giraffe']

cluster1_de_emb = [[] for i in range(len(cluster1_de))]
cluster1_de_emb_stop = [[] for i in range(len(cluster1_de))]

for i,tweet in enumerate(cluster1_de):
  tweet = tweet.split()
  for word in tweet:
    word = word.lower()
    try:
      ind = de_embeddings.token_to_index[word]
      cluster1_de_emb[i].append(np.array(de_embeddings.vectors[ind].tolist()))
      
      if word not in stopwords.words('german'):
        cluster1_de_emb_stop[i].append(np.array(de_embeddings.vectors[ind].tolist()))
      
    except:
      print(word)

cluster_animals = cluster1_en_emb + cluster1_fr_emb + cluster1_de_emb
cluster_animals_stop = cluster1_en_emb_stop + cluster1_fr_emb_stop + cluster1_de_emb_stop

In [None]:
df_animals = pd.DataFrame({'sentences': cluster1_en + cluster1_fr + cluster1_de,
                    'embeddings': cluster_animals, 'embeddings_stop': cluster_animals_stop,
                    'cluster': [1]*len(cluster_animals)})

df_animals

Unnamed: 0,sentences,embeddings,embeddings_stop,cluster
0,The shark is the most ferocious animal in the ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[-0.005200000014156103, 0.04019999876618385, ...",1
1,Scientist have discovered that squids are extr...,"[[0.016899999231100082, -0.00559999980032444, ...","[[0.016899999231100082, -0.00559999980032444, ...",1
2,The animal with the longest neck is the giraffe,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.04390000179409981, 0.056699998676776886, -...",1
3,Le requin est l’animal le plus féroce de la me...,"[[0.038600001484155655, 0.022099999710917473, ...","[[0.042100001126527786, 0.0066999997943639755,...",1
4,Les scientifiques ont découvert que les calmar...,"[[0.012600000016391277, 0.06719999760389328, 0...","[[0.04149999842047691, -0.008500000461935997, ...",1
5,L’animal au cou le plus long est la girafe,"[[-0.02290000021457672, 0.03139999881386757, 0...","[[-0.025299999862909317, -0.018400000408291817...",1
6,Der Hai ist das wildeste Tier im Meer Es ist m...,"[[0.018400000408291817, -0.03099999949336052, ...","[[0.015799999237060547, 0.026399999856948853, ...",1
7,Wissenschaftler haben entdeckt dass Tintenfisc...,"[[0.0568000003695488, 0.07339999824762344, 0.0...","[[0.0568000003695488, 0.07339999824762344, 0.0...",1
8,Das Tier mit dem längsten Hals ist die Giraffe,"[[-0.020500000566244125, 0.03819999843835831, ...","[[0.0044999998062849045, 0.09790000319480896, ...",1


In [None]:
# a potential cluster of words in different languages

# english
cluster2_en = ['The telephone is the most ferocious invention in the world It’s my favourite',
               'Scientists have discovered that robots are extremely intelligent beings',
               'The computer with the longest battery is the laptop']

cluster2_en_emb = [[] for i in range(len(cluster2_en))]
cluster2_en_emb_stop = [[] for i in range(len(cluster2_en))]

# iterate through the list of sentences and generate embeddings
for i,tweet in enumerate(cluster2_en):
  for word in tweet.split():
    try:
      word = word.lower()
      ind = en_embeddings.token_to_index[word]
      cluster2_en_emb[i].append(np.array(en_embeddings.vectors[ind].tolist()))
      if word not in stopwords.words('english'):
        cluster2_en_emb_stop[i].append(np.array(en_embeddings.vectors[ind].tolist()))
    except:
      continue

# french
cluster2_fr = ['Le téléphone est l’invention la plus féroce au monde C’est mon prefere',
               'Les scientifiques ont découvert que les robots sont des êtres extrêmement intelligents', 
               'L’ordinateur avec la batterie la plus longue est l’ordinateur portable']

cluster2_fr_emb = [[] for i in range(len(cluster2_fr))]
cluster2_fr_emb_stop = [[] for i in range(len(cluster2_fr))]

for i,tweet in enumerate(cluster2_fr):
  for word in tweet.split():
    try:
      word = word.lower()
      ind = fr_embeddings.token_to_index[word]
      cluster2_fr_emb[i].append(np.array(fr_embeddings.vectors[ind].tolist()))
      if word not in stopwords.words('french'):
        cluster2_fr_emb_stop[i].append(np.array(fr_embeddings.vectors[ind].tolist()))
    except:
      continue

# german 
cluster2_de = ['Das Telefon ist die wildeste Erfindung der Welt Es ist mein Favorit',
               'Wissenschaftler haben entdeckt dass Roboter extrem intelligente Wesen sind', 
               'Der Computer mit dem längsten Akku ist der Laptop']

cluster2_de_emb = [[] for i in range(len(cluster2_de))]
cluster2_de_emb_stop = [[] for i in range(len(cluster2_de))]

for i,tweet in enumerate(cluster2_de):
  for word in tweet.split():
    try:
      word = word.lower()
      ind = de_embeddings.token_to_index[word]
      cluster2_de_emb[i].append(np.array(de_embeddings.vectors[ind].tolist()))
      if word not in stopwords.words('german'):
        cluster2_de_emb_stop[i].append(np.array(de_embeddings.vectors[ind].tolist()))
    except:
      continue

cluster_tech = cluster2_en_emb + cluster2_fr_emb + cluster2_de_emb
cluster_tech_stop = cluster2_en_emb_stop + cluster2_fr_emb_stop + cluster2_de_emb_stop

In [None]:
df_tech = pd.DataFrame({'sentences': cluster2_en + cluster2_fr + cluster2_de,
                    'embeddings': cluster_tech, 'embeddings_stop': cluster_tech_stop,
                    'cluster': [2]*len(cluster_tech)})

df_tech

Unnamed: 0,sentences,embeddings,embeddings_stop,cluster
0,The telephone is the most ferocious invention ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.007000000216066837, 0.07699999958276749, 0...",2
1,Scientists have discovered that robots are ext...,"[[-0.04490000009536743, 0.0737999975681305, -0...","[[-0.04490000009536743, 0.0737999975681305, -0...",2
2,The computer with the longest battery is the l...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.07039999961853027, -0.03150000050663948, 0...",2
3,Le téléphone est l’invention la plus féroce au...,"[[0.038600001484155655, 0.022099999710917473, ...","[[0.07739999890327454, 0.0803999975323677, 0.0...",2
4,Les scientifiques ont découvert que les robots...,"[[0.012600000016391277, 0.06719999760389328, 0...","[[0.04149999842047691, -0.008500000461935997, ...",2
5,L’ordinateur avec la batterie la plus longue e...,"[[-0.020999999716877937, 0.08550000190734863, ...","[[-0.04190000146627426, -0.04170000180602074, ...",2
6,Das Telefon ist die wildeste Erfindung der Wel...,"[[-0.020500000566244125, 0.03819999843835831, ...","[[0.024700000882148743, 0.11760000139474869, 0...",2
7,Wissenschaftler haben entdeckt dass Roboter ex...,"[[0.0568000003695488, 0.07339999824762344, 0.0...","[[0.0568000003695488, 0.07339999824762344, 0.0...",2
8,Der Computer mit dem längsten Akku ist der Laptop,"[[0.018400000408291817, -0.03099999949336052, ...","[[0.08749999850988388, 0.028999999165534973, 0...",2


In [None]:
df = pd.concat([df_animals, df_tech])
df.reset_index(inplace=True)
df

Unnamed: 0,index,sentences,embeddings,embeddings_stop,cluster
0,0,The shark is the most ferocious animal in the ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[-0.005200000014156103, 0.04019999876618385, ...",1
1,1,Scientist have discovered that squids are extr...,"[[0.016899999231100082, -0.00559999980032444, ...","[[0.016899999231100082, -0.00559999980032444, ...",1
2,2,The animal with the longest neck is the giraffe,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.04390000179409981, 0.056699998676776886, -...",1
3,3,Le requin est l’animal le plus féroce de la me...,"[[0.038600001484155655, 0.022099999710917473, ...","[[0.042100001126527786, 0.0066999997943639755,...",1
4,4,Les scientifiques ont découvert que les calmar...,"[[0.012600000016391277, 0.06719999760389328, 0...","[[0.04149999842047691, -0.008500000461935997, ...",1
5,5,L’animal au cou le plus long est la girafe,"[[-0.02290000021457672, 0.03139999881386757, 0...","[[-0.025299999862909317, -0.018400000408291817...",1
6,6,Der Hai ist das wildeste Tier im Meer Es ist m...,"[[0.018400000408291817, -0.03099999949336052, ...","[[0.015799999237060547, 0.026399999856948853, ...",1
7,7,Wissenschaftler haben entdeckt dass Tintenfisc...,"[[0.0568000003695488, 0.07339999824762344, 0.0...","[[0.0568000003695488, 0.07339999824762344, 0.0...",1
8,8,Das Tier mit dem längsten Hals ist die Giraffe,"[[-0.020500000566244125, 0.03819999843835831, ...","[[0.0044999998062849045, 0.09790000319480896, ...",1
9,0,The telephone is the most ferocious invention ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.007000000216066837, 0.07699999958276749, 0...",2


In [None]:
# Generate the embedding at the tweet level (taking the mean)
def tweet_embedding_sum(df, embeddings_location = 'word_embeddings'):
  """
  """
  tweet_agg = []
  for tweet in df[embeddings_location]:
    # sum all elements from the same dimension for all words
    tweet_emb = np.sum(tweet, axis = 0, dtype=np.float64)
    # normalize
    tweet_emb_norm = tweet_emb/np.linalg.norm(tweet_emb) 
    tweet_agg.append(tweet_emb_norm)

  return(tweet_agg)

In [None]:
# Generate the embedding at the tweet level (taking the mean)
def tweet_embedding_mean(df, embeddings_location = 'word_embeddings'):
  """
  """
  tweet_agg = []
  for tweet in df[embeddings_location]:
    # take the mean across each dimension for all words
    tweet_emb = np.mean(tweet, axis = 0, dtype=np.float64)
    tweet_agg.append(tweet_emb)

  return(tweet_agg)

In [None]:
# generate the tweet embeddings

# with stop words
tweets_mean = tweet_embedding_mean(df,'embeddings')
tweets_sum = tweet_embedding_sum(df,'embeddings')
# without stop words
tweets_mean_stop = tweet_embedding_mean(df,'embeddings_stop')
tweets_sum_stop = tweet_embedding_sum(df,'embeddings_stop')

In [None]:
from sklearn.cluster import DBSCAN, KMeans
k = 2
kmeans = KMeans(k, verbose=0)
kmeans.fit(tweets_sum_stop)
kmeans_pred = kmeans.labels_

In [None]:
df['kmeans_sum'] = kmeans_pred
df

Unnamed: 0,index,sentences,embeddings,embeddings_stop,cluster,kmeans_sum
0,0,The shark is the most ferocious animal in the ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[-0.005200000014156103, 0.04019999876618385, ...",1,1
1,1,Scientist have discovered that squids are extr...,"[[0.016899999231100082, -0.00559999980032444, ...","[[0.016899999231100082, -0.00559999980032444, ...",1,1
2,2,The animal with the longest neck is the giraffe,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.04390000179409981, 0.056699998676776886, -...",1,1
3,3,Le requin est l’animal le plus féroce de la me...,"[[0.038600001484155655, 0.022099999710917473, ...","[[0.042100001126527786, 0.0066999997943639755,...",1,0
4,4,Les scientifiques ont découvert que les calmar...,"[[0.012600000016391277, 0.06719999760389328, 0...","[[0.04149999842047691, -0.008500000461935997, ...",1,0
5,5,L’animal au cou le plus long est la girafe,"[[-0.02290000021457672, 0.03139999881386757, 0...","[[-0.025299999862909317, -0.018400000408291817...",1,0
6,6,Der Hai ist das wildeste Tier im Meer Es ist m...,"[[0.018400000408291817, -0.03099999949336052, ...","[[0.015799999237060547, 0.026399999856948853, ...",1,0
7,7,Wissenschaftler haben entdeckt dass Tintenfisc...,"[[0.0568000003695488, 0.07339999824762344, 0.0...","[[0.0568000003695488, 0.07339999824762344, 0.0...",1,0
8,8,Das Tier mit dem längsten Hals ist die Giraffe,"[[-0.020500000566244125, 0.03819999843835831, ...","[[0.0044999998062849045, 0.09790000319480896, ...",1,0
9,0,The telephone is the most ferocious invention ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.007000000216066837, 0.07699999958276749, 0...",2,1


In [None]:
from nltk.cluster import KMeansClusterer
k = 2
kmeans = KMeansClusterer(k,distance=nltk.cluster.util.cosine_distance, repeats = 10)
kmeans_pred  = kmeans.cluster(tweets_sum_stop, assign_clusters=True)

In [None]:
df['kmeans_sum_cos'] = kmeans_pred
df

Unnamed: 0,index,sentences,embeddings,embeddings_stop,cluster,kmeans_sum,kmeans_sum_cos
0,0,The shark is the most ferocious animal in the ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[-0.005200000014156103, 0.04019999876618385, ...",1,1,1
1,1,Scientist have discovered that squids are extr...,"[[0.016899999231100082, -0.00559999980032444, ...","[[0.016899999231100082, -0.00559999980032444, ...",1,1,1
2,2,The animal with the longest neck is the giraffe,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.04390000179409981, 0.056699998676776886, -...",1,1,1
3,3,Le requin est l’animal le plus féroce de la me...,"[[0.038600001484155655, 0.022099999710917473, ...","[[0.042100001126527786, 0.0066999997943639755,...",1,0,0
4,4,Les scientifiques ont découvert que les calmar...,"[[0.012600000016391277, 0.06719999760389328, 0...","[[0.04149999842047691, -0.008500000461935997, ...",1,0,0
5,5,L’animal au cou le plus long est la girafe,"[[-0.02290000021457672, 0.03139999881386757, 0...","[[-0.025299999862909317, -0.018400000408291817...",1,0,0
6,6,Der Hai ist das wildeste Tier im Meer Es ist m...,"[[0.018400000408291817, -0.03099999949336052, ...","[[0.015799999237060547, 0.026399999856948853, ...",1,0,0
7,7,Wissenschaftler haben entdeckt dass Tintenfisc...,"[[0.0568000003695488, 0.07339999824762344, 0.0...","[[0.0568000003695488, 0.07339999824762344, 0.0...",1,0,0
8,8,Das Tier mit dem längsten Hals ist die Giraffe,"[[-0.020500000566244125, 0.03819999843835831, ...","[[0.0044999998062849045, 0.09790000319480896, ...",1,0,0
9,0,The telephone is the most ferocious invention ...,"[[-0.03240000084042549, -0.04619999974966049, ...","[[0.007000000216066837, 0.07699999958276749, 0...",2,1,1
