## Learning languages from a single message

In this example we'll see how this algorithm has the power to accurately identify languages after seeing messages of very short length. First we import everything that we're going to be using throughout the notebook.

In [1]:
import os
import sys
from statistics import median
from itertools import combinations, product
from collections import defaultdict
from random import shuffle, sample, seed

import pandas as pd
import networkx as nx
from networkx.algorithms.shortest_paths.weighted import single_source_dijkstra

sys.path.append('example/dataset_utils')
from sample_dataset import sample_dataset


seed(42)

### Differentiating between English, French, Italian and German

For our first example, we're just going to take 4 languages and see how well our algorithm works on differentiating messages of varying length in these languages. We're going to stress-test the algorithm  with varying scenarios, but let us just first try it out on a simple example of a thousand messages from each of these languages, each 10 words long.

In [10]:
num_samples = 3000
samples = {
    'english' : sample_dataset(num_samples, 10, 'en'),
    'french' : sample_dataset(num_samples, 10, 'fr'),
    'italian' : sample_dataset(num_samples, 10, 'it'),
    'german' : sample_dataset(num_samples, 10, 'de')
    }
    
languages = samples.keys()

Now we define the similarity score function. This functions is the crux of this whole model ...

In [11]:
languages

dict_keys(['english', 'french', 'italian', 'german'])

In [12]:
def similarity_score(string1, string2):
    #string1 = string1.split(' ')
    #string2 = string2.split(' ')
    intersection = [x for x in string1 if x in string2]
    if len(intersection) == 0:
        return float('inf')
    else:
        return 1/(len(intersection) ** 4)

In [13]:
for language_anchor, language_other in product(languages, repeat=2):
    if language_anchor == language_other:
        continue
    print(f'Language anchor: {language_anchor}, Other language: {language_other}')
    current_sample = samples[language_anchor] + samples[language_other]
    current_sample = list(enumerate(current_sample))
    graph = []
    for combination in combinations(current_sample, 2):
        similarity = similarity_score(combination[0][1], combination[1][1])
        if similarity == float('inf'):
            continue
        graph.append( ( combination[0][0], combination[1][0],
                similarity_score(combination[0][1], combination[1][1]) ) )
    
    G = nx.Graph()

    for edge in graph:
        G.add_edge(str(edge[0]), str(edge[1]), weight=edge[2])

    distances = single_source_dijkstra(G, '0')[0]
    medijan = median(distances.values())
    predicted_labels = []
    for x in range(num_samples*2):
        if x == 0:
            continue
        if distances[str(x)] > medijan:
            predicted_labels.append(1)
        else:
            predicted_labels.append(0)

    labelss = [0 if x < num_samples-1 else 1 for x in range(num_samples-1)]
    truth = [1 if (predicted_labels[x] == labelss[x]) else 0 for x in range(num_samples-1)]
    print(sum(truth))
    print(f'Accuracy: {sum(truth)/(num_samples-1)}')


Language anchor: english, Other language: french
2888
Accuracy: 0.9629876625541848
Language anchor: english, Other language: italian
2406
Accuracy: 0.802267422474158
Language anchor: english, Other language: german
2406
Accuracy: 0.802267422474158
Language anchor: french, Other language: english
2832
Accuracy: 0.9443147715905301
Language anchor: french, Other language: italian
2848
Accuracy: 0.9496498832944315
Language anchor: french, Other language: german
2838
Accuracy: 0.9463154384794932
Language anchor: italian, Other language: english
2340
Accuracy: 0.7802600866955652
Language anchor: italian, Other language: french
2419
Accuracy: 0.8066022007335779
Language anchor: italian, Other language: german
2857
Accuracy: 0.952650883627876
Language anchor: german, Other language: english
2509
Accuracy: 0.8366122040680227
Language anchor: german, Other language: french
2513
Accuracy: 0.837945981993998
Language anchor: german, Other language: italian
2751
Accuracy: 0.9173057685895298


In [169]:
combined_list = samples_limited['en'] + samples_limited['es']

x = list(enumerate(combined_list))
shuffle(x)
indices, combined_list = zip(*x)

labels = [0 if y < n_of_samples else 1 for y in indices]

In [170]:
indices[:10], labels[:10], combined_list[:10]

((454, 773, 198, 609, 450, 1255, 1765, 985, 143, 816),
 [0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
 ('objective to reduce greenhouse gas experience I know that heading',
  'benefit from social is that the Council did not agree',
  'and to guarantee that the privacy of their communications is',
  'the utmost we had to give reasons for the existence',
  'we now all know is in no position to bring',
  'ha aprendido la lección tras la crisis Donata Gottardi califica',
  'justificación parece de sentido los recursos como el bacalao por',
  'develop and the maintenance of ecosystems should become a fundamental',
  'like to thank Mr Graefe zu Baringdorf for a very',
  'Pacific entire dispute with the United has arisen because we'))

In [171]:
graph = []
i = 0
for combination in combinations(x, 2):
    similarity = similarity_score(combination[0][1], combination[1][1])
    if similarity == float('inf'):
        continue
    graph.append( ( combination[0][0], combination[1][0],
            similarity_score(combination[0][1], combination[1][1]) ) )

In [172]:
len(graph)

782798

In [173]:
G = nx.Graph()

In [174]:
for edge in graph:
    G.add_edge(str(edge[0]), str(edge[1]), weight=edge[2])

In [175]:
from networkx.algorithms.shortest_paths.weighted import single_source_dijkstra

In [176]:
distances = single_source_dijkstra(G, '0')[0]
medijan = median(distances.values())
print(medijan)
predicted_labels = []
for x in range(n_of_samples*2):
    if x == 0:
        continue
    if distances[str(x)] > medijan:
        predicted_labels.append(1)
    else:
        predicted_labels.append(0)

0.014660493827160493


In [177]:
labelss = [0 if x < n_of_samples else 1 for x in range(n_of_samples-1)]
truth = [1 if (predicted_labels[x] == labelss[x]) else 0 for x in range(n_of_samples-1)]

In [178]:
sum(truth)/(n_of_samples-1)

0.9259259259259259