# Learning languages from a single message

[...] First we import everything that we're going to be using throughout the notebook.

In [15]:
import os
import sys

from statistics import median, mean
from itertools import combinations, product
from collections import defaultdict
from random import shuffle, sample, seed

import networkx as nx
from networkx.algorithms.shortest_paths.weighted import single_source_dijkstra

if os.getcwd()[-7:] == 'example':
    os.chdir("..")

from algorithm.base import ShortestPathModel

from example.dataset_utils.sample_dataset import sample_dataset


seed(42)

### Differentiating between English, French, Italian and German

For our first example, we're just going to take 4 languages (among 15 in our dataset) and see how well our algorithm works on differentiating messages of varying length in these languages. So let us first try it out on a simple example of a thousand messages from each of these languages, each 10 words long.

In [16]:
num_samples = 2000
samples = {
    'english' : sample_dataset(num_samples, 10, 'en'),
    'french' : sample_dataset(num_samples, 10, 'fr'),
    'italian' : sample_dataset(num_samples, 10, 'it'),
    'german' : sample_dataset(num_samples, 10, 'de')
    }
    
languages = samples.keys()

Now we define the similarity score function. This function is the crux of this whole model, but it is very simple, owing to the fact that the domain is very simple. So, for two strings $s_1$ and $s_2$, we define 
$$
\operatorname{weight}(s_1, s_2, p)=
\begin{cases}
\text{(number of words shared by } s_1 \text{ and } s_2 \text{)}^{-p} & \text{ if } s_1 \text{ and } s_2 \text{ share at least one word,}\\
\infty & \text{ otherwise.}\\
\end{cases}
$$
Here $\infty$ is a shorthand for "there is no edge between those two vertices."

We also define one other helper function.

In [17]:
def weight_function(string1, string2, p=2):
    intersection = [x for x in string1 if x in string2]
    if len(intersection) == 0:
        return float('inf')
    else:
        return 1/(len(intersection) ** p)

def print_report(accuracy_list):
    print(f"""Final report
---------
Mean accuracy is {100*mean(accuracy_list)}% and median is {100*median(accuracy_list)}%,
minimum accuracy is {100*min(accuracy_list)}% and maximum accuracy is {100*max(accuracy_list)}%
---------""")

In [12]:
accuracies = []

for language_anchor, language_other in product(languages, repeat=2):
    if language_anchor == language_other:
        continue
    print(f"""Using an example in {language_anchor.capitalize()} to learn the 
    difference between {language_anchor.capitalize()} and {language_other.capitalize()}...""")

    model = ShortestPathModel(weight_function)

    current_sample = samples[language_anchor] + samples[language_other]
    labels = (len(samples[language_anchor]) * [1] +
                len(samples[language_other]) * [0] )
    n_of_labels = len(labels)
    
    model.fit_predict(current_sample)

    # For calculating accuracy, take all but the first example, since
    # that is the known one
    accuracy = sum([1 if (labels[i] == model.predictions_on_train_set[i])
                    else 0
                    for i in range(n_of_labels)][1:]) / (n_of_labels - 1)
    accuracies.append(accuracy)
    print(f"Accuracy: {100*accuracy}%")
    

print_report(accuracies)


Using an example in English to differentiate it from French...
Accuracy: 99.17479369842461%
Using an example in English to differentiate it from Italian...
Accuracy: 96.42410602650664%
Using an example in English to differentiate it from German...
Accuracy: 99.4498624656164%
Using an example in French to differentiate it from English...
Accuracy: 90.49762440610152%
Using an example in French to differentiate it from Italian...
Accuracy: 95.04876219054765%
Using an example in French to differentiate it from German...
Accuracy: 97.07426856714179%
Using an example in Italian to differentiate it from English...
Accuracy: 91.02275568892223%
Using an example in Italian to differentiate it from French...
Accuracy: 84.0960240060015%
Using an example in Italian to differentiate it from German...
Accuracy: 91.02275568892223%
Using an example in German to differentiate it from English...
Accuracy: 87.44686171542885%
Using an example in German to differentiate it from French...
Accuracy: 99.124781

## Playing with hyperparameters

In the above example, we've seen that the model works well when given 1000 messages of length 10 in one language, and with the similarity function of inverse of the squared difference of the number of words shared. So there are a few things we can play with: the number of messages, their length, and the similarity function (in which, for simplicity, we'll only change the "squared" part).

Let us then switch our attention just to French and Italian. While it does not speak much about the peculiarities of our own dataset, French and Italian do have a high [lexical similarity](https://en.wikipedia.org/wiki/Lexical_similarity#Indo-European_languages), making them as good choice as any among the $15\cdot14/2=105$ options.

### Let's try the same parameters again

As our first experiment, let us try the same parameters but drawing many different samples from our dataset. When we draw 2000 samples of length 10 from our dataset we only draw about 20 thousand words, which is a small subset of about 3 millions words that are in each dataset, providing confidence there will be little correlation between our different runs.

In [18]:
num_of_runs = 50
accuracies = []
for _ in range(num_of_runs):
    french = sample_dataset(num_samples, 10, 'fr')
    italian = sample_dataset(num_samples, 10, 'it')
    
    model = ShortestPathModel(weight_function)

    current_sample = french + italian
    labels = (len(french) * [1] +
                len(italian) * [0] )
    n_of_labels = len(labels)
    
    model.fit_predict(current_sample)

    # For calculating accuracy, take all but the first example, since
    # that is the known one
    accuracy = sum([1 if (labels[i] == model.predictions_on_train_set[i])
                    else 0
                    for i in range(n_of_labels)][1:]) / (n_of_labels - 1)
    accuracies.append(accuracy)
    print(f"Accuracy: {100*accuracy}%")

print_report(accuracies)

Accuracy: 76.56914228557139%
Accuracy: 93.02325581395348%
Accuracy: 89.89747436859214%
Accuracy: 77.56939234808702%
Accuracy: 94.47361840460114%
Accuracy: 93.09827456864215%
Accuracy: 93.67341835458865%
Accuracy: 84.6461615403851%
Accuracy: 93.82345586396599%
Accuracy: 94.19854963740936%
Accuracy: 75.66891722930733%
Accuracy: 91.12278069517379%
Accuracy: 76.44411102775695%
Accuracy: 74.56864216054014%
Accuracy: 93.5233808452113%
Accuracy: 84.92123030757689%
Accuracy: 90.74768692173043%
Accuracy: 92.54813703425856%
Accuracy: 93.74843710927732%
Accuracy: 93.64841210302576%
Accuracy: 91.07276819204802%
Accuracy: 94.87371842960741%
Accuracy: 94.42360590147537%
Accuracy: 93.62340585146288%
Accuracy: 80.64516129032258%
Accuracy: 94.24856214053513%
Accuracy: 92.39809952488122%
Accuracy: 90.67266816704176%
Accuracy: 94.5986496624156%
Accuracy: 91.74793698424605%
Accuracy: 92.39809952488122%
Accuracy: 94.72368092023005%
Accuracy: 92.27306826706678%
Accuracy: 87.99699924981246%
Accuracy: 76.5441