## Learning languages from a single message

In this example we'll see how this algorithm has the power to accurately identify languages after seeing messages of very short length. First we import everything that we're going to be using throughout the notebook.

In [1]:
import os
import sys
from statistics import median, mean
from itertools import combinations, product
from collections import defaultdict
from random import shuffle, sample, seed

import networkx as nx
from networkx.algorithms.shortest_paths.weighted import single_source_dijkstra

if os.getcwd()[-7:] == 'example':
    os.chdir("..")

from algorithm.base import ShortestPathModel

from example.dataset_utils.sample_dataset import sample_dataset


seed(42)

### Differentiating between English, French, Italian and German

For our first example, we're just going to take 4 languages and see how well our algorithm works on differentiating messages of varying length in these languages. We're going to stress-test the algorithm  with varying scenarios, but let us just first try it out on a simple example of a thousand messages from each of these languages, each 10 words long.

In [2]:
num_samples = 2000
samples = {
    'english' : sample_dataset(num_samples, 10, 'en'),
    'french' : sample_dataset(num_samples, 10, 'fr'),
    'italian' : sample_dataset(num_samples, 10, 'it'),
    'german' : sample_dataset(num_samples, 10, 'de')
    }
    
languages = samples.keys()

Now we define the similarity score function. This function is the crux of this whole model ...

In [13]:
def similarity_score(string1, string2):
    intersection = [x for x in string1 if x in string2]
    if len(intersection) == 0:
        return float('inf')
    else:
        return 1/(len(intersection) ** 0.8)

In [14]:
accuracies = []

for language_anchor, language_other in product(languages, repeat=2):
    if language_anchor == language_other:
        continue
    print(f'Using an example in {language_anchor} to differentiate that language from {language_other}...')

    model = ShortestPathModel(similarity_score)

    current_sample = samples[language_anchor] + samples[language_other]
    labels = (len(samples[language_anchor]) * [1] +
                len(samples[language_other]) * [0] )
    n_of_labels = len(labels)
    
    model.fit_predict(current_sample)

    # For calculating accuracy, take all but the first example, since
    # that is the known one
    accuracy = sum([1 if (labels[i] == model.predictions_on_train_set[i])
                    else 0
                    for i in range(n_of_labels)][1:]) / (n_of_labels - 1)
    accuracies.append(accuracy)
    print(f"Accuracy: {100*accuracy}%")
    
print(f"""Final report
---------
Mean accuracy is {100*mean(accuracies)}% and median is {100*median(accuracies)}%,
minimum accuracy is {100*min(accuracies)}% and maximum accuracy is {100*max(accuracies)}%
---------""")   
    



Using an example in english to differentiate that language from french...
Accuracy: 96.64916229057265%
Using an example in english to differentiate that language from italian...
Accuracy: 95.72393098274569%
Using an example in english to differentiate that language from german...
Accuracy: 96.72418104526132%
Using an example in french to differentiate that language from english...
Accuracy: 99.32483120780195%
Using an example in french to differentiate that language from italian...
Accuracy: 80.32008002000501%
Using an example in french to differentiate that language from german...
Accuracy: 92.07301825456365%
Using an example in italian to differentiate that language from english...
Accuracy: 85.4463615903976%
Using an example in italian to differentiate that language from french...
Accuracy: 76.69417354338584%
Using an example in italian to differentiate that language from german...
Accuracy: 89.59739934983746%
Using an example in german to differentiate that language from english...

### Playing with hyperparameters

In the above example, we've seen that the model works well when given 1000 messages of length 10 in one language, and with the 