This notebook is testing the effect that edit distance has on the final transliteration model.

My hypothesis is that the higher the edit distance, the more validation loss will be had.

It goes through the pre-processing of data all the way to outputting a model that can be re-used later on. It even performs some basic evaluation namely plotting training and validation loss.

Works best on Japanese ('ja') as the Twitter data used contains higher portions of Japanese Tweets compared to other languages that are not English. Also the standard transliteration used when evaluating suitable name pairs works very well in Japanese.

In [1]:
language = 'ja'

In [2]:
import name_transliteration.filtering as filter

my_filter = filter.Filter(language)
my_filter.filterData("./data/")

my_filter.saveDataAsText()

./data/stream-2021-01-13T01:21:29.804195.gz
./data/stream-2021-01-12T23:08:30.828340.gz
./data/stream-2021-01-13T00:02:22.807571.gz
./data/stream-2021-01-13T02:14:13.914215.gz
./data/stream-2021-01-13T03:09:29.015229.gz
./data/stream-2021-01-13T00:55:27.831486.gz
./data/stream-2021-01-12T23:35:55.813786.gz
./data/stream-2021-01-13T00:28:46.798948.gz
./data/stream-2021-01-13T01:47:06.536491.gz
./data/stream-2021-01-13T02:42:07.071964.gz


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['screen_name'] = df['screen_name'].apply(self.removeNonLanguageCharacters)


Saving filtered names. 140052 number of rows. 


In [None]:
import name_transliteration.cleansing as cleanse
import name_transliteration.model_trainer as model_trainer

edit_thresholds = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

for i in edit_thresholds:
    my_cleanser = cleanse.Cleanse(my_filter.getDataFrame(), edit_threshold=i)
    my_cleanser.cleanseData()

    my_cleanser.saveDataAsText()

    model_trainer = model_trainer.ModelTrainer(language=language, data_path = './ja_'+str(int(i*10))+'_edit_distance_language_cleansed.txt', epochs=50)

    model_trainer.runWholeTrainProcess()

    model_trainer.plotLoss(file_name = str(i)+'_loss.png')
    model_trainer.plotAccuracy(file_name = str(i)+'_accuracy.png')

Saving cleansed names as: ja_0_edit_distance_language_cleansed.txt 4730 number of rows. 
Number of samples: 4730
Number of unique input tokens: 24
Number of unique output tokens: 871
Max sequence length for inputs: 15
Max sequence length for outputs: 11
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50