In [63]:
import csv
import pandas as pd
import requests

## Audio Sentence Processing
*Note* There is an issue with setting `sentence_id` to be the index.

The downloaded vocab list should include English translations. When Tatoeba finds that a sentence in the target language (German) has multiple English translations, it will create one row for each of the English translations. So if the sentence is "Tom bleibt bei uns." and there are two translations "Tom stays with us." and "Tom will stay with us.", then you're going to get two rows that differ only in the last field "translation".

The issue is that if the index is "sentence_id" then there will be duplicates of the index.

*Also* there are duplicate sentence ids in the `sentences_with_audio` data. I guess because some sentences may have more than one recording. I'm going to proceed assuming they are unique. I'm manually adding sentences to the vocab list and I'll know if I add a sentence with more than one audio.

In [124]:
ingest = pd.read_table('sentences_with_audio.csv',
                       names=['sentence_id',
                              'audio_id',
                              'username',
                              'license',
                              'attribution_url'])

In [125]:
ingest

Unnamed: 0,sentence_id,audio_id,username,license,attribution_url
0,61,1,fucongcong,,
1,68,2,fucongcong,,
2,78,754915,mramosch,,
3,85,566395,driini,CC BY-NC 4.0,https://tatoeba.org/deu/user/profile/driini
4,88,592881,driini,CC BY-NC 4.0,https://tatoeba.org/deu/user/profile/driini
...,...,...,...,...,...
1195633,12858087,1238345,PaulP,CC BY-NC 4.0,
1195634,12865980,1239468,PaulP,CC BY-NC 4.0,
1195635,12867905,1239469,PaulP,CC BY-NC 4.0,
1195636,12875115,1239470,PaulP,CC BY-NC 4.0,


In [126]:
VOCAB_LIST_FILEPATH = 'vocab_basket.tsv'

In [127]:
vocab_list = pd.read_table(VOCAB_LIST_FILEPATH,
                           names=['sentence_id',
                                  'text',
                                  'translation'],)

In [128]:
vocab_list

Unnamed: 0,sentence_id,text,translation
0,1729338,Ich nehme Geschenke an.,I accept gifts.
1,1907195,Es war ein Geschenk.,It was a gift.
2,2776108,Tom bleibt bei uns.,Tom stays with us.
3,2776108,Tom bleibt bei uns.,Tom will stay with us.
4,6960575,Tom akzeptierte mein Geschenk.,Tom accepted my present.
5,7636008,Tom schickte mir ein Geschenk.,Tom sent me a present.


In [129]:
audios = pd.merge(ingest, vocab_list, how='inner', on='sentence_id')

In [130]:
audios

Unnamed: 0,sentence_id,audio_id,username,license,attribution_url,text,translation
0,1907195,87415,gretelen,CC BY-NC 4.0,,Es war ein Geschenk.,It was a gift.
1,2776108,166329,Yeti,CC BY 4.0,,Tom bleibt bei uns.,Tom stays with us.
2,2776108,166329,Yeti,CC BY 4.0,,Tom bleibt bei uns.,Tom will stay with us.
3,6960575,484943,moskytoo,CC BY-NC 4.0,,Tom akzeptierte mein Geschenk.,Tom accepted my present.
4,7636008,757809,mramosch,,,Tom schickte mir ein Geschenk.,Tom sent me a present.


## Get the audio files

In [131]:
AUDIO_FOLDER_PATH = './vocab_basket_audio'

In [132]:
audio_url_template = 'https://tatoeba.org/audio/download/{0}'

In [133]:
for audio_id in audios['audio_id']:
    request_url = audio_url_template.format(audio_id)
    mp3data_request = requests.get(request_url)
    mp3data = mp3data_request.content
    mp3_filepath = f'{AUDIO_FOLDER_PATH}/{audio_id}.mp3'
    with open(mp3_filepath, 'wb') as mp3file:
        mp3file.write(mp3data)

## Anki Card Generation
I'll do this *not* through `genanki` but rather by generating text data that the Anki importer can work with.

TODO: Update this processing pipeline so that we can process a vocab list where not *all* sentences have audio.

We want something that looks like:

sentence_id | sentence_de | audio_de_filename | translation1_en | tranlsation2_en | translation3_en

Where `translation1_en` is never empty but the following two rows are optionally empty.

Using `df.loc[label]` where `label` isn't unique returns a `pd.DataFrame`. Otherwise it returns a `pd.Series`

In [144]:
translation_xs = audios[['sentence_id', 'translation']].groupby('sentence_id').aggregate(list)

In [148]:
translation_xs

Unnamed: 0_level_0,translation
sentence_id,Unnamed: 1_level_1
1907195,[It was a gift.]
2776108,"[Tom stays with us., Tom will stay with us.]"
6960575,[Tom accepted my present.]
7636008,[Tom sent me a present.]


In [166]:
output_rows = {}
for i in range(len(audios)):
    audio_row = audios.iloc[i]
    sentence_id = int(audio_row['sentence_id'])
    output_rows[sentence_id] = {'sentence_de': audio_row['text'],
                                'audio_id': int(audio_row['audio_id']),
                                'translations': translation_xs.loc[sentence_id].iloc[0][:3]}
                                
    

In [167]:
output_rows

{1907195: {'sentence_de': 'Es war ein Geschenk.',
  'audio_id': 87415,
  'translations': ['It was a gift.']},
 2776108: {'sentence_de': 'Tom bleibt bei uns.',
  'audio_id': 166329,
  'translations': ['Tom stays with us.', 'Tom will stay with us.']},
 6960575: {'sentence_de': 'Tom akzeptierte mein Geschenk.',
  'audio_id': 484943,
  'translations': ['Tom accepted my present.']},
 7636008: {'sentence_de': 'Tom schickte mir ein Geschenk.',
  'audio_id': 757809,
  'translations': ['Tom sent me a present.']}}

In [177]:
with open('vocab_bucket.tsv', 'w', newline='', encoding='utf-8') as tsvfile:
    tsvwriter = csv.writer(tsvfile, delimiter='\t')
    tsvwriter.writerow([
        'sentence_id',
        'audio_id',
        'sentence_de',
        'translation1_en',
        'translation2_en',
        'translation3_en'])
    for sentence_id in output_rows:
        sentence_de = output_rows[sentence_id]['sentence_de']
        audio_id = f'[sound:{output_rows[sentence_id]['audio_id']}.mp3]'
        translations = output_rows[sentence_id]['translations']
        tsv_row = [sentence_id, audio_id, sentence_de] + translations
        tsvwriter.writerow(tsv_row)