In [26]:
import pandas as pd
import requests

## Audio Sentence Processing
*Note* There is an issue with setting `sentence_id` to be the index.

The downloaded vocab list should include English translations. When Tatoeba finds that a sentence in the target language (German) has multiple English translations, it will create one row for each of the English translations. So if the sentence is "Tom bleibt bei uns." and there are two translations "Tom stays with us." and "Tom will stay with us.", then you're going to get two rows that differ only in the last field "translation".

The issue is that if the index is "sentence_id" then there will be duplicates of the index.

In [32]:
ingest = pd.read_table('sentences_with_audio.csv',
                       names=['sentence_id',
                              'audio_id',
                              'username',
                              'license',
                              'attribution_url'],
                      index_col='sentence_id')

In [33]:
ingest

Unnamed: 0_level_0,audio_id,username,license,attribution_url
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
61,1,fucongcong,,
68,2,fucongcong,,
78,754915,mramosch,,
85,566395,driini,CC BY-NC 4.0,https://tatoeba.org/deu/user/profile/driini
88,592881,driini,CC BY-NC 4.0,https://tatoeba.org/deu/user/profile/driini
...,...,...,...,...
12858087,1238345,PaulP,CC BY-NC 4.0,
12865980,1239468,PaulP,CC BY-NC 4.0,
12867905,1239469,PaulP,CC BY-NC 4.0,
12875115,1239470,PaulP,CC BY-NC 4.0,


In [34]:
VOCAB_LIST_FILEPATH = 'vocab_basket.tsv'

In [35]:
vocab_list = pd.read_table(VOCAB_LIST_FILEPATH,
                           names=['sentence_id',
                                  'text',
                                  'translation'],
                          index_col='sentence_id')

In [36]:
vocab_list

Unnamed: 0_level_0,text,translation
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1729338,Ich nehme Geschenke an.,I accept gifts.
1907195,Es war ein Geschenk.,It was a gift.
2776108,Tom bleibt bei uns.,Tom stays with us.
2776108,Tom bleibt bei uns.,Tom will stay with us.
6960575,Tom akzeptierte mein Geschenk.,Tom accepted my present.
7636008,Tom schickte mir ein Geschenk.,Tom sent me a present.


In [37]:
audios = ingest.join(vocab_list, how='inner')

In [38]:
audios

Unnamed: 0_level_0,audio_id,username,license,attribution_url,text,translation
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1907195,87415,gretelen,CC BY-NC 4.0,,Es war ein Geschenk.,It was a gift.
2776108,166329,Yeti,CC BY 4.0,,Tom bleibt bei uns.,Tom stays with us.
2776108,166329,Yeti,CC BY 4.0,,Tom bleibt bei uns.,Tom will stay with us.
6960575,484943,moskytoo,CC BY-NC 4.0,,Tom akzeptierte mein Geschenk.,Tom accepted my present.
7636008,757809,mramosch,,,Tom schickte mir ein Geschenk.,Tom sent me a present.


## Get the audio files

In [40]:
AUDIO_FOLDER_PATH = './vocab_basket_audio'

In [41]:
audio_url_template = 'https://tatoeba.org/audio/download/{0}'

In [44]:
for audio_id in audios['audio_id']:
    request_url = audio_url_template.format(audio_id)
    mp3data_request = requests.get(request_url)
    mp3data = mp3data_request.content
    mp3_filepath = f'{AUDIO_FOLDER_PATH}/{audio_id}.mp3'
    with open(mp3_filepath, 'wb') as mp3file:
        mp3file.write(mp3data)

## Anki Card Generation
I'll do this *not* through `genanki` but rather by generating text data that the Anki importer can work with