# Get field data for Nicos Weg A1 cards
[This Anki deck for Nicos Weg](https://ankiweb.net/shared/info/52409495) is great but I want to get the field data out of it so that I can further customize the cards.

- [Original deck github](https://github.com/brkhrdt/dw_anki)

- [Reddit post mentioning the deck](https://www.reddit.com/r/German/comments/awnq5q/anki_flashcards_for_dw_nicos_weg/)

In [60]:
import csv
import pprint
import re

In [4]:
deck_export = 'DW_Nicos_Weg_A1.txt'

In [6]:
deck_data = []

In [7]:
with open(deck_export, encoding='utf-8', newline='') as f:
    reader = csv.reader(f, delimiter='\t')
    for _ in range(3):
        next(reader)  # skip metadata
    for row in reader:
        deck_data.append(row)

In [8]:
deck_data

[['also', '[sound:BAKU_A1_auch_dwdownload.mp3]auch', 'hallo'],
 ['example',
  '[sound:BAKU_A1_Beispiel_dwdownload.mp3]das Beispiel, die Beispiele ',
  'hallo'],
 ['thanks',
  '[sound:BAKU_A1_danke_dwdownload.mp3]danke <br><small><i>alternativ: danke schön / danke sehr</i></small>',
  'hallo'],
 ['I’m doing well.',
  "[sound:BAKU_A1_mir_gehts_gut_dwdownload.mp3]Mir geht's gut.<br><br>[sound:BAKU_A1_es_geht_mir_gut_dwdownload.mp3]Es geht mir gut.",
  'hallo'],
 ["It is 9 o'clock.",
  '[sound:A1_E7_L1_S14_A1_Loesungsaudio_dwdownload.mp3]Es ist 09:00 Uhr.',
  'hallo'],
 ['Ms/Mrs',
  '[sound:BAKU_Frau_dwdownload.mp3]Frau <br><small><i>hier nur Singular, ohne Artikel</i></small>',
  'hallo'],
 ['woman<br><img src="39539529_507.jpg" width="50%" height="50%">',
  '[sound:BAKU_A1_Frau_dwdownload.mp3]die Frau, die Frauen',
  'hallo'],
 ['well;\xa0good',
  '[sound:BAKU_A1_gut_dwdownload.mp3]gut <br><small><i>besser, am besten</i></small>',
  'hallo'],
 ['Good evening.',
  '[sound:BAKU_A1_Guten_Ab

REs

`(\[[a-zA-Z0-9:_.]+\])` matches the sound file identifer.

`(\[[a-zA-Z0-9:_.]+\])([^<>]+(<br>)*)` I think this works to match the sound link and the German text.

Try this test string at [regex101](https://regex101.com/) to see what I'm talking about:
```
[sound:BAKU_A1_Tuete_dwdownload.mp3]die Tüte, die Tüten<br><br>[sound:BAKU_A1_Tasche_dwdownload.mp3]die Tasche, die Taschen
```

To match the <small> tags: `<small>.+?<\/small>`

In [14]:
test_str = '[sound:BAKU_A1_buchstabieren_dwdownload.mp3](etwas) buchstabieren <br><small><i>buchstabiert, buchstabierte, hat buchstabiert</i></small><br>[sound:BAKU_A1_heissen_dwdownload.mp3]heißen <br><small><i>heißt, hieß, hat geheißen</i></small>'
re.findall(r'(<small>.+?<\/small>)', test_str)

['<small><i>buchstabiert, buchstabierte, hat buchstabiert</i></small>',
 '<small><i>heißt, hieß, hat geheißen</i></small>']

### Search for entries that have two entries that both have `<small>` tags. 

In [23]:
small_tags = []
idxs_counts = []  # counts how many small tag terms are in a note.
small_tags_re = r'<small>.+?<\/small>'
for i, sublist in enumerate(deck_data):
    matches = re.findall(small_tags_re, sublist[1])
    if len(matches) > 1:
        small_tags.append(sublist[1])
        # may as we
        idxs.append((i, len(matches)))

In [24]:
small_tags

['[sound:BAKU_A1_uni_dwdownload.mp3]die Uni, die Unis <br><small><i>Kurzform von: Universität</i></small><br><br>[sound:BAKU_A1_Universitaet_dwdownload.mp3]die Universität, die Universitäten  <br><small><i>Kurzform: die Uni, die Unis</i></small>',
 '[sound:BAKU_A1_Passnummer_dwdownload.mp3]die Passnummer, die Passnummern <br><small><i>Kurzform von: Reisepassnummer</i></small><br><br>[sound:BAKU_A1_Reisepassnummer_dwdownload.mp3]die Reisepassnummer, die Reisepassnummern  <br><small><i>Kurzform: die Passnummer, die Passnummern</i></small>',
 '[sound:BAKU_A1_Limonade_dwdownload.mp3]die Limonade, die Limonaden <br><small><i>Kurzform: die Limo, die Limos</i></small><br><br>[sound:A1_E3_L2_S7_A5_Audio1_dwdownload.mp3]die Limo, die Limos  <br><small><i>Abkürzung für: Limonade</i></small>',
 '[sound:BAKU_A2_kuehl_dwdownload.mp3]kühl <br><small><i>kühler, am kühlsten</i></small><br><br>[sound:BAKU_A1_cool_dwdownload.mp3]cool <br><small><i>cooler, am coolsten; aus dem Englischen</i></small>',
 '

In [25]:
idxs

[(63, 2),
 (258, 2),
 (298, 2),
 (402, 2),
 (511, 2),
 (607, 2),
 (661, 2),
 (1086, 2),
 (1108, 2),
 (1268, 2),
 (1577, 2),
 (63, 2),
 (258, 2),
 (298, 2),
 (402, 2),
 (511, 2),
 (607, 2),
 (661, 2),
 (1086, 2),
 (1108, 2),
 (1268, 2),
 (1577, 2)]

This is the most complicated HTML I think we can possibly find:
```html
[sound:BAKU_A1_uni_dwdownload.mp3]die Uni, die Unis <br><small><i>Kurzform von: Universität</i></small><br><br>[sound:BAKU_A1_Universitaet_dwdownload.mp3]die Universität, die Universitäten  <br><small><i>Kurzform: die Uni, die Unis</i></small>
```

It seems like:

1. There are never more than two entries.
2. The first entry always ends with nothing, `<br><br>`, or a `<small>` tag
3. If there is a second entry, the first entry always ends with `<br><br>`
4. If there is a second entry, it always ends with `<small>` tags or nothing.

I think this does it:
```
(\[[a-zA-Z0-9:_.\-]+\])(.+?)(<br><br>|<\/small><br><br>|<\/small>|$)
```
3 groups in a match: 1st group is the URL, second is the vocab term, the last are the tags at the end of the first term.

### Search for cards with no sound

In [46]:
no_sound = []
for i, sublist in enumerate(deck_data):
    if sublist[1].find('sound:') == -1:
        no_sound.append(i)

In [47]:
no_sound

[525, 691, 1052, 1294, 1331, 1383]

In [55]:
deck_data[691]

['soccer',
 '(der) Fußball <br><small><i>nur Singular, selten mit Artikel</i></small>',
 'am-sonntag-koche-ich']

### Some cards have no sound
What to do...

Let's just remove the no sound cards from the deck for now because there are only 6 of them. We can label them as "no sound" and put them back into the deck"

In [52]:
def has_sound(card):
    return card[1].find('sound:') != -1

In [53]:
cards_with_sound = [card for card in deck_data if has_sound(card)]

In [54]:
cards_with_sound

[['also', '[sound:BAKU_A1_auch_dwdownload.mp3]auch', 'hallo'],
 ['example',
  '[sound:BAKU_A1_Beispiel_dwdownload.mp3]das Beispiel, die Beispiele ',
  'hallo'],
 ['thanks',
  '[sound:BAKU_A1_danke_dwdownload.mp3]danke <br><small><i>alternativ: danke schön / danke sehr</i></small>',
  'hallo'],
 ['I’m doing well.',
  "[sound:BAKU_A1_mir_gehts_gut_dwdownload.mp3]Mir geht's gut.<br><br>[sound:BAKU_A1_es_geht_mir_gut_dwdownload.mp3]Es geht mir gut.",
  'hallo'],
 ["It is 9 o'clock.",
  '[sound:A1_E7_L1_S14_A1_Loesungsaudio_dwdownload.mp3]Es ist 09:00 Uhr.',
  'hallo'],
 ['Ms/Mrs',
  '[sound:BAKU_Frau_dwdownload.mp3]Frau <br><small><i>hier nur Singular, ohne Artikel</i></small>',
  'hallo'],
 ['woman<br><img src="39539529_507.jpg" width="50%" height="50%">',
  '[sound:BAKU_A1_Frau_dwdownload.mp3]die Frau, die Frauen',
  'hallo'],
 ['well;\xa0good',
  '[sound:BAKU_A1_gut_dwdownload.mp3]gut <br><small><i>besser, am besten</i></small>',
  'hallo'],
 ['Good evening.',
  '[sound:BAKU_A1_Guten_Ab

### Try the search on cards with sound

In [92]:
search_re = r'(\[[a-zA-Z0-9:_.\-]+\])(.+?)(<br><br>|<\/small><br><br>|<\/small>|$)'
with_fields = {}
for sublist in cards_with_sound:
    card_front = sublist[0]
    card_back = sublist[1]
    with_fields[card_front] = []
    matches = re.findall(search_re, card_back)
    assert matches  # please pass
    for match in matches:
        # Group 1 is the soundfile
        # group 2 is the vocab term
        # group 3 is the rest of the first term,
        # like <br><br> or a <small> term
        soundfile, vocab, rest = match
        fields = {'soundfile': soundfile,
                  'term': vocab + rest}
        with_fields[card_front].append(fields)


In [93]:
pprint.pp(with_fields)

{'also': [{'soundfile': '[sound:BAKU_A1_auch_dwdownload.mp3]', 'term': 'auch'}],
 'example': [{'soundfile': '[sound:BAKU_A1_Beispiel_dwdownload.mp3]',
              'term': 'das Beispiel, die Beispiele '}],
 'thanks': [{'soundfile': '[sound:BAKU_A1_danke_dwdownload.mp3]',
             'term': 'danke <br><small><i>alternativ: danke schön / danke '
                     'sehr</i></small>'}],
 'I’m doing well.': [{'soundfile': '[sound:BAKU_A1_mir_gehts_gut_dwdownload.mp3]',
                      'term': "Mir geht's gut.<br><br>"},
                     {'soundfile': '[sound:BAKU_A1_es_geht_mir_gut_dwdownload.mp3]',
                      'term': 'Es geht mir gut.'}],
 "It is 9 o'clock.": [{'soundfile': '[sound:A1_E7_L1_S14_A1_Loesungsaudio_dwdownload.mp3]',
                       'term': 'Es ist 09:00 Uhr.'}],
 'Ms/Mrs': [{'soundfile': '[sound:BAKU_Frau_dwdownload.mp3]',
             'term': 'Frau <br><small><i>hier nur Singular, ohne '
                     'Artikel</i></small>'}],
 'woman<b

### Add the tags
Append them

In [94]:
for sublist in cards_with_sound:
    card_front = sublist[0]
    card_tags = sublist[2]
    with_fields[card_front].append(card_tags)

In [95]:
pprint.pp(with_fields)

{'also': [{'soundfile': '[sound:BAKU_A1_auch_dwdownload.mp3]', 'term': 'auch'},
          'hallo'],
 'example': [{'soundfile': '[sound:BAKU_A1_Beispiel_dwdownload.mp3]',
              'term': 'das Beispiel, die Beispiele '},
             'hallo'],
 'thanks': [{'soundfile': '[sound:BAKU_A1_danke_dwdownload.mp3]',
             'term': 'danke <br><small><i>alternativ: danke schön / danke '
                     'sehr</i></small>'},
            'hallo'],
 'I’m doing well.': [{'soundfile': '[sound:BAKU_A1_mir_gehts_gut_dwdownload.mp3]',
                      'term': "Mir geht's gut.<br><br>"},
                     {'soundfile': '[sound:BAKU_A1_es_geht_mir_gut_dwdownload.mp3]',
                      'term': 'Es geht mir gut.'},
                     'hallo'],
 "It is 9 o'clock.": [{'soundfile': '[sound:A1_E7_L1_S14_A1_Loesungsaudio_dwdownload.mp3]',
                       'term': 'Es ist 09:00 Uhr.'},
                      'hallo'],
 'Ms/Mrs': [{'soundfile': '[sound:BAKU_Frau_dwdownload.mp3]',

### Export the fields for Anki import
Need the `with_fields` variable from above.

In [98]:
with open('NW_A1_front_and_fields.tsv', 'w', newline='', encoding='utf-8') as f:
    # idk if dialect is going to mess things up.
    # maybe sniff the input file.
    writer = csv.writer(f, delimiter='\t', dialect='excel-tab')
    first_row = ['English', 'Term1', 'Term2', 'Term1_Audio', 'Term2_Audio', 'tags']
    writer.writerow(first_row)
    for english_term in with_fields:
        german_terms = with_fields[english_term]
        term1_audio = german_terms[0]['soundfile']
        term1_german = german_terms[0]['term']
        tags = german_terms[-1]  # tags should be the last list item
        if len(german_terms) > 2:
            # there's a second term
            term2_audio = german_terms[1]['soundfile']
            term2_german = german_terms[1]['term']
            row = [english_term, term1_german, term2_german, term1_audio, term2_audio, tags]
        else:
            row = [english_term, term1_german, '', term1_audio, '', tags]
        writer.writerow(row)


### Clean up the front of the cards -- TODO
Separate the HTML from the front of the cards.

Assumptions:

The front of a card is
1. Always an English sentence.
2. Sometimes ends with HTML. (that HTML will always be an image)