# Anki Statistics for "Rembering the Kanji" Flashcards

<img src="images/vocab_relearn.png" alt="Current kanji being relearned." width="48%" align="left">
<img src="images/anki_stats.png" border="1px" width="48%" align="right">    

## Import Section

In [None]:
# Fundamental packages
import pandas as pd
import numpy
import sqlite3
import json
from datetime import datetime, timezone, timedelta
from tabulate import tabulate
import re
import os

# Charts
import plotly.graph_objs as go
import plotly.io as pio

# Widgets to combine charts with interactive controls.
from ipywidgets import HTML, Image, Layout, Button, Label
from ipywidgets import HBox, VBox, Box

# The image used in the flashcards are SVG graphics generated by the anki "Kanji Colorizer" package. 
# In order to display an array of images, need to convert svg to png.
import cairosvg

# In order to culuster the keywords, use pymagnitude to read in the english word vectors.
import pymagnitude

# Utilize the UNIHAN database for looking up information regarding characters.
from cihai.core import Cihai

## Parsing an Anki Database
<img src="images/rtk.jpg" width="200" align="right" style="margin: 0px 20px">

[Anki flashcards](https://apps.ankiweb.net) use spaced repetition to help learn facts presented as a set of flashcards. In this instance, there is a flash card for each of the 2136 [jōyō kanji](https://en.wikipedia.org/wiki/List_of_jōyō_kanji), and the layout of the cards follow the scheme outlined in the book "Rembering the Kanji." In the RTK scheme, there is a set 2136 unique keywords that correspond to each kanji and a story to help visualize each character -- the characters are ordered according to their core components in an effort to assist memorization.

<img src="images/card_懲.png" width="200" align="left" border="0" style="margin: 0px 0px">

Anki saves data in sqlite databases with a table each for cards, notes, collections, and a review log. This flashcard deck, "Heisig - Remembering The Kanji," also includes an SVG graphic for each character generated from the [KanjiVG](http://kanjivg.tagaini.net/index.html) project.

In [None]:
con = sqlite3.connect("../models/collection.anki2")
df_cards = pd.read_sql_query("SELECT * from cards", con)
df_notes = pd.read_sql_query("SELECT * from notes", con)
df_col = pd.read_sql_query("SELECT * from col", con)
df_revlog = pd.read_sql_query("SELECT * from revlog", con)

In [None]:
# Convert columns of data frame into lookup mapping.
card_note_mapping = dict(df_cards[['id', 'nid']].values)
note_fields_mapping = dict(df_notes[['id', 'flds']].values)

Note: This Anki flashcard deck contains 3008 cards. The first 2136 cards are the [jōyō kanji](https://en.wikipedia.org/wiki/List_of_jōyō_kanji) (from Volume 1 of Rembembering the Kanji) and an additional 872 from Volume 3 (Volume 2 of RTK focuses on strategies for memorizing the multiple pronunciations of each character).

The [database structure](https://github.com/ankidroid/Anki-Android/wiki/Database-Structure) describes the fields used by Anki. These include
* **id:** the epoch milliseconds of when the card was created
* **nid:** notes id
* **type:** 0=new, 1=learning, 2=due, 3=filtered
* **due:** Due is used differently for different card types: 
     * new: note id or random int
     * due: integer day, relative to the collection's creation time
     * learning: integer timestamp
* **ivl:** interval (used in SRS algorithm). Negative = seconds, positive = days
* **reps:** number of reviews
* **lapses:** the number of times the card went from a "was answered correctly" to "was answered incorrectly" state

In [None]:
print(len(df_cards))
df_cards.head()

In [None]:
# Integer to track cards. Should not be necessary.
df_cards['index'] = list(range(len(df_cards)))

From the Anki desktop application, export the flashcards as an Anki Deck Package (file extension apkg) and include the associated media. Also exported is a table mapping the on disk filenames of images with references used in the notes markup.

In [None]:
with open('/Users/ray/scratch/anki_statistics/models/heisig/media') as fp:
    image_data = json.load(fp)

In [None]:
# Card 909 maps to RTK frame 953 (kanji 懲).
image_data['909']

In [None]:
image_file_map = {}
for key in image_data:
    image_file_map[image_data[key]] = key

In [None]:
def get_rtk(note_id):
    '''
    The notes table contains a column, flds, as one long string. The seven fields for the "Heisig - Remembering The Kanji" flashcard deck
    are Kanji, Keyword, My Story, Stroke Count, Heisig Number, Diagram, and RTK. 
    '''
    fields = note_fields_mapping[note_id].split('\x1f')
    rtk = fields[6]
    return rtk

In [None]:
def get_kanji(note_id):
    '''
    The notes table contains a column, flds, as one long string. The seven fields for the "Heisig - Remembering The Kanji" flashcard deck
    are Kanji, Keyword, My Story, Stroke Count, Heisig Number, Diagram, and RTK. 
    '''
    fields = note_fields_mapping[note_id].split('\x1f')
    kanji = fields[0]
    return kanji

In [None]:
def get_keyword(note_id):
    '''
    The notes table contains a column, flds, as one long string. The seven fields for the "Heisig - Remembering The Kanji" flashcard deck
    are Kanji, Keyword, My Story, Stroke Count, Heisig Number, Diagram, and RTK. 
    '''
    fields = note_fields_mapping[note_id].split('\x1f')
    keyword = fields[1]
    return keyword

In [None]:
def get_diagram_filename(note_id):
    '''
    The notes table contains a column, flds, as one long string. The seven fields for the "Heisig - Remembering The Kanji" flashcard deck
    are Kanji, Keyword, My Story, Stroke Count, Heisig Number, Diagram, and RTK. 
    '''
    fields = note_fields_mapping[note_id].split('\x1f')
    try:
        image_name = fields[5].split('=')[1].split('>')[0].replace('"','')
    except Exception as ex:
        print(fields, ex)
        return None
    
    # Hack to deal with a few malformed image file names.
    if ' /' in image_name:
        image_name = image_name.replace(' /','')
    
    return image_file_map[image_name]

def get_diagram(note_id):
    file_name = get_diagram_filename(note_id)
    with open("/Users/ray/scratch/anki_statistics/models/heisig/" + file_name, "rb") as fp:
        image = fp.read()
        png = cairosvg.svg2png(image, dpi=100)
        
    return png

In [None]:
# Add new columns to the cards table in order to track the RTK frame number, kanji, and keyword from the textbook.
df_cards['rtk'] = df_cards.nid.map(get_rtk)
df_cards['kanji'] = df_cards.nid.map(get_kanji)
df_cards['keyword'] = df_cards.nid.map(get_keyword)

In [None]:
max_lapses = max(df_cards.lapses)
mask = (df_cards.lapses == max_lapses)
df_cards[mask]

#### KanjiVG -- Generate colorized SVG diagrams for Kanji characters

In [None]:
from src.external.kanjicolorizer.colorizer import (KanjiVG, KanjiColorizer, InvalidCharacterError)

In [None]:
nid = df_cards[df_cards.kanji == '懲'].nid.values[0]
get_diagram_filename(nid)

In [None]:
s = '懲'
characters_to_colorize = [c for c in s if ord(c) >= 19968 and ord(c) <= 40879]

In [None]:
KanjiVG('懲').ascii_filename

Configuration parameters used by Anki kanjicolorizer addon:

    {
        "group-mode": true,
        "image-size": 327,
        "mode": "spectrum",
        "saturation": 0.95,
        "value": 0.75
    }

In [None]:
config = "--mode "
config += "spectrum "
config += " --group-mode "
config += " --saturation "
config += "0.95 "
config += " --value "
config += "0.75 "
config += " --image-size "
config += "327 "
kc = KanjiColorizer(config)

In [None]:
char_svg = kc.get_colored_svg('懲').encode('utf_8')
png = cairosvg.svg2png(char_svg)
display(Image(value=png, layout=Layout(width='100px', length='100px')))

In [None]:
display(df_cards[df_cards.kanji == '懲'])
nid = df_cards[df_cards.kanji == '懲'].nid.values[0]
print(get_diagram_filename(nid))
with open("/Users/ray/scratch/anki_statistics/models/heisig/909", "rb") as fp:
    image = fp.read()
    png = cairosvg.svg2png(image)
    display(Image(value=png, layout=Layout(width='100px', length='100px')))

In [None]:
# The image_widget variable displays the current character when hovered over with the cursor.
nid = df_cards[df_cards.kanji == '懲'].nid.values[0]
png = get_diagram(nid)
image_widget = Image(value=png, layout=Layout(width='100px', length='100px'))

#### Diagnose mismatch in sum of reps vs. total reviews.

In [None]:
total_reps = df_cards.reps.sum()
total_reviews = len(df_revlog)
print(total_reps, total_reviews)

# The sum of the reps recorded in df_cards doesn't match what the anki statistics ui reports. 
# Checking that there are no reps recorded on cards that do not have a Rembering The Kanji index (i.e. not i first book).
mask = (df_cards.rtk == '') & (df_cards.reps > 0)
df_cards[mask]

In [None]:
# Card deck has 3008 total cards (volumes 1 & 3 of RTK). In the card deck (as of 11 June 2019), only the volume 1 cards have the tag RTK.
# Going to clobber the cards that do not have an RTK tag.
print(len(df_cards))

In [None]:
indexNames = df_cards[ df_cards.rtk == '' ].index
df_cards.drop(indexNames , inplace=True)

In [None]:
print(len(df_cards))

## Chart to Display Review Counts

In [None]:
def hover_fn(trace, points, state):
    '''
    For scatter plots, hovering over a point gives data about the character repetitions and displays the KanjiVG diagram
    that is colored by the subcomponents. Must be converted to png in order to display a (grid) of images.
    
    Does not return a value, but updates the image_widget and details variables as side effects.
    '''
    ind = points.point_inds[0]
    hd = dict(df_cards[['id', 'nid', 'reps', 'index']].iloc[ind])
    fields = note_fields_mapping[hd['nid']].split('\x1f')
    kanji = fields[0]
    keyword = fields[1]
    rtk = fields[6]
    image_name = fields[5].split('=')[1].split('>')[0].replace('"','')
    file_name = image_file_map[image_name]
    with open("/Users/ray/scratch/anki_statistics/models/heisig/" + file_name, "rb") as fp:
        image = fp.read()
        png = cairosvg.svg2png(image, dpi=300)
    image_widget.value = png
#     details.value = df_cards[['id', 'nid', 'reps', 'index']].iloc[ind].to_frame().to_html()
    details.value = ''' <table border="0" align="left">
                        <tr><td>{} {}</td></tr>
                        <tr><td>RTK: {}</td></tr>
                        <tr><td>REPS: {}</td></tr>
                        <tr><td>nid: {}</td></tr>
                        </table>
                        '''.format(kanji, keyword, rtk, hd['reps'], hd['nid'])

In [None]:
# Variables to hold the display of charts generated below.
details_kodansha = HBox(children=[HTML(value='Kodansha place holder.</b>')])
details_nearest_neighbors = HBox(children=[HTML(value='Keyword nearest neighbors place holder.</b>')])

In [None]:
# Reproduce the standard Anki Statistics bar chart counting the number of reviews per card. 

cards_fig = go.FigureWidget(
    data=[
        dict(
            type='scatter',
            x = df_cards['rtk'],
            y = df_cards['reps'],
            mode = 'markers'
        )
    ])

scatter = cards_fig.data[0]
scatter.hoverinfo = 'text'
# scatter.on_hover(hover_fn)
scatter.marker.opacity = 0.4
scatter.marker.size = 4
scatter.marker.color = None
cards_fig.layout.template = 'seaborn'
margin=go.layout.Margin(
        l=20,
        r=10,
        b=10,
        t=0,
        pad=4
)
cards_fig.layout.margin = margin
cards_fig.layout.xaxis.title = 'Rembering the Kanji Index'
cards_fig.layout.yaxis.title = 'Repetitions'
cards_fig.layout.hovermode = 'closest'

**TODO:** Compute and display Jacard similarity of each Kodansha cluster with nearest neighbor list (or hdbscan cluster).

In [None]:
# Dummy figure, real one will be updated in cell below.
fig1 = go.FigureWidget()

In [None]:
container_layout = Layout(border='1px solid grey', width='100%')
vcontainer_layout = Layout(border='0px solid black', justify_content='space-between')
rcontainer_layout = Layout(border='0px solid red')
gcontainer_layout = Layout(border='0px solid green', width='400px')
HBox(children=[VBox(children=[cards_fig], layout=gcontainer_layout), 
               VBox(children=[VBox(children=[details_kodansha, details_nearest_neighbors],layout=rcontainer_layout)], 
                    layout=vcontainer_layout),# HBox(children=[fig1], layout=Layout(width='30%')),
               VBox(children=[image_widget], layout=rcontainer_layout)
              ], 
     layout=container_layout)

In [None]:
# Add information from cards to the review log table.
df_revlog['nid'] = df_revlog.cid.map(card_note_mapping)
df_revlog['flds'] = df_revlog.nid.map(note_fields_mapping)
df_revlog['formatted_flds'] = df_revlog['flds'].str.split('\x1f')
df_revlog['kanji'] = df_revlog['formatted_flds'].map(lambda x: x[0])
df_revlog['keyword'] = df_revlog['formatted_flds'].map(lambda x: x[1])

In [None]:
# Shift the review log dataframe for localtime in Hawaii. Does not take into acccount
# the fact that review sessions started every morning at 4am in Hawaii. Traveled two times
# to EST during the project which made it difficult to continue the reviews on time.
dti = pd.to_datetime(list(df_revlog['id']/1000.0), unit='s')
dti = dti.tz_localize('UTC')
dti = dti.tz_convert('US/Hawaii')
df_revlog = df_revlog.set_index(dti)

In [None]:
def review_type(row):
    rev_type = ''
    if row['type'] == 0:
        rev_type = 'learn'
    elif (row['type'] == 1) & (row['ivl'] > 21):
        rev_type = 'mature'
    elif (row['type'] == 1) & (row['ivl'] <= 21):
        rev_type = 'young'
    elif row['type'] == 2:
        rev_type = 'relearn'
    else:
        print('error {}'.format(row))
    
    return rev_type

df_revlog['review_type'] = df_revlog.apply(review_type, axis=1)

In [None]:
def get_current_review_type(kanji):
    mask = (df_revlog.kanji == kanji)
    return df_revlog[mask].iloc[-1]['review_type']

In [None]:
df_cards['review_type'] = df_cards.kanji.map(get_current_review_type)

In [None]:
# Update the scatter plot of RTK card number vs review count to highlight the current review type.
anki_colors = {'mature':'green', 'young':'lightgreen', 'learn':'blue', 'relearn':'red'}

scatter.marker.color = df_cards.review_type.map(anki_colors)
scatter.marker.opacity = 1

## Bar Chart reproducing Anki Statistics

In [None]:
df_cards.groupby('review_type').count().loc[:,'id'].to_frame()

**TODO:** Add python callback to update stacked bars with total reviews.

In [None]:
card_data = []

groupby = df_revlog.groupby('review_type')
for review_type in ['mature', 'young', 'learn', 'relearn']:
    group = groupby.get_group(review_type)
    # Review periods start at 4am (HST) every morning, so use a base of 4 and frequency of 24*n hours.
    card_series = group.loc[:, 'cid'].groupby(pd.Grouper(freq='24H', base=4, label='left')).count()
    card_data.append(go.Bar(x=card_series.index, y=card_series.values,
                            name=review_type, marker=dict(color=anki_colors[review_type]), opacity=0.9))
    
card_series = df_revlog.loc[:, 'cid'].groupby(pd.Grouper(freq='24H', base=4, label='left')).count()
card_data.append(go.Bar(x=card_series.index, y=card_series.values,
                            name='total', marker=dict(color='white'), opacity=0.1))

layout = go.Layout(title='Review Count', xaxis=dict(title='Date'), barmode='stack', yaxis=dict(title='Count'))
fig = go.FigureWidget(data=card_data, layout=layout)
fig

## Display Current Keywords/Kanji being Relearned

In [None]:
# Get a list of all rendered SVG images of kanji diagrams that are being relearned.
mask = (df_cards.review_type == 'relearn')
images = list(df_cards[mask].nid.map(get_diagram))

In [None]:
# Grid of relearned kanji images.
z_container_layout = Layout(border='0px solid  grey', width='50px', length='50px')
a_container_layout = Layout(border='0px solid red')
b_container_layout = Layout(border='0px solid green', justify_content='flex-start')
c_container_layout = Layout(border='0px solid black', width='50%', flex_direction='column', justify_content='space-around')
no_boxes_per_line = 10
fig1 = VBox(children=[HBox(children=[VBox(children=[Image(value=image, layout=z_container_layout)], layout=a_container_layout) 
                     for image in images[10*m:10*m+10]], layout=b_container_layout) for m in range(13)], layout=c_container_layout)

In [None]:
# Grid of relearned keywords.
keywords = list(df_cards[mask].keyword)
a_container_layout = Layout(border='0px solid red')
b_container_layout = Layout(border='0px solid green', justify_content='space-between')
c_container_layout = Layout(border='0px solid black', width='50%', flex_direction='column', justify_content='space-around')
no_boxes_per_line = 10
fig2 = VBox(children=[HBox(children=[HBox(children=[HTML(value=keyword)], layout=a_container_layout) 
                     for keyword in keywords[no_boxes_per_line*m:no_boxes_per_line*m+no_boxes_per_line]], layout=b_container_layout) for m in range(13)],
     layout=c_container_layout)

In [None]:
print(len(keywords))
HBox(children=[fig1, fig2], layout=Layout(border='0px solid black', justify_content='space-around'))

## Clusters of English keywords associated with kanji.

In [None]:
# english_model =  '../models/crawl-300d-2M.magnitude'
english_model =  '/Users/ray/scratch/flash/models/wiki-news-300d-1M.magnitude'
english_vectors = pymagnitude.Magnitude(english_model)

#### Cihai Utility for UNIHAN character data
<img src="images/cihai.jpg" width="400" align="right" style="margin: 0px 20px">

The [Cihai](https://cihai.git-pull.com/en/latest/features.html) project provides a (sqlalchemy) interface to [UNIHAN](https://unicode.org/charts/unihan.html) data.

In [None]:
# Full config for Cihai (via sqlalchemy) to supress error. Includes the argument "?check_same_thread=False" in database url.
# "SQLite objects created in a thread can only be used in that same thread."
base_directory = '../'
cihai_path = base_directory + '/models/cihai.db'
config = {'debug': False,
  'database': {'url': 'sqlite:///' + cihai_path + '?check_same_thread=False'},
  'dirs': {'cache': base_directory,
  'log': base_directory,
  'data': base_directory}}

cihai = Cihai(config=config)

In [None]:
def lookup_Unihan(character):
    # TODO: Documentation for UNIHAN (CiHai package).
    kGlyphs = {}
    try:
        query = cihai.lookup_char(character)
        glyph = query.first()
        kGlyphs = {
            'kMandarin': glyph.kMandarin,
            'kCantonese': glyph.kCantonese,
            'kTang': glyph.kTang,
            'kJapaneseOn': glyph.kJapaneseOn,
            'kJapaneseKun': glyph.kJapaneseKun,
            'kKorean': glyph.kKorean,
            'kHangul': glyph.kHangul,
            'kDefinition': glyph.kDefinition
        }
    except Exception as ex:
        print('Exception: {}, Character {}'.format(ex, character))
    return kGlyphs

In [None]:
unihan_data = {char:lookup_Unihan(char) for char in list(df_cards.kanji)}

#### Nearest Neighbors

Very slow, so cache a version on disk.

In [None]:
topn = 100
cached_knn = '../data/processed/knn_data.json'
if not os.path.isfile(cached_knn):
    knn_data = {keyword:{k:sim for k, sim in english_vectors.most_similar_approx(keyword, topn=topn)} for keyword in list(df_cards.keyword)}
    with open(cached_knn, 'w') as fp:
        json.dump(knn_data, fp)

with open(cached_knn) as fp:
    knn_data = json.load(fp)

In [None]:
keyword_kanji_mapping = dict(df_revlog[['keyword', 'kanji']].values)
kanji_keyword_mapping = dict(df_revlog[['kanji', 'keyword']].values)

def keyword_neighbors(keyword='inspection'):
    topn = 100
#     nearest_keywords_sims = {keyword:sim for keyword, sim in english_vectors.most_similar_approx(keyword, topn=topn)}
    nearest_keywords_sims = knn_data[keyword]
    nearest_keywords_sims[keyword] = 1.0
    nearest_keywords = {k for k in nearest_keywords_sims}
    nearest_keywords = nearest_keywords.intersection(set(keywords))
    nearest_keywords.add(keyword)

    keyword_data = []
    cols = ['keyword', 'kanji', 'similarity']
    glyph_data = unihan_data['請']
    cols.extend(list(glyph_data.keys()))
    
    for k in nearest_keywords:
        glyph_data = {}
        kanji = keyword_kanji_mapping[k]
        glyph_data['keyword'] = k
        glyph_data['kanji'] = kanji
        glyph_data['similarity'] = nearest_keywords_sims[k]
        glyph_data.update(unihan_data[kanji])
        keyword_data.append(glyph_data)
        
    df = pd.DataFrame(keyword_data, columns=cols)
    df = df.sort_values(by='similarity', ascending=False)
    df = df.round({'similarity': 2})
#     print(tabulate(df, showindex='never'))

    return df

In [None]:
keyword_neighbors(keyword='inspection')

In [None]:
def hnn(ind):
#     ind = points.point_inds[0]
    hd = dict(df_cards[['id', 'nid', 'reps', 'index']].iloc[ind])
    fields = note_fields_mapping[hd['nid']].split('\x1f')
    kanji = fields[0]
    keyword = fields[1]
    df = keyword_neighbors(keyword='inspection')
    table = tabulate(df[['keyword', 'kanji', 'similarity', 'kJapaneseOn']], tablefmt='html', headers=df.columns, showindex='never')
    return table

In [None]:
HTML(hnn(0))

In [None]:
def hover_fn_nn(trace, points, state):
    '''
    The notes table contains a column, flds, as one long string. The seven fields for the "Heisig - Remembering The Kanji" flashcard deck
    are Kanji, Keyword, My Story, Stroke Count, Heisig Number, Diagram, and RTK. 
    '''
    ind = points.point_inds[0]
    hd = dict(df_cards[['id', 'nid', 'reps', 'index']].iloc[ind])
    fields = note_fields_mapping[hd['nid']].split('\x1f')
    kanji = fields[0]
    keyword = fields[1]
    df = keyword_neighbors(keyword=keyword)
#     details.value = keyword
    table = tabulate(df[['keyword', 'kanji', 'similarity', 'kJapaneseOn']], tablefmt='html', headers=df.columns, showindex='never')
    details_nearest_neighbors.children = [HTML(table)]

In [None]:
scatter.hoverinfo = 'text'
scatter.on_hover(hover_fn_nn)

## Parse word clusters (scraped from) Kodansha's dictionary.
<img src="images/kodansha_cover.jpg" width="200" align="left" style="margin: 0px 20px">

>    **Kanji Synonyms** The words of a language form a closely linked network of interdependent
>    units. The meaning of a word or expression cannot really be understood unless its relationships with
>    other closely related words are taken into account. In English, for example, such words as kill, murder,
>    and execute share the meaning of ‘put to death’, but they differ considerably in usage and connotation.
>    The ability to distinguish between such words not only allows one to gain a fuller understanding of
>    their individual shades of meaning, but also helps one write with greater clarity and precision.
>
>    A special feature of this dictionary, presented for the first time in the first edition, is the complete guidance
>    it offers for the precise distinctions between kanji synonyms, or characters of similar meaning. Since
>    a proper understanding of the meanings of each character is essential for the effective mastery of the
>    Japanese vocabulary, this will be of considerable benefit to the serious student. The kanji synonyms serve
>    as a powerful learning aid for the following reasons:
>      1. They show the differences and similarities between closely related characters.
>      2. They act as a network of cross-references for quickly locating any synonym group member.
>      3. They act as a simple kanji thesaurus.
>      4. They provide the educator with a valuable source of reference data.

In [None]:
IN_CLUSTER = False
clusters = {}
current_cluster_id = ''
for line in open('../data/raw/Kodansha Word Clusters.txt'):
    if line == '\n':
        IN_CLUSTER = False
        continue
        
    k_map = re.findall('[→]', line)
    if k_map:
#         print(line.strip())
        continue
    k_title = re.findall('([….,\(\)\[\]/\-0-9a-zA-Z\s]+\n)', line)
    try:
        if k_title[0] == line:
            IN_CLUSTER = True
            clusters[line.strip()] = []
            current_cluster_id = line.strip()
    #             print(k_title)
            continue
    except Exception as ex:
        print(line, ex)
    k_line = re.findall('([一-龯ぁ-んァ-ン𠮟媾󠄁\-嚮󠄃〇]+) ([….,’ \(\)\[\]/\-0-9a-zA-Zō\s]*) ([0-9]+)', line)
    try:
        num = k_line[0][2].strip()
        if IN_CLUSTER == True:
            clusters[current_cluster_id].append(k_line[0])
        if num == '':
            print(k_line)
    except Exception as ex:
        print(k_title, ex)

In [None]:
print(tabulate(clusters['rice']))

In [None]:
def extend_cluster_rtk(cluster):
    table_rtk = []
    for row in cluster:
        if row[0] in kanji_keyword_mapping:
            row = (row[0], kanji_keyword_mapping[row[0]], row[1], row[2])
        else:
            row = (row[0], '', row[1], row[2])
        table_rtk.append(row)
    return table_rtk

In [None]:
HTML('<style>{}</style>'.format(open('custom.css').read()))

In [None]:
char = '気'
boxes = []
boxes_layout = Layout(border='0px solid  grey', width='100%', justify_content='space-around')
max_rows = 0
for key in clusters:
    if char in [t[0] for t in clusters[key]]:
        table_rtk = extend_cluster_rtk(clusters[key])
        if len(table_rtk) > max_rows:
            max_rows = len(table_rtk)
        table = tabulate(table_rtk, tablefmt='html')
        caption = '<caption>{}</caption'.format(key)
        table = table.replace('<table>', '<table id="kodansha">\n'+caption)
        table = table.replace('style="text-align: right;"', '')
        boxes.append((len(table_rtk), table))
        
boxes_extended = []
for no_rows, table in boxes:
#     print(max_rows - no_rows)
    extra_rows = '</tbody>\n' + ('<tr>' + '<td>&nbsp;</td>'*4 + '</tr>\n')*(max_rows - no_rows)
    table = table.replace('</tbody>', extra_rows)
    boxes_extended.append(table)
    
boxes = [HTML(box) for box in boxes_extended]
HBox(children=boxes, layout=boxes_layout)

In [None]:
item_layout = Layout(height='100px', min_width='40px')
items = [Button(layout=item_layout, description=str(i), button_style='warning') for i in range(40)]
box_layout = Layout(overflow_x='scroll',
                    border='3px solid black',
                    width='500px',
                    height='',
                    flex_flow='row',
                    display='flex')
carousel = Box(children=items, layout=box_layout)
VBox([Label('Scroll horizontally:'), carousel])

In [None]:
char = '気'
boxes = []
# boxes_layout = Layout(border='0px solid  grey', width='100%', justify_content='space-around')
max_rows = 0
for key in clusters:
    if char in [t[0] for t in clusters[key]]:
        table_rtk = extend_cluster_rtk(clusters[key])
        if len(table_rtk) > max_rows:
            max_rows = len(table_rtk)
        table = tabulate(table_rtk, tablefmt='html')
        caption = '<caption>{}</caption'.format(key)
        table = table.replace('<table>', '<table id="kodansha">\n'+caption)
        table = table.replace('style="text-align: right;"', '')
        boxes.append((len(table_rtk), table))
        
boxes_extended = []
for no_rows, table in boxes:
#     print(max_rows - no_rows)
    extra_rows = '</tbody>\n' + ('<tr>' + '<td>&nbsp;</td>'*4 + '</tr>\n')*(max_rows - no_rows)
    table = table.replace('</tbody>', extra_rows)
    boxes_extended.append(table)
    
item_layout = Layout(min_width='300px')
boxes = [HTML(box, layout=item_layout) for box in boxes_extended]
# HBox(children=boxes, layout=boxes_layout)
box_layout = Layout(overflow_x='scroll',
                    border='0px solid black',
                    width='500px',
                    height='300px',
                    flex_flow='row',
                    display='flex')
details_kodansha = Box(children=boxes, layout=box_layout)
label = Label('Scroll horizontally:')
details = VBox([carousel, label])

In [None]:
label.value

In [None]:
print(table)

In [None]:
print( '</tbody>\n' + ('<tr>' + '<td></td>'*4 + '</tr>\n')*4 )

In [None]:
max_cluster_length = 0
for key in clusters:
    if len(clusters[key]) > max_cluster_length:
        max_cluster_length = len(clusters[key])

In [None]:
def hover_kodansha_clusters(trace, points, state):
    ind = points.point_inds[0]
    hd = dict(df_cards[['id', 'nid', 'reps', 'index']].iloc[ind])
    fields = note_fields_mapping[hd['nid']].split('\x1f')
    kanji = fields[0]
    keyword = fields[1]
    image_name = fields[5].split('=')[1].split('>')[0].replace('"','')
    file_name = image_file_map[image_name]
    with open("/Users/ray/scratch/anki_statistics/models/heisig/" + file_name, "rb") as fp:
        image = fp.read()
        png = cairosvg.svg2png(image, dpi=300)
    image_widget.value = png
    boxes = []
    boxes_layout = Layout(border='1px solid  grey', width='100%', justify_content='space-around')
    max_rows = 0
    cluster_names = []
    for key in clusters:
        if kanji in [t[0] for t in clusters[key]]:
            cluster_names.append(key)
            table_rtk = extend_cluster_rtk(clusters[key])
            if len(table_rtk) > max_rows:
                max_rows = len(table_rtk)
            table = tabulate(table_rtk, tablefmt='html')
            caption = '<caption>{}</caption'.format(key)
            table = table.replace('<table>', '<table id="kodansha">\n'+caption)
            table = table.replace('style="text-align: right;"', '')
            boxes.append((len(table_rtk), table))

    boxes_extended = []
    for no_rows, table in boxes:
        print(max_rows - no_rows)
        extra_rows = '</tbody>\n' + ('<tr>' + '<td>&nbsp;</td>'*4 + '</tr>\n')*(max_cluster_length - no_rows)
        table = table.replace('</tbody>', extra_rows)
        boxes_extended.append(table)

    item_layout = Layout(min_width='300px')
    boxes = [HTML(box, layout=item_layout) for box in boxes_extended]
#     details.layout = boxes_layout
    details_kodansha.children = boxes
    if len(cluster_names) == 1:
        label.value = 'Word Cluster: ' + ', '.join(cluster_names)
    else:
        label.value = 'Word Clusters: ' + ', '.join(cluster_names)
    cards_fig.layout.xaxis.title.text = '{} {}'.format(kanji, keyword)
    df = keyword_neighbors(keyword=keyword)
#     details.value = keyword
    table = tabulate(df[['keyword', 'kanji', 'similarity', 'kJapaneseOn']], tablefmt='html', headers=df.columns, showindex='never')
    details_nearest_neighbors.children = [HTML(table)]

In [None]:
scatter.hoverinfo = 'text'
scatter.on_hover(hover_kodansha_clusters)

## Notes on timestamps

Review periods start at 4am (HST) every morning, and the index of the review log is a UTC timestamp. Need to bucket reviews into the period between 4am of one day and the next. Many of the card reviews during the Holidays of 2018 were finished after 3am (study pattern was new cards in the morning and reviews in the evening).

Three ways to work with timestamps: pandas, sqlite, and python datetime.

In [None]:
import time

In [None]:
# Timestamps of card reviews as computed in anki source code (routine _logLrn in sched.py).
# This is the current "time in seconds since the epoch as a floating point number" UTC.
print( int(time.time()*1000) )

# For comparision, compute same time stamp using sqlite.
cmd = "SELECT strftime('%s', 'now') as time_stamp"
df = pd.read_sql_query(cmd, con)
print(pd.to_datetime(df['time_stamp'], unit='s'))
df

In [None]:
# Our epoch. Unix epoch is January 1, 1970, 00:00:00 (UTC).
time.gmtime(0)

In [None]:
# Local time according to python
t = time.localtime()
print(t, t.tm_zone)

In [None]:
from calendar import TextCalendar

In [None]:
tc = TextCalendar(firstweekday=6)
print(tc.formatmonth(t.tm_year, t.tm_mon))

In [None]:
# Using sqlite to get timestamp (10 digit) for 4am tomorrow morning.
cmd = "SELECT strftime('%s', 'now', 'localtime', 'start of day', '+1 day', '+4 hours') as time_stamp"
df = pd.read_sql_query(cmd, con)
print(pd.to_datetime(df['time_stamp'], unit='s'))
tomorrow_4am_localtime = int(df.iloc[0]['time_stamp'])
print('Timestamp for tomorrow (localtime) at 4am, {}, is {} digits long.'.format(tomorrow_4am_localtime, len(str(tomorrow_4am_localtime))))

In [None]:
# Difference between midnight 10 June 2019 and 4am 11 June.
# Six hours -- that is HST is 10 hours behind UTC less the 4 hours to 4am.
june_10_2019_midnight_utc = 1560247200    # int(time.time()*1000) ran at midnight 10 June 2019 in HST.
june_11_2019_4am_hst = 1560225600         # Computed by SELECT strftime('%s','now','localtime','start of day','+1 day','+4 hours') as time_stamp
                                          # early in the day of 10 June 2019 (before midnight UTC).
(june_10_2019_midnight_utc - june_11_2019_4am_hst)*4

In [None]:
# From the python datetime documentation:
# "Return the local date corresponding to the POSIX timestamp, such as is returned by time.time()."
datetime.fromtimestamp(tomorrow_4am_localtime, timezone.utc)

In [None]:
# Ten digit time stamp is in seconds while 13 digit time stamp is ms -- example uses pandas to convert.
first_review = df_revlog.iloc[0]['id']
last_review = df_revlog.iloc[-1]['id']
print('First review: {}. Time stamp is {}, a {} digit number.'.format(pd.to_datetime(first_review, unit='ms'), first_review, len(str(first_review))))
print('Last review: {}. Time stamp is {}, a {} digit number.'.format(pd.to_datetime(last_review, unit='ms'), last_review, len(str(last_review))))

In [None]:
print(datetime.fromtimestamp(first_review/1000.0))
print(datetime.fromtimestamp(last_review/1000.0))

In [None]:
# Pandas data structure for time stamps (date object), initialized here by a time string.
# Example here shows 4am in the morning in the HST timezone.
s = pd.Timestamp('2019-05-09 04:00:00-10:00')
print('Timestamp = {}, clock in Greenwitch is {}.'.format(s.timestamp(), datetime.fromtimestamp(s.timestamp())))
print('Timestamp = {}, local clock is {}'.format(s.timestamp(), datetime.fromtimestamp(s.timestamp(), timezone(timedelta(hours=10)))))

In [None]:
# Second major pandas datastructure is for time periods. 
pd.Period('2019-05')

In [None]:
s.to_period(freq='M')

In [None]:
s.week

In [None]:
# One day is 60*60*24 == 86400 seconds.
1560225600 - 1560139200

In [None]:
# df_revlog.groupby(['review_type', pd.Grouper(key='shift_date', freq='13D')])['id'].count()

In [None]:
# for bucket in df_revlog_type['learn'].groupby(pd.Grouper(freq='24H', base=4, label='right')):
#     pass

In [None]:
df_cards.tail()

In [None]:
pd.to_datetime(1536562919907, unit='ms')

In [None]:
def _daysSinceCreation(crt=1533996000):
        startDate = datetime.fromtimestamp(crt)
        startDate = startDate.replace(hour=4,
                                      minute=0, second=0, microsecond=0)
        return int((time.time() - time.mktime(startDate.timetuple())) // 86400)


In [None]:
start = _daysSinceCreation(1533996000)
print(start/7)

In [None]:
def due(date):
    return date - start

In [None]:
start

In [None]:
mask = (df_cards.due - start) == 0
df_cards[mask].head()
images = list(df_cards[mask].nid.map(get_diagram))

In [None]:
for c in set(df_cards[mask].kanji):
    print(c, end=' ')

In [None]:
z_container_layout = Layout(border='0px solid  grey', width='50px', length='50px')
a_container_layout = Layout(border='0px solid red')
b_container_layout = Layout(border='0px solid green', justify_content='flex-start')
c_container_layout = Layout(border='0px solid black', width='50%', flex_direction='column', justify_content='space-around')
no_boxes_per_line = 10
fig1 = VBox(children=[HBox(children=[VBox(children=[Image(value=image, layout=z_container_layout)], layout=a_container_layout) 
                     for image in images[10*m:10*m+10]], layout=b_container_layout) for m in range(15)], layout=c_container_layout)

In [None]:
keywords = list(df_cards[mask].keyword)
a_container_layout = Layout(border='0px solid red')
b_container_layout = Layout(border='0px solid green', justify_content='space-between')
c_container_layout = Layout(border='0px solid black', width='50%', flex_direction='column', justify_content='space-around')
no_boxes_per_line = 10
fig2 = VBox(children=[HBox(children=[HBox(children=[HTML(value=keyword)], layout=a_container_layout) 
                     for keyword in keywords[no_boxes_per_line*m:no_boxes_per_line*m+no_boxes_per_line]], layout=b_container_layout) for m in range(9)],
     layout=c_container_layout)

In [None]:
print(len(keywords))
HBox(children=[fig1, fig2], layout=Layout(border='0px solid black', justify_content='space-around'))

In [None]:
df_col

In [None]:
print(datetime.fromtimestamp(1533996000))

In [None]:
mask = (df_revlog.nid == 1185996957436)
df_revlog[mask].tail()

In [None]:
import datetime
import numpy as np

In [None]:
start = df_revlog.iloc[0].id
start_dt = datetime.datetime.fromtimestamp(start/1000)

end = df_revlog.iloc[-1].id
end_dt = datetime.datetime.fromtimestamp(end/1000)

delta = end_dt - start_dt + datetime.timedelta(1)
dates_in_year = [start_dt + datetime.timedelta(i) for i in range(delta.days+1)]

In [None]:
weekdays_in_year = [i.weekday() for i in dates_in_year] #gives [0,1,2,3,4,5,6,0,1,2,3,4,5,6,…] (ticktext in xaxis dict translates this to weekdays
weeknumber_of_dates = [i.strftime("%Gww%V")[2:] for i in dates_in_year] #gives [1,1,1,1,1,1,1,2,2,2,2,2,2,2,…] name is self-explanatory
z = np.random.randint(2, size=(len(dates_in_year)))
text = [str(d) + '_' + str(i) for d,i in zip(data_days.values, dates_in_year)]

colorscale=[[False, '#eeeeee'], [True, '#76cf63']]

In [None]:
mask = (df_revlog.kanji == '泉')
z = list(mask.map(int))
data_days = df_revlog.loc[:, 'cid'].groupby(pd.Grouper(freq='24H', base=4, label='left')).count()

In [None]:
len(data_days)

In [None]:
len(weeknumber_of_dates)

In [None]:
data = [
        go.Heatmap(
            x = weeknumber_of_dates,
            y = weekdays_in_year,
            z = data_days,
            text=text,
            hoverinfo="text",
            xgap=3, # this
            ygap=3, # and this is used to make the grid-like apperance
            showscale=False,
            colorscale=colorscale
            )
        ]

layout = go.Layout(
        title='activity chart',
        height=280,
        yaxis=dict(
            showline = False, showgrid = False, zeroline = False,
            tickmode="array",
            ticktext=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
            tickvals=[0,1,2,3,4,5,6],
            ),
        xaxis=dict(
            showline = False, showgrid = False, zeroline = False,
            ),
        font={"size":10, "color":"#9e9e9e"},
        plot_bgcolor=("#fff"),
        margin = dict(t=40),
        ) 

fig = go.FigureWidget(data=data, layout=layout)

fig

In [None]:
with open('../data/raw/kiritsubo.txt') as fp:
    text = fp.read()

In [None]:
re.findall('馬寮', text)