# Exploratory Data Analysis - Bengali AI Dataset
# Grapheme Combinations 

This dataset contains images of individual hand-written [Bengali characters](https://en.wikipedia.org/wiki/Bengali_alphabet). 
Bengali characters (graphemes) are written by combining three components: a grapheme_root
, vowel_diacritic, and consonant_diacritic. Your challenge is to classify the components of the grapheme in each
image. There are roughly 10,000 possible graphemes, of which roughly 1,000 are represented in the training set. The
test set includes some graphemes that do not exist in train but has no new grapheme components. It takes a lot of
volunteers filling out [sheets like this](https://github.com/BengaliAI/graphemePrepare/blob/master/collection/A4/form_1.jpg)
to generate a useful amount of real data; focusing the problem on the grapheme components rather than on recognizing
whole graphemes should make it possible to assemble a Bengali OCR system without handwriting samples for all 10,000
graphemes.

I have extended this EDA by doing a full [Unicode Visualization of the Bengali Alphabet](https://www.kaggle.com/jamesmcguigan/unicode-visualization-of-the-bengali-alphabet/)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
from IPython.display import Markdown, HTML
from itertools import chain, product
# from src.jupyter import grid_df_display, combination_matrix

pd.set_option('display.max_columns',   500)
pd.set_option('display.max_colwidth',   -1)

%load_ext autoreload
%autoreload 2

## Custom Library Functions

In [None]:
# Source: https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side/50899244#50899244
import pandas as pd
from IPython.display import display,HTML

def grid_df_display(list_dfs, rows = 2, cols=3, fill = 'cols'):
    if fill not in ['rows', 'cols']: print("grid_df_display() - fill must be one of: 'rows', 'cols'")

    html_table = "<table style='width:100%; border:0px'>{content}</table>"
    html_row   = "<tr style='border:0px'>{content}</tr>"
    html_cell  = "<td style='width:{width}%;vertical-align:top;border:0px'>{{content}}</td>"
    html_cell  = html_cell.format(width=100/cols)

    cells = [ html_cell.format(content=df.to_html()) for df in list_dfs[:rows*cols] ]
    cells += cols * [html_cell.format(content="")] # pad

    if fill == 'rows':   # fill in rows first (first row: 0,1,2,... col-1)
        grid = [ html_row.format(content="".join(cells[i:i+cols])) for i in range(0,rows*cols,cols)]
    elif fill == 'cols': # fill columns first (first column: 0,1,2,..., rows-1)
        grid = [ html_row.format(content="".join(cells[i:rows*cols:rows])) for i in range(0,rows)]
    else:
        grid = []

    # noinspection PyTypeChecker
    display(HTML(html_table.format(content="".join(grid))))

    # add extra dfs to bottom
    [display(list_dfs[i]) for i in range(rows*cols,len(list_dfs))]


if __name__ == "main":
    list_dfs = []
    list_dfs.extend((pd.DataFrame(2*[{"x":"hello"}]),
                     pd.DataFrame(2*[{"x":"world"}]),
                     pd.DataFrame(2*[{"x":"gdbye"}])))

    grid_df_display(3*list_dfs)

In [None]:
# Source: https://github.com/JamesMcGuigan/kaggle-digit-recognizer/blob/master/src/utils/confusion_matrix.py
from typing import Union

import pandas as pd
from pandas.io.formats.style import Styler


def combination_matrix(dataset: pd.DataFrame, x: str, y: str, z: str,
                       format=None, unique=True) -> Union[pd.DataFrame, Styler]:
    """
    Returns a combination matrix, showing all valid combinations between three DataFrame columns.
    Sort of like a heatmap, but returning lists of (optionally) unique values

    :param dataset: The dataframe to create a combination_matrx from
    :param x: column name to use for the X axis
    :param y: column name to use for the Y axis
    :param z: column name to use for the Z axis (values that appear in the cells)
    :param format: '', ', '-', ', '\n'    = format value lists as "".join() string
                    str, bool, int, float = cast value lists
    :param unique:  whether to return only unique values or not - eg: combination_matrix(unique=False).applymap(sum)
    :return: returns nothing
    """
    unique_y = sorted(dataset[y].unique())
    combinations = pd.DataFrame({
        n: dataset.where(lambda df: df[y] == n)
            .groupby(x)[z]
            .pipe(lambda df: df.unique() if unique else df )
            .apply(list)
            .apply(sorted)
        for n in unique_y
    }).T

    if isinstance(format, str):
        combinations = combinations.applymap(
            lambda cell: f"{format}".join([str(value) for value in list(cell) ])
            if isinstance(cell, list) else cell
        )
    if format == str:   combinations = combinations.applymap(lambda cell: str(cell)      if isinstance(cell, list) and len(cell) > 0 else ''     )
    if format == bool:  combinations = combinations.applymap(lambda cell: True           if isinstance(cell, list) and len(cell) > 0 else False  )
    if format == int:   combinations = combinations.applymap(lambda cell: int(cell[0])   if isinstance(cell, list) and len(cell)     else ''     )
    if format == float: combinations = combinations.applymap(lambda cell: float(cell[0]) if isinstance(cell, list) and len(cell)     else ''     )

    combinations.index.rename(y, inplace=True)
    combinations.fillna('', inplace=True)
    if format == '\n':
        return combinations.style.set_properties(**{'white-space': 'pre-wrap'})  # needed for display
    else:
        return combinations  # Allows for subsequent .applymap()

## Inspect Raw Data

In [None]:
!ls ../input/bengaliai-cv19/

In [None]:
dataset = pd.read_csv('../input/bengaliai-cv19/train.csv'); 
# for key in ['grapheme_root','vowel_diacritic','consonant_diacritic','grapheme']:
#     dataset[key] = dataset[key].astype('category')  # ensures groupby().count() shows zeros
dataset['graphemes'] = dataset['grapheme'].apply(tuple)
dataset.head()

## Question: How many unique graphemes are there?

There are 168 grapheme roots, 11 vowel diacritics, 7 consonant diacritics, and 1295 unique graphemes within the 20k training dataset. 

In [None]:
dataset

In [None]:
unique = dataset.apply(lambda col: col.nunique()); unique

## Question: Can all diacritics be used with any grapheme?

- Documentation claims 10,000+ possible graphemes, which is indeed `168 * 11 * 7 = 12936`

- Assuming that the training dataset is representative of common usage, 
  then certian combinations may never (or rarely) be used in practice).

- Unconfirmed Theory: the physics of the human mouth may make such combinations unpronouncable.

- Conclusion: it may be able infer excluded combinations using simple logical rules

### Vowel / Consonant Combinations:
- Vowel #0 and Consonant #0 combine with everything
- Vowels #3, #5, #6, #8 have limited combinations with Consonants 
- Consonant #3 is never combined except with Vowel #0
- Consonant #6 only combineds with Vowels #0 and #1

In [None]:
combination_matrix(dataset, x='consonant_diacritic', y='vowel_diacritic', z='consonant_diacritic', unique=False).applymap(len)

### Grapheme Root Combinations:
- Vowel #0 and Consonant #0 combine with (nearly) everything
- ALL Roots combine with some Consonant #0
- Several Roots do NOT combine with Vowel #0 = [26, 28, 33, 34, 73, 82, 108, 114, 126, 152, 157, 158, 163]
- Several Roots do combine ALL Vowels = [13, 23, 64, 72, 79, 81, 96, 107, 113, 115, 133, 147]}
- Only Root #107 combines with ALL Consonants

In [None]:
root_vowels            = dataset.groupby('grapheme_root')['vowel_diacritic'].unique().apply(sorted).to_frame().T
root_consonants        = dataset.groupby('grapheme_root')['consonant_diacritic'].unique().apply(sorted).to_frame().T
root_vowels_values     = root_vowels.applymap(len).values.flatten()
root_consonants_values = root_consonants.applymap(len).values.flatten()

display(root_vowels)
display({
    "mean":   root_vowels_values.mean(),
    "median": np.median( root_vowels_values ),
    "min":    root_vowels_values.min(),
    "max":    root_vowels_values.max(),
    "unique_vowels":    unique['vowel_diacritic'],
    "root_combine_0":   sum([ 0 in lst for lst in root_vowels.values.flatten() ]),
    "unique_roots":     unique['grapheme_root'],
    "root_combine_not_0": str([ index for index, lst in enumerate(root_vowels.values.flatten()) if 0 not in lst ]),    
    "root_combine_all":       [ index for index, lst in enumerate(root_vowels.values.flatten()) if len(lst) == unique['vowel_diacritic'] ],
})
# print('--------------------')
display(root_consonants)
display({
    "mean":   root_consonants_values.mean(),
    "median": np.median( root_consonants_values ),
    "min":    root_consonants_values.min(),
    "max":    root_consonants_values.max(),
    "unique_consonants":  unique['consonant_diacritic'],
    "root_combine_0": sum([ 0 in lst for lst in root_consonants.values.flatten() ]),
    "unique_roots":   unique['grapheme_root'],
    "root_combine_not_0": str([ index for index, lst in enumerate(root_consonants.values.flatten()) if 0 not in lst ]),        
    "root_combine_all":       [ index for index, lst in enumerate(root_consonants.values.flatten()) if len(lst) == unique['consonant_diacritic'] ],
})

### Combination Matrices

This is the full list of which Grapheme Roots combine with which Vowels and Consonant Diacritics

In [None]:
combination_matrix(dataset, x='consonant_diacritic', y='vowel_diacritic', z='grapheme_root', format=', ')

### Visualizing Bengali

In [None]:
from collections import Counter

def filter_pairs_diacritics(pairs, diacritics, key=None):
    previous_diacritics = set(chain(*[ diacritics[k] for k,v in diacritics.items() if k != key ]))
    return [ pair for pair in pairs if pair[0] not in previous_diacritics ]

def print_conflicts(pairs, diacritics, key):
    valid = filter_pairs_diacritics(pairs, diacritics, key)
    if len(valid) == 0:
        conflict_key   = [ k for k,v in diacritics.items() if pairs[0][0] in diacritics[k].values ][0]
        conflict_dict  = { v:k for k,v in diacritics[conflict_key].items() }
        display({
            "source":   ( key, pairs[:4] ),
            "conflict": ( conflict_key, conflict_dict[pairs[0][0]], pairs[0][0] ),
        })
    return pairs

In [None]:
diacritics_raw = {
    "vowel_diacritic":     None,
    "consonant_diacritic": None,
    "grapheme_root":       None
}
for key in [ 'vowel_diacritic', 'consonant_diacritic', 'grapheme_root' ]:
    diacritics_raw[key] = (
        dataset.groupby(key)
            .apply(lambda group:   sum(group['graphemes'].apply(set).apply(Counter), Counter()) )   # -> Counter()
            .apply(lambda counter: counter.most_common() )                                          # -> [ tuple(symbol, count), ]
            .apply(lambda pairs:   pairs[0][0] if len(pairs) else '?' )
    )

    
### Hardcode conflict resolution and deduplicate - TODO: Verify correctness        
diacritics_resolutions = {
    "vowel_diacritic":     pd.Series({ 0: '্', 1: 'া', 2: 'ি' }),
    "consonant_diacritic": pd.Series({ 0: '্' }),
    "grapheme_root":       pd.Series({ 4: 'য' })
}
diacritics = { k:v.copy() for k,v in diacritics_resolutions.items() }
for key in [ 'vowel_diacritic', 'consonant_diacritic', 'grapheme_root' ]:
    diacritics[key] = (
        dataset.groupby(key)
            ### NOTE: group['graphemes'].apply(set) removes duplicate unicode diacritics       
            .apply(lambda group:   sum(group['graphemes'].apply(set).apply(Counter), Counter()) )   # -> Counter()
            .apply(lambda counter: counter.most_common() )                                          # -> [ tuple(symbol, count), ]
            .apply(lambda pairs:   print_conflicts(pairs, diacritics, key) ) 
            .apply(lambda pairs:   filter_pairs_diacritics(pairs, diacritics, key)) 
            .apply(lambda pairs:   pairs[0][0] if len(pairs) else '?' )
    )
    for index, symbol in diacritics_resolutions[key].items():
        diacritics[key][index] = symbol
    
display("Before Conflict Resolution")
display( pd.DataFrame(diacritics_raw).fillna('').T )

display("Deduplicated")
display( pd.DataFrame(diacritics).fillna('').T )

In [None]:
combination_matrix(dataset, x='consonant_diacritic', y='vowel_diacritic', z='grapheme', format=' ')

In [None]:
combination_matrix(dataset, x='grapheme_root', y='vowel_diacritic', z='grapheme', format=' ')

In [None]:
combination_matrix(dataset, x='grapheme_root', y='consonant_diacritic', z='grapheme', format=' ')

## Sanity Checking = Found Dataset BUG!

This combination_matrix lists 1292 unique grapheme combinations, which is 3 less than the 1295 unique graphemes listed in the training dataset. Something is WRONG!

Found a discrepency BUG is in the dataset. The following root/vowel/consonant keys have multiple unicode graphemes renderings! 

{'64-3-2': ['র্তী', 'র্ত্রী'],
 '64-7-2': ['র্তে', 'র্ত্রে'],
 '72-0-2': ['র্দ্র', 'র্দ']}

In [None]:
from itertools import chain
{
    "combinations": len(list(chain( 
        *combination_matrix(dataset, x='consonant_diacritic', y='vowel_diacritic', z='grapheme_root')
        .values.flatten() 
    ))),
    "unique_graphemes": unique['grapheme']
}

- Confirm that there are no null or NaN values in the dataset

In [None]:
dataset.apply(lambda row: row.isnull()).sum()

- Found the BUG! It is in the dataset!
- There is THREE sets of unique root/vowel/consonant keys that have multiple unicode renderings 

In [None]:
( 
    dataset
    .groupby(['grapheme_root', 'vowel_diacritic', 'consonant_diacritic'])
    .nunique(dropna=False) > 1 
).sum()

In [None]:
( 
    dataset
    .groupby(['grapheme_root', 'vowel_diacritic', 'consonant_diacritic'])
    .nunique(dropna=False) > 1
).query("grapheme != False")

In [None]:
multilabled_graphemes = {
    "64-3-2": dataset.query("grapheme_root == 64 & vowel_diacritic == 3 & consonant_diacritic == 2")['grapheme'].unique().tolist(),
    "64-7-2": dataset.query("grapheme_root == 64 & vowel_diacritic == 7 & consonant_diacritic == 2")['grapheme'].unique().tolist(),
    "72-0-2": dataset.query("grapheme_root == 72 & vowel_diacritic == 0 & consonant_diacritic == 2")['grapheme'].unique().tolist(),
}
multilabled_graphemes

As discovered by @ren4yu[[1](https://www.kaggle.com/c/bengaliai-cv19/discussion/133735#763562)] the unicode itself is encoded using the root/vowel/consonant diacritics as a multibyte string, and the unicode consortium have implemented multiple renderings of the same grapheme by allowing duplicate diacritics within the unicode. 

This potentually opens up another datasource for investigation, which is to explore the full range of diacritic combinations within the unicode specification.

The paper [Fonts-2-Handwriting: A Seed-Augment-Train framework for universal digit classification](https://arxiv.org/pdf/1905.08633.pdf) also makes the suggestion that it may be possible to generate synethetic data for handwriting recognition by rendering each of the unicode graphemes using various Bengali fonts

In [None]:
multilabled_grapheme_list   = list(chain(*multilabled_graphemes.values())); multilabled_grapheme_list
multilabled_grapheme_dict   = { grapheme: list(grapheme) for grapheme in multilabled_grapheme_list }
display(multilabled_grapheme_list)
display(multilabled_grapheme_dict)

This simply counts how many times each of these multi-keyed unicode graphemes is listed in the database

In [None]:
dataset[ dataset['grapheme'].isin(multilabled_grapheme_list) ].groupby(['grapheme']).count()['image_id'].to_dict()