# Kuzushiji Recognition Complete Guide

## *Build a model to transcribe ancient Kuzushiji into contemporary Japanese characters*

<img src="http://static.mxbi.net/umgy001-010-smallannomasked.jpg" height="600" width="600">
![]()

---

Imagine the history contained in a thousand years of books. What stories are in those books? What knowledge can we learn from the world before our time? What was the weather like 500 years ago? What happened when Mt. Fuji erupted? How can one fold 100 cranes using only one piece of paper? The answers to these questions are in those books.

Japan has millions of books and over a billion historical documents such as personal letters or diaries preserved nationwide. Most of them cannot be read by the majority of Japanese people living today because they were written in “Kuzushiji”.

Even though Kuzushiji, a cursive writing style, had been used in Japan for over a thousand years, there are very few fluent readers of Kuzushiji today (only 0.01% of modern Japanese natives). Due to the lack of available human resources, there has been a great deal of interest in using Machine Learning to automatically recognize these historical texts and transcribe them into modern Japanese characters. 


**<span style="color:green">kernel completed!</span>**

---

### Content

1. **[EDA]()**
     - New ```df_train```
     - missing data
     - char stats
     - top-10 chars
     - top-100 chars (plot)
     
     
2. **[Simple Visualization]()**
3. **[KMINST]()**
    - Save the 683464 chars/digits images in ```kminst.zip``` and ```info.csv```
    - Examples of obtained chars from a random image.
    
    
4. **[KMINST Classification]()**
    - Simple KNN
    - Deep Learning
    
    
5. **[Simple Predictions Visualization]()**

### Other kernels

I will create other kernels to perform **different tasks**: digitalize images, train models, do inferences etc . Here you can check them:

### More information

- [Must-read material](https://www.kaggle.com/c/kuzushiji-recognition/discussion/100579#latest-580915)
- [Worldwide Competition to Develop AI for Historical Japanese Character (Kuzushiji) Recognition](https://www.nii.ac.jp/en/news/release/2019/0710.html)
- [KMNIST Dataset](http://codh.rois.ac.jp/kmnist/index.html.en)
- [Osaka University](http://www.digitalhumanities.org/dhq/vol/11/1/000281/000281.html)

<br>

In [None]:
from PIL import Image, ImageDraw, ImageFont
from os import listdir
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import os
import gc
import sys
import seaborn as sns
import cv2
import shutil
from sklearn.neighbors import KNeighborsClassifier
from tqdm import tqdm_notebook as tqdm

%matplotlib inline

print (os.listdir('../input/'))
print("Ready!")

**Load packages**

**Install ```NotoSans```**

In [None]:
fontsize = 50

# From https://www.google.com/get/noto/
!wget -q --show-progress https://noto-website-2.storage.googleapis.com/pkgs/NotoSansCJKjp-hinted.zip
!unzip -p NotoSansCJKjp-hinted.zip NotoSansCJKjp-Regular.otf > NotoSansCJKjp-Regular.otf
!rm NotoSansCJKjp-hinted.zip

font = ImageFont.truetype('./NotoSansCJKjp-Regular.otf', fontsize, encoding='utf-8')

### Utils
> from: [Kuzushiji Visualisation](https://www.kaggle.com/anokas/kuzushiji-visualisation)

1. ```visualize_training_data```
2. ```visualize_predictions```

In [None]:
# This function takes in a filename of an image, and the labels in the string format given in train.csv, and returns an image containing the bounding boxes and characters annotated
def visualize_training_data(image_fn, labels):
    # Convert annotation string to array
    labels = np.array(labels.split(' ')).reshape(-1, 5)
    
    # Read image
    imsource = Image.open(image_fn).convert('RGBA')
    bbox_canvas = Image.new('RGBA', imsource.size)
    char_canvas = Image.new('RGBA', imsource.size)
    bbox_draw = ImageDraw.Draw(bbox_canvas) # Separate canvases for boxes and chars so a box doesn't cut off a character
    char_draw = ImageDraw.Draw(char_canvas)

    for codepoint, x, y, w, h in labels:
        x, y, w, h = int(x), int(y), int(w), int(h)
        char = unicode_map[codepoint] # Convert codepoint to actual unicode character

        # Draw bounding box around character, and unicode character next to it
        bbox_draw.rectangle((x, y, x+w, y+h), fill=(255, 255, 255, 0), outline=(255, 0, 0, 255))
        char_draw.text((x + w + fontsize/4, y + h/2 - fontsize), char, fill=(0, 0, 255, 255), font=font)

    imsource = Image.alpha_composite(Image.alpha_composite(imsource, bbox_canvas), char_canvas)
    imsource = imsource.convert("RGB") # Remove alpha for saving in jpg format.
    return np.asarray(imsource)



def visualize_test_data(image_fn):
    
    # Read image
    imsource = Image.open(image_fn).convert('RGBA')
    imsource = imsource.convert("RGB") # Remove alpha for saving in jpg format.
    return np.asarray(imsource)

In [None]:
# This function takes in a filename of an image, and the labels in the string format given in a submission csv, and returns an image with the characters and predictions annotated.
def visualize_predictions(image_fn, labels):
    # Convert annotation string to array
    labels = np.array(labels.split(' ')).reshape(-1, 3)
    
    # Read image
    imsource = Image.open(image_fn).convert('RGBA')
    bbox_canvas = Image.new('RGBA', imsource.size)
    char_canvas = Image.new('RGBA', imsource.size)
    bbox_draw = ImageDraw.Draw(bbox_canvas) # Separate canvases for boxes and chars so a box doesn't cut off a character
    char_draw = ImageDraw.Draw(char_canvas)

    for codepoint, x, y in labels:
        x, y = int(x), int(y)
        char = unicode_map[codepoint] # Convert codepoint to actual unicode character

        # Draw bounding box around character, and unicode character next to it
        bbox_draw.rectangle((x-10, y-10, x+10, y+10), fill=(255, 0, 0, 255))
        char_draw.text((x+25, y-fontsize*(3/4)), char, fill=(255, 0, 0, 255), font=font)

    imsource = Image.alpha_composite(Image.alpha_composite(imsource, bbox_canvas), char_canvas)
    imsource = imsource.convert("RGB") # Remove alpha for saving in jpg format.
    return np.asarray(imsource)

# EDA

----

### Load data

> <span style="color:red"> DISCLAIMER </span> Remember to change the ```PATH``` (if necessary)

In [None]:
!ls ../input/

In [None]:
PATH = '../input/kuzushiji-recognition/'
df_train = pd.read_csv(PATH+'train.csv')
df_test = os.listdir(PATH+'test_images/')
unicode_map = {codepoint: char for codepoint, char in pd.read_csv(PATH+'unicode_translation.csv').values}
print ("TRAIN: ", df_train.shape)
print ("TEST: ", len(df_test))
df_train.head()

### Check missing

In [None]:
df_train.isnull().sum()

276 images have no labels, I'll drop them using ```dropna```

In [None]:
#df_train.dropna(inplace=True)
df_train.reset_index(inplace=True, drop=True)
print ("TRAIN: ", df_train.shape)

### Processing
> lazy code, click ```code``` to see.

In [None]:
chars = {}

for i in range (df_train.shape[0]):
    try:
        a = [x for x in df_train.labels.values[i].split(' ') if x.startswith('U')]
        n_a = int(len(a))        
        for j in a:
            if j not in chars: chars[j]=1
            else:
                chars[j]+=1
                
        a = " ".join(a)
        
    except AttributeError:
        a = None
        n_a = 0
        
    df_train.loc[i,'chars'] = a
    df_train.loc[i,'n_chars'] = n_a
    
df_train.head()

**char stats**

In [None]:
print ("MAX chars in a picture= ", df_train.n_chars.max())
print ("MIN chars in a picture= ", df_train.n_chars.min())
print ("MEAN chars in a picture= ", df_train.n_chars.mean())

## Most common chars

In [None]:
chars = pd.DataFrame(list(chars.items()), columns=['char', 'count'])
chars['jp_char'] = chars['char'].map(unicode_map)
print (" >> Chars dataframe <<")
print ("Number of chars: ",chars.shape[0])
chars.to_csv("chars_freq.csv",index=False)
chars.head()

**TOP-10**

In [None]:
chars.sort_values(by=['count'], ascending=False).head(10).reset_index()

**TOP-100**

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(22,20))
ax = sns.barplot(y="char", x="count", data=chars.sort_values(by=['count'], ascending=False).head(100))
ax.set_title("Character frequency in images (top 100)")
plt.show()

## Rare chars

In [None]:
print ('Total chars', chars.shape[0])
print ('< 10 freq', chars[chars['count'] <= 10].shape[0])

In [None]:
rare = chars[chars['count'] <= 10]
print (rare.shape)
rare.head()

In [None]:
rare.to_csv('rare_chars.csv', index=False)

## Images without chars or <10 chars

In [None]:
lowchar = df_train[df_train.n_chars <= 10]
print ('lowchar images ',lowchar.shape[0], lowchar.shape[0]/ df_train.shape[0])
lowchar.head()

In [None]:
for image_fn in lowchar.image_id:
    image_fn = '../input/train_images/'+image_fn+'.jpg'
    imsource = Image.open(image_fn).convert('RGBA')
    imsource = imsource.convert("RGB") # Remove alpha for saving in jpg format.
    imsource = np.asarray(imsource)
    plt.figure(figsize=(10, 10))
    plt.title(image_fn)
    plt.axis("off")
    plt.imshow(imsource, interpolation='lanczos')
    plt.show() 

In [None]:
print (lowchar.shape)
lowchar.dropna(inplace=True)
print (lowchar.shape)
lowchar.to_csv('train_lowchar.csv',index=False)

<br>
## Books

In [None]:
df_train["title"]= df_train["image_id"].str.split("_", n = 1, expand = True)[0]
#df_train["chapter"]= df_train["image_id"].str.split("_", n = 2, expand = True)[1]
#df_train["page"]= df_train["image_id"].str.split("_", n = 3, expand = True)[2]
df_train.head()

In [None]:
print (df_train['title'].nunique())
df_train['title'].unique()[0:10]

In [None]:
book = df_train[df_train['title']== '200006663'].reset_index(drop=True)
book

### Visualize book
**Click to see the function code**

In [None]:
def visualize_book(title, df_train):
    df_train[df_train['title']== title].reset_index(drop=True)
    print ('Book ', title)
    for i in book.index:
        img,labels,_,_,_ = book.values[i]
        viz = visualize_training_data(PATH+'train_images/{}.jpg'.format(img), labels)
        plt.figure(figsize=(15, 15))
        plt.title(img)
        plt.axis("off")
        plt.imshow(viz, interpolation='lanczos')
        plt.show()

In [None]:
visualize_book('200006663', df_train)

#### Another interesting books

In [None]:
visualize_book('200014685-00002', df_train)

In [None]:
visualize_book('200014685-00003', df_train)

In [None]:
print ("TRAIN: ", df_train.shape)

# Visualization

#### Click ```output``` to see the images.

In [None]:
np.random.seed(1337)

for i in range(2):
    img,labels,_,_,_ = df_train.values[np.random.randint(len(df_train))]
    viz = visualize_training_data(PATH+'train_images/{}.jpg'.format(img), labels)
    plt.figure(figsize=(15, 15))
    plt.title(img)
    plt.axis("off")
    plt.imshow(viz, interpolation='lanczos')
    plt.show()

## Visualize Test

In [None]:
for img in df_test[0:2]:
    viz = visualize_test_data(PATH+'test_images/{}'.format(img))
    plt.figure(figsize=(15, 15))
    plt.title(img)
    plt.axis("off")
    plt.imshow(viz, interpolation='lanczos')
    plt.show()


# KMINST

----

**<span style="color:red">DISCLAIMER</span>**
> In this part I saved the 683464 chars/digits images in ```kminst.zip``` and ```info.csv```. You don't have to run this code if you can import those files. Check version **V4** ``` output ``` of this kernel and download them.

**get_char**
> gets all the characters from the image ```img_id``` and save them in ```kminst```. The images names have the following format: ```img_id_idx.jpg'``` where ```idx``` is in range (0, number of chars in the image).

In [None]:
def get_char(img_id, labels):
    
    image_fn = '../input/train_images/{}.jpg'.format(img_id)
    # Convert annotation string to array
    labels = np.array(labels.split(' ')).reshape(-1, 5)
    # Read image
    imsource = Image.open(image_fn).convert('RGBA')
    img = np.asarray(imsource.convert("RGB"))

    info = []
    
    for idx, (codepoint, x, y, w, h) in enumerate(labels):
        x, y, w, h = int(x), int(y), int(w), int(h)
        try:
            char = unicode_map[codepoint] # Convert codepoint to actual unicode character
        except KeyError:
            char = "e" # https://www.kaggle.com/c/kuzushiji-recognition/discussion/100712#latest-580747
        
        # crop char
        #print (idx,x,y,w,h,char)
        crop_img = img[y:y+h, x:x+w]
        result = Image.fromarray(crop_img, mode='RGB')
        name = img_id+'_{}.jpg'.format(idx)
        result.save('kminst/'+name)
        
        info.append((name,codepoint))
        
    del imsource, img, result, name
    gc.collect()
    
    return info

**Create the folder ```kminst```** where I'm going to save all the chars from all the pictures.

In [None]:
!mkdir kminst
!ls

#### Save all the digits/chars in ```kminst```

In [None]:
'''
generated = 0
info = []

for i in tqdm(df_train.index):
    img, labels,_,_ = df_train.values[i]
    info += get_char(img, labels)
    generated+= int(df_train[df_train['image_id']==img].n_chars)
    
    if (i+1)%500 == 0 or i==df_train.index[-1]:
        # save memory
        shutil.make_archive('kminst_'+str(i//500), 'zip', 'kminst')
        print (i+1,"\t>> generated ...", generated)
        shutil.rmtree('kminst', ignore_errors=True)
        os.mkdir('kminst')
'''

In [None]:
!rm -r kminst
!ls

**All as 1 zip ```kminst.zip```**

In [None]:
#shutil.make_archive('kminst', 'zip', 'kminst')
#!rm -r kminst
#!ls

#### Generate and save ```info```

In [None]:
info[0:5]

In [None]:
infok = pd.DataFrame(columns=['char_id','unicode'])
infok['char_id'] = [i[0] for i in info]
infok['unicode'] = [i[1] for i in info]
print ("TOTAL KMNIST = ", infok.shape[0])
infok.to_csv('info.csv',index=False)
infok.head()

## Examples

**these are good pictures**

In [None]:
example = "200021660-00023_2"
"100249537_00013_2"
"hnsd007-039"
"100249537_00003_2"
"200014685-00003_1"
"200014685-00016_2"

**get_char_example**
> visualize what ```get_char``` does.

In [None]:
def get_char_example(image_fn, labels):
    # Convert annotation string to array
    labels = np.array(labels.split(' ')).reshape(-1, 5)
    
    # Read image
    imsource = Image.open(image_fn).convert('RGBA')
    img = np.asarray(imsource.convert("RGB"))
    bbox_canvas = Image.new('RGBA', imsource.size)
    char_canvas = Image.new('RGBA', imsource.size)
    bbox_draw = ImageDraw.Draw(bbox_canvas) # Separate canvases for boxes and chars so a box doesn't cut off a character
    char_draw = ImageDraw.Draw(char_canvas)
    for codepoint, x, y, w, h in labels:
        x, y, w, h = int(x), int(y), int(w), int(h)
        char = unicode_map[codepoint] # Convert codepoint to actual unicode character
        # Draw bounding box around character, and unicode character next to it
        bbox_draw.rectangle((x, y, x+w, y+h), fill=(255, 255, 255, 0), outline=(255, 0, 0, 255))
        char_draw.text((x + w + fontsize/4, y + h/2 - fontsize), char, fill=(0, 0, 255, 255), font=font)
        
        # crop char
        print (x,y,w,h,char)
        crop_img = img[y:y+h, x:x+w]
        plt.axis("off")
        plt.imshow(np.asarray(crop_img), interpolation='lanczos')
        plt.show()

    imsource = Image.alpha_composite(Image.alpha_composite(imsource, bbox_canvas), char_canvas)
    imsource = imsource.convert("RGB") # Remove alpha for saving in jpg format.
    return np.asarray(imsource)

In [None]:
img, labels,_,_ = df_train[df_train['image_id']=="umgy004-011"].values[0]
print ("IMAGE: ", img)
print (">> chars:", int(df_train[df_train['image_id']==img].n_chars),"\n")

viz = get_char_example(PATH+'train_images/{}.jpg'.format(img), labels)
plt.figure(figsize=(15, 15))
plt.title(img)
plt.axis("off")
plt.imshow(viz, interpolation='lanczos')
plt.show()

# INFERENCE


**<span style="color:red">DISCLAIMER</span>**
> The following code will perform the **inference** on the test set. For more information about the training (detector and classifier) please check the official github: https://github.com/mv-lab/kuzushiji-recognition

We take the predictions from the ```detector``` , we ```classify``` each detected symbol and generate the ```submisison``` file.

<br>
# Visualize Predictions

source: [Kuzushiji Visualisation](https://www.kaggle.com/anokas/kuzushiji-visualisation)
> For the test set, you're only required to predict a single point within each bounding box instead of the entire bounding box (ideally, the centre of the bounding box). It may also be useful to visualise the box centres on the image:

In [None]:
image_fn = '../input/test_images/test_030d9355.jpg'
pred_string = 'U+306F 1231 1465 U+304C 275 1652 U+3044 1495 1218 U+306F 436 1200 U+304C 800 2000 U+3044 1000 300' # Prediction string in submission file format
viz = visualize_predictions(image_fn, pred_string)

plt.figure(figsize=(15, 15))
plt.imshow(viz, interpolation='lanczos')