Charaguana Comes to the Rescue
====

To get find the distribution of `hiragana`, `katakata` and `kanji` within the list of lemmas, we can use the `charguana` library.

[`Charaguana`](https://github.com/alvations/charguana) is a "character vommitting" library, i.e. it contains the manually labelled character sets that one would find useful when dealing with human language orthography in unicode.

First let's retreive the relevant Japanese character sets.


```python
from charguana import get_charset

# Syllables. 
hiragana = set(get_charset('hiragana'))
katakana = set(get_charset('katakana'))

# Punctuations.
cjk_punct = set(get_charset('cjk_punctuation'))

# Kanji characters are other characters in the "japanese" set.
kanji = set(get_charset('japanese')) - hiragana - katakana - cjk_punct
```

**Note:** Sadly, it's not supported by Kaggle kernels so we'll have to plug-in the lists manually

In [1]:
hiragana = {'え', 'ぁ', 'で', 'め', 'を', 'こ', 'い', '゙', '゜', 'あ', 'ゎ', 'ぺ', 'げ', 'る', 'ず', 'ち', 'ぐ', 'ゖ', 'ゆ', 'ゞ', 'ご', 'ね', 'の', 'ゑ', 'し', 'っ', 'び', 'ぶ', 'へ', 'わ', 'ほ', 'ゅ', 'さ', 'べ', 'ぜ', 'も', 'だ', 'ぱ', 'ぇ', 'ろ', 'ゕ', 'く', 'ま', 'ば', 'ざ', 'は', 'お', 'ぼ', 'ゔ', 'よ', 'ぞ', 'つ', 'が', '\u3097', 'り', 'そ', 'ん', 'ら', 'ゝ', 'れ', 'ぃ', 'ぅ', 'む', 'ぴ', '゛', 'す', 'ぽ', 'み', 'か', 'や', 'ゐ', 'ひ', 'ぷ', 'け', 'せ', '\u3098', 'て', 'ゃ', 'づ', 'ふ', 'ょ', 'じ', '゚', '\u3040', 'ゟ', 'ぉ', 'ぢ', 'な', 'に', 'た', 'ぎ', 'ぬ', 'ど', 'と', 'き', 'う'}
katakana = {'ヅ', 'プ', 'ヂ', 'ァ', 'ヵ', 'ヿ', 'ブ', 'ョ', 'ム', 'ポ', 'エ', 'ノ', 'カ', 'シ', 'ュ', 'モ', 'ナ', 'ト', 'ェ', 'ロ', 'ハ', 'オ', 'ヒ', 'ホ', 'ィ', 'ペ', 'コ', '・', '゠', 'ヶ', 'ク', 'メ', 'ギ', 'ゼ', 'ユ', 'パ', 'ビ', 'ソ', 'ピ', 'ヲ', 'ス', 'ゾ', 'ン', 'ヤ', 'リ', 'ォ', 'ッ', 'ウ', 'ツ', 'ザ', 'グ', 'ベ', 'フ', 'ヘ', 'ニ', 'ジ', 'ゥ', 'キ', 'セ', 'ヱ', 'ャ', 'ヽ', 'ケ', 'ゴ', 'ヾ', 'ラ', 'ヌ', 'タ', 'ヺ', 'サ', 'ヴ', 'ダ', 'レ', 'ヮ', 'テ', 'ヰ', 'ガ', 'ヹ', 'ヷ', 'マ', 'ミ', 'ー', 'ア', 'デ', 'ボ', 'ド', 'チ', 'ゲ', 'イ', 'バ', 'ル', 'ネ', 'ズ', 'ヨ', 'ヸ', 'ワ'}

To check whether all characters in string are of the same character set, we can loop through each character can check it as such:

In [2]:
# E.g. we want to check whether ある is hiragana.
s = 'ある'
all(True for ch in s if ch in hiragana)

Alternatively, we can simply check the first character and assumes that a lemma doesn't contain mixed script:

In [3]:
s = 'ある'
s[0] in hiragana

Another way is to check for whether the intersection between all characters in string against the character set, e.g.

In [4]:
len(set(s).intersection(hiragana)) == len(s)

Phew, now the heavy lifting is over, let's count!

In [5]:
from collections import defaultdict

import pandas as pd

df = pd.read_csv('../input/japanese-lemma-frequency/japanese_lemmas.csv')
df.head()

If it's not hiragana/katana, we can treat it as a kanji.

In [6]:
def is_charset(s, charset):
    return len(set(s).intersection(charset)) == len(s)

# We'll store the charset as the keys
# and list of row index as the values.
charset2idx = defaultdict(list)

for idx, row in df.iterrows():
    lemma = row['lemma']
    if is_charset(lemma, hiragana):
        k = 'hiragana'
    elif is_charset(lemma, katakana):
        k = 'katakana'
    else: # i.e. Kanji.
        k = 'kanji'
    charset2idx[k].append(idx)

In [7]:
num_lemmas = len(df)
print(len(charset2idx['hiragana']), 'out of', num_lemmas, 'are hirgana.')
print(len(charset2idx['katakana']), 'out of', num_lemmas, 'are katakana.')
print(len(charset2idx['kanji']), 'out of', num_lemmas, 'are kanji.')

Numbers are nice but a picture tells 15,000 words ;P

In [8]:
import matplotlib.pyplot as plt
from matplotlib import rc
 
# Data to plot
labels = 'Hiragana', 'Katakana', 'Kanji', 
sizes = [len(charset2idx['hiragana']),
         len(charset2idx['katakana']),
         len(charset2idx['kanji']), 
        ]
colors = ['lightcoral', 'yellowgreen', 'lightskyblue']
explode = (0.2, 0.1, 0.1)  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.rcParams["figure.figsize"] = [16,9]

font = {'size': 22}

rc('font', **font)
plt.show()

## ちょっと待ってください (*wait a minute*), what if there lemmas with mixed character sets?

Let's go through every character in each lemma and count them =)


In [9]:
from collections import Counter

idx2charset = defaultdict(Counter)

for idx, row in df.iterrows():
    lemma = row['lemma']
    # Iterate through each character in lemma.
    for ch in lemma:
        if ch in hiragana:
            idx2charset[idx]['Hiragana'] += 1
        elif ch in katakana:
            idx2charset[idx]['Katakana'] += 1
        else:
            idx2charset[idx]['Kanji'] += 1

So it seems that there are mixed scripts lemmas!

In [10]:
next((idx, idx2charset[idx]) for idx in idx2charset if len(idx2charset[idx]) > 1)

In [11]:
df.iloc[43]

Ah ha, we see words like **思う** where there's kanji mixed with hiragana!

In [12]:
Counter(' + '.join(sorted(charset_in_lemma.keys())) 
        for idx, charset_in_lemma in idx2charset.items())

In [13]:
charset_counter = Counter(' + '.join(sorted(charset_in_lemma.keys())) 
                          for idx, charset_in_lemma in idx2charset.items())

num_lemmas = len(df)
for cs, count in charset_counter.most_common():
    print(count, 'out of', num_lemmas, 'are', cs)

In [14]:
import matplotlib.pyplot as plt
from matplotlib import rc
 
# Close the previous plot
plt.close()

# Data to plot
labels, sizes = zip(*charset_counter.most_common())

# Red = Hiragana
# Green = Katakana
# Blue = Kanji
# Purple = Kanji + Hiragana
# Yellow = Hiragana + Katakana
# Cyan = Kanji + Katakana

colors = ['lightskyblue', 'orchid', 'yellowgreen', 
          'lightcoral',  'cyan', 'yellow']
explode = (0.1, 0.1, 0.1, 0.1, 0.1, 0.0 )  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=False, startangle=120)
plt.axis('equal')
plt.rcParams["figure.figsize"] = [14,7]

font = {'size': 12}

rc('font', **font)
plt.show()

It seems like the katakana+kanji and katakana+hiragana lemmas are of very small percentage.

Let's take a closer look at them. 

In [15]:
charset2idx = defaultdict(list)

for idx, charset_in_lemma in idx2charset.items():
    _charset = ' + '.join(sorted(charset_in_lemma.keys())) 
    charset2idx[_charset].append(idx)

Katana + Kanji lemmas
====

In [16]:
for idx in charset2idx['Kanji + Katakana']:
    print(df.iloc[idx]['lemma'])

Katakana + Hiragana lemmas
====

In [17]:
for idx in charset2idx['Hiragana + Katakana']:
    print(df.iloc[idx]['lemma'])

The *ー* charset isn't really katana in these lemmas, they are the lengthen vowels of the previous hiragana syllable. 

Are kanji really kanjis? 
====

If take a look at the first 20 kanji characters, it seems like some of are romanji and punctuations!

In [18]:
for idx in charset2idx['Kanji'][:20]:
    print(df.iloc[idx]['lemma'])

Let's redo the counting by cheating a little, I've pickled the character sets from charguana and uploaded it as a Kaggle dataset.

In [19]:
import pickle
with open('../input/charguana/japanese.pkl', 'rb') as fin:
    japanese = pickle.load(fin)

katakana = set(japanese['katakana'])
hiragana = set(japanese['hiragana'])
kanji = set(japanese['kanji'])
romanji = set(japanese['romanji'])

In [20]:
from collections import Counter

idx2charset = defaultdict(Counter)
otherchars = set()
for idx, row in df.iterrows():
    lemma = row['lemma']
    # Iterate through each character in lemma.
    for ch in lemma:
        if ch in hiragana:
            idx2charset[idx]['Hiragana'] += 1
        elif ch in katakana:
            idx2charset[idx]['Katakana'] += 1
        elif ch in romanji:
            idx2charset[idx]['Romanji'] += 1
        elif ch in kanji:
            idx2charset[idx]['Kanji'] += 1
        else:
            otherchars.add(ch)

It seems like we're still letting some slip through the gaps...
Looking at each one of the `otherchars`, they seem to fall into the **romanji** category.

In [21]:
''.join(sorted(otherchars))

Let's add them to **romanji** and recount again.

In [22]:
romanji = set(japanese['romanji']).union(otherchars)

In [23]:
idx2charset = defaultdict(Counter)
otherchars = set()
for idx, row in df.iterrows():
    lemma = row['lemma']
    # Iterate through each character in lemma.
    for ch in lemma:
        if ch in hiragana:
            idx2charset[idx]['Hiragana'] += 1
        elif ch in katakana:
            idx2charset[idx]['Katakana'] += 1
        elif ch in romanji:
            idx2charset[idx]['Romanji'] += 1
        elif ch in kanji:
            idx2charset[idx]['Kanji'] += 1
        else: 
            # Now, we should have caught everything so 
            # there shouldn't be anything falling in these gaps. 
            print(ch)

In [24]:
charset_counter = Counter(' + '.join(sorted(charset_in_lemma.keys())) 
                          for idx, charset_in_lemma in idx2charset.items())

num_lemmas = len(df)
for cs, count in charset_counter.most_common():
    print(count, 'out of', num_lemmas, 'are', cs)

Now, that's kind of messy with so many slices of the pie.

 Except for  **Hiragana+Kanji**, let's group the other mixed scripts as **Others** since they're <1.

In [25]:
import matplotlib.pyplot as plt
from matplotlib import rc
 
# Close the previous plot
plt.close()

# Data to plot
labels, sizes = zip(*charset_counter.most_common()[:5])
_, size_of_others = zip(*charset_counter.most_common()[5:])

# Added the label and counts of 'Others'
labels = ['Others'] + list(labels) 
sizes = [sum(size_of_others)] + list(sizes)

# Blue = Kanji
# Purple = Kanji + Hiragana
# Green = Katakana
# Red = Hiragana
# Grey = Romanji
# Light Grey = Others

colors = ['gainsboro', 'lightskyblue', 'orchid', 'yellowgreen', 
          'lightcoral',  'silver']
explode = (0.0, 0.1, 0.1, 0.1, 0.1, 0.1 )  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=False, startangle=120)
plt.axis('equal')
plt.rcParams["figure.figsize"] = [14,7]

font = {'size': 12}

rc('font', **font)
plt.show()

Going back to the DataFrame
====

Let's added the counts of the characters per lemma back into the original dataframe.

In [26]:
df.head()

In [31]:
# Initialize the new columns with zeros
df['#Katakana'] = 0
df['#Hiragana'] = 0
df['#Kanji'] = 0
df['#Romanji'] = 0

In [32]:
df.head()

In [33]:
for idx, row in df.iterrows():
    lemma = row['lemma']
    for ch in lemma:
        if ch in hiragana:
            df.iloc[idx, df.columns.get_loc('#Hiragana')] += 1
        elif ch in katakana:
            df.iloc[idx, df.columns.get_loc('#Katakana')] += 1
        elif ch in romanji:
            df.iloc[idx, df.columns.get_loc('#Romanji')] += 1
        elif ch in kanji:
            df.iloc[idx, df.columns.get_loc('#Kanji')] += 1

In [34]:
df.head()