# Pinyin Spelling Corrector (inspired by Peter Norvig) and Mini Chinese Input Method Editor
## Discovery and Discussion Notebook
I was inspired after reading Peter Norvig's chapter in the book, *Beautiful Data* on Natural Language Corpus Data, and subsequently his [implementation](https://norvig.com/spell-correct.html) of a spelling corrector, and I wondered if I could implement a similar spell corrector for pinyin, with the corrected pinyin being used to suggest individual characters, as a primitive Chinese input method editor would do. The syllable (pinyin without tones) and character frequency lists were taken from [Jun Da](http://lingua.mtsu.edu/chinese-computing/) at Middle Tennessee State University. While I as a linguistics minor, I am interested in natural language processing, but I do not have any significant experience in it besides messing around with NLTK a little bit. I started this project out of curiosity without much expectation, but I figured that nonetheless, it would be an interesting learning experience. I also wanted to use this opportunity to explore how to make Jupyter notebooks more interactive.

Intended Organization of the Repository:
- Workspace and Discussion Notebook (what you're looking at!): Includes thought process, exploratory visualizations, implementation.
- Scripts: Python scripts of the implementation.
- Report Notebook: Report outlining the project.
- Files: CSV files used.

In [2]:
# For working with the data
import pandas as pd
# For the wordcloud
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
# To get rid of tone marks
import unidecode
# For interactivity
from ipywidgets import widgets

In [5]:
# Read in data
syllables = pd.read_csv("C:/Users/rebek/Anaconda3/envs/xiaoshuru/data/syllable frequencies.csv")
characters = pd.read_csv("C:/Users/rebek/Anaconda3/envs/xiaoshuru/data/character ranking reduced.csv").dropna()

In [6]:
characters.head()

Unnamed: 0,frequency_rank,character,pinyin
0,1,的,de
1,2,一,yī
2,3,是,shì
3,4,不,bù
4,5,了,le


In [7]:
syllables.head()

Unnamed: 0,syllable,frequency
0,a,143836
1,ai,213586
2,an,418511
3,ang,10267
4,ao,60455


### From DataFrame to Dictionary
In order to work with the data and implement the algorithms, we're going to convert the dataframes into dictionaries.
The syllable dictionary is easy, we'll just have the keys be the syllables and the values be the frequencies. This is what we'll use for the spelling corrector, and we can also use it to create the "wordcloud" below.

The character dictionary is a little more involved, and is what will be used for the IME. It basically *is* the IME, so I'll take the opportunity to explain the intuition behind the IME as we put together the character dictionary.

In [8]:
# Function to create simple dictionary from dataframe
def simple_dict(df):
    # Takes in df, returns first column as keys, second column as values
    # Initialize dictionary
    simple = dict()
    # Iterate through df
    for i in range(len(df)):
        simple[df.iloc[i, 0]] = df.iloc[i, 1]
    # Return dictionary
    return simple

In [9]:
# Create syllable dictionary from dataframe, to prep for spelling corrector
syll_dict = simple_dict(syllables)

In [34]:
# Wordcloud code

Most Chinese IMEs don't differentiate between tones, so we won't either. We'll make a new column called "toneless" which contains the pinyin without tone marks generated using Unidecode.

In [10]:
# Use unidecode to get rid of tone marks, add a new column without tone marks
toneless = np.array([])
for p in characters["pinyin"]:              
    toneless = np.append(toneless, unidecode.unidecode(p))
characters["toneless"] = toneless
characters.head()

Unnamed: 0,frequency_rank,character,pinyin,toneless
0,1,的,de,de
1,2,一,yī,yi
2,3,是,shì,shi
3,4,不,bù,bu
4,5,了,le,le


Now let's walk through the intuition behind the IME. Let's say we type in "yi" into the IME. Ideally the IME should retrieve all the characters corresponding to "yi" in order of usage frequency.

Take a look at the dataframe. First, let's get all the rows that have the same toneless pinyin "yi." There are 159 rows, but thankfully, they are already sorted by frequency. But if they weren't, we could just sort them using the Pandas sort() function.

In [11]:
yi = characters.loc[characters["toneless"] == "yi", :]
yi.head()

Unnamed: 0,frequency_rank,character,pinyin,toneless
1,2,一,yī,yi
22,23,以,yǐ,yi
101,2,一,yī,yi
122,23,以,yǐ,yi
203,104,意,yì,yi


Now we want to save the "yi" characters into a list. There are some duplicates, so we'll only save the unique characters.

In [12]:
yi_characters = list(yi["character"].unique())

Now let's write the function that creates the character dictionary from the character dataframe.

For the character dictionary, we're going to have the toneless pinyin as the keys, and the corresponding character lists as the values.

In [14]:
# Function to create character dictionary from dataframe
def character_dict(df):
    # Takes in df containing 4 columns: frequency_rank, character, pinyin, toneless
    # Returns dictionary with toneless pinyin as the keys, list of corr. characters as values
    # Initialize dictionary
    chars = dict()
    # Get unique list of toneless pinyin
    unique_pinyin = df["toneless"].unique()
    for u in unique_pinyin:
        # Get all rows that have toneless pinyin u
        u_rows = df.loc[df["toneless"] == u, :]
        # Save u characters into a list, only save the unique characters
        u_list = list(u_rows["character"].unique())
        # Set u as key, the list as value
        chars[u] = u_list
    # Return dictionary
    return chars

In [15]:
# Create the character dictionary!
char_dict = character_dict(characters)

### Implementing the Pinyin Spelling Corrector for Syllables

In [16]:
def P(syllable, N=sum(syll_dict.values())): 
    "Probability of `syllable`."
    return syll_dict[syllable] / N

def correction(syllable): 
    "Most probable spelling correction for syllable."
    return max(candidates(syllable), key=P)

def candidates(syllable): 
    "Generate possible spelling corrections for syllable."
    return (known([syllable]) or known(edits1(syllable)) or known(edits2(syllable)) or [syllable])

def known(syllables): 
    "The subset of `syllables` that appear in syll_dict."
    return set(s for s in syllables if s in syll_dict)

def edits1(syllable):
    "All edits that are one edit away from `syllable`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(syllable[:i], syllable[i:])    for i in range(len(syllable) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(syllable): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(syllable) for e2 in edits1(e1))

In [17]:
candidates("mieo")

{'miao', 'mie'}

In [97]:
correction("mieo")

'miao'

In [102]:
candidates("ddu")

{'diu', 'dou', 'du'}

In [103]:
correction("ddu")

'dou'

In [113]:
char_dict[correction("mieo")][0:10]

['描', '妙', '庙', '苗', '秒', '瞄', '渺', '藐', '缈', '喵']

### Interactivity

In [20]:
from IPython.display import display

output = widgets.Textarea(description = "Output: ")
pinyin = widgets.Text(description = "Pinyin Input: ")
select = widgets.Select(options=char_dict[correction("")], description = "Select: ")

def pinyin_handler(sender):
    select.index = None
    select.options = char_dict[correction(pinyin.value)]
    output.value = output.value + select.value
    # select.observe(on_click, "value")
    pinyin.value = ""

        

        
def on_click(change):
    output.value = output.value[:-1] + select.value
    pinyin.index = ""



display(pinyin)
pinyin.on_submit(pinyin_handler)
display(select)
display(output)

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget