# Data Cleaning

## Formatting
Classical Chinese *juéjù* 絕句 or quatrains have either five or seven syllable lines for a total of 20 or 28 characters per line. The json text corpus includes other types of poems, with commas and periods in their formatting. 

## Selecting the Data
To select only quatrains, the number of lines are evaluated and quatrains are then stored in their respective pentasyllabic and heptasyllabic dataframes.


### Imports

In [1]:
import numpy as np
import pandas as pd
import os
import re

In [2]:
# Create a files variable that contains all of our data files.
files = os.listdir('../json')

In [3]:
df = pd.read_json('../json/poet.tang.0.json')

In [4]:
df.head()

Unnamed: 0,author,paragraphs,title,id,tags
0,太宗皇帝,"[秦川雄帝宅，函谷壯皇居。, 綺殿千尋起，離宮百雉餘。, 連甍遙接漢，飛觀迥凌虛。, 雲日隱...",帝京篇十首 一,3ad6d468-7ff1-4a7b-8b24-a27d70d00ed4,
1,太宗皇帝,"[巖廊罷機務，崇文聊駐輦。, 玉匣啓龍圖，金繩披鳳篆。, 韋編斷仍續，縹帙舒還卷。, 對此乃...",帝京篇十首 二,13e72581-968b-457f-b381-a3b7d95b8b7c,
2,太宗皇帝,"[移步出詞林，停輿欣武宴。, 琱弓寫明月，駿馬疑流電。, 驚雁落虛弦，啼猿悲急箭。, 閱賞誠...",帝京篇十首 三,a7ff247d-a11c-4ca9-a22f-ca420b8c537c,
3,太宗皇帝,"[鳴笳臨樂館，眺聽歡芳節。, 急管韻朱絃，清歌凝白雪。, 彩鳳肅來儀，玄鶴紛成列。, 去茲鄭...",帝京篇十首 四,fa374b2b-c196-4362-b4ad-8931fc9a8860,
4,太宗皇帝,"[芳辰追逸趣，禁苑信多奇。, 橋形通漢上，峰勢接雲危。, 煙霞交隱映，花鳥自參差。, 何如肆...",帝京篇十首 五,86952cb3-b622-4398-a56a-01dd39f6c6ec,


In [5]:
# Function that checks if a poem_raw is a quatrain
def quatrain_checker(poem):
    if len(poem)!=2:
        return False
    poem_raw = poem.copy()
    poem_raw[0] = re.sub('\ |\?|\？|—|\《|\》|\□|\●|\/|\{|\}|\·|\、|\「|\」|\|', '' , poem_raw[0])
    poem_raw[1] = re.sub('\ |\?|\？|—|\《|\》|\□|\●|\/|\{|\}|\·|\、|\「|\」|\|', '' , poem_raw[1])
    poem_raw[0] = poem_raw[0].split('，')
    poem_raw[1] = poem_raw[1].split('，')
    
    if len(poem_raw[0])!=2:
        return False
    if len(poem_raw[1])!=2:
        return False
    try:
        line_lengths = [len(poem_raw[0][0]), len(poem_raw[0][1]), len(poem_raw[1][0]), len(poem_raw[1][1])]
    except:
        return False
    
    if line_lengths == [5, 6, 5, 6]:
        return True
    elif line_lengths == [7, 8, 7, 8]:
        return True
    else:
        return False

In [6]:
# Function that takes a .json file and returns DataFrame with only quatrains labeled by amount of syllables per line
def quatrain_extractor(file):
    poetry = pd.read_json(f'../json/{file}')
    cleaned_poetry = []
    for i in range(0, len(poetry.paragraphs)):
        poem = poetry.paragraphs[i]
        if quatrain_checker(poem) == True:            
            poem = re.sub('\，|\。', '', ''.join(poem))
            if len(poem)==20:
                cleaned_poetry.append({'text': poem, 'syllables': 5})
            elif len(poem)==28:
                cleaned_poetry.append({'text': poem, 'syllables': 7})
    return pd.DataFrame(cleaned_poetry)

In [7]:
# Creating DataFrame with all quatrains
poetry = pd.concat([quatrain_extractor(n) for n in files], ignore_index=True)

In [8]:
poetry.syllables.value_counts()

7    90576
5    17096
Name: syllables, dtype: int64

In [9]:
len(poetry)

107672

In [10]:
len(poetry.text.unique())

103363

In [11]:
# Removing duplicates
poetry.drop_duplicates(subset='text', inplace=True)

In [12]:
# Confirming duplicate deletion
len(poetry)

103363

In [13]:
poetry.syllables.value_counts()

7    86578
5    16785
Name: syllables, dtype: int64

In [14]:
# Resetting index
poetry.reset_index(inplace=True, drop=True)

In [15]:
# Exporting to CSV
poetry.to_csv('../data/quatrains.csv')