# Aozora Bunko Vocabulary Extractor

## Introduction
The aim of this project is to create dataframes based on vocabulary from classic Japanese literature available on Aozora Bunko, with relevant JLPT level data to estimate the difficulty of the book. The dataframes will later be used to create visual guides to Japanese literature based on their level of difficulty.

[Aozora Bunko](https://www.aozora.gr.jp/) is a digital library that hosts classic Japanese literature in a convenient HTML format.

JLPT (Japanese Language Proficiency Test) is an exam offered by the Japan Foundation as a way of evaluating the Japanese proficiency of non-native speakers. There are 5 levels of this test, with N5 being the easiest, and N1 being the hardest. These levels will be a guide for assuming the difficulty of analysed texts.

## Importing libraries
The main libraries used in this project are
- Janome - Japanese text processing, such as tokenisation and conjugation
- BeautifulSoup - scraping text from HTML files
- pandas - data frame creation

In [2]:
import requests
import asyncio
import aiohttp
import pandas as pd
from janome.tokenizer import Tokenizer
from bs4 import BeautifulSoup
from collections import Counter
from tqdm import tqdm

## Source text preparation
For this project I chose to analyze "Rashōmon" by Akutagawa Ryūnosuke. The original text can be accessed via the [link below](https://www.aozora.gr.jp/cards/000879/files/128_15261.html).

In [3]:
URL = "https://www.aozora.gr.jp/cards/000879/files/128_15261.html"

After assigning the Aozora Bunko link to a variable, I parse the text using BeautifulSoup. I also access the title of the book and the name of the author to name the csv file afterwards.

In [4]:
book_source = requests.get(URL)
soup = BeautifulSoup(book_source.content, "html.parser")
book_text = soup.find("div",{"class":"main_text"}).get_text()
book_title = soup.find("h1").get_text()
book_author = soup.find("h2").get_text()
file_name = f'{book_author} - {book_title}.csv'
print(f'Book loaded: {book_title} by {book_author}')

Book loaded: 羅生門 by 芥川龍之介


## Creating the dataframe
Next, I created a function that tokenises the text and creates an array of words. The function filters the text to retrieve only the most important parts of speech, such as nouns, adjectives and verbs. All words are converted to their base form.

In [5]:
def extract_wordlist(text):
    t = Tokenizer()
    wordlist = []

    for token in t.tokenize(text):
        if "名詞" in token.part_of_speech or (token.part_of_speech.startswith("動詞") and not token.part_of_speech.startswith("助動詞")) or "形容詞" in token.part_of_speech or "代名詞" in token.part_of_speech:
            wordlist.append(token.base_form)
    return wordlist

After creating an array, I use a counter to consolidate the results and provide a number of occurrences of each word in the text. I create a dataframe from the results.

In [6]:
wordlist = extract_wordlist(book_text)
wordlist = Counter(wordlist)
wordlist = pd.DataFrame(wordlist.items(), columns = ['Word', 'Count']).sort_values(by = 'Count', ascending = False)
wordlist

Unnamed: 0,Word,Count
31,する,78
11,ゐる,59
5,下人,44
2,事,31
42,云,30
...,...,...
301,惧,1
300,人目,1
299,患,1
298,雨風,1


As seen in the table above, Rashōmon consists of 689 unique words. The most common word in the story is the verb する (suru).

## API Tests
To retrieve data on the JLPT level of each word, I use the Jisho API.

In [7]:
requests.get("https://jisho.org/api/v1/search/words?keyword=下人").json()

{'meta': {'status': 200},
 'data': [{'slug': '下人',
   'is_common': False,
   'tags': [],
   'jlpt': [],
   'japanese': [{'word': '下人', 'reading': 'げにん'}],
   'senses': [{'english_definitions': ['low-rank person', 'menial'],
     'parts_of_speech': ['Noun'],
     'links': [],
     'tags': [],
     'restrictions': [],
     'see_also': [],
     'antonyms': [],
     'source': [],
     'info': []}],
   'attribution': {'jmdict': True, 'jmnedict': False, 'dbpedia': False}}]}

As seen in the API response above, we can access not only the JLPT level of the word, but also its English definition and furigana spelling. I will not use this data, however, as the janome library is also able to provide the furigana spelling, regardless of whether there is a record in the API for the queried word.

In [8]:
requests.get("https://jisho.org/api/v1/search/words?keyword=聞く").json()['data'][0]['jlpt'][0]


'jlpt-n5'

In [9]:
requests.get("https://jisho.org/api/v1/search/words?keyword=聞く").json()['data'][0]['senses'][0]['english_definitions'][0]

'to hear'

## API Request Functions
I decided to split the API request into two separate functions. The Jisho API tends to timeout frequently, so I implemented the make_api_request() function to automatically resend the request each time there was a timeout. The API fetching happens asynchronously to speed up the process.

In [10]:
async def make_api_request(word):
    url = f"https://jisho.org/api/v1/search/words?keyword={word}"
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url, timeout = 10) as response:
                response.raise_for_status()
                return await response.json()
    except (aiohttp.ClientError, asyncio.TimeoutError):
        print(f"Timeout for word: {word}. Retrying in 5 seconds...")
        await asyncio.sleep(5)
        return await make_api_request(word)
    
async def get_word_info(wordlist):
    readings = []
    levels = []
    meanings = []
    t = Tokenizer()

    for word in tqdm(wordlist, desc='Processing'):
        response = await make_api_request(word)

        tokens = list(t.tokenize(word))

        reading = tokens[0].reading
        
        try:
            level = response['data'][0]['jlpt'][-1]
        except (IndexError, KeyError):
            level = 'unknown'
        
        try:
            meaning = response['data'][0]['senses'][0]['english_definitions'][0]
        except (IndexError, KeyError):
            meaning = 'unknown'

        readings.append(reading)
        levels.append(level)
        meanings.append(meaning)

    return readings, levels, meanings

## API Request Test
After creating the functions, I tested them using a short vocabulary list:

In [11]:
testlist = ['聞く', 'やる', '掛ける']
testlist_df = pd.DataFrame(testlist, columns = ['Word'])
readings, levels, meanings = await get_word_info(testlist_df['Word'])
testlist_df['Reading'] = readings
testlist_df['Level'] = levels
testlist_df['Meaning'] = meanings
testlist_df



Processing: 100%|██████████| 3/3 [00:02<00:00,  1.09it/s]


Unnamed: 0,Word,Reading,Level,Meaning
0,聞く,キク,jlpt-n5,to hear
1,やる,ヤル,jlpt-n1,to do
2,掛ける,カケル,jlpt-n5,"to hang up (e.g. a coat, a picture on the wall)"


## Actual vocabulary list creation
My functions are working correctly and I am now able to populate my Rashōmon data frame with readings, JLPT level information and word meanings.

In [12]:
readings, levels, meanings = await get_word_info(wordlist['Word'])
wordlist['Reading'] = readings
wordlist['JLPT Level'] = levels
wordlist['Meaning'] = meanings
wordlist

Processing:  60%|██████    | 416/689 [07:01<03:42,  1.23it/s]

Timeout for word: お. Retrying in 5 seconds...
Timeout for word: お. Retrying in 5 seconds...


Processing:  78%|███████▊  | 540/689 [09:40<02:15,  1.10it/s]

Timeout for word: 修理. Retrying in 5 seconds...


Processing:  99%|█████████▉| 683/689 [12:21<00:05,  1.02it/s]

Timeout for word: 晩. Retrying in 5 seconds...


Processing: 100%|██████████| 689/689 [12:42<00:00,  1.11s/it]


Unnamed: 0,Word,Count,Reading,JLPT Level,Meaning
31,する,78,スル,jlpt-n5,to do
11,ゐる,59,ヰル,unknown,unknown
5,下人,44,ゲニン,unknown,low-rank person
2,事,31,コト,jlpt-n3,thing
42,云,30,ウン,jlpt-n5,to say
...,...,...,...,...,...
301,惧,1,*,jlpt-n3,to fear
300,人目,1,ヒトメ,jlpt-n1,(public) notice
299,患,1,*,jlpt-n3,patient
298,雨風,1,アメカゼ,unknown,rain and wind


Finally, I can export the list for further visualisation.

In [13]:
wordlist.to_csv(f'data/{file_name}', index = False)