# Aozora Bunko Vocabulary Extractor

## Introduction
The goal of this project is to create dataframes based on vocabulary from Japanese classic literature available on Aozora Bunko, with relevant JLPT level data to estimate the difficulty of the book. Dataframes will be later used for creating visual guides to Japanese literature, based on their difficulty.

[Aozora Bunko](https://www.aozora.gr.jp/) is a digital library hosting classic Japanese literature in a convenient HTML format. 

## Importing libraries
The main libraries used for this project are:
- Janome - processing Japanese text, such as tokenizing and conjugation
- BeautifulSoup - scraping text from HTML files
- pandas - dataframe creation

In [18]:
import requests
import time
import pandas as pd
from janome.tokenizer import Tokenizer
from bs4 import BeautifulSoup
from collections import Counter
from tqdm import tqdm

## Source text preparation
For this project, I chose "Rashōmon" by Ryūnosuke Akutagawa. The original text can be accessed by the [link provided below](https://www.aozora.gr.jp/cards/000879/files/128_15261.html).

In [19]:
URL = "https://www.aozora.gr.jp/cards/000879/files/128_15261.html"

After assigning the Aozora Bunko link to a variable, I am parsing the text using BeautifulSoup. I am also accessing the book's title and author's name to name the csv file afterwards.

In [20]:
book_source = requests.get(URL)
soup = BeautifulSoup(book_source.content, "html.parser")
book_text = soup.find("div",{"class":"main_text"}).get_text()
book_title = soup.find("h1").get_text()
book_author = soup.find("h2").get_text()
file_name = f'{book_author} - {book_title}.csv'
print(f'Book loaded: {book_title} by {book_author}')

Book loaded: 羅生門 by 芥川龍之介


## Creating the dataframe
Next, I created a function that tokenizes the text and creates an array of words. The function filters the text to retrieve only the most important parts of speech, such as nouns, adjectives and verbs. All words are converted to its base form.

In [21]:
def extract_wordlist(text):
    t = Tokenizer()
    wordlist = []

    for token in t.tokenize(text):
        if "名詞" in token.part_of_speech or (token.part_of_speech.startswith("動詞") and not token.part_of_speech.startswith("助動詞")) or "形容詞" in token.part_of_speech or "代名詞" in token.part_of_speech:
            wordlist.append(token.base_form)
    return wordlist

After creating an array, I am using a counter to consolidate the results and provide a number of occurences of every word in the text. I am crating a dataframe from the results.

In [22]:
wordlist = extract_wordlist(book_text)
wordlist = Counter(wordlist)
wordlist = pd.DataFrame(wordlist.items(), columns = ['Word', 'Count']).sort_values(by = 'Count', ascending = False)
wordlist

Unnamed: 0,Word,Count
31,する,78
11,ゐる,59
5,下人,44
2,事,31
42,云,30
...,...,...
301,惧,1
300,人目,1
299,患,1
298,雨風,1


As seen in the table above, Rashōmon consists of 689 unique words. The most used word in the story is a verb する (suru).

## API Tests
For retrieving data about JLPT level of each word, I am using [JLPT-VOCAB API](https://github.com/wkei/jlpt-vocab-api).

In [7]:
requests.get("https://jlpt-vocab-api.vercel.app/api/words?word=人目").json()

{'total': 1,
 'offset': 0,
 'limit': 10,
 'words': [{'word': '人目',
   'meaning': 'glimpse; public gaze',
   'furigana': 'じんもく',
   'romaji': 'jinmoku',
   'level': 1}]}

As seen in the API response above, we can access not only the JLPT level of the word, but also its english definition and furigana spelling. However, I am not going to use this data, since janome library is also capable of providing the furigana spelling regardless if there is an record in the API for the queried word.

In [8]:
requests.get("https://jlpt-vocab-api.vercel.app/api/words?word=人目").json()['words'][0]['level']

1

## API Request Functions
I have decided to split the API fetching into two separate functions. Previously, I wanted to use the jisho.org API, but it tends to timeout frequently, thus I created the make_api_request() function to automatically resend the request anytime there was a timeout. Eventually, I have decided to change the API, since constant timeouts made the fetching process significantly longer.

In [9]:
def make_api_request(word):
    url = f"https://jlpt-vocab-api.vercel.app/api/words?word={word}"
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return response.json()
    except (requests.exceptions.RequestException, requests.exceptions.Timeout):
        print(f"Timeout for word: {word}. Retrying in 5 seconds...")
        time.sleep(5)
        return make_api_request(word)

In [14]:
def get_word_info(wordlist):
    readings = []
    levels = []
    t = Tokenizer()

    for word in tqdm(wordlist, desc= 'Processing'):
        response = make_api_request(word)
        tokens = list(t.tokenize(word))

        reading = tokens[0].reading
        
        try:
            level = response['words'][0]['level']
        except (IndexError, KeyError):
            level = 'unknown'

        readings.append(reading)
        levels.append(level)

    return readings, levels

## API Request Test
After creating the functions, I have tested them out by using a short word list:

In [15]:
testlist = ['聞く', '焼く', '本気']
readings, levels = get_word_info(testlist)
print(readings)
print(levels)

Processing: 100%|██████████| 3/3 [00:01<00:00,  1.92it/s]

['キク', 'ヤク', 'ホンキ']
[5, 4, 1]





## Actual vocabulary list creation
My functions are working correctly, so now I am able to populate my Rashōmon dataframe with readings and information about JLPT levels.

In [None]:
wordlist['Reading'], wordlist['JLPT Level'] = get_word_info(wordlist['Word'])

And finally, I am able to export the list for further visualization.

In [None]:
wordlist.to_csv(file_name, index = False)