# DEVoca-scraper

## 방법

- 위키 백과의 [Glossary of computer science](https://en.wikipedia.org/wiki/Glossary_of_computer_science)에 나오는 단어와 설명 스크래핑.

- 해당 페이지는 각 알파벳('A', 'B', 'C'...) 별로 **단어 목록**이 `<dl class="glossary">` 태그로 감싸져있음.
- 단어 목록은 `<dt class="glossary">` 태그 안의 **단어**와 `<dd class="glossary">` 태그의 **단어 설명**으로 이루어짐.

    ![](./glossary1.png)

In [1]:
import json
import re
import requests
from bs4 import BeautifulSoup as bs
from bs4 import Comment
from pprint import pprint

In [2]:
URL = 'https://en.wikipedia.org/wiki/Glossary_of_computer_science'
page = requests.get(URL)
print(page)

<Response [200]>


In [3]:
soup = bs(page.text, "html.parser")

In [4]:
dls = soup.find_all("dt", class_="glossary")
print(len(dls))
# dls

296


In [16]:
word_data = []

for word_def in dls:
    new_word = {}

    # 단어
    new_word['word_name_en'] = word_def.text
    if word_def.a != None:
        new_word['word_link'] = 'https://en.wikipedia.org' + word_def.a['href']
    else:
        new_word['word_link'] = ''

    # 설명
    word_exps = []
    for sibling in word_def.next_sibling.next_siblings:
        if sibling.name != 'dd':
            break
        if sibling == '\n':
            continue

        # HTML 주석이면 스킵
        if isinstance(sibling, Comment):
            continue

        # 리스트는 ul, ol 구분 없이 불릿 붙이기
        for li in sibling('li'):
            if li.contents[0] != ' * ':
                li.insert(0, ' * ')

        # latex 수식이 이미지로 나오는데, 사용하기 어려우니 일단 제거
        math_eqs = sibling.find_all('span', class_='mwe-math-element')
        if math_eqs != None:
            for eq in math_eqs:
                eq.clear()

        # 끝에 붙는 '\n' 제거
        sibling_text = sibling.text.rstrip()

        # 두 줄 띄어진 것 한 줄로 변경
        sibling_text = sibling_text.replace('\n\n', '\n')

        # 한 줄 데이터 모으기
        word_exps.append(sibling_text)

    # 한 줄 데이터 모은 것 합쳐서 단어 정의로 저장
    new_word['word_def'] = re.sub('\[\d+\]', '', ' '.join(word_exps))

    # 정의 설명 없는 단어는 스킵
    if new_word['word_def'] == '':
        continue

    word_data.append(new_word)

pprint(word_data[:3], sort_dicts=False)

with open("word_wiki.json", "w", encoding="utf-8") as f:
    json.dump(word_data, f, indent=2)

[{'word_name_en': 'abstract data type (ADT)',
  'word_link': 'https://en.wikipedia.org/wiki/Abstract_data_type',
  'word_def': 'A mathematical model for data types in which a data type is '
              'defined by its behavior (semantics) from the point of view of a '
              'user of the data, specifically in terms of possible values, '
              'possible operations on data of this type, and the behavior of '
              'these operations. This contrasts with data structures, which '
              'are concrete representations of data from the point of view of '
              'an implementer rather than a user.'},
 {'word_name_en': 'abstract method',
  'word_link': 'https://en.wikipedia.org/wiki/Abstract_method',
  'word_def': 'One with only a signature and no implementation body. It is '
              'often used to specify that a subclass must provide an '
              'implementation of the method. Abstract methods are used to '
              'specify interfaces in 