# Cambridge Dictionary Web Scraper

 We will scrape the pages on the cambridge dictionary website, and set up a custom dictionary.

### Get the URLs

In [1]:
! curl https://dictionary.cambridge.org/browse/english-chinese-traditional | egrep -o 'https://dictionary.cambridge.org/browse/english-chinese-traditional/[^/]+/' > toc1.url.txt 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  180k  100  180k    0     0   297k      0 --:--:-- --:--:-- --:--:--  299k


In [2]:
! cat toc1.url.txt | xargs curl > toc1.page.txt  

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  182k  100  182k    0     0   198k      0 --:--:-- --:--:-- --:--:--  199k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  210k  100  210k    0     0   885k      0 --:--:-- --:--:-- --:--:--  886k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  214k  100  214k    0     0  1188k      0 --:--:-- --:--:-- --:--:-- 1191k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  221k  100  221k    0     0   920k      0 --:--:-- --:--:-- --:--:--  924k
  % Total    % Received % Xferd  Average Speed   Tim

In [3]:
! egrep -o 'https://dictionary.cambridge.org/browse/english-chinese-traditional/[^/]+?/[^"]+' toc1.page.txt > toc2.url.txt 

In [4]:
! cat toc2.url.txt | xargs curl > toc2.page.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  198k  100  198k    0     0   279k      0 --:--:-- --:--:-- --:--:--  280k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  211k  100  211k    0     0  1081k      0 --:--:-- --:--:-- --:--:-- 1083k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  211k  100  211k    0     0   972k      0 --:--:-- --:--:-- --:--:--  974k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  213k  100  213k    0     0  1329k      0 --:--:-- --:--:-- --:--:-- 1323k
  % Total    % Received % Xferd  Average Speed   Tim

In [5]:
! egrep -o "/dictionary/english-chinese-traditional/[^\"\'']+" toc2.page.txt | awk '{print "http://dictionary.cambridge.org/"$0}' > toc3.url.txt

In [20]:
! head toc3.url.txt

http://dictionary.cambridge.org//dictionary/english-chinese-traditional/0800-number
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/0898-number
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/101
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/12a
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/12th-man
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/15
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/18
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/18-yard-box
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/180
http://dictionary.cambridge.org//dictionary/english-chinese-traditional/2-3-5


### Scrape the webpage

Now, we have a bunch of URLs linking to the webpages that we want to scrape. We are going to use the scraping package **BeautifulSoup4**. You can check up their documentation for usage.

[BeautifulSoup4 Official Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
from bs4 import BeautifulSoup
import requests



#### Test if web requests work

Test if the http request would work or not. If the result doesn't look right please let us know (or try to debug it yourself!)

In [2]:
url = 'https://dictionary.cambridge.org/dictionary/english-chinese-traditional/accident' # example
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36"  # this is to bypass the limitation from the cambridge dictionary server
headers = {'User-Agent': user_agent}
web_request = requests.get(url, headers=headers)
soup = BeautifulSoup(web_request.text, "html.parser")

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   accident in Traditional Chinese - Cambridge Dictionary
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="accident translate: 意外；不測；事故. Learn more in the Cambridge English-Chinese traditional Dictionary." name="description"/>
  <meta content="accident, chinese (traditional), dictionary, english, british, british english, definition, define, meaning, spelling, conjugation, audio pronunciation, free, online" name="keywords"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,minimum-scale=1,initial-scale=1" name="viewport"/>
  <script async="" charset="UTF-8" data-domain-script="1be0fc8f-627a-4f3a-a13f-a4d11e467d2a" data-ignore-ga="true" src="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" type="text/javascript">
  </script>
  <meta content="app-id=1586255267" name="apple-itunes-app">
   <meta content="app-id=or

#### Get the head word of the page

You may find that some of the results are having multiple lines, but that's okay. Just adjust it in your implementation.

In [4]:
for sents in soup.findAll('span', {'class': 'hw dhw'}):
    print(sents.get_text())

accident


#### Get the pos of the word

In [5]:
for sents in soup.findAll('span', {'class': 'pos dpos'}):
    print(sents.get_text())

noun


#### Get the definition of the word

In [6]:
for sents in soup.findAll('div', {'class': 'ddef_h'}):
    print(sents.find('div', {'class': 'def ddef_d db'}).get_text())

something bad that happens that is not expected or intended and that often damages something or injures someone
without intending to, or without being intended


#### Translation (of the definition)

In [7]:
for sents in soup.findAll('span', {'class': 'trans dtrans dtrans-se break-cj'}):
    print(sents.get_text())

意外；不測；事故
偶然地，意外地


In [8]:
for sents in soup.findAll('span', {'class': 'eg deg'}):
    print(sents.get_text())

Josh had an accident and spilled water all over his work.
She was injured in a car/road accident (= when one car hit another).
I deleted the file by accident.
I found her letter by accident as I was looking through my files.


In [9]:
for sents in soup.findAll('span', {'class': 'trans dtrans dtrans-se hdb break-cj'}):
    print(sents.get_text())

喬希不小心把作業上灑得都是水。
她在一宗車禍／交通意外中受傷了。
我不小心刪掉了那個檔案。
我在查看我的文件時，意外地發現了她的信。


#### Real implementation of the scraping!

Create (python) dictionary by scraping the website using bs4. <br><br>

Note: Because scraping every URLs in the .txt file may take a LOT of time, we are only working with the first 1000 URLs.

In [17]:
from bs4 import BeautifulSoup
import requests
import re
import json

url_list = []
#for url in open('test.txt', 'r').readlines():
for url in open('toc3.url.txt', 'r').readlines():
    url_list.append(url.rstrip())

word_dict = {}

for url in url_list[:1000]:
    web_request = None

    headword = None
    current_dict = {}

    retry_count = 0

    # send request
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36"  # this is to bypass the limitation from the cambridge dictionary server
    headers = {'User-Agent': user_agent}

    while retry_count < 3:
        web_request = requests.get(url, headers=headers)
        soup = BeautifulSoup(web_request.text, "html.parser")

        # get elements from the page
        if soup.find('div', {'class': 'di-body'}) != None:
            di_body = soup.find('div', {'class': 'di-body'})

            # get Headword
            if di_body.find('h2', {'class': re.compile(r'headword tw-bw dhw dpos-h_hw.*')}) != None:
                headword = di_body.find('h2', {'class': re.compile(r'headword tw-bw dhw dpos-h_hw.*')}).get_text()
            elif di_body.find('span', {'class': 'hw dhw'}) != None:
                headword = di_body.find('span', {'class': 'hw dhw'}).get_text()
            else:
                headword = url
            current_dict['Headword'] = headword
            
            entry_array = []

            # check structure
            if di_body.find('div', {'class': 'pr entry-body__el'}) != None:
                entry_body = di_body.find_all('div', {'class': 'pr entry-body__el'})
            else:
                entry_body = soup.find_all('div', {'class': 'di-body'})

            for entry in entry_body:
                entry_array_dict = {}

                # check POS
                if entry.find('span', {'class': 'pos dpos'}) != None:
                    POS_array = []

                    # get POS
                    for POS_span in entry.find_all('span', {'class': 'pos dpos'}):
                        POS_array.append(POS_span.text)
                    entry_array_dict['POS'] = POS_array 
                else:
                    entry_array_dict['POS'] = 'N/A'

                sense_body_array = []

                for sense_body in entry.find_all('div', {'class': 'sense-body dsense_b'}):

                    # check DEFINITIONS
                    if sense_body.find('div', {'class': 'def-block ddef_block'}, recursive=False) != None:

                        for def_block in sense_body.find_all('div', {'class': 'def-block ddef_block'}, recursive=False):
                            def_block_array_dict = {}

                            # get DEFINITION-ENG
                            if def_block.find('div', {'class': 'def ddef_d db'}) != None:
                                def_block_array_dict['DEFINITION-ENG'] = def_block.find('div', {'class': 'def ddef_d db'}).get_text()
                            else:
                                def_block_array_dict['DEFINITION-ENG'] = 'N/A'

                            # get DEFINITION-CHI
                            if def_block.find('span', {'class': 'trans dtrans dtrans-se break-cj'}) != None:
                                def_block_array_dict['DEFINITION-CHI'] = def_block.find('span', {'class': 'trans dtrans dtrans-se break-cj'}).get_text()
                            else:
                                def_block_array_dict['DEFINITION-CHI'] = 'N/A'
                            
                            if def_block.find('div', {'class': 'examp dexamp'}) != None:
                                examp_array = []

                                # get EXAMPLE-SENTS
                                for examples in def_block.find_all('div', {'class': 'examp dexamp'}):
                                    examp_array_dict = {}

                                    # get SENT
                                    examp_array_dict['SENT'] = examples.find('span', {'class': 'eg deg'}).get_text()

                                    # get SENT-CHT
                                    if examples.find('span', {'class': 'trans dtrans dtrans-se hdb break-cj'}) != None:
                                        examp_array_dict['SENT-CHT'] = examples.find('span', {'class': 'trans dtrans dtrans-se hdb break-cj'}).get_text()
                                    else:
                                        examp_array_dict['SENT-CHT'] = 'N/A'

                                    examp_array.append(examp_array_dict)

                                def_block_array_dict['EXAMPLE-SENTS'] = examp_array
                            
                            else: 
                                def_block_array_dict['EXAMPLE-SENTS'] = 'N/A'

                            sense_body_array.append(def_block_array_dict)

                    # check PHRASE
                    if sense_body.find('div', {'class': re.compile(r'pr phrase-block dphrase-block.*')}) != None:

                        for phrase_block in sense_body.find_all('div', {'class': re.compile(r'pr phrase-block dphrase-block.*')}):
                            phrase_block_array_dict = {}

                            # get PHRASE
                            phrase_block_array_dict['PHRASE'] = phrase_block.find('span', {'class': 'phrase-title dphrase-title'}).get_text()
                            
                            phrase_body = phrase_block.find('div', {'class': 'phrase-body dphrase_b'})

                            def_block_array = []

                            # get PHRASE-DEFINITIONS
                            for def_block in phrase_body.find_all('div', {'class': 'def-block ddef_block'}, recursive=False):
                                def_block_array_dict = {}

                                # get PHRASE-DEFINITION-ENG
                                def_block_array_dict['PHRASE-DEFINITION-ENG'] = def_block.find('div', {'class': 'def ddef_d db'}).get_text()

                                # get PHRASE-DEFINITION-CHT
                                if def_block.find('span', {'class': 'trans dtrans dtrans-se break-cj'}) != None:
                                    def_block_array_dict['PHRASE-DEFINITION-CHI'] = def_block.find('span', {'class': 'trans dtrans dtrans-se break-cj'}).get_text()

                                def_block_array.append(def_block_array_dict)

                            phrase_block_array_dict['PHARSE-BODY'] = def_block_array

                            sense_body_array.append(phrase_block_array_dict)
                        
                entry_array_dict['POS-BODY'] = sense_body_array

                entry_array.append(entry_array_dict)

            # put the entry to current_dict
            current_dict['ENTRY'] = entry_array

            # put the word to word_dict
            word_dict[headword] = current_dict

            break
        
        retry_count += 1

    if retry_count == 3:
        current_dict['Headword'] = url
        current_dict['ENTRY'] = 'url request fail'
        word_dict[headword] = current_dict

# save dict
json_str = json.dumps(word_dict, ensure_ascii=False, indent=4)

with open("result.json", "w", encoding="utf-8") as file:
    file.write(json_str)

# print dict
word_dict


{'0800 number': {'Headword': '0800 number',
  'ENTRY': [{'POS': ['noun'],
    'POS-BODY': [{'DEFINITION-ENG': 'in the UK, a free phone number that begins with 0800, provided by companies or other organizations offering advice or information',
      'DEFINITION-CHI': '（英國的）0800號碼（公司企業或其他提供諮詢或資訊服務的機構開通的以0800開頭的免費電話號碼）',
      'EXAMPLE-SENTS': 'N/A'}]}]},
 '0898 number': {'Headword': '0898 number',
  'ENTRY': [{'POS': ['noun'],
    'POS-BODY': [{'DEFINITION-ENG': 'in the UK, an expensive phone number that begins with 0898 that is provided by companies offering services such as chatlines',
      'DEFINITION-CHI': '（英國的）0898號碼（由提供電話聊天之類服務的公司開通的以0898開頭的高收費電話號碼）',
      'EXAMPLE-SENTS': 'N/A'}]}]},
 '101': {'Headword': '101',
  'ENTRY': [{'POS': ['adjective'],
    'POS-BODY': [{'DEFINITION-ENG': 'showing the most basic knowledge about a subject',
      'DEFINITION-CHI': '基本知識',
      'EXAMPLE-SENTS': [{'SENT': "You should know how to boil an egg - that's cooking 101.",
        'SENT-CHT': '你應

### Output example
`word_dict['accident']`
```json
{
    'HEADWORD': accident,
    'POS': ...,
    'DEFINITION-ENG': ...,
    'DEFINITION-CHI': ...,
    'EXAMPLE-SENTS': [
                    {'SENT': ...,
                    'SENT-CHI': ...},
                    ...],
    ...
}
```

#### Test if the result is available

Let's try the word 'accident.' See what we got here.

In [18]:
word_dict['accident']

{'Headword': 'accident',
 'ENTRY': [{'POS': ['noun'],
   'POS-BODY': [{'DEFINITION-ENG': 'something bad that happens that is not expected or intended and that often damages something or injures someone',
     'DEFINITION-CHI': '意外；不測；事故',
     'EXAMPLE-SENTS': [{'SENT': 'Josh had an accident and spilled water all over his work.',
       'SENT-CHT': '喬希不小心把作業上灑得都是水。'},
      {'SENT': 'She was injured in a car/road accident (= when one car hit another).',
       'SENT-CHT': '她在一宗車禍／交通意外中受傷了。'}]},
    {'PHRASE': 'by accident',
     'PHARSE-BODY': [{'PHRASE-DEFINITION-ENG': 'without intending to, or without being intended',
       'PHRASE-DEFINITION-CHI': '偶然地，意外地'}]}]}]}