### Exploring scraping the knowledge bases
We want to pull the paragraphs from the link.  
Notably, we can pull the entire passage out from the link, but we remove all paragraphs with  < > in it. 

In [133]:
link = "https://biblehub.com/commentaries/expositors/genesis/1.htm"

from collections import Counter 
import requests
from bs4 import BeautifulSoup as soup
import json
import json_lines

r = requests.get(link)
link = soup(r.text)

# https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8
r.encoding = r.apparent_encoding

# passages = soup.findAll("span", {"class": "p"})
# passages = link.findAll("div", {"class": "chap"})

link = soup(r.text)
passages = link.findAll("div", {"class": "chap"})

text_elements = str(passages).split('<span class="p"><br/><br/></span>')
text_elements = [text_element for text_element in text_elements if (Counter(text_element)['>'] <3) & (Counter(text_element)['>'] <3) ]
text_elements[4]

'Accepting this chapter then as it stands, and believing that only by looking at the Bible as it actually is can we hope to understand God’s method of revealing Himself, we at once perceive that ignorance of some departments of truth does not disqualify a man for knowing and imparting truth about God. In order to be a medium of revelation a man does not need to be in advance of his age in secular learning. Intimate communion with God, a spirit trained to discern spiritual things, a perfect understanding of and zeal for God’s purpose, these are qualities quite independent of a knowledge of the discoveries of science. The enlightenment which enables men to apprehend God and spiritual truth has no necessary connection with scientific attainments. David’s confidence in God and his declarations of His faithfulness are none the less valuable, because he was ignorant of a very great deal which every schoolboy now knows. Had inspired men introduced into their writings information which anticip

### 1. A pull clean paragraphs from url function

In [91]:
def link_to_soup(link):
    """
    Takes in the link and returns the text paragraphs
    """
    import requests
    from bs4 import BeautifulSoup as soup
    r = requests.get(link)
    link = soup(r.text)

    # https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8
    r.encoding = r.apparent_encoding

    # get the passages
    soup_ = soup(r.text)
    
    return soup_

def soup_to_paragraphs(url_soup):
    """
    Takes in the soup and retrieves the passage paragraphs
    """
    passages = url_soup.findAll("div", {"class": "chap"})

    text_elements = str(passages).split('<span class="p"><br/><br/></span>')
    paragraphs = [text_element for text_element in text_elements 
                  if (Counter(text_element)['>'] <3) & (Counter(text_element)['>'] <3)]
    
    return paragraphs

def soup_to_next_extension(url_soup):
    """
    The next page would point to an extention, not the full link. e.g. "../genesis/1.htm"
    """
    next_url_extension = url_soup.find("div", {"id": "topheading"}).findAll('a')[-1].get('href')
    next_url_extension = next_url_extension.replace('..','')
    return next_url_extension

def soup_to_verse_references(url_soup):
    """
    Finds the verse references at the top of the page
    (1) the references may overlap, repeats should later be removed
    (2) There may be multiple of them
    """
    passages = url_soup.findAll("div", {"class": "chap"})
    verse_references = passages[0].findAll("span", {"class":"bld"})[0].findAll('a')
    verse_references = ', '.join([verse_reference.text for verse_reference in verse_references])
    
    return verse_references

def save_passage(url, passages, verse_references, authors, save_path='../data/temp.jsonl'):
    """
    Save the knowledge base in a dict line in a jsonl file
    
    args:
        url: (str) link of the referenced paragraph
        passages: (list) list of the string paragraphs
        verse_references: (str) indicates which verse is used here
        authors: (str) credits to the original authors
        save_path: (str) filepath of the jsonl database
        
    Return:
        None
    """
    for paragraph in passages:
        dict_to_save = {'url': url,
                        'paragraph':paragraph,
                        'verse_references':verse_references,
                        'authors':authors
                       }
        with open(save_path, 'a') as outfile:
            json.dump(dict_to_save, outfile)
            outfile.write('\n')

### 2. From URL to the passage

In [59]:
url_soup = link_to_soup(url)

# get verse_references
verse_references = soup_to_verse_references(url_soup)

# get commentary text
passages = soup_to_paragraphs(url_soup)

# next URL
next_extension = soup_to_next_extension(url_soup)
url = '/'.join(url.split('/')[:-2]) + next_extension

### 2b. Writing the data to a jsonl file

In [93]:
def save_passage(url, passages, verse_references, authors, save_path='../data/temp.jsonl'):
    """
    Save the knowledge base in a dict line in a jsonl file
    
    args:
        url: (str) link of the referenced paragraph
        passages: (list) list of the string paragraphs
        verse_references: (str) indicates which verse is used here
        authors: (str) credits to the original authors
        save_path: (str) filepath of the jsonl database
        
    Return:
        None
    """
    for paragraph in passages:
        dict_to_save = {'url': url,
                        'paragraph':paragraph,
                        'verse_references':verse_references,
                        'authors':authors
                       }
        with open(save_path, 'a') as outfile:
            json.dump(dict_to_save, outfile)
            outfile.write('\n')
            
save_passage(url, passages, verse_references, 'expositors commentary', save_path='../data/temp.jsonl')

In [None]:
import json
import json_lines
with open('../data/expositors_commentary.jsonl', 'r') as handle:
    for line in handle.readlines():
        print(line)
        print("")

### 3. Deploying the scraping logic

In [122]:
url = "https://biblehub.com/commentaries/expositors/genesis/10.htm"
last_url = 'https://biblehub.com/commentaries/expositors/revelation/22.htm'


while True:
    url_soup = link_to_soup(url)

    # get verse_references
    verse_references = soup_to_verse_references(url_soup)
    
    # get commentary text
    passages = soup_to_paragraphs(url_soup)

    # next URL
    next_extension = soup_to_next_extension(url_soup)
    url = '/'.join(url.split('/')[:-2]) + next_extension
    
    # save the verse_references, passages, url
    # Should we just separate the sentences here?
    
    # breaking condition: we have reached the last page
    if url==last_url:
        break

### 4. Troubleshooting

In [124]:
url = "https://biblehub.com/commentaries/expositors/genesis/1.htm"
last_url = 'https://biblehub.com/commentaries/expositors/revelation/22.htm'

url_soup = link_to_soup(url)

# get verse_references
verse_references = soup_to_verse_references(url_soup)

# get commentary text
passages = soup_to_paragraphs(url_soup)

# next URL
next_extension = soup_to_next_extension(url_soup)
url = '/'.join(url.split('/')[:-2]) + next_extension

# save the verse_references, passages, url
# Should we just separate the sentences here?


In [128]:
passages[2]

'It will, however, be said, and with much appearance of justice, that although the first object of the writer was not to convey scientific information, yet he might have been expected to be accurate in the information he did advance regarding the physical universe. This is an enormous assumption to make on <span class="ital">a priori</span> grounds, but it is an assumption worth seriously considering because it brings into view a real and important difficulty which every reader of Genesis must face. It brings into view the twofold character of this account of creation. On the one hand it is irreconcilable with the teachings of science. On the other hand it is in striking contrast to the other cosmogonies which have been handed down from prescientific ages. These are the two patent features of this record of creation and both require to be accounted for. Either feature alone would be easily accounted for; but the two co-existing in the same document are more baffling. We have to account

In [107]:
url_soup.findAll("div", {"class": "chap"})[0].findAll("span", {"class":"bld"})

[]

### 5. How to read the knowledge base

In [118]:
import json
with open('../data/expositors_commentary.jsonl', 'r') as handle:
    df = handle.read()

In [131]:
json.loads(df.split('\n')[0])#['paragraph']

{'url': 'https://biblehub.com/commentaries/expositors/genesis/10.htm',
 'paragraph': 'Before exposing another, think first whether your own conduct could bear a similar treatment, whether you have never done the thing you desire to conceal, said the thing you would blush to hear repeated, or thought the thought you could not bear another to read. And if you be a Christian, does it not become you to remember what you yourself have learnt of the slipperiness of this world’s ways, of your liability to fall, of your sudden exposure to sin from some physical disorder, or some slight mistake which greatly extenuates your sin, but which you could not plead before another? And do you know nothing of the difficulty of conquering one sin that is rooted in your constitution, and the strife that goes on in a man’s own soul and in secret though he show little immediate fruit of it in his life before men? Surely it becomes us to give a man credit for much good resolution and much sore self-denial an

In [None]:
import json
with open('../data/expositors_commentary.jsonl', 'r') as handle:
    for line in handle.readlines():
        print(line)
        print("")

In [108]:
with open('../data/expositors_commentary.jsonl', 'rb') as f: # opening file in binary(rb) mode    
    for item in json_lines.reader(f):
        print(item) #or use print(item['X'])

{'url': 'https://biblehub.com/commentaries/expositors/genesis/10.htm', 'paragraph': 'Before exposing another, think first whether your own conduct could bear a similar treatment, whether you have never done the thing you desire to conceal, said the thing you would blush to hear repeated, or thought the thought you could not bear another to read. And if you be a Christian, does it not become you to remember what you yourself have learnt of the slipperiness of this world’s ways, of your liability to fall, of your sudden exposure to sin from some physical disorder, or some slight mistake which greatly extenuates your sin, but which you could not plead before another? And do you know nothing of the difficulty of conquering one sin that is rooted in your constitution, and the strife that goes on in a man’s own soul and in secret though he show little immediate fruit of it in his life before men? Surely it becomes us to give a man credit for much good resolution and much sore self-denial and

{'url': 'https://biblehub.com/commentaries/expositors/genesis/50.htm', 'paragraph': 'V. ITS SANCTIONS <a href="/context/exodus/23-20.htm" title="Behold, I send an Angel before you, to keep you in the way, and to bring you into the place which I have prepared....">Exodus 23:20-33</a>.', 'verse_references': 'Genesis 48:1-22, Genesis 49:1-33', 'authors': 'Expositors Commentary'}
{'url': 'https://biblehub.com/commentaries/expositors/genesis/50.htm', 'paragraph': 'A bold transition: the Angel in Whom is "My Name."', 'verse_references': 'Genesis 48:1-22, Genesis 49:1-33', 'authors': 'Expositors Commentary'}
{'url': 'https://biblehub.com/commentaries/expositors/genesis/50.htm', 'paragraph': 'Not a mere messenger.', 'verse_references': 'Genesis 48:1-22, Genesis 49:1-33', 'authors': 'Expositors Commentary'}
{'url': 'https://biblehub.com/commentaries/expositors/genesis/50.htm', 'paragraph': 'Nor the substitute of <a href="/context/exodus/33-2.htm" title="And I will send an angel before you; and 