# Image Processing in Scraping Process

During the scraping of JPE, I (obviously) noticed that while most of the articles were well digitalized, with it's abstracts provided in its webpage, some does not. There are several articles, specifically those in 1970, 1972, 1974, 1985, 1987, do have abstracts in its first-page image, but not ready-to-copy abstract in text.

The following program first scrapes the article images in those specific years, apply some image processing using `openCV` to detect the text area for an abstract, and finally turn them into text by `tessract`.

## Part 1    The scraping

In [58]:
import bs4
import requests
import CONFIG

## scraping functions
def getBSFromURL(url):
    try:
        r = requests.get(url, headers = CONFIG.HEADER, timeout = 10)
        return getBS(r.text)
    
    except requests.exceptions.RequestException as e:
        print("Connection Error")
        raise e
    except requests.ReadTimeout as e:
        print("Timeout")
        raise e

def getBS(html):
    html_bs = bs4.BeautifulSoup(html, 'html.parser')
    return html_bs

In [76]:
## Read articels from 1970, 1972, 1974, 1985, 1987 that don't have abstracts.
import pandas as pd

with open('JPE.csv') as file:
    all_articles = pd.read_csv(file)

abstract_mask = all_articles['Abstract'].isnull()
comment_mask = all_articles['Title'].str.contains('|'.join([
    'comment', 'reply'
]),regex=True, case=False)
years_mask = all_articles['Year'].isin([1970,1972, 1974, 1985, 1987])

articles_needed_mask = abstract_mask & (~comment_mask) & years_mask

articles_needed = all_articles[articles_needed_mask]
all_articles.loc[articles_needed_mask, 'img'] = 1


In [75]:
# all_articles.to_csv('JPE_img_proc.csv', index = False)

In [84]:
def getImgID(url):
    return url.split('/')[-1]

In [92]:
with open('JPE_img_proc.csv') as file:
    article_img_proc = pd.read_csv(file)

articles_needed = article_img_proc[article_img_proc['img']==1]

for index, article in articles_needed.iterrows():
    url = article['Source URL']
    try:
        article_bs = getBSFromURL(url)
        c_img = article_bs.select("img.firstPageImage")
        if not c_img:
            raise Exception('Oh no')   
        img_url = CONFIG.DOMAIN_URL + c_img[0]['src']
        img_name = getImgID(url)
        print(f'handling {img_name}: {article["Year"]}')
        
        
        ## save Img
        img_byte = requests.get(img_url, stream=True).content
        with open(f'first_pages/orig/{img_name}.png','wb') as f:
            f.write(img_byte)
        
        article_img_proc.loc[index, 'img'] = 2
        
        with open('JPE_img_proc.csv', 'w') as file:
            article_img_proc.to_csv(file, index = False)
        
    except Exception as e:
        print(e)
        print(f'Went wrong in {index}')
        continue
    

handling 259715: 1970
handling 259716: 1970
handling 259717: 1970
handling 259718: 1970
handling 259720: 1970
handling 259721: 1970
handling 259687: 1970
handling 259688: 1970
handling 259689: 1970
handling 259690: 1970
handling 259691: 1970
handling 259692: 1970
handling 259693: 1970
handling 259694: 1970
handling 259695: 1970
handling 259697: 1970
handling 259698: 1970
handling 259699: 1970
handling 259700: 1970
handling 259701: 1970
handling 259702: 1970
handling 259703: 1970
handling 259705: 1970
handling 259678: 1970
handling 259679: 1970
handling 259680: 1970
handling 259681: 1970
handling 259682: 1970
handling 259683: 1970
handling 259684: 1970
handling 259685: 1970
handling 259686: 1970
handling 259658: 1970
handling 259659: 1970
handling 259660: 1970
handling 259661: 1970
handling 259662: 1970
handling 259663: 1970
handling 259664: 1970
handling 259665: 1970
handling 259666: 1970
handling 259667: 1970
handling 259669: 1970
handling 259670: 1970
handling 259671: 1970
handling 2

handling 261314: 1985
handling 261315: 1985
handling 261316: 1985
handling 261317: 1985
handling 261318: 1985
handling 261319: 1985
handling 261320: 1985
handling 261321: 1985
handling 261322: 1985
handling 261297: 1985
handling 261298: 1985
handling 261299: 1985
handling 261300: 1985
handling 261301: 1985
handling 261302: 1985
handling 261303: 1985
handling 261304: 1985
handling 261305: 1985
handling 261306: 1985
handling 261307: 1985
handling 261308: 1985
handling 261309: 1985
handling 261284: 1985
handling 261285: 1985
handling 261286: 1985
handling 261287: 1985
handling 261288: 1985
handling 261289: 1985
handling 261290: 1985
handling 261291: 1985
handling 261292: 1985
handling 261293: 1985
handling 261294: 1985
handling 261295: 1985
handling 261296: 1985
handling 261508: 1987
handling 261509: 1987
handling 261510: 1987
handling 261511: 1987
handling 261512: 1987
handling 261513: 1987
handling 261514: 1987
handling 261515: 1987
handling 261516: 1987
handling 261517: 1987
handling 2

## Detecting text by OpenCV

Observing several articles providing an abstract suggested that it will be located in the center of the page, each with about 6% margin to the side, as the following example:

<img src="text_section_example.png" alt="Drawing" style="width: 200px;"/>

A standard process is applied to eliminate noises and iteratively erode the texts together in order to get the entire text area. 

I use `cv2.medianBlur` to first blur out any noises, possibly formed when the document is scanned. The kernal size is set to be 5. I transform the image into threshold for handling text using `adaptiveThreshold` method. This method allows the transform process to take into account the nearby pixels. I chose the `cv2.ADAPTIVE_THRESH_GAUSSIAN_C` method to evaluate the threshold, so that the edge of an paragraph can be handled better compared to taking the mean.

I then proceed to the morphological process. I dilate every thing in the threshold to merge single characters into a block of paragraph, and then erode the edges to get a decent text block. This is the criticle part of extracting the text area in an article. The dilation and erosion is iterated 20 times respectively.

Next I call the function `cv2.findContours` to get the contours of the text area, then get its boundaries by `boundingRect`. I determine the abstract by detecting whether there are 6% of margins on its sides.

In [149]:
## get all images
from os import listdir
from os.path import isfile, join

orig_path = 'first_pages/orig'
cropped_path = 'first_pages/crop'
onlyfiles = [f for f in listdir(orig_path) if isfile(join(orig_path, f))]
images = list( filter(lambda f:f.split('.')[-1] == 'png', onlyfiles) )

In [145]:
def getContours(img):
    # convert to grayscale
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

    # first blur out any noises
    gray = cv2.medianBlur(gray,5)

    # transform the image into threshold for handling text 
    thresh = cv2.adaptiveThreshold(gray,255,
                                   cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                   cv2.THRESH_BINARY_INV,11,2)

    # merge single characters into a block of paragraph
    thresh = cv2.dilate(thresh,None,iterations = 20)
    #erode the edges to get a decent text block
    thresh = cv2.erode(thresh,None,iterations = 19)

    # find the contours
    contours,hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
    
    return contours

def isAbstract(x,y,w,h,img_height, img_width):
    margin_to_left = x / img_width
    margin_to_right = 1 - (x+w)/img_width
    abstract_margin_min = 0.05
    abstract_margin_max = 0.07
    if  margin_to_left >= abstract_margin_min and \
        margin_to_right >= abstract_margin_min and \
        margin_to_left <= abstract_margin_max and \
        margin_to_right <= abstract_margin_max :
        return True
    return False

In [151]:
import cv2
import os


for file_name in images[300:301] :
    img_path = orig_path + '/' + file_name
    print(img_path)
    img = cv2.imread(img_path)
    img_height, img_width,_ = img.shape
    contours = getContours(img)
    
    abstract_area = None
    for contour in contours:
        x,y,w,h = cv2.boundingRect(contour)  #x,y, width, height
        if isAbstract(x,y,w,h,img_height, img_width):
            abstract_area = {'x': x,'y':y,'w':w,'h':h}
            
    if not abstract_area:
        continue
    
    ## handle text area
    abstract_text_img = img[y:y+h,x:x+w].copy()
#     cv2.imwrite(f'{cropped_path}/{file_name}_cropped.png',abstract_text_img)
#     os.rename(img_path, f'{orig_path}/c_{file_name})
    cv2.imshow('img',abstract_text_img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
    cv2.waitKey(1)

first_pages/orig/259635.png
