# Image Processing in Scraping Process

During the scraping of JPE, I (obviously) noticed that while most of the articles were well digitalized, with it's abstracts provided in its webpage, some does not. There are several articles, specifically those in 1970, 1972, 1974, 1985, 1987, do have abstracts in its first-page image, but not ready-to-copy abstract in text.

The following program first scrapes the article images in those specific years, apply some image processing using `openCV` to detect the text area for an abstract, and finally turn them into text by `tessract`.

## Part 1    The scraping

In [58]:
import bs4
import requests
import CONFIG

## scraping functions
def getBSFromURL(url):
    try:
        r = requests.get(url, headers = CONFIG.HEADER, timeout = 10)
        return getBS(r.text)
    
    except requests.exceptions.RequestException as e:
        print("Connection Error")
        raise e
    except requests.ReadTimeout as e:
        print("Timeout")
        raise e

def getBS(html):
    html_bs = bs4.BeautifulSoup(html, 'html.parser')
    return html_bs

In [76]:
## Read articels from 1970, 1972, 1974, 1985, 1987 that don't have abstracts.
import pandas as pd

with open('JPE.csv') as file:
    all_articles = pd.read_csv(file)

abstract_mask = all_articles['Abstract'].isnull()
comment_mask = all_articles['Title'].str.contains('|'.join([
    'comment', 'reply'
]),regex=True, case=False)
years_mask = all_articles['Year'].isin([1970,1972, 1974, 1985, 1987])

articles_needed_mask = abstract_mask & (~comment_mask) & years_mask

articles_needed = all_articles[articles_needed_mask]
all_articles.loc[articles_needed_mask, 'img'] = 1


In [75]:
# all_articles.to_csv('JPE_img_proc.csv', index = False)

In [84]:
def getImgID(url):
    return url.split('/')[-1]

In [92]:
with open('JPE_img_proc.csv') as file:
    article_img_proc = pd.read_csv(file)

articles_needed = article_img_proc[article_img_proc['img']==1]

for index, article in articles_needed.iterrows():
    url = article['Source URL']
    try:
        article_bs = getBSFromURL(url)
        c_img = article_bs.select("img.firstPageImage")
        if not c_img:
            raise Exception('Oh no')   
        img_url = CONFIG.DOMAIN_URL + c_img[0]['src']
        img_name = getImgID(url)
        print(f'handling {img_name}: {article["Year"]}')
        
        
        ## save Img
        img_byte = requests.get(img_url, stream=True).content
        with open(f'first_pages/orig/{img_name}.png','wb') as f:
            f.write(img_byte)
        
        article_img_proc.loc[index, 'img'] = 2
        
        with open('JPE_img_proc.csv', 'w') as file:
            article_img_proc.to_csv(file, index = False)
        
    except Exception as e:
        print(e)
        print(f'Went wrong in {index}')
        continue
    

handling 259715: 1970
handling 259716: 1970
handling 259717: 1970
handling 259718: 1970
handling 259720: 1970
handling 259721: 1970
handling 259687: 1970
handling 259688: 1970
handling 259689: 1970
handling 259690: 1970
handling 259691: 1970
handling 259692: 1970
handling 259693: 1970
handling 259694: 1970
handling 259695: 1970
handling 259697: 1970
handling 259698: 1970
handling 259699: 1970
handling 259700: 1970
handling 259701: 1970
handling 259702: 1970
handling 259703: 1970
handling 259705: 1970
handling 259678: 1970
handling 259679: 1970
handling 259680: 1970
handling 259681: 1970
handling 259682: 1970
handling 259683: 1970
handling 259684: 1970
handling 259685: 1970
handling 259686: 1970
handling 259658: 1970
handling 259659: 1970
handling 259660: 1970
handling 259661: 1970
handling 259662: 1970
handling 259663: 1970
handling 259664: 1970
handling 259665: 1970
handling 259666: 1970
handling 259667: 1970
handling 259669: 1970
handling 259670: 1970
handling 259671: 1970
handling 2

handling 261314: 1985
handling 261315: 1985
handling 261316: 1985
handling 261317: 1985
handling 261318: 1985
handling 261319: 1985
handling 261320: 1985
handling 261321: 1985
handling 261322: 1985
handling 261297: 1985
handling 261298: 1985
handling 261299: 1985
handling 261300: 1985
handling 261301: 1985
handling 261302: 1985
handling 261303: 1985
handling 261304: 1985
handling 261305: 1985
handling 261306: 1985
handling 261307: 1985
handling 261308: 1985
handling 261309: 1985
handling 261284: 1985
handling 261285: 1985
handling 261286: 1985
handling 261287: 1985
handling 261288: 1985
handling 261289: 1985
handling 261290: 1985
handling 261291: 1985
handling 261292: 1985
handling 261293: 1985
handling 261294: 1985
handling 261295: 1985
handling 261296: 1985
handling 261508: 1987
handling 261509: 1987
handling 261510: 1987
handling 261511: 1987
handling 261512: 1987
handling 261513: 1987
handling 261514: 1987
handling 261515: 1987
handling 261516: 1987
handling 261517: 1987
handling 2

## Detecting text by OpenCV

Observing several articles providing an abstract suggested that it will be located in the center of the page, each with about 6% margin to the side, as the following example:

<img src="text_section_example.png" alt="Drawing" style="width: 200px;"/>

A standard process is applied to eliminate noises and iteratively erode the texts together in order to get the entire text area. 

I use `cv2.medianBlur` to first blur out any noises, possibly formed when the document is scanned. The kernal size is set to be 5. I transform the image into threshold for handling text using `adaptiveThreshold` method. This method allows the transform process to take into account the nearby pixels. I chose the `cv2.ADAPTIVE_THRESH_GAUSSIAN_C` method to evaluate the threshold, so that the edge of an paragraph can be handled better compared to taking the mean.

I then proceed to the morphological process. I dilate every thing in the threshold to merge single characters into a block of paragraph, and then erode the edges to get a decent text block. This is the criticle part of extracting the text area in an article. The dilation and erosion is iterated 20 times respectively.

Next I call the function `cv2.findContours` to get the contours of the text area, then get its boundaries by `boundingRect`. I determine the abstract by detecting whether there are 6% of margins on its sides.

In [197]:
## get all images
from os import listdir
from os.path import isfile, join

orig_path = 'first_pages/orig'
cropped_path = 'first_pages/cropped'
onlyfiles = [f for f in listdir(orig_path) if isfile(join(orig_path, f))]
images = list( filter(lambda f: (f.split('.')[-1] == 'png') and (f.split('_')[0] != 'c'), onlyfiles) )

In [198]:
def getContours(img):
    # convert to grayscale
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

    # first blur out any noises
    gray = cv2.medianBlur(gray,5)

    # transform the image into threshold for handling text 
    thresh = cv2.adaptiveThreshold(gray,255,
                                   cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                   cv2.THRESH_BINARY_INV,11,2)

    # merge single characters into a block of paragraph
    thresh = cv2.dilate(thresh,None,iterations = 18)
    #erode the edges to get a decent text block
    thresh = cv2.erode(thresh,None,iterations = 18)

    # find the contours
    contours,hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)
    
    return contours

def isAbstract(x,y,w,h,img_height, img_width):
    margin_to_left = x / img_width
    margin_to_right = 1 - (x+w)/img_width
    abstract_margin_min = 0.04
    abstract_margin_max = 0.08
    if  margin_to_left >= abstract_margin_min and \
        margin_to_right >= abstract_margin_min and \
        margin_to_left <= abstract_margin_max and \
        margin_to_right <= abstract_margin_max :
        return True
    return False

In [199]:
import cv2
import os


for file_name in images:
    img_path = orig_path + '/' + file_name
    print(img_path)
    img = cv2.imread(img_path)
    img_height, img_width,_ = img.shape
    contours = getContours(img)
    
    abstract_area = None
    for contour in contours:
        x,y,w,h = cv2.boundingRect(contour)  #x,y, width, height
        if isAbstract(x,y,w,h,img_height, img_width):
            abstract_area = {'x': x,'y':y,'w':w,'h':h}
            
    if not abstract_area:
        continue
    
    ## handle text area
    print(abstract_area)
    x = abstract_area['x']-5
    y = abstract_area['y']-5
    h = abstract_area['h']+10
    w = abstract_area['w']+10
    abstract_text_img = img[y:y+h,x:x+w].copy()
    cv2.imshow('img',abstract_text_img)
#     cv2.waitKey(0)
#     cv2.destroyAllWindows()
#     cv2.waitKey(1)
    cv2.imwrite(f'{cropped_path}/cropped_{file_name}',abstract_text_img)
    os.rename(img_path, f'{orig_path}/c_{file_name}')


first_pages/orig/259706.png
first_pages/orig/260187.png
first_pages/orig/259712.png
first_pages/orig/259909.png
first_pages/orig/260178.png
first_pages/orig/259666.png
first_pages/orig/259896.png
first_pages/orig/259869.png
first_pages/orig/261474.png
first_pages/orig/259855.png
first_pages/orig/261306.png
first_pages/orig/259699.png
first_pages/orig/260226.png
first_pages/orig/260227.png
first_pages/orig/259698.png
first_pages/orig/261307.png
first_pages/orig/261475.png
first_pages/orig/259868.png
first_pages/orig/259897.png
first_pages/orig/259667.png
first_pages/orig/259673.png
first_pages/orig/259934.png
first_pages/orig/259713.png
first_pages/orig/260186.png
first_pages/orig/259707.png
first_pages/orig/259711.png
first_pages/orig/259705.png
first_pages/orig/259665.png
first_pages/orig/259671.png
first_pages/orig/259856.png
first_pages/orig/261477.png
first_pages/orig/260219.png
first_pages/orig/260218.png
first_pages/orig/260224.png
first_pages/orig/259857.png
first_pages/orig/261

## OCR

The next step is to apply text recognition to the cropped image. 

In [200]:
import pytesseract as pt

cropped_path = 'first_pages/cropped'
onlyfiles = [f for f in listdir(cropped_path) if isfile(join(cropped_path, f))]
images = list( filter(lambda f: (f.split('.')[-1] == 'png'), onlyfiles))

update_list = []
for abstract_img_path in images:
    print(f'Processing {abstract_img_path}')
    img = cv2.imread(cropped_path + '/' + abstract_img_path)
    
    abstract = pt.image_to_string(img, lang='eng')
    abstract = ''.join(abstract.split('-\n'))
    abstract = ' '.join(abstract.split('\n'))
    print(abstract)
    print('='*30)

    img_ID = abstract_img_path.split('_')[-1].split('.')[0]
    update_list.append([img_ID, abstract])
    ## update abstract
    

Processing cropped_261344.png
In markets where product quality is diffuse and verification by buyers is sufficiently costly, high-quality sellers have an incentive to “signal” to buyers by investing in some activity that is more costly for low-quality sellers. Unfortunately, with competition among buyers over the price paid for each level of the signal, there is, in general, no Nash equilibrium. However, it is sufficient for equilibrium that (i) low-quality sellers would, under symmetric information, choose not to enter the market, and (ii) the rate at which the marginal cost of signaling declines across types is sufficiently large.       
Processing cropped_260270.png
This paper examines the following apparent paradox. Adam Smith’s Wealth of Nations is universally regarded as a book that powerfully presented the social case for giving the businessman the maximum degree of freedom of action. And yet, although Smith unqualifiedly treats high wages as desirable, he treats high profits as

A simple theoretical model is developed to illustrate that three aggregation procedures used in the estimation of long-run money demand functions from time-series data—deflating the aggregate data by population and prices, deflating by prices only, or using nominal data undeflated by population or prices—are mathematically equivalent when the data are dominated by time trend. Differences in regressions based on the three aggregation procedures have nothing to do with the degree of homogencity in population and prices, as is often claimed, but merely reflect common time trends in the data. The model is seen to provide good agreement with data from three countries. 
Processing cropped_260267.png
The wage rate a person receives depends not only on the wage offered (a function of his market characteristics), but also on his job-search strategy. The higher his wage demands, the higher the wage he can expect, though the probability of finding an adequate job is lower. When comparing wages of

An asset-pricing model with money introduced via a cash-in-advance constraint is presented. The monetary velocity is variable; hence money demand does not obey the trivial quantity equation. The effects of disturbances in output and money growth on real balances, the price level, and interest rates are examined. Monetary policy has effects on real asset prices. The Fisher relation and the premium on nominal bonds are discussed. The precise role of the timing of information and transactions for properties of price levels and interest rates are clarified. 
Processing cropped_259959.png
Rent control affects the allocation of resources and the distribution of well-being. In New York City in 1968, it is estimated that occupants of controlled housing consumed 4.4 percent less housing service and 9.9 percent more nonhousing goods than they would have consumed in the absence of rent control. The resulting increase in their real income was 3.4 percent. Poorer families received larger benefits t

This paper argues that the structure of corporate ownership varies systematically in ways that are consistent with value maximization. Among the variables that are empirically significant in explaining the variation in ownership structure for 511 U.S. corporations are firm size, instability of profit rate, whether or not the firm is a regulated utility or financial institution, and whether or not the firm is in the mass media or sports industry. Doubt is cast on the Berle-Means thesis, as no significant relationship is found between ownership concentration and accounting profit rates for this set of firms.             
Processing cropped_261340.png
The law of diminishing marginal product, applied to food nutrients, implies that as consumers’ incomes increase, a smaller fraction of their food budget will be devoted to pure nutrition. | test this result using data from the Nationwide Food Consumption Survey, 1977— 78. The foods consumed by five income groups are observed, and the amounts

Time-series tests of the Hotelling r-percent rule for natural resource prices have not been strongly supportive, but the tests and the data are subject to serious difficulties. We propose here an alternative testing strategy based on another but less widely known implication of the Hotelling model. We test this implication, which we call the Hotelling Valuation Principle, by regressing the market values of the reserves of a sample of U.S. domestic oil- and gas-producing companies on their estimated Hotelling values. We find that the es timated Hotelling values account for a significant portion of the observed variations in market values and that the Hotelling mez sures are better indicators of the market values of petroleum properties than two widely cited publicly available alternative appraisals.              
Processing cropped_259900.png
A number of recent contributions have suggested that in a very basic sense liberal values conflict with the Pareto principle. This paper evaluates

In this paper we test the efficiency of the gambling market for National Football League games. Two efficiency tests are conducted. The first test is derived from the finance literature on market efficiency, while the second test is based on a market's being efficient when the rate of return on any gambling strategy based on publicly available information approximates the bookmaker’s commission. While the first test is found to be too weak to establish conclusions about the efficiency of the NFL gambling market, the second test results, showing the existence of profitable betting opportunities, indicate that speculative inefficiencies exist in this market. 
Processing cropped_259861.png
This paper examines some questions relating to the use of the domestic resource cost (DRC) and the effective rate of protection (ERP) measures for project selection and for evaluating the cost of protection. It is shown that, while DRC and ERP give the same results under optimal policies, the choice bet

   Standard models of tax incidence have no explicit public sector. For property taxes at least, this is not an innocent assumption. The local public sector is a tax-exempt sector into which capital can flow to potentially escape the burden of tax increases. Moreover, this taxexempt sector differs from typical tax-exempt sectors in incidence analysis. It grows directly with tax increases, while the taxable sectors shrink. Given this twist of reality, while at low tax rates tax increases may, for example, be borne by capital owners, at high rates the relative burden can shift increasingly to consumers. 
Processing cropped_259863.png
Cost-benefit analyses of urban renewal have generally not investigated the changes in local government expenditures generated by a project, although it is recognized that these may constitute an important source of benefits. This study seeks to measure the changes. It first outlines the qualitative effects of urban renewal projects on population, housing, an

     This is a study of the current account dynamics resulting from the savings and investment dynamics in a small open economy that is subject to exogenous changes in its terms of trade and in world interest rates. Anticipated and unanticipated, as well as temporary and permanent, terms-of-trade changes have very different effects. There is, however, a general tendency toward cycles in both savings and investment, which gives rise to cycles in the current account. [tis shown that the classic Harberger-Laursen-Metzler effect on saving of a terms-of-trade deterioration can have any sign for plausible parameter values, both for temporary and permanent disturbances.                   
Processing cropped_260211.png
In this paper, we examine the behavior of the competitive firm faced with making input-hiring decisions under conditions of price uncertainty. Unlike Sandmo and Baron, among others, we show categorically that a marginal increase in uncertainty stimulates a decline in the firm’s 

Search theory, which purports to explain how individuals behave when they have imperfect or incomplete market information, has received much attention recently. Economists have derived a number of results characterizing the effects of various changes on optimal-search behavior. Almost without exception, these results depend on the untenable assumption that searchers know the probability distribution from which they are searching. This paper studies the effect of assuming instead that searchers learn about the probability distribution while they search from it. Not invariably, but in many instances, the qualitative properties of optimal-search strategies—and thus the behavior of those who follow them—are the same as in the simpler case when the distribution is assumed known. 
Processing cropped_260189.png
‘The article examines the behavior of the ratio of gross private saving to gross national product—the gross private savings rate or GPSR— for the United States during the period 1898-1

Past empirical studies of the aggregate-consumption function have often assumed constancy of the age distribution of the population. This paper relaxes that assumption by specifying a multiperiod-consumption function which introduces age-distribution parameters explicitly into the model. This is accomplished by specifying a multiperiod constant-elasticity-of-substitution (CES) type utility function where the consumer’s time horizon is determined by various age-distribution parameters such as median age, retirement age, etc. The proportion of lifetime income spent on current consumption is shown to depend on the rate of interest, the age-distribution parameters, and the parameters of the utility function. Lifetime income, in turn, depends on the interest rates and age parameters. The model, which is nonlinear, is estimated using annual data from 1948 to 1965. The effects of changes in interest rates and age parameters are assessed, and a prediction-interval test is applied to the model.

A general model of intertemporal distribution is developed; and its implications, discussed. It is shown that Samuelson’s concept of social optimality corresponds to the golden rule of production theory. A more general model of intertemporal distribution efficiency (which would correspond to the Malinvaud concept for production) is suggested, and it is shown that the production and distribution models behave analogously; in particular, for the case of proportional growth, both models are efficient if the interest rate exceeds the growth rate and inefficient if it falls short. 
Processing cropped_261460.png
This paper extends the martingale analysis of no arbitrage pricing to worlds with taxation, The absence of arbitrage is shown to imply the existence of different shadow prices for income streams that are subject to differing tax treatments. For example, no arbitrage implies the existence of different martingale measures for capital gains and for ordinary income when they are differen

Following a discussion of theoretical and empirical problems of industry capacity measurement, a process analysis approach is used to derive a capacity measure for the U.S. petroleum refining industry. Capacity is explicitly defined as the minimum point of the industry’s short-run average cost curve and measured with respect to a fixed, full-employment product mix; capital is disaggregated into process types. Capacity utilization series for petroleum products and processes are also presented along with a comparison between the measure derived and two alternative capacity measures.     
Processing cropped_261515.png
This paper provides an empirical analysis of the determinants of intraindustry trade. We demonstrate that two- way trade flows occur where conditions are favorable to international specialization consistent with the Stigler-Williamson analysis of the division of labor. The results are inconsistent with the prevailing product differentiation— cum-scale economies model of intr

Empirical work on the causes and effects of inventive activity has had difficulty in finding measures that can indicate when and where changes in either inventive inputs or inventive output have occurred. The recent computerization of the U.S. Patent Office’s data base may prove helpful in this context, but there is the problem that a priori we do not know the relationships between patent appli tions and economically meaningful measures of these inputs and outputs. To help solve this problem, this paper investigates the dynamic relationships among the number of successful patent applications of firms, a measure of the firm’s investment in inventive activity (its R & D expenditures), and an indicator of its inventive output (the stock market value of the firm).     a        
Processing cropped_261339.png
A microeconomic theory of the financial firm is developed that is empirically testable. Financial firms are deposit-taking  intermediaries issuing their own liabilities, exemplified by 

Phis paper examines the dynamic impact of government purchases in a simple general equilibrium model with both durable and nondurable consumer goods as well as productive capital. The model generates perhaps surprising results. In particular, increases in government purchases are shown to cause reductions in real interest tes. The model thus provides a possible explanation for the observed behavior of real interest rates around wars.             
Processing cropped_261304.png
The purposes of this paper are twofold. The first is to demonstrate that the expected utility hypothesis is a reasonable description of behavior for consumers who face a low-probability, high-loss natural hazard event, given that they have adequate information. The secis to demonstrate that in California information on earthquake hazards was generated by a 1974 state law that created a market for safe housing that previously did not exi          
Processing cropped_261310.png
   This paper contrasts optimal employ

The study makes an attempt to investigate whether the Soviet Union would be richer or poorer if it had adopted a decentralized, agriculturepropelled, market-oriented economic policy. The Soviet economy is divided into agricultural and nonagricultural sectors. Each sector is approximated by a Cobb-Douglas production function. Special output, labor, and capital indices are then used in order to compare the actual with the hypothetical growth of the Soviet economy. The conclusion reached is that in absence of centralized planning, the growth of the USSR would have been the same or better. 
Processing cropped_261512.png
Models of collusive bidder behavior at single-object second-price and English auctions are provided. The independent private values model is generalized to permit the formation of coalitions and a strategic response by the auctioneer. Cooperative strategies are found to be dominant in these models: coalitions of any size are viable, and the payoff to each member increases w

In this paper we examine the effect of competition in the market for bank acquisitions on the acquirers’ stock returns. Bank acquisitions are examined because federal and state regulations greatly facilitate the identification of potential bidders and alternative targets in an acquisition. We find that the gain to acquirers is positively related to the number of alternative target firms available and negatively related to the number of other potential bidders. These results provide some insights into the sources of gains from bank acquisitions.    
Processing cropped_261303.png
This paper investigates the nature of observed deviations from the unbiased expectations hypothesis in the forward foreign exchange market. If these deviations are due to risk premia then the same premia should be observed in nominal bonds denominated in different currencies. This condition imposes testable restrictions on the parameters of a multivariate regression model. The empirical results are consistent wi

Unlike military conscription, the procurement of jurors by conscription has been widely accepted throughout the history of the Western world. This paper estimates and analyzes the social costs, the wealth redistribution, and the resource allocative consequences of this judiciary institution. As an alternative, the implications of a volunteer system of juror procurement are discussed.  I believe that the success of the jury system depends upon the willingness of men of integrity and intelligence to accept jury service. ... Therefore, I will regard a summons to serve as a juror as a test of my patriotism, just as I would consider a call to armed service. . . . I will allow nothing but unavoidable necessity to induce me to seek to be excused from jury service, and I will try to serve the entire period for which I am summoned, without regard to my personal sacrifice or my financial loss. [The Juror’s Creed,” honorable mention essay, Committee on American Citizenship] 
Processing cropped_26

The article is based on textual evidence from the quantity-theory and Keynesian literature. It shows, first, that the conceptual framework of a portfolio demand for money that Friedman denotes as the “quantity theory” is actually that of Keynesian economics. Conversely, Friedman detracts from the true quantity theory by stating that its formal short-run analysis assumes real output constant, while only prices change, Friedman also incorrectly characterizes Keynesian economics in terms of absolute price rigidity. He does this by overlooking the systematic analysis by Keynes and the Keynesians of the role of downward wage flexibility during unemployment, and of the “inflationary gap” during full employment. Otherwise Friedman’s interpretation of Keynes is the standard textbook one of an economy in a “liquiditytrap” unemployment equilibrium. The author restates his alternative interpretation of Keynesian economics in terms of unemployment disequilibrium. 
Processing cropped_260250.png
Thi

This paper outlines an optimization framework which extends the familiar Tinbergen-Theil model in two ways. First, a “piecewise quadratic” replaces the standard quadratic objective function. Second, the time horizon of the optimization becomes, within the context of economic stabilization problems, endogenous to the optimization process itself. The purpose of both extensions is to escape the conceptual restrictiveness of the Tinbergen-Theil structure while preserving the practical convenience of that model for applied policy work. The paper also describes a solution algorithm incorporating these two extensions, and it presents the results of a sample computational application based on the 1957-58 recession. 
Processing cropped_259946.png
This paper applies previous theoretical and empirical results on inflation and demand for money to a study of inflationary finance and the welfare cost of inflation. The amount of revenue generated by a steady inflation is derived as a function of the 

In [201]:
all_articles = pd.read_csv('JPE.csv')
for a in update_list:
    article_id = a[0]
    article_row = all_articles['Source URL'].str.contains(a[0])
    all_articles.loc[article_row, 'Abstract'] = a[1]

In [202]:
all_articles.to_csv('JPE.csv', index=False)