# Crawler and Scraping from scratch tutorial

***Author: Tuomas Takko, tuomas (at) fna (dot) fi***

In this brief tutorial I'll describe the process of scraping a completely new site for text resources. First we'll start by identifying the elements in the URL and on the corresponding page and then move on to constructing a simple "crawler" or ad hoc API.

The example site I'll be referring to is Elephind, which contains a search for browsing historical news articles. The site can be a great source of news text from the time when print media was the major source of news.

Hopefully this tutorial gives you some tips and tricks for other sites as well!

**Let's get started!**


In [43]:
# Imports
import re
import sys
import csv
import os
import string
import time
from bs4 import BeautifulSoup
import urllib
import requests

## Identifying the URL and search

The base URL for Elephind is https://elephind.com/ and by using the search with query "REIT" OR "real estate investment trust" a script on the page redirects us to another URL:

https://elephind.com/?a=q&results=1&r=11&e=--1980---2020--en-10--2--txt-txINtxCO-%22REIT%22+OR+%22real+estate+investment+trust%22------US---

From this URL we can identify the query, the results page number and number of elements for each page. The identification is done by playing around the page and observing changes in the URL.

**Lets break it down:**

1. https://elephind.com/ (Base)

1. ?a=q&r=1 (Starting index of links on page, starts from 1, increments by "o")

1. &results=1 (Page number)

1. &o=10 (Number of results per page)

1. &e=--1980---2020--en-100--1--txt-txINtxCO- (Time span)

1. "REIT"+OR+"real+estate+investment+trust"------US--- (Our query string and country)

Next we'll do investigation by opening one search result page and inspecting it using the developer tools in our browser of choice. I'm using Chrome but any browser should do the trick. This is the part where having a basic knowledge of HTML becomes handy.

Using the highlighting function we can see that each result has a link to an external site and the link is behind a div element with class name elephind_querymaindiv. Under that element we have several links but we can just scrape the first one into our list of links. The link is under the first 'a' element in the href field.


In [20]:
'''
Scraper for getting a specific page of an Elephind query into links of sources.

This can be looped for getting all the article links.

'''

def get_elephindpage(query, pagenumber, numperpage=10):
    firstindex = 1+(pagenumber-1)*numperpage
    querytoURL = query.replace(' ','+').replace('%22','"')
    
    urlbase = "https://elephind.com/?a=q&results=1&r="+str(firstindex)
    querypage = "&e=--1980---2020--en-10--1--txt-txINtxCO-"
    query = querytoURL+"------US---"
    url = urlbase+querypage+query
    
    page = urllib.request.urlopen(url)

    soup = BeautifulSoup(page, "lxml")
    
    links = []
    for row in soup.findAll('div', class_='elephind_querymaindiv'):
        keys=row.findAll('a', href=True)
        for i in keys:
            k = i['href']
            #tmp = k.replace('(','').replace(')','')
            if len(k)>1:
                links.append('https://elephind.com'+k)
                break

    return links

In [22]:
'''
Let's give it a try!

The following query should give us the very first link on the page
'''
truelink = 'https://elephind.com/?a=p&p=redirect&vhttp=http%3a%2f%2fcdnc.ucr.edu%2fcgi-bin%2fcdnc%3fa%3dd%26d%3dDS19841208.2.166%26txq%3d%22REIT%22+OR+%22real+estate+investment+trust%22&vsource=UCR'
firstlink = get_elephindpage('"REIT"+OR+"real+estate+investment+trust"', 1,10)[0]
print(firstlink)
print('The links match if True:', firstlink==truelink)

https://elephind.com/?a=p&p=redirect&vhttp=http%3a%2f%2fcdnc.ucr.edu%2fcgi-bin%2fcdnc%3fa%3dd%26d%3dDS19841208.2.166%26txq%3d%22REIT%22+OR+%22real+estate+investment+trust%22&vsource=UCR
The links match if True: True


In [178]:
'''
Now that the page scraper seems to work, lets see if we can get all the links from the database.

Elephind says that there were 1785 results for the query.

We'll start by looping as many times as we can, and stopping once there are no more links.
'''
all_links = []

newlinks = True
page = 1
while newlinks:
    #print('Page:', page, ', links so far: ', len(all_links))
    nlinks = get_elephindpage('"REIT"+OR+"real+estate+investment+trust"', page,10)
    if len(nlinks)>0:
        for i in nlinks:
            all_links.append(i)
        page+=1
    else:
        newlinks = False
        break
print('Number of links received:', len(all_links))

Number of links received: 1785


## Scraping the source text

Now that we can get the list of all links, we have essentially created a 'crawler' for the Elephind page. The next step in finishing this ad hoc API is to create a scraping function for the pages.

Once again we'll start by manually opening one link and inspecting it using the developer tools in our browser of choice. I'm using Chrome but any browser should do the trick.

On the first link we get directed to cdnc.ucr.edu (we need to first GET the new URL) where we have the scanned article and the extracted text on the left panel under a div element with tag 'documentdisplayleftpanesectiontextcontainer'. Under this element we have a paragraph field ('p') where the text is located. There is a small problem with the text however. After inspecting the network log on the developer tools panel one can see that the text in the element is loaded using JS and AJAX, thus meaning that the field on the page we get is empty. This requires either using a browserAgent such as Selenium to wait for the AJAX to be called, or one can try to see the correct AJAX call from the network log. In this case I wanted to keep it simple and went on to find the correct URL to send the request to. The URL was https://cdnc.ucr.edu/?a=da&command=getSectionText&d=DS19841208.2.166&srpos=&f=XML&e=-------en--20--1--txt-txIN-%252522REIT%252522+OR+%252522real+estate+investment+trust%252522-------1

Inspecting all the URLs shows that the document ID etc. is in the same format, making the parsing rather easy.

Let's try to get the text from the first link.

The text should be the following:

<i>"Option taken on building

SAN FRANCISCO - California Real Estate Investment Trust has entered into an option agreement to sell its Caelus Memories Inc Building in San Jose As consideration for Cal REIT s commitment to sell the property, the purchaser made a nonrefundable option payment to Cal REIT of 525.000 on Nov 24, which will be applied toward the purchase price."</i>


In [221]:
def cleanhtml(raw_html):
    '''
    Simple regex substitution for cleaning the html text
    '''
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', raw_html)
    return cleantext

def scrape_oldnews(origurl):
    '''
    The original link from Elephind need redirecting, thus we need to first catch the correct URL on CDNC

    '''
    try:
        response = requests.get(origurl, allow_redirects=True)
        resplst = response.text.splitlines()
        for i in resplst:
            tmp = i.strip( )
            if 'var vhttp' in tmp:
                surl = tmp.split('\'')[1].replace(' ','+')
        rls = surl.split('&')

        '''
        A major problem with scraping the CDNC site using these tools is that the text is behind an AJAX call.

        After a bit of digging using the browser dev tools and the network log I found the URL that gets the text on the left hand side.
        '''
        baseurl = rls[0].split('edu')[0]+'edu/'
        newlist = [baseurl+'?a=da', 'command=getSectionText', rls[1], 'srpos=', 'f=XML', rls[2].replace('txq=','e=-------en--20--1--txt-txIN-').replace('"','%252522')+'-------1' ]   
        nurl = '&'.join(newlist)
        #date from the URL
        for cc in range(len(rls[1])):
            if rls[1][cc].isdigit():
                break
        ndatestr = rls[1][cc:]
        ndate = ndatestr[:4]+'-'+ndatestr[4:6]+'-'+ndatestr[6:8]
        ttt = 'https://newspapers.bc.edu/?a=da&command=getSectionText&d=bcheights19900319.2.39&srpos=&f=XML&e=-------en-20--1--txt-txIN-%22REIT%22+OR+%22real+estate+investment+trust%22------'
        '''
        Now that the AJAX url is formed, we can get request the data
        '''
        response = requests.get(nurl)
        ajaxGET = response.text
        soup = BeautifulSoup(ajaxGET, "lxml")
        tmptext = ''
        for i in soup.findAll('sectiontext'):
            tmptext = str(i.text)

        return (ndate, cleanhtml(tmptext))
    except:
        print('Problem with URL: ', origurl)
        return ('','')


In [214]:
'''
Lets give this scraping trick a try.

There are multiple libraries from which Elephind crawls from. Some examples:

California: https://cdnc.ucr.edu/?a=da& command=getSectionText& d=DS19841208.2.166&srpos=&f=XML&e=-------en--20--1--txt-txIN-%252522REIT%252522+OR+%252522real+estate+investment+trust%252522-------1
Boston: https://newspapers.bc.edu/?a=da&command=getSectionText&d=bcheights19900319.2.39&srpos=&f=XML&e=-------en-20--1--txt-txIN-%22REIT%22+OR+%22real+estate+investment+trust%22------
'''
blink = 'https://elephind.com/?a=p&p=redirect&vhttp=http%3a%2f%2fnewspapers.bc.edu%2fcgi-bin%2fbostonsh%3fa%3dd%26d%3dbcheights19900319.2.39%26txq%3d%22REIT%22+OR+%22real+estate+investment+trust%22&vsource=BC'
scrape_oldnews(truelink)

('1984-12-08',
 ' Option taken on building  SAN FRANCISCO - California Real  Estate   Investment  Trust has entered into an option agreement to sell its Caelus Memories Inc Building in San Jose As consideration for Cal REIT s commitment to sell the property, the purchaser made a nonrefundable option payment to Cal REIT of 525.000 on Nov 24, which will be applied toward the purchase price. ')

## Success so far!

Let's make a short test with limited amount of articles using the functions created previously!


In [215]:
# First page of our search
links = get_elephindpage('"REIT"+OR+"real+estate+investment+trust"', 1,10)

#For each link get the text
news = []
for lnk in links:
    text = scrape_oldnews(lnk)
    print(text)
    print('-'*20)
    news.append(text)

('1984-12-08', ' Option taken on building  SAN FRANCISCO - California Real  Estate   Investment  Trust has entered into an option agreement to sell its Caelus Memories Inc Building in San Jose As consideration for Cal REIT s commitment to sell the property, the purchaser made a nonrefundable option payment to Cal REIT of 525.000 on Nov 24, which will be applied toward the purchase price. ')
--------------------
('1987-01-24', " REIT expects to reduce its distribution  SAN FRANCISCO - California Real  Estate   Investment  Trust this week reported that it expects to substantially reduce its distributions to shareholders during 1987 The trust’s annual distribution was $1.28 per share (32 cents per quarter) during 1986 Cal REIT currently expects to announce its  next quarterly distribution in late February. Cal REIT’s policy in determining the level of distributions to shareholders has been based on the trust’s anticipated operating earnings, together with profits from the sale of properti

('1990-03-19', " 80 Comm. Ave. named Riley Hall  The BC Office of Communication is pleased to announce that the dormitory located at 80 Commonwealth Avenue has been named Riley Hall. The Hall has been named after John M. Riley a continuously generous benefactor of BC. Mr. Riley, a native of Dover, Mass., received a Bachelor of Arts  degree in Philosophy from BC in 1956. Mr. Riley presently resides in Minneapolis, Minnesota where he is the Senior Managing Partner of Piper, Carraway, and Riley REIT (Real Estate Investment Trust). Mr. Riley's two children both attended and graduated from Boston College in the early eighties.  The next academic year will mark the inception of the Collinsworth Chair in the Department of Economics. The funding for the Chair will be provided primarily by the John M. Riley Trust Fund. The official invocation ceremony has been tentatively scheduled for Friday, April 27th. ")
--------------------


## Full text crawl

The following would result in creating a full dataset from the search and saving the text into a CSV file. If there are multiple empty fields, the scraper must be adjusted. Elephind indexes multiple sources which makes it difficult to adjust our method for any source. Luckily many of the sources use the same website structure.

In [222]:
import pandas as pd

def get_fullhistory(query, fname, limit=None):
    all_links = []
    
    newlinks = True
    page = 1
    while newlinks:
        #print('Page:', page, ', links so far: ', len(all_links))
        nlinks = get_elephindpage(query, page,10)
        if len(nlinks)>0:
            for i in nlinks:
                all_links.append(i)
            page+=1
        else:
            newlinks = False
            break
        if limit!=None:
            if len(all_links)>=limit:
                break
    print('Number of links received:', len(all_links))
    timestamps, newstexts = [], []
    for lnk in all_links:
        text = scrape_oldnews(lnk)
        timestamps.append(text[0])
        newstexts.append(text[1])
    df = pd.DataFrame({'date':timestamps,'texts':newstexts}).to_csv(fname+'.csv')
    return df


In [223]:
get_fullhistory('"REIT"+OR+"real+estate+investment+trust"', 'testrun', 20)

Number of links received: 20
Problem with URL:  https://elephind.com/?a=p&p=redirect&vhttp=http%3a%2f%2ftexashistory.unt.edu%2fark%3a%2f67531%2fmetapth188167%2fm1%2f42%2fzoom%2f%3fq%3d%22REIT%22+OR+%22real+estate+investment+trust%22&vsource=TEXAS
