I built a web scraper that scrapes the academic Economics forum "EconJobRumors," and returns a list of names of discussion topics, and the corresponding links. The number of pages is adjustable, but I scraped from the first 20 pages. The script takes about 20 seconds to run.

Note: I did not use an API, but scraped from the page directly. The prompt asks to "Do a little scraping or API-calling of your own," so I don't anticipate this being a problem.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

num_pages = 20

class ESSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "ESS"
    
    # URL(s) to start with.
    start_urls = [
        'https://www.econjobrumors.com',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for page in response.xpath('//tr'):
            
            # Yield a dictionary with the values we want.
            yield {

                'name': page.xpath('td/a[starts-with(@href, "https://www.econjobrumors.com/topic/")]/text()').extract_first(),
                'link': page.xpath('td/a[starts-with(@href, "https://www.econjobrumors.com/topic/")]/@href').extract_first(),

            }
        # Find current page number
        page_num = int(response.xpath('//span[@class="page-numbers current"]/text()').extract_first())
        
        print(page_num)

        if page_num <= num_pages:
        	next_page = 'https://www.econjobrumors.com/page/' + str(page_num+1)
        	yield scrapy.Request(next_page, callback = self.parse)

# Tell the script how to run the crawler by passing in settings.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'json_output.json',  # Name our storage file.
    'LOG_ENABLED': False           # Turn off logging for now.
})

# Start the crawler with our spider.
process.crawl(ESSpider)
process.start()
print('Success!')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Success!


Remove null values and print:

In [2]:
import pandas as pd

pages = pd.read_json('json_output.json')
pages = pages[(pages.name.isna() == False)]

pages.head()

Unnamed: 0,name,link
25,2,https://www.econjobrumors.com/topic/request-a-...
26,5 months,https://www.econjobrumors.com/topic/about-ejmr...
27,Universities withdrawing offers,https://www.econjobrumors.com/topic/universiti...
28,Dow Jones would surely touch 17000 once the ne...,https://www.econjobrumors.com/topic/dow-jones-...
29,Bill Ackman is buying stocks.,https://www.econjobrumors.com/topic/bill-ackma...
