## Challenge

Do a little scraping or API-calling of your own.  Pick a new website and see what you can get out of it.  Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.  

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest.  Write up a report from scraping code to summary and share it with your mentor.

In [None]:
# Import scrapy 
import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
class ESSpider(scrapy.Spider):
    # Name is important, need different ones for each spider
    # of the same class
    name = 'ESS'
    
    start_urls = ['http://www.everydaysexism.com']
    
    # Defining the scraping process
    def parse(self, response):
        with open('./scraper_results/mainpage.html', 'wb') as f:
            f.write(response.body)

# Instantiate the crawler
process = CrawlerProcess()

# Start the crawler with the spider
process.crawl(ESSpider)
process.start()

Now, we have a file called 'mainpage.html' saved to your machine that contains all the code from www.everydaysexism.com. However, to get more useful, parsed data, we must give the spider more specific instructions.

__Note:__ Remember, to restart the kernel if you want to rerun a Scrapy script.

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class ESSpider(scrapy.Spider):
    name = 'ESS'
    start_urls = ['http://www.everydaysexism.com']
    
    def parse(self, response):
        # Iterate over every <article> element on the page
        for article in response.xpath('//article'):
            yield {
                'name': article.xpath('header/h2/a/@title').extract_first(),
                'date': article.xpath('header/section/span[@class="entry-date"]/text()').extract_first(),
                'text': article.xpath('section[@class="entry-content"]/p/text()').extract(),
                'tags': article.xpath('*/span[@class="tag-links"]/a/text()').extract()
            }
            
            
# Pass in crawler parameters
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in json
    'FEED_URI': './scraper_results/firstpage.json',  # Name of the json file
    'LOG_ENABLED': False           # Turning off logging
})

# Start the crawler & spider
process.crawl(ESSpider)
process.start()
print('Success')

In [None]:
import pandas as pd 

firstpage = pd.read_json('./scraper_results/firstpage.json', orient='records')
print(firstpage.shape)
firstpage.head()

### Recursion
Now that we have a scraper that can pull the information we want off of a page and store it in a file, we want to run that scraper over all the pages of the website. We do this using recursion – the Scrapy spider will run over a page, gather information, and then detect a link to the next page and call itself on the new page.

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

class ESSpider(scrapy.Spider):
    name = 'ESS'
    start_urls = ['http://www.everydaysexism.com']
    
    def parse(self, response):

        # Iterate over every <article> element on the page
        for article in response.xpath('//article'):
            yield {
                'name': article.xpath('header/h2/a/@title').extract_first(),
                'date': article.xpath('header/section/span[@class="entry-date"]/text()').extract_first(),
                'text': article.xpath('section[@class="entry-content"]/p/text()').extract(),
                'tags': article.xpath('*/span[@class="tag-links"]/a/text()').extract()
            }

            # Getting the next page URL
            next_page = response.xpath('//div[@class="nav-previous"]/a/@href').extract_first()

            # Grabbing the next page number
            pagenum = int(re.findall(r'\d+', next_page)[0])

            # Recursively call the spider until page 9
            if next_page is not None and pagenum < 10:
                next_page = response.urljoin(next_page)
                # Request next page with same parsing as above
                yield scrapy.Request(next_page, callback=self.parse)

            
# Pass in crawler parameters
# Additional parameters are for scraping etiquette
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': './scraper_results/data.json',
    'LOG_ENABLED': False,
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Instantiate and start crawler
process.crawl(ESSpider)
process.start()
print('Success')

In [None]:
import pandas as pd

df = pd.read_json('./scraper_results/data.json', orient='records')
print(df.shape)
df.head()

Nine pages at 10 entries a row gives us 90 rows - looks like we were successful in our scraping! 

## Challenge

Now, we will try to pull Nathaniel Rakich's articles from [FiveThirtyEight](https://fivethirtyeight.com/contributors/nathaniel-rakich/).

In [2]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re 

class FiveThirtyEight(scrapy.Spider):
    name = "NS"
    
    start_urls = ['https://fivethirtyeight.com/contributors/nathaniel-rakich/']
    
    def parse(self, response):
        for item in response.xpath("//div[@class='content-area']/div"):
            yield {
                'date': item.xpath(".//div[@class='post-info']/p/time/text()").extract_first(),
                'title': item.xpath(".//div[@class='post-info']/div/div/h2/a/text()").extract_first(),
                'article_link': item.xpath(".//div[@class='post-info']/div/div/h2/a/@href").extract_first(),
                'author': item.xpath(".//div[@class='post-info']/div/div/p[@class='single-metadata card vcard']/a/text()").extract_first()
            }
        
        nextpage = response.xpath("//div[@class='links']/a/@href").extract_first()
        pagenum = int(re.findall(r'\d+', nextpage)[0])
        
        # Recursively call next page
        if nextpage is not None and pagenum < 4: 
            nextpage = response.urljoin(nextpage)
            yield scrapy.Request(nextpage, callback=self.parse)
            
            
# Passing crawler parameters
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': './scraper_results/NS538.json',
    'LOG_ENABLED': False,
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'Thapani Sawaengsri (thapani.sawaengsri@gmail.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Instantiate and start scraper
process.crawl(FiveThirtyEight)
process.start()
print('Success')

Success


In [3]:
import pandas as pd

df = pd.read_json('./scraper_results/NS538.json', orient='records')

print(df.shape)
df.head()

(26, 4)


Unnamed: 0,date,title,article_link,author
0,NaT,,,
1,NaT,,,
2,2019-10-25,\n\t\t\t\tWhere The Public Stands On Impeachme...,https://fivethirtyeight.com/features/where-the...,Nathaniel Rakich
3,2019-10-24,\n\t\t\t\tNine Candidates Have Made The Novemb...,https://fivethirtyeight.com/features/klobuchar...,Nathaniel Rakich
4,2019-10-23,\n\t\t\t\tWe’ve Already Seen Twice As Many Pre...,https://fivethirtyeight.com/features/weve-alre...,Nathaniel Rakich


Looks like the scraper worked! The first two blank rows are article-like areas that were not actually articles. 