## Challenge

Do a little scraping or API-calling of your own.  Pick a new website and see what you can get out of it.  Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.  

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest.  Write up a report from scraping code to summary and share it with your mentor.

**This scraper will get prices of cars posted on Craigslist Washington DC**



In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess


class CLSpider(scrapy.Spider):
    name = "CL"
    
    allowed_domains = ['https://washingtondc.craigslist.org/']
    # Here is where we insert our API call.
    start_urls = [
        'https://washingtondc.craigslist.org/d/cars-trucks/search/cta'
        ]
    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for posting in response.xpath('//p'):
            
            # Yield a dictionary with the values we want.
            yield {
                # This is the code to choose what we want to extract
                # You can modify this with other Xpath expressions to extract other information from the site
                # https://www.w3schools.com/xml/xpath_intro.asp
                'title': posting.xpath('a[@class="result-title hdrlnk"]/text()').extract_first(),
                'date': posting.xpath('time[@class="result-date"]/text()').extract_first(),
                'price': posting.xpath('span/span[@class="result-price"]/text()').extract_first()
            }
        
        # scrape all pages
        next_page_relative_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        next_page_absolute_url = response.urljoin(next_page_relative_url)

In [2]:
# Tell the script how to run the crawler by passing in settings.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'cars.json',  # Name our storage file.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'ThinkfulDataScienceBootcamp_Rodolfo (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False           # Turn off logging for now.
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links
    #'CLOSESPIDER_PAGECOUNT' : 10
})

# Start the crawler with our spider.
process.crawl(CLSpider)
process.start()
print('Success!')

Success!


In [3]:
import pandas as pd

# Turning JSON into Data Frame
cars = pd.read_json('cars.json')
print(cars.shape)
cars.head()

(120, 3)


Unnamed: 0,date,price,title
0,Nov 7,$12995,2004 Chevrolet Silverado 2500HD 4dr Crew Cab L...
1,Nov 7,$6495,2012 VW Jetta 2.5 SE 5 SPEED
2,Nov 7,$2950,"2004 Hyundai Santa Fe LX, AWD, whith Leather."
3,Nov 7,$3500,2002 Honda Odyssey EX-L
4,Nov 7,$5442,2011 Dodge Avenger!!!


The data can be further processed using the methods we've coverd previously which we'll skip here to save time. 