# "The Itsy Bitsy Spider Climbed into Your Website"
> "Creating your first Scrapy web spider to get the data you need."

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [web scraping, Scrapy, XPath and CSS selectors]
- hide: false
- search_exclude: true

## Theory
### What is web scraping?
Web scraping is a method to automate the extraction of large amounts of data from a website.

### What are the different methods for web scraping in Python?
Credit to [this blog post](https://elitedatascience.com/python-web-scraping-libraries).
- Level 1: Beautiful Soup and LXML
    - Parses data on a web page.
- Level 2: Requests and Selenium
    - Fetches data from a web page.
- Level 3: Scrapy
    - Systematically crawls through a web page and "manages requests, preserves user sessions, follows redirects, and handles output pipelines."

### What are CSS and XPath selectors?
- CSS and XPath can be used for selecting elements on a webpage. Credit to this [blog post](https://medium.com/dataflow-kit/css-selectors-vs-xpath-f368b431c9dc):
    - "Cascading Style Sheets (CSS) is a style sheet language used for describing the look and formatting of a document written in HTML or XML. CSS Selectors are patterns used to select the styled element(s)."
    - "XPath, the XML path language, is a query language for selecting nodes from an XML document. Locating elements with XPath works very well with a lot of flexibility."
- They are both part of our Scrapy example code below, which is used to crawl a website for images and captions.

## Examples
### When would you need web scraping?
Credit to [this blog post](https://towardsdatascience.com/https-medium-com-hiren787-patel-web-scraping-applications-a6f370d316f4), with direct quotations in quotation marks:
- "Competitor price monitoring"
- "Monitoring minimum advertised price compliance"
- "Fetching images and product descriptions"
- "Monitoring consumer sentiment"
- Aggregated news articles and news monitoring
- "Market data aggregation"
- "Extracting financial statement\[s\]"
- "Insurance"
- Predictive and real-time analytics
- "Machine-learning models"
- Data-driven and content marketing
- "Lead generation"
- "Competitive analysis"
- "Search engine optimization monitoring"
- "Reputation monitoring"
- "Academic research"
- "Data journalism"
- "Employment"
- "Search engine for classified sites"

## Code
### An Example Web Scraper with Scrapy

I will show you what you need to do to create a basic web scraper that extracts and exports image URLs and image captions from the "Places" listing on my favorite website, a travel guide for obscure destinations called [Atlas Obscura](https://www.atlasobscura.com/places). I really appreciate them for letting me scrape their website, and in return I bought their book, their wall calendar, and their daily calendar.

To start, install Scrapy in Terminal using Pip. Refer to [this blog post](https://www.liquidweb.com/kb/install-pip-windows/) if you need help installing Pip. I use the `pip3` command because I'm installing the Scrapy package for my Python 3 installation (rather than using the `pip` command, which would alter my Macbook's system Python).

![](2020-05-04-Image-1.png)

Next, change directory to the location you will create your web scraping project, and tell Scrapy to create a scraper project. Scrapy will create all the necessary files for you; you only need to alter the ones you need for your web scraper.

![](2020-05-04-Image-2.png)

Navigate to the project you created and then generate a web scraping program (spider) of the name you choose. Pass a root URL that you would like to scrape as well.

![](2020-05-04-Image-3.png)

In the main directory of the scraper, open `settings.py`. I used [Atom text editor](https://atom.io/) for this project. In the settings.py file, add a feed format (I used CSV) and a feed URI (since I was saving the CSV to the web scraper directory, I only used the filename) to tell Scrapy how to export the web scraping results. Additionally, set code that identifies you and your website to the website you're accessing.

![](2020-05-04-Image-4.png)

Please keep in mind, there are myriad other settings to tune, and I am only covering the basics here. Now, open the scraper file in the spiders subdirectory of the folder (mine was called `AtlasScraper.py`). The file should be populated with some class variables like `name` (of scraper) and `start_urls` (which you had entered with the `genspider` command above). Edit the `parse` function to add code to parse the pages of the website.

![](2020-05-04-Image-5.png)

In this case, my parse function did the following:
1. Downloaded the image URLs and the image descriptions on the page.
2. Downloaded the total number of pages on the website.
3. Waited for a few seconds to avoid overloading the website's servers.
4. Yielded a dictionary entry with the URL, the image desciption, and the response code (200 means the download succeeded).
5. Went to the next page.
6. Repeated steps 1-5 until the scraper passed the last page.

Finally, I started my web scraper, and I let it run for a few days.

![](2020-05-04-Image-6.png)

### How Did I Know What XPath and CSS Selectors to Use?

Go to the website, right-click on the item you want to download, and click "Inspect" to see the web page's code for the image. For my web scraper, I started with photo paths.

![](2020-05-04-Image-7.png)

As you can see, the relative path of this photo URL is in a `//div/figure/img/@data-src`. All the photo URLs had this relative path, so my XPath selector found all of them on the page. Next, I looked for the total number of pages in the web page's code.

![](2020-05-04-Image-8.png)

The number of pages was located in the URL located at `//span[@class='last']/a/@href`, so I used that for my Xpath selector. (Only the `span` with a `class` of `last` had the total number of pages in the URL.) Lastly, I looked for the image captions.

![](2020-05-04-Image-9.png)

All of the image captions had `js-subtitle-content` in their `class`, so I used a CSS selector to extract the text associated with each of them.

If you need a useful tool to help you with the XPath or CSS selectors, I recommend using the [ChroPath](https://sanjayselectorshub.medium.com/chropath-firepath-for-chrome-browser-3130e72b4754) tool, which helps you generate XPath or CSS selectors, or the [SelectorGadget](https://selectorgadget.com/) tool, which shows you the CSS selector for a set of elements you want to select on a web page.

For more information, please see the following links:
- [Web Scraping in Python Using Scrapy (with Multiple Examples)](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/)
- [Making Web Crawlers Using Scrapy for Python](https://www.datacamp.com/community/tutorials/making-web-crawlers-scrapy-python)
- [Python Scrapy Tutorial for Beginners – 01 – Creating Your First Spider](https://letslearnabout.net/tutorial/scrapy-tutorial/python-scrapy-tutorial-for-beginners-01-creating-your-first-spider/)
- [Python Web Scraping & Crawling Using Scrapy](https://www.youtube.com/watch?v=ve_0h4Y8nuI&list=PLhTjy8cBISEqkN-5Ku_kXG4QW33sxQo0t)
- [Scrape Multiple Pages with Scrapy](https://towardsdatascience.com/scrape-multiple-pages-with-scrapy-ea8edfa4318)

I studied all of these before creating my scraper, so I recommend you check them out too.