ODIQ workshop - Web Scraping
QUT DMRC - 2015
Patrik Wikström & Brenda Moon
Notebooks
- Introduction
- Step 1. Extract road sign name from a single item on a single page
- Step 2. Extract all road sign names on a single page
- Step 3. Extract all road sign data from a single page
- Step 4. Structure the data extraction as a function
- Step 5. Store the data in a dataframe and save to disk
- Step 6. Restructure the code for clarity
- Step 7. Plotting, tiny stat analysis and improved I/O
- Final. Support for multiple pages
Installation
This workshop uses Python 3 and the python modules listed below.
One of the easiest ways to install Python and packages that are relevant for web scraping is to use Anaconda developed by Continuum Analytics. Do note however, that Python is available in two versions, "2.7" and "3.4". We will be using Python 3.4, so make sure you download the correct one.
Anaconda can be downloaded from https://www.continuum.io/downloads
Python modules used in this workshop
We use three Python modules in this workshop. The full documentation for each is available on their websites:
- Requests - get webpages from urls
- BeautifulSoup - select text out of webpages
- Pandas - python data analysis library
These pages are in Jupyter Notebook