Working with Web Data in Python
Course materials for working with web data in Python. Originally developed for a 1 full-day course at the Methods Institute @Sheffield
About the course
This course will show how one can treat the Internet as a source of data.
What will be covered?
- Use Python to scrape, parse and read web data
- Understanding Application Programming Interfaces (APIs) and use them to collect data
- How to query APIs using the appropriate requests (case studies: GitHub, Twitter)
- Commonly returned data formats: HTML, JSON, XML
- Programatic web data collection and streams
- Regular expressions
- Manipulating web data with Pandas (visualization included)
The materials are split into broad sections containing exercises, explanations and most of the information you need to complete them. The material covered is far from exhaustive. Instead, we try to provide enough information and tasks to get you started and we hope to get you quickly to a level where you are then capable of continuing to use Python to collect and handle web data.
For you to be able to follow along the course you need to have a basic knowledge of Python as well as a basic understanding of:
- Nested data structures
- Variables assignation and types
- How to import modules in Python
- Basic HTML tagging
💻 Software requirements
We recommend using the Anaconda distribution of Python. It's free and comes with a large number of additional modules included ready for importing into your scripts, IPython shell and notebook interfaces, a powerful Python text editor (Spyder), and a good package manager, conda, for updating and installing packages.
You need to have the following installed in your laptop for the course:
- Anaconda (get it from here)
- Python > 3.5
- Jupyter notebooks (already installed with Anaconda)
- shell (we recommend using git bash: installed with Git or cmdr )
$ git clone https://github.com/trallard/WebData_Python.git
Then navigate to the directory containing the materials for the session. So if you have them in Documents/WebData_Python you'll type:
$ cd Documents/WebData_Python
Next, we'll use conda environments to install all the needed packages we need for the course:
$ conda env create -f environment.yml
Finally, you need to activate the environment you just created:
$ source activate webdata
To deactivate the environment you can do it like so:
$ source deactivate
⚡️ Course content
- Introduction to regular expressions
- Working with dictionaries
- Web scrapping
- Working with JSON data
- Accessing APIs
This work is licensed under a Creative Commons Attribution 4.0 International License.