Skip to content
Scrape tables from Wikipedia articles into CSVs
Python Shell
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/workflows Add Github Actions Workflow (#5) Sep 10, 2019
scripts fix filename too long error Feb 24, 2020
testdata improve rowspan unit tests Sep 2, 2019
wikitablescrape fix filename too long error Feb 24, 2020
.gitignore add gitignore for python cache May 8, 2016
LICENSE Initial commit May 8, 2016
README.md fix filename too long error Feb 24, 2020
requirements.txt finish 1.0.0 release Sep 7, 2019
requirements_dev.txt Add Github Actions Workflow (#5) Sep 10, 2019
setup.py fix filename too long error Feb 24, 2020

README.md

wiki-table-scrape

Scrape HTML tables from a Wikipedia page into CSV format.

wikitablescrape can be used as a shell command or imported as a Python package.

Why?

This tool makes it easy to download any Wikipedia table via CLI in a format ready for text processing.

This is especially useful when combined with a tool like xsv.

Year Distribution of Costliest Atlantic Hurricanes
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_costliest_Atlantic_hurricanes' --header='costliest' | xsv select "Season" | xsv stats --median | xsv select field,min,max,median,mean,stddev | xsv table
field   min   max   median  mean                stddev
Season  1965  2018  2002    1999.1228070175441  12.900523823770502
Country / Market Distribution of Best-selling Music Artists
wikitablescrape --url='https://en.wikipedia.org/wiki/List_of_best-selling_music_artists' --header='100 million' | xsv select 'Country / Market' | xsv frequency | xsv table
field             value                         count
Country / Market  United States                 26
Country / Market  United Kingdom                10
Country / Market  United Kingdom United States  1
Country / Market  Australia                     1
Country / Market  Spain                         1
Country / Market  Japan                         1

Installation

You can download the package from PyPI or build from source using Python 3.

As a system-level Python package

python3 -m pip install wikitablescrape
wikitablescrape --help

In a virtual environment

python3 -m venv venv
. venv/bin/activate
pip install wikitablescrape
wikitablescrape --help

Build from source

git clone https://github.com/rocheio/wiki-table-scrape
cd ./wiki-table-scrape
python3 -m venv venv
. venv/bin/activate
python setup.py install
wikitablescrape --help

Sample Commands

Write a single table to stdout
wikitablescrape --url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films" --header="films by year" | tee >(head -1) >(tail -5) >/dev/null
"Year","Title","Worldwide gross","Budget","Reference(s)"
"2015","Star Wars: The Force Awakens","$2,068,223,624","$245,000,000",""
"2016","Captain America: Civil War","$1,153,304,495","$250,000,000",""
"2017","Star Wars: The Last Jedi","$1,332,539,889","$200,000,000",""
"2018","Avengers: Infinity War","$2,048,359,754","$316,000,000–400,000,000",""
"2019","Avengers: Endgame","$2,796,255,086","$356,000,000",""
Download all tables on a page into a folder of CSV files
wikitablescrape --url="https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list" --output-folder="/tmp/scrape"
Parsing all tables from 'https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages#Top-100_list' into '/tmp/scrape'
Writing table 1 to /tmp/scrape/table_1_top_100_list.csv
Writing table 2 to /tmp/scrape/table_2_countries.csv
Writing table 3 to /tmp/scrape/table_3_cities.csv
Writing table 4 to /tmp/scrape/table_4_buildings_&_structures_&_statues.csv
Writing table 5 to /tmp/scrape/table_5_people.csv
Writing table 6 to /tmp/scrape/table_6_people_singers.csv
Writing table 7 to /tmp/scrape/table_7_people_actors.csv
Writing table 8 to /tmp/scrape/table_8_people_romantic_actors.csv
Writing table 9 to /tmp/scrape/table_9_people_athletes.csv
Writing table 10 to /tmp/scrape/table_10_people_modern_political_leaders.csv
Writing table 11 to /tmp/scrape/table_11_people_pre_modern_people.csv
Writing table 12 to /tmp/scrape/table_12_people_3rd_millennium_people.csv
Writing table 13 to /tmp/scrape/table_13_progression_of_the_most_viewed_millennial_persons_on_wikipedia.csv
Writing table 14 to /tmp/scrape/table_14_music_bands_historical_most_viewed_3rd_millennium_persons.csv
Writing table 15 to /tmp/scrape/table_15_sport_teams_historical_most_viewed_3rd_millennium_persons.csv
Writing table 16 to /tmp/scrape/table_16_films_and_tv_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 17 to /tmp/scrape/table_17_albums_historical_most_viewed_3rd_millennium_persons.csv
Writing table 18 to /tmp/scrape/table_18_books_and_book_series_historical_most_viewed_3rd_millennium_persons.csv
Writing table 19 to /tmp/scrape/table_19_books_and_book_series_pre_modern_books_and_texts.csv
head -5 /tmp/scrape/table_3_cities.csv
"Rank","Page","Continent","Views in millions"
"1","New York City","North America","75"
"2","Singapore","Asia","63"
"3","London","Europe","61"
"4","Hong Kong","Asia","50"

Testing

./scripts/test.sh

# Show coverage data in a browser
coverage html && open htmlcov/index.html

Sample Articles for Scraping

Contributing

If you would like to contribute to this module, please open an issue or pull request.

More Information

If you'd like to read more about this module, please check out my blog post from the initial release.

You can’t perform that action at this time.