# Downloading all Wikipedia Articles 

This notebook implements the downloading of all Wikipedia articles. I kept the actual download out of the main notebook because of the lengthy output. 

## Find Files to Download

In [1]:
import requests
from bs4 import BeautifulSoup
from timeit import default_timer as timer
import os

base_url = 'https://dumps.wikimedia.org/enwiki/'
index = requests.get(base_url).text
soup_index = BeautifulSoup(index, 'html.parser')

# Find the links that are dates of dumps
dumps = [a['href'] for a in soup_index.find_all('a') if 
         a.text == '20180901/']

dumps_url = base_url + dumps[0]

# Retrieve the html
dump_html = requests.get(dumps_url).text

# Convert to a soup
soup_dump = BeautifulSoup(dump_html, 'html.parser')

files = []
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
        
files_to_download = [file[0] for file in files if '.xml-p' in file[0]]
print(f'There are {len(files_to_download)} files to download.')

There are 55 files to download.


## Download Files Using Keras

Files will be saved in `/.keras/datasets`.

In [2]:
from keras.utils import get_file

data_paths = []

start = timer()
for file in files_to_download:
    data_paths.append(get_file(file, dumps_url + file))
    
end = timer()
print(f'{round(end - start)} total seconds elapsed.')

Using TensorFlow backend.


Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles1.xml-p10p30302.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles2.xml-p30304p88444.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles3.xml-p88445p200507.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles4.xml-p200511p352689.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles5.xml-p352690p565312.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles6.xml-p565314p892912.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles7.xml-p892914p1268691.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles8.xml-p1268693p1791079.bz2
Downloading data from https://dumps.w

Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles26.xml-p42567204p42663461.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles27.xml-p42663464p44163464.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles27.xml-p44163464p45663464.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles27.xml-p45663464p47163464.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles27.xml-p47163464p48663464.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles27.xml-p48663464p50163464.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles27.xml-p50163464p51663464.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20180901/enwiki-20180901-pages-articles27.xml-p51663464p53163

The total download time was just over 2 hours. That's not bad for all of Wikipedia (at leas the English articles).

This process could also be done in parallel using multithreading or multiprocessing. However, I have run into issues running parallel jobs donwloading files because the code was making too many requests to the server.