## Meta Title
Download Complete English Wikipedia knowledge base for large-scale semantic search and AI applications

# Introduction

In this guide, I’ll walk you through the entire process of downloading, parsing, and preparing the complete English Wikipedia knowledge base for advanced AI applications like large-scale semantic search . You’ll learn how to select the right dump files, automate downloads, handle massive datasets, and apply best practices for reliability and performance. Whether you’re building an NLP pipeline, a custom search engine, or training embeddings, this tutorial provides the technical steps and context needed for robust Wikipedia data engineering.


# Downloading the English Wikipedia Database

Wikimedia offers several types of database dumps, each tailored for different technical needs.  You can browse and download these files from the [Wikimedia Dumps](https://dumps.wikimedia.org/enwiki/latest/) page.Here’s a quick rundown:

- **Pages-Articles Dumps (`enwiki-latest-pages-articles*.xml.bz2`)**  
  *Contains the text of all Wikipedia articles, excluding talk pages, user pages, and other non-content pages. This is the go-to dump for NLP and embedding tasks.*

- **Pages-Articles Multistream Dumps (`enwiki-latest-pages-articles-multistream*.xml.bz2`)**  
  *Designed for efficient random access, these files come with an index for quick retrieval of specific articles.*

- **Pages-Meta-Current Dumps (`enwiki-latest-pages-meta-current*.xml.bz2`)**  
  *Includes the current revision of all pages, plus metadata like timestamps and contributor info.*

- **Pages-Meta-History Dumps (`enwiki-latest-pages-meta-history*.xml.bz2`)**  
  *Contains the full revision history for all pages—ideal for research on editing behavior or historical analysis.*

- **Full XML Dumps (`enwiki-latest.xml.bz2`)**  
  *All pages and complete revision history for comprehensive research.*

- **Abstracts Dumps (`enwiki-latest-abstract.xml.gz`)**  
  *Short summaries of each article for lightweight applications.*

- **SQL Dumps (`*.sql.gz`)**  
  *Database tables in SQL format for advanced analysis or custom mirrors.*

- **Image, Category, Pagelinks, User, Redirect, and Other Specialized Dumps**  
  *Each serves specific analytical or archival purposes.*



# Pages Article Dump

In this artcile we will look at downloading **Pages-Articles Dump** . The **Pages-Articles Dump** is the most widely used dataset for NLP and embedding tasks. The main file, `enwiki-latest-pages-articles.xml.bz2`, contains the full text of all Wikipedia articles, excluding non-content pages like talk, user, and file pages. This dump is updated regularly and is the recommended source for extracting article content for semantic search and machine learning projects.


## Split Dumps: Handling Large Files

Due to the massive size of the English Wikipedia, the pages-articles dump is often split into multiple parts for easier downloading and processing. These files are named sequentially, such as:

```
enwiki-latest-pages-articles1.xml-p1p41242.bz2
enwiki-latest-pages-articles2.xml-p41243p151573.bz2
enwiki-latest-pages-articles3.xml-p151574p311329.bz2
...
```

Each split file contains a portion of the articles, with the filename indicating the page ID range (e.g., `p1p41242` means pages with IDs from 1 to 41,242). The main file, `enwiki-latest-pages-articles.xml.bz2`, may be a concatenation of these splits or a separate full dump, depending on the release.


## Typical Dump Sizes

- **Complete Dump (compressed):** ~20–25 GB (`enwiki-latest-pages-articles.xml.bz2`)
- **Single Split Part (compressed):** ~1–2 GB (e.g., `enwiki-latest-pages-articles1.xml-p1p41242.bz2`)
- **Complete Dump (uncompressed):** ~80–100 GB
- **Single Split Part (uncompressed):** ~4–8 GB

*Sizes vary by release and Wikipedia growth. Always check the actual file sizes on the [Wikimedia Dumps](https://dumps.wikimedia.org/enwiki/latest/) page.*

Handling Wikipedia’s massive size requires smart strategies:

- **File Size:** The full articles dump can exceed 20GB compressed and over 80GB uncompressed. Splitting makes downloads manageable and reduces corruption risk.
- **Parallel Processing:** Splits allow you to process multiple chunks in parallel, speeding up parsing and embedding.
- **Resilience:** If a download fails, you only need to re-download the affected split, not the entire dataset.


# Programatically Downloading the Wikipedia Data

To automate the process of finding and downloading the latest Wikipedia article dumps, you can use the official RSS feed and the `WikiDumpClient` class (see notebook for code). This approach ensures you always get the most recent files and can handle both split and combined dumps.

## Use the RSS Feed to Find the Latest Dump

Set the RSS feed URL:

```python
DEFAULT_RSS_URL = "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml"
```

The `WikiDumpClient.get_latest_dump_link_from_rss(rss_url)` method fetches and parses this RSS feed to extract the latest dump directory URL (e.g., `https://dumps.wikimedia.org/enwiki/20250920`).

## Download and Parse dumpstatus.json

The dumstatus.json is found at `https://dumps.wikimedia.org/enwiki/20250920/dumpstatus.json`

## About `dumpstatus.json`

The `dumpstatus.json` file is a machine-readable summary of the current Wikipedia dump directory.
 (e.g., [20250920](https://dumps.wikimedia.org/enwiki/20250920/)). It provides structured metadata about all files generated during the dump process, including:

- **File names and URLs:** Direct download links for each dump file (e.g., split articles, combined dumps, indexes, SQL tables).
- **Status:** Whether each file has finished processing, is in progress, or failed.
- **File sizes:** Both compressed and uncompressed sizes.
- **Checksums:** MD5 or SHA1 hashes for verifying file integrity.
- **Timestamps:** When each file was started, finished, or last updated.
- **Job metadata:** Information about the dump run, such as job names, types, and completion status.

**Example snippet from `dumpstatus.json`:**
```json
{
  "jobs": {
    "pages-articles": {
      "files": {
        "enwiki-20250920-pages-articles1.xml-p1p41242.bz2": {
          "url": "https://dumps.wikimedia.org/enwiki/20250920/enwiki-20250920-pages-articles1.xml-p1p41242.bz2",
          "size": 123456789,
          "sha1": "abcdef123456...",
          "status": "done"
        },
        // ...more files...
      },
      "status": "done"
    },
    // ...other jobs...
  }
}
```
You can use this file to programmatically list, verify, and download the latest Wikipedia dump files for your project.

## Extract File Lists for Split and Combined Dumps

- To get the list of split "pages-articles" files, use:
  
  The `dumpstatus.json` file contains a `"jobs"` dictionary, where each key is a dump job (such as `"pages-articles"`). Inside each job, the `"files"` dictionary lists all output files for that job. For split dumps, each part (e.g., `enwiki-20250920-pages-articles1.xml-p1p41242.bz2`, `enwiki-20250920-pages-articles2.xml-p41243p151573.bz2`, etc.) appears as a separate entry. Each file entry includes metadata such as the download URL, size, checksum, and status. To extract all split "pages-articles" files, iterate over the `"files"` dictionary under the `"articlesdump"` job and collect the file URLs or names where the status is `"done"`.


  ```python
  split_files = client.get_articlesdump(dump_json)
  ```
  This method parses the `"pages-articles"` job in `dumpstatus.json` and returns a list of all split article dump files that are ready for download.

- To get the combined file (if available), use:

  The combined "pages-articles" file is a single, large compressed XML file that contains the entire set of Wikipedia articles in one file, rather than being split into multiple parts. This file is typically named in the format `enwiki-YYYYMMDD-pages-articles.xml.bz2` (for example, `enwiki-20250920-pages-articles.xml.bz2`). Not every dump run produces a combined file, but when available, it is listed in the `"files"` dictionary under the `"articlesdumprecombines"` job in `dumpstatus.json`.


  ```python
  combined_files = client.get_articlesdumpcombine(dump_json)
  ```
  This method will return a list (usually of length 0 or 1) containing the combined articles dump file(s) that are available and ready for download.

## Example Code

```python
from wiki1 import WikiDumpClient, DEFAULT_RSS_URL

client = WikiDumpClient()
dump_json = client.download_links(DEFAULT_RSS_URL)
split_files = client.get_articlesdump(dump_json)         # List of split files
combined_files = client.get_articlesdumpcombine(dump_json)  # List of combined files (if available)
```

This workflow ensures you always have the latest file list for Wikipedia articles, and the code handles caching, retries, and error handling for robust automation.

# Downloading Process

For downloading, I recommend using PycURL for its speed, reliability, and efficient handling of large files as detailed in [High Performance File Downloads in Python with PycURL](https://medium.com/neural-engineer/high-performance-file-downloads-in-python-with-pycurl-f3adcfddfaa8). PycURL leverages libcurl’s optimized networking stack, supporting

- **Streaming Downloads:** Avoid loading entire files into memory by streaming data directly to disk.
- **Resumable Downloads:** Support for HTTP range requests allows interrupted downloads to resume, saving bandwidth and time.
- **Progress Monitoring:** PycURL provides hooks for real-time progress updates, useful for tracking large downloads.
- **Error Handling & Retries:** Implement robust error checking and automatic retries for transient network issues.
- **Connection Management:** Efficiently reuse connections for multiple files, reducing overhead.

The Full source code can be found at 

# Conclusion

Downloading and embedding the entire English Wikipedia is a challenging yet incredibly rewarding project for any technical AI engineer. By leveraging the latest Wikimedia dumps, parsing structured metadata from `dumpstatus.json`, and using high-performance tools like PycURL, you can build a scalable pipeline for semantic search, knowledge extraction, and advanced NLP applications.

If you found this guide helpful, consider subscribing to my newsletter for more deep dives into AI, NLP, and scalable machine learning. Share your feedback, questions, or experiences in the comments below.

In [None]:
# Install prerequisite packages for Wikipedia dump downloading and processing
!pip install pycurl requests tqdm

In [1]:
import os
import pycurl
import hashlib

class FileDownloader:
	def __init__(self, download_dir, chunk_size=8192, callback=None):
		self.download_dir = download_dir
		self.chunk_size = chunk_size
		self.callback = callback

	def _md5sum(self, file_path):
		"""Compute md5 hash of a file asynchronously."""
		hash_md5 = hashlib.md5()
		
		with open(file_path, "rb") as f:
			while True:
				chunk = f.read(self.chunk_size)
				if not chunk:
					break
				hash_md5.update(chunk)
		return hash_md5.hexdigest()

	def _validate_file(self, file_path, expected_size=None, expected_md5=None):
		if not os.path.exists(file_path):
			return False
		if expected_size is not None and os.path.getsize(file_path) != expected_size:
			return False
		if expected_md5 is not None:
			actual_md5 = self._md5sum(file_path)
			if actual_md5 != expected_md5:
				return False
		return True

	def download(self, url, filename, expected_size=None, expected_md5=None):
		file_path = os.path.join(self.download_dir, filename)

		# Check if file already exists and is valid
		if self._validate_file(file_path, expected_size, expected_md5):
			print(f"File {file_path} already exists and is valid. Skipping download.")
			return file_path

		md5 = hashlib.md5()
		received = 0

		# Try to get total bytes from Content-Length header
		total_bytes = None
		c = pycurl.Curl()
		c.setopt(c.URL, url)
		c.setopt(c.NOBODY, 1)
		c.perform()
		try:
			total_bytes = int(c.getinfo(pycurl.CONTENT_LENGTH_DOWNLOAD))
		except Exception:
			total_bytes = expected_size
		c.close()

		def write_callback(data):
			nonlocal received
			f.write(data)
			md5.update(data)
			received += len(data)
			if self.callback:
				self.callback(received, total_bytes)

		with open(file_path, 'wb') as f:
			c = pycurl.Curl()
			c.setopt(c.URL, url)
			c.setopt(c.WRITEFUNCTION, write_callback)
			c.perform()
			c.close()

		# Check file size
		if expected_size is not None and received != expected_size:
			raise ValueError(f"Size mismatch: expected {expected_size}, got {received}")

		# Check MD5
		if expected_md5 is not None and md5.hexdigest() != expected_md5:
			raise ValueError("MD5 checksum mismatch")

		return file_path

In [None]:
import requests
import logging
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import xml.etree.ElementTree as ET
import os
import json



# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("wikidump")

DEFAULT_RSS_URL = "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml"

class WikiDumpClient:
    def __init__(self, retries=3, backoff_factor=0.3, timeout=10, download_dir="."):
        self.session = self._requests_session_with_retries(retries, backoff_factor, timeout)
        self.timeout = timeout
        self.download_dir = download_dir

    def _requests_session_with_retries(self, retries, backoff_factor, timeout):
        session = requests.Session()
        retry = Retry(
            total=retries,
            read=retries,
            connect=retries,
            backoff_factor=backoff_factor,
            status_forcelist=(500, 502, 503, 504),
            allowed_methods=["GET", "POST"]
        )
        adapter = HTTPAdapter(max_retries=retry)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        session.request_timeout = timeout
        return session

    def get_json_with_logging(self, url):
        try:
            logger.info(f"Requesting: {url}")
            url = url.replace("downloads.wikimedia.org", "dumps.wikimedia.org")
            resp = self.session.get(url, timeout=self.session.request_timeout)
            logger.info(f"Response: {resp.status_code} {resp.reason}")
            resp.raise_for_status()
            return resp.json()
        except requests.RequestException as e:
            logger.error(f"Request failed: {e}")
            return None

    def print_dump_files(self, dump_json):
        if not dump_json or "jobs" not in dump_json:
            logger.error("Invalid dumpstatus.json structure.")
            return
        jobs = dump_json["jobs"]
        for key in ["pages-articles", "pages-articles-multistream"]:
            if key in jobs:
                logger.info(f"\n=== {key} ===")
                files = jobs[key].get("files", {})
                for fname, finfo in files.items():
                    logger.info(f"{fname}: {finfo.get('url', '')}")

    def get_latest_dump_link_from_rss(self, rss_url=None):
        if not rss_url:
            rss_url = DEFAULT_RSS_URL
        try:
            logger.info(f"Requesting RSS: {rss_url}")
            rss_url = rss_url.replace("downloads.wikimedia.org", "dumps.wikimedia.org")
            resp = self.session.get(rss_url, timeout=self.session.request_timeout)
            logger.info(f"Response: {resp.status_code} {resp.reason}")
            resp.raise_for_status()
            root = ET.fromstring(resp.content)
            item = root.find(".//item")
            if item is not None:
                link_elem = item.find("link")
                if link_elem is not None and link_elem.text:
                    logger.info(f"Found dump link: {link_elem.text}")
                    url = link_elem.text
                    url_no_protocol = url.split('://', 1)[-1] if '://' in url else url
                    url_no_protocol = url_no_protocol.replace("download.wikimedia.org", "dumps.wikimedia.org")
                    return url_no_protocol
            logger.error("Could not find dump link in RSS feed.")
            return None
        except Exception as e:
            logger.error(f"Failed to parse RSS: {e}")
            return None

    def read_last_dump_url(self, filename):
        if os.path.exists(filename):
            with open(filename, 'r') as f:
                return f.read().strip()
        return None

    def write_last_dump_url(self, filename, url):
        with open(filename, 'w') as f:
            f.write(url)

    def read_json_file(self, filename):
        try:
            with open(filename, 'r') as f:
                data = json.load(f)
                logger.info(f"Successfully read JSON from {filename}")
                return data
        except Exception as e:
            logger.error(f"Failed to read JSON from {filename}: {e}")
            return None

    def download_links(self, rss_url=None, last_url_file="last_dump_url.txt", json_file="dumpstatus.json"):
        """Return cached JSON if present and up-to-date, else download and update cache if RSS URL has changed."""
        if not rss_url:
            rss_url = DEFAULT_RSS_URL
        dump_url = self.get_latest_dump_link_from_rss(rss_url)
        last_dump_url = self.read_last_dump_url(last_url_file)
        # If no dump_url, fallback to cached json if present
        if not dump_url:
            logger.warning("No dump_url found in RSS. Returning cached JSON if available.")
            return self.read_json_file(json_file)
        # If dump_url unchanged and json_file exists, return cached json
        if dump_url == last_dump_url and os.path.exists(json_file):
            logger.info("Dump URL unchanged. Returning cached JSON.")
            return self.read_json_file(json_file)
        # Otherwise, download new json, update cache, and return it
        logger.info("Dump URL changed or cache missing. Downloading new dumpstatus.json.")
        dumpstatus_url = f"https://{dump_url}/dumpstatus.json"
        dump_json = self.get_json_with_logging(dumpstatus_url)
        if dump_json:
            with open(json_file, 'w') as jf:
                json.dump(dump_json, jf, indent=2)
            self.write_last_dump_url(last_url_file, dump_url)
            logger.info("Updated cache with new dumpstatus.json.")
            return dump_json
        else:
            logger.error("Failed to fetch new dumpstatus.json. Returning cached JSON if available.")
            return self.read_json_file(json_file)

    def get_articlesdump(self, dump_json):
        """Return the main articles dump file (single file, not split)."""
        a=[]
        try:
            files = dump_json['jobs']['articlesdump']['files']
            for fname, finfo in files.items():
                d={}
                d=finfo
                a.append(d)
                #print(finfo.get('url', ''))
                #if fname.endswith('.xml.bz2') and '-' not in fname:
                #    return finfo
            return a
        except Exception as e:
            import traceback
            traceback.print_exc()
            logger.error(f"Error getting articlesdump: {e}")
        return None

    def get_articlesdumpcombine(self, dump_json):
        """Return the recombined articles dump file (if present)."""
        try:
            a=[]
            files = dump_json['jobs']['articlesdumprecombine']['files']
            for fname, finfo in files.items():
                d={}
                d=finfo
                a.append(d)
                #print(finfo.get('url', ''))
            return a        
        except Exception as e:
            logger.error(f"Error getting articlesdumprecombine: {e}")
        return None


    def download_articlesdump(self,  download_dir=None, chunk_size=16384, progress_callback=None):
        """Download all files from articlesdump using FileDownloader."""
        dump_json = self.download_links()
        from pathlib import Path
        # Import FileDownloader from the notebook's context

        files = self.get_articlesdump(dump_json)
        if not files:
            logger.error("No articlesdump files found to download.")
            return []
        if download_dir is None:
            download_dir = self.download_dir
        Path(download_dir).mkdir(parents=True, exist_ok=True)
        downloader = FileDownloader(download_dir=download_dir, chunk_size=chunk_size, callback=progress_callback)
        downloaded_files = []
        for f in files:
            url = f.get('url')
            url = f"https://dumps.wikimedia.org" + url
            if not url:
                logger.warning(f"No URL for file entry: {f}")
                continue
            filename = url.split('/')[-1]
            expected_size = f.get('size')
            expected_md5 = f.get('md5')
            try:
                logger.info(f"Downloading {filename} from {url}")
                file_path = downloader.download(url, filename, expected_size=expected_size, expected_md5=expected_md5)
                downloaded_files.append(file_path)
            except Exception as e:
                logger.error(f"Failed to download {filename}: {e}")
        return downloaded_files

    def download_articlesdumpcombine(self,  download_dir=None, chunk_size=16384, progress_callback=None):
        """Download all files from articlesdumprecombines using FileDownloader."""
        dump_json = self.download_links()
        from pathlib import Path

        files = self.get_articlesdumpcombine(dump_json)
        if not files:
            logger.error("No articlesdumprecombine files found to download.")
            return []
        if download_dir is None:
            download_dir = self.download_dir
        Path(download_dir).mkdir(parents=True, exist_ok=True)
        downloader = FileDownloader(download_dir=download_dir, chunk_size=chunk_size, callback=progress_callback)
        downloaded_files = []
        for f in files:
            url = f.get('url')
            url = f"https://dumps.wikimedia.org" + url
            if not url:
                logger.warning(f"No URL for file entry: {f}")
                continue
            filename = url.split('/')[-1]
            expected_size = f.get('size')
            expected_md5 = f.get('md5')
            try:
                logger.info(f"Downloading {filename} from {url}")
                file_path = downloader.download(url, filename, expected_size=expected_size, expected_md5=expected_md5)
                downloaded_files.append(file_path)
            except Exception as e:
                logger.error(f"Failed to download {filename}: {e}")
        return downloaded_files





In [None]:



# Define a progress callback function
client = WikiDumpClient()

last_percent = {'value': -5}

def progress_callback(bytes_downloaded, total_bytes):
    percent = (bytes_downloaded / total_bytes) * 100 if total_bytes else 0
    if percent - last_percent['value'] >= 5 or percent == 100:
        print(f"Downloaded {bytes_downloaded}/{total_bytes} bytes ({percent:.2f}%)")
        last_percent['value'] = percent

# Specify a custom output directory
output_dir = "/tmp/wikidump"

# Download articles dump with progress callback and custom output directory
downloaded_files = client.download_articlesdump(
        download_dir=output_dir,
        chunk_size=16384,
        progress_callback=progress_callback
    )
print("Downloaded files:", downloaded_files)