# Downloading WikiPedia Pages and extracting links

Now that we generated the list of pages we want, we need to download them, parse them for links, and save them in a reasonable format.

In [None]:
from typing import Dict
import regex as re
from pathlib import Path
import os
from tqdm.auto import tqdm
import json
from library_functions.config import Config

We are still using the [MediaWiki](https://github.com/barrust/mediawiki) library, unfortunately it had some problems that prevented us from doing what we wanted:

1. The intro section of a wikipedia page was not parsed by the library. This meant that for many short pages, who *only* had an introduction, the library did not return anything.

2. When parsing links in separate sections of a page, the library re-downloaded the *whole* page for every section. This made it prohibitively slow.


But fear not: through the power of Open-Source software, we cloned the library and fixed it! This resulted in two pull requests, [1](https://github.com/barrust/mediawiki/pull/90) and [2](https://github.com/barrust/mediawiki/pull/91). We are proud to say that we contributed to FOSS as part of our assignment.

One of the pull requests was approved, the other one is still pending, so in order to use the library, we need to import it from our local, modified version:


In [None]:
import sys
sys.path.append("/home/ldorigo/MEGA/DTU/Q2/social_graphs/mediawiki/")
from mediawiki import MediaWiki

Load in the category tree we generated previously:

In [None]:
with open(Config.Path.full_category_tree_clean, "r") as f:
    root_tree = json.load(f)

## Utility Functions to traverse the category tree and download pages

### Getting inline links
The 'official' MediaWiki API, when asked for links from a page, returns outgoing links that include those in template boxes at the end of pages (See at the bottom of [this page](https://en.wikipedia.org/wiki/Caffeine#External_links) to understand what we mean).

This adds an enormous amount of links that aren't relevant for out network, and result in a big mess. We thus need to extract the links "by hand", by using the python library's function that we adapted in one of our pull requests.

In [None]:
def get_inline_links(page: MediaWiki.page):
    # Store the links given by the "official" API
    api_links = page.links

    # Only get links from sections that are actually on-topic
    relevant_sections = [
        section
        for section in page.sections
        if section not in [
            "See also", 
            "References", 
            "Bibliography", 
            "External links"
            ]
    ]

    # Also parse the intro section (0) 
    relevant_sections.append(0)

    actual_links = {}
    for section in tqdm(relevant_sections):
        # Start by getting all links
        tentative_links = page.parse_section_links(section)

        if not tentative_links:
            continue

        # Remove all references and external links
        no_references = []
        for name, url in tentative_links:
            if not re.search("\[[0-9]*\]", name) and not re.search("^FILE", name):
                if re.search("en\.wikipedia\.org", url):
                    no_references.append((name, url))

        for name, url in no_references:
            # Get the wikipedia title from the http link            
            link_title = re.search("/([^/]+)(/?)$", url)
            if link_title:
                # Wikipedia titles have spaces, not underscores
                actual_title = link_title[1].replace("_", " ")
                # Sanity check: make sure the link we extracted 
                # is part of the links returned by the API
                if actual_title not in api_links:
                    continue
                # We also keep track of how many times a specific link is used,
                # in case we want to use that information later on
                actual_links[actual_title] = (
                    actual_links.setdefault(actual_title, 0) + 1
                )
    return actual_links

Now let's write a couple of functions to traverse our tree and download pages along the way.

Firsy, to download and save a single page:

In [None]:

def process_page(name: str, current_dir: Path, redirects: Dict) -> Dict:
    # Path where we want to write the file
    filepath = current_dir.joinpath(name.replace("/", "_") + ".json")

    # To enable resuming: if the article was already downloaded, 
    # just read it from file
    if filepath.exists():
        with open(filepath.as_posix(), "r") as f:
            return json.load(f)

    # Get the page through the API
    page = mw.page(name)
    results = {}

    ## Add redirects to the global list of synonyms
    for r in page.redirects:
        redirects[r] = name

    ## Save all outgoing links
    results["links"] = get_inline_links(page)

    ## Save "secondary" categories of the page
    results["categories"] = page.categories

    ## Also save a reference to the redirects within the page itself
    results["redirects"] = page.redirects

    ## Finally, save the contents and the url for quick reference:
    results["url"] = page.url
    results["content"] = page.content

    ## Save to a json file
    with open(filepath, "w+") as f:
        json.dump(results, f)
    return results

And now a function that recursively traverses the category tree:

In [None]:
def process_tree(
    root_name: str, root_node: Dict, current_dir: Path, redirects: Dict
) -> Dict:

    results = {}

    # Go through all links (pages) in the current level and process them:
    for link in tqdm(root_node["links"]):
        results[link] = {}
        results[link]["category"] = root_name
        results[link]["data"] = process_page(link, current_dir, redirects)

    # Then, go through all sub-categories, and process them 
    # using this same function recursively
    for sub_category in tqdm(root_node["sub-categories"]):
        # Create a directory for each category
        new_dir = current_dir.joinpath(sub_category)
        if not new_dir.exists():
            os.mkdir(new_dir)
        # recursive call   
        cat_results = process_tree(
            root_name=sub_category,
            root_node=root_node["sub-categories"][sub_category],
            current_dir=new_dir,
            redirects=redirects,
        )
        results.update(cat_results)
    return results

Last but not least, run our function, saving all data to file as well as in a separate data structure. Note that we are saving two things:

1. The raw data, with one page per file under the wiki_data directory
2. The full dataset in a more readily accessible form, which we will store as a large json file

In [None]:
# Keep a flat dict of redirects (synonyms) for later reference
redirects = {}

data_path = Config.Path.private_data_folder / "full_wiki_data"
if not data_path.exists():
    os.mkdir(data_path)

mw = MediaWiki()
content_tree = process_tree(
    root_name="custom_root",
    root_node=root_tree,
    current_dir=data_path,
    redirects=redirects,
)


After hand, we realized that we had overlooked some pages that also needed to be removed - so we remove them here rather than re-doing everything:

In [None]:
# Clean some pages that were overlooked

del content_tree["Antidepressant"]
del content_tree["Reversible inhibitor of monoamine oxidase A"]
del content_tree["Stimulant"]
del content_tree["Anxiolytic"]
del content_tree["Histone deacetylase inhibitor"]
del content_tree["Norepinephrine–dopamine disinhibitor"]
del content_tree["Norepinephrine–dopamine reuptake inhibitor"]
del content_tree["Monoamine oxidase inhibitor"]
del content_tree["Heterocyclic antidepressant"]
del content_tree["Tricyclic antidepressant"]
del content_tree["Barbiturate"]
del content_tree["Aphrodisiac"]
del content_tree["Tricyclic antidepressant"]

Save the full data to a single file:

In [None]:
with open(Config.Path.full_wiki_data, "w+") as f:
    json.dump(content_tree, f, indent=2)
