# Prepare wikiedia data

## Resolving Page categories to top-level categories

We now have all our wikipedia data in a neatly formatted file, but we need to do one small thing more: wikiedia pages only store the categories that they directly belong to, not their super-categories. In order to do some of the analysis we want to, we need to determine to which "top-level" categories each page belongs to.

In [None]:
from typing import Set
import sys
import json
import library_functions as lf
from library_functions.config import Config

sys.path.append("..")
from mediawiki.mediawiki import MediaWiki
mw = MediaWiki()

Get the wikipedia data we downloaded. Note that around this point in the assignment, we realized that using notebooks for everything was extremely unpractical - notebooks are great for many things, but re-usable and modular code is not one of them.

Instead, we put all of our re-usable code in the [library functions](https://github.com/wojciechdk/Social_Graphs_and_Interactions_Final_Project/tree/master/library_functions) package. If in doubt about what any function does, you are invited to refer to the inline documentation directly.

In [None]:
wiki_data = lf.load_data_wiki()

We define a function that starts from a given category in wikipedia (for instance:(Psychoactive drugs by mechanism of action)[https://en.wikipedia.org/wiki/Category:Psychoactive_drugs_by_mechanism_of_action] ). The subcategories in that category are considered the "top-level" categories, and all pages that are below them in the category tree are mapped to those top-level categories.

In [None]:
def build_category_mapping(category):
    # Helper function that gets called recursively on the tree:
    def extract_categories(tree, current: Set[str]) -> Set[str]:
        # If we're not on a leaf:
        if tree:
            # Go through all subcategories
            for subcategory in tree["sub-categories"]:
                # Some subcategories are the children of more than one category 
                # (i.e., this is not technically a tree). Ignore them.
                if subcategory not in current:
                    current.union(
                        extract_categories(
                            tree["sub-categories"][subcategory], 
                            current
                        )
                    )
                    current.add(subcategory)
        return current

    # Get category tree from wikipedia
    ct = mw.categorytree(category, depth=15)
    ct = ct[category]
    # Call the recursive function on the resulting tree
    results = {
        key: extract_categories(ct["sub-categories"][key], set())
        for key in ct["sub-categories"].keys()
    }
    return results

We decided to limit our analysis to two parent categories:

1. [Psychoactive drugs by mechanism of action](https://en.wikipedia.org/wiki/Category:Psychoactive_drugs_by_mechanism_of_action)
2. [Drugs by psychological effects](https://en.wikipedia.org/wiki/Category:Drugs_by_psychological_effects)

Ideally, we should also include [Dietary Supplements](https://en.wikipedia.org/wiki/Category:Dietary_supplements) - however, we added those later in our analysis, and we lack time to re-do it with them as well.

In [None]:
categories_mechanism = build_category_mapping(
    "Psychoactive drugs by mechanism of action"
)
categories_effects = build_category_mapping("Drugs by psychological effects")


Our function above returns a mapping from <category_name> to the set of categories that are sub-categories of <category_name>. In order to persist these to disk, we need to convert them to lists, as json does not support sets:

In [None]:
for i in categories_mechanism:
    categories_mechanism[i] = list(categories_mechanism[i])
for i in categories_effects:
    categories_effects[i] = list(categories_effects[i])

And finally, save the mapping to file:

In [None]:
with open(Config.Path.wiki_effect_categories, "w+") as f:
    json.dump(categories_effects, f)
with open(Config.Path.wiki_mechanism_categories, "w+") as f:
    json.dump(categories_mechanism, f)

While we ended up not using the following, we also save a mapping between non-root categories and the set of pages that are part of that category:

In [None]:
# First determine the set of all categories (at any level) that exist
all_categories = set()
for category in wiki_data["categories"]:
    all_categories = all_categories.union(set(category))

print(f"Amount of categories: {len(all_categories)}")

# Initialize the dictionary with empty lists
category_inverse_mapping = {category: [] for category in all_categories}

# Iterate over all (pagename, categories) tuples
for name, categories in zip(wiki_data["name"], wiki_data["categories"]):
    # For each category that the current page is in, add the page's
    # name to the list of pages in that category
    for category in categories:
        category_inverse_mapping[category].append(name)

Save the resulting mapping to file:

In [None]:
with open(Config.Path.all_categories_to_names_mapping, "w+") as f:
    json.dump(category_inverse_mapping, f)

# Some convenience files

Let's store some extra files that provide alternative ways to access Wikipedia data. 
Note that we put the functions to do this in our `library_functions` folder, so you can refer there for the code - but we'll briefly explain what each does.

First, save a mapping between synonyms and substance names. This allows us to quickly retrieve the original substance given a synonym:

In [None]:
lf.save_synonym_mapping(wiki_data)

Save a list of all substance names:

In [None]:
lf.save_substance_names(wiki_data)

Save a mapping from substance name to the text contents of the page:

In [None]:
lf.save_contents(wiki_data)

And finally save a mapping from substance name to the url of its wikipedia page:

In [None]:
lf.save_urls(wiki_data)