## Aquiring the Data

For our model we need background stories for many different heroes and villains. We decided to collect these from wikipedia.

### The Wikipedia API

Wikipedia has an api, which is pretty useful on its own, but there are also packages that streamline the process, and add extra features.

In [None]:
import requests
import wikipedia
import string
from wikiparse import WikiParser
%load_ext autoreload
%autoreload 2

First, we tried querying from the basic API. By making 'member requests' we can get members of a category. Each member has a pagid and a title, which we can use to search for a specific page.

In [None]:
S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

TITLE = "Category:Marvel Comics supervillains"

PARAMS = {
    'action': "query",
    'list': 'categorymembers',
    'cmtitle': TITLE,
    'cmlimit': '100',
    'format': "json",
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()
ids = [item['title'] for ind,item in enumerate(DATA['query']['categorymembers'])]
ids[:20]

Double checking to make sure it returns the right number of entries.

In [None]:
len(ids)

When you search for a page with the basic API returns the content of a page as html, so we need to find a way to turn it into plain text. Here we are testing a method for taking a list of titles and extracting the plaintext version of each one.

It turns out that the `wikipedia` package automatically returns pages with a `content` field, which contains the plaintext version of the page. By using this package we only need to do a little cleaning of the text before attaching it to a list.

In [None]:
bios = []
miss = 0

for id in ids:
    try:
        hero = wikipedia.page(id)
        bio = get_bio(hero)
        if bio != -1:
            bios.append(bio)
    except wikipedia.exceptions.DisambiguationError:
        miss += 1

## Collecting the Data

Below is the final code for scanning through the entire wikipedia category for Marvel Comics Superheroes 100 entries at a time, and then processing each 100 entries, and adding them to a list. We have a created a custom object to handle many of the functions involved.

This cell only downloads the entries for heroes, the villains are collected later.

In [None]:
#S = requests.Session()
import datetime
import time

now = datetime.datetime.now()

wp = WikiParser()
cat = "Category:Marvel Comics superheroes"
titles = wp.get_category(cat)

h_bios = []
old_bios = []

for _ in range(15):
    print('page: ' + str(_) + ' parsing...', end=" ")
    new_bios = wp.get_all_bios(titles,'hero')
    if new_bios == old_bios:
        break
    h_bios += new_bios
    if wp.cmc != -1:
        titles = wp.continue_category(cat)
    old_bios = new_bios

later = datetime.datetime.now()
elapsed = later-now
print("Time: ", elapsed) 
heroes = h_bios
print(len(h_bios))

In [None]:
len(heroes)

Now we perform the same operation, but for the villains.

In [None]:
#S = requests.Session()
import datetime
import time

now = datetime.datetime.now()

wp_v = WikiParser()
cat = "Category:Marvel Comics supervillains"
titles = wp_v.get_category(cat)

v_bios = []
old_bios = []

for _ in range(12):
    print('page: ' + str(_) + ' parsing...', end=" ")
    new_bios = wp_v.get_all_bios(titles,'villain')
    if new_bios == old_bios:
        break
    v_bios += new_bios
    if wp_v.cmc != -1:
        titles = wp_v.continue_category(cat)
    old_bios = new_bios
        
villains = v_bios
later = datetime.datetime.now()
elapsed = later-now
print("Time: ", elapsed)    
print(len(v_bios))

Finally we compile heroes and villains into a single dataframe, and export it to a Json file.

In [None]:
all_marvel = villains + heroes

In [None]:
len(all_marvel)

In [None]:
import pandas as pd

marvel_df = pd.DataFrame.from_dict(all_marvel)
marvel_df.head()
#marvel_df.to_json('marvel_bios.json')

We also experimented with taking some DC bios, but ultimately we decided it wasn't necessary.

In [None]:
wp = WikiParser()
cat = "Category:DC Comics superheroes"
titles = wp.get_category(cat)

bios = []
old_bios = []

for _ in range(10):
    print('page: ' + str(_) + ' parsing...', end=" ")
    new_bios = wp.get_all_bios(titles,'hero')
    if new_bios == old_bios:
        break
    bios += new_bios
    if wp.cmc != -1:
        titles = wp.continue_category(cat)
    old_bios = new_bios
    
print(len(bios))
dc_heroes = bios

In [None]:
len(dc_heroes)
len(wp.ambiguation)

In [None]:
wp_v = WikiParser()
cat = "Category:DC Comics supervillains"
titles = wp_v.get_category(cat)

bios = []

for _ in range(8):
    print('page: ' + str(_) + ' parsing...', end=" ")
    new_bios = wp_v.get_all_bios(titles,'villain')
    bios += new_bios
    if wp_v.cmc != -1:
        titles = wp_v.continue_category(cat)
    
print(len(bios))
dc_villains = bios