In [1]:
import urllib.request
import json
import pandas as pd
import re
import requests
import os

In [2]:
# create directories used to save data
os.makedirs("data", exist_ok=True) 
os.makedirs("data/characters", exist_ok=True) 
os.makedirs("data/thumbnails", exist_ok=True)  

## Dunderpedia: The Office Wiki

Using the Dunderpedia API, we need to extract all the character names.

This data is located in the Characters category: https://theoffice.fandom.com/wiki/Category:Characters. To extract the data in this category, we need to make an API request to the API baseurl https://theoffice.fandom.com/api.php. Then we need build the baseurl with the following parameters: 
- `action=query`: to fetch information about from the Dunderpedia wiki source.
- `cmtitle=Category:Characters`: Using the cmtitle parameter we specify which category we want to enumerate. 
- `list=categorymembers&cmlimit=max`: We make sure to include all the characters by setting the `cmlimit` paramater to maximum. Otherwise, not all characters will be included in the result. 

In [3]:
baseurl = "https://theoffice.fandom.com/api.php?" # the baseurl to build the API query to the character pages
action = "action=query"

# we create the title for the page we want to extract the content from
title = "cmtitle=Category:Characters"

# add the parameter list=categorymembers to the query's content
content = "list=categorymembers&cmlimit=max"
dataformat ="format=json" # we want the return of the query to be json for future use when we extract the contents

# construct the query for the title
category_query = f"{baseurl}{action}&{content}&{title}&{dataformat}"
print(f"For title: {title.split('=')[1]}, the query is: {category_query}")

For title: Category:Characters, the query is: https://theoffice.fandom.com/api.php?action=query&list=categorymembers&cmlimit=max&cmtitle=Category:Characters&format=json


Once the category query has been built, we need to make the actual request to extract each of the character page title and id from the `categorymemebers`section. These will be then used for building the queries to extract the actual information about the characters: 

In [4]:
response = urllib.request.urlopen(category_query) # make the request to the query
data_response = response.read().decode('utf-8') # read the response and decode it
data_json = json.loads(data_response)['query']['categorymembers'] # convert the response to json format

page_ids = []
titles = []

for category in data_json:
    # we keep a list with the page indexes and titles to be used when extracting the content for the category pages
    page_ids.append(category['pageid'])
    titles.append(category['title'].replace(' ', '_'))

The above query, returns not only the character names, but also the categories the characters are divided into. Looking at the data we identified 5 main categories, and 18 subcategories. We extract them and separate them from the character names. These categories will be used later in the analysis. 


In [5]:
main_categories = list(map(lambda title: title.casefold(), titles[:5]))
sub_categories = list(map(lambda title: title.casefold().replace('category:', ''), titles[-18:]))
characters = titles[5:-18]

print(f"Main categories: {main_categories}\n")
print(f"Sub categories: {sub_categories}")

Main categories: ['background_employees', 'clients_of_dunder_mifflin', 'main_characters', 'mentioned_characters', 'voiced_characters']

Sub categories: ['actors_of_the_3rd_floor', 'actors_of_threat_level_midnight', 'animals', 'characters_of_dwight_schrute', 'characters_of_michael_scott', 'deceased_characters', 'dunder_mifflin_employees', 'family_members', 'females', 'former_employees', 'it_guys', 'main_characters', 'males', 'the_3rd_floor_characters', 'threat_level_midnight_characters', 'unnamed', 'unseen_characters', 'warehouse_worker']


Now that we have the correct titles and ids for each of the character pages, we can make the API queries to extract the content of each page. This content will later be used in order to extract the characters information and hyperlinks. This will help in the future analysis and network construction.  

To do this we make an API request. We again use baseurl `https://theoffice.fandom.com/api.php?` and title for each of the characters, and we built related API queries for each of them. Then, we made the requests for each query. For each character, we save the resulting page content in a txt file.

In [6]:
content = "prop=revisions&rvprop=content&rvslots=*"

queries = []
page_ids = page_ids[5:-18]
for i, page_id in enumerate(page_ids):
    title = characters[i]
    query = f"{baseurl}{action}&{content}&titles={title}&{dataformat}"
    queries.append(query)
    print(f"For title: {title}, the query is: {query}")

For title: A.J., the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=A.J.&format=json
For title: Abby, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Abby&format=json
For title: Alan, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Alan&format=json
For title: Alan_Brand, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Alan_Brand&format=json
For title: Alex, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Alex&format=json
For title: Alice, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Alice&format=json
For title: Amy, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslot

Next we follow the strategy we used to download the pages above, to download each character page. For each character, we save the resulting page content in a txt file.

#### Regular expression: Find character categories<a class="anchor" id="regex_categories"></a>

For each character, we extracted their Main Role and Secondary Roles. In each of the character pages, the roles are stored in the following way: `[[Category:Family members]]`. To extract them, we used the following regular expression: `(?:\[\[Category\:)(.*?)\]\]`: 

 - `(?:\[\[Category\:)` Create a non-capturing group matching the word `[[Category:`. The result of this non-capturing group will not be included in the match result.
 - `(.*?)\]\]`: Creates a returns a capturing group matching any character that ends with `]]` and comes after `[[Category:`. This is the information that we are interested in. 


In [7]:
def get_character_roles(content):
    roles = {'Main Role': '', 'Secondary Roles': []}
    pattern = r'(?:\[\[Category\:)(.*?)\]\]'
    matches = re.findall(pattern, content)
    
    for match in matches:
        if match.casefold().replace(' ', '_') in main_categories: # the regex match is part of the main_categories
            roles['Main Role'] = match
        elif match.casefold().replace(' ', '_') in sub_categories: # the regex match is part of the sub_categories
            roles['Secondary Roles'].append(match)
    return roles

In [8]:
contents = {'Name': [], 'Main Role': [], 'Secondary Roles': []} # dictionary storing the charactes contents - will later be used to build the dataframe

for i, query in enumerate(queries):
    # make the actual requests using the urllib library for each of the previoysly initialized queries
    response = urllib.request.urlopen(query) # make the request to the query
    data_response = response.read().decode('utf-8') # read the response and decode it

    data_json = json.loads(data_response) # convert the response to json format
    name = data_json['query']['pages'][str(page_ids[i])]['title']
   
    contents['Name'].append(name) # add character name
    
    # the content that we want to find in the initial page is located at query/pages/'index'/revisions[0]/slots/main/*
    content = data_json['query']['pages'][str(page_ids[i])]['revisions'][0]['slots']['main']['*']
    
    roles = get_character_roles(content) # add character main and secondary roles
    if 'Main Role' in roles:
        contents['Main Role'].append(roles['Main Role'])
    if 'Secondary Roles' in roles:
        contents['Secondary Roles'].append(roles['Secondary Roles'])
    with open(f'./data/characters/{name}.txt', 'w+') as file:
        file.write(content)

#### Extract characters branches (workplace)

Next we are going to extract the different branches of Dunder Mifflin to see in which branch and department every character works at, so we can do some basic statistics that will help us understand the analysis better.

Using the following page inside the Dunderpedia: https://theoffice.fandom.com/wiki/Branch_(disambiguation), we can see that they are 8 branches. For each branch, we make a request and extract the characters belonging to each branch. Then we add this information to the character dataframe.

We construct the queries in the same way as before. Since now we have to make 8 queries, we manually extracted the page ids and titles for each of the pages:

In [9]:
content = "prop=revisions&rvprop=content&rvslots=*"

branch_queries = []
page_ids_branches = [1657, 2011, 2092, 1765, 9709, 2095]
titles_branches = ['Dunder_Mifflin_Scranton', 'Dunder_Mifflin_Albany', 
                   'Dunder_Mifflin_Nashua', 'Dunder_Mifflin_Corporate_Office', 
                   'Dunder_Mifflin_Syracuse', 'Dunder_Mifflin_Utica']
for title in titles_branches:
    query = f"{baseurl}{action}&{content}&titles={title}&{dataformat}"
    branch_queries.append(query)
    print(f"For title: {title}, the query is: {query}")

For title: Dunder_Mifflin_Scranton, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Dunder_Mifflin_Scranton&format=json
For title: Dunder_Mifflin_Albany, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Dunder_Mifflin_Albany&format=json
For title: Dunder_Mifflin_Nashua, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Dunder_Mifflin_Nashua&format=json
For title: Dunder_Mifflin_Corporate_Office, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Dunder_Mifflin_Corporate_Office&format=json
For title: Dunder_Mifflin_Syracuse, the query is: https://theoffice.fandom.com/api.php?action=query&prop=revisions&rvprop=content&rvslots=*&titles=Dunder_Mifflin_Syracuse&format=json
For title: Dunder_Mifflin_Utica, the query is: https://theoffice.fandom.com/a

Next we make the actual requests and extract the branch and department for the characters listed in the above pages. note that employees can have more than one branch and one department, as this develops during the series.

#### Regular expression: Find character work position<a class="anchor" id="regex_position"></a>


For each character, we also extracted their work positions, i.e. Secretary, manager, salesman. To extract them, we used the following regular expression: `\[\[(.*?)(?:\|.*?)?\]\](?:.*?)(?:\- )(.*?)(?:\;|\,|\n|\<small\>)`: 
 - `\[\[`: matches the starting double square blackets [[ of every link following this pattern
 - `(.*?)`: creates a capturing group matching shortest string containing any character except for a new line. This will return the character name. 
 - `(?:\|.*?)?\]\]`: Creates a non-capturing group that will not be included in the result, this is accomplished using `?:` at the start. Inside the non-capturing group, we have one matching token: `|`,  by using the ?, it matches zero or one times the token. The group should end with `]]`.
 - `(?:.*?)(?:\- )`: Include two no capturing group that match from `]]` until `- `. 
 - `(.*?)`: creates a second capturing group matching shortest string containing any character except for a new line. This will retun the role of the character.
 - `(?:\;|\,|\n|\<small\>)`: Creates a non-capturing group allows our match to end whenever one of these four tokens appears: `;`, `,`, `\n`, `<small\>`. 

In [10]:
def get_chraracter_position(title, content):
    pattern = r'\[\[(.*?)(?:\|.*?)?\]\](?:.*?)(?:\- )(.*?)(?:\;|\,|\n|\<small\>)'
    matches = re.findall(pattern, content)
    for match in matches:
        character = match[0]
        role = match[1]
        try:
            index = contents['Name'].index(character)
            contents['Position'][index].append(role)
            contents['Branch'][index].append(title.replace('_', ' '))
        except ValueError: # character not present in the list of saved characters
            continue

In [11]:
contents['Position'] = [[] for i in range(len(contents['Name']))]
contents['Branch'] = [[] for i in range(len(contents['Name']))]

for i, query in enumerate(branch_queries):
    # make the actual requests using the urllib library for each of the previoysly initialized queries
    response = urllib.request.urlopen(query) # make the request to the query
    data_response = response.read().decode('utf-8') # read the response and decode it

    data_json = json.loads(data_response) # convert the response to json format
    name = data_json['query']['pages'][str(page_ids_branches[i])]['title']
       
    # the content that we want to find in the initial page is located at query/pages/'index'/revisions[0]/slots/main/*
    content = data_json['query']['pages'][str(page_ids_branches[i])]['revisions'][0]['slots']['main']['*']
    
    get_chraracter_position(titles_branches[i], content) # add character position and branch to contents dictionary

#### Extract character links

Now we have extracted the descriptions from each character, each page will correspond to a character, which is a node in our network. Next we take the pages we have downloaded for each character to find all the hyperlinks in a character's page that link to another node of the network (e.g. another character). For this we use regular expressions.

#### Regular expression: Extract character links<a class="anchor" id="regex_links"></a>

For each character, we extract the pages his/her page links to. The links are presented in the following way: ``[[Angela Martin]]``. This represents a link to the character Angela Martin. However, there are other links to urls such as wikipedia, representeed in the following way : `[[Wikipedia:Shunning|shunning]]`, these should be removed. For this, we created the following regular expression: `\[\[(.*?)(?:\|.*?)?\]\]`

 - `\[\[`: matches the starting double square blackets [[ of every link following this pattern
 - `(.*?)`: creates a capturing group matching shortest string containing any character except for a new line. This will return the character name in the link. 
 - `(?:\|.*?)?\]\]`: Creates a non-capturing group that will not be included in the result, this is accomplished using `?:` at the start. Inside the non-capturing group, we have one matching token: `|`,  by using the ?, it matches zero or one times the token. The group should end with `]]`. This will get rid of the Wikipedia links and those sources that contain a `|`. 

In [12]:
def get_links(content):
    pattern1_matches = re.findall(r'\[\[(.*?)(?:\|.*?)?\]\]', content)
        
    # we need to check if the found matches are characters (i.e. they are present in the extracted df)
    pattern1_matches = list(filter(lambda x: x in list(contents['Name']), pattern1_matches))
    return pattern1_matches

In [13]:
links = []

for name in contents['Name']:
    with open(f'./data/characters/{name}.txt') as file:
        content = file.read()
        # use the get_links to extract the links using a regrex expression
        links.append(get_links(content))

#### Constructing the Dunderpedia Dataframe with Pandas

Next we create a dataframe with the character names, main and secondary roles and links, and we save this data as csv

In [14]:
df = pd.DataFrame.from_dict(contents)
df.insert(5, 'Links', links) # add all genders to dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             298 non-null    object
 1   Main Role        298 non-null    object
 2   Secondary Roles  298 non-null    object
 3   Position         298 non-null    object
 4   Branch           298 non-null    object
 5   Links            298 non-null    object
dtypes: object(6)
memory usage: 14.1+ KB


In [15]:
df.head() # display the new dataframe

Unnamed: 0,Name,Main Role,Secondary Roles,Position,Branch,Links
0,A.J.,,[Males],[A salesman and ex-boyfriend of [[Holly Flax|H...,[Dunder Mifflin Nashua],"[Holly Flax, Holly Flax, Michael Scott, Pam Be..."
1,Abby,,"[Females, Family members]",[],[],"[Stacy, Pam Beesly]"
2,Alan,,[],[],[],"[Kelly Kapoor, Ryan Howard, Pam Beesly]"
3,Alan Brand,,"[Dunder Mifflin employees, Former employees, M...",[CEO ],[Dunder Mifflin Corporate Office],"[Jan Levinson, Michael Scott, Oscar Martinez, ..."
4,Alex,,[Males],[],[],"[Pam Beesly, Alex, Karen Filippelli]"


#### Extract characters thumbnails

Next we extract pictures for each character so that we can use them for the network nodes

In [16]:
content = "prop=images&rvprop=content&rvslots=*"

img_queries = []

for i, page_id in enumerate(page_ids):
    title = characters[i]
    query = f"{baseurl}{action}&{content}&titles={title}&{dataformat}"
    img_queries.append(query)
    print(f"For title: {title}, the query is: {query}")

For title: A.J., the query is: https://theoffice.fandom.com/api.php?action=query&prop=images&rvprop=content&rvslots=*&titles=A.J.&format=json
For title: Abby, the query is: https://theoffice.fandom.com/api.php?action=query&prop=images&rvprop=content&rvslots=*&titles=Abby&format=json
For title: Alan, the query is: https://theoffice.fandom.com/api.php?action=query&prop=images&rvprop=content&rvslots=*&titles=Alan&format=json
For title: Alan_Brand, the query is: https://theoffice.fandom.com/api.php?action=query&prop=images&rvprop=content&rvslots=*&titles=Alan_Brand&format=json
For title: Alex, the query is: https://theoffice.fandom.com/api.php?action=query&prop=images&rvprop=content&rvslots=*&titles=Alex&format=json
For title: Alice, the query is: https://theoffice.fandom.com/api.php?action=query&prop=images&rvprop=content&rvslots=*&titles=Alice&format=json
For title: Amy, the query is: https://theoffice.fandom.com/api.php?action=query&prop=images&rvprop=content&rvslots=*&titles=Amy&format

In [17]:
images = {}
errors = []
for i, query in enumerate(img_queries):
    # make the actual requests using the urllib library for each of the previoysly initialized queries
    response = urllib.request.urlopen(query) # make the request to the query
    data_response = response.read().decode('utf-8') # read the response and decode it

    data_json = json.loads(data_response) # convert the response to json format
    name = content = data_json['query']['pages'][str(page_ids[i])]['title']
    
    try:
        # the content that we want to find in the initial page is located at query/pages/'index'/revisions[0]/slots/main/*
         
        img_title = data_json['query']['pages'][str(page_ids[i])]['images'][0]['title']
        img_index = 0
        while 'gif' in img_title:
            img_title = data_json['query']['pages'][str(page_ids[i])]['images'][img_index+1]['title']

        content = 'prop=pageimages'
        query_img = f"{baseurl}{action}&{content}&titles={img_title.replace(' ', '_').replace('&', '%26')}&{dataformat}"
        response = urllib.request.urlopen(query_img) # make the request to the query
        img_response = response.read() # read the response and decode it
        data_json = json.loads(img_response) # convert the response to json format
        pages = data_json['query']['pages']
        
        page = list(pages.values())[0]
        img_url = page['thumbnail']['source']
        img_data = requests.get(img_url).content
        
        extension = img_title.split('.')[-1] if 'gif' not in img_title.split('.')[-1] else 'png'
        path =  "./data/thumbnails/" + characters[i] + "." + extension
        images[characters[i]] = characters[i] + "." + extension

        with open(path, 'wb') as file:
            file.write(img_data)
    except Exception as e:
        errors.append(characters[i].replace('_', ' '))

Cleanup the dataset and remove all those characters that do not have a thumbnail, they should not be included in the network

In [18]:
df = df[~df['Name'].isin(errors)]

In [19]:
df.insert(6, 'ImagePath', images.values()) # add all races to dataframe

In [20]:
df.head() # display the new dataframe with the new ImagePath column

Unnamed: 0,Name,Main Role,Secondary Roles,Position,Branch,Links,ImagePath
0,A.J.,,[Males],[A salesman and ex-boyfriend of [[Holly Flax|H...,[Dunder Mifflin Nashua],"[Holly Flax, Holly Flax, Michael Scott, Pam Be...",A.J..jpg
1,Abby,,"[Females, Family members]",[],[],"[Stacy, Pam Beesly]",Abby.png
2,Alan,,[],[],[],"[Kelly Kapoor, Ryan Howard, Pam Beesly]",Alan.jpg
3,Alan Brand,,"[Dunder Mifflin employees, Former employees, M...",[CEO ],[Dunder Mifflin Corporate Office],"[Jan Levinson, Michael Scott, Oscar Martinez, ...",Alan_Brand.png
4,Alex,,[Males],[],[],"[Pam Beesly, Alex, Karen Filippelli]",Alex.jpg


#### Save data to csv (to be used in the main notebook)

In [21]:
# use the to_csv built in pandas function to save our data to a csv file called characters.csv
df.to_csv('data/dunderpedia_characters.csv', index=False)