This notebook contains data prep and analysis for the wikipedia page graph.

Page graph is a directed graph where each node represents a page and each edge represents a hyperlink between two pages.

In [50]:
import pandas as pd
import time, json, os, random, traceback, pyperclip, importlib, queue
import matplotlib.pyplot as plt
import seaborn as sns
import src.wiki_analysis_utils as wiki_analysis_utils

_ = importlib.reload(wiki_analysis_utils)

In [4]:
data_root_dir = r'C:\Users\mohitvyas\MyDesktop\WikipediaDataset\data\\'

Prepare a small subgraph of articles within a particular topic for experimentation

0. Identify a seed page (e.g. https://en.wikipedia.org/wiki/Coriander)
1. Start with all categories tagged on it
2. ~~Do a BFS on undirected version of category graph starting from the above categories to obtain the category set of interest.~~ (This approach results in ~80% of all categories which is unhelpful. Intuition is to obtain small subgraph that is very similar to the seed page. Something like "set of all wiki articles about plant species like coriander")
    - ~~Alternative approach-1: do BFS on parent directed graph to obtain all ancenstors. Then do BFS on child directed graph to obtain all descendants of these ancestors.~~ (This also doesn't work)
    - Alternative approach-2: same as approach 1, except, these constraints:
        - don't include back edges (to avoid cycles)
        - do a BFS from seed categories and keep only ancestors with <= k hops away from seed categories
        - only include anscestors with <= N descendant categories 
3. Get all articles that has at least 1 category from the category set of interest.
4. Keep only edges that are between articles in the above set to obtain the final dataset.

In [5]:
target_article_name = 'Coriander'
target_dir = data_root_dir + 'PageSubGraphs\\' + \
    wiki_analysis_utils.normalized_page_name(target_article_name).replace(' ', '_') + '\\'
os.makedirs(target_dir, exist_ok=True)

In [14]:
category_stats = pd.read_csv(data_root_dir + 'category_id_to_stats.tsv', sep='\t')

In [6]:
page_name_to_page_id, page_id_to_page_name, error_counts = \
    wiki_analysis_utils.load_page_name_to_id_map(data_root_dir, silent=True)

In [7]:
# load category name to id mappings from the category pages
categories, failure_counts = wiki_analysis_utils\
    .load_category_name_to_id_map(data_root_dir, silent=True)

In [8]:
target_page_ids = set([page_name_to_page_id[wiki_analysis_utils.normalized_page_name(target_article_name)]])
seed_categories = []
silent = True
start_time = time.time()
for partition in range(10):
    with open(data_root_dir + f'category_pages/part-{partition}.txt', 'r') as f:
        for line in f:
            if line=='': continue
            data = json.loads(line)
            if 'category_id' in data:
                for _, page_id in data['articles']:
                    if page_id in target_page_ids:
                        seed_categories.append(data['category_id'])
                        break
    if not silent:
        print(f"Processed till part {partition} in {(time.time() - start_time) / 60} minutes")

In [9]:
seed_category_names = [categories['id_to_name'][cat_id] for cat_id in seed_categories]
print (f"Found {len(seed_categories)} seed categories")
print (json.dumps(seed_category_names, indent=4))

Found 8 seed categories
[
    "plants described in 1753",
    "spices",
    "indian spices",
    "edible apiaceae",
    "herbs",
    "medicinal plants",
    "plants used in native american cuisine",
    "apioideae"
]


In [10]:
parent_graph_adj_lists, child_graph_adj_lists = wiki_analysis_utils\
    .load_category_graph(data_root_dir, silent=True)

Loaded 500000 edges in 0.5993496974309286 minutes
Loaded 1000000 edges in 1.16529008547465 minutes
Loaded 1500000 edges in 1.7171114603678386 minutes
Loaded 2000000 edges in 2.1856919646263124 minutes
Loaded 2500000 edges in 2.6906119704246523 minutes
Loaded 3000000 edges in 3.1646892468134564 minutes
Loaded 3500000 edges in 3.6239789764086408 minutes
Loaded 4000000 edges in 4.230975333849589 minutes
Loaded 4500000 edges in 4.71725279490153 minutes


In [13]:
edge_to_type = {}
with open(data_root_dir + 'category_edges_dfs_classification.tsv', 'r') as f:
    i = 0
    for line in f:
        i+=1
        if i==1: continue
        data = line.strip().split('\t')
        edge_to_type[(int(data[0]), int(data[1]))] = data[2]

In [22]:
category_to_num_descendants = category_stats\
    .set_index('CategoryId')['NumDescendantCategories'].to_dict()

In [15]:
# BFS on the parent graph to find all the ancestors of the seed categories
# avoid back edges to work with a cycle-less graph
print (f"Starting with {len(seed_categories)} seed categories")
ancestor_to_distance = {}
nodes_to_visit = queue.Queue()
for seed_category in seed_categories:
    ancestor_to_distance[seed_category] = (0, seed_category)
    nodes_to_visit.put(seed_category)
while not nodes_to_visit.empty():
    node = nodes_to_visit.get()
    if node in parent_graph_adj_lists:
        for parent in parent_graph_adj_lists[node]:
            if parent not in ancestor_to_distance and edge_to_type[(parent, node)] != 'BackEdge':
                ancestor_to_distance[parent] = (ancestor_to_distance[node][0] + 1, node)
                nodes_to_visit.put(parent)
print(f"Found {len(ancestor_to_distance)} ancestor categories")

Starting with 8 seed categories
Found 3409 ancestor categories


In [25]:
# visualize what are the ancestors sorted by distance
ancestor_to_distance_sorted = sorted(ancestor_to_distance.items(), key=lambda x: x[1][0])
print (max([x[1][0] for x in ancestor_to_distance_sorted]), "\n\n")
for distance in range(7): # range(max([x[0] for x in ancestor_to_distance_sorted])):
    print(f"Distance {distance}:")
    for ancestor, (d, _) in ancestor_to_distance_sorted:
        if d==distance:
            print(f"{categories['id_to_name'][ancestor]} ({category_to_num_descendants[ancestor]}), ", end='')
    print("\n\n---------------------------\n\n")

107 


Distance 0:
plants described in 1753 (1), spices (16), indian spices (1), edible apiaceae (3), herbs (8), medicinal plants (10), plants used in native american cuisine (5), apioideae (28), 

---------------------------


Distance 1:
herbs and spices (28), plant products (5), indian cuisine (74), apiaceae (37), edible plants (335), medicinal herbs and fungi (14), plants and humans (3), native american ethnobotany (1), native american cuisine (7), flora of the united states (16), 

---------------------------


Distance 2:
food ingredients (11), natural materials (2258), natural products (3), products by type (6), asian cuisine by country (1), culture of india (22677), food and drink in india (19), cuisine by country (11), south asian cuisine (2), desi cuisine (30), apiales (72), agronomy (31), vegan cuisine (3), plant crops (339), biologically based therapies (60), herbalism (2), drugs by structure (3), pharmacognosy (13), plants (3449), humans and other species (3), ethnobotany 

In [24]:
"""
Observations:
- a lot of anscestors are reachable and they become very generic and irrelevant very rapidly
- anscestors that are big categorizes are not very related to the initial seed categories
- anscestors that are not too big but are far apart are also likely to be irrelevant

one approach would be to apply very aggressive thresholds on both properties: category size and distance from seed categories
but that may reduce recall significantly

Alternate approach:
slowly expand cluster. Start with seed articles and fetch 
similar articles. Expand the cluster slowly and define the distance of a new article from
cluster as the average or Nth percentile distance from the all articles in it to ensure that the 
cluster is compact. This will be very slow.
"""
_=1

In [40]:
# rank order candidate categories based on 2 features:
# 1. distance from seed categories
# 2. size of the category
# pick weights manually

candidate_categories = []
for ancestor, (d, _) in ancestor_to_distance_sorted:
    candidate_categories.append((ancestor, categories['id_to_name'][ancestor], d, category_to_num_descendants[ancestor]))
candidate_categories = pd.DataFrame(candidate_categories, columns=['CategoryId', 'CategoryName', 'Distance', 'Size'])

In [43]:
candidate_categories['Score'] = -1 * candidate_categories['Distance'] / 100 - 0.1 * candidate_categories['Size'] / 1000
candidate_categories.sort_values('Score', ascending=False).head(200)

Unnamed: 0,CategoryId,CategoryName,Distance,Size,Score
0,22012331,plants described in 1753,0,1,-0.0001
2,44992786,indian spices,0,1,-0.0001
3,19950859,edible apiaceae,0,3,-0.0003
6,45068645,plants used in native american cuisine,0,5,-0.0005
4,743395,herbs,0,8,-0.0008
...,...,...,...,...,...
139,740309,gardening,4,81,-0.0481
128,5183917,structure,4,84,-0.0484
150,74709352,subfields of chemistry,4,87,-0.0487
131,40455234,euthenics,4,87,-0.0487


In [48]:
expanded_seed_category_set = set(candidate_categories.sort_values('Score', ascending=False).head(200)['CategoryId'].values)
print (len(expanded_seed_category_set))

200


In [49]:
# get full category set by getting all descendants of the expanded seed category set
expanded_category_set = set()
nodes_to_visit = queue.Queue()
for seed_category in expanded_seed_category_set:
    expanded_category_set.add(seed_category)
    nodes_to_visit.put(seed_category)
while not nodes_to_visit.empty():
    node = nodes_to_visit.get()
    if node in child_graph_adj_lists:
        for child in child_graph_adj_lists[node]:
            # remove back edges
            if child not in expanded_category_set and edge_to_type[(node, child)] != 'BackEdge':
                expanded_category_set.add(child)
                nodes_to_visit.put(child)
print(f"Expanded to {len(expanded_category_set)} categories")

Expanded to 1649504 categories
