# Data wrangling and exploration - Wikipedia citation records

* Read bulk TSV files, select and process informative columns. Construct a directed bipartite graph of web pages and cited publications, and analyze the graph properties.

## Data wrangling procedure description

### 1. Raw data collection

The primary source of data for Wikipedia citation statistics comes from online repository: https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/
The datasets for different languages are stored in separate archives, each containing a 'tab-separated values' (TSV) files.

### 2. Data cleaning

The TSV files were read into a pandas dataframe. The datasets contain 6 features: page_id, page_title, rev_id, timestamp, type, and id. The data is already clean, the only modification was to convert wikipedia webpage named 'NaN' (describing the concept of not-a-number) to a string, rather than NaN value, as interpretted by pandas read_csv().


### 3. Construction of graph representation of the web pages and publications

Using networkx library, directed bipartite graph was constructed to represent the relations between pages and publications. The unique values of page ids ('page_id') and publication ids ('id') were converted to graph nodes, and pairs of 'page_id' and 'id' from the dataframe records were converted to edges. Each node has the 'bipartite' property.


### 4. Exploration of graph properties

Degree of centrality (DOC) was calcualted for both partitions, and most connected pages and publications were determined. The DOC distribution is most conveniently presented in a log-log plot. The near-linear dependence between the logs of publications and logs of their DOCs suggests a power relation between the two.



## Data wrangling code

In [1]:
import os
import numpy as np
import pandas as pd
import networkx as nx
%matplotlib inline
import matplotlib.pyplot as plt

The Wikipedia data is in the form of series of language-specific TSV files, which were downloaded into a local directory (data/raw).

In [2]:
base_path = '../data/raw'
processed_path = '../data/processed'

In [3]:
# read TSV data
df = pd.read_csv(os.path.join(base_path,'enwiki.tsv'), sep='\t', parse_dates=['timestamp'],infer_datetime_format=True)

# Convert mistakenly converted type nan to string 'NaN' (wikipedia page name)
df.page_title = df.page_title.fillna("NaN")

df.head(5)

Unnamed: 0,page_id,page_title,rev_id,timestamp,type,id
0,2867096,Mu Aquilae,503137751,2012-07-19 16:08:41,doi,10.1051/0004-6361:20078357
1,2867096,Mu Aquilae,508363722,2012-08-20 22:56:21,arxiv,astro-ph/0604502
2,2867096,Mu Aquilae,508363722,2012-08-20 22:56:21,arxiv,astro-ph/0003329
3,2867096,Mu Aquilae,508363722,2012-08-20 22:56:21,arxiv,0708.1752
4,2867096,Mu Aquilae,503137751,2012-07-19 16:08:41,doi,10.1051/0004-6361:20064946


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3794695 entries, 0 to 3794694
Data columns (total 6 columns):
page_id       int64
page_title    object
rev_id        int64
timestamp     datetime64[ns]
type          object
id            object
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 173.7+ MB


**Create a bipartite graph**

Prepare lists of nodes and edges from the dataframe

In [5]:
# list of unique web pages (web_page nodes)
wpages = df.page_id.unique()

# list of unique publications (publication nodes)
pubs = df.id.unique()

# list of references (edges)
edges = [(page, pub) for page, pub in df[['page_id', 'id']].values]

Create a bipartite directed graph 

In [6]:
G = nx.DiGraph()
G.add_nodes_from(wpages, bipartite='web_page')
G.add_nodes_from(pubs, bipartite='publication')
G.add_edges_from(edges)

Make lists of web_page and publication nodes (same as above but, now extracted directly from the graph)

In [7]:
wpage_nodes = [node for node, data in G.nodes(data=True) if data['bipartite']=='web_page']
pub_nodes = [node for node, data in G.nodes(data=True) if data['bipartite']=='publication']

Calculate bipartite degree centrality

In [8]:
dcent = nx.bipartite.degree_centrality(G, wpage_nodes)

Assign degree of centrality to the two partitions and rank nodeds within each partition

In [9]:
# webpage ranking
wpage_dcent = [(node, dcent[node]) for node in wpage_nodes]
wpage_rank = sorted(wpage_dcent, key=lambda x: x[1], reverse=True)

# publication ranking
pub_dcent = [(node, dcent[node]) for node in pub_nodes]
pub_rank = sorted(pub_dcent, key=lambda x: x[1], reverse=True)

**Look at the most prominent web pages and publications**

In [10]:
print('Web page with most references:', df.loc[df.page_id == wpage_rank[0][0]].iloc[0]['page_title'])
print('Number of references:', df.loc[df.page_id == wpage_rank[0][0]].shape[0])
df.loc[df.page_id == wpage_rank[0][0]].head()

Web page with most references: 2017 in paleontology
Number of references: 930


Unnamed: 0,page_id,page_title,rev_id,timestamp,type,id
2863256,52206193,2017 in paleontology,748990989,2016-11-11 17:21:59,doi,10.1017/S0016756816000236
2863257,52206193,2017 in paleontology,793738221,2017-08-03 16:49:35,doi,10.1016/j.cub.2017.06.071
2863258,52206193,2017 in paleontology,804164124,2017-10-07 05:05:59,pmid,28489871
2863259,52206193,2017 in paleontology,787972457,2017-06-28 18:08:23,doi,10.3140/bull.geosci.1668
2863260,52206193,2017 in paleontology,768238206,2017-03-02 16:51:02,doi,10.5710/AMGH.05.09.2016.3009


Most cited publication (english): "Encyclopedia Of All Footballers (10th Edition)"

In [None]:
print('Number of citations:', df.loc[df.id == pub_rank[0][0]].shape[0])
df.loc[df.id == pub_rank[0][0]].head()

Number of citations: 4769


Unnamed: 0,page_id,page_title,rev_id,timestamp,type,id
206501,3415451,List of VFL/AFL players with international bac...,798221651,2017-08-31 17:13:15,isbn,9781921496325
368444,5271083,Vic Cumberland,794792437,2017-08-10 02:26:43,isbn,9781921496325
704178,13802361,Ian Mort,714356193,2016-04-09 07:02:30,isbn,9781921496325
834821,15292398,Syd Barker Sr.,640627969,2015-01-02 08:53:41,isbn,9781921496325
1303211,19228797,Athol Milne,717048257,2016-04-25 12:26:15,isbn,9781921496325


In [None]:
plt.figure(figsize=(18,10))

plt.subplot(121)
plt.xscale('log')
plt.yscale('log')
plt.plot(np.array(wpage_rank)[:,1])
plt.xlabel('publication #')
plt.ylabel('Degree centrality (bipartite)')
plt.title('Wiki page degree centrality')
plt.xlim(1,)

plt.subplot(122)
plt.xscale('log')
plt.yscale('log')
plt.plot(np.array(pub_rank)[:,1])
plt.xlabel('publication #')
plt.ylabel('Degree centrality (bipartite)')
plt.title('Publication degree centrality')
plt.xlim(1,)

Visualize bipartite graph

In [None]:
#matrix = nx.bipartite.biadjacency_matrix(G, row_order=np.array(wpage_rank)[:,0], column_order=np.array(pub_rank)[:,0])

In [None]:
from networkx.readwrite import json_graph

In [None]:
json_graph.node_link_data(G)

In [None]:
import wikipedia

In [None]:
os.path.getsize(os.path.join(processed_path, 'wiki_en.csv'))