# Tracking Knowledge Propagation Across Wikipedia Languages

We present a dataset of inter-language knowledge propagation in Wikipedia. The dataset includes the data from 2001 to the first trimester of 2020. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this dataset is the first to explore the full inter-language propagation at a large scale.

Authors:
- Rodolfo Valentim
- Giovanni Comarela
- Souneil Park
- Diego Saez-Trumper


## Load entire dataset

In [1]:
import pandas as pd 

wikipedia_df = pd.read_csv('https://zenodo.org/record/4433137/files/dataset.csv.zip', sep=',')
wikipedia_df.head()

Unnamed: 0,Wikidata ID,Language Edition,Creation Timestamp,Topics,Scores
0,Q1,plwiki;svwiki;jawiki;fiwiki;frwiki;cswiki;mswi...,1023468530;1038689254;1047585716;1053504055;10...,STEM.Space;STEM.STEM*;STEM.Physics,0.99;0.97;0.97
1,Q100,enwiki;svwiki;frwiki;nlwiki;zhwiki;eswiki;fiwi...,999811023;1045709287;1048540987;1073222047;108...,Geography.Geographical,0.9
2,Q1000,enwiki;dewiki;svwiki;eswiki;nlwiki;jawiki;etwi...,999997044;1033492217;1045359510;1061260123;106...,Geography.Geographical;Geography.Regions.Afric...,0.9;0.77
3,Q10000,enwiki;eswiki;plwiki;fiwiki;jawiki;zhwiki;nlwi...,1103755157;1124976107;1138957516;1146600855;11...,STEM.Libraries_&_Information;STEM.STEM*,0.99;0.98
4,Q100000,nlwiki;enwiki;liwiki;frwiki;zhwiki;cawiki;arzwiki,1139001160;1139662968;1142592879;1244377983;13...,Geography.Regions.Europe.Western_Europe;Geogra...,1.0;1.0;0.94


## Load dataset by chuncks

In [13]:
import pandas as pd 

for chunk in pd.read_csv('https://zenodo.org/record/4433137/files/dataset.csv.zip', chunksize=10**2, sep=','):
    print(chunk.shape)
    break  # remove this line to run through the entire dataset 

(100, 5)


## Terminology

- Wikidata ID: the language-agnostic Wikidata identifier; for instance Q298 (Chile).\footnote{\url{https://www.wikidata.org/wiki/Q298}} %used to identify a concept. 
- Language Edition: a specific language instance of an item. For example, the Portuguese version of Q298 would be ptwiki-Q298.\footnote{\url{https://pt.wikipedia.org/wiki/Chile}}
- Creation Timestamp: an ordered list of linux timestamp align with the Language Edition list.
- Topics: a set of items belonging to the same topic (e.g. History). Note that the topic is assigned to the item, and is propagated to all the pages of the item. Therefore, if item Q298 belongs to the topic 'Geography', all the pages about Q298 would also belong to the same topic.\footnote{\url{https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification}}
- Scores: an ordered list aling with the set of Topics that represent the probability of the assign topic being correct.

We emphasize that we removed non-Wikipedia projects such as Wiktionary, Wikiquote, Wikibooks, and others. Furthermore, we also make the same dataset available in a CSV file, which is sorted first by Wikidata item and then by time of page creation.

## Process each collumn

In [None]:
import pandas as pd 
import numpy as np
from tqdm import tqdm # install the package tqdm
tqdm.pandas() # If you want to show progress bar in the processing. If not, replace progress_apply by apply

wikipedia_df = pd.read_csv('https://zenodo.org/record/4433137/files/dataset.csv.zip', sep=',')

wikipedia_df['Language Edition'] = wikipedia_df['Language Edition'].progress_apply(lambda x: x.split(";"))

wikipedia_df['Creation Timestamp'] = wikipedia_df['Creation Timestamp'].progress_apply(lambda x: x.split(";"))

wikipedia_df['Topics'] = wikipedia_df['Topics'].replace(np.nan, '', regex=True)
wikipedia_df['Topics'] = wikipedia_df['Topics'].progress_apply(lambda x: x.split(";"))

wikipedia_df['Scores'] = wikipedia_df['Scores'].replace(np.nan, '', regex=True)
wikipedia_df['Scores'] = wikipedia_df['Scores'].progress_apply(lambda x: x.split(";"))

wikipedia_df.head()

  from pandas import Panel
100%|██████████| 13251684/13251684 [00:25<00:00, 511762.15it/s]
100%|██████████| 13251684/13251684 [00:26<00:00, 503124.73it/s]
100%|██████████| 13251684/13251684 [00:15<00:00, 843272.24it/s]
 80%|███████▉  | 10550730/13251684 [00:26<00:02, 913204.84it/s]