# Wikilanguage Walkthrough for Dave

This is an interactive notebook showing different operations you can perform using Wikilanguage. You can describe this as computing the "most culturally relevant" articles within a set of parameters

**Calculation Modes**:
- Most culturally relevant in English
- Most culturally relevant to other cultures (e.g. Japan)
- Under-appreciated by English speakers relative to another culture
- Most obscure (least culturally relevant)

**Narrowing (Boolean) Operations**:
- Belonging / not belonging to a certain category (e.g. movies, cities, people)
- Country of origin
- Date of publication
- Near to a location (e.g. within 50km of Palo Alto)

**Grouping Modes**:
- By date of release

## Pre-Requirements
Imports -- if something is not working here I've fucked up the requirements.txt

In [1]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
import pandas as pd
import numpy as np
import pandas_helper
from pandas_helper import Concepts, Countries
from tqdm.notebook import tqdm
import requests
n_display = 20
n_display_short = 10
pd.set_option('display.max_rows', n_display)

Dataset download -- this project contains code to recompute from scratch but it requires extensive set-up

In [2]:
url = "https://wikilanguage.storage.googleapis.com/wikilanguage.tsv.gz"
data_path = Path("data/wikilanguage.test.tsv.gz")

if not data_path.exists():
    print(f"Downloading dataset from {url}")
    data_path.parent.mkdir(exist_ok=True)
    response = requests.get(url, stream=True)

    with open(str(data_path), "wb") as handle:
        for data in tqdm(response.iter_content(chunk_size=1*1024*1024), unit="mb"):
            handle.write(data)

Load data file into dataframe

In [3]:
df = pandas_helper.load_data(data_path)

## Demos
### Movies
Concept ids (e.g. `Concepts.FILM` correspond to ids in the Wikidata database which can be searched at https://www.wikidata.org/). `wl` is a namespace for Wikilanguage-specific queries implemented in `pandas_helper.py`

The most culturally relevant movies rated by English speakers

In [4]:
df.wl.instance_of(Concepts.FILM).wl.top_ranked("enwiki", n_display)

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q2875,Gone with the Wind,5e-06,1.0
Q17738,Star Wars Episode IV: A New Hope,4e-06,0.875392
Q193695,The Wizard of Oz,4e-06,0.866004
Q44578,Titanic,3e-06,0.692884
Q47703,The Godfather,3e-06,0.666326
Q24815,Citizen Kane,3e-06,0.665082
Q132689,Casablanca,3e-06,0.617875
Q190908,Seven,3e-06,0.593116
Q184843,Blade Runner,3e-06,0.580551
Q220394,The Birth of a Nation,3e-06,0.527241


We can calculate this to relative to other cultures by manually specifying another wiki (e.g. jawiki for Japan, itwiki for Italy or frwiki for France). Below are the most culturally relevant movies according to Italian wikipedia

In [5]:
print("Available wikis: ", ", ".join(pandas_helper.load_wikis(data_path)))
df.wl.instance_of(Concepts.FILM).wl.top_ranked("itwiki", n_display)

Available wikis:  ptwiki, eswiki, jawiki, arwiki, dewiki, enwiki, zhwiki, ruwiki, itwiki, frwiki


Unnamed: 0_level_0,sample_label,itwiki_pagerank,itwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q44578,Titanic,8e-06,1.0
Q18407,La Dolce Vita,8e-06,0.94711
Q180098,Ben-Hur,7e-06,0.839822
Q24871,Avatar,6e-06,0.756159
Q47703,The Godfather,6e-06,0.68191
Q761952,L'Arrivée d'un train en gare de La Ciotat,6e-06,0.673
Q134430,Snow White and the Seven Dwarfs,5e-06,0.660865
Q103474,2001: A Space Odyssey,5e-06,0.650776
Q131074,The Lord of the Rings: The Return of the King,5e-06,0.631053
Q2875,Gone with the Wind,5e-06,0.604718


Because movies are often global, there are a lot of duplicates between the Italian-identified movies and English-identified movies. A more interesting query is to determine which movies are "trending" in Italian Wikipedia relative to English Wikipedia. We do this through the "kl divergence" function which performs a trending calculation

In [6]:
df.wl.instance_of(Concepts.FILM).wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5).nlargest(n_display, "kl_divergence")[[
    "sample_label", "itwiki_title", "kl_divergence", "kl_relative_to_max"
]]

Unnamed: 0_level_0,sample_label,itwiki_title,kl_divergence,kl_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q18407,La Dolce Vita,La dolce vita,8.06103e-16,1.0
Q761952,L'Arrivée d'un train en gare de La Ciotat,L'arrivo di un treno alla stazione di La Ciotat,1.937562e-16,0.240362
Q180098,Ben-Hur,Ben-Hur (film 1959),1.821666e-16,0.225984
Q1570686,Partie de cartes,La partita a carte,6.628555e-17,0.08223
Q8665,Workers Leaving the Lumière Factory,L'uscita dalle officine Lumière,5.3508220000000006e-17,0.066379
Q19355,Life is Beautiful,La vita è bella (film 1997),3.9599050000000003e-17,0.049124
Q4660499,A Visit to the Seaside,A Visit to the Seaside,3.37808e-17,0.041906
Q464032,Cinema Paradiso,Nuovo Cinema Paradiso,3.368082e-17,0.041782
Q2570819,Don Juan,Don Giovanni e Lucrezia Borgia,3.117837e-17,0.038678
Q212775,The Last Emperor,L'ultimo imperatore,2.4429290000000002e-17,0.030305


An alternative is to instead consider Italian-made films under English or Italian wikipedia importance ranking, which gives an intuitive sense of "the most culturally relevant Italian films". In this case `Countries.ITALY` also resolves to a Wikidata id available at https://www.wikidata.org/

In [7]:
df.wl.instance_of(Concepts.FILM).wl.country_of_origin(Countries.ITALY).wl.top_ranked('itwiki', n_display)

Unnamed: 0_level_0,sample_label,itwiki_pagerank,itwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q19355,Life is Beautiful,4e-06,1.0
Q3823660,La presa di Roma,3e-06,0.740985
Q76479,A Fistful of Dollars,3e-06,0.692648
Q172837,Bicycle Thieves,3e-06,0.676532
Q1024861,Cabiria,3e-06,0.617405
Q3214027,La canzone dell'amore,2e-06,0.531347
Q3818300,L'allenatore nel pallone,2e-06,0.526271
Q41483,"The Good, the Bad and the Ugly",2e-06,0.516605
Q6379279,The Great Beauty,2e-06,0.458824
Q12018,8½,2e-06,0.458503


Since Wikidata is structured, we can also apply these types of rankings to group by a specific attribute (e.g. top ranked films by year)

In [8]:
df.wl.instance_of(Concepts.FILM).wl.top_by_year(n=n_display, top_col='enwiki_pagerank')[["sample_label", "publication_date", "enwiki_pagerank"]]

Unnamed: 0_level_0,sample_label,publication_date,enwiki_pagerank
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q3604746,Avatar 2,2021-12-15,1.231067e-07
Q57177410,Birds of Prey,2020-01-01,2.101868e-07
Q23781155,Avengers: Endgame,2019-04-22,1.92825e-06
Q23780914,Avengers: Infinity War,2018-04-25,1.104982e-06
Q23780734,Black Panther,2017-11-03,1.017793e-06
Q19590955,Rogue One,2016-12-10,1.133212e-06
Q6074,Star Wars Episode VII: The Force Awakens,2015-12-16,1.763776e-06
Q13417189,Interstellar,2014-10-26,9.406503e-07
Q246283,Frozen,2013-11-10,1.251971e-06
Q189330,The Dark Knight Rises,2012-07-20,1.45198e-06


Finally, operators can be nested so we can find the most culturally relevant Italian films crunched by year

In [9]:
df.wl.instance_of(Concepts.FILM)\
    .wl.country_of_origin(Countries.ITALY)\
    .wl.top_by_year(n=n_display, top_col='itwiki_pagerank')\
    [["sample_label", "itwiki_title", "publication_date", "itwiki_pagerank"]]

Unnamed: 0_level_0,sample_label,itwiki_title,publication_date,itwiki_pagerank
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q81635374,Hammamet (film),Hammamet (film),2020-01-09,1.66302e-07
Q63213307,The Traitor,Il traditore (film 2019),2019-05-23,6.542975e-07
Q51800111,Dogman,Dogman (film),2018-07-11,6.137035e-07
Q25136757,Call Me by Your Name,Chiamami col tuo nome (film),2017-01-01,8.864058e-07
Q22340123,Like Crazy,La pazza gioia,2016-01-01,5.794918e-07
Q19587078,Tale of Tales,Il racconto dei racconti - Tale of Tales,2015-01-01,6.78691e-07
Q17605404,Il giovane favoloso,Il giovane favoloso,2014-01-01,7.18153e-07
Q6379279,The Great Beauty,La grande bellezza,2013-05-21,2.028506e-06
Q172419,Piazza Fontana: The Italian Conspiracy,Romanzo di una strage,2012-01-01,7.946213e-07
Q1242957,What a Beautiful Day,Che bella giornata,2011-01-05,5.926636e-07


### Other Concepts
The same operators apply to other types of concepts beyond films. For a given article, you can figure out the concepts it belongs to by pressing the "Edit Links" button underneath languages on Wikipedia. The Wikidata page will state the article is an "instance of" some concept (e.g. https://en.wikipedia.org/wiki/Hypnotize_(The_Notorious_B.I.G._song) is an instance of single - [Q134556](https://www.wikidata.org/wiki/Q1629508)). Those concepts may also be a subclass of another concept, e.g. since is a subclass of "release" [Q2031291](https://www.wikidata.org/wiki/Q2031291)

For example, here are the most culturally relevant singles for English speakers:

In [10]:
df.wl.instance_of("Q134556").wl.top_ranked('enwiki', n_display)

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1472773,"Oh, Pretty Woman",1.620994e-06,1.0
Q214430,Like a Rolling Stone,1.459345e-06,0.900278
Q210211,Amazing Grace,1.306221e-06,0.805815
Q653991,All I Want for Christmas Is You,1.214108e-06,0.74899
Q308895,We Are the World,1.166166e-06,0.719415
Q890,Gangnam Style,1.075949e-06,0.663759
Q1188494,Do They Know It's Christmas?,1.069725e-06,0.659919
Q1067025,Hollaback Girl,1.011987e-06,0.6243
Q161402,Over the Rainbow,9.938199e-07,0.613093
Q154968,Lili Marleen,9.814362e-07,0.605453


If articles underneath those concepts often have a publication date, then you can also crunch by date. For example, the most culturally relevant singles by year

In [11]:
df.wl.instance_of("Q134556").wl.top_by_year(n=(n_display*2), top_col='enwiki_pagerank')[["enwiki_title", "publication_date", "enwiki_pagerank"]][n_display:]

Unnamed: 0_level_0,enwiki_title,publication_date,enwiki_pagerank
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1921383,Music (Madonna song),2000-08-21,4.792736e-07
Q2539121,Expo 2000 (song),1999-12-01,4.462832e-07
Q1932351,Iris (song),1998-04-01,7.109357e-07
Q5463103,Fly (Sugar Ray song),1997-06-17,5.48648e-07
Q908516,Don't Speak,1996-04-15,7.15579e-07
Q1193181,Torn (Ednaswap song),1995-01-01,6.914456e-07
Q653991,All I Want for Christmas Is You,1994-10-29,1.214108e-06
Q1149738,Macarena,1993-08-15,4.691665e-07
Q1165194,One (U2 song),1992-02-02,2.556562e-07
Q214113,Black or White,1991-10-11,4.025967e-07


Wikidata gets pretty wacky, so you can rank all kinds of strange things. Here are the most culturally relevant **humans**

In [12]:
df.wl.instance_of(Concepts.HUMAN).wl.top_ranked(n=n_display_short, wiki='enwiki') 

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q76,Barack Obama,7.6e-05,1.0
Q1043,Carl Linnaeus,7.1e-05,0.944269
Q9682,Elizabeth II,6.4e-05,0.845081
Q868,Aristotle,6.1e-05,0.808661
Q207,George W. Bush,6e-05,0.788083
Q302,Jesus Christ,5.8e-05,0.767799
Q517,Napoleon,5.7e-05,0.756202
Q22686,Donald Trump,5.6e-05,0.746166
Q8007,Franklin Delano Roosevelt,5.2e-05,0.693304
Q692,William Shakespeare,4.9e-05,0.647217


Or the most culturally relevant **wonders of the ancient world**

In [13]:
df.wl.instance_of(Concepts.WONDERS_OF_THE_ANCIENT_WORLD).wl.top_ranked(n=n_display_short, wiki='enwiki') 

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q37200,Great Pyramid of Giza,3.917875e-06,1.0
Q43018,Temple of Artemis,1.016052e-06,0.259337
Q43244,Lighthouse of Alexandria,9.826618e-07,0.250815
Q45368,Mausoleum of Maussollos,9.288698e-07,0.237085
Q41931,Hanging Gardens of Babylon,8.61247e-07,0.219825
Q41553,Colossus of Rhodes,7.609217e-07,0.194218
Q46239,Statue of Zeus at Olympia,5.2255e-07,0.133376


My workflow is typically to find an article then look-up the Wikidata entry, find the concept id and start crunching. For example [Pizza](https://en.wikipedia.org/wiki/Pizza) -> [Instance of Types of Food or Dish](https://www.wikidata.org/wiki/Q177) -> [Types of Food or Dish (Q19861951)](https://www.wikidata.org/wiki/Q19861951)

In [14]:
df.wl.instance_of('Q19861951').wl.top_ranked(n=n_display_short, wiki='enwiki') 

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q40050,drink,1.2e-05,1.0
Q736427,staple food,4e-06,0.361253
Q182940,dessert,4e-06,0.355222
Q177,pizza,4e-06,0.30568
Q41415,soup,3e-06,0.275485
Q2920963,stew,3e-06,0.238418
Q46383,sushi,3e-06,0.233892
Q6663,hamburger,3e-06,0.217335
Q28803,sandwich,2e-06,0.149575
Q275068,broth,2e-06,0.138362


You can also use this in reverse to find the least relevant types of food or dish

In [15]:
df.wl.instance_of('Q19861951').wl.top_ranked(n=n_display_short, wiki='enwiki', desc=False) 

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q3369568,Nguri,1.915574e-08,0.001621
Q11559422,seafood dish,2.258693e-08,0.001911
Q5523588,Garnache,2.574945e-08,0.002179
Q4982700,Bucheron,2.617011e-08,0.002214
Q16359149,Jāņi cheese,2.965499e-08,0.002509
Q25067130,pizzetta,3.278636e-08,0.002774
Q7813019,tofurkey,3.340876e-08,0.002827
Q4209872,jeok,5.055754e-08,0.004278
Q1018075,Butterbrot,5.523101e-08,0.004673
Q1591487,Havarti,5.864418e-08,0.004962


## Geographic Queries

Many Wikipedia articles also contain co-ordinates within Wikidata. That means we can do searches relative to a locality. For example, the most culturally relevant tourist attractions near Palo Alto

In [16]:
palo_alto = df.wl.resolve("Palo Alto, California", col="enwiki_title")
palo_alto.wl.within_radius(df, 50).wl.instance_of(Concepts.TOURIST_ATTRACTION).wl.top_ranked('enwiki', n_display)[['sample_label', 'enwiki_pagerank']]

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0_level_0,sample_label,enwiki_pagerank
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Q964035,Computer History Museum,1.150577e-05
Q913672,San Francisco Museum of Modern Art,2.288416e-06
Q635559,Golden Gate Park,2.022624e-06
Q965731,California Academy of Sciences,1.959615e-06
Q1470276,M. H. de Young Memorial Museum,8.921262e-07
Q1324280,Fisherman's Wharf,8.824058e-07
Q206518,Exploratorium,7.738572e-07
Q1416890,Fine Arts Museums of San Francisco,6.103649e-07
Q1672708,Iris & B. Gerald Cantor Center for Visual Arts,5.964724e-07
Q930276,Lombard Street,5.532263e-07


This also works with other Wikipedias, for example what are more culturally relevant tourist attractions to Italians than English speakers?

In [17]:
palo_alto = df.wl.resolve("Palo Alto, California", col="enwiki_title")
palo_alto.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5)\
    .nlargest(n_display, "kl_divergence")\
    [['sample_label', 'kl_divergence']]

Unnamed: 0_level_0,sample_label,kl_divergence
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Q1856083,Pier 39,2.191236e-05
Q1107297,Coit Tower,1.342134e-05
Q913672,San Francisco Museum of Modern Art,8.271844e-06
Q1470276,M. H. de Young Memorial Museum,7.020882e-06
Q930276,Lombard Street,4.795701e-06
Q965731,California Academy of Sciences,7.386911e-07
Q2166318,Walk of Game,5.360747e-07
Q1324280,Fisherman's Wharf,2.730834e-07
Q877714,Oakland Museum of California,2.268066e-09
Q3313374,Hiller Aviation Museum,2.626996e-10


Perhaps this is more interesting when we consider tourist attractions or museums in other countries. What do English speakers find in Rome?

In [18]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.top_ranked('enwiki', n_display)

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q237,Vatican City,2.406982e-05,1.0
Q12512,St. Peter's Basilica,1.24098e-05,0.515575
Q10285,Colosseum,5.316048e-06,0.220859
Q182955,Vatican Museums,4.167579e-06,0.173145
Q180212,Roman Forum,4.146222e-06,0.172258
Q99309,Pantheon,4.126112e-06,0.171423
Q2943,Sistine Chapel,4.010773e-06,0.166631
Q200642,Palatine Hill,2.84329e-06,0.118127
Q333906,Capitoline Museums,1.959018e-06,0.081389
Q486382,Castel Sant'Angelo,1.533058e-06,0.063692


But what hidden gems Italians know that English speakers don't about tourist attractions?

In [19]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5)\
    .nlargest(n_display, "kl_divergence")\
    [['sample_label', 'kl_divergence']]

Unnamed: 0_level_0,sample_label,kl_divergence
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1
Q237,Vatican City,0.0004561892
Q10285,Colosseum,4.422506e-08
Q200642,Palatine Hill,1.043703e-08
Q333906,Capitoline Museums,6.8888e-09
Q486382,Castel Sant'Angelo,6.62451e-09
Q680971,Experimental Centre of Cinematography,3.589622e-09
Q1492387,Galleria Nazionale d'Arte Moderna e Contemporanea,6.816189e-10
Q2579612,Palazzo delle Esposizioni,1.021645e-10
Q1135392,National Museum of Rome,3.938613e-11
Q1881229,MAXXI – National Museum of the 21st Century Arts,2.083575e-11


Or top museums according to Italians?

In [20]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.MUSEUM)\
    .wl.top_ranked("itwiki", n_display)

Unnamed: 0_level_0,sample_label,itwiki_pagerank,itwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q182955,Vatican Museums,1.7e-05,1.0
Q200642,Palatine Hill,1.5e-05,0.852827
Q333906,Capitoline Museums,1.2e-05,0.664486
Q486382,Castel Sant'Angelo,1e-05,0.597392
Q680971,Experimental Centre of Cinematography,7e-06,0.397573
Q1492387,Galleria Nazionale d'Arte Moderna e Contemporanea,5e-06,0.291925
Q841506,Galleria Borghese,5e-06,0.262623
Q1135392,National Museum of Rome,4e-06,0.238981
Q502098,Baths of Caracalla,4e-06,0.23764
Q836108,Baths of Diocletian,3e-06,0.200447
