# Wikilanguage Walkthrough for Dave

This is an interactive notebook showing different operations you can perform using Wikilanguage. You can describe this as computing the "most culturally relevant" articles within a set of parameters

**Calculation Modes**:
- Most culturally relevant in English
- Most culturally relevant to other cultures (e.g. Japan)
- Under-appreciated by English speakers relative to another culture
- Most obscure (least culturally relevant)

**Narrowing (Boolean) Operations**:
- Belonging / not belonging to a certain category (e.g. movies, cities, people)
- Country of origin
- Date of publication
- Near to a location (e.g. within 50km of Palo Alto)

**Grouping Modes**:
- By date of release

## Pre-Requirements
Imports -- if something is not working here I've fucked up the requirements.txt

In [1]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
import pandas as pd
import numpy as np
import pandas_helper
from pandas_helper import Concepts, Countries
from tqdm.notebook import tqdm
import requests
n_display = 20
n_display_short = 10
pd.set_option('display.max_rows', n_display)

Dataset download -- this project contains code to recompute from scratch but it requires extensive set-up

In [2]:
url = "https://wikilanguage.storage.googleapis.com/wikilanguage.tsv.gz"
data_path = Path("data/wikilanguage.test.tsv.gz")

if not data_path.exists():
    print(f"Downloading dataset from {url}")
    data_path.parent.mkdir(exist_ok=True)
    response = requests.get(url, stream=True)

    with open(str(data_path), "wb") as handle:
        for data in tqdm(response.iter_content(chunk_size=1*1024*1024), unit="mb"):
            handle.write(data)

Load data file into dataframe

In [3]:
df = pandas_helper.load_data(data_path)

## Demos
### Movies
Concept ids (e.g. `Concepts.FILM` correspond to ids in the Wikidata database which can be searched at https://www.wikidata.org/). `wl` is a namespace for Wikilanguage-specific queries implemented in `pandas_helper.py`

The most culturally relevant movies rated by English speakers

In [4]:
df.wl.instance_of(Concepts.FILM).wl.top_ranked("enwiki", n_display)

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q2875,Gone with the Wind,5e-06,1.0
Q17738,Star Wars Episode IV: A New Hope,4e-06,0.875392
Q193695,The Wizard of Oz,4e-06,0.866004
Q44578,Titanic,3e-06,0.692884
Q47703,The Godfather,3e-06,0.666326
Q24815,Citizen Kane,3e-06,0.665082
Q132689,Casablanca,3e-06,0.617875
Q190908,Seven,3e-06,0.593116
Q184843,Blade Runner,3e-06,0.580551
Q220394,The Birth of a Nation,3e-06,0.527241


We can calculate this to relative to other cultures by manually specifying another wiki (e.g. jawiki for Japan, itwiki for Italy or frwiki for France). Below are the most culturally relevant movies according to Italian wikipedia

In [5]:
print("Available wikis: ", ", ".join(pandas_helper.load_wikis(data_path)))
df.wl.instance_of(Concepts.FILM).wl.top_ranked("itwiki", n_display)

Available wikis:  ptwiki, eswiki, jawiki, arwiki, dewiki, enwiki, zhwiki, ruwiki, itwiki, frwiki


Unnamed: 0_level_0,sample_label,itwiki_pagerank,itwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q44578,Titanic,8e-06,1.0
Q18407,La Dolce Vita,8e-06,0.94711
Q180098,Ben-Hur,7e-06,0.839822
Q24871,Avatar,6e-06,0.756159
Q47703,The Godfather,6e-06,0.68191
Q761952,L'Arrivée d'un train en gare de La Ciotat,6e-06,0.673
Q134430,Snow White and the Seven Dwarfs,5e-06,0.660865
Q103474,2001: A Space Odyssey,5e-06,0.650776
Q131074,The Lord of the Rings: The Return of the King,5e-06,0.631053
Q2875,Gone with the Wind,5e-06,0.604718


Because movies are often global, there are a lot of duplicates between the Italian-identified movies and English-identified movies. A more interesting query is to determine which movies are "trending" in Italian Wikipedia relative to English Wikipedia. We do this through the "kl divergence" function which performs a trending calculation

In [6]:
df.wl.instance_of(Concepts.FILM).wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5).nlargest(n_display, "kl_divergence")[[
    "sample_label", "itwiki_title", "kl_divergence", "kl_relative_to_max"
]]

Unnamed: 0_level_0,sample_label,itwiki_title,kl_divergence,kl_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q18407,La Dolce Vita,La dolce vita,8.06103e-16,1.0
Q761952,L'Arrivée d'un train en gare de La Ciotat,L'arrivo di un treno alla stazione di La Ciotat,1.937562e-16,0.240362
Q180098,Ben-Hur,Ben-Hur (film 1959),1.821666e-16,0.225984
Q1570686,Partie de cartes,La partita a carte,6.628555e-17,0.08223
Q8665,Workers Leaving the Lumière Factory,L'uscita dalle officine Lumière,5.3508220000000006e-17,0.066379
Q19355,Life is Beautiful,La vita è bella (film 1997),3.9599050000000003e-17,0.049124
Q4660499,A Visit to the Seaside,A Visit to the Seaside,3.37808e-17,0.041906
Q464032,Cinema Paradiso,Nuovo Cinema Paradiso,3.368082e-17,0.041782
Q2570819,Don Juan,Don Giovanni e Lucrezia Borgia,3.117837e-17,0.038678
Q212775,The Last Emperor,L'ultimo imperatore,2.4429290000000002e-17,0.030305


An alternative is to instead consider Italian-made films under English or Italian wikipedia importance ranking, which gives an intuitive sense of "the most culturally relevant Italian films". In this case `Countries.ITALY` also resolves to a Wikidata id available at https://www.wikidata.org/

In [7]:
df.wl.instance_of(Concepts.FILM).wl.country_of_origin(Countries.ITALY).wl.top_ranked('itwiki', n_display)

Unnamed: 0_level_0,sample_label,itwiki_pagerank,itwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q19355,Life is Beautiful,4e-06,1.0
Q3823660,La presa di Roma,3e-06,0.740985
Q76479,A Fistful of Dollars,3e-06,0.692648
Q172837,Bicycle Thieves,3e-06,0.676532
Q1024861,Cabiria,3e-06,0.617405
Q3214027,La canzone dell'amore,2e-06,0.531347
Q3818300,L'allenatore nel pallone,2e-06,0.526271
Q41483,"The Good, the Bad and the Ugly",2e-06,0.516605
Q6379279,The Great Beauty,2e-06,0.458824
Q12018,8½,2e-06,0.458503


Since Wikidata is structured, we can also apply these types of rankings to group by a specific attribute (e.g. top ranked films by year)

In [8]:
df.wl.instance_of(Concepts.FILM).wl.top_by_year(n=n_display, top_col='enwiki_pagerank')[["sample_label", "publication_date", "enwiki_pagerank"]]

Unnamed: 0_level_0,sample_label,publication_date,enwiki_pagerank
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q3604746,Avatar 2,2021-12-15,1.231067e-07
Q57177410,Birds of Prey,2020-01-01,2.101868e-07
Q23781155,Avengers: Endgame,2019-04-22,1.92825e-06
Q23780914,Avengers: Infinity War,2018-04-25,1.104982e-06
Q23780734,Black Panther,2017-11-03,1.017793e-06
Q19590955,Rogue One,2016-12-10,1.133212e-06
Q6074,Star Wars Episode VII: The Force Awakens,2015-12-16,1.763776e-06
Q13417189,Interstellar,2014-10-26,9.406503e-07
Q246283,Frozen,2013-11-10,1.251971e-06
Q189330,The Dark Knight Rises,2012-07-20,1.45198e-06


Finally, operators can be nested so we can find the most culturally relevant Italian films crunched by year

In [9]:
df.wl.instance_of(Concepts.FILM)\
    .wl.country_of_origin(Countries.ITALY)\
    .wl.top_by_year(n=n_display, top_col='itwiki_pagerank')\
    [["sample_label", "itwiki_title", "publication_date", "itwiki_pagerank"]]

Unnamed: 0_level_0,sample_label,itwiki_title,publication_date,itwiki_pagerank
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q81635374,Hammamet (film),Hammamet (film),2020-01-09,1.66302e-07
Q63213307,The Traitor,Il traditore (film 2019),2019-05-23,6.542975e-07
Q51800111,Dogman,Dogman (film),2018-07-11,6.137035e-07
Q25136757,Call Me by Your Name,Chiamami col tuo nome (film),2017-01-01,8.864058e-07
Q22340123,Like Crazy,La pazza gioia,2016-01-01,5.794918e-07
Q19587078,Tale of Tales,Il racconto dei racconti - Tale of Tales,2015-01-01,6.78691e-07
Q17605404,Il giovane favoloso,Il giovane favoloso,2014-01-01,7.18153e-07
Q6379279,The Great Beauty,La grande bellezza,2013-05-21,2.028506e-06
Q172419,Piazza Fontana: The Italian Conspiracy,Romanzo di una strage,2012-01-01,7.946213e-07
Q1242957,What a Beautiful Day,Che bella giornata,2011-01-05,5.926636e-07


### Other Concepts
The same operators apply to other types of concepts beyond films. For a given article, you can figure out the concepts it belongs to by pressing the "Edit Links" button underneath languages on Wikipedia. The Wikidata page will state the article is an "instance of" some concept (e.g. https://en.wikipedia.org/wiki/Hypnotize_(The_Notorious_B.I.G._song) is an instance of single - [Q134556](https://www.wikidata.org/wiki/Q1629508)). Those concepts may also be a subclass of another concept, e.g. since is a subclass of "release" [Q2031291](https://www.wikidata.org/wiki/Q2031291)

For example, here are the most culturally relevant singles for English speakers:

In [10]:
df.wl.instance_of("Q134556").wl.top_ranked('enwiki', n_display)

Unnamed: 0_level_0,sample_label,enwiki_pagerank,enwiki_relative_to_max
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1472773,"Oh, Pretty Woman",1.620994e-06,1.0
Q214430,Like a Rolling Stone,1.459345e-06,0.900278
Q210211,Amazing Grace,1.306221e-06,0.805815
Q653991,All I Want for Christmas Is You,1.214108e-06,0.74899
Q308895,We Are the World,1.166166e-06,0.719415
Q890,Gangnam Style,1.075949e-06,0.663759
Q1188494,Do They Know It's Christmas?,1.069725e-06,0.659919
Q1067025,Hollaback Girl,1.011987e-06,0.6243
Q161402,Over the Rainbow,9.938199e-07,0.613093
Q154968,Lili Marleen,9.814362e-07,0.605453


If articles underneath those concepts often have a publication date, then you can also crunch by date. For example, the most culturally relevant singles by year

In [11]:
df.wl.instance_of("Q134556").wl.top_by_year(n=(n_display*2), top_col='enwiki_pagerank')[["enwiki_title", "publication_date", "enwiki_pagerank"]][n_display:]

Unnamed: 0_level_0,enwiki_title,publication_date,enwiki_pagerank
concept_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1921383,Music (Madonna song),2000-08-21,4.792736e-07
Q2539121,Expo 2000 (song),1999-12-01,4.462832e-07
Q1932351,Iris (song),1998-04-01,7.109357e-07
Q5463103,Fly (Sugar Ray song),1997-06-17,5.48648e-07
Q908516,Don't Speak,1996-04-15,7.15579e-07
Q1193181,Torn (Ednaswap song),1995-01-01,6.914456e-07
Q653991,All I Want for Christmas Is You,1994-10-29,1.214108e-06
Q1149738,Macarena,1993-08-15,4.691665e-07
Q1165194,One (U2 song),1992-02-02,2.556562e-07
Q214113,Black or White,1991-10-11,4.025967e-07


Wikidata gets pretty wacky, so you can rank all kinds of strange things. Here are the most culturally relevant **humans**

In [None]:
df.wl.instance_of(Concepts.HUMAN).wl.top_ranked(n=n_display_short, wiki='enwiki') 

Or the most culturally relevant **wonders of the ancient world**

In [None]:
df.wl.instance_of(Concepts.WONDERS_OF_THE_ANCIENT_WORLD).wl.top_ranked(n=n_display_short, wiki='enwiki') 

My workflow is typically to find an article then look-up the Wikidata entry, find the concept id and start crunching. For example [Pizza](https://en.wikipedia.org/wiki/Pizza) -> [Instance of Types of Food or Dish](https://www.wikidata.org/wiki/Q177) -> [Types of Food or Dish (Q19861951)](https://www.wikidata.org/wiki/Q19861951)

In [None]:
df.wl.instance_of('Q19861951').wl.top_ranked(n=n_display_short, wiki='enwiki') 

You can also use this in reverse to find the least relevant types of food or dish

In [None]:
df.wl.instance_of('Q19861951').wl.top_ranked(n=n_display_short, wiki='enwiki', desc=False) 

## Geographic Queries

Many Wikipedia articles also contain co-ordinates within Wikidata. That means we can do searches relative to a locality. For example, the most culturally relevant tourist attractions near Palo Alto

In [None]:
palo_alto = df.wl.resolve("Palo Alto, California", col="enwiki_title")
palo_alto.wl.within_radius(df, 50).wl.instance_of(Concepts.TOURIST_ATTRACTION).wl.top_ranked('enwiki', n_display)[['sample_label', 'enwiki_pagerank']]

This also works with other Wikipedias, for example what are more culturally relevant tourist attractions to Italians than English speakers?

In [None]:
palo_alto = df.wl.resolve("Palo Alto, California", col="enwiki_title")
palo_alto.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5)\
    .nlargest(n_display, "kl_divergence")\
    [['sample_label', 'kl_divergence']]

Perhaps this is more interesting when we consider tourist attractions or museums in other countries. What do English speakers find in Rome?

In [None]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.top_ranked('enwiki', n_display)

But what hidden gems Italians know that English speakers don't about tourist attractions?

In [None]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5)\
    .nlargest(n_display, "kl_divergence")\
    [['sample_label', 'kl_divergence']]

Or top museums according to Italians?

In [None]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.MUSEUM)\
    .wl.top_ranked("itwiki", n_display)