# Wikilanguage Walkthrough for Dave

This is an interactive notebook showing different operations you can perform using Wikilanguage. You can describe this as computing the "most culturally relevant" articles within a set of parameters

**Calculation Modes**:
- Most culturally relevant in English
- Most culturally relevant to other cultures (e.g. Japan)
- Under-appreciated by English speakers relative to another culture
- Most obscure (least culturally relevant)

**Narrowing (Boolean) Operations**:
- Belonging / not belonging to a certain category (e.g. movies, cities, people)
- Country of origin
- Date of publication
- Near to a location (e.g. within 50km of Palo Alto)

**Grouping Modes**:
- By date of release

## Pre-Requirements
Imports -- if something is not working here I've fucked up the requirements.txt

In [1]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
import pandas as pd
import numpy as np
import pandas_helper
from pandas_helper import Concepts, Countries
from tqdm.notebook import tqdm
import requests
n_display = 20
n_display_short = 10
pd.set_option('display.max_rows', n_display)

Dataset download -- this project contains code to recompute from scratch but it requires extensive set-up

In [2]:
url = "https://wikilanguage.storage.googleapis.com/wikilanguage.tsv.gz"
data_path = Path("data/wikilanguage.test.tsv.gz")

if not data_path.exists():
    print(f"Downloading dataset from {url}")
    data_path.parent.mkdir(exist_ok=True)
    response = requests.get(url, stream=True)

    with open(str(data_path), "wb") as handle:
        for data in tqdm(response.iter_content(chunk_size=1*1024*1024), unit="mb"):
            handle.write(data)

Load data file into dataframe

In [None]:
df = pandas_helper.load_data(data_path)

## Demos
### Movies
Concept ids (e.g. `Concepts.FILM` correspond to ids in the Wikidata database which can be searched at https://www.wikidata.org/). `wl` is a namespace for Wikilanguage-specific queries implemented in `pandas_helper.py`

The most culturally relevant movies rated by English speakers

In [None]:
df.wl.instance_of(Concepts.FILM).wl.top_ranked("enwiki", n_display)

We can calculate this to relative to other cultures by manually specifying another wiki (e.g. jawiki for Japan, itwiki for Italy or frwiki for France). Below are the most culturally relevant movies according to Italian wikipedia

In [None]:
print("Available wikis: ", ", ".join(pandas_helper.load_wikis(data_path)))
df.wl.instance_of(Concepts.FILM).wl.top_ranked("itwiki", n_display)

Because movies are often global, there are a lot of duplicates between the Italian-identified movies and English-identified movies. A more interesting query is to determine which movies are "trending" in Italian Wikipedia relative to English Wikipedia. We do this through the "kl divergence" function which performs a trending calculation

In [None]:
df.wl.instance_of(Concepts.FILM).wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5).nlargest(n_display, "kl_divergence")[[
    "sample_label", "itwiki_title", "kl_divergence", "kl_relative_to_max"
]]

An alternative is to instead consider Italian-made films under English or Italian wikipedia importance ranking, which gives an intuitive sense of "the most culturally relevant Italian films". In this case `Countries.ITALY` also resolves to a Wikidata id available at https://www.wikidata.org/

In [None]:
df.wl.instance_of(Concepts.FILM).wl.country_of_origin(Countries.ITALY).wl.top_ranked('itwiki', n_display)

Since Wikidata is structured, we can also apply these types of rankings to group by a specific attribute (e.g. top ranked films by year)

In [None]:
df.wl.instance_of(Concepts.FILM).wl.top_by_year(n=n_display, top_col='enwiki_pagerank')[["sample_label", "publication_date", "enwiki_pagerank"]]

Finally, operators can be nested so we can find the most culturally relevant Italian films crunched by year

In [None]:
df.wl.instance_of(Concepts.FILM)\
    .wl.country_of_origin(Countries.ITALY)\
    .wl.top_by_year(n=n_display, top_col='itwiki_pagerank')\
    [["sample_label", "itwiki_title", "publication_date", "itwiki_pagerank"]]

### Other Concepts
The same operators apply to other types of concepts beyond films. For a given article, you can figure out the concepts it belongs to by pressing the "Edit Links" button underneath languages on Wikipedia. The Wikidata page will state the article is an "instance of" some concept (e.g. https://en.wikipedia.org/wiki/Hypnotize_(The_Notorious_B.I.G._song) is an instance of single - [Q134556](https://www.wikidata.org/wiki/Q1629508)). Those concepts may also be a subclass of another concept, e.g. since is a subclass of "release" [Q2031291](https://www.wikidata.org/wiki/Q2031291)

For example, here are the most culturally relevant singles for English speakers:

In [None]:
df.wl.instance_of("Q134556").wl.top_ranked('enwiki', n_display)

If articles underneath those concepts often have a publication date, then you can also crunch by date. For example, the most culturally relevant singles by year

In [None]:
df.wl.instance_of("Q134556").wl.top_by_year(n=(n_display*2), top_col='enwiki_pagerank')[["enwiki_title", "publication_date", "enwiki_pagerank"]][n_display:]

Wikidata gets pretty wacky, so you can rank all kinds of strange things. Here are the most culturally relevant **humans**

In [None]:
df.wl.instance_of(Concepts.HUMAN).wl.top_ranked(n=n_display_short, wiki='enwiki') 

Or the most culturally relevant **wonders of the ancient world**

In [None]:
df.wl.instance_of(Concepts.WONDERS_OF_THE_ANCIENT_WORLD).wl.top_ranked(n=n_display_short, wiki='enwiki') 

My workflow is typically to find an article then look-up the Wikidata entry, find the concept id and start crunching. For example [Pizza](https://en.wikipedia.org/wiki/Pizza) -> [Instance of Types of Food or Dish](https://www.wikidata.org/wiki/Q177) -> [Types of Food or Dish (Q19861951)](https://www.wikidata.org/wiki/Q19861951)

In [None]:
df.wl.instance_of('Q19861951').wl.top_ranked(n=n_display_short, wiki='enwiki') 

You can also use this in reverse to find the least relevant types of food or dish

In [None]:
df.wl.instance_of('Q19861951').wl.top_ranked(n=n_display_short, wiki='enwiki', desc=False) 

## Geographic Queries

Many Wikipedia articles also contain co-ordinates within Wikidata. That means we can do searches relative to a locality. For example, the most culturally relevant tourist attractions near Palo Alto

In [None]:
palo_alto = df.wl.resolve("Palo Alto, California", col="enwiki_title")
palo_alto.wl.within_radius(df, 50).wl.instance_of(Concepts.TOURIST_ATTRACTION).wl.top_ranked('enwiki', n_display)[['sample_label', 'enwiki_pagerank']]

This also works with other Wikipedias, for example what are more culturally relevant tourist attractions to Italians than English speakers?

In [None]:
palo_alto = df.wl.resolve("Palo Alto, California", col="enwiki_title")
palo_alto.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5)\
    .nlargest(n_display, "kl_divergence")\
    [['sample_label', 'kl_divergence']]

Perhaps this is more interesting when we consider tourist attractions or museums in other countries. What do English speakers find in Rome?

In [None]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.top_ranked('enwiki', n_display)

But what hidden gems Italians know that English speakers don't about tourist attractions?

In [None]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.TOURIST_ATTRACTION)\
    .wl.kl_divergence(base_wiki="enwiki", target_wiki="itwiki", importance_weight=5)\
    .nlargest(n_display, "kl_divergence")\
    [['sample_label', 'kl_divergence']]

Or top museums according to Italians?

In [None]:
rome = df.wl.resolve("Rome", col="enwiki_title")
rome.wl.within_radius(df, 50)\
    .wl.instance_of(Concepts.MUSEUM)\
    .wl.top_ranked("itwiki", n_display)