# 🛠️ Setup – Run once

In [1]:
# @title Code download
# @markdown This block initialises the notebook.
!rm -r sample_data
!mkdir data
!mkdir export
!mkdir plots
!wget -O wikilit.zip "https://userpage.fu-berlin.de/fu7182rx/wikilit.zip"
!unzip -o wikilit.zip
!rm wikilit.zip

rm: cannot remove 'sample_data': No such file or directory
mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘export’: File exists
mkdir: cannot create directory ‘plots’: File exists
--2023-10-06 13:11:51--  https://userpage.fu-berlin.de/fu7182rx/wikilit.zip
Resolving userpage.fu-berlin.de (userpage.fu-berlin.de)... 130.133.4.196
Connecting to userpage.fu-berlin.de (userpage.fu-berlin.de)|130.133.4.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3186 (3.1K) [application/zip]
Saving to: ‘wikilit.zip’


2023-10-06 13:11:52 (145 MB/s) - ‘wikilit.zip’ saved [3186/3186]

Archive:  wikilit.zip
  inflating: requirements.txt        
  inflating: user-config.py          
  inflating: wikilit.py              


In [2]:
# @title Package installation
# @markdown This block installs the required packages.
!pip install -r requirements.txt



In [3]:
# @title Package imports
# @markdown This block imports the necessary packages.
from wikilit import wikilit
import polars as pl
import plotly.express as px
# Colab files handling
from google.colab import files
import os

# 📚 Selection Method

This notebook supports multiple selection methods that determine the way the list of articles is compiled.

__Note:__ When parsing page names, formats with spaces (`Charlotte Brontë`) and underscores (`Charlotte_Brontë`) are both supported.

## Article’s language links (`langlinks`)
This method compiles all interlanguage links of a certain article, i.e. versions of the article on other editions of the site. The `selection` should be an article title.

Examples: `Charlotte Brontë`, `Virginia Wolff`, `César Aira`, `Herta Müller`, `Elfriede Jelinek`, `Christa Wolf`, `Nikos Kazantzakis`.

In [12]:
selection_method = "langlinks"
article_title = "Elfriede Jelinek" # @param ["Charlotte Brontë", "Virginia Wolff", "César Aira", "Herta Müller", "Elfriede Jelinek", "Christa Wolf", "Nikos Kazantzakis"] {allow-input: true}
selection = article_title
site = "wikipedia" # @param ["wikipedia"] {allow-input: true}
lang = "en" # @param {type:"string"}

## Articles of a category (`category`)
This method selects all articles in a MediaWiki category and its subcategories. The `selection` should be a category title.

Examples: `19th-century English women writers`, `Chinese women novelists`, `North Korean novelists`.

In [4]:
selection_method = "category"
category_title = "Brazilian LGBT novelists" # @param ["19th-century English women writers", "North Korean novelists", "Chinese women novelists", "South Korean women novelists", "Brazilian LGBT novelists"] {allow-input: true}
selection = category_title
site = "wikipedia" # @param ["wikipedia"] {allow-input: true}
lang = "en" # @param {type:"string"}

## Articles from a file (`file`)
This method reads article titles from a tab-separated values (TSV) file you supply.

The resulting dataframe retains all columns of the supplied file and includes the page stats as additional columns. This allows you to enrich existing datasets.

This method is especially useful when combined with output from the [Wikidata Query Service](https://query.wikidata.org/).

`article_title_column` determines which column will be used as the article title.

In [4]:
selection_method = "file"
article_title_column = "desc"  # @param {type:"string"}
site = "wikipedia"  # @param ["wikipedia"] {allow-input: true}
lang = "en" # @param ["en", "de", "fr", "es", "ja", "ru", "pt", "zh", "it", "ar"] {allow-input: true}

# Upload file
uploaded = files.upload()
file_list = list(uploaded.keys())

# Check file
if len(file_list) > 1:
    for file in file_list:
        os.remove(file)
    raise ValueError("Please only upload one file.")

filename = file_list[0]
if not filename.endswith(".tsv"):
    os.remove(filename)
    raise ValueError("File must be TSV.")

# Move file to data/ directory
dest = "data/input.tsv"
os.rename(filename, dest)
print(f"File saved as {dest}")
selection = dest

Saving test.tsv to test.tsv
File saved as data/input.tsv


# 🛰️ Data Retrieval

In [13]:
# @title Fetch
print(
    f"Retrieving data using selection_method='{selection_method}' and selection='{selection}' from {lang}.{site}"
)

if "article_title_column" not in locals():
    article_title_column = None

data = wikilit(
  selection_method=selection_method,
  selection=selection,
  lang=lang,
  site=site,
  article_title_column=article_title_column
)

print(data)
data.write_csv("data/output.tsv", separator="\t")
print("Wrote data to data/output.tsv")


Retrieving data using selection_method='langlinks' and selection='Elfriede Jelinek' from en.wikipedia



Site wikipedia:be-tarask instantiated using different code "be-x-old"


Site wikipedia:no instantiated using different code "nb"

Getting page stats: 100%|██████████| 101/101 [04:19<00:00,  2.57s/it, zh-yue:耶利內克]

shape: (101, 15)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ title      ┆ url        ┆ length ┆ n_contrib ┆ … ┆ first_rev ┆ selection ┆ selection ┆ selection │
│ ---        ┆ ---        ┆ ---    ┆ utors     ┆   ┆ ision     ┆ _method   ┆ ---       ┆ _case     │
│ str        ┆ str        ┆ i64    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ str       ┆ ---       │
│            ┆            ┆        ┆ i64       ┆   ┆ datetime[ ┆ str       ┆           ┆ str       │
│            ┆            ┆        ┆           ┆   ┆ μs]       ┆           ┆           ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ Elfriede   ┆ https://af ┆ 3080   ┆ 4         ┆ … ┆ 2017-03-2 ┆ langlinks ┆ Elfriede  ┆ af:Elfrie │
│ Jelinek    ┆ .wikipedia ┆        ┆           ┆   ┆ 6         ┆           ┆ Jelinek   ┆ de        │
│            ┆ .org/wiki/ ┆        ┆           ┆   ┆ 16:31:59  ┆          




In [10]:
# @title Data export
format = "TSV"  # @param ["TSV", "Excel"]
filename = "output"  # @param {type:"string"}

export_dir = "export/"

output_file = "" + export_dir + filename

if format == "TSV":
    output_file += ".tsv"
    data.write_csv(output_file, separator="\t")
elif format == "Excel":
    output_file += ".xlsx"
    data.write_excel(output_file)

files.download(output_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 📊 Plotting

This section allows you to plot your data onto a graph.


In [24]:
# @title Bar plot

template = "plotly_white"
yaxis_var = "length" # @param ["length", "n_contributors", "n_revisions", "n_extlinks", "n_langlinks", "n_links", "n_linkshere", "n_categories", "pageviews_60d", "first_revision"] {allow-input: true}
yaxis_type = "Auto" # @param ["Auto", "linear", "log", "date"] {allow-input: true}
if yaxis_type == "Auto":
  yaxis_type = "-"

title = "Auto" # @param ["Auto"] {allow-input: true}
if title == "Auto":
  title = f"{selection_method.title()}: {selection}"
xaxis_title = "Auto" # @param ["Auto"] {allow-input: true}
if xaxis_title == "Auto":
  xaxis_title = "Page"
yaxis_title = "Auto" # @param ["Auto"] {allow-input: true}
if yaxis_title == "Auto":
  yaxis_title = yaxis_var


# Bar graph: article size vs language edition
# Different versions of x axis labels
plt = px.bar(
    data.sort(yaxis_var, descending=True).to_pandas(),
    x="selection_case",
    y=yaxis_var,
    template=template,
    color_discrete_sequence=px.colors.qualitative.Set2
)

plt.update_layout(
    title=title,
    xaxis_title=xaxis_title,
    yaxis_title=yaxis_title,
    # Rotate x-axis labels 45 degrees
    xaxis_tickangle=-45,
    yaxis_type=yaxis_type,
)

plt.show()

In [20]:
# @title Regression plot

template = "plotly_white"
xaxis_var = "length"  # @param ["length", "n_contributors", "n_revisions", "n_extlinks", "n_langlinks", "n_links", "n_linkshere", "n_categories", "pageviews_60d", "first_revision"] {allow-input: true}
yaxis_var = "n_links"  # @param ["length", "n_contributors", "n_revisions", "n_extlinks", "n_langlinks", "n_links", "n_linkshere", "n_categories", "pageviews_60d", "first_revision"] {allow-input: true}

title = "Auto"  # @param ["Auto"] {allow-input: true}
if title == "Auto":
    title = f"{selection_method.title()}: {selection}"
xaxis_title = "Auto"  # @param ["Auto"] {allow-input: true}
if xaxis_title == "Auto":
    xaxis_title = xaxis_var
yaxis_title = "Auto"  # @param ["Auto"] {allow-input: true}
if yaxis_title == "Auto":
    yaxis_title = yaxis_var


# Bar graph: article size vs language edition
# Different versions of x axis labels
plt = px.scatter(
    data.to_pandas(),
    x=xaxis_var,
    y=yaxis_var,
    template=template,
    color_discrete_sequence=px.colors.qualitative.Set2,
    # Add regression line
    trendline="ols",
    hover_name="selection_case",
)

plt.update_layout(
    title=title,
    xaxis_title=xaxis_title,
    yaxis_title=yaxis_title,
)

plt.show()


## Plot Export

You may export your generated plot as a static image or interactive HTML.

### Formats
__HTML__: Interactive plot as shown above. May be viewed using a browser.  
__SVG__: Scalable vector graphic. Static image that may be resized without becoming blurry.  
__PNG__: Static raster image. May be pixelated or blurry due to Google Colab rendering.


### Options
`format`: Format to produce  
`filename`: Desired filename without extension  

__Non-interactive formats only (SVG/PNG)__  
`width` and `height`: Dimensions in pixels  
`scale`: Scale factor to use when exporting the figure. A scale factor larger than 1.0 will increase the image resolution with respect to the figure’s layout pixel dimensions.

In [9]:
# @title Options
format = "HTML"  # @param ["HTML", "SVG", "PNG"]
filename = "plot"  # @param {type:"string"}
width = 1200  # @param {type:"number"}
height = 600  # @param {type:"number"}
scale = 1  # @param {type:"number"}

plots_dir = "plots/"

output_file = "" + plots_dir + filename

if format == "HTML":
    output_file += ".html"
    plt.write_html(output_file)
elif format in ["PNG", "SVG"]:
    output_file += "." + format.lower()
    plt.write_image(
        output_file,
        scale=scale,
        width=width,
        height=height,
    )

files.download(output_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>