<a href="https://colab.research.google.com/github/yojuna/experiments/blob/main/wikipedia_edit_vandalism_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## wikipedia article analysis

### motivation: in our post-truth internet, the ability to find reliable information should be a fundamental right for all.

Wikipedia is still one of the most important sources of our collective knowledge. However, in the wars of propaganda, it is by far the lowest hanging fruit that can be attacked to manipulate public consciousness.

The following is an attempt to glean insights from publicly available information to assess the quality of a wikipedia article, and in the process develop a reproducible effort that can automate this analysis by automagically pulling in the required info and processing the data.

Till date, there have been extensive open source efforts to achieve this kind of goal and this work aims to build on top of those amazing contributions and also to document them.



## related work

### wikipedia-histories

https://github.com/ndrezn/wikipedia-histories

https://txtlab.org/2020/09/do-wikipedia-editors-specialize/

From the article:

"

One of the students in our lab, Nathan Drezner, has a new collaboration out entitled, “[Everyday Specialization: The coherence of editorial communities on Wikipedia.](https://txtlab.org/wp-content/uploads/2020/09/WikipediaCommunities_2020.pdf)”

In this paper, Drezner studies edit histories of over 30,000 Wiki pages across four different cultural domains (science, sports, culture, and politics). His goal is to better understand how editors cluster together in smaller subdomains: do editors tend to focus on a single science, or single sport, or single political party or type of art form (books, film, television) or do they range widely in the types of pages they edit?

...

Drezner's work gives us a great look into how editorial communities behave on Wikipedia, helping us see the dynamics of lay users when it comes to knowledge formation. His work highlights the value of this new kind of data — the edit history — as a fascinating resource to better understand editorial behaviour. The data he used in the study is available at our lab [dataverse](https://dataverse.harvard.edu/dataverse/txtlab) and he has also created code at his GitHub repository that others can use to create their own datasets of edit histories by different domains.

"

### mwedittypes

https://techblog.wikimedia.org/2022/06/28/what-is-in-an-edit-automated-detection-of-edit-types-on-wikipedia/

Edit diffs and type detection for Wikipedia. The goal is to transform unstructured edits to Wikipedia articles into a structured summary of what actions were taken in the edit. The library has two major formats (and associated algorithms):

    Simple summary: fast computation of changes that results in a basic summary of counts of changes
    Structured summary: slow but more context-aware computation that provides details of each specific change


https://github.com/geohci/edit-types

### Analysis of Wikipedia Vandalism

https://github.com/yinghawl/Analysis-of-Vandalism-on-Wikipedia

From the repo description README:

While the goal of this investigation is, in essence, to answer the question "Is the content of Wikipedia articles reliable?", such a question is more subjective than objective.

However, analyzing editing behavior can provide users of the site with some quantitative information about its reliability. The specfic questions considered in this analysis are:

- Are the population edit ratios among the top five portals equal? That is, is the category of an article a good predictor of the percentage of edits that are unconstructive?

- Is there a relationship between the number of page views and the number of reverted edits? That is, is the viewership of an article a good predictor of the number of unconstructive edits?

- Is there a difference in the response time among the different portals? That is, is the category of an article a good predictor of how long it takes for an unconstructive edit to be reverted?

- Is there adifference in the response time between pages that are semi-protected versus those that are unprotected? That is, is the protection status of an article a good predictor of how long it takes an unconstructive edit to be reverted? The terminology used in each question statement is defined as necessary in subsequent sec- tions.

### other relevant links dump

PyWikiBot

https://www.mediawiki.org/wiki/Manual:Pywikibot

https://github.com/wikimedia/pywikibot



Kaggle notebook of wikipedia data analysis

https://www.kaggle.com/code/evarga/analysis-of-wikipedia-edits/notebook



## Implemetation / Experiments

#### notebook housekeeping

In [1]:
# pretty print text outputs

# import pprint
# pp = pprint.PrettyPrinter(indent=4)

from pprint import pprint

In [2]:
# suppress warnings in code output cells
import warnings
warnings.filterwarnings('ignore')

### Gather the data

#### wikipedia python library

https://pypi.org/project/wikipedia/

https://github.com/goldsmith/Wikipedia

In [None]:
# install the package
!  pip install wikipedia

##### Basic Usage

API summary:

https://wikipedia.readthedocs.io/en/latest/code.html

In [3]:
import wikipedia

In [4]:
# https://en.wikipedia.org/wiki/Russian_invasion_of_Ukraine
article_name = 'Russian invasion of Ukraine'

article = wikipedia.page(article_name)

In [5]:
# print(article.content)

pprint(article.content)

('On 24 February 2022, Russia invaded Ukraine in an escalation of the '
 'Russo-Ukrainian War that started in 2014. The invasion became the largest '
 'attack on a European country since World War II. It is estimated to have '
 'caused tens of thousands of Ukrainian civilian casualties and hundreds of '
 'thousands of military casualties. By June 2022, Russian troops occupied '
 'about 20% of Ukrainian territory. About 8 million Ukrainians had been '
 'internally displaced and more than 8.2 million had fled the country by April '
 "2023, creating Europe's largest refugee crisis since World War II. Extensive "
 'environmental damage caused by the war, widely described as an ecocide, '
 'contributed to food crises worldwide.\n'
 "Before the invasion, Russian troops massed near Ukraine's borders as Russian "
 'officials denied any plans to attack. Russian president Vladimir Putin '
 'announced a "special military operation" to support the Russian-backed '
 'breakaway republics of Donetsk 

In [6]:
# article summary

article_summary = wikipedia.summary(article_name)

pprint(article_summary)

('On 24 February 2022, Russia invaded Ukraine in an escalation of the '
 'Russo-Ukrainian War that started in 2014. The invasion became the largest '
 'attack on a European country since World War II. It is estimated to have '
 'caused tens of thousands of Ukrainian civilian casualties and hundreds of '
 'thousands of military casualties. By June 2022, Russian troops occupied '
 'about 20% of Ukrainian territory. About 8 million Ukrainians had been '
 'internally displaced and more than 8.2 million had fled the country by April '
 "2023, creating Europe's largest refugee crisis since World War II. Extensive "
 'environmental damage caused by the war, widely described as an ecocide, '
 'contributed to food crises worldwide.\n'
 "Before the invasion, Russian troops massed near Ukraine's borders as Russian "
 'officials denied any plans to attack. Russian president Vladimir Putin '
 'announced a "special military operation" to support the Russian-backed '
 'breakaway republics of Donetsk 

In [7]:
# other accessible metadata

# article title
print("Arcticle Title:")
pprint(article.title)

# page title
print("URL:")
pprint(article.url)

# article references
print("References:")
pprint(article.references)

# sections
print("Sections:")
pprint(article.sections)

# links
print("Links:")
pprint(article.links)

Arcticle Title:
'Russian invasion of Ukraine'
URL:
'https://en.wikipedia.org/wiki/Russian_invasion_of_Ukraine'
References:
['https://www.understandingwar.org/backgrounder/ukraine-conflict-updates',
 'https://news.yahoo.com/russia-bombards-kyiv-vows-strike-123428279.html?fr=yhssrp_catchall',
 'https://www.bbc.com/news/world-europe-61773356',
 'https://unric.org/en/the-un-and-the-war-in-ukraine-key-information/',
 'https://www.consilium.europa.eu/en/documents-publications/library/library-blog/posts/think-tank-reports-on-the-invasion-of-ukraine/',
 'https://www.nbcnews.com/news/world/russia-withdraws-snake-island-ukraine-counteroffensive-south-kherson-rcna35874',
 'https://www.bbc.com/news/world-europe-62033619',
 'https://www.france24.com/en/europe/20220716-live-russia-accused-of-shelling-from-zaporizhzhia-nuclear-plant',
 'https://www.cnbc.com/2022/07/23/world-leaders-slam-putins-attack-on-odesa-following-sea-corridor-deal.html',
 'https://www.cnn.com/2022/07/31/europe/ukraine-russia-wa

#### mwclient

https://pypi.org/project/mwclient/

mwclient is a lightweight Python client library to the [MediaWiki API](https://www.mediawiki.org/wiki/API) which provides access to most API functionality.

In [71]:
# installation

! pip install mwclient



#### Basic usage

In [8]:
# basic usage

from mwclient import Site

domain = "en.wikipedia.org"
title = "Russian invasion of Ukraine"

try:
    site = Site(domain)
    page = site.pages[title]
except ConnectionError as e:
    print(e)

In [9]:
# basic page operations
## ref: https://mwclient.readthedocs.io/en/latest/user/page-ops.html

page_text = page.text()

pprint(page_text)

  and should_run_async(code)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 '|title=Russia demands NATO roll back from East Europe and stay out of '
 'Ukraine |work=[[Reuters]] '
 '|url=https://www.reuters.com/world/russia-unveils-security-guarantees-says-western-response-not-encouraging-2021-12-17/ '
 '|url-status=live |access-date=24 February 2022 '
 '|archive-url=https://web.archive.org/web/20220222081106/https://www.reuters.com/world/russia-unveils-security-guarantees-says-western-response-not-encouraging-2021-12-17/ '
 '|archive-date=22 February 2022}}</ref> Russia threatened an unspecified '
 'military response if NATO followed an "aggressive line."<ref>{{Cite news '
 '|last=MacKinnon |first=Mark |date=21 December 2021 |title=Putin warns of '
 "unspecified military response if U.S. and NATO continue 'aggressive line' "
 '|work=[[The Globe and Mail]] '
 '|url=https://www.theglobeandmail.com/world/article-putin-warns-of-unspecified-military-response-if-us-and-nato-continue/ '
 '|url-status=l

#### Extract page revision history

NOTE: edited wikipedia-histories library code to limit to a fixed number of articles for processing, as parsing the complete data for an article was taking too much time.

In [11]:
# ref: https://github.com/ndrezn/wikipedia-histories/blob/main/src/wikipedia_histories/get_histories.py

import asyncio
import re
from datetime import datetime
from time import mktime

import aiohttp
import mwparserfromhell as mw
import pandas as pd
from lxml import html
from mwclient import Site
from requests.exceptions import ConnectionError

# from .revision import Revision
"""
Container for Change object
"""

# ref: https://github.com/ndrezn/wikipedia-histories/blob/main/src/wikipedia_histories/revision.py


class Revision:
    """
    Class to track data about each change of a page
    """

    def __init__(self, index, title, time, revid, kind, user, comment, rating, content):
        """
        Create a change object

        :param index: edit level
        :param title: page title
        :param time: time the edit was made
        :param revid: the id number of the revision
        :param kind: minor or non-minor edit
        :param user: the user who made the edit
        :param comment: the comment attached to the edit
        :param class: the quality of the edit
        :param content: the content of the edit, i.e. the text
        """
        self.index = index
        self.title = title
        self.time = time
        self.revid = revid
        self.kind = kind
        self.user = user
        self.comment = comment
        self.rating = rating
        self.content = content

    def __str__(self):
        return str(self.revid)

    def __repr__(self):
        return str(self.revid)


def _get_users(metadata):
    """
    Pull users, handles hidden user errors
    Parameters:
        metadata: sheet of metadata from mwclient
    Returns:
        the list of users
    """
    users = []
    for rev in metadata:
        try:
            users.append(rev["user"])
        except (KeyError):
            users.append(None)
    return users


def get_kind(metadata):
    """
    Gather edit types (minor or not), handles untagged edits
    Parameters:
        metadata: sheet of metadata from mwclient
    Returns:
        list of True/False representing whether an edit is minor
    """
    kind = []
    for rev in metadata:
        if "minor" in rev:
            kind.append(True)
        else:
            kind.append(False)
    return kind


def get_comment(metadata):
    """
    Check for comments
    Parameters:
        metadata: sheet of metadata from mwclient
    Returns:
        The comments as a list
    """
    comment = []
    for rev in metadata:
        try:
            comment.append(rev["comment"])
        except KeyError:
            comment.append("")
    return comment


def get_ratings(talk, num_revisions=100):
    """
    Output classes of a page to a list (FA, good, etc.) given a talk page
    Parameters:
        talk: set of talk pages from metadata
    Returns:
        The ratings and timestamps for a page
    """
    timestamps = [rev["timestamp"] for rev in talk.revisions()]
    ratings = []
    content = []

    count = num_revisions

    prev = None
    for cur in talk.revisions(prop="content"):
        if count < 1:
          break

        if cur.__len__() == 1:
            content.append(prev)
        else:
            content.append(cur)

        prev = cur

        count -= 1

    print('limiting # of entries to: ', len(content))

    i = 0
    for version in content:
        try:
            templates = mw.parse(version.get("*")).filter_templates()
        except IndexError:
            continue

        rate = "NA"
        for template in templates:
            try:
                rate = template.get("class").value
                break
            except ValueError:
                continue

        rating = (rate, datetime.fromtimestamp(mktime(timestamps[i])))

        ratings.append(rating)
        i += 1

    return ratings


async def get_text(revid, attempts=0, lang_code="en"):
    """
    Pull plain text representation of a revision from API
    Parameters:
        revid: revision id of a page
        attempts: The number of attempts at retrieving the id so far
    """
    try:
        # async implementation of requests get
        async with aiohttp.ClientSession() as session:
            async with session.get(
                f"https://{lang_code}.wikipedia.org/w/api.php",
                params={"action": "parse", "format": "json", "oldid": revid,},
            ) as resp:
                response = await resp.json()
    # request errors from server
    except:
        if attempts == 10:
            return -1
        # If there's a server error, just re-send the request until the server complies
        return await get_text(revid, attempts=attempts + 1, lang_code=lang_code)
    # Check if page was deleted (deleted pages have no text and are therefore un-parsable)
    try:
        raw_html = response["parse"]["text"]["*"]
    # Page error (represents deleted pages)
    except KeyError:
        return None
    # Parse raw html from response
    document = html.document_fromstring(raw_html)
    text = document.xpath("//p")
    paragraphs = []
    for paragraph in text:
        paragraphs.append(paragraph.text_content())

    # Put everything together
    cur = "".join(paragraphs)

    return cur


async def get_texts(revids, lang_code="en"):
    """
    Get the text of articles given the list of revision ids

    Parameters:
        revids: A list of revids (type int) correlating to article revisions
    Returns:
        The text for each revision id
    """
    # Container for the revision texts
    texts = []

    # Gather body content of all revisions (asynchronously)
    sema = 100
    for i in range(0, revids.__len__(), +sema):
        texts += await asyncio.gather(
            *(get_text(revid, lang_code=lang_code) for revid in revids[i : (i + sema)])
        )
    return texts


def get_history(title, include_text=True, domain="en.wikipedia.org", num_edits=100):
    """
    Collects everything and returns a list of Change objects

    Parameters:
        title: article title
        include_text: Whether to unclude body text or not. Speed increases if False
    Returns:
        A list of Change objects representing each revision to the
    """

    # Load the article
    try:
        print('extracting page: ', title)
        site = Site(domain)
        page = site.pages[title]
    except ConnectionError:
        return -1
    try:
        print('extracting page talk history')
        talk = site.pages["Talk:" + title]
    except:
        return -1
    # ratings = get_ratings(talk)
    ratings = get_ratings(talk, num_revisions=num_edits)

    # Collect metadata information
    # metadata = list(page.revisions())
    ## limit to fixed number of pages, to avoid long waiting times
    # if num_edits:
    #   print("extracting only top ", num_edits, " edits")
    #   metadata = list(page.revisions())[:num_edits]
    # else:
    #   metadata = list(page.revisions())

    print('extracting page revisions: ')
    print('limiting to number of edits: ', num_edits)
    metadata = list(page.revisions())[:num_edits]
    users = _get_users(metadata)
    kind = get_kind(metadata)
    comments = get_comment(metadata)

    revids = []

    # Collect list of revision ids using the metadata pull
    for i in range(0, metadata.__len__()):
        revids.append(metadata[i]["revid"])

    # Get the text of the revisions. Performance is improved if this isn't done, but you lose the revisions
    if include_text:
        lang_code = extract_lang_code_from_domain(domain)
        # texts = asyncio.run(get_texts(revids, lang_code))
        # texts = await get_texts(revids, lang_code)
        texts = get_texts(revids, lang_code)

    else:
        texts = [""] * len(metadata)

    # Iterate backwards through our metadata and put together the list of change items
    history = []
    for i in range(metadata.__len__() - 1, -1, -1):
        # Iterate against talk page editions
        time = datetime.fromtimestamp(mktime(metadata[i]["timestamp"]))
        rating = "NA"

        for item in ratings:
            if time > item[1]:
                rating = item[0]
                break

        change = Revision(
            i,
            title,
            time,
            metadata[i]["revid"],
            kind[i],
            users[i],
            comments[i],
            rating,
            texts[i],
        )

        # Compile the list of changes
        history.append(change)

    return history


def to_df(changes):
    """
    Make a dataframe out of the change objects

    Parameters:
        changes: A list of changes
    Returns:
        A DataFrame representation of the changes
    """
    df = []

    for change in changes:
        row = dict(
            title=change.title,
            time=change.time,
            revid=change.revid,
            kind=change.kind,
            user=change.user,
            comment=change.comment,
            rating=change.rating,
            text=change.content,
        )
        df.append(row)
    return pd.DataFrame(df)


def extract_lang_code_from_domain(domain: str) -> str:
    match = re.match(r"([a-z-]+).wikipedia.org", domain)
    if match:
        return match.group(1)
    return ""


  and should_run_async(code)


In [14]:
article_name = "Russian invasion of Ukraine"

domain = "en.wikipedia.org"
include_text=False
# ^^ include_text=True fails when running in notebook due to asyncio loop issue

# limiting the number oif entries, to handle manageable number of edits for articles with large edit count;
# use `num_edits=None` to get the full thing, but would take a long time for this example at least
num_edits=100

# get revision history for article
article_hist = get_history(article_name, include_text=include_text, domain=domain, num_edits=num_edits)

  and should_run_async(code)


extracting page:  Russian invasion of Ukraine
extracting page talk history




limiting # of entries to:  100


  tokens = self._tokenizer.tokenize(text, context, skip_style_tags)


extracting page revisions: 
limiting to number of edits:  100


In [15]:
changes_df = to_df(article_hist)

  and should_run_async(code)


In [16]:
changes_df

  and should_run_async(code)


Unnamed: 0,title,time,revid,kind,user,comment,rating,text
0,Russian invasion of Ukraine,2023-12-26 17:08:48,1191927807,True,Ellwat,/* Kherson-Mykolaiv front */,,
1,Russian invasion of Ukraine,2023-12-26 17:09:25,1191927884,True,Ellwat,/* Zaporizhzhia front */,,
2,Russian invasion of Ukraine,2023-12-27 07:21:41,1192034246,False,Penlite,/* 2023 counteroffensives and summer campaign ...,,
3,Russian invasion of Ukraine,2023-12-27 07:27:31,1192034923,True,Penlite,/* 2023 counteroffensives and summer campaign ...,,
4,Russian invasion of Ukraine,2023-12-27 07:32:04,1192035453,True,Penlite,/* 2023 counteroffensives and summer campaign ...,,
...,...,...,...,...,...,...,...,...
95,Russian invasion of Ukraine,2024-02-09 05:52:12,1205238419,False,ElderZamzam,"/* Ukrainian revolution, Russian intervention ...",(b),
96,Russian invasion of Ukraine,2024-02-09 22:00:18,1205535428,True,TylerBurden,Reverted edit by [[Special:Contribs/ElderZamza...,(b),
97,Russian invasion of Ukraine,2024-02-09 22:51:12,1205552077,False,TylerBurden,Undid revision 1205535428 by [[Special:Contrib...,(b),
98,Russian invasion of Ukraine,2024-02-10 08:24:44,1205722277,False,ElderZamzam,/* Prelude */ Missing remarks from MSC 2022,(b),


#### PyWikiBot

In [6]:
# pip installation
# ! pip install pywikibot

  and should_run_async(code)


Collecting pywikibot
  Downloading pywikibot-8.6.0-py3-none-any.whl (707 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m708.0/708.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pywikibot
Successfully installed pywikibot-8.6.0


In [53]:
## alternative installation from source; github repo clone
## ref: https://github.com/wikimedia/pywikibot?tab=readme-ov-file#quick-start

# ! pip install requests
# ! git clone https://gerrit.wikimedia.org/r/pywikibot/core.git

## ! cd core &&  git submodule update --init
## ! cd core && pip install -r requirements.txt

# use `%cd` to change the working directory in the notebook instead of `!cd` which only changes for each command
%cd /content
%cd core
! pwd
! git submodule update --init
! pip install -r requirements.txt

/content
/content/core
/content/core
Ignoring importlib_metadata: markers 'python_version < "3.8"' don't match your environment


In [59]:
! python pwb.py generate_user_files
# follow the prompts and fill in the values after clicking on the code output to get the input prompt
## can press enter for the default prompt suggestions for wikipedia and english lang

You can abort at any time by pressing ctrl-c

Your default user directory is "/content/core"
 1: commons
 2: foundation
 3: i18n
 4: incubator
 5: lingualibre
 6: mediawiki
 7: meta
 8: osm
 9: outreach
10: species
11: vikidia
12: wikibooks
13: wikidata
14: wikifunctions
15: wikihow
16: wikimania
17: wikimediachapter
18: wikinews
19: wikipedia
20: wikiquote
21: wikisource
22: wikispore
23: wikitech
24: wikiversity
25: wikivoyage
26: wiktionary
27: wowwiki
Select family of sites we are working on, just enter the number or name (default: wikipedia): 
This is the list of known site codes:
th, olo, pag, als, mnw, frr, ki, su, ml, ka, fon, kg, hyw, it, awa, nia, ug,
dga, nah, ny, vo, ky, got, cr, lo, pi, bi, et, csb, nn, bug, lld, bh, din, sv,
sg, nl, zea, tw, mni, iu, ady, ku, sc, yi, avk, ce, bat-smg, mzn, lmo, bbc,
vep, ko, nov, alt, szy, tpi, kbp, tyv, be, rmy, kl, el, trv, af, zh-yue, ms,
bar, cu, anp, guc, fy, oc, sh, sl, jv, eml, tay, az, bm, wuu, qu, yo, dz, gn,
mai, cs, crh, li, go

In [61]:
import pywikibot
site = pywikibot.Site('en', 'wikipedia')  # The site we want to run our bot on
# page = pywikibot.Page(site, 'Wikipedia:Sandbox')

ImportError: cannot import name 'RateLimit' from 'pywikibot.tools.collections' (/usr/local/lib/python3.10/dist-packages/pywikibot/tools/collections.py)

In [None]:
## alternate config instructions:

# # configure PyWikibot
# ## https://www.mediawiki.org/wiki/Manual:Pywikibot/Installation#Configure_Pywikibot


# git clone https://github.com/wikimedia/pywikibot.git --depth 1
# python3 -m pip install -U setuptools
# python3 -m pip install -e pywikibot/
# cd pywikibot/
# python3 pwb.py generate_family_file.py https://url.to.your/wiki/api.php? mywikiname
# python3 pwb.py generate_user_files.py
# # follow the prompts

In [62]:
%cd /content
! git clone https://github.com/wikimedia/pywikibot.git --depth 1
! python3 -m pip install -U setuptools
! python3 -m pip install -e pywikibot/

/content
Cloning into 'pywikibot'...
remote: Enumerating objects: 600, done.[K
remote: Counting objects: 100% (600/600), done.[K
remote: Compressing objects: 100% (488/488), done.[K
remote: Total 600 (delta 120), reused 415 (delta 102), pack-reused 0[K
Receiving objects: 100% (600/600), 1.88 MiB | 10.47 MiB/s, done.
Resolving deltas: 100% (120/120), done.
Collecting setuptools
  Downloading setuptools-69.0.3-py3-none-any.whl (819 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m819.5/819.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 67.7.2
    Uninstalling setuptools-67.7.2:
      Successfully uninstalled setuptools-67.7.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, w

UsageError: Line magic function `%` not found.


In [63]:
%cd pywikibot/
# ! python3 pwb.py generate_family_file.py https://url.to.your/wiki/api.php? mywikiname
! python3 pwb.py generate_user_files.py
# # follow the prompts

/content/pywikibot
You can abort at any time by pressing ctrl-c

Your default user directory is "/content/pywikibot"
 1: commons
 2: foundation
 3: i18n
 4: incubator
 5: lingualibre
 6: mediawiki
 7: meta
 8: osm
 9: outreach
10: species
11: vikidia
12: wikibooks
13: wikidata
14: wikifunctions
15: wikihow
16: wikimania
17: wikimediachapter
18: wikinews
19: wikipedia
20: wikiquote
21: wikisource
22: wikispore
23: wikitech
24: wikiversity
25: wikivoyage
26: wiktionary
27: wowwiki
Select family of sites we are working on, just enter the number or name (default: wikipedia): 
This is the list of known site codes:
nv, uz, tt, ace, fur, ts, eu, io, cdo, haw, lbe, ta, sr, ckb, pfl, tay, scn,
st, hr, pcd, got, gu, anp, be, km, za, sah, inh, ny, bpy, nl, or, hi, kg, mni,
guc, ia, dga, zh-classical, olo, mnw, ch, nap, mai, pnt, krc, he, mt, pa, chy,
sk, dsb, war, te, nn, crh, es, fr, bm, dv, ko, kv, ext, gag, csb, sm, gpe, map-
bms, lo, ltg, pag, fon, ss, hyw, ga, oc, stq, shn, xh, sv, gl, jam, 

In [64]:
import pywikibot
site = pywikibot.Site('en', 'wikipedia')  # The site we want to run our bot on
# page = pywikibot.Page(site, 'Wikipedia:Sandbox')

ImportError: cannot import name 'RateLimit' from 'pywikibot.tools.collections' (/usr/local/lib/python3.10/dist-packages/pywikibot/tools/collections.py)

NOTE:

PyWikiBot Import fails after installation and config due to:

ImportError: cannot import name 'RateLimit' from 'pywikibot.tools.collections' (/usr/local/lib/python3.10/dist-packages/pywikibot/tools/collections.py)




#### wikipedia-histories

In [5]:
# # install via pip
# ! pip install wikipedia-histories

  and should_run_async(code)


In [65]:
import wikipedia_histories

article_name = 'Russian invasion of Ukraine'

In [70]:
article_history = wikipedia_histories.get_history(article_name)



KeyboardInterrupt: 

In [4]:
article_history

  and should_run_async(code)


NameError: name 'article_history' is not defined

In [None]:
## ^^ was taking too long for the selected article, tried using the mwclient
## library with hardcoded modified code from this repo to limit the number of
## revisions being handled