# The Wikipedia API: The Basics

* by [R. Stuart Geiger](http://stuartgeiger.com), released [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)

## The API

An API is an Application Programming Interface, which is a standardized way for programs to communicate and share data with each other. Wikipedia runs on an open source platform called MediaWiki, as do many other wikis. You can use the API to do almost anything that you can do with the browser. 

You want to use the API (rather than just downloading the full text of the HTML page as if you were a web browser) for a few reasons: it uses fewer resources (for you and Wikipedia), it is standardized, and it is very well supported in many different programming languages.

### API resources
* [The main API documentation](https://www.mediawiki.org/wiki/API:Main_page)
* [The properties modules](https://www.mediawiki.org/wiki/API:Properties)
* [Client code for many languages](https://www.mediawiki.org/wiki/API:Client_code)
* [Etiquette and usage limits](https://www.mediawiki.org/wiki/API:Etiquette) -- most libraries will rate limit for you
* [pywikibot main manual](https://www.mediawiki.org/wiki/Manual:Pywikibot) and [library docs](http://pywikibot.readthedocs.org/en/latest/pywikibot/)


## The wikipedia library
This is the simplest, no hastle library for querying Wikipedia articles, but it has fewer features. You should use this if you want to get the text of articles.

In [None]:
!pip install wikipedia
import wikipedia


In this example, we will get the page for Berkeley, California and count the most commonly used words in the article. I'm using nltk, which is a nice library for natural language processing (although it is probably overkill for this).


In [None]:
bky = wikipedia.page("Berkeley, California")
bky

In [None]:
bk_split = bky.content.split()

In [None]:
bk_split[:10]

In [None]:
!pip install nltk
import nltk


In [None]:
fdist1 = nltk.FreqDist(bk_split)
fdist1.most_common(10)

There are many functions in a Wikipedia page object. We can also get all the Wikipedia articles that are linked from a page, all the URL links in the page, or all the geographical coordinates in the page.

There was a study about which domains were most popular in Wikipedia articles.

In [None]:
print(bky.references[:10])

In [None]:
print(bky.links[:10])

While the wikipedia package doesn't support categories, Wikipedia has list articles that function similarly. Such as [List of colleges and universities in California](https://enwp.org/List_of_colleges_and_universities_in_California). 

In [None]:
ca_colleges = wikipedia.page("List of colleges and universities in California")

for uni in ca_colleges.links:
    uni_page = wikipedia.page(uni)
    uni_page_length = len(uni_page.content)
    print(uni,uni_page_length)

## Querying using pywikibot

pywikibot is one of the most well-developed and widely used libraries for querying the Wikipedia API. It does need a configuration script (user-config.py) in the directory where you are running the python script. It is often used  by bots that edit, so there are many features that are not available unless you login with a Wikipedia account. 

**Note: you can edit pages with pywikibot, but please don't! You have to get approval from Wikipedia's bot approval group, or else your account is likely to be banned. **

In [None]:
!pip install pywikibot
import pywikibot

In [None]:
site = pywikibot.Site()

In [None]:
bky_page = pywikibot.Page(site, "Berkeley, California")
bky_page

In [None]:
# page text with all the wikimarkup and templates 
bky_page.latest_revision

# page text expanded to HTML
bky_page.expand_text()

In [None]:
# All the geographical coordinates linked in a page (may have multiple per article)
bky_page.coordinates()

## Generators


In [None]:
from pywikibot import pagegenerators

In [None]:
cat = pywikibot.Category(site,'Category:Cities in Alameda County, California')

In [None]:
gen = cat.members()

In [None]:
gen

In [None]:
# create an empty list
coord_d = []

In [None]:
for page in gen:
    print(page.title(), page.coordinates())
    pc = page.coordinates()
    for coord in pc:
        # If the page is not a category
        if(page.isCategory()==False):
            coord_d.append({'label':page.title(), 'latitude':coord.lat, 'longitude':coord.lon})
        

In [None]:
coord_d[:3]

In [None]:
import pandas as pd
coord_df = pd.DataFrame(coord_d)
coord_df

### Subcategories
Pages are only members of the direct category they are in. If a page is in a category, and that category is a member of another category, then it will not be shown through the members() function. So you have to iterate through the category to recursively access subcategory members. This exercise is left to the readers. :)

In [None]:
bay_cat = pywikibot.Category(site,'Category:Universities and colleges in California')
bay_gen = bay_cat.members()

In [None]:
for page in bay_gen:
    print(page.title(), page.isCategory(), page.coordinates())

### Other interesting information from pages

Backlinks are all the pages that link to a page. Note: this can get very, very long with even minorly popular articles.

In [None]:
telegraph_page = pywikibot.Page(site, u"Telegraph Avenue")
telegraph_backlinks = telegraph_page.backlinks
telegraph_backlinks()

In [None]:
for bl_page in telegraph_backlinks():
    if(bl_page.namespace()==1):
        print(bl_page.title())

Who has contributed to a page, and how many times have they edited?

In [None]:
telegraph_page.contributors()

Templates are all the extensions to wikimarkup that give you things like citations, tables, infoboxes, etc. You can iterate over all the templates in a page.

In [None]:
bky_templates = bky_page.templatesWithParams()


But templates are quite non-standard and very difficult to parse! Hence.... [Wikidata!](https://github.com/thehackerwithin/berkeley/blob/master/scraping_wikipedia/wikidata-intro.ipynb)