# Requesting ORES scores through LiftWing ML Service API
Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

This example illustrates how to generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).


## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023



In [1]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

In [2]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

# Get your access token

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.


In [3]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.
#
#   In this case I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
#   track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
#   from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 
#
#   In the Homework 2 folder you should be able to find a zip file containing the apikeys user module. Install this module
#   into the folder where you keep all of your user modules. This is also the folder that your PYTHONPATH variable points to.
#
from apikeys.KeyManager import KeyManager
keyman = KeyManager()

#
#   This is my Wikipedia/Wikimedia username. They suggest you request your keys using your Wikipedia username, so I
#   also stored the API key using my Wikipedia username.
#
#   You should probably use your own username here.
USERNAME = "dwmc"
key_info = keyman.findRecord(USERNAME,API_ORES_LIFTWING_ENDPOINT)
ACCESS_TOKEN = key_info[0]['key']
print(key_info[0]['description'])
#print(ACCESS_TOKEN)
#
#   Note: if you don't want to use the key manager to help manage your API keys, you can specify the values as constants
#   below. Just don't distribute the notebook without removing the constants or you'll be distributing your key too.
#
#USERNAME = "<your_wikimedia_username>"
#ACCESS_TOKEN = "<your_wikimedia_provided_access_token_its_a_really_long_string>"
#

Access Token for Wikimedia APIs - this 'Personal' token was created to access ORES through the Lift Wing ML service


The Wikimedia Foundation appears to be issuing access tokens that are adhering to the [JWT (JSON Web Token) standard](https://jwt.io/introduction/). There was also some documentation by IBM about the [use of JWT tokens](https://www.ibm.com/docs/en/cics-ts/6.1?topic=cics-json-web-token-jwt) that I found useful. Keep in mind, documentation from IBM is specific to their implementation of the JWT standard. Access tokens are composed of different parts that specify the domain being accessed and rate limits. The little snippet of code below is not required to make ORES requests. It just allows us to see what is in the Wikimedia provided access token that you were issued.

In [4]:
#
#   Decode the Wikimedia JWT Access token
#
#   NOTE: This is not required to use LiftWing to request ORES scores. This is just being done to satisfy my curiosity.
#   You might be curious too!
#
import base64

print("Decoding the ACCESS_TOKEN:")
try:
    token_components = ACCESS_TOKEN.split(".")
    if len(token_components) == 3:
        header = json.loads(base64.b64decode(token_components[0]).decode())
        payload = json.loads(base64.b64decode(token_components[1]).decode())
        print("Token Header:",json.dumps(header,indent=4))
        print("Token Payload:",json.dumps(payload,indent=4))
        #print("Token Signature:",token_components[2])
        print("Token Signature: <value_suppressed>")
        #
        #  One should be able to use public/private keys to actually validate that signature - left as an exercise for later
        #
    else:
        print(f"The ACCESS_TOKEN appears to be improperly structured. It should have 3 components and it has {len(token_components)}")
except Exception as ex:
    print(f"Looks like the ACCESS_TOKEN is undefined or an empty value")
    raise(ex)


Decoding the ACCESS_TOKEN:
Token Header: {
    "typ": "JWT",
    "alg": "RS256"
}
Token Payload: {
    "aud": "d37664a5e7d77aa45af7bff44ae2b45d",
    "jti": "efe884f9443ce32928ac994496d0b05a0db5a6dd76193c1d66528a7973ca53caf11c82a84cbc0516",
    "iat": 1692043627.687235,
    "nbf": 1692043627.687238,
    "exp": 33248952427.68444,
    "sub": "34000125",
    "iss": "https://meta.wikimedia.org",
    "ratelimit": {
        "requests_per_unit": 5000,
        "unit": "HOUR"
    },
    "scopes": [
        "basic"
    ]
}
Token Signature: <value_suppressed>


## Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


**Example 1** - Call the function by passing in three items, revision id, email, and access token

In [6]:
#   
#
#   Which article - the key for the article dictionary defined above
article_title = "Bison"
#
print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {ARTICLE_REVISIONS[article_title]:d}")
#
#    Make the call, just pass in the article revision ID, email address, and access token
score = request_ores_score_per_article(article_revid=ARTICLE_REVISIONS[article_title],
                                       email_address="dwmc@uw.edu",
                                       access_token=ACCESS_TOKEN)
#
#    Output the result
print(json.dumps(score,indent=4))
#

Getting LiftWing ORES scores for 'Bison' with revid: 1085687913
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1085687913": {
                "articlequality": {
                    "score": {
                        "prediction": "FA",
                        "probability": {
                            "B": 0.07895665991827401,
                            "C": 0.03728215742560417,
                            "FA": 0.5629436065906797,
                            "GA": 0.30547854835374505,
                            "Start": 0.011061807252218824,
                            "Stub": 0.00427722045947826
                        }
                    }
                }
            }
        }
    }
}


**Example 2** - Call the function by copying and initializing parameter templates and using those

In [7]:
#
#   What article - the key for the article dictionary defined above
article_title = "Red squirrel"
#
#   We have to pass in some parameters used for the request header. Create a copy of the template and fill in some fields.
hparams = REQUEST_HEADER_PARAMS_TEMPLATE.copy()
hparams['email_address'] = "dwmc@uw.edu"
hparams['access_token'] = ACCESS_TOKEN
#
#    We can also do this with the request data - although this might not be as useful as with the header params
rd = ORES_REQUEST_DATA_TEMPLATE.copy()
rd['rev_id'] = ARTICLE_REVISIONS[article_title]
#
print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {ARTICLE_REVISIONS[article_title]:d}")
#
#    Make the call, just pass in the article revision ID and the header parameters
score = request_ores_score_per_article(request_data=rd,
                                       header_params=hparams)
#
#    Output the result
print(json.dumps(score,indent=4))
#

Getting LiftWing ORES scores for 'Red squirrel' with revid: 1083787665
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1083787665": {
                "articlequality": {
                    "score": {
                        "prediction": "C",
                        "probability": {
                            "B": 0.34796005858456314,
                            "C": 0.5493773163633026,
                            "FA": 0.033407655474737605,
                            "GA": 0.056969127408174655,
                            "Start": 0.008976178264322794,
                            "Stub": 0.0033096639048991057
                        }
                    }
                }
            }
        }
    }
}
