## Raman SV - Data 512 - Assignment 2

Snippets of code in this document are either used as is or modified based on the example code shared by Dr. David McDonald, professor for Data 512

These are from the notebooks - "wp_page_info_example" and "wp_ores_liftwing_example" shared as part of the starter code for this assignment

These are licensed CC-BY (https://creativecommons.org/licenses/by/4.0/) by the original author

This entire notebook along with the datasources are licensed CC-BY (https://creativecommons.org/licenses/by/4.0/)

Import  the necessary libraries and packages for this project and setup the local environment

In [2]:
import os
import json, time, urllib.parse
import requests
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.lines as mlines
import seaborn as sns
import datetime

pd.set_option('display.max_colwidth', None)
Curr_Dir = 'C:/Users/raman/OneDrive/Desktop/UDubs/Classroom/Q4/Data 512 - HCDE/Assignments/Week 2/data-512-homework_2/'

The below 2 files contain the region and divisions for the US states and the population estimates for 2022.
The population estimates are obtained from this link [State Population Totals and Components of Change: 2020-2022](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html)

These 2 files are merged to get a list of states, their Regional Division and 2022 estimated population.

In [7]:
states_by_region = pd.read_excel(os.path.join(Curr_Dir,'Input Files', 'US States by Region - US Census Bureau.xlsx'))

states_by_region['regional_division'] = states_by_region['REGION'] + ' (' + states_by_region['DIVISION'] + ') '

#print(states_by_region.columns)
states_by_region.head(5)

Unnamed: 0,REGION,DIVISION,STATE,regional_division
0,South,East South Central,Alabama,South (East South Central)
1,West,Pacific,Alaska,West (Pacific)
2,West,Mountain,Arizona,West (Mountain)
3,South,West South Central,Arkansas,South (West South Central)
4,West,Pacific,California,West (Pacific)


In [8]:
population_2022_est = pd.read_excel(os.path.join(Curr_Dir,'Input Files', 'NST-EST2022-POP.xlsx'))
#print(population_2022_est.columns)
population_2022_est.head(10)

Unnamed: 0,Geographic Area,"April 1, 2020 Estimates Base",2022
0,United States,331449520,333287557
1,Northeast,57609156,57040406
2,Midwest,68985537,68787595
3,South,126266262,128716192
4,West,78588565,78743364
5,Alabama,5024356,5074296
6,Alaska,733378,733583
7,Arizona,7151507,7359197
8,Arkansas,3011555,3045637
9,California,39538245,39029342


In [9]:
states_by_region_with_pop = pd.merge(states_by_region[['STATE', 'regional_division']], population_2022_est[['Geographic Area', 2022]], left_on='STATE', right_on='Geographic Area', how= 'left' )
states_by_region_with_pop.drop(columns=['Geographic Area'], inplace=True)
states_by_region_with_pop.rename(columns={2022: 'Population Est. 2022'}, inplace=True)

states_by_region_with_pop.head(5)


Unnamed: 0,STATE,regional_division,Population Est. 2022
0,Alabama,South (East South Central),5074296
1,Alaska,West (Pacific),733583
2,Arizona,West (Mountain),7359197
3,Arkansas,South (West South Central),3045637
4,California,West (Pacific),39029342


Import the list of cities by states that will act as the source. Source file courtesy of Dr. David McDonald, professor for Data 512

In [10]:
state_list_source = pd.read_csv(os.path.join(Curr_Dir,'Input Files', 'us_cities_by_state_SEPT.2023.csv'))
state_list_source.head(100)

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...
95,Alabama,"Courtland, Alabama","https://en.wikipedia.org/wiki/Courtland,_Alabama"
96,Alabama,"Cowarts, Alabama","https://en.wikipedia.org/wiki/Cowarts,_Alabama"
97,Alabama,"Creola, Alabama","https://en.wikipedia.org/wiki/Creola,_Alabama"
98,Alabama,"Crossville, Alabama","https://en.wikipedia.org/wiki/Crossville,_Alabama"


Merge the above 3 datasets to get a list of states, their regional division, the cities and the population estimates

In [12]:
state_list = pd.merge(state_list_source, states_by_region_with_pop, left_on='state', right_on='STATE', how= 'left' )
state_list.drop(columns=['STATE'], inplace=True)

output_excel_path = os.path.join(Curr_Dir, 'Input Files', 'test_temp.xlsx')
state_list.to_excel(output_excel_path, index=False)

state_list.head(100)


Unnamed: 0,state,page_title,url,regional_division,Population Est. 2022
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",South (East South Central),5074296
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",South (East South Central),5074296
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",South (East South Central),5074296
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",South (East South Central),5074296
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",South (East South Central),5074296
...,...,...,...,...,...
95,Alabama,"Courtland, Alabama","https://en.wikipedia.org/wiki/Courtland,_Alabama",South (East South Central),5074296
96,Alabama,"Cowarts, Alabama","https://en.wikipedia.org/wiki/Cowarts,_Alabama",South (East South Central),5074296
97,Alabama,"Creola, Alabama","https://en.wikipedia.org/wiki/Creola,_Alabama",South (East South Central),5074296
98,Alabama,"Crossville, Alabama","https://en.wikipedia.org/wiki/Crossville,_Alabama",South (East South Central),5074296


### Pageview Variables initialization
The below code (and comments) is mainly derived from the aforementioned starter code for this assignment

In [97]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<svraman@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
# ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [98]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [99]:
state_list.head(10)

Unnamed: 0,state,page_title,url,regional_division,Population Est. 2022
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",South (East South Central),5074296
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",South (East South Central),5074296
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",South (East South Central),5074296
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",South (East South Central),5074296
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",South (East South Central),5074296
5,Alabama,"Albertville, Alabama","https://en.wikipedia.org/wiki/Albertville,_Alabama",South (East South Central),5074296
6,Alabama,"Alexander City, Alabama","https://en.wikipedia.org/wiki/Alexander_City,_Alabama",South (East South Central),5074296
7,Alabama,"Aliceville, Alabama","https://en.wikipedia.org/wiki/Aliceville,_Alabama",South (East South Central),5074296
8,Alabama,"Allgood, Alabama","https://en.wikipedia.org/wiki/Allgood,_Alabama",South (East South Central),5074296
9,Alabama,"Altoona, Alabama","https://en.wikipedia.org/wiki/Altoona,_Alabama",South (East South Central),5074296


Below is the code that utilizes the above function to get the details

In [100]:
state_list_data = []

for i, title in enumerate(state_list['page_title']):
    PageInfo = request_pageinfo_per_article(title)
    #print(PageInfo)
    # Had to add the below logic as a keyerror for items was being returned for the code, this handles
    # scenarios where items is not part of the response for any reason
    if PageInfo is not None and isinstance(PageInfo, dict) and 'query' in PageInfo:

        # df = pd.DataFrame(PageInfo)
        state_list_data.append(PageInfo)

    else:
        print("No valid response")

state_list_data = pd.DataFrame({'Response': state_list_data})


output_excel_path = os.path.join(Curr_Dir, 'Intermediate files', 'responses_for_states.xlsx')
state_list_data.to_excel(output_excel_path, index=False)

# Now, 'df' contains all the responses in a single column
state_list_data.head()




Unnamed: 0,Response
0,"{'batchcomplete': '', 'query': {'pages': {'104730': {'pageid': 104730, 'ns': 0, 'title': 'Abbeville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1171163550, 'length': 24706, 'talkid': 281244, 'fullurl': 'https://en.wikipedia.org/wiki/Abbeville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Abbeville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Abbeville,_Alabama'}}}}"
1,"{'batchcomplete': '', 'query': {'pages': {'104761': {'pageid': 104761, 'ns': 0, 'title': 'Adamsville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1177621427, 'length': 18040, 'talkid': 281272, 'fullurl': 'https://en.wikipedia.org/wiki/Adamsville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Adamsville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Adamsville,_Alabama'}}}}"
2,"{'batchcomplete': '', 'query': {'pages': {'105188': {'pageid': 105188, 'ns': 0, 'title': 'Addison, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1168359898, 'length': 13309, 'talkid': 281517, 'fullurl': 'https://en.wikipedia.org/wiki/Addison,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Addison,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Addison,_Alabama'}}}}"
3,"{'batchcomplete': '', 'query': {'pages': {'104726': {'pageid': 104726, 'ns': 0, 'title': 'Akron, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1165909508, 'length': 11710, 'talkid': 281240, 'fullurl': 'https://en.wikipedia.org/wiki/Akron,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Akron,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Akron,_Alabama'}}}}"
4,"{'batchcomplete': '', 'query': {'pages': {'105109': {'pageid': 105109, 'ns': 0, 'title': 'Alabaster, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1179139816, 'length': 20343, 'talkid': 281444, 'fullurl': 'https://en.wikipedia.org/wiki/Alabaster,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Alabaster,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Alabaster,_Alabama'}}}}"


The above code took 87 minutes to execute. It would be better to either break this bigger loop into smaller loops of 1000 runs or send 50 requests as an array each to ensure there is no time out or missing out on data due to network or computer issues.

Now that we have the lastrevid and related details for the states, we have to define the methodology that will generate the page score.
Some of the code in the below segments are from the notebook "wp_ores_liftwing_example" shared as part of the starter code for this assignment
These are licensed CC-BY (https://creativecommons.org/licenses/by/4.0/) by the original author

In [102]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there

#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<svraman@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}

#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "svraman@uw.edu",         # your email address should go here
    'access_token'  : "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiI4MzViYzBjOGFjYTUzYjRiMmMwODBmNzgxNzBiNDMxOSIsImp0aSI6IjEyMjFkYjJhYjM0NGRkMGUzYTA5MTc4NTE4ZTI1ZThlY2QyYjI4NWM0NGNlZDU0ZTNhZDU5OWI3ZDljYWJkNjU0Nzg3MTYxZTYxN2MzNGE5IiwiaWF0IjoxNjk3MzM4MjAwLjc0MTgyMSwibmJmIjoxNjk3MzM4MjAwLjc0MTgyNCwiZXhwIjozMzI1NDI0NzAwMC43Mzk0NzUsInN1YiI6Ijc0MDA2Mzg1IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.kEN-6Dwj8qTMdl8oGGsTMSpK9Q-Ics0Ylw33-MhwOOoTJgSX6qGgp71wmDNzKYBkD72ExdbWujX289ZTL-IFiQDiPmIy_lPgguf4DD4yFqP1cOxHqyROW4hI6Rm6YZgMyNbLTuUazqDOqvhH-VQNf2lTbkkYjWOcbF9xDxBOXRcCiknPbH3p1uo_Q20E4_eZv7t-uqjtszhzkOiI3xR6gO9QNDxyYQGtdMM1mfBoj-FlpQ5eYVs-opqS1NSC1SxjtaZZR4ClGut33FjyRkBp92HpTqH-BXXHeXRHcwB4EXBj_rXT72Qv-dIzWR6gocoVGIEjZYTVRLUH4wE15t_sMpiV5CEGIZ3ubk65M5JuW6APT868a0U87Z5VAQvzpgixEGv_qRSP15V5EgpWDatiH1EucUC9tkmRi2nOKI0CKQW8GSdfJa144cekpOOfaLyEIJUhYWcxQWmPElwzUp6_pUy1CHm40ysOa8jRF9HdPMtgV0x_2OprduHDsIHT7RC40qsYNpUhE6Tgx4XazCEjPkCnSKv4l8RajG_dQqMFbA1AiyOwUSKpZkJObWt579DfxK15hqSBuCgzGH3GnfdE8h-qxL_mPOOWtg97Fg-gFxQngyNoeCCpkInei_U9QnY56Ya2DgZkfVNDzFvYXn6Eu3FeXU1T2Bfxp0V7lWL94Gw"
# the access token you create will need to go here
}


#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }


#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}



To execute the API calls, you would need access tokens from Wikimedia. The process for the same is listed below.

Once the tokens are obtained, please enter the same in the below code snippet. The below text is from the starter code:


    You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

    There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

    The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

    The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

    Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

    The value you need to work the code below is the Access token - a very long string.


In [103]:
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

The below text and code are from the starter code

##### Define a function to make the ORES API request

    The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [104]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [107]:
# taking a backup to ensure progress isn't lost
#state_list_data.head(5)
state_list_data_2 = state_list_data.copy()

In [112]:
state_list_data.head(5)

Unnamed: 0,Response
0,"{'batchcomplete': '', 'query': {'pages': {'104730': {'pageid': 104730, 'ns': 0, 'title': 'Abbeville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1171163550, 'length': 24706, 'talkid': 281244, 'fullurl': 'https://en.wikipedia.org/wiki/Abbeville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Abbeville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Abbeville,_Alabama'}}}}"
1,"{'batchcomplete': '', 'query': {'pages': {'104761': {'pageid': 104761, 'ns': 0, 'title': 'Adamsville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1177621427, 'length': 18040, 'talkid': 281272, 'fullurl': 'https://en.wikipedia.org/wiki/Adamsville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Adamsville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Adamsville,_Alabama'}}}}"
2,"{'batchcomplete': '', 'query': {'pages': {'105188': {'pageid': 105188, 'ns': 0, 'title': 'Addison, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1168359898, 'length': 13309, 'talkid': 281517, 'fullurl': 'https://en.wikipedia.org/wiki/Addison,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Addison,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Addison,_Alabama'}}}}"
3,"{'batchcomplete': '', 'query': {'pages': {'104726': {'pageid': 104726, 'ns': 0, 'title': 'Akron, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1165909508, 'length': 11710, 'talkid': 281240, 'fullurl': 'https://en.wikipedia.org/wiki/Akron,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Akron,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Akron,_Alabama'}}}}"
4,"{'batchcomplete': '', 'query': {'pages': {'105109': {'pageid': 105109, 'ns': 0, 'title': 'Alabaster, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1179139816, 'length': 20343, 'talkid': 281444, 'fullurl': 'https://en.wikipedia.org/wiki/Alabaster,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Alabaster,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Alabaster,_Alabama'}}}}"


Now that the function is defined, we can call the function by copying and initializing parameter templates

In [120]:
lastrevid_values = []
title_values = []
results = []
scores = []
# define the parameters to be used in the API call
hparams = REQUEST_HEADER_PARAMS_TEMPLATE.copy()
rd = ORES_REQUEST_DATA_TEMPLATE.copy()

prediction_mapping = {
    'FA': 'Featured article',
    'GA': 'Good article (sometimes called A-class)',
    'B': 'B-class article',
    'C': 'C-class article',
    'Start': 'Start-class article',
    'Stub': 'Stub-class article'
}

for index, row in state_list_data.iterrows():
    response_dict = row['Response']

    # Parse values from the response for processing
    pages_data = response_dict.get('query', {}).get('pages', {})
    lastrevid = None
    title = None

    for key, page_data in pages_data.items():
        lastrevid = page_data.get('lastrevid')
        title = page_data.get('title')
        if lastrevid:
            lastrevid_values.append(lastrevid)
            title_values.append(title)
            rd['rev_id'] = lastrevid
            score = request_ores_score_per_article(request_data=rd, header_params=hparams)
            scores.append(score)

            prediction = score['enwiki']['scores'].get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('prediction')
            probability = score['enwiki']['scores'].get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('probability', {}).get(prediction, None)
            prediction_text = prediction_mapping.get(prediction, None)

            result_dict = {
                    'title': title,
                    'lastrevid': lastrevid,
                    'response': score,
                    'prediction': prediction,
                    'prediction_text': prediction_text
                    #'probability': probability
            }
            results.append(result_dict)

            break

    if lastrevid is None:
        lastrevid_values.append(None)
        title_values.append(None)
        #scores.append(None)



# Add 'lastrevid' values as a new column in the DataFrame
#state_list_data['lastrevid'] = lastrevid_values
#state_list_data['title'] = title_values
#state_list_data['score_response'] = scores
results_df = pd.DataFrame(results)
#state_list_data = pd.concat([state_list_data, results_df], axis=1)

#state_list_data = state_list_data[['title',	'lastrevid', 'prediction', 'prediction_text','probability']]
#state_list_data.head(5)
results_df.head(10)

Unnamed: 0,title,lastrevid,response,prediction,prediction_text
0,"Abbeville, Alabama",1171163550,"{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1171163550': {'articlequality': {'score': {'prediction': 'C', 'probability': {'B': 0.31042252456158204, 'C': 0.5979200965294227, 'FA': 0.025186220917133947, 'GA': 0.04952133645299354, 'Start': 0.013573873336789355, 'Stub': 0.0033759482020785892}}}}}}}",C,C-class article
1,"Adamsville, Alabama",1177621427,"{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1177621427': {'articlequality': {'score': {'prediction': 'C', 'probability': {'B': 0.198274200391586, 'C': 0.3770695177348356, 'FA': 0.019070364455845708, 'GA': 0.3514876684327692, 'Start': 0.05026148902798659, 'Stub': 0.003836759956977147}}}}}}}",C,C-class article
2,"Addison, Alabama",1168359898,"{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1168359898': {'articlequality': {'score': {'prediction': 'C', 'probability': {'B': 0.27104076563661905, 'C': 0.324459707767518, 'FA': 0.011265514086494389, 'GA': 0.29487067754320384, 'Start': 0.0931882446366844, 'Stub': 0.005175090329480344}}}}}}}",C,C-class article
3,"Akron, Alabama",1165909508,"{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1165909508': {'articlequality': {'score': {'prediction': 'GA', 'probability': {'B': 0.175388344565975, 'C': 0.2655870765311225, 'FA': 0.011556876058535826, 'GA': 0.4485841879139288, 'Start': 0.09350806348909406, 'Stub': 0.005375451441343951}}}}}}}",GA,Good article (sometimes called A-class)
4,"Alabaster, Alabama",1179139816,"{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1179139816': {'articlequality': {'score': {'prediction': 'C', 'probability': {'B': 0.270971932616856, 'C': 0.6463838191722866, 'FA': 0.009590992690925318, 'GA': 0.033641571757281816, 'Start': 0.036340860820955355, 'Stub': 0.003070822941694893}}}}}}}",C,C-class article


In [143]:
# Create a ID column to facilitate processing in chunks
state_list_data['Row_ID'] = state_list_data.index + 1
state_list_data.head(10)

Unnamed: 0,Response,Row_ID
0,"{'batchcomplete': '', 'query': {'pages': {'104730': {'pageid': 104730, 'ns': 0, 'title': 'Abbeville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1171163550, 'length': 24706, 'talkid': 281244, 'fullurl': 'https://en.wikipedia.org/wiki/Abbeville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Abbeville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Abbeville,_Alabama'}}}}",1
1,"{'batchcomplete': '', 'query': {'pages': {'104761': {'pageid': 104761, 'ns': 0, 'title': 'Adamsville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1177621427, 'length': 18040, 'talkid': 281272, 'fullurl': 'https://en.wikipedia.org/wiki/Adamsville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Adamsville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Adamsville,_Alabama'}}}}",2
2,"{'batchcomplete': '', 'query': {'pages': {'105188': {'pageid': 105188, 'ns': 0, 'title': 'Addison, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1168359898, 'length': 13309, 'talkid': 281517, 'fullurl': 'https://en.wikipedia.org/wiki/Addison,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Addison,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Addison,_Alabama'}}}}",3
3,"{'batchcomplete': '', 'query': {'pages': {'104726': {'pageid': 104726, 'ns': 0, 'title': 'Akron, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1165909508, 'length': 11710, 'talkid': 281240, 'fullurl': 'https://en.wikipedia.org/wiki/Akron,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Akron,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Akron,_Alabama'}}}}",4
4,"{'batchcomplete': '', 'query': {'pages': {'105109': {'pageid': 105109, 'ns': 0, 'title': 'Alabaster, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1179139816, 'length': 20343, 'talkid': 281444, 'fullurl': 'https://en.wikipedia.org/wiki/Alabaster,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Alabaster,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Alabaster,_Alabama'}}}}",5
5,"{'batchcomplete': '', 'query': {'pages': {'104899': {'pageid': 104899, 'ns': 0, 'title': 'Albertville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1179198677, 'length': 26930, 'watchers': 34, 'talkid': 281390, 'fullurl': 'https://en.wikipedia.org/wiki/Albertville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Albertville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Albertville,_Alabama'}}}}",6
6,"{'batchcomplete': '', 'query': {'pages': {'105153': {'pageid': 105153, 'ns': 0, 'title': 'Alexander City, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1179140073, 'length': 25275, 'watchers': 32, 'talkid': 281484, 'fullurl': 'https://en.wikipedia.org/wiki/Alexander_City,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Alexander_City,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Alexander_City,_Alabama'}}}}",7
7,"{'batchcomplete': '', 'query': {'pages': {'105086': {'pageid': 105086, 'ns': 0, 'title': 'Aliceville, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:37Z', 'lastrevid': 1167792390, 'length': 31568, 'talkid': 281424, 'fullurl': 'https://en.wikipedia.org/wiki/Aliceville,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Aliceville,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Aliceville,_Alabama'}}}}",8
8,"{'batchcomplete': '', 'query': {'pages': {'100811': {'pageid': 100811, 'ns': 0, 'title': 'Allgood, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:35Z', 'lastrevid': 1165909718, 'length': 11278, 'talkid': 281031, 'fullurl': 'https://en.wikipedia.org/wiki/Allgood,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Allgood,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Allgood,_Alabama'}}}}",9
9,"{'batchcomplete': '', 'query': {'pages': {'100812': {'pageid': 100812, 'ns': 0, 'title': 'Altoona, Alabama', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:35:35Z', 'lastrevid': 1165909823, 'length': 10679, 'talkid': 281032, 'fullurl': 'https://en.wikipedia.org/wiki/Altoona,_Alabama', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Altoona,_Alabama&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Altoona,_Alabama'}}}}",10


In [151]:

# Initialize variables
lastrevid_values = []
title_values = []
results = []
scores = []

hparams = REQUEST_HEADER_PARAMS_TEMPLATE.copy()
rd = ORES_REQUEST_DATA_TEMPLATE.copy()

prediction_mapping = {
    'FA': 'Featured article',
    'GA': 'Good article (sometimes called A-class)',
    'B': 'B-class article',
    'C': 'C-class article',
    'Start': 'Start-class article',
    'Stub': 'Stub-class article'
}

# Break for 1-2 minutes, process 750 lines each minute
chunk_size = 750
break_duration = 60

# Create a function to process a chunk of data
def process_chunk(chunk):
    for index, row in chunk.iterrows():
        response_dict = row['Response']

        pages_data = response_dict.get('query', {}).get('pages', {})
        lastrevid = None
        title = None

        for key, page_data in pages_data.items():
            lastrevid = page_data.get('lastrevid')
            title = page_data.get('title')
            if lastrevid:
                #lastrevid_values.append(lastrevid)
                #title_values.append(title)
                rd['rev_id'] = lastrevid
                score = request_ores_score_per_article(request_data=rd, header_params=hparams)
                scores.append(score)

                prediction = score['enwiki']['scores'].get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('prediction')
                probability = score['enwiki']['scores'].get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('probability', {}).get(prediction, None)

                prediction_text = prediction_mapping.get(prediction, None)

                result_dict = {
                        'title': title,
                        'lastrevid': lastrevid,
                        'response': score,
                        'prediction': prediction,
                        'probability': probability,
                        'prediction_text': prediction_text,
                        'Row ID': row['Row_ID']
                }
                results.append(result_dict)

                break

        if lastrevid is None:
            lastrevid_values.append(None)
            title_values.append(None)

# Process the data in chunks
for i in range(18751, max(state_list_data['Row_ID']), chunk_size):
    chunk = state_list_data[i:i + chunk_size]
    process_chunk(chunk)
    output_excel_path = os.path.join(Curr_Dir, 'Intermediate files', 'responses_for_states_' + str(i) + '_' + str(i + chunk_size) + '.xlsx')
    temp_results_df = pd.DataFrame(results)

    temp_results_df.to_excel(output_excel_path, index=False)
    time.sleep(break_duration)  # break for the pre defined amount of time

# Create DataFrames
results_df = pd.DataFrame(results)
#lastrevid_df = pd.DataFrame({'lastrevid': lastrevid_values})
#title_df = pd.DataFrame({'title': title_values})

# Merge the DataFrames
#state_list_data = pd.concat([title_df, lastrevid_df, results_df], axis=1)

# Filter columns
#state_list_data = state_list_data[['title', 'lastrevid', 'prediction', 'prediction_text']]

# Display the first 10 rows of the resulting DataFrame
#state_list_data.head(10)


If the results were stored in multiple files to account for batch processing, use the below code to combine the data as a dataframe once all data is processed

In [3]:
folder_path = os.path.join(Curr_Dir, 'Intermediate files', 'All responses')

# Initialize an empty DataFrame to store the combined data
combined_df = pd.DataFrame()

# Iterate through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.xlsx'):
        file_path = os.path.join(folder_path, filename)

        df = pd.read_excel(file_path)
        combined_df = combined_df.append(df, ignore_index=True)

combined_df.count()



  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df = combined_df.append(df, ignore_index=True)
  combined_df 

title              79156
lastrevid          79156
response           79156
prediction         79156
probability        79156
prediction_text    79156
Row ID             58156
dtype: int64

In [153]:
output_excel_path_2 = os.path.join(Curr_Dir, 'Intermediate files', 'responses_for_states_combined' + '.xlsx')
combined_df.to_excel(output_excel_path_2, index=False)

In [4]:
#combined_df.head(10)
result_processing = combined_df.copy()

result_processing.drop(columns=['response'], inplace=True)
result_processing.drop(columns=['Row ID'], inplace=True)
# 79156 rows
result_processing.drop_duplicates(inplace=True)
# 21518 rows


In [5]:
#result_processing.count()
result_processing.head(10)


title              21518
lastrevid          21518
prediction         21518
probability        21518
prediction_text    21518
dtype: int64

In [13]:
state_list.head(10)


Unnamed: 0,state,page_title,url,regional_division,Population Est. 2022
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",South (East South Central),5074296
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",South (East South Central),5074296
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",South (East South Central),5074296
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",South (East South Central),5074296
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",South (East South Central),5074296
5,Alabama,"Albertville, Alabama","https://en.wikipedia.org/wiki/Albertville,_Alabama",South (East South Central),5074296
6,Alabama,"Alexander City, Alabama","https://en.wikipedia.org/wiki/Alexander_City,_Alabama",South (East South Central),5074296
7,Alabama,"Aliceville, Alabama","https://en.wikipedia.org/wiki/Aliceville,_Alabama",South (East South Central),5074296
8,Alabama,"Allgood, Alabama","https://en.wikipedia.org/wiki/Allgood,_Alabama",South (East South Central),5074296
9,Alabama,"Altoona, Alabama","https://en.wikipedia.org/wiki/Altoona,_Alabama",South (East South Central),5074296


In [16]:
results_and_details = pd.merge(state_list[['state','page_title', 'regional_division','Population Est. 2022']], result_processing[['title', 'lastrevid', 'prediction', 'probability', 'prediction_text']], left_on='page_title', right_on='title', how= 'left' )
#results_and_details.head(10)

results_and_details.drop(columns=['title'], inplace=True)

output_excel_path_3 = os.path.join(Curr_Dir, 'Output Files', 'Article quality predictions' + '.xlsx')
results_and_details.to_excel(output_excel_path_3, index=False)


In [171]:
results_and_details['lastrevid'] = results_and_details['lastrevid'].astype(str)
results_and_details['lastrevid'] = results_and_details['lastrevid'].str.split('.').str[0]


results_and_details.head(10)

Unnamed: 0,state,page_title,regional_division,Population Est. 2022,title,lastrevid,prediction,probability,prediction_text
0,Alabama,"Abbeville, Alabama",South (East South Central),5074296,"Abbeville, Alabama",1171163550,C,0.59792,C-class article
1,Alabama,"Adamsville, Alabama",South (East South Central),5074296,"Adamsville, Alabama",1177621427,C,0.37707,C-class article
2,Alabama,"Addison, Alabama",South (East South Central),5074296,"Addison, Alabama",1168359898,C,0.32446,C-class article
3,Alabama,"Akron, Alabama",South (East South Central),5074296,"Akron, Alabama",1165909508,GA,0.448584,Good article (sometimes called A-class)
4,Alabama,"Alabaster, Alabama",South (East South Central),5074296,"Alabaster, Alabama",1179139816,C,0.646384,C-class article
5,Alabama,"Albertville, Alabama",South (East South Central),5074296,"Albertville, Alabama",1179198677,C,0.575156,C-class article
6,Alabama,"Alexander City, Alabama",South (East South Central),5074296,"Alexander City, Alabama",1179140073,GA,0.389467,Good article (sometimes called A-class)
7,Alabama,"Aliceville, Alabama",South (East South Central),5074296,"Aliceville, Alabama",1167792390,GA,0.563505,Good article (sometimes called A-class)
8,Alabama,"Allgood, Alabama",South (East South Central),5074296,"Allgood, Alabama",1165909718,C,0.41782,C-class article
9,Alabama,"Altoona, Alabama",South (East South Central),5074296,"Altoona, Alabama",1165909823,C,0.379118,C-class article


### Analysis
This segment consists of the updates to the dataframes to get the details of the pages at a state and division level


Analysis by State:

In [181]:
# Creating a copy of the results to get details by state
results_and_details_state = results_and_details.copy()

# Group the data to get the population and number of articles by state
results_and_details_state_groups = results_and_details_state.groupby('state').agg({'Population Est. 2022': 'max', 'title': 'nunique'}).reset_index()

# Pivot the data to get the number of articles by quality
results_and_details_state_groups_pivot = results_and_details_state.pivot_table(index='state', columns='prediction', values='title', aggfunc='count', fill_value=0).reset_index()

#results_and_details_state_groups.head(10)
#results_and_details_state_groups_pivot.head(10)

# Merge the above 2 to get the combined dataframe
results_and_details_state_aggregate = results_and_details_state_groups.merge(results_and_details_state_groups_pivot, on='state', how='left')
#results_and_details_state_aggregate.head(5)

# Add columns for articles per capita
results_and_details_state_aggregate['total articles per capita'] = results_and_details_state_aggregate['title']/results_and_details_state_aggregate['Population Est. 2022']
results_and_details_state_aggregate['High Quality (FA or GA) articles per capita'] = (results_and_details_state_aggregate['FA'] + results_and_details_state_aggregate['GA'])/results_and_details_state_aggregate['Population Est. 2022']

results_and_details_state_aggregate.head(5)

Unnamed: 0,state,Population Est. 2022,title,B,C,FA,GA,Start,Stub,total articles per capita,High Quality (FA or GA) articles per capita
0,Alabama,5074296,461,14,616,4,102,182,4,9.1e-05,2.1e-05
1,Alaska,733583,149,8,89,1,30,19,2,0.000203,4.2e-05
2,Arizona,7359197,91,12,54,1,23,0,1,1.2e-05,3e-06
3,Arkansas,3045637,500,3,322,0,72,102,1,0.000164,2.4e-05
4,California,39029342,482,102,207,3,170,0,0,1.2e-05,4e-06


Analysis by Region and Division:

In [202]:
# get the population totals by regional division
results_and_details_reg_1 = results_and_details_state.groupby(['state','regional_division']).agg({'Population Est. 2022': 'max'}).reset_index()
results_and_details_reg_1.sort_values(by='regional_division', ascending=True).head(50)

population_by_region = results_and_details_reg_1.groupby('regional_division').agg({'Population Est. 2022': 'sum'}).reset_index()
population_by_region.head(10)

Unnamed: 0,regional_division,Population Est. 2022
0,Midwest (East North Central),47097779
1,Midwest (West North Central),19721893
2,Northeast (Middle Atlantic),41910858
3,Northeast (New England),11503343
4,South (East South Central),19578002
5,South (South Atlantic),66781137
6,South (West South Central),41685250
7,West (Mountain),25514320
8,West (Pacific),53229044


In [203]:
# Creating a copy of the results to get details by state
results_and_details_region = results_and_details.copy()

# Group the data to get the population and number of articles by state
results_and_details_region_groups = results_and_details_region.groupby('regional_division').agg({'title': 'nunique'}).reset_index()

# Pivot the data to get the number of articles by quality
results_and_details_region_groups_pivot = results_and_details_region.pivot_table(index='regional_division', columns='prediction', values='title', aggfunc='count', fill_value=0).reset_index()

#results_and_details_region_groups.head(10)
#results_and_details_region_groups_pivot.head(10)

# Merge the above 2 to get the combined dataframe
results_and_details_region_aggregate = results_and_details_region_groups.merge(results_and_details_region_groups_pivot, on='regional_division', how='left')
#results_and_details_region_aggregate.head(10)

# Get the population by region
results_and_details_region_aggregate = results_and_details_region_aggregate.merge(population_by_region, on='regional_division', how='left')

# Add columns for articles per capita
results_and_details_region_aggregate['total articles per capita'] = results_and_details_region_aggregate['title']/results_and_details_region_aggregate['Population Est. 2022']
results_and_details_region_aggregate['High Quality (FA or GA) articles per capita'] = (results_and_details_region_aggregate['FA'] + results_and_details_region_aggregate['GA'])/results_and_details_region_aggregate['Population Est. 2022']

results_and_details_region_aggregate.head(10)

Unnamed: 0,regional_division,title,B,C,FA,GA,Start,Stub,Population Est. 2022,total articles per capita,High Quality (FA or GA) articles per capita
0,Midwest (East North Central),4753,138,2976,7,712,862,59,47097779,0.000101,1.5e-05
1,Midwest (West North Central),3578,40,2727,5,635,161,10,19721893,0.000181,3.2e-05
2,Northeast (Middle Atlantic),3781,283,1747,142,914,234,470,41910858,9e-05,2.5e-05
3,Northeast (New England),1437,56,920,9,216,171,65,11503343,0.000125,2e-05
4,South (East South Central),1529,40,1261,6,365,311,9,19578002,7.8e-05,1.9e-05
5,South (South Atlantic),1850,112,1204,13,533,117,5,66781137,2.8e-05,8e-06
6,South (West South Central),2103,62,1189,10,625,213,7,41685250,5e-05,1.5e-05
7,West (Mountain),1189,49,722,4,338,89,9,25514320,4.7e-05,1.3e-05
8,West (Pacific),1304,133,595,19,471,40,46,53229044,2.4e-05,9e-06


## Results


1) Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .


In [204]:
top10_states_by_coverage = results_and_details_state_aggregate.sort_values(by='total articles per capita', ascending=False)
top10_states_by_coverage.head(10)

Unnamed: 0,state,Population Est. 2022,title,B,C,FA,GA,Start,Stub,total articles per capita,High Quality (FA or GA) articles per capita
42,Vermont,647064,329,3,185,0,45,35,61,0.000508,7e-05
31,North Dakota,779261,356,5,309,0,26,16,0,0.000457,3.3e-05
17,Maine,1385340,483,2,300,0,43,135,3,0.000349,3.1e-05
38,South Dakota,909824,311,2,248,0,56,3,2,0.000342,6.2e-05
13,Iowa,3200517,1043,8,903,2,102,27,1,0.000326,3.2e-05
1,Alaska,733583,149,8,89,1,30,19,2,0.000203,4.2e-05
35,Pennsylvania,12972008,2556,60,1347,3,563,140,443,0.000197,4.4e-05
20,Michigan,10034113,1773,49,861,1,132,685,45,0.000177,1.3e-05
47,Wyoming,581381,99,1,55,0,39,3,1,0.00017,6.7e-05
26,New Hampshire,1395231,234,4,167,1,62,0,0,0.000168,4.5e-05


2) Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .


In [186]:
bottom10_states_by_coverage = results_and_details_state_aggregate.sort_values(by='total articles per capita', ascending=True)
bottom10_states_by_coverage.head(10)

Unnamed: 0,state,Population Est. 2022,title,B,C,FA,GA,Start,Stub,total articles per capita,High Quality (FA or GA) articles per capita
30,North Carolina,10698973,50,2,27,1,20,0,0,5e-06,2e-06
25,Nevada,3177772,19,1,10,0,8,0,0,6e-06,3e-06
4,California,39029342,482,102,207,3,170,0,0,1.2e-05,4e-06
2,Arizona,7359197,91,12,54,1,23,0,1,1.2e-05,3e-06
43,Virginia,8683619,133,44,186,2,34,0,0,1.5e-05,4e-06
7,Florida,22244823,412,34,235,6,114,23,1,1.9e-05,5e-06
33,Oklahoma,4019800,75,8,35,0,31,1,0,1.9e-05,8e-06
14,Kansas,2937150,63,11,30,1,21,0,0,2.1e-05,7e-06
18,Maryland,6164660,157,8,100,2,40,4,3,2.5e-05,7e-06
46,Wisconsin,5892539,192,14,117,0,61,1,0,3.3e-05,1e-05


3) Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .


In [187]:
top10_states_by_HQ_coverage = results_and_details_state_aggregate.sort_values(by='High Quality (FA or GA) articles per capita', ascending=False)
top10_states_by_HQ_coverage.head(10)

Unnamed: 0,state,Population Est. 2022,title,B,C,FA,GA,Start,Stub,total articles per capita,High Quality (FA or GA) articles per capita
42,Vermont,647064,329,3,185,0,45,35,61,0.000508,7e-05
47,Wyoming,581381,99,1,55,0,39,3,1,0.00017,6.7e-05
38,South Dakota,909824,311,2,248,0,56,3,2,0.000342,6.2e-05
45,West Virginia,1775156,232,6,120,0,106,0,0,0.000131,6e-05
24,Montana,1122867,128,8,62,0,55,3,0,0.000114,4.9e-05
26,New Hampshire,1395231,234,4,167,1,62,0,0,0.000168,4.5e-05
35,Pennsylvania,12972008,2556,60,1347,3,563,140,443,0.000197,4.4e-05
23,Missouri,6177957,951,6,605,1,262,71,6,0.000154,4.3e-05
1,Alaska,733583,149,8,89,1,30,19,2,0.000203,4.2e-05
27,New Jersey,9261699,564,183,0,130,249,0,2,6.1e-05,4.1e-05


4) Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).

In [189]:
bottom10_states_by_HQ_coverage = results_and_details_state_aggregate.sort_values(by='High Quality (FA or GA) articles per capita', ascending=True)
bottom10_states_by_HQ_coverage.head(10)

Unnamed: 0,state,Population Est. 2022,title,B,C,FA,GA,Start,Stub,total articles per capita,High Quality (FA or GA) articles per capita
30,North Carolina,10698973,50,2,27,1,20,0,0,5e-06,2e-06
25,Nevada,3177772,19,1,10,0,8,0,0,6e-06,3e-06
2,Arizona,7359197,91,12,54,1,23,0,1,1.2e-05,3e-06
43,Virginia,8683619,133,44,186,2,34,0,0,1.5e-05,4e-06
4,California,39029342,482,102,207,3,170,0,0,1.2e-05,4e-06
7,Florida,22244823,412,34,235,6,114,23,1,1.9e-05,5e-06
29,New York,19677151,661,40,400,9,102,94,25,3.4e-05,6e-06
18,Maryland,6164660,157,8,100,2,40,4,3,2.5e-05,7e-06
14,Kansas,2937150,63,11,30,1,21,0,0,2.1e-05,7e-06
33,Oklahoma,4019800,75,8,35,0,31,1,0,1.9e-05,8e-06


5) Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.

In [205]:

top_divisions_by_HQ_coverage = results_and_details_region_aggregate.sort_values(by='total articles per capita', ascending=False)
top_divisions_by_HQ_coverage.head(10)

Unnamed: 0,regional_division,title,B,C,FA,GA,Start,Stub,Population Est. 2022,total articles per capita,High Quality (FA or GA) articles per capita
1,Midwest (West North Central),3578,40,2727,5,635,161,10,19721893,0.000181,3.2e-05
3,Northeast (New England),1437,56,920,9,216,171,65,11503343,0.000125,2e-05
0,Midwest (East North Central),4753,138,2976,7,712,862,59,47097779,0.000101,1.5e-05
2,Northeast (Middle Atlantic),3781,283,1747,142,914,234,470,41910858,9e-05,2.5e-05
4,South (East South Central),1529,40,1261,6,365,311,9,19578002,7.8e-05,1.9e-05
6,South (West South Central),2103,62,1189,10,625,213,7,41685250,5e-05,1.5e-05
7,West (Mountain),1189,49,722,4,338,89,9,25514320,4.7e-05,1.3e-05
5,South (South Atlantic),1850,112,1204,13,533,117,5,66781137,2.8e-05,8e-06
8,West (Pacific),1304,133,595,19,471,40,46,53229044,2.4e-05,9e-06


6) Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.


In [206]:

top_divisions_by_HQ_coverage = results_and_details_region_aggregate.sort_values(by='High Quality (FA or GA) articles per capita', ascending=False)
top_divisions_by_HQ_coverage.head(10)

Unnamed: 0,regional_division,title,B,C,FA,GA,Start,Stub,Population Est. 2022,total articles per capita,High Quality (FA or GA) articles per capita
1,Midwest (West North Central),3578,40,2727,5,635,161,10,19721893,0.000181,3.2e-05
2,Northeast (Middle Atlantic),3781,283,1747,142,914,234,470,41910858,9e-05,2.5e-05
3,Northeast (New England),1437,56,920,9,216,171,65,11503343,0.000125,2e-05
4,South (East South Central),1529,40,1261,6,365,311,9,19578002,7.8e-05,1.9e-05
0,Midwest (East North Central),4753,138,2976,7,712,862,59,47097779,0.000101,1.5e-05
6,South (West South Central),2103,62,1189,10,625,213,7,41685250,5e-05,1.5e-05
7,West (Mountain),1189,49,722,4,338,89,9,25514320,4.7e-05,1.3e-05
8,West (Pacific),1304,133,595,19,471,40,46,53229044,2.4e-05,9e-06
5,South (South Atlantic),1850,112,1204,13,533,117,5,66781137,2.8e-05,8e-06
