## Raman SV - Data 512 - Assignment 2

Snippets of code in this document are either used as is or modified based on the example code shared by Dr. David McDonald, professor for Data 512
These are from the notebooks - "wp_page_info_example" and "wp_ores_liftwing_example" shared as part of the starter code for this assignment
These are licensed CC-BY (https://creativecommons.org/licenses/by/4.0/) by the original author

Import  the necessary libraries and packages for this project and setup the local environment

In [1]:
import os
import json, time, urllib.parse
import requests
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.lines as mlines
import seaborn as sns
import datetime

pd.set_option('display.max_colwidth', None)
Curr_Dir = 'C:/Users/raman/OneDrive/Desktop/UDubs/Classroom/Q4/Data 512 - HCDE/Assignments/Week 2/hub.com-svraman1991-data-512-homework_1/'



Import the list of states that will act as the source. Source file courtesy of Dr. David McDonald, professor for Data 512

In [6]:
state_list = pd.read_csv(os.path.join(Curr_Dir,'Input Files', 'us_cities_by_state_SEPT.2023 - test_version.csv'))
state_list.head(100)

Unnamed: 0,state,page_title,url
0,Washington,"Aberdeen, Washington","https://en.wikipedia.org/wiki/Aberdeen,_Washington"
1,Washington,"Airway Heights, Washington","https://en.wikipedia.org/wiki/Airway_Heights,_Washington"
2,Washington,"Albion, Washington","https://en.wikipedia.org/wiki/Albion,_Washington"
3,Washington,"Algona, Washington","https://en.wikipedia.org/wiki/Algona,_Washington"
4,Washington,"Almira, Washington","https://en.wikipedia.org/wiki/Almira,_Washington"
...,...,...,...
95,Washington,"Harrah, Washington","https://en.wikipedia.org/wiki/Harrah,_Washington"
96,Washington,"Harrington, Washington","https://en.wikipedia.org/wiki/Harrington,_Washington"
97,Washington,"Hartline, Washington","https://en.wikipedia.org/wiki/Hartline,_Washington"
98,Washington,"Hatton, Washington","https://en.wikipedia.org/wiki/Hatton,_Washington"


### Pageview Variables initialization
The below code (and comments) is mainly derived from the aforementioned starter code for this assignment

In [7]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<svraman@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
# ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [9]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Below is the code that utilizes the above function to get the details

In [None]:

print(f"Getting page info data for: {ARTICLE_TITLES[2]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[2])
print(json.dumps(info,indent=4))

In [12]:
state_list['page_title'][1]

'Airway Heights, Washington'

In [21]:

state_list_temp = state_list['page_title'].iloc[:10]
state_list_temp.head(20)

0             Aberdeen, Washington
1       Airway Heights, Washington
2               Albion, Washington
3               Algona, Washington
4               Almira, Washington
5            Anacortes, Washington
6            Arlington, Washington
7               Asotin, Washington
8               Auburn, Washington
9    Bainbridge Island, Washington
Name: page_title, dtype: object

In [23]:
#PageViews_state_list = PAGEINFO_PARAMS_TEMPLATE.copy()
#PageViews_state_list['titles'] = ['page_title'][1]


info = request_pageinfo_per_article(state_list_temp[2])
print(json.dumps(info,indent=4))

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "138321": {
                "pageid": 138321,
                "ns": 0,
                "title": "Albion, Washington",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2023-10-10T22:36:02Z",
                "lastrevid": 1165909617,
                "length": 10572,
                "talkid": 12209556,
                "fullurl": "https://en.wikipedia.org/wiki/Albion,_Washington",
                "editurl": "https://en.wikipedia.org/w/index.php?title=Albion,_Washington&action=edit",
                "canonicalurl": "https://en.wikipedia.org/wiki/Albion,_Washington"
            }
        }
    }
}


In [45]:
state_list_data = []

for i, title in enumerate(state_list_temp):
    PageInfo = request_pageinfo_per_article(title)
    print(PageInfo)
    # Had to add the below logic as a keyerror for items was being returned for the code, this handles
    # scenarios where items is not part of the response for any reason
    if PageInfo is not None and isinstance(PageInfo, dict) and 'query' in PageInfo:

        # df = pd.DataFrame(PageInfo)
        state_list_data.append(PageInfo)

    else:
        print("No valid response")

state_list_data = pd.DataFrame({'Response': state_list_data})

# Now, 'df' contains all the responses in a single column
state_list_data.head()



{'batchcomplete': '', 'query': {'pages': {'137935': {'pageid': 137935, 'ns': 0, 'title': 'Aberdeen, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1179080433, 'length': 34588, 'watchers': 80, 'talkid': 1883355, 'fullurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Aberdeen,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington'}}}}
{'batchcomplete': '', 'query': {'pages': {'138251': {'pageid': 138251, 'ns': 0, 'title': 'Airway Heights, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-13T07:42:59Z', 'lastrevid': 1176353978, 'length': 19652, 'talkid': 8997019, 'fullurl': 'https://en.wikipedia.org/wiki/Airway_Heights,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?

Unnamed: 0,Response
0,"{'batchcomplete': '', 'query': {'pages': {'137935': {'pageid': 137935, 'ns': 0, 'title': 'Aberdeen, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1179080433, 'length': 34588, 'watchers': 80, 'talkid': 1883355, 'fullurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Aberdeen,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington'}}}}"
1,"{'batchcomplete': '', 'query': {'pages': {'138251': {'pageid': 138251, 'ns': 0, 'title': 'Airway Heights, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-13T07:42:59Z', 'lastrevid': 1176353978, 'length': 19652, 'talkid': 8997019, 'fullurl': 'https://en.wikipedia.org/wiki/Airway_Heights,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Airway_Heights,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Airway_Heights,_Washington'}}}}"
2,"{'batchcomplete': '', 'query': {'pages': {'138321': {'pageid': 138321, 'ns': 0, 'title': 'Albion, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1165909617, 'length': 10572, 'talkid': 12209556, 'fullurl': 'https://en.wikipedia.org/wiki/Albion,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Albion,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Albion,_Washington'}}}}"
3,"{'batchcomplete': '', 'query': {'pages': {'137975': {'pageid': 137975, 'ns': 0, 'title': 'Algona, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1176355686, 'length': 13446, 'talkid': 12043125, 'fullurl': 'https://en.wikipedia.org/wiki/Algona,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Algona,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Algona,_Washington'}}}}"
4,"{'batchcomplete': '', 'query': {'pages': {'138087': {'pageid': 138087, 'ns': 0, 'title': 'Almira, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-13T15:22:31Z', 'lastrevid': 1135299836, 'length': 14608, 'talkid': 12205121, 'fullurl': 'https://en.wikipedia.org/wiki/Almira,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Almira,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Almira,_Washington'}}}}"


In [52]:
lastrevid_values = []

for index, row in state_list_data.iterrows():
    response_dict = row['Response']

    # Access 'lastrevid' value and append to the list
    lastrevid = response_dict.get('query', {}).get('pages', {}).get('title', None)
    lastrevid_values.append(lastrevid)

# Add 'lastrevid' values as a new column in the DataFrame
state_list_data['lastrevid'] = lastrevid_values

# Now, 'df' contains the 'lastrevid' values
state_list_data.head(5)


Unnamed: 0,Response,lastrevid
0,"{'batchcomplete': '', 'query': {'pages': {'137935': {'pageid': 137935, 'ns': 0, 'title': 'Aberdeen, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1179080433, 'length': 34588, 'watchers': 80, 'talkid': 1883355, 'fullurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Aberdeen,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington'}}}}",
1,"{'batchcomplete': '', 'query': {'pages': {'138251': {'pageid': 138251, 'ns': 0, 'title': 'Airway Heights, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-13T07:42:59Z', 'lastrevid': 1176353978, 'length': 19652, 'talkid': 8997019, 'fullurl': 'https://en.wikipedia.org/wiki/Airway_Heights,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Airway_Heights,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Airway_Heights,_Washington'}}}}",
2,"{'batchcomplete': '', 'query': {'pages': {'138321': {'pageid': 138321, 'ns': 0, 'title': 'Albion, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1165909617, 'length': 10572, 'talkid': 12209556, 'fullurl': 'https://en.wikipedia.org/wiki/Albion,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Albion,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Albion,_Washington'}}}}",
3,"{'batchcomplete': '', 'query': {'pages': {'137975': {'pageid': 137975, 'ns': 0, 'title': 'Algona, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1176355686, 'length': 13446, 'talkid': 12043125, 'fullurl': 'https://en.wikipedia.org/wiki/Algona,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Algona,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Algona,_Washington'}}}}",
4,"{'batchcomplete': '', 'query': {'pages': {'138087': {'pageid': 138087, 'ns': 0, 'title': 'Almira, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-13T15:22:31Z', 'lastrevid': 1135299836, 'length': 14608, 'talkid': 12205121, 'fullurl': 'https://en.wikipedia.org/wiki/Almira,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Almira,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Almira,_Washington'}}}}",


In [55]:
lastrevid_values = []
title_values = []

for index, row in state_list_data.iterrows():
    response_dict = row['Response']

    # Access 'lastrevid' value and append to the list
    pages_data = response_dict.get('query', {}).get('pages', {})
    lastrevid = None
    title = None

    for key, page_data in pages_data.items():
        lastrevid = page_data.get('lastrevid')
        title = page_data.get('title')
        if lastrevid:
            lastrevid_values.append(lastrevid)
            title_values.append(title)
            break  # Stop searching when 'lastrevid' is found

    if lastrevid is None:
        lastrevid_values.append(None)
        title_values.append(None)


# Add 'lastrevid' values as a new column in the DataFrame
state_list_data['lastrevid'] = lastrevid_values
state_list_data['title'] = title_values

# Now, 'df' contains the 'lastrevid' values
state_list_data.head(5)


Unnamed: 0,Response,lastrevid,title
0,"{'batchcomplete': '', 'query': {'pages': {'137935': {'pageid': 137935, 'ns': 0, 'title': 'Aberdeen, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1179080433, 'length': 34588, 'watchers': 80, 'talkid': 1883355, 'fullurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Aberdeen,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Aberdeen,_Washington'}}}}",1179080433,"Aberdeen, Washington"
1,"{'batchcomplete': '', 'query': {'pages': {'138251': {'pageid': 138251, 'ns': 0, 'title': 'Airway Heights, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-13T07:42:59Z', 'lastrevid': 1176353978, 'length': 19652, 'talkid': 8997019, 'fullurl': 'https://en.wikipedia.org/wiki/Airway_Heights,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Airway_Heights,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Airway_Heights,_Washington'}}}}",1176353978,"Airway Heights, Washington"
2,"{'batchcomplete': '', 'query': {'pages': {'138321': {'pageid': 138321, 'ns': 0, 'title': 'Albion, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1165909617, 'length': 10572, 'talkid': 12209556, 'fullurl': 'https://en.wikipedia.org/wiki/Albion,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Albion,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Albion,_Washington'}}}}",1165909617,"Albion, Washington"
3,"{'batchcomplete': '', 'query': {'pages': {'137975': {'pageid': 137975, 'ns': 0, 'title': 'Algona, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-10T22:36:02Z', 'lastrevid': 1176355686, 'length': 13446, 'talkid': 12043125, 'fullurl': 'https://en.wikipedia.org/wiki/Algona,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Algona,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Algona,_Washington'}}}}",1176355686,"Algona, Washington"
4,"{'batchcomplete': '', 'query': {'pages': {'138087': {'pageid': 138087, 'ns': 0, 'title': 'Almira, Washington', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2023-10-13T15:22:31Z', 'lastrevid': 1135299836, 'length': 14608, 'talkid': 12205121, 'fullurl': 'https://en.wikipedia.org/wiki/Almira,_Washington', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Almira,_Washington&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Almira,_Washington'}}}}",1135299836,"Almira, Washington"
