<a href="https://colab.research.google.com/github/vriteshg1210/data-512-homework_2/blob/main/DATA512_HW2_Analysing_Wikipedia_Page_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DATA512 - HW2: Considering Bias in Data
by Vritesh Gera, University of Washington

The goal of this project is to explore the concept of bias in data using Wikipedia articles. We will consider articles about cities in different US states and will combine a dataset of Wikipedia articles with a dataset of state populations, and use a machine learning service called ORES to estimate the quality of the articles about the cities. We will perform an analysis of how the coverage of US cities on Wikipedia and how the quality of articles about cities varies among states. The Wikipedia ['Category:Lists of cities in the United States by state'](https://en.wikipedia.org/wiki/Category:Lists_of_cities_in_the_United_States_by_state) is crawled to generate a list of Wikipedia article pages about US cities from each state. This data can be found in [us_cities_by_state_SEPT.2023.csv](https://drive.google.com/file/d/1khouDmMaZyKo0y5WkFj4lu7g8o35x_98/view?usp=sharing).


## Data Acquisition

We will start by acquiring the Page info data using the API mentioned above and creating list of dictionaries and finally dataframes of this data. The first step will be importing the python libraries which are necessary to run this program.

In [4]:
# These are standard python modules
import json, time, urllib.parse
import pandas as pd
import matplotlib.pyplot as plt
import warnings

# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

I have uploaded the dataset to my Google drive and hence to access it, we will have to mount the driver in google colab. You can find the dataset here-https://drive.google.com/file/d/1df_fvJXuFtZGuYUpqitDYj2RWkkPbKOv/view?usp=drive_link

In [5]:
# Suppress the warning statements
warnings.filterwarnings("ignore")

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Page Info API
The code below illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). It shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](), covers additional details that may be helpful when trying to use or understand this example. The code provided below has been sampled from [this notebook](https://drive.google.com/file/d/15UoE16s-IccCTOXREjU3xDIz07tlpyrl/view?usp=sharing) under the [CC-BY license](https://creativecommons.org/licenses/by/4.0/).

We will now be using the API for getting the Page Info data from Wikipedia. Below are some constants that help make the code a bit more readable and makes the calling of the API smoother.

In [7]:
#    CONSTANTS

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<vriteshg@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

#Now we will access the list of article urls along with the state and the title of the article stored in a csv
df_city_by_state = pd.read_csv('/content/drive/My Drive/us_cities_by_state_SEPT.2023.csv')
ARTICLE_TITLES = df_city_by_state['page_title'].values

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [8]:
#    PROCEDURES/FUNCTIONS

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Now we will hit the API and start appending the data being sent back for each article in a list. In the end we'll have a list of dictionaries.

In [None]:
article_info_list = []
for i in range(0,len(ARTICLE_TITLES)):
  print(f"Getting page info data for: {ARTICLE_TITLES[i]}")
  info = request_pageinfo_per_article(ARTICLE_TITLES[i])
  article_info_list.append(info)

Getting page info data for: Abbeville, Alabama
Getting page info data for: Adamsville, Alabama
Getting page info data for: Addison, Alabama
Getting page info data for: Akron, Alabama
Getting page info data for: Alabaster, Alabama
Getting page info data for: Albertville, Alabama
Getting page info data for: Alexander City, Alabama
Getting page info data for: Aliceville, Alabama
Getting page info data for: Allgood, Alabama
Getting page info data for: Altoona, Alabama
Getting page info data for: Andalusia, Alabama
Getting page info data for: Anderson, Lauderdale County, Alabama
Getting page info data for: Anniston, Alabama
Getting page info data for: Arab, Alabama
Getting page info data for: Ardmore, Alabama
Getting page info data for: Argo, Alabama
Getting page info data for: Ariton, Alabama
Getting page info data for: Arley, Alabama
Getting page info data for: Ashford, Alabama
Getting page info data for: Ashland, Alabama
Getting page info data for: Ashville, Alabama
Getting page info dat

As the API takes a very long time to fetch all the data, we will write the list of dicts to an offline file in json format

In [None]:
city_data = open('/content/drive/MyDrive/city.json', "w")
json.dump(article_info_list, city_data, indent = 4)
city_data.close()

In [9]:
#Reading the city json
file_path_city = '/content/drive/My Drive/city.json'
with open(file_path_city, 'r') as json_file:
  article_info_list = json.load(json_file)

The next step would be to extract the useful information from the list of dicts we have created. Out of this, we need the title of the article and the lastrevid. We extract this data and append it to a new list.

In [10]:
title_revid_list = [{'title': item['title'], 'lastrevid': item['lastrevid']} for item in article_info_list]

## The ORES API

After we have the data, we will hit another API which will give us the ORES rating of the articles we have in our list. he machine learning system used for this purpose is called [ORES](https://www.mediawiki.org/wiki/ORES). This was originally an acronym for 'Objective Revision Evaluation Service' but was simply renamed 'ORES'. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are labelings learned based on articles in Wikipedia that were peer-reviewed using the [Wikipedia content assessment](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment) procedures. These quality classes are a subset of quality assessment categories developed by Wikipedia editors, ranked below from best to worst:

1. FA - Featured article
2. GA - Good article (sometimes called A-class)
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

ORES requires a specific revision ID of a specific article to be able to make a label prediction. You can use the [API:Info](https://www.mediawiki.org/wiki/API:Info) request to get a range of metadata on an article, including the most current revision ID of the article page. For more information, the [ORES API documentation](https://ores.wikimedia.org/) can be accessed from the main ORES page.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing. This code below illustrates how to generate article quality estimates for article revisions using the LiftWing version of ORES. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) has information about some parameters that have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).

The code to request ORES scores through LiftWing ML Service API has been sampled from [this notebook](https://drive.google.com/file/d/17C9xsmR9U3lJeD52UTbAedlHDetwYsxs/view?usp=sharing) under the [CC-BY license](https://creativecommons.org/licenses/by/4.0/).

In [11]:
#    CONSTANTS
#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there

#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too

REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<gera.vritesh@gmail.com>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}

#    This is a template for the parameters that we need to supply in the headers of an API request

REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "gera.vritesh@gmail.com",         # your email address should go here
    'access_token'  : "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJkMDU3YmU5YzM4MGNjNjJiMDUzZDBkOTMxNTA0ZDVmZCIsImp0aSI6ImFkNDMyNzk3OTg0YzYyZmY4YTIxNzYxMDRhZTVjY2NmNzUyMzQwYTZmYjg4YWQ5NmIwMzc3YTdiYTY5MDZiMjVlMGU1NmE3MGIyZjMyODVhIiwiaWF0IjoxNjk3MzIwNDkxLjQ2Mjc4MSwibmJmIjoxNjk3MzIwNDkxLjQ2Mjc4OSwiZXhwIjozMzI1NDIyOTI5MS40NjA5MzQsInN1YiI6Ijc0MDA1NjIzIiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.idz7vxuH2RN8vow5X9K_g13KMx3WXykgOXbbb1DNEEWnlgAW0bawraKC3SCsfsJEglQ9CtcUFsAkESx838vAXJ4dIQkOqei5kfBobdCmdfB7w3x77oZWLdXRyjt8xuuMaFBOPPewvKMyzqCBQZTC76EGyYRx_ThM7NQmr_hK-sxVCB97ao3uzUyo2pssQji44gu809c8VnIBHOjXzvf68s2SA6sugd2Yr6Jb363AXZWvykhxQbOBzckkUiTfb0jcVE_MtwuRA_phgJFNK3K50FL6oHa7MJH_Y-gA_ix0Fo116IsTz_lbYXfNVPeHZcHoNYt840UFd2hdxRTNMwyBLkoxgezxCgK_uM5LLlluIpaCephuhBp_jUiCEwck5-DtueZNk-5O3Bwgy_Ju0wRflYFZW4tp_vhdWnpjo-aUj1rLyz4kJPV81TNTf63ymBha2vGp9LREls_QkY3Hy7AF9WXu9wlOAN-_gkmbseQKajY34ASxFBSdVxQ4DnYX_XfoCRe5LRJu428R8Y7yHwnmP6agLvJGDeLt4zrcOx8B_6n9BEA-TbFmBiyf3CUOi5VAUPVWElMKGTfmSHlYAvnjHx_Chec2clzhYNC2Jaxc-8ocUMzICntIy2lCe5FO9AmF0zeCSlUxCtmbiZuCfWu8f3-BhFKbadfhREl0TEm3qwA"          # the access token you create will need to go here
}



# We will input the list of titles along with its revid
ARTICLE_REVISIONS = title_revid_list
#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#

#Changed the values to xxx strings because these values shouldn't be distributed openly
USERNAME = "xxx"
ACCESS_TOKEN = "xxxx"
#

## Get your Access Token for getting ORES Data

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a 'dashboard' for managing authentication tokens. You might have a hard time finding this 'dashboard'. First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages). At the very bottom of the 'Special pages' page is a section titled 'Other special pages' (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the 'API keys' page you can create a new key.

The authentication guide suggests that you should create a server-side app key which did not work when I tried. But, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this study and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.

## Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article.

In [12]:

#    PROCEDURES/FUNCTIONS
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Now we will create a list with all the ORES scores based on the revids

In [None]:
#   Which article - the key for the article dictionary defined above
ores_score_file = []
unscored_article_list = []
for item in ARTICLE_REVISIONS:
  article_title = item['title']
  temp_revind = item['lastrevid']
  print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {temp_revind}")
  score = request_ores_score_per_article(article_revid=temp_revind,
                                       email_address="gera.vritesh@gmail.com",
                                       access_token=ACCESS_TOKEN)
  if score:
    ores_score_file.append(score)
    print(json.dumps(score,indent=4))
  else:
    unscored_article_list


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                }
            }
        }
    }
}
Getting LiftWing ORES scores for 'Red Bay, Alabama' with revid: 1169333757
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1169333757": {
                "articlequality": {
                    "score": {
                        "prediction": "C",
                        "probability": {
                            "B": 0.2808033331369756,
                            "C": 0.6241599246116336,
                            "FA": 0.016545542445909205,
                            "GA": 0.05155356322806621,
                            "Start": 0.023080365037708828,
                            "Stub": 0.0038572715397066325
                        }
                    }
                }
            }
        }
    }
}
Getting LiftWing ORES scores for 'Red Level, Alab

We should note that no artciles were found without a quality score. Now we will write these scores and the articles to a csv file for storing

In [None]:
qual_cities = ores_score_file.keys()
qual_preds = ores_score_file.values()
predictions = pd.DataFrame()
predictions['article'] = qual_cities
predictions['prediction'] = qual_preds
predictions.to_csv("/content/drive/MyDrive/predictions (1).csv")

The API took 6-7 hours to run and failed multiple times. Hence, I kept on appending the results into a my dictionary for all those runs and eventually saved the list of dicts to a csv file.

Now we will read the file into the dataframe. We do this as it takes a very long time to re-run the APIs and hence after the initial run, whenever we need to access the data, we can directly pull it from the file

In [13]:
#Reading the ORES file
ores_score_file = pd.read_csv('/content/drive/My Drive/predictions (1).csv')

In [14]:
ores_score_file.head(10)

Unnamed: 0.1,Unnamed: 0,article,prediction
0,0,"Abbeville, Alabama",C
1,1,"Adamsville, Alabama",C
2,2,"Addison, Alabama",C
3,3,"Akron, Alabama",GA
4,4,"Alabaster, Alabama",C
5,5,"Albertville, Alabama",C
6,6,"Alexander City, Alabama",GA
7,7,"Aliceville, Alabama",GA
8,8,"Allgood, Alabama",C
9,9,"Altoona, Alabama",C


We identified a few entries which did not fit our criteria of being a  geographical location and hence we will remove those/

In [16]:
tbd = ['2020 United States census','2010 United States census','County (United States)','Population','Square mile','Federal Information Processing Standards',
       'American National Standards Institute','Geographic Names Information System','Wikipedia:Citation needed']
pred_drop = ores_score_file[ores_score_file['article'].isin(tbd)].index
ores_score_file.drop("Unnamed: 0",axis=1,inplace=True)
ores_score_file.drop(pred_drop,inplace=True)

We will now add the prediction and value as a new key-value pair in our dictionary

In [27]:
#Creating a temp dataframe so that the original does not get impacted with our experimentation
df = pd.DataFrame(ores_score_file)

df2 = pd.DataFrame(title_revid_list)

df3 = pd.merge(df, df2, left_on='article', right_on='title', how='left')
df3 = df3.drop_duplicates()

In [29]:
df3.head()

Unnamed: 0,article,prediction,title,lastrevid
0,"Abbeville, Alabama",C,"Abbeville, Alabama",1171163550
2,"Adamsville, Alabama",C,"Adamsville, Alabama",1177621427
4,"Addison, Alabama",C,"Addison, Alabama",1168359898
6,"Akron, Alabama",GA,"Akron, Alabama",1165909508
8,"Alabaster, Alabama",C,"Alabaster, Alabama",1179139816


In [32]:
title_revid_list= df3.to_dict(orient='records')

In [33]:
title_revid_list

[{'article': 'Abbeville, Alabama',
  'prediction': 'C',
  'title': 'Abbeville, Alabama',
  'lastrevid': 1171163550},
 {'article': 'Adamsville, Alabama',
  'prediction': 'C',
  'title': 'Adamsville, Alabama',
  'lastrevid': 1177621427},
 {'article': 'Addison, Alabama',
  'prediction': 'C',
  'title': 'Addison, Alabama',
  'lastrevid': 1168359898},
 {'article': 'Akron, Alabama',
  'prediction': 'GA',
  'title': 'Akron, Alabama',
  'lastrevid': 1165909508},
 {'article': 'Alabaster, Alabama',
  'prediction': 'C',
  'title': 'Alabaster, Alabama',
  'lastrevid': 1179139816},
 {'article': 'Albertville, Alabama',
  'prediction': 'C',
  'title': 'Albertville, Alabama',
  'lastrevid': 1179198677},
 {'article': 'Alexander City, Alabama',
  'prediction': 'GA',
  'title': 'Alexander City, Alabama',
  'lastrevid': 1179140073},
 {'article': 'Aliceville, Alabama',
  'prediction': 'GA',
  'title': 'Aliceville, Alabama',
  'lastrevid': 1167792390},
 {'article': 'Allgood, Alabama',
  'prediction': 'C',
 

We want the 'state' as a seperate key-value pair in our dict as this has been outlined in the homework document.

In [34]:
# Add the 'state' key-value pair
for item in title_revid_list:
    title = item.get('title', '')
    state = title.split(', ')[-1]
    item['state'] = state
print(title_revid_list)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Now we will read the population data into a dataframe. The US Census Bureau provides updated population estimates for every US state. The data can be found on ['State Population Totals and Components of Change: 2020-2022'](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) from their website. An Excel file linked to that page contains estimated populations of all US states for 2022. - https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html
I have uploaded the file to my drive and hence am accessing it from there.

In [36]:
population_data = pd.read_excel('/content/drive/My Drive/NST-EST2022-POP.xlsx')
population_data

Unnamed: 0,table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,Geographic Area,"April 1, 2020 Estimates Base",Population Estimate (as of July 1),,
1,,,2020,2021.0,2022.0
2,United States,331449520,331511512,332031554.0,333287557.0
3,Northeast,57609156,57448898,57259257.0,57040406.0
4,Midwest,68985537,68961043,68836505.0,68787595.0
...,...,...,...,...,...
60,Note: The estimates are developed from a base ...,,,,
61,Suggested Citation:,,,,
62,Annual Estimates of the Resident Population fo...,,,,
63,"Source: U.S. Census Bureau, Population Division",,,,


Pre-processing the population data

In [37]:
# Set the first row as column names and remove the existing columns
population_data.columns = population_data.iloc[0]
population_data = population_data[1:]

# Reset the index
population_data.reset_index(drop=True, inplace=True)
#Picking 2 columns which are needed for our analysis out of the whole dataset
population_data = population_data[['Geographic Area',population_data.columns[-1]]]
population_data.drop(0)

Unnamed: 0,Geographic Area,NaN
1,United States,333287557.0
2,Northeast,57040406.0
3,Midwest,68787595.0
4,South,128716192.0
5,West,78743364.0
...,...,...
59,Note: The estimates are developed from a base ...,
60,Suggested Citation:,
61,Annual Estimates of the Resident Population fo...,
62,"Source: U.S. Census Bureau, Population Division",


In [38]:
#We rename the column names and then remove the '.' from the geographic area's name
population_data.columns = ['Geographic Area','Population 2022']
population_data = population_data.drop(0)
population_data['Geographic Area'] = population_data['Geographic Area'].str.lstrip('.')
population_data.head(10)

Unnamed: 0,Geographic Area,Population 2022
1,United States,333287557.0
2,Northeast,57040406.0
3,Midwest,68787595.0
4,South,128716192.0
5,West,78743364.0
6,Alabama,5074296.0
7,Alaska,733583.0
8,Arizona,7359197.0
9,Arkansas,3045637.0
10,California,39029342.0


In [39]:
# Merge the population data with your list of dictionaries
merged_data = pd.merge(population_data, pd.DataFrame(title_revid_list), left_on='Geographic Area', right_on='state')
merged_data.head(10)

Unnamed: 0,Geographic Area,Population 2022,article,prediction,title,lastrevid,state
0,Alabama,5074296.0,"Abbeville, Alabama",C,"Abbeville, Alabama",1171163550,Alabama
1,Alabama,5074296.0,"Adamsville, Alabama",C,"Adamsville, Alabama",1177621427,Alabama
2,Alabama,5074296.0,"Addison, Alabama",C,"Addison, Alabama",1168359898,Alabama
3,Alabama,5074296.0,"Akron, Alabama",GA,"Akron, Alabama",1165909508,Alabama
4,Alabama,5074296.0,"Alabaster, Alabama",C,"Alabaster, Alabama",1179139816,Alabama
5,Alabama,5074296.0,"Albertville, Alabama",C,"Albertville, Alabama",1179198677,Alabama
6,Alabama,5074296.0,"Alexander City, Alabama",GA,"Alexander City, Alabama",1179140073,Alabama
7,Alabama,5074296.0,"Aliceville, Alabama",GA,"Aliceville, Alabama",1167792390,Alabama
8,Alabama,5074296.0,"Allgood, Alabama",C,"Allgood, Alabama",1165909718,Alabama
9,Alabama,5074296.0,"Altoona, Alabama",C,"Altoona, Alabama",1165909823,Alabama


The 'region' demarcation within the US is not standardized and fixed. In fact, different US government agencies agglomerate states to define regions as a function of differing goals. For this analysis, we will use the regional and divisional agglomerations as defined by the US Census Bureau. The data for the same can be found in ['US States by Region - US Census Bureau'](https://docs.google.com/spreadsheets/d/14Sjfd_u_7N9SSyQ7bmxfebF_2XpR8QamvmNntKDIQB0/edit?usp=sharing).

In [41]:
#Pulling the region data from the file uploaded to my drive and processing it to make it usable
region_data = pd.read_excel('/content/drive/My Drive/US States by Region - US Census Bureau.xlsx')
region_data['DIVISION'].fillna(method = 'ffill', inplace = True)
region_data.head(10)

Unnamed: 0,REGION,DIVISION,STATE
0,Northeast,,
1,,New England,
2,,New England,Connecticut
3,,New England,Maine
4,,New England,Massachusetts
5,,New England,New Hampshire
6,,New England,Rhode Island
7,,New England,Vermont
8,,Middle Atlantic,
9,,Middle Atlantic,New Jersey


Now we merge the data to create the final dataframe to run all the analysis on

In [42]:
#Merging the previous dataframes with the new region_data
merged_region_data = merged_data.merge(region_data, left_on='state', right_on='STATE', how='left')
merged_region_data = merged_region_data.drop(columns=['Geographic Area','REGION','STATE'])
merged_region_data.head(10)

Unnamed: 0,Population 2022,article,prediction,title,lastrevid,state,DIVISION
0,5074296.0,"Abbeville, Alabama",C,"Abbeville, Alabama",1171163550,Alabama,East South Central
1,5074296.0,"Adamsville, Alabama",C,"Adamsville, Alabama",1177621427,Alabama,East South Central
2,5074296.0,"Addison, Alabama",C,"Addison, Alabama",1168359898,Alabama,East South Central
3,5074296.0,"Akron, Alabama",GA,"Akron, Alabama",1165909508,Alabama,East South Central
4,5074296.0,"Alabaster, Alabama",C,"Alabaster, Alabama",1179139816,Alabama,East South Central
5,5074296.0,"Albertville, Alabama",C,"Albertville, Alabama",1179198677,Alabama,East South Central
6,5074296.0,"Alexander City, Alabama",GA,"Alexander City, Alabama",1179140073,Alabama,East South Central
7,5074296.0,"Aliceville, Alabama",GA,"Aliceville, Alabama",1167792390,Alabama,East South Central
8,5074296.0,"Allgood, Alabama",C,"Allgood, Alabama",1165909718,Alabama,East South Central
9,5074296.0,"Altoona, Alabama",C,"Altoona, Alabama",1165909823,Alabama,East South Central


# Analysis

The analysis for this study will consist of calculating total-articles-per-population (a ratio representing the number of articles per person) and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a state-by-state and divisional basis. All of these values are 'per capita' ratios.
For this analysis 'high quality' articles are considered as articles that ORES predicted would be in either the 'FA' (featured article) or 'GA' (good article) classes.

The data is then used to construct six following tables:
1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .
2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .
3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .
4. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).
5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.
6. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.


# a) Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .


In [62]:
#First we count the number of articles per state
statewise_count = merged_region_data.groupby('state')['title'].count()
#We calculate the statewise population
statewise_pop = merged_region_data.groupby('state')['Population 2022'].mean()
#Then we calculate the count of articles per capita
statewise_coverage_per_capita = statewise_count/statewise_pop
statewise_coverage_per_capita.sort_values(ascending=False,inplace=True)
top_state_coverage = statewise_coverage_per_capita.to_frame()
#Now we showcase the top 10 US states with the max articles
top_state_coverage = top_state_coverage.rename(columns={0:'Top 10 USA states by coverage'})
top_state_coverage.head(10)

Unnamed: 0_level_0,Top 10 USA states by coverage
state,Unnamed: 1_level_1
Vermont,0.000507
North Dakota,0.000457
Maine,0.000349
South Dakota,0.000342
Iowa,0.000326
Alaska,0.000202
Pennsylvania,0.000197
Michigan,0.000177
Wyoming,0.00017
New Hampshire,0.000168


# b) Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .

In [63]:
#First we count the number of articles per state
statewise_count = merged_region_data.groupby('state')['title'].count()
#We calculate the statewise population
statewise_pop = merged_region_data.groupby('state')['Population 2022'].mean()
#Then we calculate the count of articles per capita
statewise_coverage_per_capita = statewise_count/statewise_pop
statewise_coverage_per_capita.sort_values(ascending=True,inplace=True)
top_state_coverage = statewise_coverage_per_capita.to_frame()
#Now we showcase the top 10 US states with the max articles
top_state_coverage = top_state_coverage.rename(columns={0:'Bottom 10 USA states by coverage'})
top_state_coverage.head(10)

Unnamed: 0_level_0,Bottom 10 USA states by coverage
state,Unnamed: 1_level_1
North Carolina,5e-06
Nevada,6e-06
California,1.2e-05
Arizona,1.2e-05
Virginia,1.5e-05
Oklahoma,1.8e-05
Florida,1.8e-05
Kansas,2.1e-05
Maryland,2.5e-05
Wisconsin,3.2e-05


# c) Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .

In [64]:
#First we create a column with boolean values with 1 showing high quality artcile and 0 showing otherwise
merged_region_data['high_quality_articles'] = (merged_region_data['prediction'] == 'FA') | (merged_region_data['prediction'] == 'GA')
#We group the high quality articles by state
highest_quality_by_state = merged_region_data.groupby('state')['high_quality_articles'].sum()
#Calculating these sums per capita
highest_quality_by_state_per_capita = highest_quality_by_state/statewise_pop
highest_quality_by_state_per_capita.sort_values(ascending=False,inplace=True)
top_high_qual = highest_quality_by_state_per_capita.to_frame()
top_high_qual = top_high_qual.rename(columns={0:'Top 10 US states with High Quality Articles per Capita'})
top_high_qual.head(10)

Unnamed: 0_level_0,Top 10 US states with High Quality Articles per Capita
state,Unnamed: 1_level_1
Vermont,7e-05
Wyoming,6.7e-05
South Dakota,6.2e-05
West Virginia,6e-05
Montana,4.9e-05
New Hampshire,4.5e-05
Pennsylvania,4.4e-05
Missouri,4.2e-05
Alaska,4.2e-05
New Jersey,4.1e-05


# d) Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).

In [65]:
#We do the same procesure as before but now we will put ascending as True so that we have the bottom 10 states
lowest_quality_by_state = merged_region_data.groupby('state')['high_quality_articles'].sum()
lowest_quality_by_state_per_capita = lowest_quality_by_state/statewise_pop
lowest_quality_by_state_per_capita.sort_values(ascending=True,inplace=True)
low_high_qual = lowest_quality_by_state_per_capita.to_frame()
low_high_qual = low_high_qual.rename(columns={0:'Bottom 10 US states with High Quality Articles per Capita'})
low_high_qual.head(10)

Unnamed: 0_level_0,Bottom 10 US states with High Quality Articles per Capita
state,Unnamed: 1_level_1
North Carolina,2e-06
Virginia,2e-06
Nevada,2e-06
Arizona,3e-06
California,4e-06
Florida,5e-06
New York,6e-06
Maryland,7e-06
Kansas,7e-06
Oklahoma,8e-06


# e) Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.

In [58]:
# Rank ordered list of US census divisions by total articles per capita
final_df3 = merged_region_data.copy()
articles_per_state = final_df3[['state', 'title']].groupby('state', as_index=False).count()
articles_per_state = pd.DataFrame(articles_per_state)
articles_per_state.head()

state_division_pop = final_df3[['state', 'DIVISION', 'Population 2022']].drop_duplicates()
state_division_pop.head()

articles_per_div_state = pd.merge(state_division_pop, articles_per_state, on='state', how='inner')
articles_per_div_state.head()

grouped_values = articles_per_div_state.groupby('DIVISION', as_index=False)
grouped_values = grouped_values.sum()
grouped_values['articles_per_capita'] = grouped_values['title']/grouped_values['Population 2022']
grouped_values = grouped_values.sort_values(by='articles_per_capita', ascending=False)
grouped_values

Unnamed: 0,DIVISION,Population 2022,title,articles_per_capita
7,West North Central,19721893.0,3574,0.000181
4,New England,11503343.0,1433,0.000125
0,East North Central,47097779.0,4750,0.000101
2,Middle Atlantic,41910858.0,3772,9e-05
1,East South Central,19578002.0,1527,7.8e-05
8,West South Central,41685250.0,2098,5e-05
3,Mountain,25514320.0,1184,4.6e-05
6,South Atlantic,66781137.0,1846,2.8e-05
5,Pacific,53229044.0,1298,2.4e-05


# f) Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.

In [60]:
# Rank ordered list of US census divisions by high quality articles per capita
final_df4 = merged_region_data.copy()
final_df4 = final_df4[final_df4['prediction'].isin(['FA', 'GA'])]
articles_per_state2 = final_df4[['state', 'title']].groupby('state', as_index=False).count()
articles_per_state2 = pd.DataFrame(articles_per_state2)

state_division_pop2 = final_df4[['state', 'DIVISION', 'Population 2022']].drop_duplicates()

articles_per_div_state2 = pd.merge(state_division_pop2, articles_per_state2, on='state', how='inner')

grouped_values2 = articles_per_div_state2.groupby('DIVISION', as_index=False)
grouped_values2 = grouped_values2.sum()
grouped_values2['articles_per_capita'] = grouped_values2['title']/grouped_values2['Population 2022']
grouped_values2 = grouped_values2.sort_values(by='articles_per_capita', ascending=False)
grouped_values2

Unnamed: 0,DIVISION,Population 2022,title,articles_per_capita
7,West North Central,19721893.0,637,3.2e-05
2,Middle Atlantic,41910858.0,1055,2.5e-05
4,New England,11503343.0,224,1.9e-05
1,East South Central,19578002.0,316,1.6e-05
0,East North Central,47097779.0,715,1.5e-05
8,West South Central,41685250.0,632,1.5e-05
3,Mountain,25514320.0,333,1.3e-05
5,Pacific,53229044.0,489,9e-06
6,South Atlantic,66781137.0,525,8e-06
