# Considering Bias in Data - Homework 2

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles about cities in different US states. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of state populations, and use a machine learning service called ORES to estimate the quality of the articles about the cities.

## Step 1: Getting the Article, Population and Region Data

The first step is getting the data, which lives in several different places. You will need data that lists Wikipedia articles about US cities and data for US state populations.The Wikipedia [Category:Lists of cities in the United States by state](https://en.wikipedia.org/wiki/Category:Lists_of_cities_in_the_United_States_by_state) was crawled to generate a list of Wikipedia article pages about US cities from each state. This data is called as 'us_cities_by_state_SEPT.2023.csv' in the notebook.

### 1.a. Importing the required libraries

In [1]:
import json, time, urllib.parse
import requests
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
import warnings

# Suppressing the less important warnings.
warnings.filterwarnings('ignore')

### 1.b. Loading the .csv files into dataframes

The US Census Bureau provides updated population estimates for every US state. We can find [State Population Totals and Components of Change: 2020-2022](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) on their website. An Excel file linked to that page contains estimated populations of all US states for 2022.

The 'region' demarcation within the US is not one standardized and fixed thing. In fact, different US government agencies agglomerate states to define regions as a function of differing goals (e.g., see List of regions of the United States for some examples). For this analysis, we will use the regional and divisional agglomerations as defined by the US Census Bureau and mentioned in 'US States by Region - US Census Bureau.csv'.

There are 3 input files:
1. **us_cities_by_state_SEPT.2023.csv**: List of Wikipedia article pages about US cities from each state.
2. **population_by_states_2022.csv**: Contains estimated populations of all US states for 2022.
3. **US States by Region - US Census Bureau.csv**: Regional and divisional agglomerations as defined by the US Census Bureau and used for analysis in this notebook.

The files are read into 3 dataframes:
- Cities by states is saved into a dataframe called **df_cities**
- Population by states is saved into a dataframe called **df_pop**
- US states based on regions and divisions are saved into a dataframe called **region_df**

In [2]:
df_cities = pd.read_csv('../input files/us_cities_by_state_SEPT.2023.csv')
df_pop = pd.read_csv('../input files/population_by_states_2022.csv')
region_df = pd.read_csv('../input files/US States by Region - US Census Bureau.csv')

- We create a dictionary called state_to_region_division to map states to their corresponding regions and divisions based on data from the region_df. This mapping is used to add "Region" and "Division" columns to the df_pop DataFrame.
- We define a custom sorting key function called custom_sort_key that assigns a sorting order to each state based on its region and division. This function is used to sort the df_pop DataFrame in a customized order.
- We perform various data manipulations, including sorting the DataFrame using the custom sorting key, filtering out rows with 'NaN' Division, and converting the "Population Estimate" column to numeric format. 
- Finally, we group the data by regional divisions and calculate the population sum for each division, storing the results in a DataFrame named df_pop_division.

In [3]:
# Creating a mapping dictionary for states, regions, and divisions
state_to_region_division = {}
current_region = None
current_division = None

for row in region_df.values:
    if not pd.isna(row[0]):
        current_region = row[0]
        current_division = None
    if not pd.isna(row[1]):
        current_division = row[1]
    if not pd.isna(row[2]):
        state = row[2]
        state_to_region_division[state] = (current_region, current_division)

def map_state_to_region(state):
    return state_to_region_division.get(state, ('NaN', 'NaN'))
# Adding "Region" and "Division" columns to df_pop
df_pop['Region'], df_pop['Division'] = zip(*df_pop['Geographic Area'].map(map_state_to_region))

# Creating a custom sorting key function
def custom_sort_key(state):
    region, division = state_to_region_division.get(state, ('NaN', 'NaN'))
    region_order = ['Northeast', 'Midwest', 'South', 'West']
    division_order = [
        'New England', 'Middle Atlantic',
        'East North Central', 'West North Central',
        'South Atlantic', 'East South Central', 'West South Central',
        'Mountain', 'Pacific'
    ]
    # Assigning a high value for 'NaN' regions and divisions to place them at the end
    region_index = region_order.index(region) if region in region_order else len(region_order)
    division_index = division_order.index(division) if division in division_order else len(division_order)
    return (region_index, division_index, state)

# Sorting the DataFrame by the custom sorting key
df_pop['SortKey'] = df_pop['Geographic Area'].map(custom_sort_key)
df_pop = df_pop.sort_values(by='SortKey').drop('SortKey', axis=1).reset_index().drop('index', axis=1)

# Filtering out rows with 'NaN' Division
df_pop_division = df_pop[df_pop['Division'] != 'NaN']
# Removing commas and converting "Population Estimate" to numeric
df_pop_division['Population Estimate'] = df_pop_division['Population Estimate'].str.replace(',', '').astype(int)
df_pop_division = df_pop_division.groupby('Division')['Population Estimate'].sum().reset_index()
df_pop_division.columns = ['regional_division', 'population']

We will use the df_pop_division dataframe later on in the analysis

In [4]:
# Viewing the df_cities dataframe
df_cities.head()

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"


In [5]:
# Viewing the df_pop dataframe
df_pop.head()

Unnamed: 0,Geographic Area,Population Estimate,Region,Division
0,Connecticut,3626205,Northeast,New England
1,Maine,1385340,Northeast,New England
2,Massachusetts,6981974,Northeast,New England
3,New Hampshire,1395231,Northeast,New England
4,Rhode Island,1093734,Northeast,New England


In [6]:
# Viewing the df_pop_division dataframe
df_pop_division.head()

Unnamed: 0,regional_division,population
0,East North Central,47097779
1,East South Central,19578002
2,Middle Atlantic,41910858
3,Mountain,25514320
4,New England,15129548


### 1.c. Data Cleaning

**Some Considerations**

Crawling Wikipedia categories to identify relevant page subsets can result in misleading and/or duplicate category labels. A data crawl can result in possible duplicate articles linked from differently named sub-categories. Naturally, the data crawl attempts to resolve some of these problems, but not all may have been caught. The below section talks about how we have handled the inconsistencies in the data.

#### Checking for duplicates in both the dataframes and removing those records.

In [7]:
print("Number of rows for cities dataframe = ", len(df_cities))
print("Number of rows for population dataframe = ", len(df_pop))

Number of rows for cities dataframe =  22157
Number of rows for population dataframe =  52


In [8]:
df_cities = df_cities.drop_duplicates(subset=['state', 'page_title', 'url'], keep = 'last')
df_cities = df_cities.reset_index(drop=True)
df_pop = df_pop.drop_duplicates()

In [9]:
print("Number of rows for cities dataframe ater removing the duplicates = ", len(df_cities))
print("Number of rows for population dataframe ater removing the duplicates = ", len(df_pop))

Number of rows for cities dataframe ater removing the duplicates =  21525
Number of rows for population dataframe ater removing the duplicates =  52


Two rows from the cities dataframe has been deleted and there were no duplicates in the population dataframe

#### Checking for data inconsistencies like nulls/zero numeric values

##### Checking for NULL values

In [10]:
df_cities.isnull().sum()

state         0
page_title    0
url           0
dtype: int64

In [11]:
df_pop.isnull().sum()

Geographic Area        0
Population Estimate    0
Region                 0
Division               0
dtype: int64

There are no NULL values in df_cities and df_pop

##### Checking for ZERO values

In [12]:
df_pop[df_pop['Population Estimate'] == 0]

Unnamed: 0,Geographic Area,Population Estimate,Region,Division


There are no 0 population values in the df_pop dataframe

## Step 2: Getting Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [ORES](https://www.mediawiki.org/wiki/ORES). This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:
1. FA - Featured article
2. GA - Good article (sometimes called A-class)
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article
These labelings were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a subset of quality assessment categories developed by Wikipedia editors.


To get a Wikipedia page quality prediction from ORES for each cities’s article page we will need to:
- read each line of us_cities_by_state_SEPT.2023.csv
- make a page info request to get the current page revision
- make an ORES request using the page title and current revision id.

### 2.a. Configuring the API parameters
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

**License**

This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.1 - August 14, 2023.

In [13]:
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0) - API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<sagnik99@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is a list of politicians from Wikipedia article titles 
ARTICLE_TITLES = df_cities['page_title']

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

### 2.b. API request function

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [14]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

- Iterating through the ARTICLE_TITLES to call the above defined function such that we can get the JSON response from the endpoint.
- capturing only the article title and the revision_id for the above API call into a dataframe called **df_articles_lastrevid**

**The article pages not found are to be captured in the error log file named as "API_request_error_log.txt"**

In [15]:
pageinfo_list = {}
with open('../intermediate files/API_request_error_log.txt', 'w') as f:

    for i in tqdm(range(0, len(ARTICLE_TITLES))):
        try: 
            request_op = request_pageinfo_per_article(article_title = ARTICLE_TITLES[i],
                                                      request_template = PAGEINFO_PARAMS_TEMPLATE)['query']['pages']
            pageinfo_list.update(request_op)
        except:
            txt = txt = "Couldn't get the page info for: " + i
            f.write(txt)
            f.write('\n')
    
df_articles_lastrevid = pd.DataFrame.from_dict(pageinfo_list, orient='index', columns=['page_title', 'lastrevid'])
df_articles_lastrevid.reset_index(inplace = True, drop = True)
df_articles_lastrevid.head()

In [16]:
# Saving the API call response into a csv file to avoid reloading it multiple times
df_articles_lastrevid.to_csv('../intermediate files/request_pageinfo_per_article_output.csv', index=False)

### 2.c. Page information from endpoint

This example illustrates how to generate quality scores for article revisions using [ORES](https://www.mediawiki.org/wiki/ORES). This example shows how to request a score of a specific revision, where the score provides probabilities for all of the possible article quality levels. The API documentation can be access from the [ORES API documentation](https://ores.wikimedia.org) page. However, this documentation is a little skimpy and if you want more information you may have to dig around.

**License**

This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023.

In [19]:
# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL

API_ORES_SCORE_PARAMS = "/scores/{context}/?models={model}&revids={revids}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<sagnik99@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023'
}

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

In [20]:
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: 
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    
    # set the revision id into the template
    request_template['revids'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

Extracting the ORES score by running a for loop

In [21]:
ores_score = {}
for i in tqdm(range(0, len(df_articles_lastrevid.lastrevid))):
    try:
        revids = str(int(df_articles_lastrevid['lastrevid'][i]))
        req_op = request_ores_score_per_article(revids)['enwiki']['scores']
        ores_score[revids] = req_op[revids]['articlequality']['score']['prediction']
    except:
        print("Couldn't get the ORES info for: ", i)

The ones for which the ORES score wasn't captured is printed below (article_title, lastrevid):
1. **Kennebunk, Maine 1172898961**
2. **Fraser, Michigan 1162379459**
3. **Wildwood Crest, New Jersey 1179887888**
    
Since, for only three articles, we are not getting the ORES score, we are listing these on the notebook instead of storing in a separate file. Please note that the absence of ORES scores for these three articles may be due to various reasons, such as missing data in the source, temporary unavailability of ORES data, or specific issues related to these articles.

Creating a dataframe of the ORES output and renaming the columns for easier tracking and combining later.

In [22]:
df_scores = pd.DataFrame.from_dict(ores_score, orient='index', columns=['prediction'])
df_scores.reset_index(inplace = True)
df_scores = df_scores.rename(columns = {'index': 'lastrevid'})
df_scores['lastrevid'] = df_scores['lastrevid'].astype('int')
df_scores

Storing the ORES score output dataframe as a .csv file to avoid re-running the code again to retrieve the information

In [23]:
df_scores.to_csv('../intermediate files/request_ores_score_per_article_output.csv', index=False)

## Step 3: Combining the Datasets

Some processing of the data will be necessary. In particular, we'll need to - after retrieving and including the ORES data for each article - merge the Wikipedia data and population data together. Both files have fields containing state names for just that purpose.

The combined dataset also requires labeling each state with its US Census regional-division. The [spreadsheet listing the states in each regional division](https://docs.google.com/spreadsheets/d/14Sjfd_u_7N9SSyQ7bmxfebF_2XpR8QamvmNntKDIQB0/edit?usp=sharing) represents the regions, divisions, and states hierarchically. We will need to read this data file and merge it into the resulting dataset.

When merging the data, we found entries that cannot be trivially merged. Most likely, the Census Bureau population data includes areas that are not technically states (e.g., "Washington, D.C., or Puerto Rico, etc). Non-states are ignored. Also, all areas for which there are no matches are identified, and we output a list naming those areas, with each area on a separate line called: **wp_states-no_match.txt**.

Finally, we consolidate the merged data into a single CSV file called: **wp_scored_city_articles_by_state.csv**

The schema for that file should look something like this:

| Column           |
|:-----------------|
| state            |
| regional_division |
| population        |
| article_title     |
| revision_id       |
| article_quality   |

### 3.a. Combining Datasets

- cities data (df_cities) has columns: state, page_title, url
- page info (df_articles_lastrevid) has columns: (state, lastrevid)
- ores score (df_scores) has columns: lastrevid, prediction
- population data (df_pop) has columns: Geographic Area, Population Estimate, Region, Division

Merging the article page info dataframe with the ORED score prediction

In [26]:
print('Number of records in Article page info = ', len(df_articles_lastrevid))
print('Number of records in ORES score prediction = ', len(df_scores))

df_joined = df_articles_lastrevid.merge(df_scores, on = ['lastrevid'], how = 'left')

# Adding state as well
df_joined = df_cities.merge(df_joined, left_on = "page_title", right_on = "title", how = 'left')

# Cleaning the dataframe by removing duplicate name column (i.e., title) and url
df_cities_scores = df_joined.drop(['url', 'title'], axis = 1)
print('Number of records in the joined dataframe with cities and their ORES score prediction =',
      len(df_cities_scores))

# To view a snippet of the dataframe
df_cities_scores.head()

Number of records in Article page info =  21519
Number of records in ORES score prediction =  21516
Number of records in the joined dataframe with cities and their ORES score prediction = 21525


Unnamed: 0,state,page_title,lastrevid,prediction
0,Alabama,"Abbeville, Alabama",1171163550,C
1,Alabama,"Adamsville, Alabama",1177621427,C
2,Alabama,"Addison, Alabama",1168359898,C
3,Alabama,"Akron, Alabama",1165909508,GA
4,Alabama,"Alabaster, Alabama",1179139816,C


Adding the population data to this dataframe as well.

In [27]:
df_consolidated = df_cities_scores.merge(df_pop, left_on = 'state', right_on = 'Geographic Area', how = 'outer')
print('Number of records in the joined datafarme =', len(df_consolidated))
df_consolidated.head()

Number of records in the joined datafarme = 21540


Unnamed: 0,state,page_title,lastrevid,prediction,Geographic Area,Population Estimate,Region,Division
0,Alabama,"Abbeville, Alabama",1171164000.0,C,Alabama,5074296,South,East South Central
1,Alabama,"Adamsville, Alabama",1177621000.0,C,Alabama,5074296,South,East South Central
2,Alabama,"Addison, Alabama",1168360000.0,C,Alabama,5074296,South,East South Central
3,Alabama,"Akron, Alabama",1165910000.0,GA,Alabama,5074296,South,East South Central
4,Alabama,"Alabaster, Alabama",1179140000.0,C,Alabama,5074296,South,East South Central


### 3.b. Finding states with no matches

In [28]:
# s1 is a list for checking for states with no wiki data
# creating sets and taking a set difference for the no matches count

s1 = df_consolidated[df_consolidated['state'].isnull()]['Geographic Area'].unique()

# s2 is a list for checking states with no population data
s2 = df_consolidated[df_consolidated['Geographic Area'].isnull()]['state'].unique()

no_match = list(set(np.append(s1, s2)))
no_match.sort()
no_match

['Connecticut',
 'District of Columbia',
 'Georgia',
 'Georgia_(U.S._state)',
 'Nebraska',
 'New Hampshire',
 'New Jersey',
 'New Mexico',
 'New York',
 'New_Hampshire',
 'New_Jersey',
 'New_Mexico',
 'New_York',
 'North Carolina',
 'North Dakota',
 'North_Carolina',
 'North_Dakota',
 'Puerto Rico',
 'Rhode Island',
 'Rhode_Island',
 'South Carolina',
 'South Dakota',
 'South_Carolina',
 'South_Dakota',
 'West Virginia',
 'West_Virginia']

#### Writing to an output text file no_match.txt

In [29]:
with open ('../output files/wp_areas-no_match.txt', 'w') as f:
    for i in no_match:
        f.write(i)
        f.write('\n')

#### Consolidate the remaining data into a single CSV file

Checking for nulls in geographic area & state, if yes then we drop those columns

In [30]:
df_consolidated = df_consolidated[(~df_consolidated['state'].isnull()) & (~df_consolidated['Geographic Area'].isnull())]
df_consolidated = df_consolidated.drop('Geographic Area', axis = 1)

Renaming all the columns as per the standard given in the instruction file

In [31]:
df_consolidated = df_consolidated.rename(columns = {
    'state' : 'state',
    'Division' : 'regional_division',
    'Population Estimate': 'population',
    'page_title' : 'article_title',
    'lastrevid' : 'revision_id',
    'prediction': 'article_quality'
}).drop('Region', axis=1)
desired_order = ['state', 'regional_division', 'population', 'article_title', 'revision_id', 'article_quality']
df_consolidated = df_consolidated[desired_order]

Saving df_consolidated into a csv file as required and viewing the same.

In [32]:
df_consolidated.to_csv('../output files/wp_scored_city_articles_by_state.csv', index=False)
df_consolidated.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5074296,"Abbeville, Alabama",1171164000.0,C
1,Alabama,East South Central,5074296,"Adamsville, Alabama",1177621000.0,C
2,Alabama,East South Central,5074296,"Addison, Alabama",1168360000.0,C
3,Alabama,East South Central,5074296,"Akron, Alabama",1165910000.0,GA
4,Alabama,East South Central,5074296,"Alabaster, Alabama",1179140000.0,C


## Step 4: Analysis

Our analysis will consist of calculating total-articles-per-population (a ratio representing the number of articles per person) and high-quality-articles-per-population (a ratio representing the number of high-quality articles per person) on a state-by-state and divisional basis. All of these values are "per capita" ratios. For this analysis, we should consider "high-quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

### 4.a. Total Articles per Population (articles per capita)

#### By State

In [33]:
# Removing the duplicates for states, 
# group the states and aggregate population per regional division by counting number of articles, 
# calculate article_per_capita
df1 = df_consolidated[~df_consolidated.duplicated(subset=['state', 'regional_division'], keep = 'last')]

# Calculating the population of each state
state_pop = df1[['state', 'population']].groupby('state').sum().reset_index()
state_article_cnt = df_consolidated[['state', 'article_title']].groupby('state').count().reset_index()
total_articles_state = state_pop.merge(state_article_cnt, on='state')
total_articles_state.columns=['state', 'population', 'article_count']
total_articles_state['article_count'] = total_articles_state['article_count'].astype('int')
total_articles_state['population'] = total_articles_state['population'].str.replace(',', '').astype('int')
total_articles_state['articles_per_capita'] = total_articles_state['article_count'] / (total_articles_state['population'])
total_articles_state['articles_per_capita'] = total_articles_state['articles_per_capita'].astype('float')

# handling for conditions where population is zero (0 states) - still keeping this check
total_articles_state = total_articles_state[total_articles_state['articles_per_capita'] != np.inf] 
print('On a state level, the dataframe returns the below number of rows')
print(len(total_articles_state['state'].unique()))
total_articles_state.reset_index(inplace=True)
total_articles_state = total_articles_state.drop('index', axis = 1)
total_articles_state.head()

On a state level, the dataframe returns the below number of rows
37


Unnamed: 0,state,population,article_count,articles_per_capita
0,Alabama,5074296,461,9.1e-05
1,Alaska,733583,149,0.000203
2,Arizona,7359197,91,1.2e-05
3,Arkansas,3045637,500,0.000164
4,California,39029342,482,1.2e-05


#### By Regional Division

In [34]:
# Repeating the same as above but grouping by regional division in this case
# Calculating the population of each state

division_pop = df_pop_division # Using from step 1
division_article_cnt = df_consolidated[['regional_division', 'article_title']].groupby('regional_division').count().reset_index()
total_articles_division = division_pop.merge(division_article_cnt, on='regional_division')
total_articles_division.columns=['regional_division', 'population', 'article_count']
total_articles_division['articles_per_capita'] = total_articles_division['article_count'] / (total_articles_division['population'])
 
print('On a regional division level, the dataframe returns the below number of rows')
print(len(total_articles_division['regional_division'].unique()))
total_articles_division.head()

On a regional division level, the dataframe returns the below number of rows
9


Unnamed: 0,regional_division,population,article_count,articles_per_capita
0,East North Central,47097779,4754,0.000101
1,East South Central,19578002,1529,7.8e-05
2,Middle Atlantic,41910858,2556,6.1e-05
3,Mountain,25514320,1083,4.2e-05
4,New England,15129548,1164,7.7e-05


### 4.b. High Quality Articles per Population

This applies to only the articles tagged with FA or GA in the "article_quality" column

#### By State

In [35]:
# Filtering the article based on the artcile_quality attribute
# Calculation for article_count and article_per_capita done the same as above i.e., group by state

df3 = df_consolidated[~df_consolidated.duplicated(subset=['state', 'regional_division'], keep = 'last')]

state_pop = df3[['state', 'population']].groupby('state').sum().reset_index()
hq_state_df = df_consolidated[(df_consolidated['article_quality'] == 
                                 'FA') | (df_consolidated['article_quality'] == 'GA')]

state_count = hq_state_df[['state', 'article_title']].groupby('state').count().reset_index()
hq_state_df = state_pop.merge(state_count, on='state')
hq_state_df.columns=['state', 'population', 'article_count']
hq_state_df['article_count'] = hq_state_df['article_count'].astype('int')
hq_state_df['population'] = hq_state_df['population'].str.replace(',', '').astype('int')
hq_state_df['articles_per_capita'] = hq_state_df['article_count'] / (hq_state_df['population'])
hq_state_df['articles_per_capita'] = hq_state_df['articles_per_capita'].astype('float')

# Need to exclude conditions where the population of a state is zero
hq_state_df = hq_state_df[hq_state_df['articles_per_capita'] != np.inf]
hq_state_df.reset_index(inplace=True)
hq_state_df.drop(columns=['index'], inplace=True)

print('On a state level, the high quality dataframe returns the below number of rows')
print(len(hq_state_df['state'].unique()))
hq_state_df.head()

On a state level, the high quality dataframe returns the below number of rows
37


Unnamed: 0,state,population,article_count,articles_per_capita
0,Alabama,5074296,53,1e-05
1,Alaska,733583,31,4.2e-05
2,Arizona,7359197,24,3e-06
3,Arkansas,3045637,72,2.4e-05
4,California,39029342,173,4e-06


#### By Regional Division

In [36]:
# Filtering the article based on the artcile_quality attribute
# Calculation for article_count and article_per_capita done the same as above i.e., group by regional division

division_pop = df_pop_division #Using from step 1

hq_division_df = df_consolidated[(df_consolidated['article_quality'] == 
                                 'FA') | (df_consolidated['article_quality'] == 'GA')]
division_count = hq_division_df[['regional_division', 'article_title']].groupby('regional_division').count().reset_index()
hq_division_df = division_pop.merge(division_count, on='regional_division')
hq_division_df.columns=['regional_division', 'population', 'article_count']
hq_division_df['articles_per_capita'] = hq_division_df['article_count'] / (hq_division_df['population'])

print('On a regional division level, the high quality dataframe returns the below number of rows')
print(len(hq_division_df['regional_division'].unique()))
hq_division_df.head()

On a regional division level, the high quality dataframe returns the below number of rows
9


Unnamed: 0,regional_division,population,article_count,articles_per_capita
0,East North Central,47097779,716,1.5e-05
1,East South Central,19578002,316,1.6e-05
2,Middle Atlantic,41910858,564,1.3e-05
3,Mountain,25514320,304,1.2e-05
4,New England,15129548,150,1e-05


# Step 5: Results

The results from our analysis will be produced in the form of data tables. We are being asked to produce six total tables as follows:

### 1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order)

In [37]:
top10_state = total_articles_state.sort_values(by=['articles_per_capita'],
                                                    ascending=False).head(10).reset_index()
top10_state.index += 1
top10_state['state']

1          Vermont
2            Maine
3             Iowa
4           Alaska
5     Pennsylvania
6         Michigan
7          Wyoming
8         Arkansas
9         Missouri
10       Minnesota
Name: state, dtype: object

### 2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) 

In [38]:
bottom10_state = total_articles_state.sort_values(by=['articles_per_capita'],
                                                    ascending=True).head(10).reset_index()
bottom10_state.index += 1
bottom10_state['state']

1         Nevada
2     California
3        Arizona
4       Virginia
5        Florida
6       Oklahoma
7         Kansas
8       Maryland
9      Wisconsin
10    Washington
Name: state, dtype: object

### 3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order)

In [39]:
top10_hq_state = hq_state_df.sort_values(by=['articles_per_capita'],
                                             ascending=False).head(10).reset_index()
top10_hq_state.index += 1
top10_hq_state['state']

1          Vermont
2          Wyoming
3          Montana
4     Pennsylvania
5         Missouri
6           Alaska
7             Iowa
8           Oregon
9            Maine
10       Minnesota
Name: state, dtype: object

### 4. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order)

In [40]:
bottom10_hq_state = hq_state_df.sort_values(by=['articles_per_capita'],
                                             ascending=True).head(10).reset_index()
bottom10_hq_state.index += 1
bottom10_hq_state['state']

1          Virginia
2            Nevada
3           Arizona
4        California
5           Florida
6          Maryland
7            Kansas
8          Oklahoma
9     Massachusetts
10        Louisiana
Name: state, dtype: object

### 5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita

In [41]:
division_coverage = total_articles_division.sort_values(by=['articles_per_capita'],
                                                ascending=False).reset_index()
division_coverage.index += 1
division_coverage['regional_division']

1    West North Central
2    East North Central
3    East South Central
4           New England
5       Middle Atlantic
6    West South Central
7              Mountain
8               Pacific
9        South Atlantic
Name: regional_division, dtype: object

### 6. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita

In [42]:
division_hq_coverage = hq_division_df.sort_values(by=['articles_per_capita'],
                                           ascending=False).reset_index()
division_hq_coverage.index += 1
division_hq_coverage['regional_division']

1    West North Central
2    East South Central
3    East North Central
4    West South Central
5       Middle Atlantic
6              Mountain
7           New England
8               Pacific
9        South Atlantic
Name: regional_division, dtype: object