# Analysis of the Data

## Information of the Different Analysis

The analysis consists of calculating total-articles-per-population (a ratio representing the number of articles per person) and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a state-by state and divisional basis. All of these values are 'per capita' ratios.

For the analysis we are considering 'high quality' articles to be articles that ORES predicted would be in either the 'FA' (featured article) or 'GA' (good article) classes.

In [1]:
# ---------------------------- importing libraries --------------------------- #

import pandas as pd

In [2]:
# --------------------------- importing saved data --------------------------- #

df = pd.read_csv("data/wp_scored_city_articles_by_state.csv")
df.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5074296.0,"Abbeville, Alabama",1171163550,C
1,Alabama,East South Central,5074296.0,"Adamsville, Alabama",1177621427,C
2,Alabama,East South Central,5074296.0,"Addison, Alabama",1168359898,C
3,Alabama,East South Central,5074296.0,"Akron, Alabama",1165909508,GA
4,Alabama,East South Central,5074296.0,"Alabaster, Alabama",1179139816,C


### Top 10 US States by coverage: The 10 states with the highest total articles per capita (descending order)

In [3]:
# ------------------------- top 10 states by coverage ------------------------ #

# calculating total articles per capita
total_articles = df.groupby('state')['article_title'].count() / df.groupby('state')['population'].mean()

# sorting the states by total articles per capita in descending order and selecting the top 10
top_10_states = total_articles.sort_values(ascending = False).head(10)

# creating a DataFrame for the top 10 states by coverage
top_10_states_coverage = pd.DataFrame({
    'State': top_10_states.index,
    'Total Articles per Capita': top_10_states.values
})

# display the data table
table_name = "Top 10 US States by Coverage"
print(f"Data Table: {table_name}\n")
print(top_10_states_coverage)

Data Table: Top 10 US States by Coverage

           State  Total Articles per Capita
0        Vermont                   0.001014
1   South Dakota                   0.000684
2   North Dakota                   0.000457
3   Pennsylvania                   0.000394
4          Maine                   0.000349
5        Wyoming                   0.000341
6           Iowa                   0.000326
7  West Virginia                   0.000261
8         Alaska                   0.000203
9       Michigan                   0.000177


### Bottom 10 US States by coverage: The 10 states with the lowest total articles per capita (ascending order)

In [4]:
# ------------------------- bottom 10 states by coverage ------------------------ #

# sortting the states by total articles per capita in ascending order and selecting the bottom 10
bottom_10_states = total_articles.sort_values(ascending = True).head(10)

# creating a DataFrame for the bottom 10 states by coverage
bottom_10_states_coverage = pd.DataFrame({
    'State': bottom_10_states.index,
    'Total Articles per Capita': bottom_10_states.values
})

# display the data table
table_name = "Bottom 10 US States by Coverage"
print(f"Data Table: {table_name}\n")
print(bottom_10_states_coverage)

Data Table: Bottom 10 US States by Coverage

            State  Total Articles per Capita
0  North Carolina                   0.000005
1          Nevada                   0.000006
2      California                   0.000012
3         Arizona                   0.000012
4        Virginia                   0.000015
5         Florida                   0.000018
6        Oklahoma                   0.000019
7          Kansas                   0.000021
8        Maryland                   0.000025
9        New York                   0.000033


### Top 10 US States by high quality: The 10 states with the highest high quality articles per capita (descending order)

In [5]:
# ----------------------- top 10 states by high quality ---------------------- #

# filtering and counting high-quality articles
high_quality_articles = df[df['article_quality'].isin(['FA', 'GA'])]

# calculating the high-quality articles per capita
high_quality_articles_pc = high_quality_articles.groupby('state')['article_title'].count() / df.groupby('state')['population'].mean()

# sorting the states by high-quality articles per capita in descending order and selecting the top 10
top_10_states = high_quality_articles_pc.sort_values(ascending = False).head(10)

# creating a DataFrames for the top and bottom 10 states by high quality
top_10_states_high_quality = pd.DataFrame({
    'State': top_10_states.index,
    'High Quality Articles per Capita': top_10_states.values
})

# display the data table
table_name = "Top 10 states by high quality"
print(f"Data Table: {table_name}\n")
print(top_10_states_high_quality)

Data Table: Top 10 states by high quality

           State  High Quality Articles per Capita
0        Vermont                          0.000139
1        Wyoming                          0.000134
2   South Dakota                          0.000123
3  West Virginia                          0.000119
4   Pennsylvania                          0.000087
5        Montana                          0.000049
6  New Hampshire                          0.000045
7       Missouri                          0.000043
8         Alaska                          0.000042
9      Tennessee                          0.000041


### Bottom 10 US States by high quality: The 10 states with the lowest high quality articles per capita (ascending order)

In [6]:
# --------------------- bottom 10 states by high quality --------------------- #

# sorting the states by high-quality articles per capita in ascending order and selecting the bottom 10
bottom_10_states_by_high_quality = high_quality_articles_pc.sort_values(ascending = True).head(10)

# creating a DataFrame from the sorted Series
bottom_10_states_high_quality = pd.DataFrame({
    'State': bottom_10_states_by_high_quality.index,
    'High Quality Articles per Capita': bottom_10_states_by_high_quality.values
})

# display the data table
table_name = "Bottom 10 states by high quality"
print(f"Data Table: {table_name}\n")
print(bottom_10_states_high_quality)

Data Table: Bottom 10 states by high quality

            State  High Quality Articles per Capita
0  North Carolina                          0.000002
1        Virginia                          0.000002
2          Nevada                          0.000003
3         Arizona                          0.000003
4      California                          0.000004
5         Florida                          0.000005
6        New York                          0.000006
7        Maryland                          0.000007
8          Kansas                          0.000007
9        Oklahoma                          0.000008


### Census divisions by total coverage: A rank ordered list of US census divisions (descending order) by total articles per capita

In [7]:
# -------------------- census divisions by total coverage -------------------- #

# calculating total articles per capita by Census division
total_articles_per_capita = df.groupby('regional_division')['article_title'].count() / df.groupby('regional_division')['population'].mean()


# sorting the divisions by total articles per capita in descending order
sorted_divisions_total_coverage = total_articles_per_capita.sort_values(ascending = False)

# creating DataFrame for rank-ordered lists
divisions_by_total_coverage = pd.DataFrame({
    'Census Division': sorted_divisions_total_coverage.index,
    'Total Articles per Capita': sorted_divisions_total_coverage.values
})

# display the data table
table_name = "Census Divisions by Total Coverage"
print(f"Data Table: {table_name}\n")
print(divisions_by_total_coverage)

Data Table: Census Divisions by Total Coverage

      Census Division  Total Articles per Capita
0  West North Central                   0.000889
1  East North Central                   0.000642
2         New England                   0.000556
3  East South Central                   0.000389
4     Middle Atlantic                   0.000327
5      South Atlantic                   0.000235
6            Mountain                   0.000142
7  West South Central                   0.000141
8             Pacific                   0.000088


### Census divisions by high quality coverage: A rank ordered list of US census divisions (descending order) by high quality articles per capita

In [8]:
# ----------------- census divisions by high quality coverage ---------------- #

# calculating high-quality articles per capita by Census division
high_quality_articles_per_capita = high_quality_articles.groupby('regional_division')['article_title'].count() / df.groupby('regional_division')['population'].mean()

# sorting the divisions by high-quality articles per capita in descending order
sorted_divisions_high_quality_coverage = high_quality_articles_per_capita.sort_values(ascending = False)

divisions_by_high_quality_coverage = pd.DataFrame({
    'Census Division': sorted_divisions_high_quality_coverage.index,
    'High Quality Articles per Capita': sorted_divisions_high_quality_coverage.values
})

# display the data table
table_name = "Census Divisions by High Quality Coverage"
print(f"Data Table: {table_name}\n")
print(divisions_by_high_quality_coverage)

Data Table: Census Divisions by High Quality Coverage

      Census Division  High Quality Articles per Capita
0  West North Central                          0.000165
1  East North Central                          0.000113
2  East South Central                          0.000093
3     Middle Atlantic                          0.000088
4         New England                          0.000087
5      South Atlantic                          0.000063
6            Mountain                          0.000049
7  West South Central                          0.000045
8             Pacific                          0.000033
