# Creating a dataset for Internet access by county #

1. [Introduction](#introduction_tag)
2. [Source Datasets](#dataset_tag)
3. [Notebook Preparation](#prep_tag)
4. [Organizing the BroadbandNow Set](#organize_tag)
5. [Merge the Sets and Save](#merge_tag)
6. [Concluding Comments and Observations](#conclusion_tag)

<a id="introduction_tag"></a>
## Introduction ##

I read a [USA Today article][usa_today] from June 2020, where they discuss
library usage during the pandemic. Some libraries set up wi-fi networks that
extended outside the building, so that people would have access to the Internet
even when the library was shutdown.

This had me curious about how many people have convenient access to the Internet.
There are some companies that no longer have live customer service phone numbers,
so any support relies on being able to read their web page. If someone wanted to
determine the validity of claims and rumors spread by social media, they either
need to have a trusted radio/television new source, or they need convenient access
to the Internet to be able to
investigate the information (by searching for original articles or
unaltered video).

I found a pair of datasets that had information that would let me look at the situation.
But while doing data cleaning, I found some problems that required significant effort
to diagnose. I figured it would be useful to create a new dataset, and provide it on
Kaggle in case others were interested.

[usa_today]: https://www.usatoday.com/story/news/2020/06/11/when-libraries-reopen-after-coronavirus-might-months/5316591002/

<a id="dataset_tag"></a>
## Source Datasets ##

I started with the [dataset][imlsdb] provided by the Institute of
Museum and Library Services [(IMLS)][imls], titled
"IMLS Indicators Workbook: Economic Status and Broadband Availability and Adoption".
The workbook contained statistics blended from three sources: the U.S. Census Bureau
American Community Survey (ACS 5-year 2014-2018 estimates); broadbandnow.com
(commercial aggregator of FCC data); and the Bureau of Labor Statistics (local area unemployment statistics).

On December 10, 2020, BroadbandNow.Com [bbn)][bbn] provided a
[dataset hosted at GitHub][bbndb] as part of their [Open Data Challenge][bbn_open].
This had the features I wanted to cross check with the IMLS dataset.

Unfortunately, there are problems with both datasets. Some examples:
1. The IMLS set has Autauga County, Alabama as having zero broadband providers and
"NaN" for the lowest cost. In the BBN set, there are 6 zip codes in that county,
and 5 of them have broadband providers and plan prices.
2. The IMLS set has Baldwin, Alabama  as having zero broadband providers and
"NaN" for the lowest cost. In the BBN set, there are 21 zip codes in that county,
and 20 of them have broadband providers and plan prices.
3. The BBN set has 33023 rows, but 4816 of them are duplicates.
4. The BBN set has some zip codes labeled with counties that I could not find.
Zip code 52626 is labeled as "Clark, Iowa", even though it belongs to Van Buren
and Lee counties. I was unable to find Walsh, Minnesota for zip code 56744.
5. I don't know what the population numbers are for the BBN set.
For example, I looked at the 5 counties for Hawaii.
The IMLS population numbers are identical to the 2019 estimates for each county.
The BBN population numbers are below the 2010 census numbers for each county
(Hawaii, Honolulu, Kalawao, Kauai),
except for Maui which is exactly the 2010 census number.

I decided it would be worth it to do a partial clean-up of both sets, and then
merge them to create a dataset with fewer problems. However, that still requires
some choices and compromises.
1. The ACS data that was used for the IMLS set was grouped by PUMAS (Public Use Microdata Areas)
2. The BBN data is grouped by zip code
3. The set I want to make is group by counties

I have to use what is convenient, since I don't have the resources or knowledge on how
to best group the data. If someone is curious on what sort of issues are involved, I found
this web page from the University of Michigan titled
["Creating County-Level Statistics from Public Use Microdata Areas (PUMAS)"][umich_puma].
A zip code can lie in multiple counties, and vice versa.

When there wasn't a clear match up between the counties listed in the IMLS and BBN data sets,
I consulted the [Wikipedia list of US counties][wiki_county]
and the [zip-codes.com web site][zip_web_site] for more
information.

[imlsdb]: https://www.imls.gov/data/data-catalog/imls-indicators-workbook-economic-status-and-broadband-availability-and-adoption
[imls]: https://www.imls.gov/research-evaluation
[bbndb]: https://github.com/BroadbandNow/Open-Data
[bbn]: https://broadbandnow.com
[bbn_open]: https://broadbandnow.com/report/open-dataset-announcement
[umich_puma]: https://www.psc.isr.umich.edu/dis/census/Features/puma2cnty/
[wiki_county]: https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county_equivalents
[zip_web_site]: https://www.zip-codes.com/zip-code-database.asp


<a id="prep_tag"></a>
## Notebook Preparation ##

I ran this jupyter notebook on my home computer, so you will likely have to modify it
for your computer set-up. I downloaded the IMLS and BBN datasets from the linked web
pages mentioned in the previous section, but I also put a copy with this Kaggle dataset
for convenience. The IMLS information is in an Excel spreadsheet, but I put the
county information into a CSV file.

As an aside, the IMLS set has FIPS codes for each county, but the
[Wikipedia entry][wiki_fips] mentions: "On September 2, 2008, FIPS 5-2 was one of ten standards withdrawn by NIST as a Federal Information Processing Standard."

[wiki_fips]: https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import statsmodels.api as sm
import time

plt.rcParams["figure.figsize"] = (12,10)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_info_columns', 150)

# suppress warnings about "value is trying to be set on a copy of a slice from a DataFrame"
pd.options.mode.chained_assignment = None  # default='warn'

# *** following lines if using Kaggle ***
data_imls_df = pd.read_csv('../input/us-broadband-availability/source_sets/IMLS_county_data.csv', skipinitialspace=True)
data_bbn_df = pd.read_csv('../input/us-broadband-availability/source_sets/broadbandnow_opendata.csv', skipinitialspace=True)

# *** read from local version of files ***
#data_imls_df = pd.read_csv('IMLS_county_data.csv')
#data_bbn_df = pd.read_csv('broadbandnow_opendata.csv')

print(data_imls_df.info())
print(data_bbn_df.info())

The IMLS dataset has fields that begin with "MOE" (margin of error).
Since I am interested in clustering
rather than precise error estimates, those features are dropped.
I am also dropping the GEO and FIPS features.

The long column names are descriptive, but I've renamed them to shorter versions
for easier use.

For convenience, I am removing entries for Puerto Rico, which will leave the counties
in the 50 US states and the District of Columbia.

The "population" is listed as an object Dtype, because a handful of the entries
had commas at the thousands. I stripped off the commas, and saved it as "float".
(I could have saved it as int64, but later decimal calculations will be using this number.)

In [None]:
#remove MOE features
moe_list = ["MOE, % w/o health ins.", "MOE, Poverty Rate (%)", "MOE SNAP",
           "MOE No Computer", "MOE no Internet", "MOE Broadband"]
for feat in moe_list:
    del data_imls_df[feat]

#remove some identifier features; leave the NAME and Stabr
id_list = ["GEO_ID", "FIPS_State", "FIPS_County"]
for feat in id_list:
    del data_imls_df[feat]

# Long column names are descriptive, but are making automatically generated plots
# difficult to read. Shorten them.

data_imls_df = data_imls_df.rename(
    columns = {'County':'county', 'State':'state',
               'NAME':'full_name', 'Stabr': 'state_abr',
               'Population_2019':'population', 'Unemployment rate 2019':'unemp',
               'Percent w/o Health insurance':'health_ins', 'Poverty Rate (%)':'poverty',
               'Percent received SNAP (2018)':'SNAP',
               'Percent with no home computer (2018)':'no_comp',
               'Percent with no home Internet (2018)':'no_internet',
               'Percent with home Broadband (2018)':'home_broad',
               'Number of Broadband providers (2019)':'broad_num',
               'Population for whom broadband available, 2019 (%)':'broad_avail',
               'Lowest broadband cost per month, 2019 ($)':'broad_cost'})

# remove Puerto Rico
data_imls_df = data_imls_df[~((data_imls_df['state'] == 'Puerto Rico'))]

# a few of the population numbers have commas
data_imls_df['population'] = data_imls_df['population'].apply(lambda x: x.strip(',')
                                if isinstance(x, str) else x).astype(float)

print(data_imls_df.info())

For Nevada, the "county" information does not match the "name".

The "name"rows begin with Churchill County, and end with Carson City.
The county label starts with "Carson City", instead of ending with it.
This led to major problems later on when I was using the "county" field
to merge data sets.

After verifying that the population numbers were consistent with the "name" field,
I am overwriting the "county" label.

In [None]:
print('    Before:')
print(data_imls_df[data_imls_df['state'] == "Nevada"][['full_name', 'county']].head(20))

nevada_reorganize = [
    ("Churchill County, Nevada", "Churchill"),
    ("Clark County, Nevada", "Clark"),
    ("Douglas County, Nevada", "Douglas"),
    ("Elko County, Nevada", "Elko"),
    ("Esmeralda County, Nevada", "Esmeralda"),
    ("Eureka County, Nevada", "Eureka"),
    ("Humboldt County, Nevada", "Humboldt"),
    ("Lander County, Nevada", "Lander"),
    ("Lincoln County, Nevada", "Lincoln"),
    ("Lyon County, Nevada", "Lyon"),
    ("Mineral County, Nevada", "Mineral"),
    ("Nye County, Nevada", "Nye"),
    ("Pershing County, Nevada", "Pershing"),
    ("Storey County, Nevada", "Storey"),
    ("Washoe County, Nevada", "Washoe"),
    ("White Pine County, Nevada", "White Pine"),
    ("Carson City, Nevada", "Carson City") ]

for (name,county) in nevada_reorganize:
    data_imls_df.loc[data_imls_df['full_name'] == name, 'county'] = county
print('\n    After:')
print(data_imls_df[data_imls_df['state'] == "Nevada"][['full_name', 'county']].head(20))

For the broadbandnow.com dataset, since I had access to more information, I wanted to save
more of the features than were originally used in the IMLS set. (This is still a subset
of the possible features present in the dataset.)
- Zip, Population, County, State : identifiers (most counties have multiple Zip codes)
- WiredCount_2020: Number of Wired (Cable, Copper, DSL, Fiber) Providers present in a zip code
- AllProviderCount_2020: Number of Providers of any technology present in a zip code, including Fixed Wireless Providers (WISPs)
- All25_3_2020: Number of Providers (any technology) present in a zip code offering speeds of at least 25 Mbps Download / 3 Mbps Upload
- AverageMbps: Average Download Speed via M-Lab Speed Tests, rolling 12 months
- %Access to Terrestrial Broadband: Percent of the Zip's Population that has Access to Terrestrial (Wired + Fixed Wireless) Broadband (25 Mbps Download / 3 Mbps Upload)
- Lowest Priced Terrestrial Broadband Plan: The Lowest Regular Monthly Priced Terrestrial (Wired + Fixed Wireless) Residential Standalone-Internet Broadband (25 Mbps Download / 3 Mbps Upload) Plan available in the zip

These fields were renamed to shorter versions, and "_bbn" appended to the variable
name to keep them distinct from the IMLS features.

The prices have dolar signs, and the access numbers have the percentage symbols.
These were stripped off, and saved as floating numbers.

The BBN set has 33023 rows, but 4816 of them are duplicates.
(The duplicate rows are identical to the first found entries, so was free to chose
the first or the last duplicate found.)

In [None]:
id_list = ["Zip", "Population", "County", "State",
           "WiredCount_2020", "AllProviderCount_2020", "All25_3_2020",
           "AverageMbps", "%Access to Terrestrial Broadband",
           "Lowest Priced Terrestrial Broadband Plan"]

data_bbn_df = data_bbn_df[id_list].rename(
    columns = {'Population':'population_bbn', 'County':'county', 'State':'state',
               'WiredCount_2020':'wired_bbn',
               'AllProviderCount_2020':'provide_bbn',
               'All25_3_2020':'all25_bbn',
               'AverageMbps':'downave_bbn',
               '%Access to Terrestrial Broadband':'access_bbn',
               'Lowest Priced Terrestrial Broadband Plan':'price_bbn'})

# The prices are mostly strings (due to $), except for the NaN which are floats
data_bbn_df['price_bbn'] = data_bbn_df['price_bbn'].apply(lambda x: x.strip('$')
                                if isinstance(x, str) else x).astype(float)

# Remove the percentage symbol from the access numbers
data_bbn_df['access_bbn'] = data_bbn_df['access_bbn'].apply(lambda x: x.strip('%')
                                if isinstance(x, str) else x).astype(float)

# over 10% of the data set is duplicates
data_bbn_df = data_bbn_df.drop_duplicates()

print(data_bbn_df.info())

<a id="organize_tag"></a>
## Organizing the BroadbandNow Set ##

Both sets have a "county" and "state" label, so the plan is to merge the IMLS and BBN sets
using those two features. However, there are a few mismatches in the names of counties
between the two sets. These will be dealt with on a case-by-case basis.

In [None]:
# This will loop over all states, and compare the counties in the two data sets,
# and look for any names that are present in only one of the two sets.
#
# This generates a lot of text, so it is commented out.
# Uncomment if you want to verify my observations.

#state_list = list(data_imls_df['state'].unique())
#for local_state in state_list:
#    set1 = set(data_imls_df[data_imls_df['state'] == local_state]['county'].sort_values())
#    set2 = set(county_bbn_df[county_bbn_df['state'] == local_state]['county'].sort_values())
#    print("Checking for the state of",local_state,"have",
#          len(set1),"IMLS counties and",len(set2),"BBN counties")
#    print("Counties missing in BBN list:",list(set1-set2))
#    print("Counties missing in IMLS list:",list(set2-set1))

#### Alaska ####
The IMLS set has a full_name of "Kusilvak Census Area", but has "null" for the county label.
The BBN set has "Wade Hampton" as a county. Looking in Wikipedia, there was a name change
in 2015. For both data sets, I changed the county label to "Kusilvak".

#### Iowa ####
The BBN set has a "Clark county". This is distinct from "Clarke county", which is present in both sets. Looking at the zip code (52626), it belongs to Van Buren and Lee counties.
I have changed the BBN entry to use "Van Buren" for the county.

#### Minnesota ####
The BBN set has a "Walsh county". My online search only found a Walsh county in North Dakota.
The zip code (56744) belongs to the city of Oslo, MN, which is right next to the North Dakota
border. This is probably a zip code that crosses state lines. The zip code belongs
to Marshall and Polk counties.
I have changed the BBN entry to use "Marshall" for the county.

#### counties missing from the BBN set ####
The following counties are not in the BBN data set:
- Geogia: Quitman
- Hawaii: Kalawao
- Mississippi: Issaquena, George
- Nebraska: Loup, Banner
- Nevada: Storey
- Virginia: Franklin City, Covington, Lexington, Emporia, Martinsville, Fairfax City, Manassas Park

Most of these have small populations (but not always the smallest). Without investigating every case, I assume most of these are where zip codes and county lines do not conveniently line up.

For Nebraska, Loup contains zip code 68879, while Banner has zip code 69345. Neither of those numbers was in the BBN data set. (Considering there are almost 42 thousand zip codes in the USA, a dataframe with a little over 28 thousand rows will not cover all cases.)

I will let the user decide how to deal with these cases. After the sets are merged, all the "_bbn" variables will be null/NaN for counties not in the BBN dataset.

In [None]:
# handle some of the county mismatches
data_imls_df.loc[data_imls_df['full_name'] == "Kusilvak Census Area, Alaska", 'county'] = "Kusilvak"
data_bbn_df.loc[data_bbn_df['county'] == "Wade Hampton", 'county'] = "Kusilvak"

data_bbn_df.loc[data_bbn_df['Zip'] == 52626, 'county'] = "Van Buren"
data_bbn_df.loc[data_bbn_df['Zip'] == 56744, 'county'] = "Marshall"

The BBN set is organized by zip code, so will need to be grouped into counties.

Some of the variables are obvious. The county population is will use the sum of the zip codes assigned to that county. The lowest plan price will be the minimum (not NaN) value found. The average download speed is the population weighted average of the "average download speed" for each zip code.

The number of providers in the county is more complicated. It is not clear if the same provider covers more than one zip code or how many of them. I have decided to use the population weighted average for the number of providers. This will result in non-integer numbers.

In [None]:
# making a copy, in case I want to overwrite some values
work_df = data_bbn_df.copy()

# compute the fraction of the population for each zip code in a county
county_pop = work_df.groupby(['county', 'state'])['population_bbn'].transform('sum')
work_df['weight'] = work_df['population_bbn'].div(county_pop)

# start with the population of the county
county_bbn_df = work_df.groupby(['county', 'state'])['population_bbn'].sum()
county_bbn_df = county_bbn_df.reset_index(name = 'population_bbn')

temp_df = work_df.groupby(['county', 'state'])['price_bbn'].min()
temp_df = temp_df.reset_index(name = 'price_bbn')
county_bbn_df = county_bbn_df.merge(temp_df, how='left')

id_list = ["wired_bbn", "provide_bbn", "all25_bbn",
           "downave_bbn", "access_bbn"]
for col_name in id_list:
    work_df = work_df.astype({col_name:'float'})
    work_df[col_name] = work_df[col_name] * work_df['weight']
    temp_df = work_df.groupby(['county', 'state'])[col_name].sum()
    temp_df = temp_df.reset_index(name = col_name)
    county_bbn_df = county_bbn_df.merge(temp_df, how='left')

I noticed that some zip codes (likely in cities) can have average download speeds in the hundreds, while other zip codes (likely rural) are in the single digits. I wanted to create a feature that would track this. The BBN set has a feature for the download speed for the 90th percentile of the people, but I won't be able to compute that quantity when grouping the
zip codes.

I decided to create a measure called "slowfrac_bbn".
I select the zip codes with an average download speed of less than 6 Mb/s,
and determine what fraction of the county's population are in those zip codes.

The distribution of average download speeds for all zip codes looks like
it has an exponential decrease. Six seemed like a reasonable cut-off point.

If there are no zip codes in a county with an average speed of less than 6,
this will result in "slowfrac" being NaN. I replace those values with 0.

This is not a rigorously calculated statistical value, since I am using numbers which
are already averaged by zip code. However, I think it will still help indicate what
is going on in the county, if a high average speed is due to some extreme outliers
or if everyone has good connection speeds.

In [None]:
slower_df = work_df[work_df['downave_bbn'] < 6.0].copy()
temp_df = slower_df.groupby(['county', 'state'])['weight'].sum()
temp_df = temp_df.reset_index(name = 'slowfrac_bbn')
county_bbn_df = county_bbn_df.merge(temp_df, how='left')

county_bbn_df['slowfrac_bbn'] = county_bbn_df['slowfrac_bbn'].fillna(0.0)

slower_df = work_df[work_df['downave_bbn'] < 25.0].copy()
plt.figure(figsize=(12,4))
plt.subplot(1, 2, 1)
plt.hist(slower_df['downave_bbn'])
plt.xlabel('ave download speed by zip code')

plt.subplot(1, 2, 2)
plt.hist(county_bbn_df['slowfrac_bbn'])
plt.xlabel('slowfrac measure by county')

plt.tight_layout()
plt.show()

<a id="merge_tag"></a>
## Merge the Sets and Save ##

This notebook is saved on Kaggle to show how the dataset was created, but I'm running
it on a home computer. The line to save the dataframe as a CSV file is commented out.

In [None]:
access_data = data_imls_df.merge(county_bbn_df, how='left')

print(access_data.info())

file_name = 'broadband_access.csv'
#access_data.to_csv(file_name,index=False)

print ('File',file_name,'saved at',time.asctime( time.localtime(time.time()) ))

<a id="conclusion_tag"></a>
## Concluding Comments and Observations ##

Hopefully this dataset will be interesting for other people,
and that this notebook has enough documentation to understand
how it was created.

The user will still have to do some data cleaning before
proceeding with an analysis. For example, some rows are missing
country unemployment numbers. The cases where BroadbandNow
did not have information for a particular county will also
need to be dealt with.

I mentioned earlier that the population numbers provided by
BroadbandNow don't seem to match the census numbers I could
find. I don't know if that is due to missing zip codes, sample
weighting, or if something else went into the methodology.
The values are close enough to the 2019 census estimates
that I assume they are useful in calculating
averages and other weighted values, even though I will
be using the census estimates for the county population.

This set also includes the numbers for BroadbandNow that were
in the original IMLS set (number of providers, % population coverage,
lowest price). These are kept in case more investigation is desired.
For analysis, they will probably be removed, and the newer BBN values used instead.

In [None]:
temp_df = data_imls_df[['state','population']]
state1_df = temp_df.groupby(['state'])['population'].sum()
state1_df = state1_df.reset_index(name = 'pop')
temp_df = data_bbn_df[['state','population_bbn']]
state2_df = temp_df.groupby(['state'])['population_bbn'].sum()
state2_df = state2_df.reset_index(name = 'pop_bbn')
state_df = state1_df.merge(state2_df, how='left')

plt.figure(figsize=(12,4))

plt.subplot(1, 2, 1)
plt.scatter(access_data["population"], access_data["population_bbn"])
plt.plot(access_data["population"], access_data["population"], color="red")
plt.xlabel('pop_census by county')
plt.ylabel('pop_bbn by county')

plt.subplot(1, 2, 2)

plt.scatter(state_df["pop"], state_df["pop_bbn"])
plt.plot(state_df["pop"], state_df["pop"], color="red")
plt.xlabel('pop_census by state')
plt.ylabel('pop_bbn by state')

plt.tight_layout()
plt.show()

In [None]:
access_data = data_imls_df.merge(county_bbn_df, how='left')

plt.figure(figsize=(12,4))

plt.subplot(1, 3, 1)
plt.scatter(access_data["broad_num"], access_data["all25_bbn"])
plt.plot(access_data["broad_num"], access_data["broad_num"], color="red")
plt.xlabel('# of providers (IMLS set)')
plt.ylabel('# of providers (BBN set, zip code weighted)')

plt.subplot(1, 3, 2)
plt.scatter(access_data["broad_avail"], access_data["access_bbn"])
plt.plot(access_data["broad_avail"], access_data["broad_avail"], color="red")
plt.xlabel('% population bb access (IMLS set)')
plt.ylabel('% population bb access (BBN set, zip code weighted)')

plt.subplot(1, 3, 3)
plt.scatter(access_data["broad_cost"], access_data["price_bbn"])
plt.plot(access_data["broad_cost"], access_data["broad_cost"], color="red")
plt.xlabel('lowest price (IMLS set)')
plt.ylabel('lowest price (BBN set)')

plt.tight_layout()
plt.show()