# Determine Comparisons Based on Population
#### Ian Mac Moore, github @zenfinity, 4/24/20
We initially knew we wanted to compare MN with another State of similar population, and MSP to another metro area also based on population. And we needed to get State and national totals. This notebook shows how those comparison areas were determined.

## Process

### Initialize notebook with imports

In [1]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests
import time
from scipy.stats import linregress
import sys

sys.path.insert(0, "/Users/ianmacmoore/Desktop/ClassHomework/keys")

### County data and metro comparison
Right away we knew that the most granular data we could get was county. We made the assumption that the sum of Hennepin and Ramsey county populations is a decent representation of the "metro area" of Minneapolis/St Paul. Given that, we looked for another county of similar population, with another assumption that generally cities will be contained in a single county. The Twin Cities as a metro is somewhat unique in being geographically adjacent yet administratively distinct.

So we start with county data from the Census.

In [2]:
pop_byCounty_df = pd.read_csv("./Output_Data/Census_County_Population_20200424.csv")
pop_byCounty_df


Unnamed: 0,Population,County,State,Abbreviation
0,47086.0,Washington,Mississippi,MS
1,12028.0,Perry,Mississippi,MS
2,8321.0,Choctaw,Mississippi,MS
3,23480.0,Itawamba,Mississippi,MS
4,10129.0,Carroll,Mississippi,MS
...,...,...,...,...
3137,19994.0,Carroll,Indiana,IN
3138,36378.0,Huntington,Indiana,IN
3139,24217.0,White,Indiana,IN
3140,20993.0,Jay,Indiana,IN


Determine the population of Twin Cities "metro" based on county.

In [3]:
#Get MN Counties of interest
pop_TwinCities = pop_byCounty_df.loc[(pop_byCounty_df['State']=="Minnesota")&(pop_byCounty_df['County']=="Hennepin")|(
    pop_byCounty_df['County']=="Ramsey"),:]
pop_TwinCities
#Don't know why this is still giving Ramsey Co ND




Unnamed: 0,Population,County,State,Abbreviation
638,11557.0,Ramsey,North Dakota,ND
778,1235478.0,Hennepin,Minnesota,MN
815,541493.0,Ramsey,Minnesota,MN


In [4]:
pop_TwinCities = pop_TwinCities.loc[pop_TwinCities['State']=="Minnesota"]
pop_TwinCities

Unnamed: 0,Population,County,State,Abbreviation
778,1235478.0,Hennepin,Minnesota,MN
815,541493.0,Ramsey,Minnesota,MN


In [5]:
TwinCities = pop_TwinCities.sum()
TwinCities

Population             1.77697e+06
County              HennepinRamsey
State           MinnesotaMinnesota
Abbreviation                  MNMN
dtype: object

Given that, we can look at a range of counties and determine a good comparison.

In [6]:
pd.to_numeric(pop_byCounty_df['Population'])

pop_PotentialComparisons = pop_byCounty_df.loc[(
    pop_byCounty_df['Population']>1000000)&(
    pop_byCounty_df['Population']<2000000),:]




In [7]:
pop_PotentialComparisons.sort_values('Population', axis=0, inplace=True)
pop_PotentialComparisons.reset_index(inplace=True)
pop_PotentialComparisons

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,index,Population,County,State,Abbreviation
0,1541,1019722.0,Pima,Arizona,AZ
1,2879,1021902.0,Fulton,Georgia,GA
2,2427,1040133.0,Montgomery,Maryland,MD
3,532,1046558.0,Wake,North Carolina,NC
4,499,1054314.0,Mecklenburg,North Carolina,NC
5,1083,1120805.0,Salt Lake,Utah,UT
6,2542,1133247.0,Contra Costa,California,CA
7,1124,1143529.0,Fairfax,Virginia,VA
8,1049,1203166.0,Travis,Texas,TX
9,1750,1225561.0,Allegheny,Pennsylvania,PA


In [8]:
#ComparisonCounty = pop_PotentialComparisons.loc[pop_PotentialComparisons['County']=="Wayne",'Population'].astype(float)
#ComparisonCounty
#percentDiffCountyCompare = np.abs((ComparisonCounty-TwinCities)/TwinCities)*100
#percentDiffCountyCompare

Wayne county is very close to sum of Hennepin and Ramsey county, so we use that for our comparison. I attempt above to calculate a percent difference to quantify "very close", but abandoned due to time.

### State Comparison
We follow the same steps to determine a State to compare MN with.

In [9]:
#Do it again for States
pop_byState_df = pd.read_csv("./Output_Data/Census_State_Population_20200424.csv")
pop_byState_df

Unnamed: 0,State,Population,Abbreviation
0,Minnesota,5527358.0,MN
1,Mississippi,2988762.0,MS
2,Missouri,6090062.0,MO
3,Montana,1041732.0,MT
4,Nebraska,1904760.0,NE
5,Nevada,2922849.0,NV
6,New Hampshire,1343622.0,NH
7,New Jersey,8881845.0,NJ
8,New Mexico,2092434.0,NM
9,New York,19618453.0,NY


In [10]:
pop_PotentialComparisonsSt = pop_byState_df.loc[(
    pop_byState_df['Population']>5000000)&(
    pop_byState_df['Population']<6000000)]
pop_PotentialComparisonsSt

Unnamed: 0,State,Population,Abbreviation
0,Minnesota,5527358.0,MN
26,Wisconsin,5778394.0,WI
33,Colorado,5531141.0,CO


Having narrowed this down, we will compare with CO because it is very close in population to MN, and also not a Mid-Western State.

### Area/population density as parameter to validate comparison

|*Place*      | *Area (sq mi)* |
|-------------|:--------------:|
|Hennepin, MN |606.43          |
|Ramsey, MN	  |170.16          |
|Wayne, MI    |672.26          |
|MINNESOTA    |86942.71        | 
|MICHIGAN     |96810.22        |
|WISCONSIN    |65503.21        |
|COLORADO     |104100.32       |

This table shows all the places we would like to compare and their geographic areas, obtained manually from US Census data (see ./WorkingFiles/LandAreaLND01.xls). We can see that areas of the counties that we want to compare are very close. Since populations are also close, we conclude that population density is also very similar, bolstering choice. Population density of CO is lower than MN, but we are okay with that difference since it's at a macro level.

## Summary
### Process
We use Census data obtained from api to narrow down options to compare COVID-19 data to based on population of Counties (as a stand-in for "metro") and States. Additionally we used population density validate the comparison.
### Interpretation
Despite a lack of quantitative measure of "closeness" between places of comparisons, we are comfortable moving forward with these:

* MSP (Hennepin/Ramsey Counties MN) vs Detroit (Wayne County MI) 
* MN vs CO 

### Further Investigation
Determine quantitation measure for "closeness". Find automated way to bring in geographic area data. Find data source for metro areas, instead of relying on county proxy.