## Week 2:
 
New Topics:

 - Resources for help with coding 
 - Creating a reproducible workflow
 - Merging _pandas_ DataFrames
 - Additional _pandas_ methods 
 

Coding Tasks:

Start a new Jupyter Notebook to complete these tasks. This week, you'll be combining two different datasets.

First, you'll work with a dataset containing the number of primary care physicians per county for each county in the United States. It was obtained from the Area Health Resources File, published by the [Health Resources and Services Administration](https://data.hrsa.gov/topics/health-workforce/ahrf). This data is contained in the file `primary_care_physicians.csv`.

Second, the file `population_by_county.csv` contains the Census Bureau's 2019 population estimates for each US County. It also contains a column `urban`. The `urban` column uses data from the National Bureau of Economic Research to classify each county as either urban or rural. The U.S. Office of Management and Budget designates counties as metropolitan (a core urban area of 50,000 or more population), micropolitan (an urban core of at least 10,000 but less than 50,000 population), or neither. Here, a county is considered "urban" if it is part of a metropolitan or micropolitan area and "rural" if it is not.

 1. First, import the primary care physicians dataset (`primary_care_physicians.csv`) into a data frame named `physicians`. 

In [4]:
import pandas as pd
physicians = pd.read_csv('../data/primary_care_physicians.csv')
physicians

Unnamed: 0,FIPS,state,county,primary_care_physicians
0,1001,Alabama,Autauga,26.0
1,1003,Alabama,Baldwin,153.0
2,1005,Alabama,Barbour,8.0
3,1007,Alabama,Bibb,12.0
4,1009,Alabama,Blount,12.0
...,...,...,...,...
3225,72151,Puerto Rico,Yabucoa,5.0
3226,72153,Puerto Rico,Yauco,43.0
3227,78010,US Virgin Islands,St. Croix,14.0
3228,78020,US Virgin Islands,St. John,1.0


2. Filter physicians down to just the counties in Tennessee. Save the filtered dataframe back to physicians. Verify that the resulting dataframe has 95 rows.

In [5]:
physicians = physicians.loc[physicians['state'] == 'Tennessee']
physicians.shape

(95, 4)

3. Look at the distribution of the number of primary care physicians. What do you notice?

In [28]:
physicians['primary_care_physicians'].describe()

count    636.000000
mean      60.119497
std      133.713461
min        0.000000
25%        4.000000
50%       15.000000
75%       38.000000
max      806.000000
Name: primary_care_physicians, dtype: float64

4. Now, import the population by county dataset (population_by_county.csv) into a DataFrame named population.

In [29]:
population = pd.read_csv('../data/population_by_county.csv')
population.head()

Unnamed: 0,FIPS,population,county,state,urban
0,17051,21565,Fayette County,ILLINOIS,Rural
1,17107,29003,Logan County,ILLINOIS,Rural
2,17165,23994,Saline County,ILLINOIS,Rural
3,17097,701473,Lake County,ILLINOIS,Urban
4,17127,14219,Massac County,ILLINOIS,Rural


5. Merge the `physicians` DataFrame with the `population` DataFrame. Keep only the values for Tennessee. When you merge, be sure the include both the `population` and `urban` columns in the merged results. Save the result of the merge back to `physicians`.

In [67]:
population['county'].str.split(' County', expand = True)
population['county']=population['county'].str.split(' County', expand = True)[0]
population

Unnamed: 0,FIPS,population,county,state,urban
283,47165,183437,Sumner,TENNESSEE,Urban
284,47169,10231,Trousdale,TENNESSEE,Urban
285,47027,7654,Clay,TENNESSEE,Rural
405,47157,936374,Shelby,TENNESSEE,Urban
406,47077,27977,Henderson,TENNESSEE,Rural
...,...,...,...,...,...
3195,47123,46064,Monroe,TENNESSEE,Rural
3196,47079,32284,Henry,TENNESSEE,Rural
3197,47033,14399,Crockett,TENNESSEE,Rural
3198,47095,7401,Lake,TENNESSEE,Rural


In [57]:
## population['county'].str.rsplit(' ', n = 1, expand = True) %other option

Unnamed: 0,0
283,Sumner
284,Trousdale
285,Clay
405,Shelby
406,Henderson
...,...
3195,Monroe
3196,Henry
3197,Crockett
3198,Lake


In [70]:
population = population.loc[population['state'] == 'TENNESSEE']
physicians = pd.merge(left = physicians,
         right = population[['county','population', 'urban']])
physicians

Unnamed: 0,FIPS,state,county,primary_care_physicians,population_x,urban_x,population_y,urban_y,population_x.1,urban_x.1,population_y.1,urban_y.1,population,urban
0,47001,Tennessee,Anderson,39.0,76061,Urban,76061,Urban,76061,Urban,76061,Urban,76061,Urban
1,47003,Tennessee,Bedford,15.0,48292,Rural,48292,Rural,48292,Rural,48292,Rural,48292,Rural
2,47005,Tennessee,Benton,3.0,16140,Rural,16140,Rural,16140,Rural,16140,Rural,16140,Rural
3,47007,Tennessee,Bledsoe,1.0,14836,Rural,14836,Rural,14836,Rural,14836,Rural,14836,Rural
4,47009,Tennessee,Blount,90.0,129927,Urban,129927,Urban,129927,Urban,129927,Urban,129927,Urban
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,47181,Tennessee,Wayne,5.0,16693,Rural,16693,Rural,16693,Rural,16693,Rural,16693,Rural
90,47183,Tennessee,Weakley,18.0,33510,Rural,33510,Rural,33510,Rural,33510,Rural,33510,Rural
91,47185,Tennessee,White,9.0,26800,Rural,26800,Rural,26800,Rural,26800,Rural,26800,Rural
92,47187,Tennessee,Williamson,338.0,225389,Urban,225389,Urban,225389,Urban,225389,Urban,225389,Urban


In [None]:
physicians = pd.merge(left = physicians,
         right = population[['county','population', 'urban']])
physicians

In [63]:
## physicians = pd.merge(left = physicians,
##               right = population.loc[population['state']=="TENNESSEE"][['FIPS', 'population', 'urban']], 
##         on="FIPS")

Unnamed: 0,FIPS,state,county,primary_care_physicians,population_x,urban_x,population_y,urban_y,population_x.1,urban_x.1,population_y.1,urban_y.1
0,47001,Tennessee,Anderson,39.0,76061,Urban,76061,Urban,76061,Urban,76061,Urban
1,47003,Tennessee,Bedford,15.0,48292,Rural,48292,Rural,48292,Rural,48292,Rural
2,47005,Tennessee,Benton,3.0,16140,Rural,16140,Rural,16140,Rural,16140,Rural
3,47007,Tennessee,Bledsoe,1.0,14836,Rural,14836,Rural,14836,Rural,14836,Rural
4,47009,Tennessee,Blount,90.0,129927,Urban,129927,Urban,129927,Urban,129927,Urban
...,...,...,...,...,...,...,...,...,...,...,...,...
89,47181,Tennessee,Wayne,5.0,16693,Rural,16693,Rural,16693,Rural,16693,Rural
90,47183,Tennessee,Weakley,18.0,33510,Rural,33510,Rural,33510,Rural,33510,Rural
91,47185,Tennessee,White,9.0,26800,Rural,26800,Rural,26800,Rural,26800,Rural
92,47187,Tennessee,Williamson,338.0,225389,Urban,225389,Urban,225389,Urban,225389,Urban


6. How many Tennessee counties are considered urban?

In [66]:
physicians['urban'].value_counts()

Rural    56
Urban    38
Name: urban, dtype: int64

7. The State Health Access Data Assistance Center (SHADAC) (https://www.shadac.org/) classifies counties into three groups based on the number of residents per primary care physician. First, counties with fewer than 1500 residents per primary care physician are considered to have an "adequate" supply. Counties with at least 1500 residents but fewer than 3500 residents per primary care physician are considered to have a "moderately inadequate" supply, and counties with at least 3500 residents per primary care physician are considered to have a "low inadequate" supply. How many counties in Tennessee are in each group?

In [72]:
res_per_pcp = physicians['population'] / physicians['primary_care_physicians']
res_per_pcp

0      1950.282051
1      3219.466667
2      5380.000000
3     14836.000000
4      1443.633333
          ...     
89     3338.600000
90     1861.666667
91     2977.777778
92      666.831361
93     3178.279070
Length: 94, dtype: float64

In [80]:
## to add column with this data to df
physicians['residents_per_pcp'] = physicians['population'] / physicians['primary_care_physicians']
adequate = physicians.loc[physicians['residents_per_pcp'] < 1500]
adequate.shape

(14, 15)

In [81]:
adequate = physicians.loc[(physicians['population'] / physicians['primary_care_physicians']) < 1500]
moderately_inadequate = physicians.loc[(physicians['population'] / physicians['primary_care_physicians'] >= 1500) & 
                                       (physicians['population'] / physicians['primary_care_physicians'] <= 3500)]
low_inadequate = physicians.loc[(physicians['population'] / physicians['primary_care_physicians']) > 3500]

In [82]:
len(adequate['county'])

14

In [83]:
len(moderately_inadequate['county'])

50

In [84]:
len(low_inadequate['county'])

30

8. Does there appear to be any detectable relationship between whether a county is urban or rural and its supply of primary care physicians?

In [85]:
adequate['urban'].value_counts()

Urban    9
Rural    5
Name: urban, dtype: int64

In [86]:
moderately_inadequate['urban'].value_counts()

Rural    31
Urban    19
Name: urban, dtype: int64

In [87]:
low_inadequate['urban'].value_counts()

Rural    20
Urban    10
Name: urban, dtype: int64