# County Health Data Notebook - Sam Thoma

## An overview of my procedure...

#### I will be exploring three different aspects of the "County Health Data" Dataset, being: Physical Inactivity by Region, Access to Excercise Opportunities by Region, and Adult Obesity by Region. I will do so by narrowing down the dataset into smaller, readable pieces.

#### These pieces come in the form of random samples in which are utilized in order to get a better general idea of the average between the two listed variables.

#### The final goal in mind in relation to the dataset, would be to find out the correlation of the "Physical Inactivity" and "Access to Excercise" variables to the most important variable, Adult Obesity.

## Steps:

##### The first step of this process is to import both numpy and pandas into the code. This is done by importing both either as "np", or "pd".


In [116]:
import numpy as np

In [117]:
import pandas as pd

##### The next step is to bring in the dataset and read such dataset. In this case the data set is "CountyHealthData_2014-2015.csv". This is done by setting up: "rawdata=pd.read_csv(...)".

In [118]:
rawdata=pd.read_csv("CountyHealthData_2014-2015.csv")

##### These next three steps are just to be aware of the shear size, shape, and columns of the dataset, allowing a better grasp on what kind of data is at hand. These steps are optional. This is done through rawdata._ and so on.

In [119]:
rawdata.shape

(6109, 64)

In [120]:
rawdata.size

390976

rawdata.columns

##### Next, we will be examining the physical activity of the population, in relation to Region. This will be done by drawing random samples from the data set in portions of ten. The code in doing so will involve rawdata.loc[:,[... , ...]].sample(...). The reason for completing these steps is to further educate ourselves on potential correlations between such variables and the main variable, adult obesity.

In [121]:
rawdata.loc[:,["Region","Physical inactivity"]].sample(n=10)

Unnamed: 0,Region,Physical inactivity
4192,South,0.338
901,South,0.252
3003,South,0.341
4053,Midwest,0.365
4235,South,0.334
1688,Midwest,0.28
5984,South,0.324
5378,South,0.292
3137,South,0.389
5886,Midwest,0.236


##### We will be repeating this exact step for both the access to excercise opportunities and adult obesity variables.

In [122]:
rawdata.loc[:,["Region","Access to exercise opportunities"]].sample(n=10)

Unnamed: 0,Region,Access to exercise opportunities
193,South,0.096
1144,Midwest,0.69
4301,South,0.795
4588,South,0.692
3832,West,0.352
4023,Midwest,0.941
499,West,0.23
4876,South,0.229
1510,Midwest,0.399
2008,South,0.427


##### Repeat step above for the last variable, adult obesity.

In [123]:
rawdata.loc[:,["Region","Adult obesity"]].sample(n=10)

Unnamed: 0,Region,Adult obesity
3449,Midwest,0.313
3723,Northeast,0.339
2526,Midwest,0.321
4088,Midwest,0.322
5955,South,0.333
2478,Midwest,0.361
4718,Midwest,0.326
1109,Midwest,0.297
3091,South,0.315
5621,South,0.292


##### After investigation, we have learned that there is in fact a correlation between the variables physical inactivity, access to excercise opportunities, and adult obesity. We will now create a subset of data to locate the areas in which obesity is in a high concentration of adults, further proving our point. We will do so by using the code, ..._subset = rawdata[rawdata[...] >= ...].copy().

In [124]:
Obesity_subset = rawdata[rawdata["Adult obesity"] >= 0.425].copy()

In [125]:
Obesity_subset

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
57,AL,South,East South Central,Bullock County,1011,1011,Insuff Data,1/1/2015,10744.0,0.23,...,,0.232,0.041,9976.0,0.167,66.0,25937,0.86,,
108,AL,South,East South Central,Greene County,1063,1063,Region 16,1/1/2014,10935.0,0.271,...,,0.225,0.056,9236.0,0.164,11.0,24592,0.873,18.27,
109,AL,South,East South Central,Greene County,1063,1063,Region 16,1/1/2015,11968.0,0.271,...,,0.206,0.043,9557.0,0.164,23.0,25864,0.878,20.8,
130,AL,South,East South Central,Lowndes County,1085,1085,Region 16,1/1/2014,15671.0,0.358,...,,0.222,0.057,9866.0,0.272,18.0,28544,0.908,22.7,0.35
131,AL,South,East South Central,Lowndes County,1085,1085,Region 16,1/1/2015,14513.0,0.358,...,,0.2,0.043,9288.0,0.272,19.0,29734,0.935,23.4,
177,AL,South,East South Central,Wilcox County,1131,1131,Insuff Data,1/1/2015,14288.0,0.305,...,,0.234,0.057,8682.0,0.155,18.0,23184,0.964,14.8,
2220,LA,South,West South Central,East Carroll Parish,22035,22035,Insuff Data,1/1/2015,12741.0,0.268,...,,0.298,0.053,13536.0,0.336,66.0,26781,0.946,20.4,
3014,MS,South,East South Central,Coahoma County,28027,28027,Region 16,1/1/2014,14788.0,0.243,...,7.9,0.256,0.065,11771.0,0.253,82.0,25662,0.911,27.92,0.325
3015,MS,South,East South Central,Coahoma County,28027,28027,Region 16,1/1/2015,15292.0,0.243,...,7.03,0.26,0.061,11554.0,0.253,87.0,27500,0.957,33.7,
3039,MS,South,East South Central,Holmes County,28051,28051,Region 16,1/1/2015,13393.0,0.238,...,,0.269,0.063,11102.0,0.329,54.0,22674,0.956,19.2,


##### We now have our final product of the data that we were after, and now need to export the subset. This is done through ...subset.to_csv("...csv", index=False).

In [126]:
Obesity_subset.to_csv("Obesity_subset.csv", index=False)