# Overview

### County Health Data

***CountyHealthData_2014-2015.csv*** contains a wide array of data pertaining to multiple states across the nation. By the end of this procedure, we will narrow down our dataset to provide information which better reflects correlations among variables such as healthcare costs, income, and the number of uninsured individuals not obtaining healthcare services needed in the state of North Carolina. 

This data is available to anyone, from policy makers, to researchers, to the general public, interested in issues relating to healthcare costs in North Carolina.

# Getting Started

### 1. Retrieving *CountyHealthData_2014-2015.csv*
Refer to the file ***CountyHealthData_2014-2015.csv*** in this repository. Download the *.csv* file and save it to a specific *working directory* (a specifc folder on your desktop, i.e. "County Health Files"). This will be necessary for successful manipulation and download of our data. 

### 2. Additional Tools: *Pandas*
The ***Pandas Package*** is necessary in order to successfuly manipulate and filter the retrieved dataset to show data relevant to our interest.

#### Importing Pandas Package:

The following steps show how to import the Pandas Package:

In [11]:
import numpy as np

In [12]:
import pandas as pd

### 3. Creating and Defining a Dataframe

#### The Working Directory

As a reminder, note that the ***CountyHealthData_2014-2015.csv*** file retrieved from this repository must be saved in the same *working directory* as this Python Notebook.

#### Defining CountyHealthData_2014-2015.csv as "df":

Use the **df = pd.read_csv()** feature as shown below to define your dataframe. This will simplify the data aggregation process.

In [9]:
df = pd.read_csv("CountyHealthData_2014-2015.csv")

Now that ***CountyHealthData_2014-2015.csv*** is defined as **"df"**, run df as shown below and the ***CountyHealthData_2014-2015.csv*** dataset will appear. 

In [10]:
df

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
0,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2014,,0.122,...,,0.374,0.250,3791.0,0.185,216.0,69192,0.127,,0.287
1,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2015,,0.122,...,,0.314,0.176,4837.0,0.185,254.0,74088,0.133,,
2,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2014,6827.0,0.125,...,15.37,0.218,0.096,6588.0,0.119,135.0,71094,0.319,6.29,0.160
3,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2015,6856.0,0.125,...,17.08,0.227,0.123,6582.0,0.119,148.0,76362,0.334,5.60,
4,AK,West,Pacific,Bethel Census Area,2050,2050,Insuff Data,1/1/2014,13345.0,0.211,...,,0.394,0.124,5860.0,0.200,169.0,41722,0.668,12.77,0.477
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6104,WY,West,Mountain,Uinta County,56041,56041,Insuff Data,1/1/2015,7436.0,0.135,...,18.66,0.192,0.090,7600.0,0.123,47.0,60953,0.273,,
6105,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2014,6580.0,0.106,...,,0.225,0.086,8202.0,0.099,47.0,49533,0.328,,0.133
6106,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2015,7572.0,0.106,...,,0.226,0.101,7940.0,0.099,47.0,50740,0.309,,
6107,WY,West,Mountain,Weston County,56045,56045,Insuff Data,1/1/2014,5633.0,0.162,...,,0.201,0.084,6906.0,0.130,28.0,53665,0.232,,0.171


# Filtering Your Dataset to Focus on North Carolina

### 1. Creating a Subset

Since we want to focus on data pertaining to North Carolina, we will create a subset of the original dataframe, ***CountyHealthData_2014-2015.csv***, using the **df ["State"] == "NC"** feature as shown below:

In [11]:
df[df["State"]=="NC"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


### 2. Creating a Copy

In order to successfully manipulate and download our new subset, an important step is to create a copy using the **.copy** method as shown below:

In [12]:
NC_subset = df[df["State"] == "NC"].copy()

In [13]:
NC_subset

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


# A Step Further: Creating a Dataframe with Desired Columns

### 1. Using .loc Method to Filter Data

Now that we have created a subset from the original dataframe containing only North Carolina data, we could stop there and continue with exporting our data. However, our interest remains on finding correlations among healthcare-related variables, meaning we are interested in specific columns within the ***NC_subset*** dataset. 

Using the **df.loc** method, we will filter the North Carolina dataset to show our desired columns. 

- Note that when we created the ***NC_subset*** in the section above, the output shows the range where NC data is found ***[3243:3442]***. This is useful for the **df.loc** method shown below, as we we must identify the range of rows for the method to work.   

The **df.loc** method requires a *range* of rows first, and then column *names* second:

In [28]:
df.loc[3243:3442,["State", "County", "Year", "Unemployment", "Uninsured", "Uninsured children", "Uninsured adults", "Health care costs", "Could not see doctor due to cost", "Median household income"]]

Unnamed: 0,State,County,Year,Unemployment,Uninsured,Uninsured children,Uninsured adults,Health care costs,Could not see doctor due to cost,Median household income
3243,NC,Alamance County,1/1/2014,0.094,0.206,0.073,0.259,8640.0,0.167,41394
3244,NC,Alamance County,1/1/2015,0.080,0.203,0.088,0.249,9050.0,0.167,43001
3245,NC,Alexander County,1/1/2014,0.101,0.195,0.077,0.240,9316.0,0.205,39655
3246,NC,Alexander County,1/1/2015,0.078,0.194,0.076,0.239,9242.0,0.205,46064
3247,NC,Alleghany County,1/1/2014,0.106,0.272,0.131,0.320,9585.0,0.210,34046
...,...,...,...,...,...,...,...,...,...,...
3438,NC,Wilson County,1/1/2015,0.112,0.209,0.079,0.262,9450.0,0.107,40772
3439,NC,Yadkin County,1/1/2014,0.089,0.209,0.097,0.252,10084.0,0.158,40012
3440,NC,Yadkin County,1/1/2015,0.071,0.201,0.094,0.242,10998.0,0.158,40998
3441,NC,Yancey County,1/1/2014,0.111,0.228,0.110,0.268,7707.0,0.158,36019


### 2. Defining a New Subset

Now that we have filtered our data to only show columns we are interested in analyzing, we will create and define a new subset of this data named ***NC_subset_health***. 

In [31]:
NC_subset_health = df.loc[3243:3442,["State", "County", "Year", "Unemployment", "Uninsured", "Uninsured children", "Uninsured adults", "Health care costs", "Could not see doctor due to cost", "Median household income"]]

After defining the **df.loc** formula as ***NC_subset_health*** we have simplified the long string of information into a single name. Run ***NC_subset_health*** as shown below, and our newly filtered dataset will appear. 

In [32]:
NC_subset_health

Unnamed: 0,State,County,Year,Unemployment,Uninsured,Uninsured children,Uninsured adults,Health care costs,Could not see doctor due to cost,Median household income
3243,NC,Alamance County,1/1/2014,0.094,0.206,0.073,0.259,8640.0,0.167,41394
3244,NC,Alamance County,1/1/2015,0.080,0.203,0.088,0.249,9050.0,0.167,43001
3245,NC,Alexander County,1/1/2014,0.101,0.195,0.077,0.240,9316.0,0.205,39655
3246,NC,Alexander County,1/1/2015,0.078,0.194,0.076,0.239,9242.0,0.205,46064
3247,NC,Alleghany County,1/1/2014,0.106,0.272,0.131,0.320,9585.0,0.210,34046
...,...,...,...,...,...,...,...,...,...,...
3438,NC,Wilson County,1/1/2015,0.112,0.209,0.079,0.262,9450.0,0.107,40772
3439,NC,Yadkin County,1/1/2014,0.089,0.209,0.097,0.252,10084.0,0.158,40012
3440,NC,Yadkin County,1/1/2015,0.071,0.201,0.094,0.242,10998.0,0.158,40998
3441,NC,Yancey County,1/1/2014,0.111,0.228,0.110,0.268,7707.0,0.158,36019


### 3. Creating a Copy of *NC_subset_health*

Now, we must create a copy of this new subset (like we did for ***NC_subset***) in order to use and export our dataset. 

Use the **.copy()** method as shown below:

In [33]:
NC_subset_health.copy()

Unnamed: 0,State,County,Year,Unemployment,Uninsured,Uninsured children,Uninsured adults,Health care costs,Could not see doctor due to cost,Median household income
3243,NC,Alamance County,1/1/2014,0.094,0.206,0.073,0.259,8640.0,0.167,41394
3244,NC,Alamance County,1/1/2015,0.080,0.203,0.088,0.249,9050.0,0.167,43001
3245,NC,Alexander County,1/1/2014,0.101,0.195,0.077,0.240,9316.0,0.205,39655
3246,NC,Alexander County,1/1/2015,0.078,0.194,0.076,0.239,9242.0,0.205,46064
3247,NC,Alleghany County,1/1/2014,0.106,0.272,0.131,0.320,9585.0,0.210,34046
...,...,...,...,...,...,...,...,...,...,...
3438,NC,Wilson County,1/1/2015,0.112,0.209,0.079,0.262,9450.0,0.107,40772
3439,NC,Yadkin County,1/1/2014,0.089,0.209,0.097,0.252,10084.0,0.158,40012
3440,NC,Yadkin County,1/1/2015,0.071,0.201,0.094,0.242,10998.0,0.158,40998
3441,NC,Yancey County,1/1/2014,0.111,0.228,0.110,0.268,7707.0,0.158,36019


# Exporting Your Data

### Exporting using *.to_csv ()*

Exporting your data will allow you to use and analyize this newly manipulated data as if it were an original dataset. 

Recall our two subsets, ***NC_subset*** and ***NC_subset_health***.

We can export these subsets into our *working directory* using the **.to_csv** method as shown below:

In [19]:
NC_subset.to_csv("NC_subset.csv", index=False)

In [34]:
NC_subset_health.to_csv("NC_subset_health.csv", index=False)

Now that we have our new data as a **.csv** file, we can use this data for analysis. 