#Creating a New Subset From a Dataset: County Health Care

##Instruction Overview
* The following instructions will show you how to create a new subset of data from the County Public Health Dataset using Python3.
* This is designed for people with little to no experience in working with data.

##First Steps
1. Download this .csv file to your computer [link](https://uncch.instructure.com/courses/64002/files/8242988?wrap=1)
2. Create a folder in your google drive with a name that is easy to remember, like "English".
3. Upload the .csv file into that folder in your google drive.

Open **Google.Colab** and click **"New Notebook"**

Now, you are ready to begin working with the data!

-----


#Uploading your data to Python3
First you have to connect Python3 with your google drive.

- This lets Python access the data so you can start manipulating and filtering the data.
- It will ask you for permission to connect with your Google Drive. Allow it permission so your data can transfer.









In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


To properly filter and manage the data, it is important to import packages to Python3.

You can give these packages nicknames to make it easier for you to access them while working with the data.

* Import Pandas and give it the nickname "pd"
* Import numpy and give it the nickname "np"


In [2]:
import pandas as pd
import numpy as np

To actually import the data you use **pd.read_csv()**

* When filling in the parenthesis, make sure each folder is exactly the same as it is in the drive.
* Use .copy() at the end to ensure there are no errors later on.
* Give it a simple name, like dataframe, that will make it easily accessible later on.

In [3]:
dataframe = pd.read_csv('gdrive/ My Drive/English/CountyHealthData_2014-2015.csv')

Now the name can be used to access the data set.
* when you type in the name, Python3 will give you the first and last 5 rows to give a sample of the data.

In [4]:
dataframe

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
0,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2014,,0.122,...,,0.374,0.250,3791.0,0.185,216.0,69192,0.127,,0.287
1,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2015,,0.122,...,,0.314,0.176,4837.0,0.185,254.0,74088,0.133,,
2,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2014,6827.0,0.125,...,15.37,0.218,0.096,6588.0,0.119,135.0,71094,0.319,6.29,0.160
3,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2015,6856.0,0.125,...,17.08,0.227,0.123,6582.0,0.119,148.0,76362,0.334,5.60,
4,AK,West,Pacific,Bethel Census Area,2050,2050,Insuff Data,1/1/2014,13345.0,0.211,...,,0.394,0.124,5860.0,0.200,169.0,41722,0.668,12.77,0.477
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6104,WY,West,Mountain,Uinta County,56041,56041,Insuff Data,1/1/2015,7436.0,0.135,...,18.66,0.192,0.090,7600.0,0.123,47.0,60953,0.273,,
6105,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2014,6580.0,0.106,...,,0.225,0.086,8202.0,0.099,47.0,49533,0.328,,0.133
6106,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2015,7572.0,0.106,...,,0.226,0.101,7940.0,0.099,47.0,50740,0.309,,
6107,WY,West,Mountain,Weston County,56045,56045,Insuff Data,1/1/2014,5633.0,0.162,...,,0.201,0.084,6906.0,0.130,28.0,53665,0.232,,0.171


---
#Creating a Subset

There are 6109 rows of data, containing a lot of irrelevant information for our potential research questions. For this subset we are specifically looking at the information in **Arizona**. The following information will be helpful with potential reserach questions.

* State
* County
* Year
* Health Care Costs



**Now we will filter out data so we only see these variables/columns**.


First we need to filter our data to only show information from Arizona.
 * You use a filtering command that isolates every instance of ["State] == "AZ"
 * The inner statement should contain [dataframe["state"] == "AZ", which is a true false statement. This will return true 30 times because there are 30 rows with Arizona data in them.
 * Make sure to use an easy name for the subset so you can work with it later on.



**It is important to add .copy() so you won't run into "SettingwithCopyWarning" later on.**

The following commonand filters the data to show just Arizona.

In [5]:
AZ_Subset = dataframe[dataframe["State"] == "AZ"].copy()

* Make sure you type the variable in exactly how it is written in the data so you don't get an error message.

Now we can test our subset and see the isolated Arizona data!

In [6]:
AZ_Subset

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
330,AZ,West,Mountain,Apache County,4001,4001,Insuff Data,1/1/2014,14039.0,0.203,...,9.83,0.26,0.141,8968.0,0.184,64.0,32886,0.245,13.11,0.366
331,AZ,West,Mountain,Apache County,4001,4001,Insuff Data,1/1/2015,14350.0,0.203,...,7.04,0.297,0.176,8097.0,0.184,64.0,30252,0.248,14.7,
332,AZ,West,Mountain,Cochise County,4003,4003,Region 8,1/1/2014,7828.0,0.187,...,13.74,0.209,0.143,8460.0,0.124,55.0,43017,0.431,5.03,0.206
333,AZ,West,Mountain,Cochise County,4003,4003,Region 8,1/1/2015,7412.0,0.187,...,14.27,0.208,0.143,8221.0,0.124,58.0,45294,0.465,5.3,
334,AZ,West,Mountain,Coconino County,4005,4005,Region 8,1/1/2014,7342.0,0.128,...,9.88,0.254,0.159,7835.0,0.128,69.0,45509,0.292,6.37,0.191
335,AZ,West,Mountain,Coconino County,4005,4005,Region 8,1/1/2015,7744.0,0.128,...,10.65,0.266,0.205,7739.0,0.128,77.0,48732,,7.2,
336,AZ,West,Mountain,Gila County,4007,4007,Insuff Data,1/1/2014,10843.0,0.214,...,16.78,0.22,0.141,10011.0,0.161,43.0,38267,0.422,10.01,0.24
337,AZ,West,Mountain,Gila County,4007,4007,Insuff Data,1/1/2015,10899.0,0.214,...,18.51,0.228,0.166,9918.0,0.161,43.0,39868,0.426,9.8,
338,AZ,West,Mountain,Graham County,4009,4009,Region 8,1/1/2014,8077.0,0.19,...,12.61,0.2,0.131,8120.0,0.091,75.0,41080,0.298,4.47,0.193
339,AZ,West,Mountain,Graham County,4009,4009,Region 8,1/1/2015,7350.0,0.19,...,16.48,0.236,0.168,7560.0,0.091,85.0,43497,,4.9,


We got the data down to Arizona, but now we need to filter it to the variables we decided on earlier for our final subset.
* To do this we can use the .loc[] format
* The .loc[] format uses commas to show what rows and columns you want to filter down to
* For our final dataset we are keeping all of the rows. A colon is used to tell Python3 that you are keeping all the rows.
* For the columns we need to filter down to State, County, Year, and Health care costs. You use this format: ["State","County","Year","Health care costs"]


**Remember to type the variables exactly as they are in the dataset so Python3 knows what you want to filter to.**

We are going to name this subset "Final_subset" so we know it is out last one.

* The following command will acheive creating the final subset.

In [7]:
Final_subset = AZ_Subset.loc[:,["State","County","Year","Health care costs"]].copy()

It is important that you use the AZ_Subset that we created earlier so it only takes that Arizona data.

Now lets look at our final subset!

In [8]:
Final_subset

Unnamed: 0,State,County,Year,Health care costs
330,AZ,Apache County,1/1/2014,8968.0
331,AZ,Apache County,1/1/2015,8097.0
332,AZ,Cochise County,1/1/2014,8460.0
333,AZ,Cochise County,1/1/2015,8221.0
334,AZ,Coconino County,1/1/2014,7835.0
335,AZ,Coconino County,1/1/2015,7739.0
336,AZ,Gila County,1/1/2014,10011.0
337,AZ,Gila County,1/1/2015,9918.0
338,AZ,Graham County,1/1/2014,8120.0
339,AZ,Graham County,1/1/2015,7560.0


Now that we have our final subset it is time to export it!

#Exporting the Subset

To export our data we will use the .to_csv(), adding the filename and extension within the parathesis at the end.

* The dataframe we are using is our final subset titled, Final_subset
* It is also important to include index=false to eliminate the column of indices that pandas creates. index=false tells pandas not to bring in these index numbers.

The following command exports our new subset.


In [11]:
Final_subset.to_csv("Final_subset.csv", index=False)

* It is important to capitalize the F in index=False so you don't get an error when exporting your subset.

You have now sucessfully made a new subset from the County Public Health Data. Yay!