# How to compile a new subset of HealthData.csv
## By Thomas White
### English 105

#### Purpose:
* This list of instructions is intended to help other users of GitHub learn how to use portions of a larger dataset in order to draw a more specific conclusion about a smaller portion  of the data.

#### Experience needed?
* You'll need to know some basic aspects of Python3, mainly how to use the pandas package for data analysis, which can be accessed here:
[Using the Pandas Package for Data Analysis, part one](https://uncch.instructure.com/courses/4810/files/1959947?wrap=1)
[Using the Pandas Package for Data Analysis, part two](https://uncch.instructure.com/courses/4810/files/1959948?wrap=1)

#### System Requirements:
* You'll need to download Anaconda to your computer. This comes with JupyterLab, which we'll be using. We'll also be using several files and links, which will be given to you throughout this guide. 

#### What are we doing?
* In my example, I have chosen to look at *median household income by region.*

* The data file we're using, HealthData.csv, is massive, and we obviously don't need all of it to look at two columns of data, so we can reduce the data sheet via indexing and filtering. Here is how:*

#### Instructions:
1. Open Anaconda Navigator, followed by JupyterLab.
2. Click on the file icon with a plus sign on it. Name the file whatever you wish.
3. Within your new file, click on the blue plus sign icon. 
4. Then, click on the big Python3 button under 'Notebook'. This should take you to a blank page with a list of number(s) on the side.
5. Open the "HealthData.csv" file. This is listed below:
[Health Data](https://uncch.instructure.com/courses/4810/files/1951167?wrap=1)
6. Download this data file to your computer. It should be downloaded as a ".csv" file.
7. Once back on your blank Python3 page, click the arrow that is pointing up with a horizontal line below it in the top left corner. Click on "HealthData.csv".
8. Go back to your blank Python3 file and click on the empty horizontal box in the middle of the screen.
##### Here is where the actual coding starts!
9. Import the packages that we'll need to use with Python. This is done by writing, `import numpy as np` and `import pandas as pd` in the first box.
* Keep these commands in the same box, but put them on separate lines.* 
* After writing your code in each box, hold shift and press enter to run each box.* 

In [2]:
import numpy as np
import pandas as pd

10. Create a new box by clicking on the plus symbol with a rectangle over it. This is located on the right side of your first box. 
11. In your new box, define dataframe as df and make it equal to the pd.read_csv command, with "HealthData.csv" following it in round brackets. It should look like this: `df=pd.read_csv("HealthData.csv")`

In [3]:
df=pd.read_csv("HealthData.csv")

##### Quick note:
* In HealthData.csv, four regions are defined: West, South, Northeast, and Midwest. Since we are looking at median household income by region, these are of interest to us. 
* But in HealthData.csv, the number of data points from each region are not equal. 
* For instance, we have 2803 responses for the South region, but we only have 434 for the Northeast region.
* This unevenness is worrisome, as fewer responses for a region could cause that region's data to be inaccurate. This being the case, I randomly selected 100 data points from each region and found the median of these. Here is how:
12. Go to HealthData.csv, and find a group of over 100 consecutive datapoints that are all from the same region. All of the regions have at least one of these groups. Ideally, find the largest group for each region. I recommend using the "ctrl + f" function on your computers to locate them. For example, the West region has data points from line 330 to 592. Do this for each of the four regions.
##### More coding:
13. Repeat step 10 to make a new box. 
14. (Part one) In the new box, we have to use the `df.loc` attribute. Write that, followed by a bracket and the range of consecutive datapoints that you've found. 
14. (Part two) After this, we must call the columns of data that we want to observe from HealthData.csv. In this instance, those are Region and Median household income. Write those in brackets. `,["Region","Median household income"]]`
14. (Part three) We must now obtain a random sample of 100 data points to use. After your brackets, write `.sample(n=100)`. All together, your line of code should look like this: `df.loc[330:592,["Region","Median household income"]].sample(n=100)`

In [4]:
df.loc[330:592,["Region","Median household income"]].sample(n=100)

Unnamed: 0,Region,Median household income
562,West,57510
373,West,38663
533,West,44059
485,West,34440
453,West,63398
...,...,...
524,West,56191
404,West,42552
443,West,91843
471,West,55011


15. Set this line of code equal to the region it draws data from, followed by `_subset`. That should look like this: `west_subset = df.loc[330:592,["Region","Median household income"]].sample(n=100)` 

In [5]:
west_subset = df.loc[330:592,["Region","Median household income"]].sample(n=100)

16. Repeat steps 13 through 15 for each region. 
##### Changing your output into a '.csv' file
17. To make this into a '.csv' file, we must write: `west_subset.to_csv("west_subset.to_csv")`

In [6]:
west_subset.to_csv("west_subset.to_csv")

#### Final thoughts
* Congrats! Now you know how to make a larger dataset smaller. You also know how to acquire random samples from larger sets of data. Thanks for using my example.  These are useful skills to have when navigating this software. There are various other examples on how to learn these skills, so thanks for using mine.