# Introduction

After using these steps as a guide, the user should be able to recreate a smaller subset from the given data set. This data set works with stored information about cancer surgeries in several counties and hospitals across the state of California. The user will utilize indexing and filtering to narrow in on smaller, specific values. By implementing these techniques, a smaller subset will be created for export as a csv file. 

#### *System Requirements*

1. Download Anaconda Navigator onto your computer.
2. Create a folder on your computer containing the CancerDataSet.csv file. This folder will also eventually store the new subset you are going to create.
3. Launch Juypter Lab through Anaconda Navigator.

#### *Getting Started*

1. Once Juypter Lab is launched, open the folder you created and make sure that the data set is opened in your Juypter notebook.
2. By starting on a blank page, you will type: import pandas as pd. 

*This will allow the pandas function to be written as a shortcut with pd.*

In [4]:
import pandas as pd

# Working with the Data

Before we can start working with the data, it must be read as a csv file. 
1. Define the data. I defined the data as "rawdata". 
2. Use the function read_csv to create a pathway from the defined name (rawdata) to the name of the csv file (CancerDataSet.csv)

*Be aware of spelling and capitalization when inputing the name of this csv file. Be sure that it matches the exact file name.*

In [5]:
rawdata=pd.read_csv("CancerDataSet.csv")

By utilizing the pd.read_csv function, a relative pathway is created which allows the defined data "rawdata" to be used in place of typing out the entire dataset title in later functions. 

### *Filtering Data*

Now that the system recognizes that "rawdata" represents "CancerDataSet.csv", "rawdata" can be used as a shortcut for the next functions we will be performing. 




##### **Utilizing the .loc function**

1. Now that rawdata is defined, start with typing rawdata
2. Type .loc and add a bracket.  
3. The first values denote the rows. To indicate that you want to start with data at row 13 and continue with all the rest of the rows, type 13:, and start another bracket to nest the columns.
4. To only include Year, County, Hospital, and Surgery columns, state them all with parentheses within the nested bracket.
5. To close the entire function, add two brackets . 

*I chose to begin at row 13 because that's where the hospital data begins, and I took off the latitude and longitude columns because I wanted to focus on more tangible locations such as the county and the hospital.*

In [9]:
rawdata.loc[13:,["Year","County","Hospital","Surgery"]]

Unnamed: 0,Year,County,Hospital,Surgery
13,2013,Alameda,Alameda Hospital,Breast
14,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Brain
15,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Colon
16,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Prostate
17,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Esophagus
...,...,...,...,...
15694,2020,Yuba,Adventist Health and Rideout,Rectum
15695,2020,Yuba,Adventist Health and Rideout,Prostate
15696,2020,Yuba,Adventist Health and Rideout,Lung
15697,2020,Yuba,Adventist Health and Rideout,Colon


##### **Utilizing logical conditions**


In order to filter the data by a specific value present in a specific column, we can implement filtering. 

Here, I am going to filter the data to only include points that contain "Orange under the "County" column. That way, I will be creating a subset that only shows me data points with information about surgeries occurring only in Orange County. 

If we want to narrow in on only data points that are characterized by "Orange" under the "County" column, we can utilize a nested function.

1. Create a nested function with rawdata, followed by a bracket, and rawdata again.
2. Next, insert the function ["County"] =="Orange"] 

This function ensures that data will be scanned for only points that make this statement true. 



In [6]:
rawdata[rawdata["County"]=="Orange"]

Unnamed: 0,Year,County,Hospital,OSHPDID,Surgery,# of Cases (ICD 9),# of Cases (ICD 10),LONGITUDE,LATITUDE
909,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Rectum,22.0,,-117.95524,33.70162
910,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Prostate,24.0,,-117.95524,33.70162
911,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Pancreas,13.0,,-117.95524,33.70162
912,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Lung,26.0,,-117.95524,33.70162
913,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Liver,33.0,,-117.95524,33.70162
...,...,...,...,...,...,...,...,...,...
14810,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Esophagus,,6.0,-117.84399,33.85442
14811,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Colon,,78.0,-117.84399,33.85442
14812,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Breast,,486.0,-117.84399,33.85442
14813,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Brain,,48.0,-117.84399,33.85442


## Making a Copy

To be able to return to the subset you created, there needs to be a copy. 

1. Name the new subset Orange_subset
2. Define the new subset using the same function used to create the subset like this:



In [7]:
Orange_subset=rawdata[rawdata["County"]=="Orange"].copy()

3. End the function with .copy() to create the copy. 

Now, simply type the name of the subset and the data will display. 

In [8]:
Orange_subset

Unnamed: 0,Year,County,Hospital,OSHPDID,Surgery,# of Cases (ICD 9),# of Cases (ICD 10),LONGITUDE,LATITUDE
909,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Rectum,22.0,,-117.95524,33.70162
910,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Prostate,24.0,,-117.95524,33.70162
911,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Pancreas,13.0,,-117.95524,33.70162
912,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Lung,26.0,,-117.95524,33.70162
913,2013,Orange,Orange Coast Memorial Medical Center,106300225.0,Liver,33.0,,-117.95524,33.70162
...,...,...,...,...,...,...,...,...,...
14810,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Esophagus,,6.0,-117.84399,33.85442
14811,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Colon,,78.0,-117.84399,33.85442
14812,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Breast,,486.0,-117.84399,33.85442
14813,2020,Orange,Kaiser Foundation Hospital ñ Orange County ñ A...,106304409.0,Brain,,48.0,-117.84399,33.85442


# Exporting the Subset

After completing the subset, it can be prepared for export to be used by others. 

#### *Export as CSV*

By using the function .to_csv() with the subset file, the subset is now in csv form.
However, to ensure that the values that were cut from the original dataset when creating the subset are left out, the function index=false indicates that those values should be left off.

In [9]:
Orange_subset.to_csv("Orange_subset.csv",index=False)

At this point, a new csv file should have appeared in your folder. 

The new subset data is now ready for use!