# Introduction

After using these steps as a guide, the user should be able to recreate a smaller subset from the given data set. This data set works with stored information about cancer surgeries in several counties and hospitals across the state of California. The user will utilize indexing and filtering to narrow in on smaller, specific values. By implementing these techniques, a smaller subset will be created for export as a csv file. 

#### *System Requirements*

1. Download Anaconda Navigator onto your computer.
2. Create a folder on your computer containing the CancerDataSet.csv file. This folder will also eventually store the new subset you are going to create.
3. Launch Juypter Lab through Anaconda Navigator.

#### *Getting Started*

1. Once Juypter Lab is launched, open the folder you created and make sure that the data set is opened in your Juypter notebook.
2. By starting on a blank page, you will type: import pandas as pd. 

*This will allow the pandas function to be written as a shortcut with pd.*

In [2]:
import pandas as pd

# Working with the Data

Before we can start working with the data, it must be read as a csv file. 
1. Define the data. I defined the data as "rawdata". 
2. Use the function read_csv to create a pathway from the defined name (rawdata) to the name of the csv file (CancerDataSet.csv)

*Be aware of spelling and capitalization when inputing the name of this csv file. Be sure that it matches the exact file name.*

In [3]:
rawdata=pd.read_csv("CancerDataSet.csv")

### *Filtering Data*

Now that the system recognizes that "rawdata" represents "CancerDataSet.csv", "rawdata" can be used as a shortcut for the next functions we will be performing. 




##### **Utilizing the .loc function**

1. Now that rawdata is defined, start with typing rawdata
2. Type .loc and add a bracket.  
3. The first values denote the rows. To indicate that you want to start with data at row 13 and continue with all the rest of the rows, type 13:, and start another bracket to nest the columns.
4. To only include Year, County, Hospital, and Surgery columns, state them all with parentheses within the nested bracket.
5. To close the entire function, add two brackets . 

*I chose to begin at row 13 because that's where the hospital data begins, and I took off the latitude and longitude columns because I wanted to focus on more tangible locations such as the county and the hospital.*

In [9]:
rawdata.loc[13:,["Year","County","Hospital","Surgery"]]

Unnamed: 0,Year,County,Hospital,Surgery
13,2013,Alameda,Alameda Hospital,Breast
14,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Brain
15,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Colon
16,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Prostate
17,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,Esophagus
...,...,...,...,...
15694,2020,Yuba,Adventist Health and Rideout,Rectum
15695,2020,Yuba,Adventist Health and Rideout,Prostate
15696,2020,Yuba,Adventist Health and Rideout,Lung
15697,2020,Yuba,Adventist Health and Rideout,Colon


##### **Utilizing logical conditions**


In order to filter the data by a specific value present in a specific column, we can implement filtering. 

Here, I am going to filter the data to only include points that contain "brain" under the "surgery" column. That way, I will be creating a subset that only shows me data points with information where brain cancer surgery occurred.

If we want to narrow in on only data points that are characterized by "brain" under the "Surgery" column, we can utilize a nested function.

1. Create a nested function with rawdata, followed by a bracket, and rawdata again.
2. Next, insert the function ["Surgery"] =="Brain"] 

This function ensures that data will be scanned for only points that make this statement true. 



In [17]:
rawdata[rawdata["Surgery"]=="Brain"]

Unnamed: 0,Year,County,Hospital,OSHPDID,Surgery,# of Cases (ICD 9),# of Cases (ICD 10),LONGITUDE,LATITUDE
4,2013,,Statewide,,Brain,2719.0,,,
14,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,106010739.0,Brain,8.0,,-122.257840,37.856330
28,2013,Alameda,Highland Hospital,106010846.0,Brain,1.0,,-122.231200,37.799170
70,2013,Alameda,Washington Hospital ñ Fremont,106010987.0,Brain,7.0,,-121.980060,37.558470
88,2013,Alameda,Eden Medical Center,106014233.0,Brain,27.0,,-122.087406,37.698377
...,...,...,...,...,...,...,...,...,...
15646,2020,Ventura,Community Memorial Hospital ñ San Buenaventura,106560473.0,Brain,,4.0,-119.258124,34.274589
15652,2020,Ventura,Ventura County Medical Center,106560481.0,Brain,,13.0,-119.253840,34.276930
15665,2020,Ventura,Los Robles Hospital and Medical Center,106560492.0,Brain,,16.0,-118.882411,34.207620
15678,2020,Ventura,Adventist Health Simi Valley,106560525.0,Brain,,1.0,-118.743940,34.289730


## Making a Copy

To be able to return to the subset you created, there needs to be a copy. 

1. Name the new subset Brain_subset
2. Define the new subset using the same function used to create the subset like this:



In [18]:
Brain_subset=rawdata[rawdata["Surgery"]=="Brain"].copy()

3. End the function with .copy() to create the copy. 

Now, simply type the name of the subset and the data will display. 

In [19]:
Brain_subset

Unnamed: 0,Year,County,Hospital,OSHPDID,Surgery,# of Cases (ICD 9),# of Cases (ICD 10),LONGITUDE,LATITUDE
4,2013,,Statewide,,Brain,2719.0,,,
14,2013,Alameda,Alta Bates Summit Medical Center ñ Alta Bates ...,106010739.0,Brain,8.0,,-122.257840,37.856330
28,2013,Alameda,Highland Hospital,106010846.0,Brain,1.0,,-122.231200,37.799170
70,2013,Alameda,Washington Hospital ñ Fremont,106010987.0,Brain,7.0,,-121.980060,37.558470
88,2013,Alameda,Eden Medical Center,106014233.0,Brain,27.0,,-122.087406,37.698377
...,...,...,...,...,...,...,...,...,...
15646,2020,Ventura,Community Memorial Hospital ñ San Buenaventura,106560473.0,Brain,,4.0,-119.258124,34.274589
15652,2020,Ventura,Ventura County Medical Center,106560481.0,Brain,,13.0,-119.253840,34.276930
15665,2020,Ventura,Los Robles Hospital and Medical Center,106560492.0,Brain,,16.0,-118.882411,34.207620
15678,2020,Ventura,Adventist Health Simi Valley,106560525.0,Brain,,1.0,-118.743940,34.289730


# Exporting the Subset

After completing the subset, it can be prepared for export to be used by others. 

#### *Export as CSV*

By using the function .to_csv() with the subset file, the subset is now in csv form.
However, to ensure that the values that were cut from the original dataset when creating the subset are left out, the function index=false indicates that those values should be left off.

In [20]:
Brain_subset.to_csv("Brain_subset.csv",index=False)

At this point, a new csv file should have appeared in your folder. 

The new subset data is now ready for use!