# Exercise: Is this Drug Working?

In this exercise, we will use the pandas python package, along with matplotlib graphing to look at some simulated data from a (simplified) drug trial. We have a simple question: based on these data, is our drug working? 

## Getting the test data

First, we'll need to download a comma separated values file that contains our drug data. This file can be found [here](./resources/treatment_results.csv). 

Click on the link to download the file (by default it will go to `Downloads`.

Then, move it to the same folder on your computer that you will be using to for your analysis. You can do this either on the command line or using your graphical user interface (NOTE: you need to remember or write down which one this is so you can get to that same folder in your command line interface before starting jupyter notebooks).


## Examining the test data in Excel

CSV or comma separated values files can be opened in Excel or Google Sheets.

### Opening the data file in Excel
Depending on your version of Excel, you may be able to use one of the following methods:

- `File > Open`, then pick the file. (If you are in the folder where the file is but can't see it, there may be a pulldown menu letting you choose which files Excel 'sees', select 'All Files')
- open Excel separately and drag the `.csv` file from your finder onto it. 
- A final method is to open a new blank sheet and from the Data tab select `Get Data > From File > From Text/CSV`, and then pick your file

### Examining the Data

Now let's check out the data. It should look something like this:

<img src="./resources/data_screenshot.png" alt="An Excel file. There are 4 columns: Patient ID, Patient Age, Patient Health, and Treatment. Under each are numbers. The patient ID column has a unique number for each patient. The first few patient ages range between 15 and 88. The Patient Health column has a number in the thousands representing patient health. The treatment column is categorical and has either 'Treatment' or 'Placebo' marked in each row." width=800>

### Checking the path for your current working directory

Before we try to load the data in our jupyter notebook, we have to move it into the same directory as the current working directory for our notebook.  The `get_cwd` command can tell you the path of your current working directory. This can be useful for moving files into that directory in your finder, or re-downloading them and saving them into that directory.

In [16]:
from os import getcwd
print(getcwd())

/Users/zaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/intro_to_data_studies/content/tabular_data


## Move the file into our current working directory using your graphical user interface

Try to download the file directly into the folder shown by your getcwd() command. Or, if you prefer, download it into Downloads and then move it to your current working directory.

Once you think you've got it there, you can use the `listdir` command from the `os` module to check if its there.

In [19]:
from os import listdir
listdir()

['.DS_Store',
 'exercise_is_this_drug_working.ipynb',
 'resources',
 '.ipynb_checkpoints']

In my case you can see I *didn't* put the file in my current directory (it's not in the list), but instead stuck it into a 'resources' folder within my current working directory. That's OK - we just use another listdir command to check that it is in the resources folder like I expect:

In [20]:
listdir('resources')

['data_screenshot.png', 'treatment_results.csv']

Great! I see it. So wherever your file is, you just need to remember that to load it into pandas, we'll need the *relative path* from our current working directory to where the file is (in my case `./resources/treatment_results.csv', but if it were in the same folder it would just be `treatment_results.csv`)

## Importing pandas and loading the data into python

In [12]:
import pandas as pd

#the data file variable
#should match the *relative path* from
#your current working directory (where you ran Jupyter Notebook from) 
#to the actual data file. Or if the file is in your current working
#directory, you can just use the filename

data_file = './resources/treatment_results.csv'

#Now we load the file into a pandas DataFrame called df
df = pd.read_csv(data_file)

#Show a few rows of our dataframe
df

Unnamed: 0,PatientID,Patient Age,Patient Health,Treatment
0,0,89,1538,Placebo
1,1,55,1167,Drug
2,2,15,1759,Drug
3,3,42,1361,Drug
4,4,52,1754,Placebo
...,...,...,...,...
95,95,85,718,Drug
96,96,61,1693,Placebo
97,97,38,1429,Drug
98,98,52,1206,Drug


Ideally you should now have a pandas dataframe with the same data as you saw in Excel.

## Troubleshooting: If you get an error saying the file doesn't exist

The most common reason the above will fail, is if the location you are telling python that the file is in doesn't match it's actual location on your computer. 

If you get an error, scroll down to the bottom first! The most important part is the last line. The rest just walks thorugh the steps python took to get to that error, which can be long and complex. 

Read the whole last line of the error message out loud. Seriously! It sounds silly but it can really help to force your brain to engage with the error message (my brain, at least, REALLY wants to gloss over parts of it). If the previous command errored out, the most likely reason is an error that looks something like this:

`FileNotFoundError: [Errno 2] No such file or directory: './resources/treatment_resultss.csv'`

Can you see why I got this error? In my case, I added an extra 's' onto the filename (treatment_resultss.csv instead of treatment_results.csv). Of course, that misspelled file won't exist. In your case the filename may be right, but you may be in a different folder than you specified.

You can use the `listdir` function from python's `os` or operating system module to check what files are in any folder. Just import listdir from os, then call it. If you call it without parameters (like `listdir()`), it will list what's in your current working directory. If you give it a path, it will say what files are at the location you specify.

## Getting x and y data out of our columns



In [23]:
age = list(df['Patient Age'])
treatment = list(df['Treatment'])
response = list(df['Patient Health'])

print(age,treatment,response)

[89, 55, 15, 42, 52, 70, 41, 54, 58, 68, 82, 23, 64, 70, 30, 84, 88, 18, 42, 79, 37, 70, 26, 35, 55, 21, 46, 74, 54, 42, 45, 33, 24, 19, 66, 5, 53, 22, 76, 43, 49, 32, 17, 31, 78, 36, 49, 83, 20, 69, 61, 69, 48, 33, 73, 46, 70, 40, 53, 17, 33, 82, 59, 40, 20, 60, 66, 42, 61, 34, 35, 66, 53, 6, 39, 48, 57, 50, 45, 83, 72, 45, 78, 90, 15, 32, 56, 64, 22, 62, 64, 67, 48, 48, 63, 85, 61, 38, 52, 67] ['Placebo', 'Drug', 'Drug', 'Drug', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Drug', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Drug', 'Drug', 'Placebo', 'Drug', 'Placebo', 'Drug', 'Drug', 'Drug', 'Drug', 'Drug', 'Drug', 'Drug', 'Placebo', 'Drug', 'Drug', 'Drug', 'Drug', 'Placebo', 'Placebo', 'Placebo', 'Drug', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Placebo', 'Drug', 'Placebo', 'Placebo', 'Drug', 'Drug', 'Placebo', 'Drug', 'Drug', 'Drug', 'Drug', 'Drug', 'Placebo', 'Drug', 'Placebo', 'Placebo', 'Drug', 'Drug', 'Drug', 'Drug', 'Placebo', 'Pl