# NEUS 642 - Week 3 Homework

Your goal for the day? Perform a real statistical test on real data. In this case, you'll perform a T-test the assess the significance of changes in the auditory brainstem response (ABR) following noise exposure. Along the way, we'll deal with the very normal problem of missing/incomplete data.

## Reload the ABR dataset

To continue working with pandas, let's start by loading the exposure and threshold files from lecture and joining the tables into the `data` DataFrame.

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 7

# other imports useful for this homework
import matplotlib.pyplot as plt
from scipy import stats

In [None]:
exposure_data = pd.read_csv('exposure_data.csv', index_col=0, parse_dates=['exposure_date'])
threshold_data = pd.read_csv('abr_thresholds.csv', parse_dates=['abr_date'])
data = threshold_data.join(exposure_data, on=['animal'])

# pivot to one column per timepoint
data['days_re_exposure'] = (data['abr_date'] - data['exposure_date']).astype(str)
data_timepoints = data.pivot(columns='days_re_exposure', index=['animal', 'ear', 'exposure_level'], values='threshold').reset_index()

In [None]:
data_timepoints

As you see here, and might remember from lecture, there are *NaNs* (special numpy value meaning "not a number") in cells where data was not collected. You'll have to deal with that in your homework.

## Question 1 - Paired T-test

We want to determine if the change in hearing threshold following noise exposure is significant. There are many ways to do this. But let's start with an old standard, the paired T-test. A paired T-test considers changes within-animal and can sometimes before more powerful than its cousin, the independent (or unpaired) T-test. Conveniently, the scientific python packages, scipy, has a T-test built in.

In [None]:
from scipy import stats
stats.ttest_rel

Write a function `my_ttest` that takes a dataframe and two column names as inputs and returns the results `T, p` of a T-test in a tuple. 

Remember that you have to exclude rows containing NaN values for this to work! There are many ways to deal with this issue. Some solutions are built into pandas. And/or notice the `nan_policy` parameter in `ttest_rel`

In [None]:
def my_ttest(df, col1, col2):
    """ Perform a paired T-test between two columns of a dataframe
    Inputs:
    df : dataframe
    col1, col2 : string names of columns to compare
    Returns:
    T, p : tuple of T score and p value resulting from paired t-test
    """
    # Your answer here

    return T, p

Test it out:

In [None]:
T, p = column_ttest_rel(data_timepoints, '-3 days', '1 days')
print(f"T={T:.3f} p={p:.3e}")

In [None]:
T, p = column_ttest_rel(data_timepoints, '-3 days', '14 days')
print(f"T={T:.3f} p={p:.3e}")

## Question 2 - Avoiding bias from multiple measurements.

When you look at data from each animal in the cell above, notice that there are two measurements on each day, one for the left ear and one for the right ear. You might also notice that they tend to be similar on the same day. To perform a more conservative T-test, let's average thresholds across ears for each animal before evaluating the signficance of the threshold change. 

Using the `groupby` and `mean` methods, generate a new DataFrame `data_ear_averaged`, which averages data across ears. Then send your new dataframe through `my_ttest` that you wrote for question 1.

In [None]:
#Your answer here


Test it out:

In [None]:
T, p = column_ttest_rel(data_ear_averaged, '-3 days', '1 days')
print(f"T={T:.3f} p={p:.3e}")

In [None]:
T, p = column_ttest_rel(data_ear_averaged, '-3 days', '14 days')
print(f"T={T:.3f} p={p:.3e}")

## Question 3 - Scatter plot

Use your documnation searching skills to figure out how to generate a scatter plot comparing thresholds for each animal between two timepoints, eg, `-3 days` and `1 days` or  `-3 days` and `14 days`.  You can plot data for each measurement or the average across ears. Or both!

Let's also make the plot tidy: 
* Color the dots differently for the different noise exposure levels.
* Make sure the axes are labeled.
* Include a dashed, diagonal line running from (20,20) to (80,80), so it's easy to see which way the thresholds have shifted.
* In the title, report the results of your paired T-test.
* Optional: If you want to get fancy, make the color of the dots depend on noise exposure level.

You will likely need to use a combinations of dataframe methods and calls to matplotlib functions. 

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Your answer here


## Bonus - Significant effect of noise exposure level?

We've ignored the `exposure_level` column in the homework so far. This number indicates the intensity of the noise exposure, and in class we saw that the louder noise exposure may produce more severe hearing loss on day 1 after exposure. Can you perform T-tests to compare the mean threshold change between the 104, 110 and 114 dB exposure groups?

Logic: 
1. Define a new column as the threshold difference between -3 days and 1 days post-exposure
2. For each exposure level (104, 110 or 114 dB), select the subset of rows for that group.
3. Perform a t-test comparing the mean change in threshold between these exposure groups. Report the T score and p value.
4. Do you think this is a definitive result? Why/why not? Should you average across ears before testing or not?

Important: This will require an *independent* T-test, since the same animal cannot be in different exposure groups. This is implemented in a different scipy function.

In [None]:
from scipy import stats
stats.ttest_ind

In [None]:
# Your answer here