## Unit Testing
While we will not cover the [unit testing library](https://docs.python.org/3/library/unittest.html) that python has, we wanted to introduce you to a simple way that you can test your code.

Unit testing is important because it the only way you can be sure that your code is do what you think it is doing. 

Remember, just because ther are no errors does not mean your code is correct.

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt
pd.set_option('display.max_columns', 100) # Show all columns when looking at dataframe

In [2]:
# Download NHANES 2015-2016 data
df = pd.read_csv("data/nhanes_2015_2016.csv")
df.index = range(1,df.shape[0]+1)

In [None]:
df.head()

### Goal
We want to find the mean of first 100 rows of 'BPXSY1' when 'RIDAGEYR' > 60

i.e. we want to find the mean of first 100 rows of the blood pressure variable when our age is more than 60

In [3]:
# One possible way of doing this is:
pd.Series.mean(df[df.RIDAGEYR > 60].loc[range(0,100), 'BPXSY1']) 
# Current version of python will include this warning, older versions will not

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return getattr(section, self.name)[new_key]


139.57142857142858

The above code will not work right even though it returns a reasonable value for mean. So we should not always trust the output. What went wrong here is that we are using .loc instead of .iloc. The returned filtered dataframe df[df.RIDAGEYR > 60] will not contain rows labeled from 0 to 100. Say for example, in the returned dataframe from df[df.RIDAGEYR > 60], there is no row at row label 0. So pandas will just insert a row with all NANs at row label 0. This will affect the mean we calculate.


To test, we will create a simple test dataframe whose structure we know and whose mean we can easily calulate.

In [5]:
# test our code on only ten rows so we can easily check
test = pd.DataFrame({'col1': np.repeat([3,1],5), 'col2': range(3,13)}, index=range(1,11))
test

Unnamed: 0,col1,col2
1,3,3
2,3,4
3,3,5
4,3,6
5,3,7
6,1,8
7,1,9
8,1,10
9,1,11
10,1,12


Now let us put the test dataframe in the same code for calculating mean. We know that our test data should give us a mean of 5 for the condition we gave.

In [6]:
# pd.Series.mean(df[df.RIDAGEYR > 60].loc[range(0,5), 'BPXSY1'])
# should return 5

pd.Series.mean(test[test.col1 > 2].loc[range(0,5), 'col2'])

4.5

What went wrong?
Lets display the filtered test dataframe.
Hmm, there is a NaN at 0. In the original test datafram, there were no NANs. So what happened was that there was no row with row label 0 in the filtered test dataframe. So pandas will just insert a row with all NANs at row label 0. This will affect the mean we calculate.

In [8]:
test[test.col1 > 2].loc[range(0,5), 'col2']
# 0 is not in the row index labels because the second row's value is < 2. For now, pandas defaults to filling this
# with NaN

0    NaN
1    3.0
2    4.0
3    5.0
4    6.0
Name: col2, dtype: float64

In [9]:
# Using the .iloc method instead, we are correctly choosing the first 5 rows, regardless of their row labels
test[test.col1 >2].iloc[range(0,5), 1]

1    3
2    4
3    5
4    6
5    7
Name: col2, dtype: int64

In [10]:
pd.Series.mean(test[test.col1 >2].iloc[range(0,5), 1])
# now we get the correct expected mean of 5

5.0

In [None]:
# We can compare what our real dataframe looks like with the incorrect and correct methods
df[df.RIDAGEYR > 60].loc[range(0,5), :] # Filled with NaN whenever a row label does not meet the condition

In [None]:
df[df.RIDAGEYR > 60].iloc[range(0,5), :] # Correct picks the first five rows such that 'RIDAGEYR" > 60

In [None]:
# Applying the correct method to the original question about BPXSY1
print(pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), 16]))

# Another way to reference the BPXSY1 variable
print(pd.Series.mean(df[df.RIDAGEYR > 60].iloc[range(0,100), df.columns.get_loc('BPXSY1')]))