# Attributions

**Chris:** Came up with the 4th test, coded the validations

**Naveen:** Looked over final version

**Emily:** Came up with 3 tests, wrote all the explanations/reports

In [3]:
# Imports
import glob

import numpy as np
import pandas as pd

In [4]:
# Reading in the file and taking a peek to figure out patterns for validations we want
df = pd.read_csv('../data/150717_2A_2B.csv')

df.head()

Unnamed: 0,location,animal,user,sn,an,datatype,start,end,startreason,endreason,frect,fredur,midct,middur,burct,burdur,stdate,sttime
0,c97,z097,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,34,55.7,34,4.3,0,0.0,17/07/2015,14:30:00
1,c98,z098,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,0,60.0,0,0.0,0,0.0,17/07/2015,14:30:00
2,c99,z099,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,6,59.0,6,1.0,0,0.0,17/07/2015,14:30:00
3,c100,z100,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,25,55.8,25,4.2,0,0.0,17/07/2015,14:30:00
4,c101,z101,ZEBRALAB02\zebralab_user,1,0,quant,0.0,60.0,Beginning of session,End of period,1,60.0,0,0.0,0,0.0,17/07/2015,14:30:00


Just from looking through patterns of columns to see what each one represents (including what could possibly be real, meaningful values for each), a few things jump out at us as requirements or deviations from the norm that we'd want to run tests for:

1. We will make sure the column headers are the same as those given. We are choosing this test to ensure that we are receiving the data that we expect, and not assigning incorrect values to certain variables, which would be really bad!
2. For a number of observations for the "sttime" variable, we see lag from some part of the data collection such that the start time starts at "\_:\_:00" seconds instead of "\_:\_:59," which is the expectation. We want to run a test to make sure that we do not have delay in recorded times due to hard-drive write delays. We want to keep the recorded sttimes accurate so that we can ensure the appropriate time interval was recorded for each bout of activity. Additionally, consistency of this column will allow us to see if there is any actual malfunction with the machine or other technology - we should be alarmed if we see "\_:_:43," for instance.
3. Because "middur" should be used for calculation of activity, no activity should be negative. Run a test to check that "middur" $\geq$ 0. We chose this test because we don't want negative activities because this indicates problems with the calibration, since rest/no movement should just be 0 at a minimum.
4. The supposed value difference between "start" and "end" values for a given observation should be exactly 60.0, as set by the experimentalist. We notice that some are not (maybe make sure the reason for this is not end of session). We will make a test to check for the observation time length. We chose this test because we want to ensure that the activities are being recorded across the same time length, or else it skews the data; perhaps the analysis will want to only take activities that arise from the same length of time, and therefore those that are not exactly 60.0 s would provide inaccurate information. Maybe the analysis will involve discarding those data points, or else calculating the average activity per second and mlutiplying it by 60. Regardless of analysis, this time would need to be known and ideally consistent.


In [5]:
def test_col_headers(df, fname):
    """ Make sure that our column headers are correct """
    column_names = np.array( ['location', 'animal', 'user', 'sn', 'an', 'datatype', 'start', 'end',
                      'startreason', 'endreason', 'frect', 'fredur', 'midct', 'middur',
                      'burct', 'burdur', 'stdate', 'sttime'] )
    
    assert np.all(df.columns == column_names), (fname + ' has incorrect column names')

In [6]:
# Test for consistent column headers
test_col_headers(df, 'test')

We see that there were no errors thrown for this test, which means the column headers are what we expect them to be - great! Let's move on to checking for time delay from hard-drive recording.

In [8]:
def test_sttime(df, fname):
    """ Make sure that our times are not delayed due to hard-drive write delays """
    
    time_test = np.array([pd.to_datetime(getattr(row, 'sttime')).second == 0 for row in df.itertuples()])
    
    errors = np.where(time_test != True)[0]
    
    assert np.all(time_test), fname + " has invalid sttimes " + np.array_str(errors)

In [9]:
# Test for sttimes
test_sttime(df, 'test')

AssertionError: test has invalid sttimes [    96     97     98 ..., 790845 790846 790847]

As we saw earlier when we skimmed the data, there are many observations with invalid sttimes that are probably due to lag in hard drive recording. Looks like our test works.

In [12]:
def test_neg(df, fname):
    """ Make sure none of our activities are negative """
    
    neg_test = np.array([getattr(row, 'middur') >= 0 for row in df.itertuples()])
    
    errors = np.where(neg_test != True)[0]
    
    assert np.all(errors), fname + " has negative activities " + np.array_str(errors)

In [13]:
# Test to make sure there are no negative middur values
test_neg(df, 'test')

Whoo, there are no negative activities! Good stuff. All the activity observations are usable data.

In [81]:
def test_timediff(df, fname):
    """ Make sure the time difference between start and end times is exactly 60 """
    
    time_diffs = np.array( [getattr(row, 'end') - getattr(row, 'start') == 60.0 for row in df.itertuples()
                    if getattr(row, 'endreason') == 'End of period'] )
    
    errors = np.where(time_diffs != True)[0]
    
    assert np.all(time_diffs), fname + " has incorrect start/end times at " + np.array_str(errors)

In [82]:
# Check time difference between start and end
test_timediff(df, 'test')

AssertionError: test has incorrect start/end times at [619968 619969 619970 619971 619972 619973 619974 619975 619976 619977
 619978 619979 619980 619981 619982 619983 619984 619985 619986 619987
 619988 619989 619990 619991 619992 619993 619994 619995 619996 619997
 619998 619999 620000 620001 620002 620003 620004 620005 620006 620007
 620008 620009 620010 620011 620012 620013 620014 620015 620016 620017
 620018 620019 620020 620021 620022 620023 620024 620025 620026 620027
 620028 620029 620030 620031 620032 620033 620034 620035 620036 620037
 620038 620039 620040 620041 620042 620043 620044 620045 620046 620047
 620048 620049 620050 620051 620052 620053 620054 620055 620056 620057
 620058 620059 620060 620061 620062 620063 620064 620065 620066 620067
 620068 620069 620070 620071 620072 620073 620074 620075 620076 620077
 620078 620079 620080 620081 620082 620083 620084 620085 620086 620087
 620088 620089 620090 620091 620092 620093 620094 620095 620096 620097
 620098 620099 620100 620101 620102 620103 620104 620105 620106 620107
 620108 620109 620110 620111 620112 620113 620114 620115 620116 620117
 620118 620119 620120 620121 620122 620123 620124 620125 620126 620127
 620128 620129 620130 620131 620132 620133 620134 620135 620136 620137
 620138 620139 620140 620141 620142 620143 620144 620145 620146 620147
 620148 620149 620150 620151 620152 620153 620154 620155 620156 620157
 620158 620159 620160 620161 620162 620163 620164 620165 620166 620167
 620168 620169 620170 620171 620172 620173 620174 620175 620176 620177
 620178 620179 620180 620181 620182 620183 620184 620185 620186 620187
 620188 620189 620190 620191 620192 620193 620194 620195 620196 620197
 620198 620199 620200 620201 620202 620203 620204 620205 620206 620207
 620208 620209 620210 620211 620212 620213 620214 620215 620216 620217
 620218 620219 620220 620221 620222 620223 620224 620225 620226 620227
 620228 620229 620230 620231 620232 620233 620234 620235 620236 620237
 620238 620239 620240 620241 620242 620243 620244 620245 620246 620247
 620248 620249 620250 620251 620252 620253 620254 620255 620256 620257
 620258 620259 620260 620261 620262 620263 620264 620265 620266 620267
 620268 620269 620270 620271 620272 620273 620274 620275 620276 620277
 620278 620279 620280 620281 620282 620283 620284 620285 620286 620287
 620288 620289 620290 620291 620292 620293 620294 620295 620296 620297
 620298 620299 620300 620301 620302 620303 620304 620305 620306 620307
 620308 620309 620310 620311 620312 620313 620314 620315 620316 620317
 620318 620319 620320 620321 620322 620323 620324 620325 620326 620327
 620328 620329 620330 620331 620332 620333 620334 620335 620336 620337
 620338 620339 620340 620341 620342 620343 620344 620345 620346 620347
 620348 620349 620350 620351]

We can see that a huge portion of the last tests that were run had incorrect times. If we take a closer look, we can see it's because those observations were after "End of Session." Looks like our test worked. We do not want to use those time points in the final data analysis anyway.

<div class="alert alert-info">Final: 46/50 </div>

For this problem, I graded your four most complicated validation functions out of 12.5, and awarded +1 for each novel additional function. 

* test_column_names (10/12.5) [Though you have altered this from the tutorial function, you could also make this function more sophisticated - what if the column names were in a different order? Or, would it be useful for the function to return which columns were missing?]

* test_sttime (12/12.5) [This is great, but is checking that the seconds are equal to zero the best way to do this? Ideally, you should compare the time stamps of adjacent measurements and make sure they are exactly sixty seconds apart. The experiment may not necessarily always start when seconds are 0.]

* test_neg (11.5/12.5) [Only note is that you could easily make this function more sophisticated by adding more checks - the middur column should be floats and bounded by 60, e.g.]

* test_timediff (12/12.5) [I think it would be more reasonable to look for differences in the `sttime` column for problems with writing to disk (which I think is where Justin actually found it). Also, the incorrect time difference isn't because of the end of session (that occurs at index > 700,000). You also wrote your function cleverly enough to exclude those rows! I think you found a real problem!]