## Temperature and humidity data
The data is in separate `.csv` files in the folder `.\data\TempHumidData_22_7_2016`.
Within the `TempHumidData_22_7_2016` folder are subfolders named `BOX <id number>` and within these are `.csv` files in the following format:
```
Date/Time,Value,Nest Id,Visit Id
8/01/2014 12:54,57.961,117,4
8/01/2014 13:24,61.458,117,4
8/01/2014 13:54,64.33,117,4
```

Will be turned into the following format for both the temp and the humidity data:
```
Date/Time,
Value,
Nest Id (comes from the folder name),
filename of the data csv
md5 hash of the record (for identification of duplicates)
```
Each of these `.csv` files are to be merged into a single data set of all observations for all nests in files named `a_Temp_merged.csv` and `a_Humidity_merged.csv`. (**in progress**)
* Files where the filename contains 'ibutton' are excluded.
* Files commencing with sensor metadata will have the metadata stripped during the append.

At the completion of the appending, duplicate records will be extracted out of the merged file and recorded in files called `b2_Temp_duplicates.csv` and `b2_Humidity_duplicates.csv`. Non-duplicates will be recorded in `b1_Temp_nondupes.csv` and `b1_Humidity_nondupes.csv`.

In [84]:
import os
import sys
import pandas as pd
import csv
import logging
from progress.bar import Bar
import hashlib

In [40]:
# clear the log file
try:
    os.remove('temphumidity.log')
except OSError:
    pass
# set up logging
log = logging.getLogger("temphumidity")
log.setLevel(logging.DEBUG)
# log.setLevel(logging.INFO)
fh = logging.FileHandler('temphumidity.log', mode='w', encoding='utf-8')
formatter = logging.Formatter('%(asctime)s ~ %(name)s ~ %(levelname)s ~ %(message)s')
fh.setFormatter(formatter)
log.addHandler(fh)

In [85]:
# initialise variables
folder = os.path.normpath('../data/TempHumidData_22_7_2016')

file_list = [] # all files identified
temp_files = [] # identified temperature files
temp_files_excluded = [] # temperature files that we couldn't process
humid_files = [] # identified humidity files
humid_files_excluded = [] # humidity files that we couldn't process
ibutton_files = [] # iButton files to report, but ignore
bad_extensions = [] # non-csv files to report, but ignore
other_files = [] # csv files not containing 'temp', 'humid' or 'humd' in the file name

In [86]:
# helper functions
def md5_hash(nest, date, unit, value, source):
    b = bytearray()
    b.extend(map(ord, '{0}{1}{2}{3}{4}'
                 .format(nest, date, unit, value, source)))
    return hashlib.md5(b).hexdigest()

In [87]:
print('------------------------------------------------------------')
print('-- Identifying the Temp and Humidity data files           --')
print('------------------------------------------------------------')
# Get the list of all files (file_list)
# Split that list into the following lists:
#  1. Temperature files (contain 'temp' in the filename)
#  2. Humidity files (contain 'humidity' or 'humd' in the filename)
#  3. ibutton files (not needed, excluded)
#  4. Bad extensions (anything that is not a csv file)
#  5. Unrecognised (anything that doesn't fit any of the above criteria)

# Consider and categorise each file into the set of lists
for path, subdirs, files in os.walk(folder):
    for file in files:
        fp = os.path.join(path, file)
        file_list.append(fp)
        if not file.endswith('.csv'):
            bad_extensions.append(fp)
        elif 'ibutton' in file.lower():
            ibutton_files.append(fp)
        elif 'temp' in file.lower():
            temp_files.append(fp)
        elif 'humid' in file.lower() or 'humd' in file.lower():
            humid_files.append(fp)
        else:
            other_files.append(fp)
            
# Check that we've captured them all and print out a status report
total_parsed = len(temp_files) + \
    len(humid_files) + \
    len(ibutton_files) + \
    len(bad_extensions) + \
    len(other_files)
print()
print('Total files:        {:>10}'.format(len(file_list)))
print('Temp files:         {:>10}'.format(len(temp_files)))
# for file in temp_files:
#     log.debug('Temp: ' + file)
print('Humidity files:     {:>10}'.format(len(humid_files)))
# for file in humid_files:
#     log.debug('Humidity: ' + file)
print('Bad extension files:{:>10} <- Check log file'.format(len(bad_extensions)))
for file in bad_extensions:
    log.info('Non-csv: ' + file)
print('iButton files:      {:>10} <- Excluded'.format(len(ibutton_files)))
for file in ibutton_files:
    log.debug('iButton: ' + file)
print('Unrecognised files: {:>10} <- Check log file'.format(len(other_files)))
for file in other_files:
    log.info('Unrecognised: ' + file)
print('------------------------------------------------------------')
print('Reconciles:               {0}'.format(total_parsed == len(file_list)))
print('------------------------------------------------------------')
if not total_parsed == len(file_list):
    log.warning('Parsed files to not reconcile. Total: ' + 
             str(len(file_list)) + 
                 ' Parsed: ' + 
                 str(total_parsed))

INFO:temphumidity:Non-csv: ..\data\TempHumidData_22_7_2016\BOX 201\entered into access\201 HUMID AUG 6 TO OCT 25 2014_with deets.xlsx
INFO:temphumidity:Non-csv: ..\data\TempHumidData_22_7_2016\BOX 206\206 HUMID JAN 11 TO MAR 24 2015
INFO:temphumidity:Non-csv: ..\data\TempHumidData_22_7_2016\BOX 219\219 HUMID JAN 11 TO MAR 24 2015
INFO:temphumidity:Non-csv: ..\data\TempHumidData_22_7_2016\BOX 306\306 notes.xlsx
DEBUG:temphumidity:iButton: ..\data\TempHumidData_22_7_2016\BOX 103\ENTERED INTO ACCESS\103 HUMID 03 JUN TO 24 AUG 2015 ibutton info.csv
DEBUG:temphumidity:iButton: ..\data\TempHumidData_22_7_2016\BOX 103\ENTERED INTO ACCESS\103 HUMID 24th AUG TO 27TH OCT 2015 ibutton info.csv
DEBUG:temphumidity:iButton: ..\data\TempHumidData_22_7_2016\BOX 103\ENTERED INTO ACCESS\103 HUMID 27TH OCT 2015 TO 06TH JAN 2016 ibutton info.csv
DEBUG:temphumidity:iButton: ..\data\TempHumidData_22_7_2016\BOX 103\ENTERED INTO ACCESS\103 HUMID 6TH JAN TO 9TH MARCH 2016 ibutton info.csv
DEBUG:temphumidity:iB

------------------------------------------------------------
-- Identifying the Temp and Humidity data files           --
------------------------------------------------------------

Total files:              1728
Temp files:                697
Humidity files:            678
Bad extension files:         4 <- Check log file
iButton files:             348 <- Excluded
Unrecognised files:          1 <- Check log file

INFO:temphumidity:Unrecognised: ..\data\TempHumidData_22_7_2016\BOX 218.csv



------------------------------------------------------------
Reconciles:               True
------------------------------------------------------------


## Issues with temp files
1. Some files use TempC rather than Unit and Value - need to catch and rename


In [92]:
print('------------------------------------------------------------')
print('-- Merging the Temp files                                 --')
print('------------------------------------------------------------')
print('\nFiles to merge:\t', len(temp_files), 
      'Commencing setup:', end='', flush=True)
log.debug('TEMP:Files to merge: ' + str(len(temp_files)) + 
          ' Creating output files')

# 1. Take the list of temperature files (temp_files)
# 2. In turn:
#    2.1. Check whether they include a header (and strip it)
#    2.2. Load them into a dataframe
#    2.3. Write/append them to the output csv
#       2.3.1. Write any non-conforming ones to an exclusions file for follow-up

# make a csv to write the data to (delete it first)
try:
    os.remove('a_Temp_merged.csv')
except OSError:
    pass
output = open('a_Temp_merged.csv', 'w', newline='')
try:
    os.remove('a_Temp_merged_excluded.csv')
except OSError:
    pass
output_excluded = open('a_Temp_merged_excluded.csv', 'w', newline='')
print(' done.', flush=True)
log.debug('TEMP:Output file creation complete.')

# 1. Take the list of temperature files (temp_files)
print('### Need to put this all in an iter loop ###')
for index, current_file in enumerate(temp_files):
    df = None
    log.debug('TEMP:Opening ' + str(current_file))

    # 2.1. Check whether they include a header (and strip it)
    # 2.2. Load them into a dataframe
    input_file = open(current_file, 'r')
    line = input_file.readline()
    if line[0:23] == 'Date/Time,Value,Nest Id':
        # Straight to the data, not metadata headers
        log.debug('TEMP:Data only; Loading into memory: {0} '
                  .format(current_file))
        df = pd.read_csv(current_file, encoding='utf-8', engine='python', 
                         parse_dates=True, infer_datetime_format=True)
        log.debug('TEMP:Loaded into memory: {0}'.format(current_file))
    else:
        # There are metadata headers, strip them out
        log.debug('TEMP:Stripping header: {0!s}'.format(current_file))
        i = 1
        while i<50 and line[0:9] != 'Date/Time':
            # we expect the column header on line 20, 
            #    but we read in the first 50 lines to be safe
            line = input_file.readline() # reads the next line in the file
            i += 1
            # at exit of the loop, we've either found the column 
            #    header at line i or we hit line 50 without finding it
        if line[0:9] == 'Date/Time':
            # we found the column header row
            log.debug('TEMP:Found column header at line {0!s} - {1}'
                      .format(i, line))
            log.debug('TEMP:Loading into memory: {0}'.format(current_file))
            df = pd.read_csv(current_file, encoding='utf-8', engine='python',
                             skiprows=i-1, parse_dates=True, 
                             infer_datetime_format=True)
            log.debug('TEMP:Loaded into memory: {0}'.format(current_file))
        elif i == 50:
            # we haven't found the column header, so we don't know that 
            #    this file is. Exclude it.
            log.warning('TEMP:Did not find the column header for {0}, excluding.'
                        .format(current_file))
            temp_files_excluded.append(current_file)
            df = None

    # 2.3. Write/append them to the output csv
    try:
        log.debug('TEMP:Normalising data fields: {0}'.format(current_file))
        columns = ['NestId', 'Date/Time', 'Unit', 'Value', 'SourceFile', 'md5']
        # Include temp units if not listed. always degrees C.
        if not 'Unit' in df.columns:
            df['Unit'] = 'C'
        # Add/overwrite the Nest Id to always use the folder name
        # Below expression strips off the base temphumidity folder 
        #    path and takes the next directory (the box#)
        if 'Nest Id' in df.columns:
            df = df.drop('Nest Id', 1)
        nestid = current_file[len(folder)+1:].split('\\')[0]
        df['NestId'] = nestid
        df['SourceFile'] = current_file
        df['md5'] = md5_hash(df['NestId'], 
                                 df['Date/Time'], 
                                 df['Unit'], 
                                 df['Value'], 
                                 df['SourceFile'])
        log.debug('TEMP:Normalised data fields: {0}'.format(current_file))
        log.debug('TEMP:Appending to merged file: {0}'.format(current_file))
        if index == 0:
            # write the header the first time
            df.to_csv(output, header=True, columns=columns)
        else:
            df.to_csv(output, header=False, columns=columns)
        log.debug('TEMP:Appended to merged file: {0}'.format(current_file))
    except AttributeError:
        pass
        
print(temp_files_excluded)

# close the files
output.close()
output_excluded.close()

------------------------------------------------------------
-- Merging the Temp files                                 --
------------------------------------------------------------

Files to merge:	 697 Commencing setup:

DEBUG:temphumidity:TEMP:Files to merge: 697 Creating output files


 done.


DEBUG:temphumidity:TEMP:Output file creation complete.
DEBUG:temphumidity:TEMP:Opening ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv
DEBUG:temphumidity:TEMP:Stripping header: ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv
DEBUG:temphumidity:TEMP:Found column header at line 20 - Date/Time,Unit,Value

DEBUG:temphumidity:TEMP:Loading into memory: ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv
DEBUG:temphumidity:TEMP:Loaded into memory: ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv
DEBUG:temphumidity:TEMP:Normalising data fields: ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv
DEBUG:temphumidity:TEMP:Normalised data fields: ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv
DEBUG:temphumidity:TEMP:Appending to merged file: ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv
DEBUG:temphumidity:TEMP:Appended to merged file: ..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 0

### Need to put this all in an iter loop ###


KeyError: 'Value'

In [83]:
import hashlib

def md5_hash(nest, date, unit, value, source):
    b = bytearray()
    b.extend(map(ord, '{0}{1}{2}{3}{4}'
                 .format(nest, date, unit, value, source)))
    return hashlib.md5(b).hexdigest()

md5_hash('BOX 117', '8/01/2014 12:54', 'C', '28.618', current_file)

'3c069f3b8333495c9b75414175107c34'

In [74]:
nestid = current_file[len(folder)+1:].split('\\')[0]
nestid
# nestid = nestid[1:]
# nestid = nestid[]

'BOX  117'

In [80]:
# temp_files[:3]
df

Unnamed: 0,Date/Time,Value,Nest Id,Unit
0,8/01/2014 12:54,28.618,BOX 117,C
1,8/01/2014 13:24,26.619,BOX 117,C
2,8/01/2014 13:54,26.619,BOX 117,C
3,8/01/2014 14:24,25.619,BOX 117,C
4,8/01/2014 14:54,25.619,BOX 117,C
5,8/01/2014 15:24,24.619,BOX 117,C
6,8/01/2014 15:54,24.619,BOX 117,C
7,8/01/2014 16:24,24.119,BOX 117,C
8,8/01/2014 16:54,23.619,BOX 117,C
9,8/01/2014 17:24,23.119,BOX 117,C


In [None]:
with open(os.path.normpath('..\data\TempHumidData_22_7_2016\108 TEMP 09 APR TO 03 JUL 14.csv'), 'r') as inf:
    has_header = csv.Sniffer().has_header(inf.read(1024))

    has_header

In [None]:
temp_files[0]