# 1A Data Preparation
This script performs the following tasks:
1. Set up the environment
2. Load the relevant data sets from file
 * `NestCharacteristic-Static.csv   -> df_nest_static`
 * `NestCharacteristic-Seasonal.csv -> df_nest_seasonal`
 * `BreedingDataCombined.csv        -> df_breeding`
 * `TempData_2_10_2016.txt          -> df_temp`
 * `HumidData_2_10_2016.txt         -> df_humd`
3. Join them into a) a data file with nests, seasonal coverage and breeding observations and b) a data file for transactional sensor data
4. Add additional computed features to the data
5. Write the prepared data to file
 * `SensorDataWithBreedingPhase.csv` contains the temp and humidity logs, together with the breeding phase for that nest.
 * `NestDataWithBreedingStats.csv` contains all static nest masterdata, together with nest cover and breeding observations

## 1. Set up the environment
### 1.1 Import the required libraries
We need a certain set of common libraries for the tasks to be performed. These are imported below. If an import statement errors, you will need to install the library in your environment using the command line command `pip install <library>`.

In [5]:
print('Setting up environment and variables...', flush=True)
import pandas as pd
import os
import numpy as np
import datetime
import time

# all the useful and reuseable functions are defined in helper_functions.py
from helper_functions import *

Setting up environment and variables...


### 1.2 Set up the variables
You will need to change the values of the variables below to suit the names and directory location of your files to be loaded.

In [6]:
# update these with your file paths
nest_static_file = os.path.normpath('./0_data/NestStaticDataTbl_access_27_01_2017.csv')
nest_seasonal_file = os.path.normpath('./0_data/NestSeasonDataTbl_access_27_01_2017.csv')
nest_annual_file = os.path.normpath('./0_data/NestAnnualDataTbl_access_28_01_2017.csv')
breeding_data_file = os.path.normpath('./0_data/BreedingDataTbl_17_01_2017.csv')
temp_humid_file = os.path.normpath('./0_data/TempHumid_all.txt')

# write intermediate tables to disk for debugging purposes
write_temps = True
output_path = os.path.normpath('./output/A_load_and_combine_data')
df_sensor_data = None

log('Done.')

Mon Apr 10 20:50:34 2017 - Done.


## 1.3 Set up helper functions
Most helper functions are in helper_functions.py. These below are required to be in this module so they can use the environment variables.

In [7]:
def write_temp_file(df, filepath, df_name):
    '''
    If write_temps is true, this function will write the specified Pandas dataframe (df) to csv at the specified location (filepath).
    Variables:
        df: a Pandas dataframe to be written to csv.
        filepath: a string in Unix path format (using / not \) for the csv destination.
        df_name: human readable name or description of the dataframe for logging purposes.
    '''
    if write_temps:
        print('{0} - Writing intermediate table {1} to disk.'.format(str(time.ctime()), df_name, filepath), flush=True)
        if not os.path.exists(output_path):
            os.makedirs(output_path)
        df.to_csv(os.path.normpath(filepath))
        if os.path.getsize(filepath) > 0:
            print('{0} - Written {1}: {2:.3f} MB'.format(str(time.ctime()), filepath, os.path.getsize(filepath)/1000000), flush=True)
            
def breeding_year(date)   :
    '''
    Breeding year is defined as 01 Feb to 31 Jan using the year as at 01 Feb.
    Arguments:
        date is the datetime object representing the observation date to be converted into a breeding year.
    Returns:
        The Breeding Year for the provided date (as a float, because INTs don't handle NaNs).
    '''
    if date.month == 1:
        return date.year - 1
    else:
        return date.year

## 2. Load the data from file
### 2.1.1 Read in the NestCharacteristic-Static data (df_nest_static)
This is the real nest master data to which everything else is joined. Refer to the GitHub Wiki for descriptions of the data fields.

In [8]:
read_file_handler_start(nest_static_file, 'Nest Characteristic (Static)')
data_types = {'nest_id': str,
              'nest_type': str,
              'shape': str,
              'easting': np.float32,
              'northing': np.float32,
              'elevation': np.float32,
              'aspect': np.float32,
              'slope': np.float32,
              'treatment': str,
              'duration_of_insolation': np.float32,
              'box_height_mm': np.float32,
              'box_length_mm': np.float32,
              'box_width_mm': np.float32,
              'box_wall_width_mm': np.float32,
              'box_lid_depth': np.float32,
              'entrance_bearing': np.float32,
              'entrance_height': np.float32,
              'entrance_length': np.float32,
              'entrance_width': np.float32,
              'vents': np.float32,
              'tunnel': np.float32,
              'distance_to_boardwalk_m': np.float32,
              'distance_to_landfall_m': np.float32,
              'distance_to_shore_vegetation_m': np.float32,
              'box_vol_L': np.float32,
              'box_area_cm2': np.float32,
              'comment': str,
              'lat': np.float32,
              'long': np.float32,
              'autoNumber': np.float32
             }
df_nest_static = pd.read_csv(nest_static_file, 
                             header=0,
                             dtype=data_types,
                             encoding='utf-8',
                             error_bad_lines=True,
                             warn_bad_lines=True)
read_file_handler_end(nest_static_file, 'Nest Characteristic (Static)', df_nest_static, 'df_nest_static')

Mon Apr 10 20:50:34 2017 - Loading the Nest Characteristic (Static) data file.
Mon Apr 10 20:50:34 2017 - Nest Characteristic (Static) file is 0.039 MB.
Mon Apr 10 20:50:34 2017 - Loading into memory.
Mon Apr 10 20:50:34 2017 - Success: loaded 247 records.


### 2.1.1 Update and cleanse fields in NestCharacteristic-Static data (df_nest_static)
* Make sure all the nest IDs are uppercase and trimmed
* Create field `box_vol_L`
* Create field `box_area_cm2`

In [9]:
# drop the unwanted fields
df_nest_static = df_nest_static[['nest_id', 'nest_type', 'shape', 'easting', 'northing', 'elevation',
                                 'aspect', 'slope', 'treatment', 'duration_of_insolation',
                                 'box_height_mm', 'box_length_mm', 'box_width_mm', 'box_wall_width_mm',
                                 'box_lid_depth', 'entrance_bearing', 'entrance_height',
                                 'entrance_length', 'entrance_width', 'vents', 'tunnel',
                                 'distance_to_boardwalk_m', 'distance_to_landfall_m',
                                 'distance_to_shore_vegetation_m', 'comment'
                                ]].copy()

# make sure the nest IDs are all caps
df_nest_static['nest_id'] = df_nest_static['nest_id'].apply(lambda x: x.strip()).apply(lambda x: x.upper())

# calc the volume
# some boxes have only external measurements, not internal (which we are trying to calc). If it has external
# but not internal, then use external measurements
def box_vol_L(row):
    # box measurements are external, so deduct walls. Boxes have no bottom, so deduct only lid depth from height.
    return (row['box_length_mm'] - 2*row['box_wall_width_mm']) * (row['box_width_mm'] - 2*row['box_wall_width_mm']) * (row['box_height_mm']- row['box_lid_depth']) / 1000000
df_nest_static['box_vol_L'] = df_nest_static.apply(box_vol_L, axis=1)
        
# calc the floor area
def box_area_cm2(row):
    return (row['box_length_mm'] - 2*row['box_wall_width_mm']) * (row['box_width_mm'] - 2*row['box_wall_width_mm']) / 100
df_nest_static['box_area_cm2'] = df_nest_static.apply(box_area_cm2, axis=1)

# create the output directory if it doesn't exist
if not os.path.exists(os.path.normpath('./output/A_load_and_combine_data')):
    os.makedirs('./output/A_load_and_combine_data')
write_temp_file(df_nest_static, './output/A_load_and_combine_data/df_nest_static.csv', 'df_nest_static')
log('df_nest_static prepared successfully.\n')

Mon Apr 10 20:50:34 2017 - Writing intermediate table df_nest_static to disk.
Mon Apr 10 20:50:34 2017 - Written ./output/A_load_and_combine_data/df_nest_static.csv: 0.048 MB
Mon Apr 10 20:50:34 2017 - df_nest_static prepared successfully.



### 2.2.1 Read in the NestCharacteristic-Seasonal data (as df_nest_seasonal)
Recorded for old boxes and natural nests. Contains seasonal observations of nest vegetation and cover.
New boxes (not recorded) were an experiment in different building methods and their effect on box temperature. 

In [10]:
read_file_handler_start(nest_seasonal_file, 'Nest Characteristic (Seasonal)')

data_types = {'type': str,
              'nest_id': str,
              'NestSeasYear': str,
              'date': str,
              'year': str,
              'season': str,
              'BoxCoverTotal': np.float32,
              'BoxCoverDead': np.float32,
              'BoxWood': np.float32,
              'BoxWoodDead': np.float32,
              'BoxVeg': np.float32,
              'BoxVegDead': np.float32,
              'QuadCoverTotal': np.float32,
              'QuadCoverDead': np.float32,
              'QuadWood': np.float32,
              'QuadWoodDead': np.float32,
              'QuadVeg': np.float32,
              'QuadVegDead': np.float32,
              'cavity_cover': np.float32,
              'canopy_cover': np.float32,
              'comments': str
             }
df_nest_seasonal = pd.read_csv(nest_seasonal_file,
                               header=0,
                               dtype=data_types,
                               encoding='utf-8',
                               parse_dates=['date'],
                               dayfirst=True,
                               error_bad_lines=True,
                               warn_bad_lines=True)

read_file_handler_end(nest_seasonal_file, 'Nest Characteristic (Seasonal)', df_nest_seasonal, 'df_nest_seasonal')

Mon Apr 10 20:50:34 2017 - Loading the Nest Characteristic (Seasonal) data file.
Mon Apr 10 20:50:34 2017 - Nest Characteristic (Seasonal) file is 0.120 MB.
Mon Apr 10 20:50:34 2017 - Loading into memory.
Mon Apr 10 20:50:34 2017 - Success: loaded 1,929 records.


### 2.2.2 Update and cleanse fields in the NestCharacteristic-Seasonal data (df_nest_seasonal)
* Nest IDs to be all uppercase and trimmed
* recalculate the `year` and `season`
* create the unique ID `BoxSeasYear`

In [11]:
# drop unwanted fields: 'year', 'season', 'month',  'NestSeasYear'
df_nest_seasonal = df_nest_seasonal[['nest_id', 'BoxCoverTotal', 'BoxCoverDead',
                                     'BoxWood', 'BoxWoodDead', 'BoxVeg', 'BoxVegDead', 'QuadCoverTotal',
                                     'QuadCoverDead', 'QuadWood', 'QuadWoodDead', 'QuadVeg', 'QuadVegDead',
                                     'comments', 'date', 'cavity_cover', 'canopy_cover'
                                    ]].copy()

# make sure the nest IDs are all caps
df_nest_seasonal['nest_id'] = df_nest_seasonal['nest_id'].apply(lambda x: x.strip()).apply(lambda x: x.upper())

# calculate the breeding year (Feb to Jan)
df_nest_seasonal['breeding_year'] = df_nest_seasonal['date'].apply(lambda x: breeding_year(x))

# recalculate season (because was manually created). See helper_functions.py
df_nest_seasonal['season'] = df_nest_seasonal['date'].apply(lambda x: season(x))

# calc the unique ID
df_nest_seasonal['NestSeasYear'] = df_nest_seasonal['nest_id'] + df_nest_seasonal['season'] + df_nest_seasonal['breeding_year'].apply(lambda x: str(x))

# all blank canopy_covers should be 0 not NaN
df_nest_seasonal['canopy_cover'] = df_nest_seasonal['canopy_cover'].apply(lambda x: 0 if np.isnan(x) else x)

write_temp_file(df_nest_seasonal, './output/A_load_and_combine_data/df_nest_seasonal.csv', 'df_nest_seasonal')

Mon Apr 10 20:50:34 2017 - Writing intermediate table df_nest_seasonal to disk.
Mon Apr 10 20:50:34 2017 - Written ./output/A_load_and_combine_data/df_nest_seasonal.csv: 0.158 MB


### 2.3.1 Read in the Annual Nest Data file (as df_nest_annual)

In [12]:
in_file = nest_annual_file
desc = 'Nest Characteristic (Annual)'

read_file_handler_start(in_file, desc)
data_types = {
    'nest_id': str,
    'year': np.float32,
    'nest_year': str,
    'is_discrete_bush': str,
    'vegetation_cover': np.float32,
    'veg_cover_species': str,
    'tetragonia': np.float32,
    'rhagodia': np.float32,
    'acacia': np.float32,
    'bush_height': np.float32,
    'budh_length': np.float32,
    'bush_width': np.float32,
    'bush_wall_depth': np.float32,
    'cavity_height': np.float32,
    'cavity_length': np.float32,
    'cavity_width': np.float32,
    'cavity_volume': np.float32,
    'cavity_area': np.float32,
    'entrance_bearing': np.float32,
    'entrance_direction': str,
    'entrance_height': np.float32,
    'entrance_length': np.float32,
    'entrance_width': np.float32,
    'distance_to_nearest_neighbour_m': np.float32,
    'notes': str,
    'observation_date': str,
}
df_nest_annual = pd.read_csv(nest_annual_file,
                          header=0, 
                          dtype=data_types,
                          encoding='utf-8',
                          parse_dates=['observation_date'],
                          dayfirst=True,
                          error_bad_lines=True,
                          warn_bad_lines=True)

read_file_handler_end(in_file, desc, df_nest_annual, 'df_nest_annual')

Mon Apr 10 20:50:34 2017 - Loading the Nest Characteristic (Annual) data file.
Mon Apr 10 20:50:34 2017 - Nest Characteristic (Annual) file is 0.064 MB.
Mon Apr 10 20:50:34 2017 - Loading into memory.
Mon Apr 10 20:50:35 2017 - Success: loaded 822 records.


In [13]:
# drop unwanted fields: 'year', 'nest_year', 'veg_cover_species', 'cavity_volume', 'cavity_area',
df_nest_annual = df_nest_annual[[
        'nest_id', 'is_discrete_bush', 'vegetation_cover',
       'tetragonia', 'rhagodia', 'acacia', 'bush_height',
       'budh_length', 'bush_width', 'bush_wall_depth', 'cavity_height',
       'cavity_length', 'cavity_width', 
       'entrance_bearing', 'entrance_direction', 'entrance_height',
       'entrance_length', 'entrance_width', 'distance_to_nearest_neighbour_m',
       'notes', 'observation_date'
    ]].copy()

# make sure the nest IDs are all caps
df_nest_annual['nest_id'] = df_nest_annual['nest_id'].apply(lambda x: x.strip()).apply(lambda x: x.upper())

# recreate the veg_cover_species
def veg_cover_species(row):
    result = ''
    if not np.isnan(row['tetragonia']) and row['tetragonia'] > 0:
        result = 'Tetr '
    if not np.isnan(row['rhagodia']) and row['rhagodia'] > 0:
        result = '{0}{1}'.format(result, 'Rhag ')
    if not np.isnan(row['acacia']) and row['acacia'] > 0:
        result = '{0}{1}'.format(result, 'Acac ')
    return result
df_nest_annual['veg_cover_species'] = df_nest_annual.apply(veg_cover_species, axis=1)

# calculate the breeding_year
df_nest_annual['breeding_year'] = df_nest_annual['observation_date'].apply(lambda x: breeding_year(x))

# recalculate cavity_area in cm2. L, W, H are in mm.
df_nest_annual['cavity_area_cm2'] = df_nest_annual['cavity_length'] * df_nest_annual['cavity_width'] / 100

# recalculate cavity_volume in L, assumed a half ellipsoid. Vol of half ellipsoid is 1/2 * 3/4 * pi * abc where a,b,c are the radii
df_nest_annual['cavity_volume_L'] = round((0.5 * 0.75 * np.pi * 
                                     (df_nest_annual['cavity_length']/2) * 
                                     (df_nest_annual['cavity_width']/2) * 
                                     (df_nest_annual['cavity_height']/2)
                                    ) / 1000000, ndigits=2)

write_temp_file(df_nest_annual, './output/A_load_and_combine_data/df_nest_annual.csv', 'df_nest_annual')

Mon Apr 10 20:50:35 2017 - Writing intermediate table df_nest_annual to disk.
Mon Apr 10 20:50:35 2017 - Written ./output/A_load_and_combine_data/df_nest_annual.csv: 0.072 MB


### 2.4.1 Read in the BreedingDataCombined file (as df_breeding)

In [14]:
read_file_handler_start(breeding_data_file, 'Breeding')
data_types = {'nest_id': str,
              'observation_date': str,
              'Year': str,
              'Month': str,
              'ActivityStatus': np.float32,
              'adult': np.float32,
              'clutch': np.float32,
              'eggs': np.float32,
              'ChicksAlive': np.float32, # running obs, ignore
              'ChicksDead': np.float32, # use sum
              'TotalChicksHatch': np.float32, # use max
              'ChicksAge': np.float32, # use max
              'ChicksFledge': np.float32, # use max
              'ChicksMissing': np.float32, # ignore
              'ContentsNotVisible': np.float32, # ignore
              'EggLayDate': str, # use max, avg or min
              'IDChick1': np.float32,
              'MassChick1': np.float32, # use max
              'IDChick2': np.float32,
              'MassChick2': np.float32, # use max
              'comments': str
             }
df_breeding = pd.read_csv(breeding_data_file,
                          header=0, 
                          dtype=data_types,
                          encoding='utf-8',
                          parse_dates=['observation_date', 'EggLayDate'],
                          dayfirst=True,
                          error_bad_lines=True,
                          warn_bad_lines=True)
read_file_handler_end(breeding_data_file, 'Breeding', df_breeding, 'df_breeding')

Mon Apr 10 20:50:35 2017 - Loading the Breeding data file.
Mon Apr 10 20:50:35 2017 - Breeding file is 0.711 MB.
Mon Apr 10 20:50:35 2017 - Loading into memory.
Mon Apr 10 20:50:38 2017 - Success: loaded 16,608 records.


### 2.4.2 Update and cleanse fields in the Breeding data (df_breeding)
* Nest IDs to be all uppercase and trimmed
* `year` is year of `observation_date`

In [15]:
# drop unwanted columns: 'Year', 'Month', 'ChicksMissing', 'ContentsNotVisible', 
df_breeding = df_breeding[[
       'nest_id', 'observation_date', 'ActivityStatus',
       'adult', 'clutch', 'eggs', 'ChicksDead', 'ChicksAlive', 
       'TotalChicksHatch', 'ChicksAge', 'ChicksFledge', 
       'EggLayDate', 'comments', 'IDChick1',
       'MassChick1', 'IDChick2', 'MassChick2'
    ]].copy()

# make sure the nest IDs are all caps
df_breeding['nest_id'] = df_breeding['nest_id'].apply(lambda x: str(x)).apply(lambda x: x.strip()).apply(lambda x: x.upper())

# create year field
df_breeding['breeding_year'] = df_breeding['observation_date'].apply(lambda x: breeding_year(x))

# convert the ActivityStatus into separate columns:
# 0 - no activity at all
# 1 - some evidence of nesting activity
# 2 - breeding initiated (egg laid)
# 3 - moulting activity
df_breeding['used_for_nesting'] = df_breeding['ActivityStatus'].apply(lambda x: 1 if x==1 else 0)
df_breeding['used_for_breeding'] = df_breeding['ActivityStatus'].apply(lambda x: 1 if x==2 else 0)
df_breeding['used_for_moulting'] = df_breeding['ActivityStatus'].apply(lambda x: 1 if x==3 else 0)
df_breeding['used_for_nesting_or_breeding'] = df_breeding['ActivityStatus'].apply(lambda x: 1 if x==1 or x==2 else 0)

write_temp_file(df_breeding, './output/A_load_and_combine_data/df_breeding.csv', 'df_breeding')

Mon Apr 10 20:50:38 2017 - Writing intermediate table df_breeding to disk.
Mon Apr 10 20:50:38 2017 - Written ./output/A_load_and_combine_data/df_breeding.csv: 0.970 MB


### 2.4.3 Aggregate the Breeding data to get annual stats
* **nest_id**
* **breeding_year**
* **clutch**
* clutch_count
* egg_count
* chick_count
* fletch_count
* lay_date
* age_at_fledging
* mass_at_fledging_chick1
* mass_at_fledging_chick2
* chick_id1
* chick_id2

Add field:
* `flag_activity_status`: True iff max(ActivityStatus) in year > 0. Note that ActivityStatus was not recorded for the numeric nest_ids, so this field should not be used for 'usage'.

In [16]:
log('Aggregating breeding data to get annual stats.')

# get the clutches per nest and year
# [[chosen columns]] -> groupby -> apply max -> add suffix -> remove multi-index
df_clutch_count = df_breeding[['nest_id', 'breeding_year', 'clutch']
                             ].groupby(['nest_id', 'breeding_year']).max().add_suffix('_count').reset_index()

# get the annual stats per nest, year and clutch
temp = df_breeding[['nest_id', 'breeding_year', 'ActivityStatus', 'clutch', 'eggs', 'ChicksDead', 'TotalChicksHatch',
                    'ChicksFledge', 'EggLayDate', 'ChicksAge', 'MassChick1', 'MassChick2'
                   ]].copy()
aggregation = {
    'eggs': {'egg_count': 'max'},
    'ChicksDead': {'num_deaths': 'sum'}, 
    'TotalChicksHatch': {'num_hatched': 'max'},
    'ChicksFledge': {'num_fledged': 'max'}, 
    'EggLayDate': {'EggLayDate': 'max'}, 
    'ChicksAge': {'age_at_fledge': 'max'},
    'MassChick1': {'peak_mass_chick1': 'max'},
    'MassChick2': {'peak_mass_chick2': 'max'}    
}
df_breeding_gb = temp.groupby(['nest_id', 'breeding_year', 'clutch']).agg(aggregation).reset_index()
# df_breeding_gb is now a multi-index, which is a pain. Lose the top level with droplevel(), which loses the first 3 column names
# Get them back by explicitly renaming.
df_breeding_gb.columns = df_breeding_gb.columns.droplevel(0)
df_breeding_gb.columns.values[0] = 'nest_id'
df_breeding_gb.columns.values[1] = 'breeding_year'
df_breeding_gb.columns.values[2] = 'clutch'
df_breeding_gb['EggLayDate'] = df_breeding_gb['EggLayDate'].apply(lambda x: pd.to_datetime(x)) # confirmed the max EggLayDate is correct when converted back to datetime

write_temp_file(df_clutch_count, './output/A_load_and_combine_data/df_clutch_count.csv', 'df_clutch_count')
write_temp_file(df_breeding_gb, './output/A_load_and_combine_data/df_breeding_gb.csv', 'df_breeding_gb')
del temp

Mon Apr 10 20:50:38 2017 - Aggregating breeding data to get annual stats.
Mon Apr 10 20:50:38 2017 - Writing intermediate table df_clutch_count to disk.
Mon Apr 10 20:50:38 2017 - Written ./output/A_load_and_combine_data/df_clutch_count.csv: 0.015 MB
Mon Apr 10 20:50:38 2017 - Writing intermediate table df_breeding_gb to disk.
Mon Apr 10 20:50:38 2017 - Written ./output/A_load_and_combine_data/df_breeding_gb.csv: 0.015 MB


### 2.5.1 Load the combined temperature and humidity data
Read the temperature and humidity data file into memory and report on success/failure.

In [17]:
read_file_handler_start(temp_humid_file, 'Temperature and Humidity')
column_names = ['recnum', 'datetime', 'nest_id', 'humidity', 'temp_c']
data_types = {'recnum': np.int32, 
              'datetime': str, 
              'nest_id': str,
              'humidity': np.float32,
              'temp_c': np.float32
             }
df_sensor_data = pd.read_csv(temp_humid_file,
                      names=column_names,
                      usecols=[0,1,2,3,4],
                      dtype=data_types,
#                           nrows=10000,
                      parse_dates=['datetime'],
                      infer_datetime_format=True,
                      dayfirst=True,
                      encoding='utf-8',
                      error_bad_lines=True,
                      warn_bad_lines=True
                     )
read_file_handler_end(temp_humid_file, 'Temperature and Humidity', df_sensor_data, 'df_sensor_data')

Mon Apr 10 20:50:38 2017 - Loading the Temperature and Humidity data file.
Mon Apr 10 20:50:38 2017 - Temperature and Humidity file is 112.663 MB.
Mon Apr 10 20:50:38 2017 - Loading into memory. Please be patient.
Mon Apr 10 20:50:51 2017 - Success: loaded 2,505,456 records.


### 3.1.2 Calculations per-sensor reading
The following calculations are added per sensor reading:
* The `breeding_year`: same as the calendar year
* `temp_bucket` is a category for each 5C temperature range: <0, 0-5, .., 60+
* `humidity_bucket`: is a category for roughly 20% humidity ranges, based on human comfort zones
* `average_activity_phase`: the average activity conducted at the time of the observation

Note: An `actual_activity_phase` (the current phase of breeding based on per-nest observations) is added later

In [18]:
# make sure the nest IDs are all caps
log('Cleanse the nest_ids...')
df_sensor_data['nest_id'] = df_sensor_data['nest_id'].apply(lambda x: x.strip()).apply(lambda x: x.upper())
log(' Done.')

# add the breeding_year (same as calendar year): 
log('Calculating breeding year...')
df_sensor_data['breeding_year'] = df_sensor_data['datetime'].apply(lambda x: breeding_year(x))
log(' Done.')

# Add flags for various temperature ranges. Uses temp_bucket() as defined in helper_functions.py
# These are summed to give the amount of time in the temp band
log('Calculating temperature buckets...')
df_sensor_data['temp_bucket'] = df_sensor_data['temp_c'].apply(temp_bucket)
log(' Done.')

# # Add flags for various humidity ranges. Uses humidity_bucket() as defined in helper_functions.py
# # These are summed to give the amount of time in the humidity band
log('Calculating humidity buckets...')
df_sensor_data['humidity_bucket'] = df_sensor_data['humidity'].apply(humidity_bucket)
log(' Done.')

Mon Apr 10 20:50:51 2017 - Cleanse the nest_ids...
Mon Apr 10 20:50:52 2017 -  Done.
Mon Apr 10 20:50:52 2017 - Calculating breeding year...
Mon Apr 10 20:51:01 2017 -  Done.
Mon Apr 10 20:51:01 2017 - Calculating temperature buckets...
Mon Apr 10 20:51:05 2017 -  Done.
Mon Apr 10 20:51:05 2017 - Calculating humidity buckets...
Mon Apr 10 20:51:08 2017 -  Done.


### 3.2 Join the NestCharacteristic Static and Seasonal data
`df_nest_seasonal` + `df_nest_static` -> `df_nest_joined`

Seasonal is `left` and Static is `right`, such that the Seasonal data is augmented with the nests static metadata.

In [19]:
log('Joining the Nest Characteristic (Seasonal and Static) data sets .')
df_nest_joined = pd.merge(left=df_nest_seasonal,
                            right=df_nest_static,
                            how='left',
                            on=['nest_id'], # both have same keys
                            left_on=None, # same key names: don't need to specify R and L
                            right_on=None, # same key names: don't need to specify R and L
                            left_index=False, # dont' use left df index as key
                            right_index=False, # dont' use right df index as key
                            sort=True, # for efficiency do/not sort the df first
                            suffixes=['_seasonal', '_static']
                            )
if df_nest_joined is not None:
    log('Join complete. Here are the stats:')
    print('Records in seasonal data:    {0:>20,}'.format(len(df_nest_seasonal)))
    print('Records in static data:      {0:>20,}'.format(len(df_nest_static)))
    print('                              -------------------')
    print('Records in joined data:      {0:>20,}'.format(len(df_nest_joined)))
    print('\nOverview:')
    gb = df_nest_joined.groupby(['nest_id'])
    print('Number of nest_ids:          {0:>20,}\n'.format(len(gb)))
    write_temp_file(df_nest_joined, './output/A_load_and_combine_data/df_nest_joined.csv', 'df_nest_joined')
    del gb
else:
    log('JOIN FAILED!!!.')
log(' Done.')

Mon Apr 10 20:51:08 2017 - Joining the Nest Characteristic (Seasonal and Static) data sets .
Mon Apr 10 20:51:08 2017 - Join complete. Here are the stats:
Records in seasonal data:                   1,929
Records in static data:                       247
                              -------------------
Records in joined data:                     1,929

Overview:
Number of nest_ids:                           241

Mon Apr 10 20:51:08 2017 - Writing intermediate table df_nest_joined to disk.
Mon Apr 10 20:51:08 2017 - Written ./output/A_load_and_combine_data/df_nest_joined.csv: 0.526 MB
Mon Apr 10 20:51:08 2017 -  Done.


### 3.3 Join the breeding stats together
`df_breeding_gb + df_clutch_count -> df_breeding_annual_stats`

Clutch counts per year and annual clutch survival stats.
Note that the reduced record count in `df_breeding_annual_stats` compared to `df_clutch_count` is due to a number of nest-years having blank/zero clutches. These are excluded.

In [20]:
# join the clutch count on to the annual stats
log('Merging the aggregated breeding stats.')
df_breeding_annual_stats = pd.merge(left=df_breeding_gb,
                                    right=df_clutch_count,
                                    how='left',
                                    on=['nest_id', 'breeding_year'], # both have same keys
                                    sort=True # for efficiency do/not sort the df first
                                   )

log('Join complete. Here are the stats:')
print('Records in annual stats data: {0:>20,}'.format(len(df_breeding_gb)))
print('Records in clutch count data: {0:>20,}'.format(len(df_clutch_count)))
print('                              -------------------')
print('Records in joined data:       {0:>20,}'.format(len(df_breeding_annual_stats)))
print('\nOverview:')
gb = df_breeding_annual_stats.groupby(['nest_id'])
print('Number of nest_ids in clutch count:   {0:>12,}'.format(len(df_clutch_count.groupby(['nest_id']))))
print('Number of nest_ids in breeding stats: {0:>12,}'.format(len(df_breeding_gb.groupby(['nest_id']))))
print('Number of nest_ids in joined:         {0:>12,}\n'.format(len(df_breeding_annual_stats.groupby(['nest_id']))))
write_temp_file(df_breeding_annual_stats, './output/A_load_and_combine_data/df_breeding_annual_stats.csv', 'df_breeding_annual_stats')
del gb
log(' Done.')

Mon Apr 10 20:51:08 2017 - Merging the aggregated breeding stats.
Mon Apr 10 20:51:08 2017 - Join complete. Here are the stats:
Records in annual stats data:                  277
Records in clutch count data:                  860
                              -------------------
Records in joined data:                        277

Overview:
Number of nest_ids in clutch count:            212
Number of nest_ids in breeding stats:          122
Number of nest_ids in joined:                  122

Mon Apr 10 20:51:09 2017 - Writing intermediate table df_breeding_annual_stats to disk.
Mon Apr 10 20:51:09 2017 - Written ./output/A_load_and_combine_data/df_breeding_annual_stats.csv: 0.016 MB
Mon Apr 10 20:51:09 2017 -  Done.


### 3.4 Join the Nest data (seasonal and static) to the Breeding stats
`df_nest_joined + df_breeding_annual_stats -> df_nest_and_breeding`

In [21]:
# join the annual clutch and breeding stats onto the full seasonal and static nest data
log('Merging the aggregated breeding stats to the static and seasonal nest data.')
df_nest_and_breeding = pd.merge(left=df_nest_joined,
                                right=df_breeding_annual_stats,
                                how='left',
                                on=['nest_id','breeding_year'], # both have same keys
                                sort=True 
                               )

log('Join complete. Here are the stats:')
print('Records in nest data:                 {0:>12,}'.format(len(df_nest_joined)))
print('Records in breeding stats data:       {0:>12,}'.format(len(df_breeding_annual_stats)))
print('                                      ------------')
print('Records in joined data:               {0:>12,}'.format(len(df_nest_and_breeding)))
print('\nOverview:')
gb = df_breeding_annual_stats.groupby(['nest_id'])
print('Number of nest_ids in nest data:      {0:>12,}'.format(len(df_nest_joined.groupby(['nest_id']))))
print('Number of nest_ids in breeding stats: {0:>12,}'.format(len(df_breeding_annual_stats.groupby(['nest_id']))))
print('Number of nest_ids in joined:         {0:>12,}\n'.format(len(df_nest_and_breeding.groupby(['nest_id']))))
write_temp_file(df_nest_and_breeding, './output/A_load_and_combine_data/NestDataWithBreedingStats.csv', 'df_nest_and_breeding')
del gb
log( 'Done.')

Mon Apr 10 20:51:09 2017 - Merging the aggregated breeding stats to the static and seasonal nest data.
Mon Apr 10 20:51:09 2017 - Join complete. Here are the stats:
Records in nest data:                        1,929
Records in breeding stats data:                277
                                      ------------
Records in joined data:                      2,066

Overview:
Number of nest_ids in nest data:               241
Number of nest_ids in breeding stats:          122
Number of nest_ids in joined:                  241

Mon Apr 10 20:51:09 2017 - Writing intermediate table df_nest_and_breeding to disk.
Mon Apr 10 20:51:09 2017 - Written ./output/A_load_and_combine_data/NestDataWithBreedingStats.csv: 0.604 MB
Mon Apr 10 20:51:09 2017 - Done.


### 3.5 Aggregate the sensor data into stats per breeding phase
To understand the effect of nest conditions (from sensor data) in the choice of nest and breeding success of the nest, we need to break up the stats into:
* *annual stats* which represent the averages, spikes etc for the entire year. These give an understanding of the nest itself.
* *phase stats* which represent the conditions during specific phases of the breeding cycle. E.g. during nesting, during incubation, during rearing. To get these phase stats, we need to get the phase boundary dates from the breeding observation data.

The nest sensor readings are aggregated to summarise the nest conditions by `nest`, `breeding_year` and `activity_phase`.

### 3.5.1 Get the actual breeding phase dates
Summarise the breeding data to obtain the following:
* list of all nests (regardless of breeding activity)
* the `nesting_date` for each nest in each year
* the `egg_lay_date` for each nest, year and clutch
* the `hatch_date` for each nest, year and clutch
* the `fledge_date` for each nest, year and clutch

Join these all back together to get the phase dates all in one place, then join the combined result on to the sensor data table and calculate the phase in which each sensor reading occurred.
This will take a while.

**Issue: Nesting dates dont work: the second clutch will have first nesting date and the first obs for many nests is after the lay date, so nesting_date > lay_date**

In [22]:
# for each nest, year and clutch, get the following:
# first activity_status date (nesting_date), EggLayDate, hatch_date, fledge_date
# nesting_date, hatch_date, fledge_date are the min observation_date per nest, year, clutch where the value is not NaN

log('Calculating the breeding phase dates for each nest and year.')
# take each unique nest id, which we'll join the rest of the data onto
df_all_nests = df_nest_static[['nest_id']].drop_duplicates()
write_temp_file(df_all_nests, './output/A_load_and_combine_data/df_all_nests.csv', 'df_all_nests')

# egg_lay_date
log(' Creating a table for the courting and egg lay dates...')
gb_lay_date = df_breeding[['nest_id', 'breeding_year', 'clutch', 'EggLayDate']
                         ].groupby(['nest_id', 'breeding_year', 'clutch']).min().reset_index()
gb_lay_date.rename(columns={'EggLayDate': 'egg_lay_date'}, inplace=True)
# nesting date: 31 days before egg_lay_date
gb_lay_date['courting_date'] = gb_lay_date['egg_lay_date'] - datetime.timedelta(days=31)
write_temp_file(gb_lay_date, './output/A_load_and_combine_data/gb_lay_date.csv', 'gb_lay_date')
log(' Done.')

# hatch_date
log(' Creating a table for the hatch dates...')
def hatch_date(row):
    return row['observation_date'] - datetime.timedelta(days=row['ChicksAge'])
# get the observation date (select columns) where age is not blank (i.e. they're there)
gb_hatch_date = df_breeding[['nest_id', 'breeding_year', 'clutch', 
                             'observation_date', 'ChicksAge']].loc[df_breeding['ChicksAge'].notnull()]
gb_hatch_date['hatch_date'] = gb_hatch_date.apply(hatch_date, axis=1)
# get the min hatch_date 
gb_hatch_date = gb_hatch_date[['nest_id', 'breeding_year', 'clutch', 'hatch_date'
                              ]].groupby(['nest_id', 'breeding_year', 'clutch']).min().reset_index()
write_temp_file(gb_hatch_date, './output/A_load_and_combine_data/gb_hatch_date.csv', 'gb_hatch_date')
log(' Done.')



Mon Apr 10 20:51:09 2017 - Calculating the breeding phase dates for each nest and year.
Mon Apr 10 20:51:09 2017 - Writing intermediate table df_all_nests to disk.
Mon Apr 10 20:51:09 2017 - Written ./output/A_load_and_combine_data/df_all_nests.csv: 0.002 MB
Mon Apr 10 20:51:09 2017 -  Creating a table for the courting and egg lay dates...
Mon Apr 10 20:51:09 2017 - Writing intermediate table gb_lay_date to disk.
Mon Apr 10 20:51:09 2017 - Written ./output/A_load_and_combine_data/gb_lay_date.csv: 0.011 MB
Mon Apr 10 20:51:09 2017 -  Done.
Mon Apr 10 20:51:09 2017 -  Creating a table for the hatch dates...
Mon Apr 10 20:51:09 2017 - Writing intermediate table gb_hatch_date to disk.
Mon Apr 10 20:51:09 2017 - Written ./output/A_load_and_combine_data/gb_hatch_date.csv: 0.007 MB
Mon Apr 10 20:51:09 2017 -  Done.


In [23]:
# fledge_date
log(' Creating a table for the fledge dates...')
# is either the date that the chicks were of age and no longer observed in the nest, or were observed dead
# get the observation date (select columns) where there is a fledge flag
gb_fledge_date = df_breeding[['nest_id', 'breeding_year', 'clutch', 'observation_date', 
                              'ChicksAlive', 'ChicksDead', 'ChicksFledge']].fillna(0)
gb_fledge_date['dead_or_fledged'] = gb_fledge_date.apply(lambda row: row['ChicksFledge'] > 0 
                                                         or (row['ChicksDead'] > 0 and row['ChicksAlive'] == 0), axis=1)
gb_fledge_date = gb_fledge_date.query('dead_or_fledged')
# get the min obs date, which is the earliest fledge recording (per clutch)
gb_fledge_date = gb_fledge_date[['nest_id', 'breeding_year', 'clutch', 'observation_date'
                                ]].groupby(['nest_id', 'breeding_year', 'clutch']).min().reset_index()
# rename the obs date 
gb_fledge_date.rename(columns={'observation_date': 'dead_or_fledge_date'}, inplace=True)
write_temp_file(gb_fledge_date, './output/A_load_and_combine_data/gb_fledge_date.csv', 'gb_fledge_date')
log(' Done.')

Mon Apr 10 20:51:09 2017 -  Creating a table for the fledge dates...
Mon Apr 10 20:51:11 2017 - Writing intermediate table gb_fledge_date to disk.
Mon Apr 10 20:51:11 2017 - Written ./output/A_load_and_combine_data/gb_fledge_date.csv: 0.007 MB
Mon Apr 10 20:51:11 2017 -  Done.


`df_phase_dates` contains the full breeding data 

In [24]:
# join the key date tables together
log(' Merging the phase date tables...')
df_phase_dates = pd.merge(left=df_all_nests, 
                          right=gb_lay_date, 
                          how='left', 
                          on=['nest_id'], 
                          sort=True
                         )
df_phase_dates = pd.merge(left=df_phase_dates, 
                          right=gb_hatch_date, 
                          how='left', 
                          on=['nest_id', 'breeding_year', 'clutch'], 
                          sort=True
                         )
df_phase_dates = pd.merge(left=df_phase_dates, 
                          right=gb_fledge_date, 
                          how='left', 
                          on=['nest_id', 'breeding_year', 'clutch'], 
                          sort=True
                         )
df_phase_dates = pd.merge(left=df_phase_dates, 
                          right=df_clutch_count, 
                          how='left', 
                          on=['nest_id', 'breeding_year'], 
                          sort=True
                         )
write_temp_file(df_phase_dates, './output/A_load_and_combine_data/df_phase_dates.csv', 'df_phase_dates')
log(' Done.')

Mon Apr 10 20:51:11 2017 -  Merging the phase date tables...
Mon Apr 10 20:51:11 2017 - Writing intermediate table df_phase_dates to disk.
Mon Apr 10 20:51:11 2017 - Written ./output/A_load_and_combine_data/df_phase_dates.csv: 0.019 MB
Mon Apr 10 20:51:11 2017 -  Done.


In [25]:
log(' Pivot breeding data to get the clutch dates...')
# get the required cols
df_clutch_pivot = gb_lay_date[['nest_id', 'breeding_year', 'clutch', 'egg_lay_date']].copy()
# we have to combine the index (because pivot() does not like a multi-index) and then split it apart a few steps later
df_clutch_pivot['nestyear'] = df_clutch_pivot['nest_id'] + '-' + (df_clutch_pivot['breeding_year'].apply(lambda x: str(int(x))))
# drop the old index fields
df_clutch_pivot = df_clutch_pivot[['nestyear', 'clutch', 'egg_lay_date']]
# do the pivot to get the (up to three) clutch dates per nest and year
df_clutch_pivot = df_clutch_pivot.pivot(index='nestyear', columns='clutch')['egg_lay_date'].reset_index()

# rename and restore the indexes
df_clutch_pivot.rename(columns={1.0: 'clutch_1', 2.0: 'clutch_2', 3.0: 'clutch_3'}, inplace=True)
df_clutch_pivot['nest_id'] = df_clutch_pivot['nestyear'].apply(lambda x: x.split('-')[0])
df_clutch_pivot['breeding_year'] = df_clutch_pivot['nestyear'].apply(lambda x: float(x.split('-')[1]))
df_clutch_pivot = df_clutch_pivot[['nest_id', 'breeding_year', 'clutch_1', 'clutch_2', 'clutch_3']]
write_temp_file(df_clutch_pivot, './output/A_load_and_combine_data/df_clutch_pivot.csv', 'df_clutch_pivot')
log(' Done.')
log('Done.')

Mon Apr 10 20:51:11 2017 -  Pivot breeding data to get the clutch dates...
Mon Apr 10 20:51:11 2017 - Writing intermediate table df_clutch_pivot to disk.
Mon Apr 10 20:51:11 2017 - Written ./output/A_load_and_combine_data/df_clutch_pivot.csv: 0.007 MB
Mon Apr 10 20:51:11 2017 -  Done.
Mon Apr 10 20:51:11 2017 - Done.


** Add clutch dates to the sensor data** 

The `sensor_data` are lacking a `clutch` number, which will create duplicates if we attempt to join on the phase dates. Get the clutch dates and join them into the `sensor_data`.

In [26]:
# to avoid making epic dupes, we need to first add the clutch number on to the sensor data table

log('Join the clutch dates to the sensor data.')
# join on to the sensor data
df_sensor_clutch = pd.merge(left=df_sensor_data, 
                            right=df_clutch_pivot, 
                            how='left', 
                            on=['nest_id', 'breeding_year'],
                            sort=True
                           )
log('Done. Rows: {0:,}'.format(len(df_sensor_clutch)))

log('Assigning a clutch number to each sensor record. Be patient.')
# flag each reading with a clutch number
def clutch_number(row):
    if pd.isnull(row['clutch_1']):
        # there are no breeding observations for this nest and year
        return 0
    else:
        # there is at least 1 clutch
        if pd.isnull(row['clutch_2']) or row['datetime'] < row['clutch_2']:
            # there was only a single clutch, or there were >1 but this reading was before the 2nd clutch
            return 1
        elif pd.isnull(row['clutch_3']) or (not pd.isnull(row['clutch_3']) and row['datetime'] < row['clutch_3']):
            # there is a 2nd clutch if we got this far. if there is no 3rd, or the reading is before the 3rd, then this is 2nd
            return 2
        else:
            # there is a 3rd clutch and the sensor reading is after the 3rd
            return 3

df_sensor_clutch['clutch_number'] = df_sensor_clutch.apply(lambda row: clutch_number(row), axis=1)
log('Done.')
write_temp_file(df_sensor_clutch, './output/A_load_and_combine_data/df_sensor_clutch.csv', 'df_sensor_clutch')

Mon Apr 10 20:51:11 2017 - Join the clutch dates to the sensor data.
Mon Apr 10 20:51:12 2017 - Done. Rows: 2,505,456
Mon Apr 10 20:51:12 2017 - Assigning a clutch number to each sensor record. Be patient.
Mon Apr 10 20:55:13 2017 - Done.
Mon Apr 10 20:55:13 2017 - Writing intermediate table df_sensor_clutch to disk.
Mon Apr 10 20:55:45 2017 - Written ./output/A_load_and_combine_data/df_sensor_clutch.csv: 287.631 MB


**Join the actual breeding phase dates on to the sensor data**

In [27]:
log('Join the phase dates on to the sensor data.')
df_sensor_phase = pd.merge(left=df_sensor_clutch,
                        right=df_phase_dates,
                        how='left',
                        left_on=['nest_id', 'breeding_year', 'clutch_number'], 
                        right_on=['nest_id', 'breeding_year', 'clutch'], 
                        sort=True 
                        )
del df_sensor_phase['clutch_number']
log('Done. Rows: {0:,}'.format(len(df_sensor_phase)))

Mon Apr 10 20:55:45 2017 - Join the phase dates on to the sensor data.
Mon Apr 10 20:55:47 2017 - Done. Rows: 2,505,456


**Calculate the breeding_phase for each sensor reading** 

In [28]:
log('Calculate the breeding phase for each sensor record... Be patient...')


def breeding_phase(row):
    '''
    Calculates the breeding phase per record based on the phase dates.
    There are additional rules when the subsequent phase did not occur:
    a) Eggs without a hatch date are assumed dead at 45 days and nest returns to 'Not in use'
    b) Hatched eggs without a fledge date are assumed fledged/dead after 90 days
    
    Parameters:
    -----------
    row: a dataframe record containing the following columns:
         'egg_lay_date', 
         'courting_date',
         'hatch_date', 
         'dead_or_fledge_date'
         
    Returns:
    --------
    A string representing the current breeding phase according to the phase dates above.
    Return values are:
        'Not in use' : Between the dead_or_fledge_date of the previous clutch/breeding season and next courting_date
        'Courting',  : Between the courting_date and egg_lay_date
        'Incubating',: Between the egg_lay_date and hatch_date
        'Rearing'    : Between the hatch_date and dead_or_fledge_date
    '''
    if row['courting_date'] <= row['datetime'] < row['egg_lay_date']:
        return 'Courting'
    elif row['egg_lay_date'] <= row['datetime'] < row['hatch_date']:
        return 'Incubating'
    elif pd.isnull(row['hatch_date']) and (row['egg_lay_date'] <= row['datetime'] < row['egg_lay_date'] 
                                         + datetime.timedelta(days=45)):
        return 'Incubating'
    elif row['hatch_date'] <= row['datetime'] < row['dead_or_fledge_date']:
        return 'Rearing'
    elif pd.isnull(row['dead_or_fledge_date']) and (row['hatch_date'] <= row['datetime'] < row['hatch_date'] 
                                                + datetime.timedelta(days=90)):
        return 'Rearing'
    else:
        return 'Not in use'


def season_from_month(month):
    '''
    Returns the season (southern hemisphere) for the provided month.
    
    Parameters:
    -----------
    month (scalar): the float/integer representing the calendar month
    
    Returns:
    --------
    A string representing the season (southern hemisphere) for the calendar month provided. 
    I.e. 'Spring', 'Summer', 'Autumn', 'Winter'
    '''
    if 3 <= month <= 5:
        return 'Autumn'
    elif 6 <= month <= 8:
        return 'Winter'
    elif 9 <= month <= 11:
        return 'Spring'
    elif month == 12 or 1 <= month <= 2:
        return 'Summer'
    else:
        raise ValueError('{0} is not a valid month. Expecting 1-12.'.format(month))
    

def season_year(date):
    '''
    Returns the season_year for a given year and month. The season_year represents 
    the year in which the season began and adjusts treats Summer 2013 as including 
    the following months: December 2013, January 2014, February 2014.
    
    Parameters:
    -----------
    row: a dataframe row that includes a 'breeding_year' and a 'month'
    breeding_year: The calendar year for the provided month
    month: The calendar month
    
    Returns:
    --------
    An integer representing the year in which the season began.
    '''
    if date.month <= 2:
        return date.year - 1
    else:
        return date.year


df_sensor_phase['calendar_year'] = df_sensor_phase['datetime'].apply(lambda x: x.year)
log('  > Calendar year: done.')
df_sensor_phase['month'] = df_sensor_phase['datetime'].apply(lambda x: x.month)
log('  > Month: done.')
df_sensor_phase['day'] = df_sensor_phase['datetime'].apply(lambda x: x.day)
log('  > Day: done.')
df_sensor_phase['hour'] = df_sensor_phase['datetime'].apply(lambda x: x.hour)
log('  > Hour: done.')
df_sensor_phase['minute'] = df_sensor_phase['datetime'].apply(lambda x: x.minute)
log('  > Minute: done.')
df_sensor_phase['season'] = df_sensor_phase['month'].apply(lambda x: season_from_month(x))
log('  > Season: done.')
df_sensor_phase['season_year'] = df_sensor_phase['datetime'].apply(lambda x: season_year(x))
log('  > Season_Year: done.')
df_sensor_phase['breeding_phase'] = df_sensor_phase.apply(breeding_phase, axis=1)
log('  > Breeding Phase: done.')
log('Done.')

write_temp_file(df_sensor_phase, './output/A_load_and_combine_data/df_sensor_phase.csv', 'df_sensor_phase')

Mon Apr 10 20:55:47 2017 - Calculate the breeding phase for each sensor record... Be patient...
Mon Apr 10 20:55:56 2017 -   > Calendar year: done.
Mon Apr 10 20:56:05 2017 -   > Month: done.
Mon Apr 10 20:56:13 2017 -   > Day: done.
Mon Apr 10 20:56:20 2017 -   > Hour: done.
Mon Apr 10 20:56:28 2017 -   > Minute: done.
Mon Apr 10 20:56:29 2017 -   > Season: done.
Mon Apr 10 20:56:37 2017 -   > Season_Year: done.
Mon Apr 10 21:04:41 2017 -   > Breeding Phase: done.
Mon Apr 10 21:04:41 2017 - Done.
Mon Apr 10 21:04:41 2017 - Writing intermediate table df_sensor_phase to disk.
Mon Apr 10 21:05:34 2017 - Written ./output/A_load_and_combine_data/df_sensor_phase.csv: 452.662 MB


**Clean up intermediate tables**

In [29]:
log('Cleaning up intermediate data tables...')
del df_sensor_clutch
del gb_lay_date
del gb_hatch_date
del gb_fledge_date
del df_nest_joined
del df_breeding_annual_stats
del df_clutch_count
del df_breeding_gb
log('Done.')

Mon Apr 10 21:05:34 2017 - Cleaning up intermediate data tables...
Mon Apr 10 21:05:35 2017 - Done.


**Pickle the two key data files for use in later scripts**

In [31]:
log('Writing the final tables to pickle for future use...')
df_sensor_phase.to_pickle(os.path.normpath('./output/A_load_and_combine_data/df_sensor_phase.pkl'))
df_nest_and_breeding.to_pickle(os.path.normpath('./output/A_load_and_combine_data/df_nest_and_breeding.pkl'))
df_nest_static.to_pickle(os.path.normpath('./output/A_load_and_combine_data/df_nest_static.pkl'))
log('Done.')
    
log('\n\nScript complete.')

Mon Apr 10 21:08:37 2017 - Writing the final tables to pickle for future use...
Mon Apr 10 21:08:43 2017 - Done.
Mon Apr 10 21:08:43 2017 - 

Script complete.
